Encyclopedia of Data Warehousing and Mining 2006

TEAM LinG
Encyclopedia of
Data Warehousing
and Mining
John Wang
Montclair State University, USA
IDEA GROUP REFERENCE

Hershey London Melbourne Singapore
TEAM LinG
Acquisitions Editor: Rene Davies
Development Editor: Kristin Roth
Senior Managing Editor: Amanda Appicello
Managing Editor: Jennifer Neidig
Copy Editors: Eva Brennan, Alana Bubnis, Rene Davies and Sue VanderHook
Typesetters: Diane Huskinson, Sara Reed and Larissa Zearfoss
Support Staff: Michelle Potter
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.
Published in the United States of America by

Idea Group Reference (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@idea-group.com
Web site: http://www.idea-group-ref.com
and in the United Kingdom by

Idea Group Reference (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 3313
Web site: http://www.eurospan.co.uk
Copyright 2006 by Idea Group Inc. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Encyclopedia of data warehousing and mining / John Wang, editor.

p. cm.
Includes bibliographical references and index.
ISBN 1-59140-557-2 (hard cover) -- ISBN 1-59140-559-9 (ebook)
1. Data warehousing. 2. Data mining. I. Wang, John, 1955-
QA76.9.D37E52 2005
005.74--dc22
2005004522
British Cataloguing in Publication Data

A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set
are those of the authors, but not necessarily of the publisher.
TEAM LinG
Editorial Advisory Board
Floriana Esposito, Universitadi Bari, Italy
Chandrika Kamath, Lawrence Livermore National Laboratory, USA
Huan Liu, Arizona State University, USA
Sach Mukherjee, University of Oxford, UK
Alan Oppenheim, Montclair State University, USA
Marco Ramoni, Harvard University, USA
Mehran Sahami, Stanford University, USA
Michele Sebag, Universite Paris-Sud, France
Alexander Tuzhilin, New York University, USA
Ning Zhong, Maebashi Institute of Technology, Japan
TEAM LinG
List of Contributors
Abdulghani, Amin A. / Quantiva, USA

Abidin, Taufik / North Dakota State University, USA
Abou-Seada, Magda / Middlesex University, UK
Agresti, William W. / Johns Hopkins University, USA
Amoretti, Maria Suzana Marc / Federal University of Rio Grande do Sul (UFRGS), Brasil
An, Aijun / York University, Canada
Aradhye, Hrishikesh B. / SRI International, USA
Artz, John M. / The George Washington University, USA
Ashrafi, Mafruz Zaman / Monash University, Australia
Athappilly, Kuriakose / Western Michigan University, USA
Awad, Mamoun / University of Texas at Dallas, USA
Bala, Pradip Kumar / IBAT, Deemed University, India
Banerjee, Protima / Drexel University, USA
Banerjee, Rabindra Nath / Indian Institute of Technology, Kharagpur, India
Barko, Christopher D. / The University of North Carolina at Greensboro, USA
Bashir, Ahmad / University of Texas at Dallas, USA
Becerra, Victor M. / University of Reading, UK
Bellatreche, Ladjel / LISI/ENSMA, France
Besemann, Christopher / North Dakota State University, USA
Betz, Andrew L. / Progressive Insurance, USA
Beynon, Malcolm J. / Cardiff University, UK
Bhatnagar, Shalabh / Indian Institute of Science, India
Bhowmick, Sourav Saha / Nanyang Technological University, Singapore
Bickel, Steffen / Humboldt-Universitt zu Berlin, Germany
Bittanti, Sergio / Politecnico di Milano, Italy
Borges, Thyago / Catholic University of Pelotas, Brazil
Boros, Endre / RUTCOR, Rutgers University, USA
Bose, Indranil / The University of Hong Kong, Hong Kong
Boulicaut, Jean-Francois / INSA de Lyon, France
Breault, Joseph L. / Ochsner Clinic Foundation, USA
Bretschneider, Timo R. / Nanyang Technological University, Singapore
Brown, Marvin L. / Grambling State University, USA
Bruha, Ivan / McMaster University, Canada
Bruining, Nico / Erasmus Medical Thorax Center, The Netherlands
Buccafurri, Francesco / University Mediterranea of Reggio Calabria, Italy
Burr, Tom / Los Alamos National Laboratory, USA
Butler, Shane M. / Monash University, Australia
Cadot, Martine / University of Henri Poincar/LORIA, Nancy, France
Caelli, Terry / Australian National University, Australia
Calvo, Roberto Wolfler / Universit de Technologie de Troyes, France
Caramia, Massimiliano / Istituto per le Applicazioni del Calcolo IAC-CNR, Italy
Cardoso, Jorge / University of Madeira, Portugal
TEAM LinG
Carneiro, Sofia / University of Minho, Portugal
Cerchiello, Paola / University of Pavia, Italy
Chakravarty, Indrani / Indian Institute of Technology, India
Chalasani, Suresh / University of Wisconsin-Parkside, USA
Chang, Chia-Hui / National Central University, Taiwan
Chen, Qiyang / Montclair State University, USA
Chen, Shaokang / The University of Queensland, Australia
Chen, Yao / University of Massachusetts Lowell, USA
Chen, Zhong / Shanghai JiaoTong University, PR China
Chien, Chen-Fu / National Tsing Hua University, Taiwan
Cho, Vincent / The Hong Kong Polytechnic University, Hong Kong
Chu, Feng / Nanyang Technological University, Singapore
Chu, Wesley / University of California - Los Angeles, USA
Chung, Seokkyung / University of Southern California, USA
Chung, Soon M. / Wright State University, USA
Conversano, Claudio / University of Cassino, Italy
Cook, Diane J. / University of Texas at Arlington, USA
Cook, Jack / Rochester Institute of Technology, USA
Cunningham, Colleen / Drexel University, USA
Dai, Honghua / Deakin University, Australia
Daly, Olena / Monash University, Australia
Dardzinska, Agnieszka / Bialystok Technical University, Poland
Das, Gautam / The University of Texas at Arlington, USA
de Campos, Luis M. / Universidad de Granada, Spain
de Luigi, Fabio / University of Ferrara, Italy
De Meo, Pasquale / Universit Mediterranea di Reggio Calabria, Italy
DeLorenzo, Gary J. / Robert Morris University, USA
Delve, Janet / University of Portsmouth, UK
Denoyer, Ludovic / University of Paris VI, France
Denton, Anne / North Dakota State University, USA
Dhaenens, Clarisse / LIFL, University of Lille 1, France
Diday, Edwin / University of Dauphine, France
Dillon, Tharam / University of Technology Sydney, Australia
Ding, Qiang / Concordia College, USA
Ding, Qin / Pennsylvania State University, USA
Domeniconi, Carlotta / George Mason University, USA
Dorado de la Calle, Julin / University of A Corua, Spain
Dorai, Chitra / IBM T. J. Watson Research Center, USA
Drew, James H. / Verizon Laboratories, USA
Dumitriu, Luminita / Dunarea de Jos University, Romania
Ester, Martin / Simon Fraser University, Canada
Fan, Weiguo / Virginia Polytechnic Institute and State University, USA
Felici, Giovanni / Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy
Feng, Ling / University of Twente, The Netherlands
Fernndez, Vctor Fresno / Universidad Rey Juan Carlos, Spain
Fernndez-Luna, Juan M. / Universidad de Granada, Spain
Fischer, Ingrid / Friedrich-Alexander University Erlangen-Nrnberg, Germany
Fu, Ada Wai-Chee / The Chinese University of Hong Kong, Hong Kong
Fu, Li M. / University of Florida, USA
Fu, Yongjian / Cleveland State University, USA
Fung, Benjamin C. M. / Simon Fraser University, Canada
Fung, Benny Yiu-ming / The Hong Kong Polytechnic University, Hong Kong
Gallinari, Patrick / University of Paris VI, France
Galvo, Roberto Kawakami Harrop / Instituto Tecnolgico de Aeronutica, Brazil
TEAM LinG
Ganguly, Auroop R. / Oak Ridge National Laboratory, USA
Garatti, Simone / Politecnico di Milano, Italy
Garrity, Edward J. / Canisius College, USA
Ge, Nanxiang / Aventis, USA
Gehrke, Johannes / Cornell University, USA
Georgieva, Olga / Institute of Control and System Research, Bulgaria
Giudici, Paolo / University of Pavia, Italy
Goodman, Kenneth W. / University of Miami, USA
Greenidge, Charles / University of the West Indies, Barbados
Grzymala-Busse, Jerzy W. / University of Kansas, USA
Gunopulos, Dimitrios / University of California, USA
Guo, Hong / Southern Illinois University, USA
Gupta, Amar / University of Arizona, USA
Gupta, P. / Indian Institute of Technology, India
Haastrup, Palle / European Commission, Italy
Hamdi, Mohamed Salah / UAE University, UAE
Hamel, Lutz / University of Rhode Island, USA
Hamers, Ronald / Erasmus Medical Thorax Center, The Netherlands
Hammer, Peter L. / RUTCOR, Rutgers University, USA
Han, Hyoil / Drexel University, USA
Harms, Sherri K. / University of Nebraska at Kearney, USA
Hogo, Mofreh / Czech Technical University, Czech Republic
Holder, Lawrence B. / University of Texas at Arlington, USA
Hong, Yu / BearingPoint Inc, USA
Horiguchi, Susumu / Tohoku University, Japan
Hou, Wen-Chi / Southern Illinois University, USA
Hsu, Chun-Nan / Institute of Information Science, Academia Sinica, Taiwan
Hsu, William H. / Kansas State University, USA
Hu, Wen-Chen / University of North Dakota, USA
Hu, Xiaohua / Drexel University, USA
Huang, Xiangji / York University, Canada
Huete, Juan F. / Universidad de Granada, Spain
Hwang, Sae / University of Texas at Arlington, USA
Ibaraki, Toshihide / Kwansei Gakuin University, Japan
Ito, Takao / Ube National College of Technology, Japan
Jahangiri, Mehrdad / University of Southern California, USA
Jrvelin, Kalervo / University of Tampere, Finland
Jha, Neha / Indian Institute of Technology, Kharagpur, India
Jin, Haihao / University of Kentucky, USA
Jourdan, Laetitia / LIFL, University of Lille 1, France
Jun, Jongeun / University of Southern California, USA
Kanapady, Ramdev / University of Minnesota, USA
Kao, Odej / University of Paderborn, Germany
Karakaya, Murat / Bilkent University, Turkey
Katsaros, Dimitrios / Aristotle University, Greece
Kern-Isberner, Gabriele / University of Dortmund, Germany
Khan, Latifur / University of Texas at Dallas, USA
Khan, M. Riaz / University of Massachusetts Lowell, USA
Khan, Shiraj / University of South Florida, USA
Kickhfel, Rodrigo Branco / Catholic University of Pelotas, Brazil
Kim, Han-Joon / The University of Seoul, Korea
Klawonn, Frank / University of Applied Sciences Braunschweig/Wolfenbuettel, Germany
Koeller, Andreas / Montclair State University, USA
Kokol, Peter / University of Maribor, FERI, Slovenia
TEAM LinG
Kontio, Juha / Turku Polytechnic, Finland
Koppelaar, Henk / Delft University of Technology, The Netherlands
Kroeze, Jan H. / University of Pretoria, South Africa
Kros, John F. / East Carolina University, USA
Kryszkiewicz, Marzena / Warsaw University of Technology, Poland
Kusiak, Andrew / The University of Iowa, USA
la Tendresse, Ingo / Technical University of Clausthal, Germany
Lax, Gianluca / University Mediterranea of Reggio Calabria, Italy
Layos, Luis Magdalena / Universidad Politcnica de Madrid, Spain
Lazarevic, Aleksandar / University of Minnesota, USA
Lee, Chung-Hong / National Kaohsiung University of Applied Sciences, Taiwan
Lee, Chung-wei / Auburn University, USA
Lee, JeongKyu / University of Texas at Arlington, USA
Lee, Tzai-Zang / National Cheng Kung University, Taiwan, ROC
Lee, Zu-Hsu / Montclair State University, USA
Lee-Post, Anita / University of Kentucky, USA
Leni, Mitja / University of Maribor, FERI, Slovenia
Levary, Reuven R. / Saint Louis University, USA
Li, Tao / Florida International University, USA
Li, Wenyuan / Nanyang Technological University, Singapore
Liberati, Diego / Consiglio Nazionale delle Ricerche, Italy
Licthnow, Daniel / Catholic University of Pelotas, Brazil
Lim, Ee-Peng / Nanyang Technological University, Singapore
Lin, Beixin (Betsy) / Montclair State University, USA
Lin, Tsau Young / San Jose State University, USA
Lindell, Yehuda / Bar-Ilan University, Israel
Lingras, Pawan / Saint Marys University, Canada
Liu, Chang / Northern Illinois University, USA
Liu, Huan / Arizona State University, USA
Liu, Li / Aventis, USA
Liu, Xiaohui / Brunel University, UK
Liu, Xiaoqiang / Delft University of Technology, The Netherlands, and Donghua University, China
Lo, Victor S.Y. / Fidelity Personal Investments, USA
Lodhi, Huma / Imperial College London, UK
Loh, Stanley / Catholic University of Pelotas, Brazil, and Lutheran University of Brasil, Brazil
Long, Lori K. / Kent State University, USA
Lorenzi, Fabiana / Universidade Luterana do Brasil, Brazil
Lovell, Brian C. / The University of Queensland, Australia
Lu, June / University of Houston-Victoria, USA
Lu, Xinjian / California State University, Hayward, USA
Lutu, Patricia E.N. / University of Pretoria, South Africa
Ma, Sheng / IBM T.J. Watson Research Center, USA
Maj, Jean-Baptiste / LORIA/INRIA, France
Maloof, Marcus A. / Georgetown University, USA
Mangamuri, Murali / Wright State University, USA
Mani, D. R. / Massachusetts Institute of Technology, USA, and Harvard University, USA
Maniezzo, Vittorio / University of Bologna, Italy
Manolopoulos, Yannis / Aristotle University, Greece
Marchetti, Carlo / Universit di Roma La Sapienza, Italy
Mart, Rafael / Universitat de Valncia, Spain
Masseglia, Florent / INRIA Sophia Antipolis, France
Mathieu, Richard / Saint Louis University, USA
McLeod, Dennis / University of Southern California, USA
Mecella, Massimo / Universit di Roma La Sapienza, Italy
TEAM LinG
Meinl, Thorsten / Friedrich-Alexander University Erlangen-Nrnberg, Germany
Meo, Rosa / Universit degli Studi di Torino, Italy
Mishra, Nilesh / Indian Institute of Technology, India
Mladeni , Dunja / Jozef Stefan Institute, Slovenia
Mobasher, Bamshad / DePaul University, USA
Mohania, Mukesh / IBM India Research Lab, India
Morantz, Brad / Georgia State University, USA
Moreira, Adriano / University of Minho, Portugal
Motiwalla, Luvai / University of Massachusetts Lowell, USA
Muhlenbach, Fabrice / EURISE, Universit Jean Monnet - Saint-Etienne, France
Mukherjee, Sach / University of Oxford, UK
Murty, M. Narasimha / Indian Institute of Science, India
Muruzbal, Jorge / University Rey Juan Carlos, Spain
Muselli, Marco / Italian National Research Council, Italy
Musicant, David R. / Carleton College, USA
Muslea, Ion / SRI International, USA
Nanopoulos, Alexandros / Aristotle University, Greece
Nasraoui, Olfa / University of Louisville, USA
Nayak, Richi / Queensland University of Technology, Australia
Nemati, Hamid R. / The University of North Carolina at Greensboro, USA
Ng, Vincent To-yee / The Hong Kong Polytechnic University, Hong Kong
Ng, Wee-Keong / Nanyang Technological University, Singapore
Nicholson, Scott / Syracuse University School of Information Studies, USA
ODonnell, Joseph B. / Canisius College, USA
Oh, JungHwan / University of Texas at Arlington, USA
Oppenheim, Alan / Montclair State University, USA
Owens, Jan / University of Wisconsin-Parkside, USA
Oza, Nikunj C. / NASA Ames Research Center, USA
Pang, Les / National Defense University, USA
Paquet, Eric / National Research Council of Canada, Canada
Pasquier, Nicolas / Universit de Nice-Sophia Antipolis, France
Pathak, Praveen / University of Florida, USA
Perlich, Claudia / IBM Research, USA
Perrizo, William / North Dakota State University, USA
Peter, Hadrian / University of the West Indies, Barbados
Peterson, Richard L. / Montclair State University, USA
Pharo, Nils / Oslo University College, Norway
Piltcher, Gustavo / Catholic University of Pelotas, Brazil
Poncelet, Pascal / Ecole des Mines dAls, France
Portougal, Victor / The University of Auckland, New Zealand
Povalej, Petra / University of Maribor, FERI, Slovenia
Primo, Tiago / Catholic University of Pelotas, Brazil
Provost, Foster / New York University, USA
Psaila, Giuseppe / Universit degli Studi di Bergamo, Italy
Quattrone, Giovanni / Universit Mediterranea di Reggio Calabria, Italy
Rabual Dopico, Juan R. / University of A Corua, Spain
Rahman, Hakikur / SDNP, Bangladesh
Rakotomalala, Ricco / ERIC, Universit Lumire Lyon 2, France
Ramoni, Marco F. / Harvard Medical School, USA
Ras, Zbigniew W. / University of North Carolina, Charlotte, USA
Rea, Alan / Western Michigan University, USA
Rehm, Frank / German Aerospace Center, Germany
Ricci, Francesco / eCommerce and Tourism Research Laboratory, ITC-irst, Italy
Rivero Cebrin, Daniel / University of A Corua, Spain
TEAM LinG
Sacharidis, Dimitris / University of Southern California, USA
Saldaa, Ramiro / Catholic University of Pelotas, Brazil
Sanders, G. Lawrence / State University of New York at Buffalo, USA
Santos, Maribel Yasmina / University of Minho, Portugal
Saquer, Jamil M. / Southwest Missouri State University, USA
Sayal, Mehmet / Hewlett-Packard Labs, USA
Saygin, Ycel / Sabanci University, Turkey
Scannapieco, Monica / Universit di Roma La Sapienza, Italy
Schafer, J. Ben / University of Northern Iowa, USA
Scheffer, Tobias / Humboldt-Universitt zu Berlin, Germany
Schneider, Michel / Blaise Pascal University, France
Scime, Anthony / State University of New York College Brockport, USA
Sebastiani, Paola / Boston University School of Public Health, USA
Segall, Richard S. / Arkansas State University, USA
Shah, Shital C. / The University of Iowa, USA
Shahabi, Cyrus / University of Southern California, USA
Shen, Hong / Japan Advanced Institute of Science and Technology, Japan
Sheng, Yihua Philip / Southern Illinois University, USA
Siciliano, Roberta / University of Naples Federico II, Italy
Simitsis, Alkis / National Technical University of Athens, Greece
Simes, Gabriel / Catholic University of Pelotas, Brazil
Sindoni, Giuseppe / ISTAT - National Institute of Statistics, Italy
Singh, Richa / Indian Institute of Technology, India
Smets, Philippe / Universit Libre de Bruxelles, Belgium
Smith, Kate A. / Monash University, Australia
Song, Il-Yeol / Drexel University, USA
Song, Min / Drexel University, USA
Sounderpandian, Jayavel / University of Wisconsin-Parkside, USA
Souto, Nieves Pedreira / University of A Corua, Spain
Stanton, Jeffrey / Syracuse University School of Information Studies, USA
Sundaram, David / The University of Auckland, New Zealand
Sural, Shamik / Indian Institute of Technology, Kharagpur, India
Talbi, El-Ghazali / LIFL, University of Lille 1, France
Tan, Hee Beng Kuan / Nanyang Technological University, Singapore
Tan, Rebecca Boon-Noi / Monash University, Australia
Taniar, David / Monash University, Australia
Teisseire, Maguelonne / University of Montpellier II, France
Terracina, Giorgio / Universit della Calabria, Italy
Thelwall, Mike / University of Wolverhampton, UK
Theodoratos, Dimitri / New Jersey Institute of Technology, USA
Thomasian, Alexander / New Jersey Institute of Technology, USA
Thuraisingham, Bhavani / The MITRE Corporation, USA
Tininini, Leonardo / CNR - Istituto di Analisi dei Sistemi e Informatica Antonio Ruberti, Italy
Troutt, Marvin D. / Kent State University, USA
Truemper, Klaus / University of Texas at Dallas, USA
Tsay, Li-Shiang / University of North Carolina, Charlotte, USA
Tzacheva, Angelina / University of North Carolina, Charlotte, USA
Ulusoy, zgr / Bilkent University, Turkey
Ursino, Domenico / Universit Mediterranea di Reggio Calabria, Italy
Vardaki, Maria / University of Athens, Greece
Vargas, Juan E. / University of South Carolina, USA
Vatsa, Mayank / Indian Institute of Technology, India
Viertl, Reinhard / Vienna University of Technology, Austria
Viktor, Herna L. / University of Ottawa, Canada
TEAM LinG
Virgillito, Antonino / Universit di Roma La Sapienza, Italy
Viswanath, P. / Indian Institute of Science, India
Walter, Jrg Andreas / University of Bielefeld, Germany
Wang, Dajin / Montclair State University, USA
Wang, Hai / Saint Marys University, Canada
Wang, Ke / Simon Fraser University, Canada
Wang, Lipo / Nanyang Technological University, Singapore
Wang, Shouhong / University of Massachusetts Dartmouth, USA
Wang, Xiong / California State University at Fullerton, USA
Webb, Geoffrey I. / Monash University, Australia
Wen, Ji-Rong / Microsoft Research Asia, China
West, Chad / IBM Canada Limited, Canada
Wickramasinghe, Nilmini / Cleveland State University, USA
Wieczorkowska, Alicja A. / Polish-Japanese Institute of Information Technology, Poland
Winkler, William E. / U.S. Bureau of the Census, USA
Wong, Raymond Chi-Wing / The Chinese University of Hong Kong, Hong Kong
Woon, Yew-Kwong / Nanyang Technological University, Singapore
Wu, Chien-Hsing / National University of Kaohsiung, Taiwan, ROC
Xiang, Yang / University of Guelph, Canada
Xing, Ruben / Montclair State University, USA
Yan, Feng / Williams Power, USA
Yan, Rui / Saint Marys University, Canada
Yang, Hsin-Chang / Chang Jung University, Taiwan
Yang, Hung-Jen / National Kaohsiung Normal University, Taiwan
Yang, Ying / Monash University, Australia
Yao, James E. / Montclair State University, USA
Yao, Yiyu / University of Regina, Canada
Yavas, Gkhan / Bilkent University, Turkey
Yeh, Jyh-haw / Boise State University, USA
Yoo, Illhoi / Drexel University, USA
Yu, Lei / Arizona State University, USA
Zendulka, Jaroslav / Brno University of Technology, Czech Republic
Zhang, Bin / Hewlett-Packard Research Laboratories, USA
Zhang, Chengqi / University of Technology Sydney, Australia
Zhang, Shichao / University of Technology Sydney, Australia
Zhang, Yu-Jin / Tsinghua University, Beijing, China
Zhao, Qiankun / Nanyang Technological University, Singapore
Zhao, Yan / University of Regina, Canada
Zhao, Yuan / Nanyang Technological University, Singapore
Zhou, Senqiang / Simon Fraser University, Canada
Zhou, Zhi-Hua / Nanjing University, China
Zhu, Dan / Iowa State University, USA
Zhu, Qiang / University of Michigan, USA
Ziad, Tarek / NUXEO, France
Ziarko, Wojciech / University of Regina, Canada
Zorman, Milan / University of Maribor, FERI, Slovenia
Zou, Qinghua / University of California - Los Angeles, USA
TEAM LinG
Contents
by Volume
VOLUME I
Action Rules / Zbigniew W. Ras, Angelina Tzacheva, and Li-Shiang Tsay ........................................................... 1
Active Disks for Data Mining / Alexander Thomasian ........................................................................................... 6
Active Learning with Multiple Views / Ion Muslea ................................................................................................. 12
Administering and Managing a Data Warehouse / James E. Yao, Chang Liu, Qiyang Chen, and June Lu .......... 17
Agent-Based Mining of User Profiles for E-Services / Pasquale De Meo, Giovanni Quattrone,
Giorgio Terracina, and Domenico Ursino ......................................................................................................... 23
Aggregate Query Rewriting in Multidimensional Databases / Leonardo Tininini ................................................. 28
Aggregation for Predictive Modeling with Relational Data / Claudia Perlich and Foster Provost ....................... 33
API Standardization Efforts for Data Mining / Jaroslav Zendulka ......................................................................... 39
Application of Data Mining to Recommender Systems, The / J. Ben Schafer ........................................................ 44
Approximate Range Queries by Histograms in OLAP / Francesco Buccafurri and Gianluca Lax ........................ 49
Artificial Neural Networks for Prediction / Rafael Mart ......................................................................................... 54
Association Rule Mining / Yew-Kwong Woon, Wee-Keong Ng, and Ee-Peng Lim ................................................ 59
Association Rule Mining and Application to MPIS / Raymond Chi-Wing Wong and Ada Wai-Chee Fu ............. 65
Association Rule Mining of Relational Data / Anne Denton and Christopher Besemann ..................................... 70
Association Rules and Statistics / Martine Cadot, Jean-Baptiste Maj, and Tarek Ziad ..................................... 74
Automated Anomaly Detection / Brad Morantz ..................................................................................................... 78
Automatic Musical Instrument Sound Classification / Alicja A. Wieczorkowska ................................................... 83
Bayesian Networks / Ahmad Bashir, Latifur Khan, and Mamoun Awad ............................................................... 89
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective / Les Pang ....................................................... 94
Bibliomining for Library Decision-Making / Scott Nicholson and Jeffrey Stanton ................................................. 100
Biomedical Data Mining Using RBF Neural Networks / Feng Chu and Lipo Wang ................................................ 106
Building Empirical-Based Knowledge for Design Recovery / Hee Beng Kuan Tan and Yuan Zhao ...................... 112
Business Processes / David Sundaram and Victor Portougal ............................................................................... 118
Case-Based Recommender Systems / Fabiana Lorenzi and Francesco Ricci ....................................................... 124
Categorization Process and Data Mining / Maria Suzana Marc Amoretti ............................................................. 129
Center-Based Clustering and Regression Clustering / Bin Zhang ........................................................................... 134
Classification and Regression Trees / Johannes Gehrke ........................................................................................ 141
Classification Methods / Aijun An ........................................................................................................................... 144
Closed-Itemset Incremental-Mining Problem / Luminita Dumitriu ......................................................................... 150
Cluster Analysis in Fitting Mixtures of Curves / Tom Burr ..................................................................................... 154
Clustering Analysis and Algorithms / Xiangji Huang ............................................................................................. 159
Clustering in the Identification of Space Models / Maribel Yasmina Santos, Adriano Moreira,
and Sofia Carneiro .............................................................................................................................................. 165
Clustering of Time Series Data / Anne Denton ......................................................................................................... 172
Clustering Techniques / Sheng Ma and Tao Li ....................................................................................................... 176
Clustering Techniques for Outlier Detection / Frank Klawonn and Frank Rehm ................................................. 180
Combining Induction Methods with the Multimethod Approach / Mitja Leni, Peter Kokol, Petra Povalej
and Milan Zorman ............................................................................................................................................... 184
Comprehensibility of Data Mining Algorithms / Zhi-Hua Zhou .............................................................................. 190
Computation of OLAP Cubes / Amin A. Abdulghani .............................................................................................. 196
Concept Drift / Marcus A. Maloof ............................................................................................................................ 202
Condensed Representations for Data Mining / Jean-Francois Boulicaut ............................................................. 207
Content-Based Image Retrieval / Timo R. Bretschneider and Odej Kao ................................................................. 212
Continuous Auditing and Data Mining / Edward J. Garrity, Joseph B. ODonnell,

and G. Lawrence Sanders ................................................................................................................................... 217
Data Driven vs. Metric Driven Data Warehouse Design / John M. Artz ................................................................. 223
Data Management in Three-Dimensional Structures / Xiong Wang ........................................................................ 228
TEAM LinG
Data Mining and Decision Support for Business and Science / Auroop R. Ganguly, Amar Gupta,
and Shiraj Khan .................................................................................................................................................. 233
Data Mining and Warehousing in Pharma Industry / Andrew Kusiak and Shital C. Shah .................................... 239
Data Mining for Damage Detection in Engineering Structures / Ramdev Kanapady

and Aleksandar Lazarevic .................................................................................................................................. 245
Data Mining for Intrusion Detection / Aleksandar Lazarevic ................................................................................. 251
Data Mining in Diabetes Diagnosis and Detection / Indranil Bose ........................................................................ 257
Data Mining in Human Resources / Marvin D. Troutt and Lori K. Long ............................................................... 262
Data Mining in the Federal Government / Les Pang ................................................................................................ 268
Data Mining in the Soft Computing Paradigm / Pradip Kumar Bala, Shamik Sural,
and Rabindra Nath Banerjee .............................................................................................................................. 272
Data Mining Medical Digital Libraries / Colleen Cunningham and Xiaohua Hu ................................................... 278
Data Mining Methods for Microarray Data Analysis / Lei Yu and Huan Liu ......................................................... 283
Data Mining with Cubegrades / Amin A. Abdulghani ............................................................................................. 288
Data Mining with Incomplete Data / Hai Wang and Shouhong Wang .................................................................... 293
Data Quality in Cooperative Information Systems / Carlo Marchetti, Massimo Mecella, Monica Scannapieco,
and Antonino Virgillito ...................................................................................................................................... 297
Data Quality in Data Warehouses / William E. Winkler .......................................................................................... 302
Data Reduction and Compression in Database Systems / Alexander Thomasian .................................................. 307
Data Warehouse Back-End Tools / Alkis Simitsis and Dimitri Theodoratos ......................................................... 312
Data Warehouse Performance / Beixin (Betsy) Lin, Yu Hong, and Zu-Hsu Lee ..................................................... 318
Data Warehousing and Mining in Supply Chains / Richard Mathieu and Reuven R. Levary ............................... 323
Data Warehousing Search Engine / Hadrian Peter and Charles Greenidge ......................................................... 328
Data Warehousing Solutions for Reporting Problems / Juha Kontio ..................................................................... 334
Database Queries, Data Mining, and OLAP / Lutz Hamel ....................................................................................... 339
Database Sampling for Data Mining / Patricia E.N. Lutu ....................................................................................... 344
DEA Evaluation of Performance of E-Business Initiatives / Yao Chen, Luvai Motiwalla, and M. Riaz Khan ...... 349
Decision Tree Induction / Roberta Siciliano and Claudio Conversano ............................................................... 353
Diabetic Data Warehouses / Joseph L. Breault ....................................................................................................... 359
TEAM LinG
Discovering an Effective Measure in Data Mining / Takao Ito ............................................................................... 364
Discovering Knowledge from XML Documents / Richi Nayak .............................................................................. 372
Discovering Ranking Functions for Information Retrieval / Weiguo Fan and Praveen Pathak ............................ 377
Discovering Unknown Patterns in Free Text / Jan H. Kroeze ................................................................................. 382
Discovery Informatics / William W. Agresti ............................................................................................................. 387
Discretization for Data Mining / Ying Yang and Geoffrey I. Webb .......................................................................... 392
Discretization of Continuous Attributes / Fabrice Muhlenbach and Ricco Rakotomalala .................................. 397
Distributed Association Rule Mining / Mafruz Zaman Ashrafi, David Taniar, and Kate A. Smith ........................ 403
Distributed Data Management of Daily Car Pooling Problems / Roberto Wolfler Calvo, Fabio de Luigi,
Palle Haastrup, and Vittorio Maniezzo ............................................................................................................. 408
Drawing Representative Samples from Large Databases / Wen-Chi Hou, Hong Guo, Feng Yan,
and Qiang Zhu ..................................................................................................................................................... 413
Efficient Computation of Data Cubes and Aggregate Views / Leonardo Tininini .................................................. 421
Embedding Bayesian Networks in Sensor Grids / Juan E. Vargas .......................................................................... 427
Employing Neural Networks in Data Mining / Mohamed Salah Hamdi .................................................................. 433
Enhancing Web Search through Query Log Mining / Ji-Rong Wen ....................................................................... 438
Enhancing Web Search through Web Structure Mining / Ji-Rong Wen ................................................................. 443
Ensemble Data Mining Methods / Nikunj C. Oza ................................................................................................... 448
Ethics of Data Mining / Jack Cook ......................................................................................................................... 454
Ethnography to Define Requirements and Data Model / Gary J. DeLorenzo ......................................................... 459
Evaluation of Data Mining Methods / Paolo Giudici ............................................................................................. 464
Evolution of Data Cube Computational Approaches / Rebecca Boon-Noi Tan ...................................................... 469
Evolutionary Computation and Genetic Algorithms / William H. Hsu ..................................................................... 477
Evolutionary Data Mining For Genomics / Laetitia Jourdan, Clarisse Dhaenens, and El-Ghazali Talbi ............ 482
Evolutionary Mining of Rule Ensembles / Jorge Muruzbal .................................................................................. 487
Explanation-Oriented Data Mining / Yiyu Yao and Yan Zhao ................................................................................. 492
Factor Analysis in Data Mining / Zu-Hsu Lee, Richard L. Peterson, Chen-Fu Chien, and Ruben Xing ............... 498
Financial Ratio Selection for Distress Classification / Roberto Kawakami Harrop Galvo, Victor M. Becerra,
and Magda Abou-Seada ..................................................................................................................................... 503
TEAM LinG
Flexible Mining of Association Rules / Hong Shen ................................................................................................. 509
Formal Concept Analysis Based Clustering / Jamil M. Saquer .............................................................................. 514
Fuzzy Information and Data Analysis / Reinhard Viertl ......................................................................................... 519
General Model for Data Warehouses, A / Michel Schneider .................................................................................. 523
Genetic Programming / William H. Hsu .................................................................................................................... 529
Graph Transformations and Neural Networks / Ingrid Fischer ............................................................................... 534
Graph-Based Data Mining / Lawrence B. Holder and Diane J. Cook .................................................................... 540
Group Pattern Discovery Systems for Multiple Data Sources / Shichao Zhang and Chengqi Zhang ................... 546
Heterogeneous Gene Data for Classifying Tumors / Benny Yiu-ming Fung and Vincent To-yee Ng .................... 550
Hierarchical Document Clustering / Benjamin C. M. Fung, Ke Wang, and Martin Ester ....................................... 555
High Frequency Patterns in Data Mining / Tsau Young Lin .................................................................................... 560
Homeland Security Data Mining and Link Analysis / Bhavani Thuraisingham ..................................................... 566
Humanities Data Warehousing / Janet Delve .......................................................................................................... 570
Hyperbolic Space for Interactive Visualization / Jrg Andreas Walter ................................................................... 575
VOLUME II
Identifying Single Clusters in Large Data Sets / Frank Klawonn and Olga Georgieva ......................................... 582
Immersive Image Mining in Cardiology / Xiaoqiang Liu, Henk Koppelaar, Ronald Hamers,
and Nico Bruining ............................................................................................................................................... 586
Imprecise Data and the Data Mining Process / Marvin L. Brown and John F. Kros .............................................. 593
Incorporating the People Perspective into Data Mining / Nilmini Wickramasinghe .............................................. 599
Incremental Mining from News Streams / Seokkyung Chung, Jongeun Jun and Dennis McLeod ........................ 606
Inexact Field Learning Approach for Data Mining / Honghua Dai ......................................................................... 611
Information Extraction in Biomedical Literature / Min Song, Il-Yeol Song, Xiaohua Hu, and Hyoil Han .............. 615
Instance Selection / Huan Liu and Lei Yu ............................................................................................................... 621
Integration of Data Sources through Data Mining / Andreas Koeller .................................................................... 625
Intelligence Density / David Sundaram and Victor Portougal .............................................................................. 630
Intelligent Data Analysis / Xiaohui Liu ................................................................................................................... 634
TEAM LinG
Intelligent Query Answering / Zbigniew W. Ras and Agnieszka Dardzinska ........................................................ 639
Interactive Visual Data Mining / Shouhong Wang and Hai Wang .......................................................................... 644
Interscheme Properties Role in Data Warehouses / Pasquale De Meo, Giorgio Terracina,

and Domenico Ursino .......................................................................................................................................... 647
Inter-Transactional Association Analysis for Prediction / Ling Feng and Tharam Dillon .................................... 653
Interval Set Representations of Clusters / Pawan Lingras, Rui Yan, Mofreh Hogo, and Chad West .................... 659
Kernel Methods in Chemoinformatics / Huma Lodhi .............................................................................................. 664
Knowledge Discovery with Artificial Neural Networks / Juan R. Rabual Dopico, Daniel Rivero Cebrin,
Julin Dorado de la Calle, and Nieves Pedreira Souto .................................................................................... 669
Learning Bayesian Networks / Marco F. Ramoni and Paola Sebastiani ............................................................... 674
Learning Information Extraction Rules for Web Data Mining / Chia-Hui Chang and Chun-Nan Hsu .................. 678
Locally Adaptive Techniques for Pattern Classification / Carlotta Domeniconi and Dimitrios Gunopulos ........ 684
Logical Analysis of Data / Endre Boros, Peter L. Hammer, and Toshihide Ibaraki .............................................. 689
Lsquare System for Mining Logic Data, The / Giovanni Felici and Klaus Truemper ............................................ 693
Marketing Data Mining / Victor S.Y. Lo .................................................................................................................. 698
Material Acquisitions Using Discovery Informatics Approach / Chien-Hsing Wu and Tzai-Zang Lee ................. 705
Materialized Hypertext View Maintenance / Giuseppe Sindoni .............................................................................. 710
Materialized Hypertext Views / Giuseppe Sindoni ................................................................................................... 714
Materialized View Selection for Data Warehouse Design / Dimitri Theodoratos and Alkis Simitsis ..................... 717
Methods for Choosing Clusters in Phylogenetic Trees / Tom Burr ........................................................................ 722
Microarray Data Mining / Li M. Fu .......................................................................................................................... 728
Microarray Databases for Biotechnology / Richard S. Segall ................................................................................ 734
Mine Rule / Rosa Meo and Giuseppe Psaila ........................................................................................................... 740
Mining Association Rules on a NCR Teradata System / Soon M. Chung and Murali Mangamuri ....................... 746
Mining Association Rules Using Frequent Closed Itemsets / Nicolas Pasquier ................................................... 752
Mining Chat Discussions / Stanley Loh, Daniel Licthnow, Thyago Borges, Tiago Primo,
Rodrigo Branco Kickhfel, Gabriel Simes, Gustavo Piltcher, and Ramiro Saldaa ..................................... 758
Mining Data with Group Theoretical Means / Gabriele Kern-Isberner .................................................................. 763
Mining E-Mail Data / Steffen Bickel and Tobias Scheffer ....................................................................................... 768
TEAM LinG
Mining for Image Classification Based on Feature Elements / Yu-Jin Zhang ......................................................... 773
Mining for Profitable Patterns in the Stock Market / Yihua Philip Sheng, Wen-Chi Hou, and Zhong Chen ......... 779
Mining for Web-Enabled E-Business Applications / Richi Nayak ......................................................................... 785
Mining Frequent Patterns via Pattern Decomposition / Qinghua Zou and Wesley Chu ......................................... 790
Mining Group Differences / Shane M. Butler and Geoffrey I. Webb ....................................................................... 795
Mining Historical XML / Qiankun Zhao and Sourav Saha Bhowmick .................................................................. 800
Mining Images for Structure / Terry Caelli ............................................................................................................. 805
Mining Microarray Data / Nanxiang Ge and Li Liu ................................................................................................ 810
Mining Quantitative and Fuzzy Association Rules / Hong Shen and Susumu Horiguchi ..................................... 815
Model Identification through Data Mining / Diego Liberati .................................................................................. 820
Modeling Web-Based Data in a Data Warehouse / Hadrian Peter and Charles Greenidge ................................. 826
Moral Foundations of Data Mining / Kenneth W. Goodman ................................................................................... 832
Mosaic-Based Relevance Feedback for Image Retrieval / Odej Kao and Ingo la Tendresse ................................. 837
Multimodal Analysis in Multimedia Using Symbolic Kernels / Hrishikesh B. Aradhye and Chitra Dorai ............ 842
Multiple Hypothesis Testing for Data Mining / Sach Mukherjee ........................................................................... 848
Music Information Retrieval / Alicja A. Wieczorkowska ......................................................................................... 854
Negative Association Rules in Data Mining / Olena Daly and David Taniar ....................................................... 859
Neural Networks for Prediction and Classification / Kate A. Smith ......................................................................... 865
Off-Line Signature Recognition / Indrani Chakravarty, Nilesh Mishra, Mayank Vatsa, Richa Singh,
and P. Gupta ........................................................................................................................................................ 870
Online Analytical Processing Systems / Rebecca Boon-Noi Tan ........................................................................... 876
Online Signature Recognition / Indrani Chakravarty, Nilesh Mishra, Mayank Vatsa, Richa Singh,
and P. Gupta ........................................................................................................................................................ 885
Organizational Data Mining / Hamid R. Nemati and Christopher D. Barko .......................................................... 891
Path Mining in Web Processes Using Profiles / Jorge Cardoso ............................................................................. 896
Pattern Synthesis for Large-Scale Pattern Recognition / P. Viswanath, M. Narasimha Murty,

and Shalabh Bhatnagar ..................................................................................................................................... 902
Physical Data Warehousing Design / Ladjel Bellatreche and Mukesh Mohania .................................................. 906
Predicting Resource Usage for Capital Efficient Marketing / D. R. Mani, Andrew L. Betz, and James H. Drew .... 912
TEAM LinG
Privacy and Confidentiality Issues in Data Mining / Ycel Saygin ......................................................................... 921
Privacy Protection in Association Rule Mining / Neha Jha and Shamik Sural ..................................................... 925
Profit Mining / Senqiang Zhou and Ke Wang ......................................................................................................... 930
Pseudo Independent Models / Yang Xiang ............................................................................................................ 935
Reasoning about Frequent Patterns with Negation / Marzena Kryszkiewicz ......................................................... 941
Recovery of Data Dependencies / Hee Beng Kuan Tan and Yuan Zhao ................................................................ 947
Reinforcing CRM with Data Mining / Dan Zhu ........................................................................................................ 950
Resource Allocation in Wireless Networks / Dimitrios Katsaros, Gkhan Yavas, Alexandros Nanopoulos,
Murat Karakaya, zgr Ulusoy, and Yannis Manolopoulos ............................................................................ 955
Retrieving Medical Records Using Bayesian Networks / Luis M. de Campos, Juan M. Fernndez-Luna,
and Juan F. Huete ............................................................................................................................................... 960
Robust Face Recognition for Data Mining / Brian C. Lovell and Shaokang Chen ............................................... 965
Rough Sets and Data Mining / Jerzy W. Grzymala-Busse and Wojciech Ziarko .................................................... 973
Rule Generation Methods Based on Logic Synthesis / Marco Muselli .................................................................. 978
Rule Qualities and Knowledge Combination for Decision-Making / Ivan Bruha .................................................... 984
Sampling Methods in Approximate Query Answering Systems / Gautam Das ...................................................... 990
Scientific Web Intelligence / Mike Thelwall ............................................................................................................ 995
Search Situations and Transitions / Nils Pharo and Kalervo Jrvelin .................................................................. 1000
Secure Multiparty Computation for Privacy Preserving Data Mining / Yehida Lindell .......................................... 1005
Semantic Data Mining / Protima Banerjee, Xiaohua Hu, and Illhoi Yoo ............................................................... 1010
Semi-Structured Document Classification / Ludovic Denoyer and Patrick Gallinari ............................................ 1015
Semi-Supervised Learning / Tobias Scheffer ............................................................................................................ 1022
Sequential Pattern Mining / Florent Masseglia, Maguelonne Teisseire, and Pascal Poncelet ............................ 1028
Software Warehouse / Honghua Dai ....................................................................................................................... 1033
Spectral Methods for Data Clustering / Wenyuan Li ............................................................................................... 1037
Statistical Data Editing / Claudio Conversano and Roberta Siciliano .................................................................. 1043
Statistical Metadata in Data Processing and Interchange / Maria Vardaki ........................................................... 1048
Storage Strategies in Data Warehouses / Xinjian Lu .............................................................................................. 1054
TEAM LinG
Subgraph Mining / Ingrid Fischer and Thorsten Meinl .......................................................................................... 1059
Support Vector Machines / Mamoun Awad and Latifur Khan ............................................................................... 1064
Support Vector Machines Illuminated / David R. Musicant .................................................................................... 1071
Survival Analysis and Data Mining / Qiyang Chen, Alan Oppenheim, and Dajin Wang ...................................... 1077
Symbiotic Data Mining / Kuriakose Athappilly and Alan Rea .............................................................................. 1083
Symbolic Data Clustering / Edwin Diday and M. Narasimha Murthy .................................................................... 1087
Synthesis with Data Warehouse Applications and Utilities / Hakikur Rahman .................................................... 1092
Temporal Association Rule Mining in Event Sequences / Sherri K. Harms ........................................................... 1098
Text Content Approaches in Web Content Mining / Vctor Fresno Fernndez and Luis Magdalena Layos ....... 1103
Text Mining-Machine Learning on Documents / Dunja Mladeni ........................................................................ 1109
Text Mining Methods for Hierarchical Document Indexing / Han-Joon Kim ......................................................... 1113
Time Series Analysis and Mining Techniques / Mehmet Sayal .............................................................................. 1120
Time Series Data Forecasting / Vincent Cho ........................................................................................................... 1125
Topic Maps Generation by Text Mining / Hsin-Chang Yang and Chung-Hong Lee ............................................. 1130
Transferable Belief Model / Philippe Smets ............................................................................................................ 1135
Tree and Graph Mining / Dimitrios Katsaros and Yannis Manolopoulos ............................................................. 1140
Trends in Web Content and Structure Mining / Anita Lee-Post and Haihao Jin .................................................. 1146
Trends in Web Usage Mining / Anita Lee-Post and Haihao Jin ............................................................................ 1151
Unsupervised Mining of Genes Classifying Leukemia / Diego Liberati, Sergio Bittanti,

and Simone Garatti ............................................................................................................................................. 1155
Use of RFID in Supply Chain Data Processing / Jan Owens, Suresh Chalasani,
and Jayavel Sounderpandian ............................................................................................................................. 1160
Using Dempster-Shafer Theory in Data Mining / Malcolm J. Beynon .................................................................... 1166
Using Standard APIs for Data Mining in Prediction / Jaroslav Zendulka .............................................................. 1171
Utilizing Fuzzy Decision Trees in Decision Making / Malcolm J. Beynon .............................................................. 1175
Vertical Data Mining / William Perrizo, Qiang Ding, Qin Ding, and Taufik Abidin .............................................. 1181
Video Data Mining / JungHwan Oh, JeongKyu Lee, and Sae Hwang .................................................................... 1185
Visualization Techniques for Data Mining / Herna L. Viktor and Eric Paquet ...................................................... 1190
TEAM LinG
Wavelets for Querying Multidimensional Datasets / Cyrus Shahabi, Dimitris Sacharidis,
and Mehrdad Jahangiri ...................................................................................................................................... 1196
Web Mining in Thematic Search Engines / Massimiliano Caramia and Giovanni Felici .................................... 1201
Web Mining Overview / Bamshad Mobasher ......................................................................................................... 1206
Web Page Extension of Data Warehouses / Anthony Scime ................................................................................... 1211
Web Usage Mining / Bamshad Mobasher ............................................................................................................... 1216
Web Usage Mining and Its Applications / Yongjian Fu ......................................................................................... 1221
Web Usage Mining Data Preparation / Bamshad Mobasher ................................................................................... 1226
Web Usage Mining through Associative Models / Paolo Giudici and Paola Cerchiello .................................... 1231
World Wide Web Personalization / Olfa Nasraoui ................................................................................................. 1235
World Wide Web Usage Mining / Wen-Chen Hu, Hung-Jen Yang, Chung-wei Lee, and Jyh-haw Yeh ............... 1242
TEAM LinG
xxi
Foreword
There has been much interest developed in the data mining field both in the academia and the industry over the past
10-15 years. The number of researchers and practitioners working in the field and the number of scientific papers
published in various data mining outlets increased drastically over this period. Major commercial vendors incorporated
various data mining tools into their products, and numerous applications in many areas, including life sciences, finance,
CRM, and Web-based applications, have been developed and successfully deployed.
Moreover, this interest is no longer limited to the researchers working in the traditional fields of statistics, machine
learning and databases, but has recently expanded to other fields, including operations research/management science
(OR/MS) and mathematics, as evidenced from various data mining tracks organized at different INFORMS meetings,
special issues of OR/MS journals and the recent conference on Mathematical Foundations of Learning Theory
organized by mathematicians.
As the Encyclopedia of Data Warehousing and Mining amply demonstrates, all these diverse interests from
different groups of researchers and practitioners helped to shape data mining as a broad and multi-faceted discipline
spanning a large class of problems in such diverse areas as life sciences, marketing (including CRM and e-commerce),
finance, telecommunications, astronomy, and many other fields (the so called data mining and X phenomenon, where
X constitutes a broad range of fields where data mining is used for analyzing the data). This also resulted in a process
of cross-fertilization of ideas generated by these diverse groups of researchers interacting across the traditional
boundaries of their disciplines.
Despite all this progress, data mining still faces several challenges that make the field ripe with future research
opportunities. First, despite the cross-fertilization of ideas spanning various disciplines, the convergence among
different disciplines proceeds gradually, and more work is required to arrive at a unified view of data mining widely
accepted by different groups of researchers. Second, despite a considerable progress, still more work is required on
the theoretical foundations of data mining, as was recently stated by the participants of the Dagstuhl workshop Data
Mining: The Next Generation organized by R. Agrawal, J.-C. Freytag and R. Ramakrishnan and also expressed by
various other data mining researchers. Third, the data mining community must address the privacy and security
problems for data mining to be accepted by the privacy advocates and the Congress. Fourth, as the field advances, so
is the scope of data mining applications. The challenge to the field is to develop more advanced data mining methods
that would work in these increasingly demanding applications. Fifth, despite a considerable progress in developing more
user-friendly data mining tools, more work is required in this area with the goal of making these tools accessible to a
large audience of nave data mining users. In particular, one of the challenges is to devise methods that would
smoothly embed data mining tools into corresponding applications on the front-end and would integrate these tools
with databases on the back-end. Achieving such capabilities is very important since this would allow data mining to
cross the chasm (using Geoffrey Moores terminology) and become a mainstream technology utilized by millions of
users. Finally, more work is required on actionability and on the development of better methods for discovering
actionable patterns in the data. Currently, discovering actionable patterns in data constitutes a laborious and
challenging process. It is important to streamline and simplify this process and make it more efficient.
Given significant and rapid advancements in data mining and data warehousing, it is important to take periodic
snapshots of the field every few years. The data mining community addressed this issue by producing publications
covering the state of the art of the field every few years starting with the first volume Advances in Knowledge
Discovery and Data Mining (edited by U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy) published by
AAAI/MIT Press in 1996. This encyclopedia provides the latest snapshot of the field and surveys a broad array of
topics ranging from the basic theories to the recent advancements in the field and covers a diverse range of problems
TEAM LinG
xxii
from the analysis of microarray data to the analysis of multimedia and Web data. It also identifies future directions and
trends in data mining and data warehousing. Therefore, this volume should become an excellent guide to researchers
and practitioners.
Alexander Tuzhilin
New York University, USA
TEAM LinG
xxiii
Preface
How can a data-flooded manager get out of the mire? How can a confused decision maker pass through a maze?
How can an over-burdened problem solver clean up a mess? How can an exhausted scientist decipher a myth?
The answer is an interdisciplinary subject and a powerful tool known as data mining (DM). DM can turn data into
dollars; transform information into intelligence; change pattern into profit; and convert relationship into resources.
As the third branch of operations research and management science (OR/MS) and the third milestone of data
management, DM can help attack the third category of decision making by elevating our raw data into the third stage
of knowledge creation.
The term third has been mentioned four times above. Lets go backward and look at the three stages of knowledge
creation. Managers are often drowning in data (the first stage) but starving for knowledge. A collection of data is not
information (the second stage); and a collection of information is not knowledge. Data begets information which begets
knowledge. The whole subject of DM has a synergy of its own and represents more than the sum of its parts.
There are three categories of decision making: structured, semi-structured and unstructured. Decision making
processes fall along a continuum that ranges from highly structured decisions (sometimes called programmed) to highly
unstructured (non-programmed) decisions (Turban et al., 2005, p. 12).
At one end of the spectrum, structured processes are routine and typically repetitive problems for which standard
solutions exist. Unfortunately, rather than being static, deterministic and simple, the majority of real world problems
are dynamic, probabilistic, and complex. Many professional and personal problems are classified as unstructured, or
marginally as semi-structured, or even in between, since the boundaries between them may not be crystal-clear.
In addition to developing normative models (such as linear programming, economic order quantity) for solving
structured (or programmed) problems, operation researchers and management scientists have created many descriptive
models, such as simulation and goal programming, to deal with semi-structured alternatives. Unstructured problems,
however, fall in a gray areas for which there are no cut-and-dry solution methods. The current two branches of OR/MS
hit a dead end with unstructured problems.
To gain knowledge, one must understand the patterns that emerge from information. Patterns are not just simple
relationships among data; they exist separately from information, as archetypes or standards to which emerging
information can be compared so that one may draw inferences and take action. Over the last 40 years, the tools and
techniques used to process data and information have continued to evolve from databases (DBs) to data warehousing
(DW) and further to DM. DW applications, the middle of these three stages, have become business-critical. However,
DM can help deliver even more value from these huge repositories of information.
Certainly, there are many statistical models that have emerged over time. Machine learning has marked a milestone
in the evolution of computer science (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). Although DM is still
in its infancy, it is now being used in a wide range of industries and for a range of tasks in a variety of contexts (Wang,
2003). DM is synonymous with knowledge discovery in databases, knowledge extraction, data/pattern analysis, data
archeology, data dredging, data snooping, data fishing, information harvesting, and business intelligence (Giudici,
2003; Hand et al., 2001; Han & Kamber, 2000). There are unprecedented opportunities in the future to utilize DM.
Data warehousing and mining (DWM) is the science of managing and analyzing large datasets and discovering novel
patterns. In recent years, DWM has emerged as a particularly exciting and industrially relevant area of research.
Prodigious amounts of data are now being generated in domains as diverse and elusive as market research, functional
genomics and pharmaceuticals. Intelligently analyzing data to discover knowledge with the aim of answering crucial
questions and helping make informed decisions is the challenge that lies ahead.
The Encyclopedia of Data Warehousing and Mining provides theories, methodologies, functionalities, and
applications to decision makers, problem solvers, and data miners in business, academia, and government. DWM lies
TEAM LinG
xxiv
at the junction of database systems, artificial intelligence, machine learning and applied statistics, which makes it a
valuable area for researchers and practitioners. With a comprehensive overview, The Encyclopedia of Data Warehous-
ing and Mining offers a thorough exposure to the issues of importance in this rapidly changing field. The encyclopedia
also includes a rich mix of introductory and advanced topics while providing a comprehensive source of technical,
functional and legal references to DWM.
After spending more than a year preparing this book, with a strictly peer-reviewed process, I am delighted to see
it published. The standard for selection was very high. Each article went through at least three peer reviews; additional
third-party reviews were sought in cases of controversy. There have been innumerable instances where this feedback
has helped to improve the quality of the content, and even influenced authors in how they approach their topics.
The primary objective of this encyclopedia is to explore the myriad of issues regarding DWM. A broad spectrum
of practitioners, managers, scientists, educators, and graduate students who teach, perform research, and/or implement
these discoveries, are the envisioned readers of this encyclopedia.
The encyclopedia contains a collection of 234 articles, written by an international team of 361 experts representing
leading scientists and talented young scholars from 34 countries. They have contributed great effort to create a source
of solid, practical information, informed by sound underlying theory that should become a resource for all people
involved in this dynamic new field. Lets take a peek at a few articles:
The evaluation of DM methods requires a great deal of attention. A valid model evaluation and comparison can
improve considerably the efficiency of a DM process. Paolo Giudici has presented several ways to perform model
comparison, in which each has its advantages and disadvantages.
According to Zbigniew W. Ras, the main object of action rules is to generate special types of rules for a database
that point the direction for re-classifying objects with respect to some distinguishing attributes (called decision
attributes). This creates flexible attributes that form a basis for action rules construction.
With the constraints imposed by computer memory and mining algorithms, we can experience selection pressures
more than ever. The main point of instance selection is approximation. Our task is to achieve as good mining results
as possible by approximating the whole dataset with the selected instances and hope to do better in DM with instance
selection as it is possible to remove noisy and irrelevant data in the process. Huan Liu and Lei Yu have presented an
initial attempt to review and categorize the methods of instance selection in terms of sampling, classification, and
clustering.
Shichao Zhang and Chengqi Zhang introduce a group of pattern discovery systems for dealing with the multiple
data source (MDS) problem, mainly including a logical system for enhancing data quality; a logical system for resolving
conflicts; a data cleaning system; a database clustering system; a pattern discovery system and a post-mining system.
Based on his extensive experience, Gautam Das surveys recent state-of-the-art solutions to the problem of
approximate query answering in databases, in which ballpark answers(i.e., approximate answers) to queries can be
provided within acceptable time limits. These techniques sacrifice accuracy to improve running time; typically through
some sort of lossy data compression. Also, Han-Joon Kim (the holder of two patents on text mining applications)
discusses a comprehensive text-mining solution to document indexing problems on topic hierarchies (taxonomy).
Condensed representations have been proposed as a useful concept for the optimization of typical DM tasks. It
appears as a key concept within the emerging inductive DB framework where inductive query evaluation needs for
effective constraint-based DM techniques. Jean-Franois Boulicaut introduces this research domain, its achievements
in the context of frequent itemset mining from transactional data and its future trends.
Zhi-Hua Zhou discusses complexity issues in DM. Although we still have a long way to go in order to produce
patterns that can be understood by most people involved with DM tasks, endeavors on improving the comprehensibility
of complicated algorithms have proceeded at a promising pace.
Pattern classification poses a difficult challenge in finite settings and high dimensional spaces caused by the issue
of dimensionality. Carlotta Domeniconi and Dimitrios Gunopulos discuss classification techniques, including the
authors own work, to mitigate the problem of dimensionality and reduce bias, by estimating local feature relevance and
selecting features accordingly. This issue has both theoretical and practical relevance, since learning tasks abound in
which data are represented as a collection of a very large numbers of features. Thus, many applications can benefit from
improvements in predicting error.
Qinghua Zou proposes using pattern decomposition algorithms to find frequent patterns in large datasets. Pattern
decomposition is a DM technology that uses known, frequent or infrequent patterns to decompose long itemsets to
many short ones. It identifies frequent patterns in a dataset using a bottom-up methodology and reduces the size of
the dataset in each step. The algorithm avoids the process of candidate set generation and decreases the time for
counting supports due to the reduced dataset.
TEAM LinG
xxv
Perrizo, Ding, et al. review a category of DM approaches using vertical data structures. They demonstrate their
applications in various DM areas, such as association rule mining and multi-relational DM. Vertical DM strategy aims
at addressing scalability issues by organizing data in vertical layouts and conducting logical operations on vertically
partitioned data instead of scanning the entire DB horizontally.
Integration of data sources refers to the task of developing a common schema, as well as data transformation
solutions, for a number of data sources with related content. The large number and size of modern data sources makes
manual approaches to integration increasingly impractical. Andreas Koeller provides a comprehensive overview over
DM techniques which can help to partially or fully automate the data integration process.
DM applications often involve testing hypotheses regarding thousands or millions of objects at once. The statistical
concept of multiple hypothesis testing is of great practical importance in such situations, and an appreciation of the
issues involved can vastly reduce errors and associated costs. Sach Mukherjee provides an introductory look at multiple
hypothesis testing in the context of DM.
Maria Vardaki illustrates the benefits of using statistical metadata by information systems, depicting also how such
standardization can improve the quality of statistical results. She proposes a common, semantically rich, and object-
oriented data/metadata model for metadata management that integrates the main steps of data processing and covers
all aspects of DW that are essential for DM requirements. Finally, she demonstrates how a metadata model can be
integrated in a web-enabled statistical information system to ensure quality of statistical results.
A major obstacle in DM applications is the gap between statistic-based pattern extraction and value-based decision-
making. Profit mining aims at reducing this gap. The concept and techniques proposed by Ke Wang and Senqiang Zhou
are applicable to applications under a general notion of utility.
Although a tremendous amount of progress has been made in DM over the last decade or so, many important
challenges still remain. For instance, there are still no solid standards of practice; it is still too easy to misuse DM
software; secondary data analysis without appropriate experimental design is still common; and it is still hard to choose
right kind of analysis methods for the problem in hand. Xiao Hui Liu points out that intelligent data analysis (IDA)
is an interdisciplinary study concerning the effective analysis of data, which may help advance the state of art in the
field.
In recent years, the need to extract complex tree-like or graph-like patterns in massive data collections (e.g., in
bioinformatics, semistructured or Web DBs) has become a necessity. This has led to the emergence of the research field
of graph and tree mining. This field provides many promising topics for both theoretical and engineering achievements,
and many expect this to be one of the key fields in DM research in the years ahead. Katsaros and Manolopoulos review
the most important strategic application-domains where frequent structure mining (FSM) provides significant results.
A survey is presented of the most important algorithms that have been proposed for mining graph-like and tree-like
substructures in massive data collections.
Lawrence B. Holder and Diane J. Cook are among the pioneers in the field of graph-based DM and have developed
the widely-disseminated Subdue graph-based DM system (http://ailab.uta.edu/subdue). They have directed multi-
million dollar government-funded projects in the research, development and application of graph-based DM in real-
world tasks ranging from bioinformatics to homeland security.
Graphical models such as Bayesian networks (BNs) and decomposable Markov networks (DMNs) have been widely
applied to probabilistic reasoning in intelligent systems. Automatic discovery of such models from data is desirable,
but is NP-hard in general. Common learning algorithms use single-link look-ahead searches for efficiency. However,
pseudo-independent (PI) probabilistic domains are not learnable by such algorithms. Yang Xiang introduces funda-
mentals of PI domains and explains why common algorithms fail to discover them. He further offers key ideas as to how
they can efficiently be discovered, and predicts advances in the near future.
Semantic DM is a novel research area that used graph-based DM techniques and ontologies to identify complex
patterns in large, heterogeneous data sets. Tony Hus research group at Drexel University is involved in the
development and application of semantic DM techniques to the bioinformatics and homeland security domains.
Yu-Jin Zhang presents a novel method for image classification based on feature element through association rule
mining. The feature elements can capture well the visual meanings of images according to the subjective perception
of human beings, and are suitable for working with rule-based classification models. Techniques are adapted for mining
the association rules which can find associations between the feature elements and class attributes of the image, and
the mined rules are applied to image classifications.
Results of image DB queries are usually presented as a thumbnail list. Subsequently, each of these images can be
used for refinement of the initial query. This approach is not suitable for queries by sketch. In order to receive the desired
images, the user has to recognize misleading areas of the sketch and modify these images appropriately. This is a non-
TEAM LinG
xxvi
trivial problem, as the retrieval often is based on complex, non-intuitive features. Therefore, Odej Kao presents a mosaic-
based technique for sketch feedback, which combines the best sections contained in an image DB into a single query
image.
Andrew Kusiak and Shital C. Shah emphasize the need for an individual-based paradigm, which may ensure the well-
being of patients and the success of pharmaceutical industry. The new methodologies are illustrated with various
medical informatics research projects on topics such as predictions for dialysis patients, significant gene/SNP
identifications, hypoplastic left heart syndrome for infants, and epidemiological and clinical toxicology. DWM and data
modeling will ultimately lead to targeted drug discovery and individualized treatments with minimum adverse effects.
The use of microarray DBs has revolutionized the way in which biomedical research and clinical investigation can
be conducted in that high-density arrays of specified DNA sequences can be fabricated onto a single glass slide or
chip. However, the analysis and interpretation of the vast amount of complex data produced by this technology poses
an unprecedented challenge. LinMin Fu and Richard Segall present a state-of-the-art review of microarray DM problems
and solutions.
Knowledge discovery from genomic data has become an important research area for biologists. An important
characteristic of genomic applications is the very large amount of data to be analyzed, and most of the time, it is not
possible to apply only classical statistical methods. Therefore, Jourdan, Dhaenens and Talbi propose to model
knowledge discovery tasks associated with such problems as combinatorial optimization tasks, in order to apply
efficient optimization algorithms to extract knowledge from those large datasets.
Founded on the work of Indrani Chakravarty et al.s research, handwritten signature is a behavioral biometric. There
are two methods used for recognition of handwritten signatures offline and online. While offline methods extract static
features of signature instances by treating them as images, online methods extract and use temporal or dynamic features
of signatures for recognition purposes. Temporal features are difficult to imitate, and hence online recognition methods
offer higher accuracy rates than offline methods.
Neurons are small processing units that are able to store some information. When several neurons are connected,
the result is a neural network, a model inspired by biological neural networks like the brain. Kate Smith provides useful
guidelines to ensure successful learning and generalization of the neural network model. Also, a special version in the
form of probabilistic neural networks (PNNs) is explained by Ingrid Fischer with the help of graphic transformations.
The sheer volume of multimedia data available has exploded on the Internet in the past decade in the form of webcasts,
broadcast programs and streaming audio and video. Automated content analysis tools for multimedia depend on face
detectors and recognizers; videotext extractors; speech and speaker identifiers; people/vehicle trackers; and event
locators resulting in large sets of multimodal features that can be real-valued, discrete, ordinal, or nominal. Multimedia
metadata based on such a multimodal collection of features, poses significant difficulties to subsequent tasks such as
classification, clustering, visualization and dimensionality reduction which traditionally deal only with continuous-
valued data. Aradhye and Dorai discuss mechanisms that extend tasks traditionally limited to continuous-valued feature
spaces to multimodal multimedia domains with symbolic and continuous-valued features, including (a) dimensionality
reduction, (b) de-noising, (c) visualization, and (d) clustering.
Brian C. Lovell and Shaokang Chen review the recent advances in the application of face recognition for multimedia
DM. While the technology for mining text documents in large DBs could be said to be relatively mature, the same cannot
be said for mining other important data types such as speech, music, images and video. Yet these forms of multimedia
data are becoming increasingly common on the Internet and intranets.
The goal of Web usage mining is to capture, model, and analyze the behavioral patterns and profiles of users
interacting with a Web site. Bamshad Mobasher and Yongjian Fu provide an overview of the three primary phases of
the Web mining process: data preprocessing, pattern discovery, and pattern analysis. The primary focus of their articles
is on the types of DM and analysis tasks most commonly used in Web usage mining, as well as some their typical
applications in areas such as Web personalization and Web analytics. Ji-Rong Wen explores the ways of enhancing
Web search using query log mining and Web structure mining.
In line with Mike Thelwalls opinion, scientific Web intelligence (SWI) is a research field that combines techniques
from DM, Web intelligence and scientometrics to extract useful information from the links and text of academic-related
Web pages, using various clustering, visualization and counting techniques. SWI is a type of Web mining that combines
Web structure mining and text mining. Its main uses are in addressing research questions concerning the Web, or Web-
related phenomena, rather than in producing commercially useful knowledge.
Web-enabled electronic business is generating massive amount of data on customer purchases, browsing patterns,
usage times and preferences at an increasing rate. DM techniques can be applied to all the data being collected. Richi
Nayak presents issues associated with DM for Web-enabled electronic-business.
TEAM LinG
xxvii
Tobias Scheffer gives an overview of common email mining tasks including email filing, spam filtering and mining
communication networks. The main section of his work focuses on recent developments in mining email data for support
of the message creation process. Approaches to mining question-answer pairs and sentences are also reviewed.
Stanley Loh describes a computer-supported approach to mine discussions that occurred in chat rooms. Dennis
Mcleod explores incremental mining from news streams. JungHwan Oh summarizes the current status of video DM. J.
Ben Schafer addresses the technology used to generate recommendations.
In the abstract, a DW can be seen as a set of materialized views defined over source relations. During the initial design
of a DW, the designer faces the problem of deciding which views to materialize in the DW. This problem has been
addressed in the literature for different classes of queries and views, and with different design goals. Theodoratos and
Simitsis identify the different design goals used to formulate alternative versions of the problem and highlight the
techniques used to solve it.
Michel Schneider addresses the problem of designing a DW schema. He suggested a general model for this purpose
that integrates a majority of existing models: the notion of a well-formed structure is proposed to help design the process;
a graphic representation is suggested for drawing well-formed structures; and the classical star-snowflake structure
is represented.
Anthony Scime presents a methodology for adding external information from the World Wide Web to a DW, in
addition to the DWs domain information. The methodology assures decision makers that the added Web based data
are relevant to the purpose and current data of the DW.
Privacy and confidentiality of individuals are important issues in the information technology age. Advances in DM
technology have increased privacy concerns even more. Jack Cook and Ycel Sayg1n highlight the privacy and
confidentiality issues in DM, and survey state of the art solutions and approaches for achieving privacy preserving
DM.
Ken Goodman provides one of the first overviews of ethical issues that arise in DM. He shows that while privacy
and confidentiality often are paramount in discussions of DM, other issues including the characterization of
appropriate uses and users, and data miners intentions and goals must be considered. Machine learning in genomics
and in security surveillance are set aside as special issues requiring attention.
Increased concern about privacy and information security has led to the development of privacy preserving DM
techniques. Yehuda Lindell focuses on the paradigms for defining security in this setting, and the need for a rigorous
approach. Shamik Sural et al. present some of important approaches to privacy protection in association rule mining.
Human-computer interaction is crucial in the knowledge discovery process in order to accomplish a variety of novel
goals of DM. In Shou Hong Wangs opinion, interactive visual DM is human-centered DM, implemented through
knowledge discovery loops coupled with human-computer interaction and visual representations.
Symbiotic DM is an evolutionary approach that shows how organizations analyze, interpret, and create new
knowledge from large pools of data. Symbiotic data miners are trained business and technical professionals skilled in
applying complex DM techniques and business intelligence tools to challenges in a dynamic business environment.
Athappilly and Rea opened the discussion on how businesses and academia can work to help professionals learn, and
fuse the skills of business, IT, statistics, and logic to create the next generation of data miners.
Yiyu Yao and Yan Zhao first make an immediate comparison between scientific research and DM and add an
explanation construction and evaluation task to the existing DM framework. Explanation-oriented DM offers a new
perspective, which has a significant impact on the understanding of the complete process of DM and effective
applications of DM results.
Traditional DM views the output from any DM initiative as a homogeneous knowledge product. Knowledge
however, always is a multifaceted construct, exhibiting many manifestations and forms. It is the thesis of Nilmini
Wickramasinghes discussion that a more complete and macro perspective, and a more balanced approach to knowledge
creation, can best be provided by taking a broader perspective of the knowledge product resulting from the KDD
process: namely, by incorporating a people-based perspective into the traditional KDD process, and viewing knowledge
as the multifaceted construct it is. This in turn will serve to enhance the knowledge base of an organization, and facilitate
the realization of effective knowledge.
Fabrice Muhlenbach and Ricco Rakotomalala are the authors of an original supervised multivariate discretization
method called HyperCluster Finder. Their major contributions to the research community are present in a DM software
called TANAGRA, which is freely available on Internet.
Recently there have been many efforts to apply DM techniques to security problems, including homeland security
and cyber security. Bhavani Thuraisingham (the inventor of three patents for MITRE) examines some of these
developments in DM in general and link analysis in particular, and shows how DM and link analysis techniques may
be applied for homeland security applications. Some emerging trends are also discussed.
TEAM LinG
xxviii
In order to reduce financial statement errors and fraud, Garrity, ODonnell and Sanders proposed an architecture
that provides auditors with a framework for an effective continuous auditing environment that utilizes DM.
The applications of DWM are everywhere: from Kernel Methods in Chemoinformatics to Data Mining for Damage
Detection in Engineering Structures; from Predicting Resource Usage for Capital Efficient Marketing to Mining for
Profitable Patterns in the Stock Market; from Financial Ratio Selection for Distress Classification to Material
Acquisitions Using Discovery Informatics Approach; from Resource Allocation in Wireless Networks to Reinforcing
CRM with Data Mining; from Data Mining Medical Digital Libraries to Immersive Image Mining in Cardiology; from
Data Mining in Diabetes Diagnosis and Detection to Distributed Data Management of Daily Car Pooling Problems;
and from Mining Images for Structure to Automatic Musical Instrument Sound ClassificationThe list of DWM
applications is endless and the future of DWM is promising.
Knowledge explosion pushes DWM, a multidisciplinary subject, to ever-expanding regions. Inclusion, omission,
emphasis, evolution and even revolution are part of our professional life. In spite of our efforts to be careful, should
you find any ambiguities or perceived inaccuracies, please contact me at wangj@mail.montclair.edu.
REFERENCES
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data
mining. AAAI/MIT Press.
Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. John Wiley.
Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press.
Turban, E., Aronson, J. E., & Liang, T. P. (2005). Decision support systems and intelligent systems. Upper Saddle River,
NJ: Pearson Prentice Hall.
Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing.
TEAM LinG
xxix
Acknowledgments
The editor would like to thank all of the authors for their insights and excellent contributions to this book. I also want
to thank the group of anonymous reviewers who assisted me in the peer-reviewing process and provided comprehen-
sive, critical and, constructive reviews. Each Editorial Advisory Board member has made a big contribution in guidance
and assistance.
The editor wishes to acknowledge the help of all involved in the development process of this book, without whose
support the project could not have been satisfactorily completed. Linxi Liao and MinSun Ku, two graduate assistants,
are hereby graciously acknowledged for their diligent work. I owe my thanks to Karen Dennis for lending a hand in the
tedious process of proof-reading. A further special note of thanks goes to the staff at Idea Group Inc., whose
contributions have been invaluable throughout the entire process, from inception to final publication. Particular thanks
go to Sara Reed, Jan Travers, and Rene Davies, who continuously prodded via e-mail to keep the project on schedule,
and to Mehdi Khosrow-Pour, whose enthusiasm motivated me to accept his invitation to join this project.
My appreciation is also due the Global Education Center at MSU for awarding me a Global Education Fund. I would
also like to extend my thanks to my brothers Zhengxian, Shubert (an artist, http://www.portraitartist.com/wang/bio.htm),
and sister Jixian, who stood solidly behind me and contributed in their own sweet little ways. Thanks go to all Americans,
since it would not have been possible for the four of us to come to the U.S. without their support of different scholarships.
Finally, I want to thank my family: my parents, Houde Wang and Junyan Bai for their encouragement; my wife Hongyu
for her unfailing support, and my son Leigh for being without a dad during this project. By the way, our second boy
Leon was born on August 4, 2004. Like a baby, DWM has a bright and promising future.
John Wang, Ph.D.

Wangj@mail.montclair.edu
Dept. Information and Decision Sciences
School of Business
Montclair State University
Montclair, New Jersey
USA
TEAM LinG
xxx
About the Editor
John Wang, Ph.D., is a full professor in the Department of Information and Decision Sciences at Montclair State
University (MSU), USA. Professor Wang has published 89 refereed papers and three books. He is on the editorial board
of the International Journal of Cases on Electronic Commerce and has been a guest editor and referee for Operations
Research, IEEE Transactions on Control Systems Technology, and many other highly prestigious journals. His long-
term research goal is on the synergy of operations research, data mining and cybernetics.
TEAM LinG
1
Action Rules A
Zbigniew W. Ras
University of North Carolina, Charlotte, USA
Angelina Tzacheva
Li-Shiang Tsay
INTRODUCTION another bank. Sending invitations by regular mail to all

these customers or inviting them personally by giving
There are two aspects of interestingness of rules that them a call are examples of an action associated with
have been studied in data mining literature, objective that action rule.
and subjective measures (Liu, 1997; Adomavicius & In paper by Ras & Gupta (2002), authors assume that
Tuzhilin, 1997; Silberschatz & Tuzhilin, 1995, 1996). information system is distributed and its sites are au-
Objective measures are data-driven and domain-inde- tonomous. They show that it is wise to search for action
pendent. Generally, they evaluate the rules based on rules at remote sites when action rules extracted at the
their quality and similarity between them. Subjective client site cannot be implemented in practice (sug-
measures, including unexpectedness, novelty and ac- gested actions are too expensive or too risky). The
tionability, are user-driven and domain-dependent. composition of two action rules, not necessary ex-
A rule is actionable if user can do an action to his/her tracted at the same site, was defined in Ras & Gupta
advantage based on this rule (Liu, 1997). This defini- (2002). Authors gave assumptions guaranteeing the cor-
tion, in spite of its importance, is too vague and it leaves rectness of such a composition. One of these assump-
open door to a number of different interpretations of tions requires that semantics of attributes, including the
actionability. In order to narrow it down, a new class of interpretation of null values, have to be the same at both
rules (called action rules) constructed from certain sites. This assumption is relaxed in Tzacheva & Ras
pairs of association rules, has been proposed in Ras & (2004) since authors allow different granularities of the
Wieczorkowska (2000). A formal definition of an ac- same attribute at involved sites. In the same paper, they
tion rule was independently proposed in Geffner & introduce the notion of a cost and feasibility of an action
Wainer (1998). These rules have been investigated fur- rule. Usually, a number of action rules or chains of
ther in Tsay & Ras (2004) and Tzacheva & Ras (2004). action rules can be applied to reclassify a certain set of
To give an example justifying the need of action rules, objects. The cost associated with changes of values
let us assume that a number of customers have closed within one attribute is usually different than the cost
their accounts at one of the banks. We construct, possi- associated with changes of values within another at-
bly the simplest, description of that group of people and tribute. The strategy for replacing the initially extracted
next search for a new description, similar to the one we action rule by a composition of new action rules, dy-
have, with a goal to identify a new group of customers namically built, was proposed in the paper by Tzacheva
from which no one left that bank. If these descriptions & Ras (2004). This composition of rules uniquely de-
have a form of rules, then they can be seen as actionable fines a new action rule and it was built with a goal to
rules. Now, by comparing these two descriptions, we lower the cost of reclassifying objects supported by the
may find the cause why these accounts have been closed initial action rule.
and formulate an action, which if undertaken by the bank,
may prevent other customers from closing their ac-
counts. Such actions are stimulated by action rules and BACKGROUND
they are seen as precise hints for actionability of rules.
For example, an action rule may say that by inviting In the paper by Ras & Wieczorkowska (2000), the
people from a certain group of customers for a glass of notion of an action rule was introduced. The main idea
wine by the bank, it is guaranteed that these customers was to generate, from a database, special type of rules
will not close their accounts and they do not move to which basically form a hint to users showing a way to
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Action Rules
reclassify objects with respect to some distinguished Otherwise, it is called flexible. Date of Birth is an example
attribute (called a decision attribute). Clearly, each of a stable attribute. Interest rate on any customer
relational schema gives a list of attributes used to account is an example of a flexible attribute. For simplicity
represent objects stored in a database. Values of some reasons, we will consider decision tables with only one
of these attributes, for a given object, can be changed decision. We adopt the following definition of a decision
and this change can be influenced and controlled by table:
user. However, some of these changes (for instance By a decision table we mean an information system
profit) cannot be done directly to a decision attribute. S = (U, A1 A2 {d}), where d A1 A2 is a distin-
In such a case, definitions of this decision attribute in guished attribute called decision. The elements of A1 are
terms of other attributes (called classification attributes) called stable conditions, whereas the elements of A2
have to be learned. These new definitions are used to {d} are called flexible conditions. Our goal is to change
construct action rules showing what changes in values of values of attributes in A1 for some objects from U so the
some attributes, for a given class of objects, are needed values of the attribute d for these objects may change as
to reclassify objects the way users want. But users may well. Certain relationships between attributes from A1 and
still be either unable or unwilling to proceed with ac- the attribute d will have to be discovered first.
tions leading to such changes. In all such cases, we may By Dom(r) we mean all attributes listed in the IF part
search for definitions of values of any classification of a rule r extracted from S. For example, if r = [
attribute listed in an action rule. By replacing a value of (a1,3)*(a2,4) (d,3)] is a rule, then Dom(r) = {a1,a2}.
such attribute by its definition extracted either locally By d(r) we denote the decision value of rule r. In our
or at remote sites (if system is distributed), we con- example d(r) = 3.
struct new action rules, which might be of more interest If r1, r2 are rules and B A1 A2 is a set of attributes,
to business users than the initial rule.
then r1/B = r2/B means that the conditional parts of rules
r1, r2 restricted to attributes B are the same.
For example, if r1 = [(a1,3) (d,3)], then r1/{a1} = r/
MAIN THRUST {a1}.
Assume also that (a, v w) denotes the fact that the
The technology dimension will be explored to clarify value of attribute a has been changed from v to w.
the meaning of actionable rules including action rules Similarly, the term (a, v w)(x) means that a(x)=v has
and extended action rules. been changed to a(x)=w. Saying another words, the
property (a,v) of an object x has been changed to prop-
Action Rules Discovery in a Stand- erty (a,w). Assume now that rules r1, r2 have been
alone Information System extracted from S and r1/A1 = r2/A1, d(r1)=k1, d(r2)=k2
and k1< k2. Also, assume that (b1, b2,, bp) is a list of
An information system is used for representing all attributes in Dom(r1) Dom(r2) A2 on which r1,
knowledge. Its definition, given here, is due to Pawlak r2 differ and r1(b1)= v1, r1(b2)= v2,, r1(bp)= vp,
(1991). r2(b1)= w1, r2(b2)= w2,, r2(bp)= wp.
By an information system we mean a pair S = (U, A), By (r1,r2)-action rule on xU we mean a statement:
where:
[ (b1, v1 w1) (b2, v2 w2) (bp, vp
1. U is a nonempty, finite set of objects (object wp)](x) [(d, k1 k2)](x).
identifiers),
2. A is a nonempty, finite set of attributes, that is, If the value of the rule on x is true then the rule is
a:U Va for a A, where Va is called the domain valid. Otherwise it is false.
of a. Let us denote by U<r1> the set of all customers in U
supporting the rule r1. If (r1,r2)-action rule is valid on
Information systems can be seen as decision tables. xU <r1> then we say that the action rule supports the new
In any decision table together with the set of attributes profit ranking k2 for x.
a partition of that set into conditions and decisions is To define an extended action rule (Ras & Tsay, 2003),
given. Additionally, we assume that the set of conditions let us assume that two rules are considered. We present
is partitioned into stable and flexible conditions (Ras &
Table 1.
Wieczorkowska, 2000).
A (St) B (Fl) C (St) E (Fl) G (St) H (Fl) D (Decision)
Attribute a A is called stable for the set U if its values
a1 b1 c1 e1 d1
assigned to objects from U can not change in time. a1 b2 g2 h2 d2
TEAM LinG
Action Rules
them in Table 1 to better clarify the process of construct- [(a, 1 2)](x) [(d, L H)](x) is supported by x1
ing extended action rules. Here, St means stable classi- and x4. A
fication attribute and Fl means flexible one.
In a classical representation, these two rules will have The confidence of an extended action rule is higher
a form: than the confidence of the corresponding action rule
because all objects making the confidence of that ac-
r1 = [ a1 * b1 * c1 * e1 d1 ] , tion rule lower have been removed from its set of
r2 = [ a1 * b2 * g2 * h2 d2 ]. support.
Assume now that object x supports rule r1, which Actions Rules Discovery in Distributed
means that it is classified as d1. In order to reclassify x Autonomous Information System
to class d2, we need to change its value B from b1 to b2
but also we have to require that G(x)=g2 and that the In Ras & Dardzinska (2002), the notion of a Distributed
value H for object x has to be changed to h2. This is the Autonomous Knowledge System (DAKS) framework
meaning of the extended (r1,r2)-action rule given below: was introduced. DAKS is seen as a collection of knowl-
edge systems where each knowledge system is initially
[(B, b1 b2) (G = g2) (H, h2)](x) (D, defined as an information system coupled with a set of
d1 d2)(x). rules (called a knowledge base) extracted from that
system. These rules are transferred between sites due
Assume now that by Sup(t) we mean the number of to the requests of a query answering system associated
tuples having property t . with the client site. Each rule transferred from one site
By the support of the extended (r1,r2)-action rule of DAKS to another remains at both sites.
(given above) we mean: Assume now that information system S represents
one of DAKS sites. If rules extracted from S = (U, A1 A2
Sup[(A=a1)*(B=b1)*(G=g2)]
{d}), describing values of attribute d in terms of
By the confidence of the extended (r1,r2)-action rule attributes from A 1 A2, do not lead to any useful action
(given above) we mean: rules (user is not willing to undertake any actions sug-
gested by rules), we may:
[Sup[(A=a1)*(B=b1)*(G=g2)*(D=d1)]/
Sup[(A=a1)*(B=b1)*(G=g2)]] 1) search for definitions of flexible attributes listed
in the classification parts of these rules in terms
[Sup[(A=a1)*(B=b2)*(C=c1)*(D=d2)]/ of other local flexible attributes (local mining
Sup[(A=a1)*(B=b2)*(C=c1)]]. for rules),
2) search for definitions of flexible attributes listed
To give another example of extended action rule, in the classification parts of these rules in terms
assume that S=(U,A1 A2 {d}) is a decision table repre- of flexible attributes from another site (mining
sented by Table 2. Assume that A1= {c, b}, A 2 = {a}. for rules at remote sites),
For instance, rules r1=[(a,1)*(b,1) (d,L)], r2=[(c,2) * 3) search for definitions of decision attributes of
(a,2) (d,H)] can be extracted from S, where U <r1> = {x1, these rules in terms of flexible attributes from
x4}. Extended (r1,r2)-action rule another site (mining for rules at remote sites).
[ (a, 1 2) (c = 2)](x) [(d, L H)](x) Another problem, which has to be taken into consid-
eration, is the semantics of attributes that are common
is only supported by object x1. The corresponding (r1,r2)- for a client site and some of the remote sites. This
action rule semantics may easily differ from site to site. Some-
time, such a difference in semantics can be repaired
quite easily. For instance, if Temperature in Celsius is
Table 2. used at one site and Temperature in Fahrenheit at the
other, a simple mapping will fix the problem. If infor-
c a b d
mation systems are complete and two attributes have
x1 2 1 1 L
x2 1 2 2 L
the same name and differ only in their granularity level,
x3 2 2 1 H a new hierarchical attribute can be formed to fix the
x4 1 1 1 L problem. If databases are incomplete, the problem is more
TEAM LinG
Action Rules
complex because of the number of options available to To give more formal definition of similarity, we assume
interpret incomplete values (including null vales). The that:
problem is especially difficult in a distributed framework
when chase techniques based on rules extracted at the (x,y) = [{(bi(x), bi(y)) : bi (A Am)}] / card(A
client and at remote sites are used by a client site to impute Am), where:
current values by values which are less incomplete. These
problems are presented and partial solutions given in Ras (bi(x), bi(y)) = 0, if bi(x) bi(y)
& Dardzinska (2002). (bi(x), bi(y)) = 1, if bi(x) = bi(y)
Now, let us assume that the action rule (bi(x), bi(y)) = 1/2, if either bi(x) is undefined.
r = [(b1, v1 w1) (b 2, v2 w2) (bp, vp Also, assume that

wp)](x) (d, k 1 k2)(x),
(x, SupSm(r1)) = max{(x, y) : y SupSm(r1)}, for
extracted from system S, does not provide any useful each x Sup S(r)
hint to a user for its actionability.
In this case we may look for a new action rule By the confidence of the action rule [r 1 r] we mean
(extracted either from S or from some of its remote
sites) [{(x, SupSm(r1)) : x SupS(r)} / card(Sup S(r))]
Conf(r 1) Conf(r)
r1 = [(bj1, vj1 wj1)(bj2, vj2 wj2)(bjq, vjq
wjq)](y) (bj, vj w j)(y) where Conf(r) is the confidence of the rule r in S and
Conf(r 1) is the confidence of the rule r1 in Sm.
which concatenated with r may provide better hint for its If we allow to concatenate action rules extracted
actionability. For simplicity reason, we assume that the from S with action rules extracted at other sites of
semantics and the granularity levels of all attributes DAKS, we are increasing the total number of generated
listed in both information systems are the same. action rules and the same our chance to find more
Concatenation [r1 r] is a new action rule (called suitable action rules for reclassifying objects in S is
global), defined as: also increased.
[(b1, v1 w 1) [(b1j, v1j w 1j) (bj2, vj2 wj2)

( b jq, vjq wjq)] ( bp, vp wp)](x) (d, FUTURE TRENDS
k1 k2)(x)
Business user may be either unable or unwilling to
where x is an object in S = (X,A,V). proceed with actions leading to desired reclassifica-
Some of the attributes in {bj1, bj2,, bjq} may not tions of objects. Undertaking the actions may be trivial,
belong to A. Also, the support of r1 is calculated in the feasible to an acceptable degree, or may be practically
information system from which r 1 was extracted. Let us very difficult. Therefore, the notion of a cost of an
denote that system by Sm = (Xm, Am, Vm) and the set of action rule is of very great importance. New strategies
objects in m supporting r1 by SupSm(r1). Assume that for discovering action rules of the lowest cost in DAKS,
SupS(r) is the set of objects in S supporting rule r. The based on ontologies, will be investigated.
domain of [r1 r] is the same as the domain of r, which is
equal to SupS(r). Before we define the notion of a similarity
between two objects belonging to two different informa- CONCLUSION
tion systems, we assume that A={b1, b2, b3, b4}, Am={b1,
b2, b3, b5, b6}, and objects xX, yXm are defined by Table Attributes are divided into two groups: stable and flex-
3 given below. ible. By stable attributes we mean attributes which val-
The similarity (x, y) between x and y is defined as: ues cannot be changed (for instance, age or maiden
[1+0+0+1/2+1/2+1/2]=5/12. name). On the other hand attributes (like percentage rate
or loan approval to buy a house) which values can be
changed are called flexible. Rules are extracted from a
Table 3.
decision table, using standard KD methods, with prefer-
b1 b2 b3 b4 b5 b6 ence given to flexible attributes so mainly they are
x v1 v2 v3 v4
y v1 w2 w3 w5 w6 listed in a classification part of rules. Most of these rules
TEAM LinG
Action Rules
can be seen as actionable rules and the same used to perimental and Theoretical Artificial Intelligence, Spe-
construct action-rules. cial Issue on Knowledge Discovery, 17(1-2), 119-128. A
Tzacheva, A., & Ras, Z.W. (2003). Discovering non-
standard semantics of semi-stable attributes. In I. Russell
REFERENCES & S. Haller (Eds.), Proceedings of FLAIRS-2003 (pp. 330-
334), St. Augustine, Florida. Menlo Park, CA: AAAI
Adomavicius, G., & Tuzhilin, A. (1997). Discovery of Press.
actionable patterns in databases: The action hierarchy
approach. In Proceedings of KDD97 Conference, New- Tzacheva, A., & Ras, Z.W. (2005). Action rules mining.
port Beach, CA. Menlo Park, CA: AAAI Press. International Journal of Intelligent Systems, Special
Issue on Knowledge Discovery (In press).
Geffner, H., & Wainer, J. (1998). Modeling action,
knowledge and control. In H. Prade (Ed.), ECAI 98, 13th
European Conference on AI (pp. 532-536). New York:
John Wiley & Sons. KEY TERMS
Liu, B., Hsu, W., & Chen, S. (1997). Using general
impressions to analyze discovered classification rules. Actionable Rule: A rule is actionable if user can do
In Proceedings of KDD97 Conference, Newport Beach, an action to his/her advantage based on this rule.
CA. Menlo Park, CA: AAAI Press. Autonomous Information System: Information
Pawlak, Z. (1991). Rough sets: Theoretical aspects of system existing as an independent entity.
reasoning about data. Kluwer. Domain of Rule: Attributes listed in the IF part of a
Ras, Z., & Dardzinska, A. (2002). Handling semantic rule.
inconsistencies in query answering based on distributed Flexible Attribute: Attribute is called flexible if
knowledge mining. In Foundations of Intelligent Sys- its value can be changed in time.
tems, Proceedings of ISMIS02 Symposium (pp. 66-
74). LNAI (No. 2366). Berlin: Springer-Verlag. Knowledge Base: A collection of rules defined as
expressions written in predicate calculus. These rules
Ras, Z., & Gupta, S. (2002). Global action rules in distrib- have a form of associations between conjuncts of values
uted knowledge systems. Fundamenta Informaticae Jour- of attributes.
nal, 51(1-2), 175-184.
Ontology: An explicit formal specification of how
Ras, Z., & Wieczorkowska, A. (2000). Action rules: to represent objects, concepts and other entities that are
How to increase profit of a company. In D.A. Zighed, J. assumed to exist in some area of interest and relation-
Komorowski, & J. Zytkow (Eds.), Principles of Data ships holding among them. Systems that share the same
Mining and Knowledge Discovery. (Proceedings of ontology are able to communicate about domain of
PKDD00 (pp. 587-592), LNAI (No. 1910), Lyon, France. discourse without necessarily operating on a globally
Berlin: Springer-Verlag. shared theory. System commits to ontology if its ob-
Silberschatz, A., & Tuzhilin, A. (1995). On subjective servable actions are consistent with definitions in the
measures of interestingness in knowledge discovery. In ontology.
Proceedings of KDD95 Conference. Menlo Park, CA: Semantics: The meaning of expressions written in
AAAI Press. some language, as opposed to their syntax, which de-
Silberschatz, A., & Tuzhilin, A. (1996). What makes scribes how symbols may be combined independently
patterns interesting in knowledge discovery systems. of their meaning.
IEEE Transactions on Knowledge and Data Engineer- Stable Attribute: Attribute is called stable for the
ing, 5(6). set U if its values assigned to objects from U cannot
Tsay, L.-S., & Ras, Z.W. (2005). Action rules discovery change in time.
system DEAR, method and experiments. Journal of Ex-
TEAM LinG
6
Active Disks for Data Mining

Alexander Thomasian
New Jersey Institute of Technology, USA
INTRODUCTION computers sharing a set of disks). Parallel data mining is

appropriate for large-scale data mining (Zaki, 1999) and
Active disks allow the downloading of certain types of the active disk paradigm can be considered as a low-cost
processing from the host computer onto disks, more scheme for parallel data mining.
specifically the disk controller, which has access to a In active disks the host computer partially or fully
limited amount of local memory serving as the disk downloads an application, such as data mining, onto the
cache. Data read from a disk may be sent directly to the microprocessors serving as the disk controllers. These
host computer as usual or processed locally at the disk. microprocessors have less computing power than those
Only filtered information is uploaded to the host com- associated with servers, but because servers tend to have
puter. In this article, We am interested in downloading a large number of disks, the raw computing power asso-
data mining and database applications to the disk control- ciated with the disk controllers may easily exceed the
ler. (raw) computing power of the server. Computing power
This article is organized as follows. A section on is estimated as the sum of the MIPS (millions of in-
background information is followed by a discussion of structions per second) ratings of the microprocessors.
data-mining operations and hardware technology trends. The computing power associated with the disks comes
The more influential active disk projects are reviewed in the form of a shared-nothing system, with connectiv-
next. Future trends and conclusions appear last. ity limited to indirect communication via the host.
Amdahls law on the efficiency of parallel processing is
applicable both to multiprocessors and disk control-
BACKGROUND lers: If a fraction F of the processing can be carried out
in parallel with a degree of parallelism P, then the
Data mining has been defined as the use of algorithms to effective degree of parallelism is 1/(1F+F/P)
extract information and patterns as part of Knowledge (Hennessy & Patterson, 2003). There is a higher degree
Discovery in Databases (KDD) (Dunham, 2003). Cer- of interprocessor communication cost for disk control-
tain aspects of data mining introduced a decade ago were lers.
computationally challenging for the computer systems A concise specification of active disks is as fol-
at that time. This was partially due to the high cost of lows:
computing and the uncertainty associated with the value
of information extracted by data mining. Data mining A number of important I/O intensive applications can
has become more viable economically with the advent take advantage of computational power available
of cheap computing power based on UNIX, Linux, and directly at storage devices to improve their overall
Windows operating systems. performance, more effectively balance their
The past decade has shown a tremendous increase in consumption of system wide resources, and provide
the processing power of computer systems, which has functionality that would not be otherwise available
made possible the previously impossible. On the other (Riedel, 1999).
hand, with higher disk transfer rates, a single processor
is required to process the incoming data from one disk, Active disks should be helpful from the following
but the number of disks associated with a multiproces- viewpoint. The size of the database and the processing
sor server usually exceeds the number of processors, requirements for decision support systems (DSS) is
that is, the data transfer rate is not matched by the growing rapidly. This is attributable to the length of the
processing power of the host computer. history, the level of the detail being saved, and the
Computer systems have been classified into shared- increased number of users and queries (Keeton,
everything systems (multiple processors sharing the Patterson, & Hellerstein, 1998). The first two factors
main memory and several disks), shared-nothing sys- contribute to the capacity requirement, while all three
tems (multiple computers with connectivity via an inter- factors contribute to the processing requirement.
connection network), and shared-disk systems (several With the advent of relational databases (in the mid-
1970s) perceived actual inefficiencies associated with
TEAM LinG
the processing of (SQLstructured query language) objects is determined by the distance of their feature
queries, as compared to low-level data manipulation lan- vectors. k-NN queries find the k objects with the smallest A
guages (DMLs) for hierarchical and network databases, (squared) Euclidean distances.
led to numerous proposals for database machines (Stanley Clustering methods group items, but unlike classifica-
& Su, 1983). The following categorization is given: (a) tion, the groups are not predefined. A distance measure,
intelligent secondary storage devices, proposed to speed such as the Euclidean distance between feature vectors of
up text retrieval but later modified to handle relational the objects, is used to form the clusters. The agglomerative
algebra operators, (b) database filters to accelerate table clustering algorithm is a hierarchical algorithm, which
scan, such as content-addressable file store (CAFS) from starts with as many clusters as there are data items.
International Computers Limited (ICL), (c) associative Agglomerative clustering tends to be expensive.
memory systems, which retrieve data by content, and (d) Non-hierarchical or partitional algorithms compute
database computers, which are mainly multicomputers. the clusters more efficiently. The popular k-means clus-
Intelligent secondary storage devices can be further tering method is in the family of squared-error cluster-
classified (Riedel, 1999): (a) processor per track ing algorithms, which can be implemented as follows:
(PPT), (b) processor per head (PPH), and (c) proces- (1) Designate k randomly selected points from n points
sor per disk (PPD). Given that modern disks have as the centroids of the clusters; (2) assign a point to the
thousands of tracks, the first solution is out of the cluster, whose centroid is closest to it, based on Euclid-
question. The second solution may require the R/W ean or some other distance measure; (3) recompute the
heads to be aligned simultaneously to access all the centroids for all clusters based on the items assigned to
tracks on a cylinder, which is not feasible. NCRs them; (4) repeat steps (2 through 3) with the new cen-
Teradata DBC/1012 database machine (1985) is a troids until there is no change in point membership. One
multicomputer PPD system. measure of the quality of clustering is the sum of
To summarize, according to the active disk para- squared distances (SSD) of points in each cluster with
digm, the host computer offloads the processing of respect to its centroid. The algorithm may be applied
data-warehousing and data-mining operators onto the several times and the results of the iteration with the
embedded microprocessor controller in the disk drive. smallest SSD selected. Clustering of large disk-resi-
There is usually a cache associated with each disk drive, dent datasets is a challenging problem (Dunham, 2003).
which is used to hold prefetched data but can be also Association rule mining (ARM) considers market-
used as a small memory, as mentioned previously. basket or shopping-cart data, that is, the items purchased
on a particular visit to the supermarket. ARM first
determines the frequent sets, which have to meet a
MAIN THRUST certain support level. For example, s% support for two
items A and B, such as bread and butter, implies that they
Data mining, which requires high data access band- appear together in s percent of transactions. Another
widths and is computationally intensive, is used to illus- measure is the confidence level, which is the ratio of the
trate active disk applications. support for the set intersection of A and B divided by the
support for A by itself. If bread and butter appear to-
Data-Mining Applications gether in most market-basket transactions, then there is
high confidence that customers who buy bread also buy
The three main areas of data mining are (a) classifica- butter. On the other hand, this is meaningful only if a
tion, (b) clustering, and (c) association rule mining significant fraction of customers bought bread, that is,
(Dunham, 2003). A brief review is given of the methods the support level is high. Multiple passes over the data
discussed in this article. are required to find all association rules (with a lower
Classification assigns items to appropriate classes bound for the support) when the number of objects is
by using the attributes of each item. When regression is large.
used for this purpose, the input values are the item Algorithms to reduce the cost of ARM include sam-
attributes, and the output is its class. The k-nearest- pling, partitioning (the argument for why this works is
neighbor (k-NN) method uses a training set, and a new that a frequent set of items must be frequent in at least
item is placed in the set, whose entries appear most one partition), and parallel processing (Zaki, 1999).
among the k-NNs of the target item.
K-NN queries are also used in similarity search, for Hardware Technology Trends
example, content based image retrieval (CBIR). Ob-
jects (images) are represented by feature vectors in the A computer system, which may be a server, a worksta-
areas of object color, texture, and so forth. Similarity of tion, or a PC, has three components most affecting its
TEAM LinG
performance: one or more microprocessors, a main memory, Table scan is a costly operation in relational data-
and magnetic disk storage. My discussion of technology bases, which is applied selectively when the search
trends is based on Patterson and Keeton (1998). argument is not sargable, that is, not indexed. The data
The main memory consists of dynamic random access filtering for table scans can be carried out in parallel if the
memory (DRAM) chips. Memory chip capacity is qua- data is partitioned across several disks. A GROUP BY
drupled every three years so that the memory capacity SQL statement computes a certain value, such as the
versus cost ratio is increasing 25% per year. The memory minimum, the maximum, or the mean, based on some
access latency is about 150 nanoseconds and, in the case classification of the rows in the table under consider-
of RAMBUS DRAM, the bandwidth is 800 to 1600 MB per ation, for example, undergraduate standing. To compute
second. The access latency is dropping 7% per year, and the overall average GPA, each disk sends its mean GPA
the bandwidth is increasing 20% per year. According to and the number of participating records to the host
Moores law, processor speed increases 60% per year computer.
versus 7% for main memories, so that the processor- A table scan outperforms indexing in processing k-
memory performance gap grows 50% per year. Multilevel NN queries for high dimensions. There is the additional
cache memories are used to bridge this gap and reduce the cost of building and maintaining the index. A synthetic
number of processor cycles per instruction (Hennessy dataset associated with the IBM Almadens Quest
& Patterson, 2003). project for loan applications is used to experiment with
According to Greg Papadopulos (at Sun k-NN queries. The relational table contains the follow-
Microsystems), the demand for database applications ing attributes: age, education, salary, commission, zip
exceeds the increase in central processing unit (CPU) code, make of car, cost of house, loan amount, years
speed according to Moores law (Patterson & Keeton, owned. In the case of categorical attributes, an exact
1998). Both are a factor of two, but the former is in 9 to match is required.
12 months, and the latter is in 18 months. Consequently, The Apriori algorithm for ARM is also considered
the so-called database gap is increasing with time. in determining whether customers who purchase a par-
Disk capacity is increasing at the rate of 60% per ticular set of items also purchase an additional item,
year. This is due to dramatic increases in magnetic re- but this is meaningful at a certain level of support.
cording density. Disk access time consists of queueing More rules are generated for lower values of support,
time, controller time, seek time, rotational latency, and so the limited size of the disk cache may become a
transfer time. The increased disk RPMs (rotations per bottleneck.
minute) and especially increased linear recording densi- Analytical models and measurement results from a
ties have resulted in very high transfer rates, which are prototype show that active disks scale beyond the point
increasing 60% annually. The sum of seek time and where the server saturates.
rotational latency, referred to as positioning time, is of
the order of 10 milliseconds and is decreasing very Freeblock Scheduling
slowly (8% per year). Utilizing disk access bandwidth is
a very important consideration and is the motivation This CMU project emphasizes disk performance from
behind freeblock scheduling (Lumb, Schindler, Ganger, the viewpoint of maximizing disk arm utilization (Lumb
Nagle, & Riedel, 2000; Riedel, Faloutsos, Ganger, & et al., 2000). Freeblock scheduling utilizes opportu-
Nagle, 2000). nistic reading of low-priority blocks of data from disk,
while the arm having completed the processing of a
Active Disk Projects high priority request is moving to process another such
request. Opportunities for freeblock scheduling di-
There have been several concurrent activities in the area minish as more and more blocks are being read. This is
of active disks at Carnegie Mellon University, the Uni- because blocks located centrally will be accessed right
versity of California at Santa Barbara, University of away, although other disk blocks located at extreme
Maryland, University of California at Berkeley, and in disk cylinders may require an explicit access.
the SmartSTOR project at IBMs Almaden Research In one scenario a disk processes requests by an OLTP
Center. (online transaction processing) application and a back-
ground ARM application. OLTP requests have a higher
Active1 Disk Projects at CMU priority because transaction response time should be as
low as possible, but ARM requests are processed as
Most of this effort is summarized in (Riedel, Gibson, & freeblock requests. OLTP requests access specific
Faloutsos, 1998). records, but ARM requires multiple passes over the
TEAM LinG
dataset in any order. This is a common feature of algo- The host, which runs application programs, can offload
rithms suitable for freeblock scheduling. processing to SmartSTORs, which then deliver results A
In the experimental study, OLTP requests are gener- back to the host. Experimental results with the TPC-D
ated by transactions running at a given multiprogramming benchmark are presented (see http://www.tpc.org).
level (MPL) with a certain think time, that is, the time
before the transaction generates its next request. Re-
quests are to 8 KB blocks, and the Read:Write ratio is 2:1, FUTURE TRENDS
as in the TPC-C benchmark (see http://www.tpc.org),
while the background process accesses 4 KB blocks. The The offloading of activities from the host to peripheral
bandwidth due to freeblock requests increases with in- devices has been carried out successfully in the past.
creasing MPL. Initiating low-priority requests when the SmartSTOR is an intermediate step, and active disk is a
disk has a low utilization is not considered in this study, more distant possibility, which would require standard-
because such accesses would result in an increase in the ization activities such as object-based storage devices
response time of disk accesses on behalf of the OLTP (OSD) (http://www.snia.org).
application. Additional seeks are required to access the
remaining blocks so that the last 5% of requests takes 30%
of the time of a full scan. CONCLUSION
Active Disks at UCSB/Maryland The largest benefit stemming from active disks comes
from the parallel processing capability provided by a
The host computer acts as a coordinator, scheduler, and large number of disks, that the aggregate processing
combiner of results, while the bulk of processing is power of disk controllers may exceed the computing
carried out at the disks (Acharya, Uysal, & Saltz, 1998). power of servers.
The computation at the host initiates disklets at the The filtering effect is another benefit of active disks.
disks. Disklets are disallowed to initiate disk accesses Disk transfer rates are increasing rapidly, so that by
and to allocate and free memory in the disk cache. All eliminating unnecessary I/O transfers, more disks can
these functions are carried out by the host computers be placed on I/O buses or storage area networks
operating system (OS). Disklets can only access memory (SANs).
locations (in a disks cache) within certain bounds speci-
fied by the host computers OS.
Implemented are SELECT, GROUP BY, and ACKNOWLEDGMENT
DATACUBE operator, which computes GROUP BYs
for all possible combinations of a list of attributes Supported by NSF through Grant 0105485 in Computer
(Dunham, 2003), external sort, image convolution, and Systems Architecture.
generating composite satellite images.
The SmartSTOR Project2 REFERENCES

SmartSTOR is significantly different from the previous Acharya, A., Uysal, M., & Saltz, J. H. (1998). Active
projects. It is argued in Hsu, Smith, and Young (2000) disks: Programming model, algorithms, and evaluation.
that enhancing the computing power of individual disks Proceedings of the Eighth International Conference
is not economically viable due to stringent constraints on Architectural Support for Programming Languages
on the power budget and the cost of individual disks. The and Operating Systems (pp. 81-91), USA.
low market share for such high-cost, high-performance,
and high-functionality disks would further increase their Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A.,
cost. Naughton, J., Ramakrishnan, R. et al. (1996). On the
The following alternatives are considered (Hsu et computation of multidimensional aggregates. Proceed-
al., 2000): (a) no offloading, (b) offloading operations ings of the 22nd International Conference on Very
on single tables, and (c) offloading operations on mul- Large Data Bases (pp. 506-521), India.
tiple tables. The third approach, which makes the most
sense, is implemented in SmartSTOR, which is the Dunham, M. H. (2003). Data mining: Introductory and
controller of multiple disks and can coordinate the advanced topics. Prentice-Hall.
joining of relational tables residing on the disks.
TEAM LinG
Hennessey, J. L., & Patterson, D. A. (2003). Computer Database Computer or Machine: A specialized com-
architecture: A quantitative approach (3rd ed.). Morgan puter for database applications, which usually works in
Kaufmann. conjunction with a host computer.
Hsu, W. W., Smith, A. J., & Young, H. (2000). Projecting Database Gap: Processing demand for database appli-
the performance of decision support workloads with smart cations is twofold in nine to 12 months, but it takes 18
storage (SmartSTOR). Proceedings of the Seventh Inter- months for the processor speed to increase that much
national Conference on Parallel and Distributed Sys- according to Moores law.
tems (pp. 417-425), Japan.
Disk Access Time: Sum of seek time (ST), rotational
Keeton, K., Patterson, D. A., & Hellerstein, J. M. (1998). latency (RL), and transfer time (TT). ST is the time to move
A case for intelligent disks (IDISKs). ACM SIGMOD the read/write heads (attached to the disk arm) to the
Record, 27(3), 42-52. appropriate concentric track on the disk. There is also a
head selection time to select the head on the appropriate
Lumb, C., Schindler, J., Ganger, G. R., Nagle, D. F., & track on the disk. RL for small block transfers is half of the
Riedel, E. (2000). Towards higher disk head utilization: disk rotation time. TT is the ratio of the block size and the
Extracting free bandwidth from busy disk drives. Pro- average disk transfer rate.
ceedings of the Fourth Symposium on Operating Sys-
tems Design and Implementation (pp. 87-102), USA. Freeblock Scheduling: A disk arm scheduling method
that uses opportunistic accesses to disk blocks required
Patterson, D. A., & Keeton, K. (1998). Hardware technol- for a low-priority activity.
ogy trends and database opportunities. Keynote address
of the ACM SIGMOD Conference on Management of Processor per Track/Head/Disk: The last organiza-
Data. Retrieved from http://www.cs.berkeley.edu/~pattr tion corresponds to active disks.
sn/talks.html
Shared Everything/Nothing/Disks System: The main
Ramakrishnan, R., & Gehrke, J. (2003). Database manage- memory and disks are shared by the (multiple) processors
ment systems (3rd ed.). McGraw-Hill. in the first case, nothing is shared in the second case (i.e.,
standalone computers connected via an interconnection
Riedel, E. (1999). Active disk Remote execution for network), and disks are shared by processor memory
network attached storage (Tech. Rep. No. CMU-CS- combinations in the third case.
99-177). CMU, Department of Computer Science.
SmartSTOR: A scheme where the disk array control-
Riedel, E., Faloutsos, C., Ganger, G. R., & Nagle, D. F. ler for multiple disks assists the host in processing data-
(2000). Data mining in an OLTP system (nearly) for base applications.
free. Proceedings of the ACM SIGMOD International
Conference on Management of Data (pp. 13-21), USA. Table Scan: The sequential reading of all the blocks
of a relational table to select a subset of its attributes
Riedel, E., Gibson, G. A., & Faloutsos, C. (1998). Active based on a selection argument, which is either not
storage for large scale data mining and multimedia appli- indexed (called a sargable argument) or the index is not
cations. Proceedings of the 24th International Very clustered.
Large Data Base Conference (pp. 62-73), USA.
Transaction Processing Council: This council
Su, W.-Y.S. (1983). Advanced database machine archi- has published numerous benchmarks for transaction
tecture. Prentice-Hall. processing (TPC-C), decision support (TPC-H and TPC-
Zaki, M. J. (1999). Parallel and distributed associative rule R), transactional Web benchmark, also supporting brows-
mining: A survey. IEEE Concurrency, 7(4), 14-25. ing (TPC-W).
KEY TERMS ENDNOTES
Active Disk: A disk whose controller runs applica-

1
There have been several concurrent activities in
tion code, which can process data on disk. the area of Active Disks: (a) Riedel, Gibson, and
Faloutsos at Carnegie-Mellon University (1998)
Content-Addressable File Store (CAFS): Specialized and Riedel (1999); (b) Acharya, Uysal, and Saltz
hardware from ICL (UKs International Computers Limited) (1998) at the University of California at Santa
used as a filter for database applications.
10
TEAM LinG
Barbara (UCSB) and University of Maryland; (c) PipeHash algorithm represents the datacube as a
intelligent disks by Keeton, Patterson, and lattice of related GROUP BYs. A directed edge A
Hellerstein (1998) at the University of California connects a GROUP BY i to a GROUP BY j, if j can
(UC) at Berkeley; and (d) the SmartSTOR project at be generated by i and has one less attribute
UC Berkeley and the IBM Almaden Research Center (Agarwal, Agrawal, Deshpande, Gupta, Naughton,
by Hsu, Smith, and Young (2000). Ramakrishnan, et al., 1996). Three other applica-
2
The SQL SELECT and GROUP BY statements are tions dealing with an external sort, image convolu-
easy to implement. The datacube operator com- tion, and generating composite satellite images
putes GROUP BYs for all possible combinations are beyond the scope of this discussion.
of a list of attributes (Dunham, 2003). The
11
TEAM LinG
12
Active Learning with Multiple Views

Ion Muslea
SRI International, USA
INTRODUCTION they bootstrap the views from each other by augmenting

the training set with unlabeled examples on which the
Inductive learning algorithms typically use a set of other views make high-confidence predictions. Such
labeled examples to learn class descriptions for a set of algorithms improve the classifiers learned from labeled
user-specified concepts of interest. In practice, label- data by also exploiting the implicit information pro-
ing the training examples is a tedious, time consuming, vided by the distribution of the unlabeled examples.
error-prone process. Furthermore, in some applica- In contrast to semi-supervised learning, active learn-
tions, the labeling of each example also may be ex- ers (Tong & Koller, 2001) typically detect and ask the
tremely expensive (e.g., it may require running costly user to label only the most informative examples in the
laboratory tests). In order to reduce the number of domain, thus reducing the users data-labeling burden.
labeled examples that are required for learning the Note that active and semi-supervised learners take dif-
concepts of interest, researchers proposed a variety of ferent approaches to reducing the need for labeled data;
methods, such as active learning, semi-supervised learn- the former explicitly search for a minimal set of labeled
ing, and meta-learning. examples from which to perfectly learn the target con-
This article presents recent advances in reducing the cept, while the latter aim to improve a classifier learned
need for labeled data in multi-view learning tasks; that from a (small) set of labeled examples by exploiting
is, in domains in which there are several disjoint subsets some additional unlabeled data.
of features (views), each of which is sufficient to learn In keeping with the active learning approach, this
the target concepts. For instance, as described in Blum article focuses on minimizing the amount of labeled
and Mitchell (1998), one can classify segments of data without sacrificing the accuracy of the learned
televised broadcast based either on the video or on the classifiers. We begin by analyzing co-testing (Muslea,
audio information; or one can classify Web pages based 2002), which is a novel approach to active learning. Co-
on the words that appear either in the pages or in the testing is a multi-view active learner that maximizes the
hyperlinks pointing to them. In summary, this article benefits of labeled training data by providing a prin-
focuses on using multiple views for active learning and cipled way to detect the most informative examples in a
improving multi-view active learners by using semi- domain, thus allowing the user to label only these.
supervised- and meta-learning. Then, we discuss two extensions of co-testing that
cope with its main limitationsthe inability to exploit
the unlabeled examples that were not queried and the
BACKGROUND lack of a criterion for deciding whether a task is appro-
priate for multi-view learning. To address the former,
we present Co-EMT (Muslea et al., 2002a), which inter-
Active, Semi-Supervised, and leaves co-testing with a semi-supervised, multi-view
Multi-view Learning learner. This hybrid algorithm combines the benefits of
active and semi-supervised learning by detecting the
Most of the research on multi-view learning focuses on most informative examples, while also exploiting the
semi-supervised learning techniques (Collins & Singer, remaining unlabeled examples. Second, we discuss Adap-
1999, Pierce & Cardie, 2001) (i.e., learning concepts tive View Validation (Muslea et al., 2002b), which is a
from a few labeled and many unlabeled examples). By meta-learner that uses the experience acquired while
themselves, the unlabeled examples do not provide any solving past learning tasks to predict whether multi-
direct information about the concepts to be learned. How- view learning is appropriate for a new, unseen task.
ever, as shown by Nigam, et al. (2000) and Raskutti, et al.
(2002), their distribution can be used to boost the accuracy A Motivating Problem: Wrapper
of a classifier learned from the few labeled examples.
Intuitively, semi-supervised, multi-view algorithms
Induction
proceed as follows: first, they use the small labeled
training set to learn one classifier in each view; then, Information agents such as Ariadne (Knoblock et al.,
2001) integrate data from pre-specified sets of Web
TEAM LinG
sites so that they can be accessed and combined via Figure 1. An information agent that combines data
database-like queries. For example, consider the agent in from the Zagats restaurant guide, the L.A. County A
Figure 1, which answers queries such as the following: Health Department, the ETAK Geocoder, and the Tiger
Map service
Show me the locations of all Thai restaurants in L.A. Restaurant Guide
that are A-rated by the L.A. County Health Department.
L.A. County Query:
Health Dept. A-rated Thai
To answer this query, the agent must combine data
restaurants
from several Web sources: in L.A.
from Zagats, it obtains the name and address of all

Thai restaurants in L.A.;
Agent
from the L.A. County Web site, it gets the health
rating of any restaurant of interest; RESULTS:
from the Geocoder, it obtains the latitude/longi-

tude of any physical address; Geocoder
from Tiger Map, it obtains the plot of any location,
given its latitude and longitude.
Tiger Map Server
Information agents typically rely on wrappers to
extract the useful information from the relevant Web
pages. Each wrapper consists of a set of extraction rules MAIN THRUST
and the code required to apply them. As manually writing
the extraction rules is a time-consuming task that re-
In the context of wrapper induction, we intuitively de-
quires a high level of expertise, researchers designed
scribe three novel algorithms: Co-Testing, Co-EMT,
wrapper induction algorithms that learn the rules from
and Adaptive View Validation. Note that these algo-
user-provided examples (Muslea et al., 2001).
rithms are not specific to wrapper induction, and they
In practice, information agents use hundreds of ex-
have been applied to a variety of domains, such as text
traction rules that have to be updated whenever the
classification, advertisement removal, and discourse
format of the Web sites changes. As manually labeling
tree parsing (Muslea, 2002).
examples for each rule is a tedious, error-prone task,
one must learn high accuracy rules from just a few
labeled examples. Note that both the small training sets Co-Testing: Multi-View Active Learning
and the high accuracy rules are crucial to the successful
deployment of an agent. The former minimizes the Co-Testing (Muslea, 2002, Muslea et al., 2000), which
amount of work required to create the agent, thus mak- is the first multi-view approach to active learning, works
ing the task manageable. The latter is required in order as follows:
to ensure the quality of the agents answer to each query:
when the data from multiple sources is integrated, the first, it uses a small set of labeled examples to
errors of the corresponding extraction rules get com- learn one classifier in each view;
pounded, thus affecting the quality of the final result; then, it applies the learned classifiers to all unla-
for instance, if only 90% of the Thai restaurants and beled examples and asks the user to label one of
90% of their health ratings are extracted correctly, the the examples on which the views predict different
result contains only 81% (90% x 90% = 81%) of the A- labels;
rated Thai restaurants. it adds the newly labeled example to the training
We use wrapper induction as the motivating problem set and repeats the whole process.
for this article because, despite the practical impor-
tance of learning accurate wrappers from just a few Intuitively, Co-Testing relies on the following ob-
labeled examples, there has been little work on active servation: if the classifiers learned in each view predict
learning for this task. Furthermore, as explained in a different label for an unlabeled example, at least one
Muslea (2002), existing general-purpose active learn- of them makes a mistake on that prediction. By asking
ers cannot be applied in a straightforward manner to the user to label such an example, Co-Testing is guaran-
wrapper induction. teed to provide useful information for the view that
made the mistake.
13
TEAM LinG
To illustrate Co-Testing for wrapper induction, con- iterative, two-step process: first, it uses the hypotheses
sider the task of extracting restaurant phone numbers from learned in each view to probabilistically label all the
documents similar to the one shown in Figure 2. To extract unlabeled examples; then it learns a new hypothesis in
this information, the wrapper must detect both the begin- each view by training on the probabilistically labeled
ning and the end of the phone number. For instance, to find examples provided by the other view.
where the phone number begins, one can use the following By interleaving active and semi-supervised learn-
rule: ing, Co-EMT creates a powerful synergy. On one hand,
Co-Testing boosts Co-EMs performance by providing
R1 = SkipTo( Phone: ) it with highly informative labeled examples (instead of
random ones). On the other hand, Co-EM provides Co-
This rule is applied forward, from the beginning of Testing with more accurate classifiers (learned from
the page, and it ignores everything until it finds the string both labeled and unlabeled data), thus allowing Co-
Phone:. Note that this is not the only way to detect Testing to make more informative queries.
where the phone number begins. An alternative way to Co-EMT was not yet applied to wrapper induction,
perform this task is to use the following rule: because the existing algorithms are not probabilistic
learners; however, an algorithm similar to Co-EMT was
R2 = BackTo( Cuisine ) BackTo( ( Number ) ) applied to information extraction from free text (Jones
et al., 2003). To illustrate how Co-EMT works, we
which is applied backward, from the end of the document. describe now the generic algorithm Co-EMTWI, which
R2 ignores everything until it finds Cuisine and then, combines Co-Testing with the semi-supervised wrap-
again, skips to the first number between parentheses. per induction algorithm described next.
Note that R1 and R2 represent descriptions of the In order to perform semi-supervised wrapper in-
same concept (i.e., beginning of phone number) that are duction, one can exploit a third view, which is used to
learned in two different views (see Muslea et al. [2001] evaluate the confidence of each extraction. This new
for details on learning forward and backward rules). That content-based view (Muslea et al., 2003) describes the
is, views V1 and V2 consist of the sequences of charac- actual item to be extracted. For example, in the phone
ters that precede and follow the beginning of the item, numbers extraction task, one can use the labeled ex-
respectively. View V1 is called the forward view, while amples to learn a simple grammar that describes the
V2 is the backward view. Based on V1 and V2, Co-Testing field content: (Number) Number Number. Similarly,
can be applied in a straightforward manner to wrapper when extracting URLs, one can learn that a typical URL
induction. As shown in Muslea (2002), Co-Testing clearly starts with the string http://www., ends with the string
outperforms existing state-of-the-art algorithms, both .html, and contains no HTML tags.
on wrapper induction and a variety of other real world Based on the forward, backward, and content-based
domains. views, one can implement the following semi-super-
vised wrapper induction algorithm. First, the small set
Co-EMT: Interleaving Active and of labeled examples is used to learn a hypothesis in
Semi-Supervised Learning each view. Then, the forward and backward views feed
each other with unlabeled examples on which they
To further reduce the need for labeled data, Co-EMT make high-confidence extractions (i.e., strings that are
(Muslea et al., 2002a) combines active and semi-super- extracted by either the forward or the backward rule and
vised learning by interleaving Co-Testing with Co-EM are also compliant with the grammar learned in the
(Nigam & Ghani, 2000). Co-EM, which is a semi-super- third, content-based view).
vised, multi-view learner, can be seen as the following Given the previous Co-Testing and the semi-super-
vised learner, Co-EMTWI combines them as follows. First,
the sets of labeled and unlabeled examples are used for
semi-supervised learning. Second, the extraction rules
Figure 2. The forward rule R1 and the backward rule that are learned in the previous step are used for Co-
R2 detect the beginning of the phone number. Forward Testing. After making a query, the newly labeled example
and backward rules have the same semantics and differ is added to the training set, and the whole process is
only in terms of from where they are applied (start/end repeated for a number of iterations. The empirical study in
of the document) and in which direction Muslea, et al., (2002a) shows that, for a large variety of
R1: SkipTo( Phone : ) R2: BackTo(Cuisine) BackTo( (Number) ) text classification tasks, Co-EMT outperforms both Co-
Testing and the three state-of-the-art semi-supervised
Name: Ginos Phone : (800)111-1717 Cuisine :
learners considered in that comparison.
14
TEAM LinG
View Validation: Are the Views FUTURE TRENDS

Adequate for Multi-View Learning? A
There are several major areas of future work in the field
The problem of view validation is defined as follows: of multi-view learning. First, there is a need for a view
given a new unseen multi-view learning task, how does a detection algorithm that automatically partitions a
user choose between solving it with a multi- or a single- domains features in views that are adequate for multi-
view algorithm? In other words, how does one know view learning. Such an algorithm would remove the last
whether multi-view learning will outperform pooling all stumbling block against the wide applicability of multi-
features together and applying a single-view learner? view learning (i.e., the requirement that the user pro-
Note that this question must be answered while having vides the views to be used). Second, in order to reduce
access to just a few labeled and many unlabeled ex- the computational costs of active learning (re-training
amples: applying both the single- and multi-view active after each query is CPU-intensive), one must consider
learners and comparing their relative performances is a look-ahead strategies that detect and propose (near)
self-defeating strategy, because it doubles the amount optimal sets of queries. Finally, Adaptive View Valida-
of required labeled data (one must label the queries tion has the limitation that it must be trained separately
made by both algorithms). for each application domain (e.g., once for wrapper
The need for view validation is motivated by the induction, once for text classification, etc.). A major
following observation: while applying Co-Testing to improvement would be a domain-independent view
dozens of extraction tasks, Muslea et al. (2002b) no- validation algorithm that, once trained on a mixture of
ticed that the forward and backward views are appropri- tasks from various domains, can be applied to any new
ate for most, but not all, of these learning tasks. This learning task, independently of its application domain.
view adequacy issue is related tightly to the best extrac-
tion accuracy reachable in each view. Consider, for
example, an extraction task in which the forward and CONCLUSION
backward rules lead to a high- and low-accuracy rule,
respectively. Note that Co-Testing is not appropriate In this article, we focus on three recent developments
for solving such tasks; by definition, multi-view learn- that, in the context of multi-view learning, reduce the
ing applies only to tasks in which each view is sufficient need for labeled training data.
for learning the target concept (obviously, the low-
accuracy view is insufficient for accurate extraction). Co-Testing: A general-purpose, multi-view ac-
To cope with this problem, one can use Adaptive tive learner that outperforms existing approaches
View Validation (Muslea et al., 2002b), which is a meta- on a variety of real-world domains.
learner that uses the experience acquired while solving Co-EMT: A multi-view learner that obtains a ro-
past learning tasks to predict whether the views of a new bust behavior over a wide spectrum of learning
unseen task are adequate for multi-view learning. The tasks by interleaving active and semi-supervised
view validation algorithm takes as input several solved multi-view learning.
extraction tasks that are labeled by the user as having Adaptive View Validation: A meta-learner that
views that are adequate or inadequate for multi-view uses past experiences to predict whether multi-
learning. Then, it uses these solved extraction tasks to view learning is appropriate for a new unseen
learn a classifier that, for new unseen tasks, predicts learning task.
whether the views are adequate for multi-view learning.
The (meta-) features used for view validation are
properties of the hypotheses that, for each solved task, REFERENCES
are learned in each view (i.e., the percentage of unla-
beled examples on which the rules extract the same Blum, A., & Mitchell, T. (1998). Combining labeled and
string, the difference in the complexity of the forward unlabeled data with co-training. Proceedings of the
and backward rules, the difference in the errors made on Conference on Computational Learning Theory
the training set, etc.). For both wrapper induction and (COLT-1998).
text classification, Adaptive View Validation makes
accurate predictions based on a modest amount of train- Collins, M., & Singer, Y. (1999). Unsupervised models
ing data (Muslea et al., 2002b). for named entity classification. Empirical Methods in
15
TEAM LinG
Natural Language Processing & Very Large Corpora Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000).
(pp. 100-110). Text classification from labeled and unlabeled docu-
ments using EM. Machine Learning, 39(2-3), 103-134.
Jones, R., Ghani, R., Mitchell, T., & Riloff, E. (2003).
Active learning for information extraction with mul- Pierce, D., & Cardie, C. (2001). Limitations of co-training
tiple view feature sets. Proceedings of the ECML-2003 for natural language learning from large datasets. Empiri-
Workshop on Adaptive Text Extraction and Mining. cal Methods in Natural Language Processing, 1-10.
Knoblock, C. et al. (2001). The Ariadne approach to Web- Raskutti, B., Ferra, H., & Kowalczyk, A. (2002). Using
based information integration. International Journal of unlabeled data for text classification through addition
Cooperative Information Sources, 10, 145-169. of cluster parameters. Proceedings of the Interna-
tional Conference on Machine Learning (ICML-2002).
Muslea, I. (2002). Active learning with multiple views
[doctoral thesis]. Los Angeles: Department of Com- Tong, S., & Koller, D. (2001). Support vector machine
puter Science, University of Southern California. active learning with applications to text classification.
Journal of Machine Learning Research, 2, 45-66.
Muslea, I., Minton, S., & Knoblock, C. (2000). Selective
sampling with redundant views. Proceedings of the Na-
tional Conference on Artificial Intelligence (AAAI-2000).
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierar-
KEY TERMS
chical wrapper induction for semi-structured sources.
Journal of Autonomous Agents & Multi-Agent Sys- Active Learning: Detecting and asking the user to
tems, 4, 93-114. label only the most informative examples in the domain
(rather than randomly-chosen examples).
Muslea, I., Minton, S., & Knoblock, C. (2002a). Active
+ semi-supervised learning = robust multi-view learn- Inductive Learning: Acquiring concept descrip-
ing. Proceedings of the International Conference on tions from labeled examples.
Machine Learning (ICML-2002). Meta-Learning: Learning to predict the most ap-
Muslea, I., Minton, S., & Knoblock, C. (2002b). Adap- propriate algorithm for a particular task.
tive view validation: A first step towards automatic view Multi-View Learning: Explicitly exploiting sev-
detection. Proceedings of the International Confer- eral disjoint sets of features, each of which is sufficient
ence on Machine Learning (ICML-2002). to learn the target concept.
Muslea, I., Minton, S., & Knoblock, C. (2003). Active Semi-Supervised Learning: Learning from both
learning with strong and weak views: A case study on labeled and unlabeled data.
wrapper induction. Proceedings of the International
Joint Conference on Artificial Intelligence (IJCAI-2003). View Validation: Deciding whether a set of views
is appropriate for multi-view learning.
Nigam, K., & Ghani, R. (2000). Analyzing the effective-
ness and applicability of co-training. Proceedings of Wrapper Induction: Learning (highly accurate)
the Conference on Information and Knowledge Man- rules that extract data from a collection of documents
agement (CIKM-2000). that share a similar underlying structure.
16
TEAM LinG
17
Administering and Managing a Data A

Warehouse
James E. Yao
Chang Liu
Northern Illinois University, USA
Qiyang Chen
June Lu
University of Houston-Victoria, USA
INTRODUCTION tion, 2004). The fast growth of databases enables compa-

nies to capture and store a great deal of business opera-
As internal and external demands on information from tion data and other business-related data. The data that are
managers are increasing rapidly, especially the infor- stored in the databases, either historical or operational,
mation that is processed to serve managers specific have been considered corporate resources and an asset
needs, regular databases and decision support systems that must be managed and used effectively to serve the
(DSS) cannot provide the information needed. Data ware- corporate business for competitive advantages.
houses came into existence to meet these needs, consoli- A database is a computer structure that houses a self-
dating and integrating information from many internal and describing collection of related data (Kroenke, 2004;
external sources and arranging it in a meaningful format Rob & Coronel, 2004). This type of data is primitive,
for making accurate business decisions (Martin, 1997). In detailed, and used for day-to-day operation. The data in
the past five years, there has been a significant growth in a warehouse is derived, meaning it is integrated, sub-
data warehousing (Hoffer, Prescott, & McFadden, 2005). ject-oriented, time-variant, and nonvolatile (Inmon,
Correspondingly, this occurrence has brought up the 2002). A data warehouse is defined as an integrated
issue of data warehouse administration and management. decision support database whose content is derived
Data warehousing has been increasingly recognized as an from various operational databases (Hoffer, Prescott, &
effective tool for organizations to transform data into McFadden, 2005; Sen & Jacob, 1998). Often a data ware-
useful information for strategic decision-making. To house can be referred to as a multidimensional database
achieve competitive advantages via data warehousing, because each occurrence of the subject is referenced by
data warehouse management is crucial (Ma, Chou, & Yen, an occurrence of each of several dimensions or character-
2000). istics of the subject (Gillenson, 2005). Some multidimen-
sional databases operate on a technological foundation
optimal for slicing and dicing the data, where data can
BACKGROUND be thought of as existing in multidimensional cubes (Inmon,
2002). Regular databases load data in two-dimensional
Since the advent of computer storage technology and tables. A data warehouse can use OLAP (online analytical
higher level programming languages (Inmon, 2002), processing) to provide users with multidimensional views
organizations, especially larger organizations, have put of their data, which can be visually represented as a cube
enormous amount of investment in their information for three dimensions (Senn, 2004).
system infrastructures. In a 2003 IT spending survey, With the host of differences between a database for
45% of American company participants indicated that day-to-day operation and a data warehouse for support-
their 2003 IT purchasing budgets had increased coming management decision-making process, the adminis-
pared with their budgets in 2002. Among the respon- tration and management of a data warehouse is of course
dents, database applications ranked top in areas of tech- far from similar. For instance, a data warehouse team
nology being implemented or had been implemented, requires someone who does routine data extraction,
with 42% indicating a recent implementation (Informa- transformation, and loading (ETL) from operational data-
TEAM LinG
Administering and Managing a Data Warehouse
bases into data warehouse databases. Thus the team Formulating Strategic Plans: Environmental fac-
requires a technical role called ETL Specialist. On the tors can be matched up against the strategic plan by
other hand, a data warehouse is intended to support the identifying current market positioning, financial
business decision-making process. Someone like a busi- goals, and opportunities.
ness analyst is also needed to ensure that business Determining Specific Objectives: Exploration ware-
information requirements are crossed to the data warehouse can be used to find patterns; if found, these
house development. Data in the data warehouse can be patterns are then compared with patterns discov-
very sensitive and cross functional areas, such as per- ered previously to optimize corporate objectives
sonal medical records and salary information. There- (Inmon, Terdeman, & Imhoff, 2000).
fore, a higher level of security on the data is needed.
Encrypting the sensitive data in data warehouse is a While managing a data warehouse for business strat-
potential solution. Issues as such in data warehouse egy, what needs to be taken into consideration is the
administration and management need to be defined and difference between companies. No one formula fits
discussed. every organization. Avoid using so called templates
from other companies. The data warehouse is used for
your companys competitive advantages. You need to
MAIN THRUST follow your companys user information requirements
for strategic advantages.
Data warehouse administration and management covers
a wide range of fields. This article focuses only on data Data Warehouse Development Cycle
warehouse and business strategy, data warehouse devel-
opment life cycle, data warehouse team, process man- Data warehouse system development phases are similar
agement, and security management to present the cur- to the phases in the systems development life cycle
rent concerns and issues in data warehouse administra- (SDLC) (Adelman & Rehm, 2003). However, Barker
tion and management. (1998) thinks that there are some differences between
the two due to the unique functional and operational
Data Warehouse and Business Strategy features of a data warehouse. As business and informa-
tion requirements change, new corporate information
Data is the blood of an organization. Without data, the models evolve and are synthesized into the data ware-
corporation has no idea where it stands and where it will house in the Synthesis of Model phase. These models
go (Ferdinandi, 1999, p. xi). With data warehousing, are then used to exploit the data warehouse in the
todays corporations can collect and house large vol- Exploit phase. The data warehouse is updated with new
umes of data. Does the size of data volume simply data using appropriate updating strategies and linked to
guarantee you a success in your business? Does it mean various data sources.
that the more data you have the more strategic advan- Inmon (2002) sees system development for data
tages you have over your competitors? Not necessarily. warehouse environment as almost exactly the opposite
There is no predetermined formula that can turn your of the traditional SDLC. He thinks that traditional SDLC
information into competitive advantages (Inmon, is concerned with and supports primarily the opera-
Terdeman, & Imhoff, 2000). Thus, top management and tional environment. The data warehouse operates under
data administration team are confronted with the ques- a very different life cycle called CLDS (the reverse of
tion of how to convert corporate information into com- the SDLC). The CLDS is a classic data-driven develop-
petitive advantages. ment life cycle, but the SDLC is a classic requirements-
A well-managed data warehouse can assist a corpora- driven development life cycle.
tion in its strategy to gain competitive advantages. This
can be achieved by using an exploration warehouse, The Data Warehouse Team
which is a direct product of data warehouse, to identify
environmental factors, formulate strategic plans, and Building a data warehouse is a large system develop-
determine business specific objectives: ment process. Participants of data warehouse develop-
ment can range from a data warehouse administrator
Identifying Environmental Factors: Quantified (DWA) (Hoffer, Prescott, & McFadden, 2005) to a
analysis can be used for identifying a corporations business analyst (Ferdinandi, 1999). The data ware-
products and services, market share of specific house team is supposed to lead the organization into
products and services, financial management. assuming their roles and thereby bringing about a part-
18
TEAM LinG
nership with the business (McKnight, 2000). A data ware- a new data warehouse administrator is required for
house team may have the following roles (Barker, 1998; each year a data warehouse is up and running and A
Ferdinandi, 1999; Inmon, 2000, 2003; McKnight, 2000): is being used successfully;
if an ETL tool is being written manually, many data
Data Warehouse Administrator (DWA): respon- warehouse administrators are needed; if automa-
sible for integrating and coordinating of metadata tion tool is needed much fewer staffing is required;
and data across many different data sources as well automated data warehouse database management
as data source management, physical database de- system (DBMS) requires fewer data warehouse
sign, operation, backup and recovery, security, and administrators, otherwise more administrators
performance and tuning. are needed;
Manager/Director: responsible for the overall fewer supporting staff is required if the corporate
management of the entire team to ensure that the information factory (CIF) architecture is fol-
team follows the guiding principles, business re- lowed more closely; reversely, more staff is
quirements, and corporate strategic plans. needed.
Project Manager: responsible for data warehouse
project development, including matching each team McKnight (2000) suggests that all the technical
members skills and aspirations to tasks on the roles be performed full-time by dedicated personnel
project plan. and each responsible person receives specific data
Executive Sponsor: responsible for garnering and warehouse training.
retaining adequate resources for the construction Data warehousing is growing rapidly. As the scope
and maintenance of the data warehouse. and data storage size of the data warehouse change, the
Business Analyst: responsible for determining roles and size of a data warehouse team should be
what information is required from a data warehouse adjusted accordingly. In general, the extremes should
to manage the business competitively. be avoided. Without sufficient professionals, job may
System Architect: responsible for developing and not be done satisfactorily. On the other hand, too many
implementing the overall technical architecture of people will certainly get the team overstuffed.
the data warehouse, from the backend hardware and
software to the client desktop configurations. Process Management
ETL Specialist: responsible for routine work on
data extraction, transformation, and loading for the Developing data warehouse has become a popular but
warehouse databases. exceedingly demanding and costly activity in informa-
Front End Developer: responsible for develop- tion systems development and management. Data ware-
ing the front-end, whether it is client-server or house vendors are competing intensively for their cus-
over the Web. tomers because so much of their money and prestige
OLAP Specialist: responsible for the develop- are at stake. Consulting vendors have redirected their
ment of data cubes, a multidimensional view of data attention toward this rapidly expanding market seg-
in OLAP. ment. User companies are facing with a serious ques-
Data Modeler: responsible for modeling the ex- tion on which product they should buy. Sen & Jacobs
isting data in an organization into a schema that is (1998) advice is to first understand the process of data
appropriate for OLAP analysis. warehouse development before selecting the tools for
Trainer: responsible for training the end-users to its implementation. A data warehouse development
use the system so that they can benefit from the process refers to the activities required to build a data
data warehouse system. warehouse (Barquin, 1997). Sen & Jacob (1998) and
End User: responsible for providing feedback to Ma, Chou, & Yen (2000) have identified some of these
the data warehouse team. activities, which need to be managed during the data
warehouse development cycle: initializing project, es-
In terms of the size of the data warehouse administra- tablishing the technical environment, tool integration,
tor team, Inmon (2003) has several recommendations: determining scalability, developing an enterprise in-
formation architecture, designing the data warehouse
large warehouse requires more analysts; database, data extraction/transformation, managing
every 100gbs of data in a data warehouse requires metadata, developing the end-user interface, managing
another data warehouse administrator; the production environment, managing decision sup-
port tools and applications, and developing warehouse
roll-out.
19
TEAM LinG
As mentioned before, data warehouse development is transmission process. Even if unauthorized access occurs
a large system development process. Process manage- during transmission, there is no harm to the encrypted data
ment is not required in every step of the development unless the user has the decryption code (Ma, Chou, & Yen,
processes. Devlin (1997) states that process management 2000).
is required in the following areas: process schedule,
which consists of a network of tasks and decision points;
process map definition, which defines and maintains the FUTURE TRENDS
network of tasks and decision points that make up a
process; task initiation, which supports to initiate tasks Data warehousing administration and management is
on all of the hardware/software platforms in the entire data facing several challenges, as data warehousing becomes
warehouse environment; status information enquiry, a mature part of the infrastructure of organizations.
which enquires about the status of components that are More legislative work is necessary to protect individual
running on all platforms. privacy from abuse by government or commercial enti-
ties that have large volumes of data concerning those
Security Management individuals. The protection also calls for tightened se-
curity through technology as well as user efforts for
In recent years, information technology (IT) security workable rules and regulations while at the same time
has become one of the hottest and most important topics still granting a data warehouse the ability to perform
facing both users and providers (Senn, 2005). The goal large datasets for meaningful analyses (Marakas, 2003).
of database security is the protection of data from Todays data warehouse is limited to storage of
accidental or intentional threats to its integrity and structured data in the form of records, fields, and data-
access (Hoffer, Prescott, & McFadden, 2005). The bases. Unstructured data, such as multimedia, maps,
same is true for a data warehouse. However, higher graphs, pictures, sound, and video files are demanded
security methods, in addition to the common practices increasingly in organizations. How to manage the stor-
such as view-based control, integrity control, process- age and retrieval of unstructured data and how to search
ing rights, and DBMS security, need to be used for the for specific data items set a real challenge for data
data warehouse due to the differences between a datawarehouse administration and management. Alternative
base and data warehouse. One of the differences that storage, especially the near-line storage, which is one
demand a higher level of security for a data warehouse of the two forms of alternative storage, is considered to
is the scope of and detail level of data in the data be one of the best future solutions for managing the
warehouse, such as financial transactions, personal storage and retrieval of unstructured data in data ware-
medical records, and salary information. A method that houses (Marakas, 2003).
can be used to protect data that requires high level of The past decade has seen a fast rise of the Internet
security in a data warehouse is by using encryption and and World Wide Web. Today, Web-enabled versions of
decryption. all leading vendors warehouse tools are becoming avail-
Confidential and sensitive data can be stored in a able (Moeller, 2001). This recent growth in Web use
separate set of tables where only authorized users can and advances in e-business applications have pushed the
have access. These data can be encrypted while they are data warehouse from the back office, where it is ac-
being written into the data warehouse. In this way, the cessed by only a few business analysts, to the front lines
data captured and stored in the data warehouse are of the organization, where all employees and every
secure and can only be accessed on an authorized basis. customer can use it.
Three levels of security can be offered by using encryp- To accommodate this move to the frontline of the
tion and decryption. The first level is that only authorized organization, the data warehouse demands massive
users can have access to the data in the data warehouse. scalability for data volume as well as for performance.
Each group of users, internal or external, ranging from As the number of and types of users increase rapidly,
executives to information consumers should be granted enterprise data volume is doubling in size every 9 to 12
different rights for security reasons. Unauthorized users months. Around-the-clock access to the data warehouse
are totally prevented from seeing the data in the data is becoming the norm. The data warehouse will require
warehouse. The second level is the protection from unau- fast implementation, continuous scalability, and ease of
thorized dumping and interpretation of data. Without the management (Marakas, 2003).
right key an unauthorized access will not be allowed to Additionally, building distributed warehouses, which
write anything into the tables. On the other hand, the are normally called data marts, will be on the rise. Other
existing data in the tables cannot be decrypted. The third technical advances in data warehousing will include an
level is the protection from unauthorized access during the increasing ability to exploit parallel processing, auto-
20
TEAM LinG
mated information delivery, greater support of object Information Technology Toolbox. (2004). 2003 IToolbox
extensions, very large database support, and user-friendly spending survey. Retrieved from http://datawarehouse. A
Web-enabled analysis applications. These capabilities ittoolbox.com/research/survey.asp
should make data warehouses of the future more powerful
and easier to use, which will further increase the impor- Inmon, W.H. (2002). Building the data warehouse (3rd
tance of data warehouse technology for business strate- ed.). New York: John Wiley & Sons Inc.
gic decision making and competitive advantages (Ma, Inmon, W.H. (2000). Building the data warehouse: Get-
Chou, & Yen, 2000; Marakas, 2003; Pace University, 2004). ting started. Retrieved from http://www.billinmon.com/
library/whiteprs/earlywp/ttbuild.pdf
CONCLUSION Inmon, W.H. (2003). Data warehouse administration.

Retrieved from http://www.billinmon.com/library/other/
The data that organizations have captured and stored are dwadmin.asp
considered organizational assets. Yet the data themselves Inmon, W.H., Terdeman, R.H., & Imhoff, C. (2000). Explo-
cannot do anything until they are put into intelligent use. ration warehousing. New York: John Wiley & Sons Inc.
One way to accomplish this goal is to use data warehouse
and data mining technology to transform corporate infor- Kroenke, D.M. (2004). Database processing: Fundamen-
mation into business competitive advantages. tals, design, and implementation (9th ed.). Upper Saddle
What impacts data warehouses the most is the River, NJ: Prentice Hall.
Internet and Web technology. Web browser will be- Ma, C., Chou, D.V., & Yen, D.C. (2000). Data warehousing,
come the universal interface for corporations, allowing technology assessment and management. Industrial
employees to browse their data warehouse worldwide Management + Data Systems, 100 (3), 125-137.
on public and private networks, eliminating the need to
replicate data across diverse geographic locations. Thus Marakas, G.M. (2003). Modern data warehousing, min-
strong data warehouse management sponsorship and an ing, and visualization: Core concepts. Upper Saddle
effective administration team may become a crucial River, NJ: Prentice Hall.
factor to provide an organization with the information
service needed. Martin, J. (1997, September). New tools for decision
making. DM Review, 7, 80.
McKnight Associates, Inc. (2000). Effective data ware-
REFERENCES house organizational roles and responsibilities.
Sunnyvale, CA.
Adelman, S., & Relm, C. (2003, November 5). What are
the various phases in implementing a data warehouse Moeller, R.A. (2001). Distributed data warehousing us-
solution? DMReview. Retrieved from http://www. ing web technology: How to build a more cost-effective
dmreview.com/article_sub.cfm?articleId=7660 and flexible warehouse. New York: AMACOM American
Management Association.
Barker, R. (1998, February). Managing a data warehouse.
Chertsey, UK: Veritas Software Corporation. Pace University. (2004). Emerging technology. Retrieved
from http://webcomposer.pace.edu/ea10931w/Tappert/
Barquin, F. (1997). Building, using, and managing the data Assignment2.htm
warehouse. Upper Saddle River, NJ: Prentice Hall.
Post, G.V. (2005). Database management systems: design-
Devlin, B. (1997). Data warehouse: From architecture to ing & building business applications (3 rd ed.). New York:
implementation. Reading, MA: Addison-Wesley. McGraw-Hill/Irwin.
Ferdinandi, P.L. (1999). Data warehouse advice for man- Rob, P., & Coronel, C. (2004). Database systems: Design,
agers. New York: AMACOM American Management implementation, and management (6th ed.). Boston, MA:
Association. Course Technology.
Gillenson, M.L. (2005). Fundamentals of database manage- Sen, A., & Jacob, V.S. (1998). Industrial strength data
ment systems. New York: John Wiley & Sons Inc. warehousing: Why process is so important and so often
ignored. Communication of the ACM, 41(9), 29-31.
Hoffer, J.A., Prescott, M.B., & McFadden, F.R. (2005).
Modern database management (7th ed.) Upper Saddle Senn, J.A. (2004). Information technology: Principles,
River, NJ: Prentice Hall. practices, opportunities (3rd ed.). Upper Saddle River, NJ:
Prentice Hall.
21
TEAM LinG
KEY TERMS Database Management System (DBMS): A set of

programs used to define, administer, and process the
Alternative Storage: An array of storage media that database and its applications.
consists of two forms of storage: near-line storage and/or Metadata: Data about data; data concerning the struc-
second storage. ture of data in a database stored in the data dictionary.
CLDS: The facetiously named system develop- Near-line Storage: Near-line storage is siloed tape
ment life cycle (SDLC) for analytical, DSS systems. storage where siloed cartridges of tape are archived,
CLDS is so named because in fact it is the reverse of the accessed, and managed robotically.
classical SDLC.
Online Analytical Process (OLAP): Decision
Corporate Information Factory (CIF): The cor- Support System (DSS) tools that uses multidimensional
porate information factory is a logical architecture with data analysis techniques to provide users with multidi-
a purpose of delivering business intelligence and busi- mensional views of their data.
ness management capabilities driven by data provided
from business operations. System Development Life Cycle (SDLC): The
methodology used by most organizations for develop-
Data Mart: A data warehouse that is limited in scope ing large information systems.
and facility, but for a restricted domain.
22
TEAM LinG
23
Agent-Based Mining of User Profiles for A

E-Services
Pasquale De Meo
Universit Mediterranea di Reggio Calabria, Italy
Giovanni Quattrone
Giorgio Terracina
Universit della Calabria, Italy
Domenico Ursino
INTRODUCTION in, without considering other ones somehow related to

those just provided, possibly interesting the user in the
An electronic service (e-service) can be defined as a future and what the user did not take into account in the past.
collection of network-resident software programs that In spite of present user profile managers, generally
collaborate for supporting users in both accessing and when accessing an e-service, a user must personally
selecting data and services of their interest present in a search the proposals of the users interest through it. As
provider site. Examples of e-services are e-commerce, an example, consider the bookstore section of Amazon;
e-learning, and e-government applications. E-services whenever a customer looks for a book of interest, the
are undoubtedly one of the engines presently supporting customer must carry out an autonomous personal search
the Internet revolution (Hull, Benedikt, Christophides of it throughout the pages of the site. We argue that, for
& Su, 2003). Indeed, nowadays, a large number and a improving the effectiveness of e-services, it is neces-
great variety of providers offer their services also or sary to increase the interaction between the provider
exclusively via the Internet. and the user on the one hand and to construct a rich
profile of the user, taking into account the users de-
sires, interests, and behavior, on the other hand.
BACKGROUND In addition, it is necessary to take into account a
further important factor. Nowadays, electronic and tele-
In spite of their spectacular development and present communications technology is rapidly evolving in such
relevance, e-services are yet to be considered a stable a way to allow cell phones, palmtops, and wireless PDAs
technology, and various improvements could be consid- to navigate on the Web. These mobile devices do not
ered for them. Many of the present suggestions for have the same display or bandwidth capabilities as their
bettering them are based on the concept of adaptivity desktop counterparts; nonetheless, present e-service
(i.e., the capability to make them more flexible in such providers deliver the same content to all device
a way so as to adapt their offers and behavior to the typologies (Communications of the ACM, 2002).
environment in which they are operating. In this context, In the past, various approaches have been proposed
systems capable of constructing, maintaining, and ex- for handling e-service activities; many of them are
ploiting profiles of users accessing e-services appear to agent-based. For example:
be capable of playing a key role in the future.
Both in the past and in the present, various e-service In Terziyan and Vitko (2002), an agent-based
providers exploit (usually rough) user profiles for pro- framework for managing commercial transactions
posing personalized offers. However, in most cases, the between a buyer and a seller is proposed. It ex-
profile construction methodology they adopt presents ploits a user profile that is handled by means of a
some problems. Indeed, it often requires a user to spend content-based policy.
a certain amount of time for constructing and updating In Garcia, Patern, and Gil (2002), a multi-agent
the users profile; in addition, it stores only information system called e-CoUSAL, capable of supporting
about the proposals that the user claims to be interested Web-shop activities, is presented. Its activity is
TEAM LinG
Agent-Based Mining of User Profiles for E-Services
based on the maintenance and the exploitation of Samaras and Panayiotou (2002) present a flexible
user profiles. agent-based system for providing wireless users
In Lau, Hofstede, and Bruza (2000), WEBS, an with a personalized access to the Internet ser-
agent-based approach for supporting e-commerce vices.
activities, is proposed. It exploits probabilistic In Araniti, De Meo, Iera, and Ursino (2003), a
logic rules for allowing the customer preferences novel XML-based multi-agent system for QoS
for other products to be deduced. management in wireless networks is presented.
Ardissono, et al. (2001) describe SETA, a multi-
agent system conceived for developing adaptive These approaches are particularly general and inter-
Web stores. SETA uses knowledge representation esting; however, to the best of our knowledge, none of
techniques to construct, maintain, and exploit user them has been conceived for handling e-services.
profiles.
In Bradley and Smyth (2003), the system CASPER,
for handling recruitment services, is proposed. MAIN THRUST
Given a user, CASPER first ranks job advertise-
ments according to an applicants desires and then Challenges to Face
recommends job proposals to the applicant on the
basis of the applicants past behavior. In order to overcome the problems outlined previously,
In Razek, Frasson, and Kaltenbach (2002), a multi- some challenges must be tackled.
agent prototype for e-learning called CITS (Con- First, a user can access many e-services, operating in
fidence Intelligent Tutoring Agent) is proposed. the same or in different application contexts; a faithful
The approach of CITS aims at being adaptive and and complete profile of the user can be constructed only
dynamic. by taking into account the users behavior while access-
In Shang, Shi, and Chen (2001), IDEAL (Intelli- ing all the sites. In other words, it should be possible to
gent Distributed Environment for Active Learn- construct a unique structure on the user side, storing the
ing), a multi-agent system for active distance learn- users profile and, therefore, representing the users
ing, is proposed. In IDEAL, course materials are behavior while accessing all the sites.
decomposed into small components called Second, for a given user and e-service provider, it
lecturelets. These are XML documents containing should be possible to compare the profile of the user
JAVA code; they are dynamically assembled to with the offers of the provider for extracting those
cover course topics according to learner progress. proposals that probably will interest the user. Existing
In Zaiane (2002), an approach for exploiting Web- techniques for satisfying such a requirement are based
mining techniques to build a software agent sup- mainly on the exploitation of either log files or cookies.
porting e-learning activities is presented. Techniques based on log files can register only some
information about the actions carried out by the user
All these systems construct, maintain, and exploit a upon accessing an e-service; however, they cannot match
user profile; therefore, we can consider them adaptive user preferences and e-service proposals. Vice versa,
w.r.t. the user; however, to the best of our knowledge, techniques based on cookies are able to carry out a
none of them is adaptive w.r.t. the device. certain, even if primitive, match; however, they need to
On the other side, in various areas of computer know and exploit some personal information that a user
science research, a large variety of approaches adapting might consider private.
their behavior to the device the user is exploiting has Third, it should be necessary to overcome the typical
been proposed. As an example: one-size-fits-all philosophy of present e-service pro-
viders by developing systems capable of adapting their
In Anderson, Domingos, and Weld (2001), a frame- behavior to both the profile of the user and to the
work called MINPATH, capable of simplifying the characteristics of the device the user is exploiting for
browsing activity of a mobile user and taking into accessing them (Communications of the ACM, 2002).
account the device the user is exploiting, is pre-
sented.
In Macskassy, Dayanik, and Hirsh (2000), a frame-
System Description
work named i-Valets is proposed for allowing a
user to visit an information source by using differ- The system we present in this article (called e-service
ent devices. adaptive manager [ESA-Manager]) aims at solving all
24
TEAM LinG
three problems mentioned previously. It is an XML- possibly interesting for the user in the future and that the
based multi-agent system for handling user accesses to user disregarded to take into account in the past (see De A
e-services, capable of adapting its behavior to both user Meo, Rosaci, Sarn, Terracina & Ursino [2003] for a
and device profiles. specialization of these algorithms to e-commerce). In
In ESA-Manager, a service provider agent is present our opinion, this is a particularly interesting feature for
for each e-service provider, handling the proposals stored a novel approach devoted to deal with e-services.
therein as well as the interaction with the user. In addi- Last, but not the least, it is worth observing that
tion, an agent is associated with each user, adapting its since the user profile management is carried out at the
behavior to the profiles of both the user and the device user side, no information about the user profile is sent
the user is exploiting for visiting the sites. Actually, to the e-service providers. In this way, ESA-Manager
since a user can access e-service providers by means of solves privacy problems left open by cookies.
different devices, the users profile cannot be stored in All the reasonings presented show that ESA-Man-
only one of them; as a matter of fact, it is necessary to ager is capable of solving also the second problem
have a unique copy of the user profile that registers the mentioned previously.
users behavior in visiting the e-service providers during In ESA-Manager, the device profile plays a central
the various sessions, possibly carried out by means of role. Indeed, the proposals of a provider shown to a
different devices. For this reason, the profile of a user user, as well as their presentation formats, depend on
must be handled and stored in a support different from the characteristics of the device the user is presently
the devices generally exploited by the user for accessing exploiting. However, the ESA-Manager capability of
e-service providers. As a consequence, on the user side, adapting its behavior to the device the user is exploiting
the exploitation of a profile agent appears compulsory, is not restricted to the presentation format of the
storing the profiles of both involved users and devices, proposals; indeed, the exploited device can influence
and a user-device agent, associated with a specific user also the computation of the interest degree shown by a
operating by means of a specific device, supporting the user for the proposals presented by each provider.
user in his or her activities. More specifically, one of the parameters that the
As previously pointed out, for each user, a unique interest degree associated with a proposal is based on,
profile is mined and maintained, storing information is the time the user spends visiting the corresponding
about the users behavior in accessing all e-service pro- Web pages. This time is not to be considered as an
viders1the techniques for mining, maintaining, and absolute measure, but it must be normalized w.r.t. both
exploiting user profiles are quite complex and slightly the characteristics of the exploited device and the
differ in the various applications domains; the interested navigation costs (Chan, 2000). The following example
reader can find examples of them, along with the corre- allows this intuition to be clarified. Assume that a user
sponding validation issues, in De Meo, Rosaci, Sarn, visits a Web page for two times and that each visit takes
Terracina, and Ursino (2003) for e-commerce and in De n seconds. Suppose, also, that during the first access,
Meo, Garro, Terracina, and Ursino (2003) for e-learn- the user exploits a mobile phone having a low proces-
ing. In this way, ESA-Manager solves the first problem sor clock and supporting a connection characterized by
mentioned previously. a low bandwidth and a high cost. During the second
Whenever a user accesses an e-service by means of a visit, the user uses a personal computer having a high
certain device, the corresponding service provider agent processor clock and supporting a connection charac-
sends information about its proposals to the user device terized by a high bandwidth and a low cost. It is possible
agent associated with the service provider agent and the to argue that the interest the user exhibited for the page
device he or she is exploiting. The user device agent in the former access is greater than what the user
determines similarities between the proposals presented exhibited in the latter one. Also, other device param-
by the provider and the interests of the user. For each of eters influence the behavior of ESA-Manager (see De
these similarities, both the service provider agent and Meo, Rosaci, Sarn, Terracina & Ursino [2003] for a
the user device agent cooperate for presenting to the detailed specification of the role of these parameters).
user a group of Web pages adapted to the exploited This reasoning allows us to argue that ESA-Manager
device, illustrating the proposal. solves also the third problem mentioned previously.
We argue that this behavior provides ESA-Manager As already pointed out, many agents are simulta-
with the capability of supporting the user in the search of neously active in ESA-Manager; they strongly interact
proposals of the users interest offered by the provider. with each other and continuously exchange informa-
In addition, the algorithms underlying ESA-Manager al- tion. In this scenario, an efficient management of in-
low it to identify not only the proposals probably inter- formation exchange appears crucial. One of the most
esting for the user in the present, but also other ones promising solutions to this problem has been the adop-
25
TEAM LinG
tion of XML. XML capabilities make it particularly suited ploited for both storing the agent ontologies and for
to be exploited in the agent research. In ESA-Manager, the handling the agent communication.
role of XML is central; indeed, (1) the agent ontologies are As for future work, we argue that various improve-
stored as XML documents; (2) the agent communication ments could be performed on ESA-Manager for better-
language is ACML; (3) the extraction of information from ing its effectiveness and completeness. As an example,
the various data structures is carried out by means of it might be interesting to categorize involved users on
XQuery; and (4) the manipulation of agent ontologies is the basis of their profiles, as well as involved providers
performed by means of the Document Object Model on the basis of their proposals. As a further example of
(DOM). profitable features with which our system could be
enriched, we consider extremely promising the deriva-
tion of association rules representing and predicting the
FUTURE TRENDS user behavior on accessing one or more providers.
Finally, ESA-Manager could be made even more adap-
The spectacular growth of the Internet during the last tive by considering the possibility to adapt its behavior
decade has strongly conditioned the e-service land- on the basis not only of the device a user is exploiting
scape. Such a growth is particularly surprising in some during a certain access, but also of the context (e.g., job,
application domains, such as financial services or e- holidays) in which the user is currently operating.
government.
As an example, the Internet technology has enabled
the expansion of financial services by integrating the REFERENCES
already existing, quite variegate financial data and ser-
vices and by providing new channels for information Adaptive Web. (2002). Communications of the ACM,
delivery. For instance, in 2004, the number of house- 45(5).
holds in the U.S. that will use online banking is expected
to exceed approximately 24 million, nearly double the Anderson, C.R., Domingos, P., & Weld, D.S. (2001).
number of households at the end of 2000. Adaptive Web navigation for wireless devices. Pro-
Moreover, e-services are not a leading paradigm ceedings of the Seventeenth International Joint Con-
only in business contexts, but they are an emerging ference on Artificial Intelligence (IJCAI 2001), Se-
standard in several application domains. As an example, attle, Washington.
they are applied vigorously by governmental units at Araniti, G., De Meo, P., Iera, A., & Ursino, D. (2003).
national, regional, and local levels around the world. Adaptively controlling the QoS of multimedia wireless
Moreover, e-service technology is currently success- applications through user-profiling techniques. Jour-
fully exploited in some metropolitan networks for pro- nal of Selected Areas in Communications, 21(10),
viding mediation tools in a democratic system in order 1546-1556.
to make citizen participation in rule- and decision-
making processes more feasible and direct. These are Ardissono, L. et al. (2001). Agent technologies for the
only two examples of the role e-services can play in the development of adaptive Web stores. Agent Mediated
e-government context. Handling and managing this tech- Electronic Commerce, The European AgentLink Per-
nology in all these environments is one of the most spective (pp. 194-213). Lecture Notes in Computer Sci-
challenging issues for present and future researchers. ence, Springer.
Bradley, K., & Smyth, B. (2003). Personalized informa-
tion ordering: A case study in online recruitment. Knowl-
CONCLUSION edge-Based Systems, 16(5-6), 269-275.
In this article, we have proposed ESA-Manager, an XML- Chan, P.K. (2000). Constructing Web user profiles: A
based and adaptive multi-agent system for supporting a non-invasive learning approach. Web Usage Analysis
user accessing an e-service provider in the search of and User Profiling, 39-55. Springer.
proposals present therein and appearing to be appealing De Meo, P., Garro, A., Terracina, G., & Ursino, D.
according to the users past interests and behavior. (2003). X-Learn: An XML-based, multi-agent system
We have shown that ESA-Manager is adaptive w.r.t. for supporting user-device adaptive e-learning. Pro-
the profile of both the user and the device the user is ceedings of the International Conference on Ontolo-
exploiting for accessing the e-service provider. Finally, gies, Databases and Applications of Semantics
we have seen that it is XML-based, since XML is ex- (ODBASE 2003), Taormina, Italy.
26
TEAM LinG
De Meo, P., Rosaci, D., Sarn G.M.L., Terracina, G., & tion Language defined by the Foundation for Intelligent
Ursino, D. (2003). An XML-based adaptive multi-agent Physical Agent (FIPA). A
system for handling e-commerce activities. Proceed-
ings of the International Conference on Web Ser- Adaptive System: A system adapting its behavior on
vicesEurope (ICWS-Europe 03), Erfurt, Germany. the basis of the environment it is operating in.
Garcia, F.J., Patern, F., & Gil, A.B. (2002). An adaptive Agent: A computational entity capable of both per-
e-commerce system definition. Proceedings of the ceiving dynamic changes in the environment it is oper-
International Conference on Adaptive Hypermedia ating in and autonomously performing user delegated
and Adaptive Web-Based Systems (AH02), Malaga, tasks, possibly by communicating and cooperating with
Spain. other similar entities.
Hull, R., Benedikt, M., Christophides, V., & Su, J. (2003). Agent Ontology: A description (like a formal speci-
E-services: A look behind the curtain. Proceedings of fication of a program) of the concepts and relationships
the Symposium on Principles of Database Systems that can exist for an agent or a community of agents.
(PODS 2003), San Diego, California. Device Profile: A model of a device storing infor-
Lau, R., Hofstede, A., & Bruza, P. (2000). Adaptive mation about both its costs and capabilities.
profiling agents for electronic commerce. Proceed- E-Service: A collection of network-resident soft-
ings of the CollECTeR Conference on Electronic Com- ware programs that collaborate for supporting users in
merce (CollECTeR 2000), Breckenridge, Colorado. both accessing and selecting data and services of their
Macskassy, S.A., Dayanik, A.A., & Hirsh, H. (2000). Infor- interest handled by a provider site. Examples of e-
mation valets for intelligent information access. Proceed- services are e-commerce, e-learning, and e-government
ings of the AAAI Spring Symposia Series on Adaptive applications.
User Interfaces, (AUI-2000), Stanford, California. eXtensible Markup Language (XML): The novel
Razek, M.A., Frasson, C., & Kaltenbach, M. (2002). language, standardized by the World Wide Web Consor-
Toward more effective intelligent distance learning tium, for representing, handling, and exchanging infor-
environments. Proceedings of the International Con- mation on the Web.
ference on Machine Learning and Applications Multi-Agent System (MAS): A loosely coupled
(ICMLA02), Las Vegas, Nevada. network of software agents that interact to solve prob-
Samaras, G., & Panayiotou, C. (2002). Personalized lems that are beyond the individual capacities or knowl-
portals for the wireless user based on mobile agents. edge of each of them. An MAS distributes computa-
Proceedings of the International Workshop on Mo- tional resources and capabilities across a network of
bile Commerce, Atlanta, Georgia. interconnected agents. The agent cooperation is handled
by means of an Agent Communication Language.
Shang, Y., Shi, H., & Chen, S. (2001). An intelligent
distributed environment for active learning. Proceed- User Modeling: The process of gathering informa-
ings of the ACM International Conference on World tion specific to each user either explicitly or implicitly.
Wide Web (WWW 2001), Hong Kong. This information is exploited in order to customize the
content and the structure of a service to the users
Terziyan, V., & Vitko, O. (2002). Intelligent informa- specific and individual needs.
tion management in mobile electronic commerce. Arti-
ficial Intelligence News, Journal of Russian Associa- User Profile: A model of a user representing both
tion of Artificial Intelligence, 5. the users preferences and behavior.
Zaiane, O.R. (2002). Building a recommender agent for

e-learning systems. Proceedings of the International
Conference on Computers in Education (ICCE 2002), ENDNOTE
Auckland, New Zealand.
1
It is worth pointing out that providers could be
either homogeneous (i.e., all of them operate in
the same application context, such as e-commerce)
KEY TERMS or heterogeneous (i.e., they operate in different
application contexts).
ACML: The XML encoding of the Agent Communica-
27
TEAM LinG
28
Aggregate Query Rewriting in Multidimensional

Databases
Leonardo Tininini
CNR - Istituto di Analisi dei Sistemi e Informatica Antonio Ruberti, Italy
INTRODUCTION Several groups are defined; each consisting of calls made

in the same region and with the same call plan, and finally
An efficient query engine is certainly one of the most applying the average aggregation function on the dura-
important components in data warehouses (also known as tion attribute of the data in each group. The pair of values
OLAP systems or multidimensional databases) and its (region, call plan) is used to identify each group and is
efficiency is influenced by many other aspects, both associated with the corresponding average duration value.
logical (data model, policy of view materialization, etc.) In multidimensional databases, the attributes used to
and physical (multidimensional or relational storage, in- group data define the dimensions, whereas the aggregate
dexes, etc). As is evident, OLAP queries are often based values define the measures.
on the usual metaphor of the data cube and the concepts The term multidimensional data comes from the well-
of facts, measures and dimensions and, in contrast to known metaphor of the data cube (Gray, Bosworth, Lay-
conventional transactional environments, they require man, & Pirahesh, 1996). For each of n attributes, used to
the classification and aggregation of enormous quantities identify a single measure, a dimension of an n-dimensional
of data. In spite of that, one of the fundamental require- space is considered. The possible values of the identify-
ments for these systems is the ability to perform multidi- ing attributes are mapped to points on the dimensions
mensional analyses in online response times. Since the axis, and each point of this n-dimensional space is thus
evaluation from scratch of a typical OLAP aggregate mapped to a single combination of the identifying at-
query may require several hours of computation, this can tribute values and hence to a single aggregate value. The
only be achieved by pre-computing several queries, stor- collection of all these points, along with all possible
ing the answers permanently in the database and then projections in lower dimensional spaces, constitutes the
reusing them in the query evaluation process. These pre- so-called data cube. In most cases, dimensions are struc-
computed queries are commonly referred to as material- tured in hierarchies, representing several granularity lev-
ized views and the problem of evaluating a query by using els of the corresponding measures (Jagadish, Lakshmanan,
(possibly only) these precomputed results is known as & Srivastava, 1999). Hence a time dimension can be
the problem of answering/rewriting queries using views. organized into days, months and years; a territorial dimen-
In this paper we briefly analyze the difference between sion into towns, regions and countries; a product dimen-
query answering and query rewriting approach and why sion into brands, families and types. When querying
query rewriting is preferable in a data warehouse context. multidimensional data, the user specifies the measures of
We also discuss the main techniques proposed in litera- interest and the level of detail required by indicating the
ture to rewrite aggregate multidimensional queries using desired hierarchy level for each dimension. In a multidi-
materialized views. mensional environment querying is often an exploratory
process, where the user moves along the dimension
hierarchies by increasing or reducing the granularity of
BACKGROUND displayed data. The drill-down operation corresponds to
an increase in detail, for example, by requesting the
Multidimensional data are obtained by applying aggrega- number of calls by region and month, starting from data
tions and statistical functions to elementary data, or more on the number of calls by region or by region and year.
precisely to data groups, each containing a subset of the Conversely, roll-up allows the user to view data at a
data and homogeneous with respect to a given set of coarser level of granularity (Agrawal, Gupta, & Sarawagi,
attributes. For example, the data Average duration of 1997; Cabibbo & Torlone, 1997).
calls in 2003 by region and call plan is obtained from the Multidimensional querying systems are commonly
so-called fact table, which is usually the product of known as OLAP (Online Analytical Processing) Systems,
complex source integration activities (Lenzerini, 2002) on in contrast to conventional OLTP (Online Transactional
the raw data corresponding to each phone call in that year. Processing) Systems. The two types have several con-
TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases
trasting features, although they share the same require- In general, query answering techniques are preferable
ment of fast online response times. In particular, one of in contexts where exact answers are unlikely to be ob- A
the key differences between OLTP and OLAP queries is tained (e.g., integration of heterogeneous data sources,
the number of records required to calculate the answer. like Web sites), and response time requirements are not
OLTP queries typically involve a rather limited number of very stringent. However, as noted in Grahne & Mendelzon
records, accessed through primary key or other specific (1999), query answering methods can be extremely ineffi-
indexes, which need to be processed for short, isolated cient, as it is difficult or even impossible to process only
transactions or to be issued on a user interface. In con- the useful views and apply optimization techniques
trast, multidimensional queries usually require the classi- such as pushing selections and joins. As a consequence,
fication and aggregation of a huge amount of data (Gupta, the rewriting approach is more appropriate in contexts
Harinarayan, & Quass, 1995) and fast response times are such as OLAP systems, where there is a very large amount
made possible by the extensive use of pre-computed of data and fast response times are required (Goldstein &
queries, called materialized views (whose answers are Larson, 2001), and for query optimization, where different
stored permanently in the database), and by sophisti- query plans need to be maintained in the main memory and
cated techniques enabling the query engine to exploit efficiently compared (Afrati, Li, & Ullman, 2001).
these pre-computed results.
Rewriting and Answering: An Example
MAIN THRUST Consider a fact table Cens, of elementary census data on

the simplified schema: (Census_tract_ID, Sex,
The problem of evaluating the answer to a query by using Empl_status, Educ_status, Marital_status) and a collec-
pre-computed (materialized) views has been extensively tion of aggregate data representing the resident popula-
studied in literature and generically denoted as answering tion by sex and marital status, stored in a materialized view
queries using views (Levy, Mendelzon, Sagiv, & on the schema V: (Sex, Marital_status, Pop_res). For
Srivastava, 1995; Halevy, 2001). The problem can be simplicity, it is assumed that the dimensional tables are
informally stated as follows: given a query Q and a collapsed in the fact table Cens. A typical multidimen-
collection of views V over the same schema s, is it possible sional query will be shown in the next section. The view
to evaluate the answer to Q by using (only) the informa- V is computed by a simple count(*)-group-by query on
tion provided by V? A more rigorous distinction has also the table Cens.
been made between view-based query rewriting and query
answering, corresponding to two distinct approaches to CREATE VIEW V AS
the general problem (Calvanese, De Giacomo, Lenzerini, & SELECT Sex, Marital_status, COUNT(*) AS Pop_res
Vardi, 2000; Halevy, 2001). This is strictly related to the FROM Cens
distinction between view definition and view extension, GROUP BY Sex, Marital_status
which is analogous to the standard distinction between
schema and instance in database literature. Broadly speak- The query Q expressed by
ing, view definition corresponds to the way the query is
syntactically defined, for example to the corresponding SELECT Marital_status, COUNT(*)
SQL expression, while its extension corresponds to the FROM Cens
set of returned tuples, that is, the result obtained by GROUP BY Marital_status
evaluating the view on a specific database instance.
corresponding to the resident population by marital sta-
Query Answering vs. Query Rewriting tus can be computed without accessing the data in Cens,
and be rewritten as follows:
Query rewriting is based on the use of view definitions to
produce a new rewritten query, expressed in terms of SELECT Marital_status, SUM(Pop_res)
available view names and equivalent to the original. The FROM V
answer can then be obtained by using the rewritten query GROUP BY Marital_status
and the view extensions (instances). Query answering, in
contrast, is based on the exploitation of both view defini- Note that the rewritten query can be obtained very
tions and extensions and attempts to determine the best efficiently by simple syntactic manipulations on Q and V
possible answer, possibly a subset of the exact answer, and its applicability does not depend on the records in V.
which can be extracted from the view extensions Suppose now some subsets of (views on) Cens are avail-
(Abiteboul & Duschka, 1998; Grahne & Mendelzon, 1999). able, corresponding to the employment statuses stu-
29
TEAM LinG
dents, employed and retired, called V_ST, V_EMP and consequently to replace the calculations with access to
V_RET respectively. For example V_RET may be defined the corresponding materialized views). The results are
by: extended to NGPSJ (Nested GPSJ) expressions in
Golfarelli & Rizzi (2000).
CREATE VIEW V_RET AS In Srivastava, Dar, Jagadish, & Levy (1996) an algo-
SELECT * rithm is proposed to rewrite a single block (conjunctive)
FROM Cens SQL query with GROUP BY and aggregations using
WHERE Empl_status = retired various views of the same form. The aggregate functions
considered are MIN, MAX, COUNT and SUM. The
It is evident that no rewriting can be obtained by using algorithm is based on the detection of homomorphisms
only the specified views, both because some individuals from view to query, as in the non-aggregate context
are not present in any of the views (e.g., young children, (Levy, Mendelzon, Sagiv, & Srivastava, 1995). However,
unemployed, housewives, etc.) and because some may be it is shown that more restrictive conditions must be
present in two views (a student may also be employed). considered when dealing with aggregates, as the view
However, a query answering technique tries to collect has to produce not only the right tuples, but also their
each useful accessible record and build the best pos- correct multiplicities.
sible answer, possibly by introducing approximations. In Cohen, Nutt, & Serebrenik (1999, 2000) a somewhat
By using the information on the census tract and a match- different approach is proposed: the original query, us-
ing algorithm most overlapping records may be determined able views and rewritten query are all expressed by an
and an estimate (lower bound) of the result obtained by extension of Datalog with aggregate functions (again
summing the non-replicated contributions from the views. COUNT, SUM, MIN and MAX) as query language.
Obviously, this would require a considerable computation Queries and views are assumed to be conjunctive. Sev-
time, but it might be able to produce an approximated eral candidates for rewriting of particular forms are con-
answer, in a situation where rewriting techniques would sidered and for each candidate, the views in its body are
produce no answer at all. unfolded (i.e., replaced by their body in the view defini-
tion). Finally, the unfolded candidate is compared with
Rewriting Aggregate Queries the original query to verify equivalence by using known
equivalence criteria for aggregate queries, particularly
A typical elementary multidimensional query is de- those proposed in Nutt, Sagiv, & Shurin (1998) for COUNT,
scribed by the join of the fact table with two or more SUM, MIN and MAX queries. The technique can be
dimension tables to which is applied an aggregate group extended by using the equivalence criteria for AVG
by query (see the example query Q1 below). As a conse- queries presented in Grumbach, Rafanelli, & Tininini
quence, the rewriting of this form of query and view has (1999), based on the syntactic notion of isomorphism
been studied by many researchers. modulo a product.
In query rewriting it is important to identify the views
SELECT D1.dim1, D2.dim2, AGG(F.measure) that may be actually useful in the rewriting process: this
FROM fact_table F, dim_table1 D1, dim_table2 D2 is often referred to as the view usability problem. In the
WHERE F.dimKey1 = D1.dimKey1 non-aggregate context, it is shown (Levy, Mendelzon,
AND F.dimKey2 = D2.dimKey2 Sagiv, & Srivastava, 1995) that a conjunctive view can be
GROUP BY D1.dim1, D2.dim2 (Q1) used to produce a conjunctive rewritten query if a homo-
morphism exists from the body of the view to that of the
In Gupta, Harinarayan, & Quass (1995), an algorithm is query. Grumbach, Rafanelli, & Tininini (1999) demon-
proposed to rewrite conjunctive queries with aggrega- strate that more restrictive (necessary and sufficient)
tions using views of the same form. The technique is based conditions are needed for the usability of conjunctive
on the concept of generalized projection (GP) and some count views for rewriting of conjunctive count queries,
transformation rules utilizable by an optimizer, which en- based on the concept of sound homomorphisms. It is
ables the query and views to be put in a particular normal also shown that in the presence of aggregations, it is not
form, based on GPSJ (Generalized Projection/Selection/ sufficient only to consider rewritten queries of conjunc-
Join) expressions. The query and views are analyzed in tive form: more complex forms may be required, particu-
terms of their query tree, that is, the tree representing how larly those based on the concept of isomorphism modulo
to calculate them by applying selections, joins and gener- a product.
alized projections on the base relations. By using the All rewriting algorithms proposed in the literature are
transformation rules, the algorithm tries to produce a based on trying to obtain a rewritten query with a particu-
match between one or more view trees and subtrees (and lar form by using (possibly only) the available views. An
30
TEAM LinG
interesting question is: Can I rewrite more by consider- would be interesting to study the property of complete-
ing rewritten queries of more complex form?, and the ness of known rewriting algorithms and to provide neces- A
even more ambitious one, Given a collection of views, sary and sufficient conditions for the usability of a view
is the information they provide sufficient to rewrite a to rewrite a query, even when both the query and the view
query? In Grumbach & Tininini (2003) the problem is are aggregate and of non-trivial form (e.g., allowing dis-
investigated in a general framework based on the conjunction and some limited form of negation).
cept of query subsumption. Basically, the information
content of a query is characterized by its distinguishing
power, that is, by its ability to determine that two CONCLUSION
database instances are different. Hence a collection of
views subsumes a query if it is able to distinguish any This paper has discussed a fundamental issue related to
pair of instances also distinguishable by the query, and multidimensional query evaluation, that is, how a multidi-
it is shown that a query rewriting using various views mensional query expressed in a given language can be
exists if the views subsume the query. In the particular translated, using some available materialized views, into
case of count and sum queries defined over the same fact an (efficient) evaluation plan which retrieves the neces-
table, an algorithm is proposed which is demonstrated to sary information and calculates the required results. We
be complete. In other words, even if the algorithm (as have analyzed the difference between query answering
with any algorithm of practical use) considers rewritten and query rewriting approach and discussed the main
queries of particular forms, it is shown that no improve- techniques proposed in literature to rewrite aggregate
ment could be obtained by considering rewritten que- multidimensional queries using materialized views.
ries of more complex forms.
Finally, in Grumbach & Tininini (2000) a completely
different approach to the problem of aggregate rewriting REFERENCES
is proposed. The technique is based on the idea of
formally expressing the relationships (metadata) between Abiteboul, S., & Duschka, O.M. (1998). Complexity of
raw and aggregate data and also among aggregate data of answering queries using materialized views. In ACM Sym-
different types and/or levels of detail. Data is stored in posium on Principles of Database Systems (PODS98)
standard relations, while the metadata are represented by (pp. 254-263).
numerical dependencies, namely Horn clauses formally
expressing the semantics of the aggregate attributes. The Afrati, F.N., Li, C., & Ullman, J.D. (2001). Generating
mechanism is tested by transforming the numerical de- efficient plans for queries using views. In ACM Interna-
pendencies into Prolog rules and then exploiting the tional Conference on Management of Data (SIGMOD01)
Prolog inference engine to produce the rewriting. (pp. 319-330).
Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modeling
multidimensional databases. In International Confer-
FUTURE TRENDS ence on Data Engineering (ICDE97) (pp. 232-243).
Although query rewriting techniques are currently con- Cabibbo, L., & Torlone, R. (1997). Querying multidimen-
sidered to be preferable to query answering in OLAP sional databases. In International Workshop on Data-
systems, the always increasing processing capabilities of base Programming Languages (DBPL97) (pp. 319-335).
modern computers may change the relevance of query
answering techniques in the near future. Meanwhile, the Calvanese, D., De Giacomo, G., Lenzerini, M., & Vardi,
limitations in the applicability of several rewriting algo- M.Y. (2000). What is view-based query rewriting? In
rithms shows that a substantial effort is still needed and International Workshop on Knowledge Representation
important contributions may stem from results in other meets Databases (KRDB00) (pp. 17-27).
research areas like logic programming and automated Cohen, S., Nutt, W., & Serebrenik, A. (1999). Rewriting
reasoning. Particularly, aggregate query rewriting is strictly aggregate queries using views. In ACM Symposium on
related to the problem of query equivalence for aggregate Principles of Database Systems (PODS99) (pp. 155-166).
queries and current equivalence criteria only apply to
rather simple forms of query, and dont consider, for Cohen, S., Nutt, W., & Serebrenik, A. (2000). Algorithms
example, the combination of conjunctive formulas with for rewriting aggregate queries using views. In ABDIS-
nested aggregations. DASFAA Conference 2000 (pp. 65-78).
Also the results on view usability and query Goldstein, J., & Larson, P. (2001). Optimizing queries
subsumption can be considered only preliminary and it using materialized views: A practical, scalable solution. In
31
TEAM LinG
ACM International Conference on Management of Data Principles of Database Systems (PODS98) (pp. 214-
(SIGMOD01) (pp. 331-342). 223).
Golfarelli, M., & Rizzi, S. (2000). Comparing nested GPSJ Srivastava, D., Dar, S., Jagadish, H.V., & Levy, A.Y.
queries in multidimensional databases. In Workshop on (1996). Answering queries with aggregation using views.
Data Warehousing and OLAP (DOLAP 2000) (pp. 65-71). In International Conference on Very Large Data Bases
(VLDB96) (pp. 318-329).
Grahne, G., & Mendelzon, A.O. (1999). Tableau tech-
niques for querying information sources through global
schemas. In International Conference on Database KEY TERMS
Theory (ICDT99) (pp. 332-347).
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data Cube: A collection of aggregate values classified
Data cube: A relational aggregation operator generalizing according to several properties of interest (dimensions).
group-by, cross-tab, and sub-total. In International Con- Combinations of dimension values are used to identify the
ference on Data Engineering (ICDE96) (pp. 152-159). single aggregate values in the cube.
Grumbach, S., Rafanelli, M., & Tininini, L. (1999). Query- Dimension: A property of the data used to classify it
ing aggregate data. In ACM Symposium on Principles of and navigate the corresponding data cube. In multidimen-
Database Systems (PODS99) (pp. 174-184). sional databases dimensions are often organized into
several hierarchical levels, for example, a time dimension
Grumbach, S., & Tininini, L. (2000). Automatic aggrega- may be organized into days, months and years.
tion using explicit metadata. In International Conference
on Scientific and Statistical Database Management Drill-Down (Roll-Up): Typical OLAP operation, by
(SSDBM00) (pp. 85-94). which aggregate data are visualized at a finer (coarser)
level of detail along one or more analysis dimensions.
Grumbach, S., & Tininini, L. (2003). On the content of
materialized aggregate views. Journal of Computer and Fact: A single elementary datum in an OLAP system,
System Sciences, 66(1), 133-168. the properties of which correspond to dimensions and
measures.
Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggre-
gate-query processing in data warehousing environments. Fact Table: A table of (integrated) elementary data
In International Conference on Very Large Data Bases grouped and aggregated in the multidimensional query-
(VLDB95) (pp. 358-369). ing process.
Halevy, A.Y. (2001). Answering queries using views. Materialized View: A particular form of query whose
VLDB Journal, 10(4), 270-294. answer is stored in the database to accelerate the evalu-
ation of further queries.
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D.
(1999). What can hierarchies do for data warehouses? In Measure: A numeric value obtained by applying an
International Conference on Very Large Data Bases aggregate function (such as count, sum, min, max or
(VLDB99) (pp. 530-541). average) to groups of data in a fact table.
Lenzerini, M. (2002). Data integration: A theoretical per- Query Answering: Process by which the (possibly
spective. In ACM Symposium on Principles of Database approximate) answer to a given query is obtained by
Systems (PODS02) (pp. 233-246). exploiting the stored answers and definitions of a collec-
tion of materialized views.
Levy, A.Y., Mendelzon, A.O., Sagiv, Y., & Srivastava, D.
(1995). Answering queries using views. In ACM Sympo- Query Rewriting: Process by which a source query is
sium on Principles of Database Systems (PODS95) (pp. transformed into an equivalent one referring (almost ex-
95-104). clusively) to a collection of materialized views. In multidi-
mensional databases, query rewriting is fundamental in
Nutt, W., Sagiv, Y., & Shurin, S. (1998). Deciding equiva- achieving acceptable (online) response times.
lences among aggregate queries. In ACM Symposium on
32
TEAM LinG
33
Aggregation for Predictive Modeling with A

Relational Data
Claudia Perlich
IBM Research, USA
Foster Provost
New York University, USA
INTRODUCTION and the information aggregation is based on existential

unification. More recent relational learning approaches
Most data mining and modeling techniques have been include distance-based methods (Kirsten et al., 2001),
developed for data represented as a single table, where propositionalization (Kramer et al., 2001; Knobbe et
every row is a feature vector that captures the character- al., 2001; Krogel et al., 2003), and upgrades of proposi-
istics of an observation. However, data in most domains tional learners such as Nave Bayes (Neville et al.,
are not of this form and consist of multiple tables with 2003), Logistic Regression (Popescul et al., 2002),
several types of entities. Such relational data are ubiq- Decision Trees (Jensen & Neville, 2002) and Bayesian
uitous; both because of the large number of multi-table Networks (Koller & Pfeffer, 1998). Similar to manual
relational databases kept by businesses and government feature construction, both upgrades and
organizations, and because of the natural, linked nature propositionalization use Boolean conditions and com-
of people, organizations, computers, and etc. Relational mon aggregates like min, max, or sum to transform
data pose new challenges for modeling and data mining, either explicitly (propositionalization) or implicitly
including the exploration of related entities and the aggre- (upgrades) the original relational domain into a tradi-
gation of information from multi-sets (bags) of related tional feature-vector representation.
entities. Recent work by Knobbe et al. (2001) and Wrobel &
Krogel (2001) recognizes the essential role of aggrega-
tion in all relational modeling and focuses specifically
BACKGROUND on the effect of aggregation choices and parameters.
Wrobel & Krogel (2003) present one of the few empiri-
Relational learning differs from traditional feature- cal comparisons of aggregation in propositionalization
vector learning both in the complexity of the data repre- approaches (however with inconclusive results). Perlich
sentation and in the complexity of the models. The & Provost (2003) show that the choice of aggregation
relational nature of a domain manifests itself in two operator can have a much stronger impact on the result-
ways: (1) entities are not limited to a single type, and (2) ant models generalization performance than the choice
entities are related to other entities. Relational learning of the model induction method (decision trees or logis-
allows the incorporation of knowledge from entities in tic regression, in their study).
multiple tables, including relationships between ob-
jects of varying cardinality. Thus, in order to succeed,
relational learners have to be able to identify related MAIN THRUST
objects and to aggregate information from bags of re-
lated objects into a final prediction. For illustration, imagine a direct marketing task where
Traditionally, the analysis of relational data has in- the objective is to identify customers who would re-
volved the manual construction by a human expert of spond to a special offer. Available are demographic
attributes (e.g., the number of purchases of a customer information and all previous purchase transactions, which
during the last three months) that together will form a include PRODUCT, TYPE and PRICE. In order to take
feature vector. Automated analysis of relational data is advantage of these transactions, information has to be
becoming increasingly important as the number and aggregated. The choice of the aggregation operator is
complexity of databases increases. Early research on crucial, since aggregation invariably involves loss of
automated relational learning was dominated by Induc- (potentially discriminative) information.
tive Logic Programming (Muggleton, 1992), where the Typical aggregation operators like min, max and sum
classification model is a set of first-order-logic clauses can only be applied to sets of numeric values, not to
TEAM LinG
Aggregation for Predictive Modeling with Relational Data
objects (an exception being count). It is therefore neces- worked data with identifier attributes. As Knobbe et al.
sary to assume class-conditional independence and ag- (1999) point out, traditional aggregation operators like
gregate the attributes independently, which limits the min, max, and count are based on histograms. A histo-
expressive power of the model. Perlich & Provost (2003) gram itself is a crude approximation of the underlying
discuss in detail the implications of various assump- distribution. Rather than estimating one distribution for
tions and aggregation choices on the expressive power every bag of attributes, as done by traditional aggrega-
of resulting classification models. For example, cus- tion operators, this new aggregation approach estimates
tomers who buy mostly expensive books cannot be in a first step only one distribution for each class, by
identified if price and type are aggregated separately. In combining all bags of objects for the same class. The
contrast, ILP methods do not assume independence and combination of bags of related objects results in much
can express an expensive book (TYPE=BOOK and better estimates of the distribution, since it uses many
PRICE>20); however aggregation through existential more observations. The number of parameters differs
unification can only capture whether a customer bought across distributions: for a normal distribution only two
at least one expensive book, not whether he has bought parameters are required, mean and variance, whereas
primarily expensive books. Only two systems, POLKA distributions of categorical attributes have as many
(Knobbe et al., 2001) and REGLAGGS (Wrobel & parameters as possible attribute values. In a second step,
Krogel, 2001) combine Boolean conditions and nu- the bags of attributes of related objects are aggregated
meric aggregates to increase the expressive power of through vector distances (e.g., Euclidean, Cosine, Like-
the model. lihood) between a normalized vector-representation of
Another challenge is posed by categorical attributes the bag and the two class-conditional distributions.
with many possible values, such as ISBN numbers of Imagine the following example of a document clas-
books. Categorical attributes are commonly aggregated sification domain with two tables (Document and Au-
using mode (the most common value) or the count for thor) shown in Figure 1.
all values if the number of different values is small. The first aggregation step estimates the class-
These approaches would be ineffective for ISBN: it has conditional distributions DClass n of authors from the
many possible values and the mode is not meaningful Author table. Under the alphabetical ordering of
since customers usually buy only one copy of each position:value pairs, 1:A, 2:B, and 3:C, the value for
book. Many relational domains include categorical at- DClass n at position k is defined as:
tributes of this type. One common class of such do-
mains involves networked data, where most of the infor-
Number of occurrences of author k in the set of authors related to documents of class n
mation is captured by the relationships between objects, DClass n[k] =
Number of authors related to documents of class n
possibly without any further attributes. The identity of

an entity (e.g., Bill Gates) in social, scientific, and
The resulting estimates of the class-conditional dis-
economic networks may play a much more important
tributions for our example are given by:
role than any of its attributes (e.g., age or gender).
Identifiers such as name, ISBN, or SSN are categorical
DClass 0 = [0.5 0 0.5] and DClass 1 = [0.4 0.4 0.2]
attributes with excessively many possible values that
cannot be accounted for by either mode or count.
The second aggregation step is the representation
Perlich and Provost (2003) present a new multi-step
of every document as a vector:
aggregation methodology based on class-conditional
distributions that shows promising performance on net-
Number of occurrences of author k related to the document Pn
DPn[k] =
Number of authors related to document Pn
Figure 1. Example domain with two tables that are

linked through Paper ID The vector-representation for the above examples
are D P1 = [1 0 0], D P2 = [0.5 0.5 0], D P3 = [0.33 0.33
Document Table Author Table
0.33], and DP4 = [0 0 1].
Paper ID Class Paper ID Author Name
The third aggregation step calculates vector dis-
P1 0 P1 A
P2 1 P2 B
tances (e.g., cosine) between the class-conditional distri-
P3 1 P2 A
bution and the documents DP1,...,DP4. The new Document
P4 0 P3 B table with the additional cosine features is shown in
P3 A Figure 2. In this simple example, the distance from DClass
P3 C 1
separates the examples perfectly; the distance from DClass
P4 C 0
does not.
34
TEAM LinG
Figure 2. Extended document table with new cosine high-dimensional categorical fields (author names and
features added document identifiers) are not applicable. A
The generalization performance of the new aggrega-
Document Table tion approach is related to a number of properties that are
Paper ID Class Cosine(Pn, DClass 1) Cosine(Pn, DClass 0)
of particular relevance and advantage for predictive
P1 0 0.667 0.707
P2 1 0.707 0.5 modeling:
P3 1 0.962 0.816
P4 0 0.333 0.707 Dimensionality Reduction: The use of distances
compresses the high-dimensional space of pos-
sible categorical values into a small set of dimen-
sions one for each class and distance metric. In
By taking advantage of DClass 1 and D Class 0 another new
particular, this allows the aggregation of object
aggregation approach becomes possible. Rather than
identifiers.
constructing counts for all distinct values (impossible
Preservation of Discriminative Information:
for high-dimensional categorical attributes) one can se-
Changing the class labels of the target objects
lect a small subset of values where the absolute differ-
will change the values of the aggregates. The loss
ence between entries in DClass 0 and D Class 1 is maximal. This
of discriminative information is lower since the
method would identify author B as the most discriminative.
class-conditional distributions capture signifi-
These new features, constructed from class-condi-
cant differences.
tional distributions, show superior classification per-
Domain Independence: The density estimation
formance on a variety of relational domains (Perlich &
does not require any prior knowledge about the
Provost, 2003, 2004). Table 1 summarizes the relative
application domain and therefore is suitable for a
out-of-sample performances (averaged over 10 experi-
variety of domains.
ments with standard deviations in parentheses) as pre-
Applicability to Numeric Attributes: The ap-
sented in Perlich (2003) on the CORA document classi-
proach is not limited to categorical values but can
fication task (McCallum et al., 2000) for 400 training
also be applied to numeric attributes after
examples. The data set includes information about the
discretization. Note that using traditional aggre-
authorship, citations, and the full text. This example also
gation through mean and variance assumes im-
demonstrates the opportunities arising from the ability
plicitly a normal distribution; whereas this aggre-
of relational models to take advantage of additional
gation makes no prior distributional assumptions
background information such as citations and authorship
and can capture arbitrary numeric distributions.
over simple text classification. The comparison includes
Monotonic Relationship: The use of distances to
in addition to two distribution-based feature construc-
class-conditional densities constructs numerical
tion approaches (1 and 2) using logistic regression for
features that are monotonic in the probability of
model induction: 3) a Nave Bayes classifier using the
class membership. This makes logistic regression
full text learned by the Rainbow (McCallum, 1996)
a natural choice for the model induction step.
system, 4) a Probabilistic Relational Model (Koller &
Aggregation of Identifiers: By using object iden-
Pfeffer, 1998) using traditional aggregates on both text
tifiers such as names it can overcome some of the
and citation/authorship with the results reported by Taskar
limitations of the independence assumptions and
et al. (2001), and 5) a Simple Relational Classifier (Macskassy
even allow the learning from unobserved object
& Provost, 2003) that uses only the known class labels of
properties (Perlich & Provost, 2004). The identifier
related (e.g., cited) documents. It is important to observe
represents the full information of the object and in
that traditional aggregation operators such as mode for
Table 1. Comparative classification performance
Method Used Information Accuracy
1) Class-Conditional Distributions (Authorship & Citations) 0.78 (0.01)

2) Class-Conditional Distributions and Most Discriminative Counts (Authorship & Citations) 0.81 (0.01)
3) Nave Bayes Classifier using Rainbow (Text) 0.74 (0.03)
4) Probabilistic Relational Model (Text, Authorship & Citiations) 0.74 (0.01)
5) Simple Relational Model (Related Class Labels) 0.68 (0.01)
35
TEAM LinG
particular the joint distribution of all other attributes The potential complexity of relational models and the
and even further unknown properties. resulting computational complexity of relational modeling
Task-Specific Feature Construction: The advan- remains an obstacle to real-time applications. This limita-
tages outlined above are possible through the use tion has spawned work in efficiency improvements (Yin et
the target value during feature construction. This al., 2003; Tang et al., 2003) and will remain an important
practice requires the splitting of the training set into task.
two separate portions for 1) the class-conditional
density estimation and feature construction and 2)
the estimation of the classification model. CONCLUSION
To summarize, most relational modeling has limited Relational modeling is a burgeoning topic within ma-
itself to a small set of existing aggregation operators. chine learning research, and is applicable commonly in
The recognition of the limited expressive power moti- real-world domains. Many domains collect large
vated the combination of Boolean conditioning and amounts of transaction and interaction data, but so far
aggregation, and the development of new aggregation lack a reliable and automated mechanism for model
methodologies that are specifically designed for pre- estimation to support decision-making. Relational
dictive relational modeling. modeling with appropriate aggregation methods has the
potential to fill this gap and allow the seamless integration
of model estimation on top of existing relational data-
FUTURE TRENDS bases, relieving the analyst from the manual, time-con-
suming, and omission-prone task of feature construction.
Computer-based analysis of relational data is becoming
increasingly necessary as the size and complexity of
databases grow. Many important tasks, including REFERENCES
counter-terrorism (Tang et al., 2003), social and eco-
nomic network analysis (Jensen & Neville, 2002), docu- Deroski, S. (2001). Relational data mining applications:
ment classification (Perlich, 2003), customer relation- An overview, In S. D eroski & N. Lavra (Eds.), Rela-
ship management, personalization, fraud detection tional data mining (pp. 339-364). Berlin: Springer Verlag.
(Fawcett & Provost, 1997), and genetics [e.g., see the
overview by Deroski (2001)], used to be approached with Fawcett, T., & Provost, F. (1997). Adaptive fraud detec-
special-purpose algorithms, but now are recognized as tion. Data Mining and Knowledge Discovery, (1).
inherently relational. These application domains both
Jensen, D., & Neville, J. (2002). Data mining in social
profit from and contribute to research in relational model-
networks. In R. Breiger, K. Carley, & P. Pattison (Eds.),
ing in general and aggregation for feature construction in
Dynamic social networks modeling and analysis (pp.
particular.
287-302). The National Academies Press.
In order to accommodate such a variety of domains,
new aggregators must be developed. In particular, it is Kirsten, M., Wrobel, S., & Horvath, T. (2001). Distance
necessary to account for domain-specific dependencies based approaches to relational learning and clustering.
between attributes and entities that currently are ig- In S. Deroski & N. Lavra (Eds.), Relational data mining
nored. One common type of such dependency is the (pp. 213-234). Berlin: Springer Verlag.
temporal order of events which is important for the
discovery of causal relationships. Knobbe A.J., de Haas, M., & Siebes, A. (2001).
Aggregation as a research topic poses the opportu- Propositionalisation and aggregates. In L. DeRaedt & A.
nity for significant theoretical contributions. There is Siebes (Eds.), Proceedings of the Fifth European Confer-
little theoretical work on relational model estimation ence on Principles of Data Mining and Knowledge
outside of first-order logic. In contrast to a large body Discovery (LNAI 2168) (pp. 277-288). Berlin: Springer
of work in mathematics and the estimation of functional Verlag.
dependencies that map well-defined input spaces to Koller, D., & Pfeffer, A. (1998). Probabilistic frame-
output spaces, aggregation operators have not been in- based systems. In Proceedings of Fifteenth/Tenth Con-
vestigated nearly as thoroughly. Model estimation tasks ference on Artificial Intelligence/Innovative Applica-
are usually framed as search over a structured (either in tion of Artificial Intelligence (pp. 580-587). American
terms of parameters or increasing complexity) space of Association for Artificial Intelligence.
possible solutions. But the structuring of a search space
of aggregation operators remains an open question.
36
TEAM LinG
Kramer, S., Lavra , N., & Flach, P. (2001). Proposition- fier attributes. Working Paper CeDER-04-04. Stern School
alization approaches to relational data mining. In S. of Business. A
Deroski & N. Lavra (Eds.), Relational data mining (pp.
Popescul, L., Ungar, H., Lawrence, S., & Pennock, D.M.
262-291). Berlin: Springer Verlag.
(2002). Structural logistic regression: Combining rela-
Krogel, M.A., Rawles, S., Zelezny, F., Flach, P.A., Lavrac, tional and statistical learning. In Proceedings of the
N., & Wrobel, S. (2003). Comparative evaluation of ap- Workshop on Multi-Relational Data Mining.
proaches to propositionalization. In T. Horvth & A.
Tang, L.R., Mooney, R.J., & Melville, P. (2003). Scaling up
Yamamoto (Eds.), Proceedings of the 13th International
ILP to large examples: Results on link discovery for
Conference on Inductive Logic Programming (LNAI
counter-terrorism. In Proceedings of the Workshop on
2835) (pp. 197-214). Berlin: Springer-Verlag.
Multi-Relational Data Mining (pp. 107-121).
Krogel, M.A., & Wrobel, S. (2001). Transformation-based
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic
learning using multirelational aggregation. In C. Rouveirol
classification and clustering in relational data. In Pro-
& M. Sebag (Eds.), Proceedings of the Eleventh Interna-
ceedings of the 17th International Joint Conference
tional Conference on Inductive Logic Programming (ILP)
on Artificial Intelligence (pp. 870-878).
(LNAI 2157) (pp. 142-155). Berlin: Springer Verlag.
Yin, X., Han, J., & Yang, J. (2003). Efficient multi-
Krogel M.A., & Wrobel, S. (2003). Facets of aggregation
relational classification by tuple ID propagation. In
approaches to propositionalization. In T. Horvth & A.
Proceedings of the Workshop on Multi-Relational
Yamamoto (Eds.), Proceedings of the Work-in-Progress
Data Mining.
Track at the 13th International Conference on Inductive
Logic Programming (pp. 30-39).
Macskassy, S.A., & Provost, F. (2003). A simple relational
classifier. In Proceedings of the Workshop of Multi- KEY TERMS
Relational Data Mining at SIGKDD-2003.
Aggregation: Also commonly called a summary, an
McCallum, A.K. (1996). Bow: A toolkit for statistical aggregation is the calculation of a value from a bag or
language modeling, text retrieval, classification and (multi)set of entities. Typical aggregations are sum,
clustering. Retrieved from http://www.cs.cmu.edu/ count, and average.
~mccallum/bow
Discretization: Conversion of a numeric variable
McCallum, A.K., Nigam, K., Rennie, J., & Seymore, K. into a categorical variable, usually though binning. The
(2000). Automating the construction of Internet portals entire range of the numeric values is split into a number
with machine learning. Information Retrieval, 3(2), 127- of bins. The numeric value of the attributes is replaced by
163. the identifier of the bin into which it falls.
Muggleton, S. (Ed.). (1992). Inductive logic program- Class-Conditional Independence: Property of a multi-
ming. London: Academic Press. variate distribution with a categorical class variable c and
a set of other variables (e.g., x and y). The probability of
Neville J., Jensen, D., & Gallagher, B. (2003). Simple
observing a combination of variable values given the
estimators for relational bayesian classifers. In Pro-
class label is equal to the product of the probabilities of
ceedings of the Third IEEE International Conference
each variable value given the class: P(x,y|c) = P(x|c)*P(y|c).
on Data Mining (pp. 609-612).
Inductive Logic Programming: A field of research at
Perlich, C. (2003). Citation-based document classifi-
the intersection of logic programming and inductive ma-
cation. In Proceedings of the Workshop on Informa-
chine learning, drawing ideas and methods from both
tion Technology and Systems (WITS).
disciplines. The objective of ILP methods is the inductive
Perlich, C., & Provost, F. (2003). Aggregation-based construction of first-order Horn clauses from a set of
feature invention and relational concept classes. In Pro- examples and background knowledge in relational form.
ceedings of the Ninth ACM SIGKDD International Con-
Propositionalization: The process of transforming a
ference on Knowledge Discovery and Data Mining.
multi-relational dataset, containing structured examples,
Perlich, C., & Provost, F. (2004). ACORA: Distribution- into a propositional data set (one table) with derived
based aggregation for relational learning from identi- attribute-value features, describing the structural proper-
ties of the example.
37
TEAM LinG
Relational Data: Data where the original information Relational Learning: Learning in relational domains
cannot be represented in a single table but requires two that include information from multiple tables, not based
or more tables in a relational database. Every table can on manual feature construction.
either capture the characteristics of entities of a particular
type (e.g., person or product) or relationships between Target Objects: Objects in a particular target tables
entities (e.g., person bought product). for which a prediction is to be made. Other objects reside
in additional background tables, but are not the focus
of the prediction task.
38
TEAM LinG
39
API Standardization Efforts for Data Mining A

Jaroslav Zendulka
Brno University of Technology, Czech Republic
INTRODUCTION software vendors to be easily plugged into applications.

A software package that provides data mining services is
Data mining technology just recently became actually called data mining provider and an application that
usable in real-world scenarios. At present, the data mining employs these services is called data mining consumer.
models generated by commercial data mining and statis- The data mining provider itself includes three basic archi-
tical applications are often used as components in other tectural components (Hornick et al., 2002):
systems in such fields as customer relationship manage-
ment, risk management or processing scientific data. API the End User Visible Component: An appli-
Therefore, it seems to be natural that most data mining cation developer using a data mining provider has
products concentrate on data mining technology rather to know only its API.
than on the easy-to-use, scalability, or portability. It is Data Mining Engine (or Server): the core com-
evident that employing common standards greatly simpli- ponent of a data mining provider. It provides an
fies the integration, updating, and maintenance of appli- infrastructure that offers a set of data mining
cations and systems containing components provided by services to data mining consumers.
other producers (Grossman, Hornick, & Meyer, 2002). Metadata Repository: a repository that serves to
Data mining models generated by data mining algorithms store data mining metadata.
are good examples of such components.
Currently, established and emerging standards ad- The standard APIs presented in this paper are not
dress especially the following aspects of data mining: designed to support the entire knowledge discovery
process but the data mining step only (Han & Kamber,
Metadata: for representing data mining metadata 2001). They do not provide all necessary facilities for data
that specify a data mining model and results of cleaning, transformations, aggregations, and other data
model operations (CWM, 2001). preparation operations. It is assumed that data prepara-
Application Programming Interfaces (APIs): for tion is done before an appropriate data mining algorithm
employing data mining components in applications. offered by the API is applied.
Process: for capturing the whole knowledge dis- There are four key concepts that are supported by the
covery process (CRISP-DM, 2000). APIs: a data mining model, data mining task, data mining
technique, and data mining algorithm. The data mining
In this paper, we focus on standard APIs. The objec- model is a representation of a given set of data. It is the
tive of these standards is to facilitate integration of data result of one of the data mining tasks, during which a data
mining technology with application software. Probably mining algorithm for a given data mining technique
the best-known initiatives in this field are OLE DB for Data builds the model. For example, a decision tree as one of the
Mining (OLE DB for DM), SQL/MM Data Mining (SQL/ classification models is the result of a run of a decision
MM DM), and Java Data Mining (JDM). tree-based algorithm.
Another standard, which is not an API but is important The basic data mining tasks that the standard APIs
for integration and interoperability of data mining prod- support enable users to:
ucts and applications, is a Predictive Model Markup
Language (PMML). It is a standard format for data mining 1. Build a data mining model. This task consists of two
model exchange developed by Data Mining Group (DMG) steps. First the data model is defined, that is, the
(PMML, 2003). It is supported by all the standard APIs source data that will be mined is specified, the
presented in this paper. source data structure (referred to as physical schema)
is mapped on inputs of a data mining algorithm
(referred to as logical schema), and the algorithm
BACKGROUND used to build the data mining model is specified.
Then, the data mining model is built from training
The goal of data mining API standards is to make it data.
possible for different data mining algorithms from various
TEAM LinG
API Standardization Efforts for Data Mining
2. Test the quality of a mining model by applying oriented specification for a set of data access interfaces
testing data. designed for record-oriented data stores. It employs SQL
3. Apply a data mining model to new data. commands as arguments of interface operations. The
4. Browse a data mining model for reporting and visu- approach in defining OLE DB for DM was not to extend
alization applications. OLE DB interfaces but to expose data mining interfaces in
a language-based API.
The APIs support several commonly accepted and OLE DB for DM treats a data mining model as if it were
widely used techniques both for predictive and descrip- a special type of table: (a) Input data in the form of a set
tive data mining (see Table 1). Not all techniques need all of cases is associated with a data mining model and
the tasks listed above. For example, association rule additional meta-information while defining the data min-
mining does not require testing and application to new ing model. (b) When input data is inserted into the data
data, whereas classification does. mining model (it is populated), a mining algorithm
The goals of the APIs are very similar but the approach builds an abstraction of the data and stores it into this
of each of them is different. OLE DB for DM is a language- special table. For example, if the data model represents a
based interface, SQL/MM DM is based on user-defined decision tree, the table contains a row for each leaf node
data types in SQL:1999, and JDM contains packages of of the tree (Netz et al., 2001). Once the data mining model
data mining oriented Java interfaces and classes. is populated, it can be used for prediction, or it can be
In the next section, each of the APIs is briefly charac- browsed for visualization.
terized. An example showing their application in predic- OLE DB for DM extends syntax of several SQL state-
tion is presented in another article in this encyclopedia. ments for defining, populating, and using a data mining
model see Figure 1.
MAIN THRUST SQL/MM Data Mining
OLE DB for Data Mining SQL/MM DM is an international ISO/IEC standard (SQL,

2002), which is part of the SQL Multimedia and Applica-
OLE DB for DM (OLE DB, 2000) is Microsofts API that tion Packages (SQL/MM) (Melton & Eisenberg, 2001). It
aims to become the industry standard. It provides a set of is based on SQL:1999 and its structured user-defined data
extensions to OLE DB, which is a Microsofts object- types (UDT). The structured UDT is the fundamental
Table 1. Supported data mining techniques
Technique OLE DB for DM SQL/MM DM JDM

Association rules X X X
Clustering (segmentation) X X X
Classification X X X
Sequence and deviation analysis X
Density estimation X
Regression X
Approximation X
Attribute importance X
Figure 1. Extended SQL statements in OLE DB for DM
source data columns

INSERT model columns SELECT
mining algorithm
algorithm settings
CREATE
Populating the data Defining a data Testing, applying, browsing

2 1 mining model 3
mining model the data mining model
40
TEAM LinG
facility in SQL:1999 that supports object orientation (Melton FUTURE TRENDS

& Simon, 2001). The idea of SQL/MM DM is to provide A
UDTs and associated methods for defining input data, a OLE DB for DM is a Microsofts standard which aims to
data mining model, data mining task and its settings, and for be an industry standard. A reference implementation of
results of testing or applying the data mining model. Train- a data mining provider based on this standard is available
ing, testing, and application data must be stored in a table. in Microsoft SQL Server 2000 (Netz et al., 2001). SQL/MM
Relations of the UDTs are shown in Figure 2. DM was adopted as an international ISO/IEC standard.
Some of the UDTs are related to mining techniques. As it is based on a user-defined data type feature of
Their names contain XX in the figure, which should be SQL:1999, support of UDTs in database management
Clas, Rule, Clus, and Reg for classification, asso- systems is essential for implementations of data mining
ciation rules, clustering and regression, respectively. providers based on this standard.
JDM must still go through several steps before being
Java Data Mining accepted as an official Java standard. At the time of
writing this paper, it was in the stage of final draft public
Java Data Mining (JDM) known as a Java Specification review. The Oracle9i Data Mining (Oracle9i, 2002) API
Request 73 (JSR-73) (Hornick et al., 2004) is a Java provides an early look at concepts and approaches
standard being developed under SUNs Java Community proposed for JDM. It is assumed to comply with the JDM
Process. The standard is based on a generalized, object- standard when the standard is published.
oriented, data mining conceptual model. JDM supports All the standards support PMML as a format for data
common data mining operations, as well as the creation, mining model exchange. They enable a data mining model
persistence, access, and maintenance of metadata sup- to be imported and exported in this format. In OLE DB for
porting mining activities. DM, a new model can be created from a PMML document.
Compared with OLE DB for DM and SQL/MM DM, SQL/MM DM provides methods of the DM_XXModel
JDM is more complex because it does not rely on any other UDT to import and export a PMML document. Similarly,
built-in support, such as OLE DB or SQL. It is a pure Java JDM specifies interfaces for import and export tasks.
API that specifies a set of Java interfaces and classes,
which must be implemented in a data mining provider.
Some of JDM concepts are close to those in SQL/MM DM CONCLUSION
but the number of Java interfaces and classes in JDM is
higher than the number of UDTs in SQL/MM DM. JDM Three standard APIs for data mining were presented in
specifies interfaces for objects which provide an abstrac- this paper. However, their implementation is not avail-
tion of the metadata needed to execute data mining tasks. able yet or is only a reference one. Schwenkreis (2001)
Once a task is executed, another object that represents the commented on this situation, as, the fact that implemen-
result of the task is created. tations come after standards is a general trend in todays
standardization efforts. It seems that in case of data
Figure 2. Relations of data mining UDTs introduced in SQL/MM DM
Testing data
DM_MiningData
DM_MinigData
DM_XXTestResult
DM_LogicalDataSpec DM_XXModel
Training data
DM_XXSettings
DM_ApplicationData DM_XXResult
DM_XXBldTask
Application data
41
TEAM LinG
mining, standards are not only intended to unify existing SAS Enterprise Miner to support PMML. (September
products with well-known functionality but to (partially) 17, 2002). Retrieved from http://www.sas.com/news/
design the functionality such that future products match preleases/091702/news1.html
real world requirements. A simple example of using the
APIs in prediction is presented in another article of this Schwenkreis, F. (2001). Data mining Technology driven
book (Zendulka, 2005). by standards? Retrieved from http://www.research.
microsoft.com/~jamesrh/hpts2001/submissions/
FriedemannSchwenkreis.htm
REFERENCES SQL Multimedia and Application Packages. Part 6: Data
Mining. ISO/IEC 13249-6. (2002).
Common Warehouse Metamodel Specification: Data
Mining. Version 1.0. (2001). Retrieved from http:// Zendulka, J. (2005). Using standard APIs for data mining
www.omg.org/docs/ad/01-02-01.pdf in prediction. In J. Wang (Ed.) Encyclopedia of data
warehousing and mining. Hershey, PA: Idea Group Ref-
Cross Industry Standard Process for Data Mining erence.
(CRISP-DM). Version 1.0. (2000). Retrieved from http://
www.crisp-dm.org/
KEY TERMS
Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data
mining standards initiatives. Communications of the ACM, API: Application programming interface (API) is a
45 (8), 59-61. description of the way one piece of software asks another
Han, J., & Kamber, M. (2001). Data mining: concepts and program to perform a service. A standard API for data
techniques. Morgan Kaufmann Publishers. mining enables for different data mining algorithms from
various vendors to be easily plugged into application
Hornick, M. et al. (2004). JavaSpecification Request programs.
73: JavaData Mining (JDM). Version 0.96. Retrieved
from http://jcp.org/aboutJava/communityprocess/first/ Data Mining Model: A high-level global description
jsr073/ of a given set of data which is the result of a data mining
technique over the set of data. It can be descriptive or
Melton, J., & Eisenberg, A. (2001). SQL Multimedia and predictive.
Application Packages (SQL/MM). SIGMOD Record, 30
(4), 97-102. DMG: Data Mining Group (DMG) is a consortium of
data mining vendors for developing data mining stan-
Melton, J., & Simon, A. (2001). SQL: 1999. Understanding dards. They have developed a Predictive Model Markup
relational language components. Morgan Kaufmann language (PMML).
Publishers.
JDM: Java Data Mining (JDM) is an emerging stan-
Microsoft Corporation. (2000). OLE DB for Data Mining dard API for the programming language Java. It is an
Specification Version 1.0. object-oriented interface that specifies a set of Java classes
and interfaces supporting data mining operations for
Netz, A. et al. (2001, April). Integrating data mining with building, testing, and applying a data mining model.
SQL Databases: OLE DB for data mining. In Proceedings
of the 17 th International Conference on Data Engineer- OLE DB for DM: OLE DB for Data Mining (OLE DB for
ing (ICDE 01) (pp. 379-387). Heidelberg, Germany. DM) is a Microsofts language-based standard API that
introduces several SQL-like statements supporing data
Oracle9i Data Mining. Concepts. Release 9.2.0.2. (2002). mining operations for building, testing, and applying a
Viewable CD Release 2 (9.2.0.2.0). data mining model.
PMML Version 2.1. (2003). Retrieved from http:// PMML: Predictive Model Markup Language (PMML)
www.dmg.org/pmml-v2-1.html is an XML-based language which provides a quick and
Saarenvirta, G. (2001, Summer). Operation Data Mining. easy way for applications to produce data mining models
DB2 Magazine, 6(2). International Business Machines in a vendor-independent format and to share them be-
Corporation. Retrieved from http://www.db2mag.com/ tween compliant applications.
db_area/archives/2001/q2/saarenvirta.shtml
42
TEAM LinG
SQL1999: Structured Query Language (SQL): 1999. SQL/MM DM: SQL Multimedia and Application Pack-
The version of the standard database language SQL ages Part 6: Data Mining (SQL/MM DM) is an interna- A
adapted in 1999, which introduced object-oriented fea- tional standard the purpose of which is to define data
tures. mining user-defined types and associated routines for
building, testing, and applying data mining models. It is
based on structured user-defined types of SQL:1999.
43
TEAM LinG
44
The Application of Data Mining to

Recommender Systems
J. Ben Schafer
University of Northern Iowa, USA
INTRODUCTION ries to a user concerning the latest news regarding a

senators re-election campaign.
In a world where the number of choices can be overwhelm- Without computers, a person often receives recom-
ing, recommender systems help users find and evaluate mendations by listening to what people around him have
items of interest. They connect users with items to con- to say. If many people in the office state that they enjoyed
sume (purchase, view, listen to, etc.) by associating the a particular movie, or if someone he tends to agree with
content of recommended items or the opinions of other suggests a given book, then he may treat these as recom-
individuals with the consuming users actions or opin- mendations. Collaborative filtering (CF) is an attempt to
ions. Such systems have become powerful tools in do- facilitate this process of word of mouth. The simplest of
mains from electronic commerce to digital libraries and CF systems provide generalized recommendations by
knowledge management. For example, a consumer of just aggregating the evaluations of the community at large.
about any major online retailer who expresses an interest More personalized systems (Resnick & Varian, 1997)
in an item either through viewing a product description employ techniques such as user-to-user correlations or a
or by placing the item in his shopping cart will likely nearest-neighbor algorithm.
receive recommendations for additional products. These The application of user-to-user correlations derives
products can be recommended based on the top overall from statistics, where correlations between variables are
sellers on a site, on the demographics of the consumer, or used to measure the usefulness of a model. In recommender
on an analysis of the past buying behavior of the con- systems correlations are used to measure the extent of
sumer as a prediction for future buying behavior. This agreement between two users (Breese, Heckerman, &
paper will address the technology used to generate rec- Kadie, 1998) and used to identify users whose ratings will
ommendations, focusing on the application of data min- contain high predictive value for a given user. Care must
ing techniques. be taken, however, to identify correlations that are actu-
ally helpful. Users who have only one or two rated items
in common should not be treated as strongly correlated.
BACKGROUND Herlocker et al. (1999) improved system accuracy by
applying a significance weight to the correlation based on
Many different algorithmic approaches have been ap- the number of co-rated items.
plied to the basic problem of making accurate and efficient Nearest-neighbor algorithms compute the distance
recommender systems. The earliest recommender sys- between users based on their preference history. Dis-
tems were content filtering systems designed to fight tances vary greatly based on domain, number of users,
information overload in textual domains. These were often number of recommended items, and degree of co-rating
based on traditional information filtering and information between users. Predictions of how much a user will like an
retrieval systems. Recommender systems that incorpo- item are computed by taking the weighted average of the
rate information retrieval methods are frequently used to opinions of a set of neighbors for that item. As applied in
satisfy ephemeral needs (short-lived, often one-time recommender systems, neighbors are often generated
needs) from relatively static databases. For example, re- online on a query-by-query basis rather than through the
questing a recommendation for a book preparing a sibling off-line construction of a more thorough model. As such,
for a new child in the family. Conversely, recommender they have the advantage of being able to rapidly incorpo-
systems that incorporate information-filtering methods rate the most up-to-date information, but the search for
are frequently used to satisfy persistent information (long- neighbors is slow in large databases. Practical algorithms
lived, often frequent, and specific) needs from relatively use heuristics to search for good neighbors and may use
stable databases in domains with a rapid turnover or opportunistic sampling when faced with large popula-
frequent additions. For example, recommending AP sto- tions.
TEAM LinG
The Application of Data Mining to Recommender Systems
Both nearest-neighbor and correlation-based of features for the items being classified or data about
recommenders provide a high level of personalization in relationships among the items. The category is a domain- A
their recommendations, and most early systems using specific classification such as malignant/benign for tumor
these techniques showed promising accuracy rates. As classification, approve/reject for credit requests, or in-
such, CF-based systems have continued to be popular in truder/authorized for security checks. One way to build a
recommender applications and have provided the bench- recommender system using a classifier is to use informa-
marks upon which more recent applications have been tion about a product and a customer as the input, and to
compared. have the output category represent how strongly to
recommend the product to the customer. Classifiers may
be implemented using many different machine-learning
DATA MINING IN RECOMMENDER strategies including rule induction, neural networks, and
APPLICATIONS Bayesian networks. In each case, the classifier is trained
using a training set in which ground truth classifications
The term data mining refers to a broad spectrum of math- are available. It can then be applied to classify new items
ematical modeling techniques and software tools that are for which the ground truths are not available. If subse-
used to find patterns in data and user these to build quent ground truths become available, the classifier may
models. In this context of recommender applications, the be retrained over time.
term data mining is used to describe the collection of For example, Bayesian networks create a model based
analysis techniques used to infer recommendation rules on a training set with a decision tree at each node and
or build recommendation models from large data sets. edges representing user information. The model can be
Recommender systems that incorporate data mining tech- built off-line over a matter of hours or days. The resulting
niques make their recommendations using knowledge model is very small, very fast, and essentially as accurate
learned from the actions and attributes of users. These as CF methods (Breese, Heckerman, & Kadie, 1998). Baye-
systems are often based on the development of user sian networks may prove practical for environments in
profiles that can be persistent (based on demographic or which knowledge of consumer preferences changes slowly
item consumption history data), ephemeral (based on with respect to the time needed to build the model but are
the actions during the current session), or both. These not suitable for environments in which consumer prefer-
algorithms include clustering, classification techniques, ence models must be updated rapidly or frequently.
the generation of association rules, and the production of Classifiers have been quite successful in a variety of
similarity graphs through techniques such as Horting. domains ranging from the identification of fraud and
Clustering techniques work by identifying groups of credit risks in financial transactions to medical diagnosis
consumers who appear to have similar preferences. Once to intrusion detection. Good et al. (1999) implemented
the clusters are created, averaging the opinions of the induction-learned feature-vector classification of movies
other consumers in her cluster can be used to make and compared the classification with CF recommenda-
predictions for an individual. Some clustering techniques tions; this study found that the classifiers did not perform
represent each user with partial participation in several as well as CF, but that combining the two added value over
clusters. The prediction is then an average across the CF alone.
clusters, weighted by degree of participation. Clustering One of the best-known examples of data mining in
techniques usually produce less-personal recommenda- recommender systems is the discovery of association
tions than other methods, and in some cases, the clusters rules, or item-to-item correlations (Sarwar et. al., 2001).
have worse accuracy than CF-based algorithms (Breese, These techniques identify items frequently found in as-
Heckerman, & Kadie, 1998). Once the clustering is com- sociation with items in which a user has expressed
plete, however, performance can be very good, since the interest. Association may be based on co-purchase data,
size of the group that must be analyzed is much smaller. preference by common users, or other measures. In its
Clustering techniques can also be applied as a first step simplest implementation, item-to-item correlation can be
for shrinking the candidate set in a CF-based algorithm or used to identify matching items for a single item, such
for distributing neighbor computations across several as other clothing items that are commonly purchased with
recommender engines. While dividing the population into a pair of pants. More powerful systems match an entire set
clusters may hurt the accuracy of recommendations to of items, such as those in a customers shopping cart, to
users near the fringes of their assigned cluster, pre- identify appropriate items to recommend. These rules can
clustering may be a worthwhile trade-off between accu- also help a merchandiser arrange products so that, for
racy and throughput. example, a consumer purchasing a childs handheld video
Classifiers are general computational models for as- game sees batteries nearby. More sophisticated temporal
signing a category to an input. The inputs may be vectors data mining may suggest that a consumer who buys the
45
TEAM LinG
video game today is likely to buy a pair of earplugs in the traditional CF algorithms do not consider. In one study
next month. using synthetic data, Horting produced better predic-
Item-to-item correlation recommender applications tions than a CF-based algorithm (Wolf et al., 1999).
usually use current interest rather than long-term cus-
tomer history, which makes them particularly well suited
for ephemeral needs such as recommending gifts or locat- FUTURE TRENDS
ing documents on a topic of short lived interest. A user
merely needs to identify one or more starter items to elicit As data mining algorithms have been tested and vali-
recommendations tailored to the present rather than the dated in their application to recommender systems, a
past. variety of promising applications have evolved. In this
Association rules have been used for many years in section we will consider three of these applications
merchandising, both to analyze patterns of preference meta-recommenders, social data mining systems, and
across products, and to recommend products to consum- temporal systems that recommend when rather than what.
ers based on other products they have selected. An asso- Meta-recommenders are systems that allow users to
ciation rule expresses the relationship that one product is personalize the merging of recommendations from a va-
often purchased along with other products. The number of riety of recommendation sources employing any number
possible association rules grows exponentially with the of recommendation techniques. In doing so, these sys-
number of products in a rule, but constraints on confi- tems let users take advantage of the strengths of each
dence and support, combined with algorithms that build different recommendation method. The SmartPad super-
association rules with itemsets of n items from rules with market product recommender system (Lawrence et al.,
n-1 item itemsets, reduce the effective search space. As- 2001) suggests new or previously unpurchased prod-
sociation rules can form a very compact representation of ucts to shoppers creating shopping lists on a personal
preference data that may improve efficiency of storage as digital assistant (PDA). The SmartPad system considers
well as performance. They are more commonly used for a consumers purchases across a stores product tax-
larger populations rather than for individual consumers, onomy. Recommendations of product subclasses are
and they, like other learning methods that first build and based upon a combination of class and subclass associa-
then apply models, are less suitable for applications where tions drawn from information filtering and co-purchase
knowledge of preferences changes rapidly. Association rules drawn from data mining. Product rankings within a
rules have been particularly successfully in broad applica- product subclass are based upon the products sales
tions such as shelf layout in retail stores. By contrast, rankings within the users consumer cluster, a less per-
recommender systems based on CF techniques are easier sonalized variation of collaborative filtering. MetaLens
to implement for personal recommendation in a domain (Schafer et al., 2002) allows users to blend content re-
where consumer opinions are frequently added, such as quirements with personality profiles to allow users to
online retail. determine which movie they should see. It does so by
In addition to use in commerce, association rules have merging more persistent and personalized recommenda-
become powerful tools in recommendation applications in tions, with ephemeral content needs such as the lack of
the domain of knowledge management. Such systems offensive content or the need to be home by a certain
attempt to predict which Web page or document can be time. More importantly, it allows the user to customize
most useful to a user. As Gry (2003) writes, The problem the process by weighting the importance of each indi-
of finding Web pages visited together is similar to finding vidual recommendation.
associations among itemsets in transaction databases. While a traditional CF-based recommender typically
Once transactions have been identified, each of them requires users to provide explicit feedback, a social data
could represent a basket, and each web resource an item. mining system attempts to mine the social activity records
Systems built on this approach have been demonstrated to of a community of users to implicitly extract the impor-
produce both high accuracy and precision in the coverage tance of individuals and documents. Such activity may
of documents recommended (Geyer-Schultz et al., 2002). include Usenet messages, system usage history, cita-
Horting is a graph-based technique in which nodes are tions, or hyperlinks. TopicShop (Amento et al., 2003) is
users, and edges between nodes indicate degree of simi- an information workspace which allows groups of com-
larity between two users (Wolf et al., 1999). Predictions are mon Web sites to be explored, organized into user de-
produced by walking the graph to nearby nodes and fined collections, manipulated to extract and order com-
combining the opinions of the nearby users. Horting dif- mon features, and annotated by one or more users. These
fers from collaborative filtering as the graph may be walked actions on their own may not be of large interest, but the
through other consumers who have not rated the product collection of these actions can be mined by TopicShop
in question, thus exploring transitive relationships that and redistributed to other users to suggest sites of
46
TEAM LinG
general and personal interest. Agrawal et al. (2003) ex- that data mining algorithms can be and will continue to be
plored the threads of newsgroups to identify the relation- an important part of the recommendation process. A
ships between community members. Interestingly, they
concluded that due to the nature of newsgroup postings
users are more likely to respond to those with whom they REFERENCES
disagree links between users are more likely to sug-
gest that users should be placed in differing partitions Adomavicius, G., & Tuzhilin, A. (2001). Extending
rather than the same partition. Although this technique recommender systems: A multidimensional approach.
has not been directly applied to the construction of IJCAI-01 Workshop on Intelligent Techniques for Web
recommendations, such an application seems a logical Personalization (ITWP2001), Seattle, Washington.
field of future study.
Although traditional recommenders suggest what item Agrawal, R., Rajagopalan, S., Srikant, R., & Xu, Y. (2003).
a user should consume they have tended to ignore changes Mining newsgroups using networks arising from social
over time. Temporal recommenders apply data mining behavior. In Proceedings of the Twelfth World Wide
techniques to suggest when a recommendation should be Web Conference (WWW12) (pp. 529-535), Budapest, Hun-
made or when a user should consume an item. Adomavicius gary.
and Tuzhilin (2001) suggest the construction of a recom- Amento, B., Terveen, L., Hill, W., Hix, D., & Schulman, R.
mendation warehouse, which stores ratings in a hypercube. (2003). Experiments in social data mining: The TopicShop
This multidimensional structure can store data on not System. ACM Transactions on Computer-Human Inter-
only the traditional user and item axes, but also for action, 10 (1), 54-85.
additional profile dimensions such as time. Through this
approach, queries can be expanded from the traditional Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical
what items should we suggest to user X to at what analysis of predictive algorithms for collaborative filter-
times would user X be most receptive to recommendations ing. In Proceedings of the 14th Conference on Uncer-
for product Y. Hamlet (Etzioni et al., 2003) is designed tainty in Artificial Intelligence (UAI-98) (pp. 43-52),
to minimize the purchase price of airplane tickets. Hamlet Madison, Wisconsin.
combines the results from time series analysis, Q-learn-
ing, and the Ripper algorithm to create a multi-strategy Etzioni, O., Knoblock, C.A., Tuchinda, R., & Yates, A.
data-mining algorithm. By watching for trends in airline (2003). To buy or not to buy: Mining airfare data to
pricing and suggesting when a ticket should be pur- minimize ticket purchase price. In Proceedings of the
chased, Hamlet was able to save the average user 23.8% Ninth ACM SIGKDD International Conference on Knowl-
when savings was possible. edge Discovery and Data Mining (pp. 119-128), Wash-
ington. D.C.
Gry, M., & Haddad, H. (2003). Evaluation of Web usage
CONCLUSION mining approaches for users next request prediction. In
Fifth International Workshop on Web Information and
Recommender systems have emerged as powerful tools Data Management (pp. 74-81), Madison, Wisconsin.
for helping users find and evaluate items of interest.
These systems use a variety of techniques to help users Geyer-Schulz, A., & Hahsler, M. (2002). Evaluation of
identify the items that best fit their tastes or needs. While recommender algorithms for an Internet information bro-
popular CF-based algorithms continue to produce mean- ker based on simple association rules and on the repeat-
ingful, personalized results in a variety of domains, data buying theory. In Fourth WEBKDD Workshop: Web Min-
mining techniques are increasingly being used in both ing for Usage Patterns & User Profiles (pp. 100-114),
hybrid systems, to improve recommendations in previ- Edmonton, Alberta, Canada.
ously successful applications, and in stand-alone Good, N. et al. (1999). Combining collaborative filtering
recommenders, to produce accurate recommendations in with personal agents for better recommendations. In Pro-
previously challenging domains. The use of data mining ceedings of Sixteenth National Conference on Artificial
algorithms has also changed the types of recommenda- Intelligence (AAAI-99) (pp. 439-446), Orlando, Florida.
tions as applications move from recommending what to
consume to also recommending when to consume. While Herlocker, J., Konstan, J.A., Borchers, A., & Riedl, J.
recommender systems may have started as largely a (1999). An algorithmic framework for performing collabo-
passing novelty, they clearly appear to have moved into rative filtering. In Proceedings of the 1999 Conference on
a real and powerful tool in a variety of applications, and Research and Development in Information Retrieval,
(pp. 230-237), Berkeley, California.
47
TEAM LinG
Lawrence, R.D. et al. (2001). Personalization of supermar- KEY TERMS

ket product recommendations. Data Mining and Knowl-
edge Discovery, 5(1/2), 11-32. Association Rules: Used to associate items in a data-
Lin, W., Alvarez, S.A., & Ruiz, C. (2002). Efficient adap- base sharing some relationship (e.g., co-purchase infor-
tive-support association rule mining for recommender mation). Often takes the for if this, then that, such as,
systems. Data Mining and Knowledge Discovery, 6(1) If the customer buys a handheld videogame then the
83-105. customer is likely to purchase batteries.
Resnick, P., & Varian, H.R. (1997). Communications of the Collaborative Filtering: Selecting content based on
Association of Computing Machinery Special issue on the preferences of people with similar interests.
Recommender Systems, 40(3), 56-89. Meta-Recommenders: Provide users with personal-
Sarwar, B., Karypis, G., Konstan, J.A., & Reidl, J. (2001). ized control over the generation of a single recommenda-
Item-based collaborative filtering recommendation algo- tion list formed from the combination of rich recommenda-
rithms. In Proceedings of the Tenth International Con- tion data from multiple information sources and recom-
ference on World Wide Web (pp. 285-295), Hong Kong. mendation techniques.
Schafer, J.B., Konstan, J.A., & Riedl, J. (2001). E-Com- Nearest-Neighbor Algorithm: A recommendation
merce Recommendation Applications. Data Mining and algorithm that calculates the distance between users
Knowledge Discovery, 5(1/2), 115-153. based on the degree of correlations between scores in the
users preference histories. Predictions of how much a
Schafer, J.B., Konstan, J.A., & Riedl, J. (2002). Meta- user will like an item are computed by taking the weighted
recommendation systems: User-controlled integration of average of the opinions of a set of nearest neighbors for
diverse recommendations. In Proceedings of the Elev- that item.
enth Conference on Information and Knowledge (CIKM-
02) (pp. 196-203), McLean, Virginia. Recommender Systems: Any system that provides a
recommendation, prediction, opinion, or user-configured
Shoemaker, C., & Ruiz, C. (2003). Association rule mining list of items that assists the user in evaluating items.
algorithms for set-valued data. Lecture Notes in Com-
puter Science, 2690, 669-676. Social Data-Mining: Analysis and redistribution of
information from records of social activity such as
Wolf, J., Aggarwal, C., Wu, K-L., & Yu, P. (1999). Horting newsgroup postings, hyperlinks, or system usage his-
hatches an egg: A new graph-theoretic approach to col- tory.
laborative filtering. In Proceedings of ACM SIGKDD
International Conference on Knowledge Discovery & Temporal Recommenders: Recommenders that incor-
Data Mining (pp. 201-212), San Diego, CA. porate time into the recommendation process. Time can be
either an input to the recommendation function, or the
output of the function.
48
TEAM LinG
49
Approximate Range Queries by Histograms A

in OLAP
Francesco Buccafurri
University Mediterranea of Reggio Calabria, Italy
Gianluca Lax
University Mediterranea of Reggio Calabria, Italy
INTRODUCTION Ramakrishnan, 2000). The main advantage of sampling

techniques is that they are very easy to implement.
Online analytical processing applications typically ana- Besides sampling, regression techniques try to model
lyze a large amount of data by means of repetitive data as a function in such a way that only a small set of
queries involving aggregate measures on such data. In coefficients representing such a function is stored,
fast OLAP applications, it is often advantageous to rather than the original data. The simplest regression
provide approximate answers to queries in order to technique is the linear one, which models a data distri-
achieve very high performances. A way to obtain this bution as a linear function. Despite its simplicity, not
goal is by submitting queries on compressed data in allowing the capture of complex relationships among
place of the original ones. Histograms, initially intro- data, this technique often produces acceptable results.
duced in the field of query optimization, represent one There are also non linear regressions, significantly
of the most important techniques used in the context of more complex than the linear one from the computa-
OLAP for producing approximate query answers. tional point of view, but applicable to a much larger set
of cases.
Another possibility for facing the range query esti-
BACKGROUND mation problem consists of using wavelets-based tech-
niques (Chakrabarti, Garofalakis, Rastogi & Shim, 2001;
Computing aggregate information is a widely exploited Garofalakis & Gibbons, 2002; Garofalakis & Kumar,
task in many OLAP applications. Every time it is neces- 2004). Wavelets are mathematical transformations stor-
sary to produce fast query answers and a certain estima- ing data in a compact and hierarchical fashion used in
tion error can be accepted, it is possible to inquire many application contexts, like image and signal pro-
summary data rather than the original ones and to per- cessing (Kacha, Grenez, De Doncker & Benmahammed,
form suitable interpolations. The typical OLAP query is 2003; Khalifa, 2003). There are several types of trans-
the range query. formations, each belonging to a family of wavelets. The
The range query estimation problem in the one- result of each transformation is a set of values, called
dimensional case can be stated as follows: given an wavelet coefficients. The advantage of this technique is
attribute X of a relation R, and a range I belonging to the that, typically, the value of a (possibly large) number of
domain of X, estimate the number of records of R with wavelet coefficients results to be below a fixed thresh-
value of X lying in I. The challenge is finding methods old, so that such coefficients can be approximated by 0.
for achieving a small estimation error by consuming a Clearly, the overall approximation of the technique as
fixed amount of storage space. well as the compression ratio depends on the value of
A possible solution to this problem is using sampling such a threshold. In the last years, wavelets have been
methods; only a small number of suitably selected records exploited in data mining and knowledge discovery in
of R, representing R well, are stored. The range query is databases, thanks to time and space efficiency and data
then evaluated by exploiting this sample instead of the hierarchical decomposition characterizing them. For a
full relation R. Recently, Wu, Agrawal, and Abbadi (2002) deeper treatment about wavelets, see Li, Li, Zhu, and
have shown that in terms of accuracy, sampling tech- Ogihara (2002).
niques based on the cumulative distribution function are Besides sampling and wavelets, histograms are used
definitely better than the methods based on tuple sam- widely for estimating range queries. Although some-
pling (Chaudhuri, Das & Narasayya, 2001; Ganti, Lee & times wavelets are viewed as a particular class of histo-
grams, we prefer to describe histograms separately.
TEAM LinG
Approximate Range Queries by Histograms in OLAP
MAIN THRUST we first consider the Max-Diff histogram and the V-

Optimal histogram. Even though they are not the most
Histograms are a lossy compression technique widely recent techniques, we cite them, since they are still
applied in various application contexts, like query opti- considered points of reference.
mization, statistical and temporal databases, and OLAP We start by describing the Max-Diff histogram.
applications. In OLAP, compression allows us to obtain Let V={v1, ... , v n}be the set of values of the attribute
fast approximate answers by evaluating queries on re- X actually appearing in the relation R and f(v i) be the
duced data in place of the original ones. Histograms are number of tuples of R having value v i in X. A MaxDiff
well suited to this purpose, especially in the case of histogram with h buckets is obtained by putting a bound-
range queries. ary between two adjacent attribute values v i and vi+1 of V
A histogram is a compact representation of a rela- if the difference between f(vi+1) si+1 and f(vi) s i is one
tion R. It is obtained by partitioning an attribute X of the of the h-1 largest such differences (where s i denotes the
relation R into k subranges, called buckets, and by spread of vi, that is the distance from vi to the next non-
maintaining for each of them a few pieces of informa- null value).
tion, typically corresponding to the bucket boundaries, A V-Optimal histogram, which is the other classical
the number of tuples with value of X belonging to the histogram we describe, produces more precise results
subrange associated to the bucket (often called sum of than the Max-Diff histogram. It is obtained by selecting
the bucket), and the number of distinct values of X of the boundaries for each bucket i so that i SSEi is
such a subrange occurring in some tuple of R (i.e., the minimal, where SSE i is the standard squared error of the
number of non-null frequencies of the subrange). bucket i-th.
Recall that a range query, defined on an interval I of V-Optimal histogram uses a dynamic programming
X, evaluates the number of occurrences in R with value technique in order to find the optimal partitioning w.r.t.
of X in I. Thus, buckets embed a set of precomputed a given error metrics. Even though the V-Optimal histo-
disjoint range queries capable of covering the whole gram results more accurate than Max-Diff, its high
active domain of X in R (here, active means attribute space and time complexities make it rarely used in
values actually appearing in R). As a consequence, the practice.
histogram, in general, does not give the possibility of In order to overcome such a drawback, an approxi-
evaluating exactly a range query not corresponding to mate version of the V-Optimal histogram has been pro-
one of the precomputed embedded queries. In other posed. The basic idea is quite simple. First, data are
words, while the contribution to the answer coming partitioned into l disjoint chunks, and then the V-Opti-
from the subranges coinciding with entire buckets can mal algorithm is used in order to compute a histogram
be returned exactly, the contribution coming from the within each chunk. The consequent problem is how to
subranges that partially overlap buckets can only be allocate buckets to the chunks such that exactly B buck-
estimated, since the actual data distribution is not avail- ets are used. This is solved by implementing a dynamic
able. programming scheme. It is shown that an approximate
Constructing the best histogram thus may mean de- V-Optimal histogram with B + l buckets has the same
fining the boundaries of buckets in such a way that the accuracy as the non-approximate V-Optimal with B buck-
estimation of the non-precomputed range queries be- ets. Moreover, the time required for executing the
comes more effective (e.g., by avoiding that large fre- approximate algorithm is reduced by multiplicative fac-
quency differences arise inside a bucket). This approach tor equal to1/l.
corresponds to finding, among all possible sets of pre- We call the histograms so far described classical
computed range queries, the set that guarantees the best histograms.
estimation of the other (non-precomputed) queries, Besides accuracy, new histograms tend to satisfy
once a technique for estimating such queries is defined. other properties in order to allow their application to
Besides this problem, which we call the partition new environments (e.g., knowledge discovery). In par-
problem, there is another relevant issue to investigate: ticular, (1) the histogram should maintain in a certain
how to improve the estimation inside the buckets. We measure the semantic nature of original data, in such a
discuss both of these issues in the following two sections. way that meaningful queries for mining activities can be
submitted to reduced data in place of original ones.
The Partition Problem Then, (2) for a given kind of query, the accuracy of the
reduced structure should be guaranteed. In addition, (3)
This issue has been analyzed widely in the past, and a the histogram should efficiently support hierarchical
number of techniques has been proposed. Among these, range queries in order not to limit too much the capabil-
ity of drilling down and rolling up over data.
50
TEAM LinG
Classical histograms lack the last point, since they one or more buckets can be computed exactly, while if it
are flat structures. Many proposals have been presented partially overlaps a bucket, then the result only can be A
in order to guarantee the three properties previously estimated.
described, and we report some of them in the following. The simplest adopted estimation technique is the
Requirement (3) was introduced by Koudas, Continuous Value Assumption (CVA). Given a bucket
Muthukrishnan, and Srivastava (2000), where the authors of size s and sum c, a range query overlapping the bucket
have shown the insufficient accuracy of classical histo- in i points is estimated as (i / s ) c . This corresponds to
grams in evaluating hierarchical range queries. Therein, a estimating the partial contribution of the bucket to the
polynomial-time algorithm for constructing optimal his- range query result by linear interpolation.
tograms with respect to hierarchical queries is proposed. Another possibility is to use the Uniform Spread
The selectivity estimation problem for non-hierarchi- Assumption (USA). It assumes that values are distrib-
cal range queries was studied by Gilbert, Kotidis, uted at equal distance from each other and that the
Muthukrishnan, and Strauss (2001), and, according to overall frequency sum is equally distributed among
property (2), optimal and approximate polynomial (in the them. In this case, it is necessary to know the number of
database size) algorithms with a provable approximation non-null frequencies belonging to the bucket. Denoting
guarantee for constructing histograms are also presented. by t such a value, the range query is estimated by
Guha, Koudas, and Srivastava (2002) have proposed
efficient algorithms for the problem of approximating
the distribution of measure attributes organized into (s 1) + (i 1) (t 1) c
.
hierarchies. Such algorithms are based on dynamic pro- ( s 1) t
gramming and on a notion of sparse intervals.
Algorithms returning both optimal and suboptimal An interesting problem is understanding whether,
solutions for approximating range queries by histograms by exploiting information typically contained in histo-
and their dynamic maintenance by additive changes are gram buckets and possibly by adding some concise
provided by Muthukrishnan and Strauss (2003). The best summary information, the frequency estimation inside
algorithm, with respect to the construction time, return- buckets and, then, the histogram accuracy can be im-
ing an optimal solution takes polynomial time. proved. To this aim, starting from a theoretical analysis
Buccafurri and Lax (2003) have presented a histo- about limits of CVA and USA, Buccafurri, Pontieri,
gram based on a hierarchical decomposition of the data Rosaci, and Sacc (2002) have proposed to use an addi-
distribution kept in a full binary tree. Such a tree, con- tional storage space of 32 bits, called 4LT, in each
taining a set of precomputed hierarchical queries, is bucket in order to store the approximate representation
encoded by using bit saving for obtaining a smaller of the data distribution inside the bucket. In particular, 4LT
structure and, thus, for efficiently supporting hierarchi- is used to save approximate cumulative frequencies at
cal range queries. seven equidistant intervals internal to the bucket.
Besides bucket-based histograms, there are other Clearly, approaches similar to that followed in
kinds of histograms whose construction is not driven by Buccafurri, Pontieri, Rosaci, and Sacc (2002) have to
the search of a suitable partition of the attribute domain, deal with the trade-off between the extra storage space
and, further, their structure is more complex than simply required for each bucket and the number of total buck-
a set of buckets. This class of histograms is called non- ets the allowed total storage space consents.
bucket based histograms. Wavelets are an example of
such kind of histograms.
In the next section, we deal with the second problem FUTURE TRENDS
introduced earlier concerning the estimation of range
queries partially involving buckets. Data streams is an emergent issue that in the last two
years has captured the interest of many scientific com-
Estimation Inside a Bucket munities. The crucial problem arising in several appli-
cation contexts like network monitoring, sensor net-
While finding the optimal bucket partition has been works, financial applications, security, telecommuni-
widely investigated in past years, the problem of estimat- cation data management, Web applications, and so on is
ing queries partially involving a bucket has received a dealing with continuous data flows (i.e., data streams)
little attention. having the following characteristics: (1) they are time
Histograms are well suited to range query evaluation, dependent; (2) their size is very large, so that they
since buckets basically correspond to a set of precom- cannot be stored totally due to the actual memory
puted range queries. A range query that involves entirely
51
TEAM LinG
limitation; and (3) data arrival is very fast and unpredict- Buccafurri, F., & Lax, G. (2003). Pre-computing approxi-
able, so that each data management operation should be mate hierarchical range queries in a tree-like histogram.
very efficient. Proceedings. of the International Conference on Data
Since a data stream consists of a large amount of Warehousing and Knowledge Discovery.
data, it is usually managed on the basis of a sliding
window, including only the most recent data (Babcock, Buccafurri, F., & Lax, G. (2004). Reducing data stream
Babu, Datar, Motwani & Widom, 2002). Thus, any tech- sliding windows by cyclic tree-like histograms. Pro-
nique capable of compressing sliding windows by main- ceedings of the 8th European Conference on Principles
taining a good approximate representation of data dis- and Practice of Knowledge Discovery in Databases.
tribution is certainly relevant in this field. Typical que- Buccafurri, F., Pontieri, L., Rosaci, D., & Sacc, D.
ries performed on sliding windows are similarity que- (2002). Improving range query estimation on histo-
ries and other analyses, like change mining queries grams. Proceedings of the International Conference
(Dong, Han, Lakshmanan, Pei, Wang & Yu, 2003) useful on Data Engineering.
for trend analysis and, in general, for understanding the
dynamics of data. Also in this field, histograms may Chakrabarti, K., Garofalakis, M., Rastogi, R., & Shim,
become an important analysis tool. The challenge is K. (2001). Approximate query processing using wave-
finding new histograms that (1) are fast to construct and lets. VLDB Journal, The International Journal on
to maintain; that is, the required updating operations Very Large Data Bases, 10(2-3), 199-223.
(performed at each data arrival) are very efficient; (2) Chaudhuri, S., Das, G., & Narasayya, V. (2001). A ro-
maintain a good accuracy in approximating data distri- bust, optimization-based approach for approximate an-
bution; and (3) support continuous querying on data. swering of aggregate queries. Proceedings of the 2001
An example of the above emerging approaches is ACM SIGMOD International Conference on Manage-
reported in Buccafurri and Lax (2004), where a tree-like ment of Data.
histogram with cyclic updating is proposed. By using
such a compact structure, many mining techniques, which Dong, G. et al. (2003). Online mining of changes from data
would take computational cost very high if used on real streams: Research problems and preliminary results. Pro-
data streams, can be implemented effectively. ceedings of the ACM SIGMOD Workshop on Manage-
ment and Processing of Data Streams.
Ganti, V., Lee, M. L., & Ramakrishnan, R. (2000).
CONCLUSION Icicles: Self-tuning samples for approximate query an-
swering. Proceedings of 26th International Confer-
Data reduction represents an important task both in data ence on Very Large Data Bases.
mining task and in OLAP, since it allows us to represent
very large amounts of data in a compact structure, which Garofalakis, M., & Gibbons, P.B. (2002). Wavelet syn-
efficiently perform on mining techniques or OLAP opses with error guarantees. Proceedings of the ACM
queries. Time and memory cost advantages arisen from SIGMOD International Conference on Management
data compression, provided that a sufficient degree of of Data.
accuracy is guaranteed, may improve considerably the
capabilities of mining and OLAP tools. Garofalakis, M., & Kumar, A. (2004). Deterministic wave-
This opportunity (added to the necessity, coming let thresholding for maximum error metrics. Proceedings
from emergent research fields such as data streams) of of the Twenty-third ACM SIGMOD-SIGACT-SIGART Sym-
producing more and more compact representations of posium on Principles of Database Systems.
data explains the attention that the research community Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss,
is giving toward techniques like histograms and wave- M.J. (2001). Optimal and approximate computation of
lets, which provide a concrete answer to the previous summary statistics for range aggregates. Proceedings
requirements. of the Twentieth ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems.
REFERENCES Guha, S., Koudas, N., & Srivastava, D. (2002). Fast

algorithms for hierarchical range histogram construc-
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, tion. Proceedings of the Twenty-Ffirst ACM SIGMOD-
J. (2002). Models and issues in data stream system. SIGACT-SIGART Symposium on Principles of Data-
Proceedings of the ACM SIGMOD-SIGACT-SIGART base Systems.
Symposium on Principles of Database Systems.
52
TEAM LinG
Kacha, A., Grenez, F., De Doncker, P., & Benmahammed, K. Bucket-Based Histogram: A type of histogram whose
(2003). A wavelet-based approach for frequency estima- construction is driven by the search of a suitable partition A
tion of interference signals in printed circuit boards. Pro- of the attribute domain into buckets.
ceedings of the 1st International Symposium on Informa-
tion and Communication Technologies. Continuous Value Assumption (CVA): A tech-
nique allowing us to estimate values inside a bucket by
Khalifa, O. (2003). Image data compression in wavelet linear interpolation.
transform domain using modified LBG algorithm. Pro-
ceedings of the 1st International Symposium on Infor- Data Preprocessing: The application of several
mation and Communication Technologies. methods preceding the mining phase, done for improv-
ing the overall data mining results. Usually, it consists
Koudas, N., Muthukrishnan, S., & Srivastava, D. (2000). of (1) data cleaning, a method for fixing missing values,
Optimal histograms for hierarchical range queries (ex- outliers, and possible inconsistent data; (2) data inte-
tended abstract). Proceedings of the Nineteenth ACM gration, the union of (possibly heterogeneous) data
SIGMOD-SIGACT-SIGART Symposium on Principles coming from different sources into a unique data store;
of Database Systems. and (3) data reduction, the application of any technique
working on data representation capable of saving stor-
Li, T., Li, Q., Zhu, S., & Ogihara, M. (2002). Survey on age space without compromising the possibility of in-
wavelet applications in data mining. ACM SIGKDD Ex- quiring them.
plorations, 4(2), 49-68.
Histogram: A set of buckets implementing a parti-
Muthukrishnan, S., & Strauss, M. (2003). Rangesum tion of the overall domain of a relation attribute.
histograms. Proceedings of the Fourteenth Annual
ACM-SIAM Symposium on Discrete Algorithms. Range Query: A query returning an aggregate infor-
mation (i.e., sum, average) about data belonging to a
Wu, Y., Agrawal, D., & Abbadi, A.E. (2002). Query given interval of the domain.
estimation by adaptive sampling. Proceedings of the
International Conference on Data Engineering. Uniform Spread Assumption (USA): A technique
for estimating values inside a bucket by assuming that
values are distributed at an equal distance from each
other and that the overall frequency sum is distributed
KEY TERMS equally among them.
Bucket: An element obtained by partitioning the Wavelets: Mathematical transformations imple-

domain of an attribute X of a relation into non-overlap- menting hierarchical decomposition of functions lead-
ping intervals. Each bucket consists of a tuple <inf, sup, ing to the representation of functions through sets of
val>, where val is an aggregate information (i.e., sum, wavelet coefficients.
average, count, etc.) about tuples with value of X be-
longing to the interval (inf, sup).
53
TEAM LinG
54
Artificial Neural Networks for Prediction

Rafael Mart
Universitat de Valncia, Spain
INTRODUCTION BACKGROUND
The design and implementation of intelligent systems From a technical point of view, ANNs offer a general
with human capabilities is the starting point to design framework for representing nonlinear mappings from
Artificial Neural Networks (ANNs). The original idea several input variables to several output variables. They
takes after neuroscience theory on how neurons in the are built by tuning a set of parameters known as weights
human brain cooperate to learn from a set of input and can be considered as an extension of the many
signals to produce an answer. Because the power of the conventional mapping techniques. In classification or
brain comes from the number of neurons and the mul- recognition problems, the nets outputs are categories,
tiple connections between them, the basic idea is that while in prediction or approximation problems, they are
connecting a large number of simple elements in a continuous variables. Although this article focuses on
specific way can form an intelligent system. the prediction problem, most of the key issues in the net
Generally speaking, an ANN is a network of many functionality are common to both.
simple processors called units, linked to certain neigh- In the process of training the net (supervised learn-
bors with varying coefficients of connectivity (called ing), the problem is to find the values of the weights w
weights) that represent the strength of these connec- that minimize the error across a set of input/output pairs
tions. The basic unit of ANNs, called an artificial neu- (patterns) called the training set E. For a single output
ron, simulates the basic functions of natural neurons: it and input vector x, the error measure is typically the
receives inputs, processes them by simple combination root mean squared difference between the predicted
and threshold operations, and outputs a final result. output p(x,w) and the actual output value f(x) for all the
ANNs often employ supervised learning in which elements x in E (RMSE); therefore, the training is an
training data (including both the input and the desired unconstrained nonlinear optimization problem, where
output) is provided. Learning basically refers to the the decision variables are the weights, and the objective
process of adjusting the weights to optimize the net- is to reduce the training error. Ideally, the set E is a
work performance. ANNs belongs to machine-learning representative sample of points in the domain of the
algorithms because the changing of a networks connec- function f that you are approximating; however, in prac-
tion weights causes it to gain knowledge in order to tice it is usually a set of points for which you know the
solve the problem at hand. f-value.
Neural networks have been widely used for both
classification and prediction. In this article, I focus on
the prediction or estimation problem (although with ( f ( x) p( x, w)) 2
Min error ( E , w) = xE
(1)
some few changes, my comments and descriptions also w E
apply to classification). Estimating and forecasting fu-
ture conditions are involved in different business activi-
The main goal in the design of an ANN is to obtain a
ties. Some examples include cost estimation, predic-
model that makes good predictions for new inputs (i.e.,
tion of product demand, and financial planning. More-
to provide good generalization). Therefore, the net must
over, the field of prediction also covers other activities,
represent the systematic aspects of the training data
such as medical diagnosis or industrial process model-
rather than their specific details. The standard way to
ing.
measure the generalization provided by the net consists
In this short article I focus on the multilayer neural
of introducing a second set of points in the domain of f
networks because they are the most common. I describe
called the testing set, T. Assume that no point in T
their architecture and some of the most popular training
belongs to E and f(x) is known for all x in T. After the
methods. Then I finish with some associated conclu-
optimization has been performed and the weights have
sions and the appropriate list of references to provide
been set to minimize the error in E (w=w*), the error
some pointers for further study.
across the testing set T is computed (error(T,w*)). The
TEAM LinG
net must exhibit a good fit between the target f-values and restrict our attention to real functions f: n ). Figure
the output (prediction) in the training set and also in the 1 shows a net where NI = 1, 2, ..., n, NH = n+1, n+2,..., A
testing set. If the RMSE in T is significantly higher than n+m and N O = s.
that one in E, you say that the net has memorized the data Given an input pattern x=(x1,...,xn), the neural net-
instead of learning them (i.e., the net has overfitted the work provides the user with an associated output NN(x,w),
training data). which is a function of the weights w. Each node i in the
The optimization of the function given in (1) is a hard input layer receives a signal of amount xi that it sends
problem by itself. Moreover, keep in mind that the final through all its incident arcs to the nodes in the hidden
objective is to obtain a set of weights that provides low layer. Each node n+j in the hidden layer receives a signal
values of error(T,w*) for any set T. In the following input(n+j) according to the expression
sections I summarize some of the most popular and
n
other not so popular but more efficient methods to train
the net (i.e., to compute appropriate weight values). Input(n+j)=wn+j + x w
i =1
i i ,n + j
where wn+j is the bias value for node n+j, and wi,n+j is the
MAIN THRUST weight value on the arc from node i in the input layer to
node n+j in the hidden layer. Each hidden node trans-
Several models inspired by biological neural networks forms its input by means of a nonlinear activation func-
have been proposed throughout the years, beginning tion: output(j)=sig(input(j)). The most popular choice
with the perceptron introduced by Rosenblatt (1962). for the activation function is the sigmoid function
He studied a simple architecture where the output of the sig(x)= 1/(1+e-x). Laguna and Mart (2002) test two
net is a transformation of a linear combination of the activation functions for the hidden neurons and con-
input variables and the weights. Minskey and Papert clude that the sigmoid presents superior performance.
(1969) showed that the perceptron can only solve lin- Each hidden node n+j sends the amount of signal
early separable classification problems and is therefore output(n+j) through the arc (n+j,s). The node s in the
of limited interest. A natural extension to overcome its output layer receives the weighted sum of the values
limitations is given by the so-called multilayer- coming from the hidden nodes. This sum, NN(x,w), is the
perceptron, or, simply, multilayer neural networks. I nets output according to the expression:
have considered this architecture with a single hidden
m
layer. A schematic representation of the network ap-
pears in Figure 1. NN(x,w) = ws + output (n + j) w
j =1
n+ j ,s
Neural Network Architecture In the process of training the net (supervised learn-
ing), the problem is to find the values of the weights
Let NN=(N, A) be an ANN where N is the set of nodes and (including the bias factors) that minimize the error
A is the set of arcs. N is partitioned into three subsets: (RMSE) across the training set E. After the optimization
NI, input nodes, NH, hidden nodes, and NO, output nodes. has been performed and the weights have been set
I assume that n variables exist in the function that I want (w=w*),the net is ready to produce the output for any
to predict or approximate, therefore |N I|= n. The neural input value. The testing error Error(T,w*) computes the
network has m hidden neurons (|N H|= m) with a bias term Root Mean Squared Error across the elements in the
in each hidden neuron and a single output neuron (we testing set T={y1, y2,..,ys}, where no one belongs to the
training set E:
Figure 1. Neural network diagram s
inp u ts w 1, n+ 1 error ( y , w ) i
Error(T,w*) = i =1 .
x1 1 n+1 s
w n+ 1 ,s
ou tp ut
Training Methods
x2 2 n+2 s
Considering the supervised learning described in the

previous section, many different training methods have
xn n n+m been proposed. Here, I summarize some of the most
relevant, starting with the well-known backpropagation
55
TEAM LinG
method. For a deeper understanding of them, see the based on strategy can provide useful clues about how
excellent book by Bishop (1995). the strategy may profitably be changed.
Backpropagation (BP) was the first method for neu- As far as I know, the first tabu search approach for
ral network training and is still the most widely used neural network training is due to Sexton et al. (1998).
algorithm in practical applications. It is a gradient de- A short description follows. An initial solution x0 is
scent method that searches for the global optimum of the randomly drawn from a uniform distribution in the
network weights. Each iteration consists of two steps. range [-10,10]. Solutions are randomly generated in
First, partial derivatives Error/ w are computed for this range for a given number of iterations. When
each weight in the net. Then weights are modified to generating a new point xnew, aspiration level and tabu
reduce the RMSE according to the direction given by the conditions are checked. If f(xnew)<f(xbest), then the point
gradient. There have been different modifications to this is automatically accepted and both xbest and f(xbest) are
basic procedure; the most significant is the addition of a updated; otherwise, the tabu conditions are tested. If
momentum term to prevent zigzagging in the search. there is one solution xi in the tabu list (TL) such as
Because the neural network training problem can be f(xnew) [f(xi)-0.01*f(x i), f(xi)+0.01*f(xi)], then the com-
expressed as a nonlinear unconstrained optimization plete test is applied to xnew and xi; otherwise, the point
problem, I might use more elaborated nonlinear methods is accepted. The test checks whether all the weights in
than the gradient descent to solve it. A selection of the x new are within 0.01 from xi in this case the point is
best established algorithms in unconstrained nonlinear rejected; otherwise, the point is accepted, and xnew and
optimization has also been used in this context. Specifi- f(x new) are entered into TL. This process continues for
cally, the nonlinear simplex method, the direction set 1,000 iterations of accepted solutions. Then another
method, the conjugate gradient method, the Levenberg- cycle of 1,000 iterations of random sampling begins.
Marquardt algorithm (More, 1978), and the GRG2 These cycles will continuously repeat while f(x best)
(Smith and Lasdon, 1992). improves.
Recently, metaheuristic methods have also been Mart and El-Fallahi (2004) propose an improved
adapted to this problem. Specifically, on one hand you tabu search method that consists of three phases:
can find those methods based on local search proce- MultiRSimplex, TSProb, and TSFreq. After the initial-
dures, and on the other, those methods based on popula- ization with the MultiRSimplex phase, the procedure
tion of solutions known as evolutionary methods. In the performs iterations in a loop consisting in alternating
first category, two methods have been applied, simulated both phases, TSProb and TSFreq, to intensify and diver-
annealing and tabu search, while in the second you can sify the search respectively. In this work, a computa-
find the so-called genetic algorithms, the scatter search, tional study of 12 methods for neural network training
and, more recently, a path relinking implementation. is presented, including nonlinear and local-search-based
Several studies (Sexton, 1998) have shown that tabu optimizers. Overall, experiments with 45 functions
search outperforms the simulated annealing implemen- from the literature were performed to compare the
tation; therefore, I first focus on the different tabu procedures. The experiments show that some functions
search implementations for ANN training. cannot be approximated with a reasonable accuracy
level when training the net for a limited number of
Tabu Search iterations. The experimentation also shows that the
proposed TS provides, on average, the best solutions
Tabu search (TS) is based on the premise that in order to (best approximations).
qualify as intelligent, problem solving must incorporate
adaptive memory and responsive exploration. The adap- Evolutionary Methods
tive memory feature of TS allows the implementation of
procedures that are capable of searching the solution The idea of applying the biological principle of natural
space economically and effectively. Because local evolution to artificial systems, introduced more than
choices are guided by information collected during the three decades ago, has seen impressive growth in the
search, TS contrasts with memoryless designs that heavily past few years. Evolutionary algorithms have been suc-
rely on semirandom processes that implement a form of cessfully applied to numerous problems from different
sampling. The emphasis on responsive exploration in domains, including optimization, automatic program-
tabu search, whether in a deterministic or probabilistic ming, machine learning, economics, ecology, popula-
implementation, derives from the supposition that a bad tion genetics, studies of evolution and learning, and
strategic choice can yield more information than a good social systems.
random choice. In a system that uses memory, a bad choice A genetic algorithm is an iterative procedure that
consists of a constant-size population of individuals,
56
TEAM LinG
each represented by a finite string of symbols, known as solutions, a reference set update method to build and
the genome, encoding a possible solution in a given maintain a reference set consisting of the b best solu- A
problem space. This space, referred to as the search tions found, a subset generation method to operate on the
space, comprises all possible solutions to the problem reference set in order to produce a subset of its solutions
at hand. Solutions to a problem were originally encoded as a basis for creating combined solutions, and a solution
as binary strings due to certain computational advan- combination method to transform a given subset of
tages associated with such encoding. Also, the theory solutions produced by the subset generation method into
about the behavior of algorithms was based on binary one or more combined solution vectors. An exhaustive
strings. Because in many instances it is impractical to description of these methods and how they operate can
represent solutions by using binary strings, the solution be found in Laguna and Mart (2003).
representation has been extended in recent years to Laguna and Mart (2002) proposed a three-step Scat-
include character-based encoding, real-valued encod- ter Search algorithm for ANNs. El-Fallahi, Mart, and
ing, and tree representations. Lasdon (in press) propose a new training method based
The standard genetic algorithm proceeds as follows. on the path relinking methodology. Path relinking starts
An initial population of individuals is generated at ran- from a given set of elite solutions obtained during a
dom or heuristically. Every evolutionary step, known as previous search process. Path relinking and its cousin,
a generation, the individuals in the current population Scatter Search, are mainly based on two elements: com-
are decoded and evaluated according to some predefined binations and local search. Path relinking generalizes
quality criterion, referred to as the fitness, or fitness the concept of combination beyond its usual application
function. To form a new population (the next genera- to consider paths between solutions. Local search, per-
tion), individuals are selected according to their fitness. formed now with the GRG2 optimizer, intensifies the
Many selection procedures are currently in use, one of search by seeking local optima. The paper shows an
the simplest being Hollands original fitness-propor- empirical comparison of the proposed method with the
tionate selection, where individuals are selected with a best previous evolutionary approaches, and the associ-
probability proportional to their relative fitness. This ated experiments show the superiority of the new method
ensures that the expected number of times an individual in terms of solution quality (prediction accuracy). On
is chosen is approximately proportional to its relative the other hand, these experiments confirm again that a
performance in the population. Thus, high-fitness few functions cannot be approximated with any of the
(good) individuals stand a better chance of reproduc- current training methods.
ing, while low-fitness ones are more likely to disappear.
In terms of ANN training, a solution (or individual)
consists of an array with the nets weights and its asso- FUTURE TRENDS
ciated fitness is usually the RMSE obtained with this
solution in the training set. You can find a lot of research An open problem in the context of prediction is to
in GA implementations to ANNs. Consider, for in- compare ANNs with some modern approximation tech-
stance, the recent work by Alba and Chicano (2004), in niques developed in statistics. Specifically, the non-
which a hybrid GA is proposed. Here, the hybridization parametric additive models and local regression can
refers to the inclusion of problem-dependent knowl- also offer good solutions to the general approximation
edge in a general search template. The hybrid algorithms or prediction problem. The development of hybrid sys-
used in this work are combinations of two algorithms tems from both technologies could give the starting
(weak hybridization), where one of them acts as an point for a new generation of prediction systems.
operator in the other. This kind of combinations has
produced the most successful training methods in the
last few years. The authors proposed here the combina- CONCLUSION
tion of GA with the BP algorithm as well as GA with the
Levenberg-Marquardt for training ANNs. In this work I revise the most representative methods for
Scatter search (SS) was first introduced in Glover neural network training. Several computational studies
(1977) as a heuristic for integer programming. The with some of these methods reveal that the best results
following template is a standard for implementing scat- are achieved with a combination of a metaheuristic
ter search that consists of five methods. A diversifica- procedure with a nonlinear optimizer. These experi-
tion generation method to generate a collection of ments also show that from a practical point of view,
diverse trial solutions, an improvement method to trans- some functions cannot be approximated.
form a trial solution into one or more enhanced trial
57
TEAM LinG
REFERENCES Sexton. (1998). Global optimization for artificial neural

networks: A tabu search application. European Journal
Alba, E., & Chicano, J. F. (2004). Training neural networks of Operational Research, 106, 570-584.
with GA Hybrid algorithms. In K. Deb (Ed.), Proceedings Sexton, R. S., Dorsey, R. E., & Johnson, J. D. (1999).
of the Genetic and Evolutionary Computation Confer- Optimization of neural networks: A comparative analy-
ence, USA. sis of the genetic algorithm and simulated annealing.
Bishop, C. M. (1995). Neural networks for pattern European Journal of Operational Research, 114, 589-601.
recognition. New York: Oxford University Press. Smith, S., & Lasdon, L. (1992). Solving large nonlinear
El-Fallahi, A., Mart, R., & Lasdon, L. (in press). Path programs using GRG. ORSA Journal on Computing,
relinking and GRG for artificial neural networks. Euro- 4(1), 2-15.
pean Journal of Operational Research.
Glover, F. (1977). Heuristics for integer programming
using surrogate constraints. Decision Sciences, 8, 156-166. KEY TERMS
Glover, F., & Laguna, M. (1993). Tabu search. In C. Reeves Classification: Also known as a recognition prob-
(Ed.), Heuristic techniques for combinatorial problems lem; the identification of the class to which a given
(pp. 70-150). object belongs.
Laguna, M., & Mart, R. (2002). Neural network predic- Genetic Algorithm: An iterative procedure that
tion in a system for optimizing simulations. IIE Trans- consists of a constant-size population of individuals,
actions, 34(3), 273-282. each represented by a finite string of symbols, known as
Laguna, M., & Mart, R. (2003). Scatter search: Meth- the genome, encoding a possible solution in a given
odology and implementations in C. Kluwer Academic. problem space.
Mart, R., & El-Fallahi, A. (2004). Multilayer neural Metaheuristic: A master strategy that guides and
networks: An experimental evaluation of on-line train- modifies other heuristics to produce solutions beyond
ing methods. Computers and Operations Research, those that are normally generated in a quest for local
31, 1491-1513. optimality.
Mart, R., Laguna, M., & Glover, F. (in press). Principles Network Training: The process of finding the val-
of scatter search. European Journal of Operational ues of the network weights that minimize the error
Research. across a set of input/output pairs (patterns) called the
training set.
Minsky, M. L., & Papert, S.A. (1969). Perceptrons
(Expanded ed.). Cambridge, MA: MIT Press. Optimization: The quantitative study of optima and
the methods for finding them.
More, J. J. (1978). The Levenberg-Marquardt algo-
rithm: Implementation and theory. In G. Watson (Ed.), Prediction: Consists of approximating unknown
Lecture Notes in Mathematics: Vol. 630. functions. The nets input is the values of the function
variables, and the output is the estimation of the func-
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & tion image.
Flannery, B. P. (1992). Numerical recipes: The art of
scientific computing. Cambridge, MA: Cambridge Uni- Scatter Search: A metaheuristic that belongs to the
versity Press. evolutionary methods.
Rosemblatt, F. (1962). Principles of neurodynamics: Tabu Search: A metaheuristic procedure based on

Perceptrons and theory of brain mechanisms. Washing- principles of intelligent search. Its premise is that prob-
ton, DC: Spartan. lem solving, in order to qualify as intelligent, must incor-
porate adaptive memory and responsive exploration.
58
TEAM LinG
59
Association Rule Mining A

Yew-Kwong Woon
Nanyang Technological University, Singapore
Wee-Keong Ng
Ee-Peng Lim
Association Rule Mining (ARM) is concerned with how Recently, a new class of problems emerged to challenge
items in a transactional database are grouped together. It ARM researchers: Incoming data is streaming in too
is commonly known as market basket analysis, because fast and changing too rapidly in an unordered and un-
it can be likened to the analysis of items that are fre- bounded manner. This new phenomenon is termed data
quently put together in a basket by shoppers in a market. stream (Babcock, Babu, Datar, Motwani, & Widom,
From a statistical point of view, it is a semiautomatic 2002).
technique to discover correlations among a set of vari- One major area where the data stream phenomenon
ables. is prevalent is the World Wide Web (Web). A good
ARM is widely used in myriad applications, includ- example is an online bookstore, where customers can
ing recommender systems (Lawrence, Almasi, Kotlyar, purchase books from all over the world at any time. As
Viveros, & Duri, 2001), promotional bundling (Wang, a result, its transactional database grows at a fast rate
Zhou, & Han, 2002), Customer Relationship Manage- and presents a scalability problem for ARM. Traditional
ment (CRM) (Elliott, Scionti, & Page, 2003), and cross- ARM algorithms, such as Apriori, were not designed to
selling (Brijs, Swinnen, Vanhoof, & Wets, 1999). In handle large databases that change frequently (Agrawal
addition, its concepts have also been integrated into & Srikant, 1994). Each time a new transaction arrives,
other mining tasks, such as Web usage mining (Woon, Apriori needs to be restarted from scratch to perform
Ng, & Lim, 2002), clustering (Yiu & Mamoulis, 2003), ARM. Hence, it is clear that in order to conduct ARM on
outlier detection (Woon, Li, Ng, & Lu, 2003), and the latest state of the database in a timely manner, an
classification (Dong & Li, 1999), for improved effi- incremental mechanism to take into consideration the
ciency and effectiveness. latest transaction must be in place.
CRM benefits greatly from ARM as it helps in the In fact, a host of incremental algorithms have already
understanding of customer behavior (Elliott et al., 2003). been introduced to mine association rules incremen-
Marketing managers can use association rules of prod- tally (Sarda & Srinivas, 1998). However, they are only
ucts to develop joint marketing campaigns to acquire incremental to a certain extent; the moment the univer-
new customers. The application of ARM for the cross- sal itemset (the number of unique items in a database)
selling of supermarket products has been successfully (Woon, Ng, & Das, 2001) is changed, they have to be
attempted in many cases (Brijs et al., 1999). In one restarted from scratch. The universal itemset of any
particular study involving the personalization of super- online store would certainly be changed frequently,
market product recommendations, ARM has been ap- because the store needs to introduce new products and
plied with much success (Lawrence et al., 2001). To- retire old ones for competitiveness. Moreover, such
gether with customer segmentation, ARM helped to incremental ARM algorithms are efficient only when
increase revenue by 1.8%. the database has not changed much since the last mining.
In the biology domain, ARM is used to extract novel The use of data structures in ARM, particularly the
knowledge on protein-protein interactions (Oyama, trie, is one viable way to address the data stream phe-
Kitano, Satou, & Ito, 2002). It is also successfully nomenon. Data structures first appeared when program-
applied in gene expression analysis to discover biologi- ming became increasingly complex during the 1960s. In
cally relevant associations between different genes or his classic book, The Art of Computer Programming
between different environment conditions (Creighton Knuth (1968) reviewed and analyzed algorithms and data
& Hanash, 2003). structures that are necessary for program efficiency.
TEAM LinG
Association Rule Mining
Since then, the traditional data structures have been The Frequent Pattern-growth (FP-growth) algo-
extended, and new algorithms have been introduced for rithm is a recent association rule mining algorithm that
them. Though computing power has increased tremen- achieves impressive results (Han, Pei, Yin, & Mao,
dously over the years, efficient algorithms with custom- 2004). It uses a compact tree structure called a Fre-
ized data structures are still necessary to obtain timely quent Pattern-tree (FP-tree) to store information about
and accurate results. This fact is especially true for ARM, frequent 1-itemsets. This compact structure removes
which is a computationally intensive process. the need for multiple database scans and is constructed
The trie is a multiway tree structure that allows fast with only 2 scans. In the first database scan, frequent 1-
searches over string data. In addition, as strings with itemsets are obtained and sorted in support descending
common prefixes share the same nodes, storage space order. In the second scan, items in the transactions are
is better utilized. This makes the trie very useful for first sorted according to the order of the frequent 1-
storing large dictionaries of English words. Figure 1 itemsets. These sorted items are used to construct the
shows a trie storing four English words (ape, apple, FP-tree. Figure 2 shows an FP-tree constructed from
base, and ball). Several novel trielike data structures the database in Table 1.
have been introduced to improve the efficiency of ARM, FP-growth then proceeds to recursively mine FP-
and we discuss them in this section. trees of decreasing size to generate frequent itemsets
Amir, Feldman, & Kashi (1999) presented a new way without candidate generation and database scans. It does
of mining association rules by using a trie to preprocess so by examining all the conditional pattern bases of
the database. In this approach, all transactions are mapped the FP-tree, which consists of the set of frequent itemsets
onto a trie structure. This mapping involves the extrac- occurring with the suffix pattern. Conditional FP-trees
tion of the powerset of the transaction items and the are constructed from these conditional pattern bases,
updating of the trie structure. Once built, there is no and mining is carried out recursively with such trees to
longer a need to scan the database to obtain support discover frequent itemsets of various sizes. However,
counts of itemsets, because the trie structure contains because both the construction and the use of the FP-
all their support counts. To find frequent itemsets, the trees are complex, the performance of FP-growth is
structure is traversed by using depth-first search, and reduced to be on par with Apriori at support thresholds
itemsets with support counts satisfying the minimum of 3% and above. It only achieves significant speed-ups
support threshold are added to the set of frequent at support thresholds of 1.5% and below. Moreover, it is
itemsets. only incremental to a certain extent, depending on the
Drawing upon that work, Yang, Johar, Grama, & FP-tree watermark (validity support threshold). As new
Szpankowski (2000) introduced a binary Patricia trie to transactions arrive, the support counts of items in-
reduce the heavy memory requirements of the prepro- crease, but their relative support frequency may de-
cessing trie. To support faster support queries, the crease, too. Suppose, however, that the new transactions
authors added a set of horizontal pointers to index cause too many previously infrequent itemsets to become
nodes. They also advocated the use of some form of
primary threshold to further prune the structure. How-
ever, the compression achieved by the compact Patricia Table 1. A sample transactional database
trie comes at a hefty price: It greatly complicates the
TID Items
horizontal pointer index, which is a severe overhead. In 100 AC
addition, after compression, it will be difficult for the 200 BC
Patricia trie to be updated whenever the database is 300 ABC
altered. 400 ABCD
Figure 1. An example of a trie for storing English

words Figure 2. An FP-tree constructed from the database in
Table 1 at a support threshold of 50%
ROOT
A B ROOT
P A
C
E P S L
A B
L E L
E B
60
TEAM LinG
frequent that is, the watermark is raised too high (in It is highly scalable with respect to the size of both
order to make such itemsets infrequent) according to a the database and the universal itemset. A
user-defined level then the FP-tree must be recon- It is incrementally updated as transactions are
structed. added or deleted.
The use of lattice theory in ARM was pioneered by It is constructed independent of the support
Zaki (2000). Lattice theory allows the vast search space threshold and thus can be used for various support
to be decomposed into smaller segments that can be thresholds.
tackled independently in memory or even in other ma- It helps to speed up ARM algorithms to a certain
chines, thus promoting parallelism. However, they re- extent that allows results to be obtained in real-
quire additional storage space as well as different tra- time.
versal and construction techniques. To complement the
use of lattices, Zaki uses a vertical database format, We shall now discuss our novel trie data structure
where each itemset is associated with a list of transac- that not only satisfies the above requirements but also
tions known as a tid-list (transaction identifierlist). outperforms the discussed existing structures in terms
This format is useful for fast frequency counting of of efficiency, effectiveness, and practicality. Our struc-
itemsets but generates additional overheads because most ture is termed Support-Ordered Trie Itemset (SOTrieIT
databases have a horizontal format and would need to be pronounced so-try-it). It is a dual-level support-
converted first. ordered trie data structure used to store pertinent
The Continuous Association Rule Mining Algorithm itemset information to speed up the discovery of fre-
(CARMA), together with the support lattice, allows the quent itemsets.
user to change the support threshold and continuously As its construction is carried out before actual
displays the resulting association rules with support and mining, it can be viewed as a preprocessing step. For
confidence bounds during its first scan/phase (Hidber, every transaction that arrives, 1-itemsets and 2-itemsets
1999). During the second phase, it determines the pre- are first extracted from it. For each itemset, the
cise support of each itemset and extracts all the frequent SOTrieIT will be traversed in order to locate the node
itemsets. CARMA can readily compute frequent itemsets that stores its support count. Support counts of 1-
for varying support thresholds. However, experiments itemsets and 2-itemsets are stored in first-level and
reveal that CARMA only performs faster than Apriori at second-level nodes, respectively. The traversal of the
support thresholds of 0.25% and below, because of the SOTrieIT thus requires at most two redirections, which
tremendous overheads involved in constructing the sup- makes it very fast. At any point in time, the SOTrieIT
port lattice. contains the support counts of all 1-itemsets and 2-
The adjacency lattice, introduced by Aggarwal & Yu itemsets that appear in all the transactions. It will then
(2001), is similar to Zakis boolean powerset lattice, be sorted level-wise from left to right according to the
except the authors introduced the notion of adjacency support counts of the nodes in descending order.
among itemsets, and it does not rely on a vertical data- Figure 3 shows a SOTrieIT constructed from the
base format. Two itemsets are said to be adjacent to each database in Table 1. The bracketed number beside an item
other if one of them can be transformed to the other with is its support count. Hence, the support count of itemset
the addition of a single item. To address the problem of {AB} is 2. Notice that the nodes are ordered by support
heavy memory requirements, a primary threshold is de- counts in a level-wise descending order.
fined. This term signifies the minimum support thresh- In algorithms such as FP-growth that use a similar
old possible to fit all the qualified itemsets into the data structure to store itemset information, the structure
adjacency lattice in main memory. However, this ap- must be rebuilt to accommodate updates to the universal
proach disallows the mining of frequent itemsets at
support thresholds lower than the primary threshold.
Figure 3. A SOTrieIT structure
ROOT
MAIN THRUST
As shown in our previous discussion, none of the exist-

ing data structures can effectively address the issues C(4) A(3) B(3) D(1)
induced by the data stream phenomenon. Here are the
desirable characteristics of an ideal data structure that
can help ARM cope with data streams:
D(1) C(3) B(2) D(1) C(3) D(1)
61
TEAM LinG
itemset. The SOTrieIT can be easily updated to accommo- more varied to cater to a broad customer base; transac-
date the new changes. If a node for a new item in the tion databases will grow in both size and complexity.
universal itemset does not exist, it will be created and Hence, association rule mining research will certainly
inserted into the SOTrieIT accordingly. If an item is continue to receive much attention in the quest for
removed from the universal itemset, all nodes contain- faster, more scalable and more configurable algorithms.
ing that item need only be removed, and the rest of the
nodes would still be valid.
Unlike the trie structure of Amir et al. (1999), the CONCLUSION
SOTrieIT is ordered by support count (which speeds up
mining) and does not require the powersets of transac- Association rule mining is an important data mining task
tions (which reduces construction time). The main weak- with several applications. However, to cope with the
ness of the SOTrieIT is that it can only discover frequent current explosion of raw data, data structures must be
1-itemsets and 2-itemsets; its main strength is its speed utilized to enhance its efficiency. We have analyzed
in discovering them. They can be found promptly be- several existing trie data structures used in association
cause there is no need to scan the database. In addition, rule mining and presented our novel trie structure, which
the search (depth first) can be stopped at a particular has been proven to be most useful and practical. What
level the moment a node representing a nonfrequent lies ahead is the parallelization of our structure to
itemset is found, because the nodes are all support further accommodate the ever-increasing demands of
ordered. todays need for speed and scalability to obtain associa-
Another advantage of the SOTrieIT, compared with tion rules in a timely manner. Another challenge is to
all previously discussed structures, is that it can be design new data structures that facilitate the discovery
constructed online, meaning that each time a new trans- of trends as association rules evolve over time. Differ-
action arrives, the SOTrieIT can be incrementally up- ent association rules may be mined at different time
dated. This feature is possible because the SOTrieIT is points and, by understanding the patterns of changing rules,
constructed without the need to know the support thresh- additional interesting knowledge may be discovered.
old; it is support independent. All 1-itemsets and 2-
itemsets in the database are used to update the SOTrieIT
regardless of their support counts. To conserve storage REFERENCES
space, existing trie structures such as the FP-tree have
to use thresholds to keep their sizes manageable; thus, Aggarwal, C. C., & Yu, P. S. (2001). A new approach to
when new transactions arrive, they have to be recon- online generation of association rules. IEEE Transac-
structed, because the support counts of itemsets will tions on Knowledge and Data Engineering, 13(4), 527-
have changed. 540.
Finally, the SOTrieIT requires far less storage space
than a trie or Patricia trie because it is only two levels Agrawal, R., & Srikant, R. (1994). Fast algorithms for
deep and can be easily stored in both memory and files. mining association rules. Proceedings of the 20th In-
Although this causes some input/output (I/O) overheads, ternational Conference on Very Large Databases (pp.
it is insignificant as shown in our extensive experiments. 487-499), Chile.
We have designed several algorithms to work synergis-
tically with the SOTrieIT and, through experiments with Amir, A., Feldman, R., & Kashi, R. (1999). A new and
existing prominent algorithms and a variety of databases, versatile method for association generation. Informa-
we have proven the practicality and superiority of our tion Systems, 22(6), 333-347.
approach (Das, Ng, & Woon, 2001; Woon et al., 2001). In Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom,
fact, our latest algorithm, FOLD-growth, is shown to J. (2002). Models and issues in data stream systems.
outperform FP-growth by more than 100 times (Woon, Proceedings of the ACM SIGMOD/PODS Conference
Ng, & Lim, 2004). (pp. 1-16), USA.
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999).
FUTURE TRENDS Using association rules for product assortment deci-
sions: A case study. Proceedings of the Fifth ACM
The data stream phenomenon will eventually become SIGKDD Conference (pp. 254-260), USA.
ubiquitous as Internet access and bandwidth become Creighton, C., & Hanash, S. (2003). Mining gene expres-
increasingly affordable. With keen competition, prod- sion databases for association rules. Bioinformatics,
ucts will become more complex with customization and 19(1), 79-86.
62
TEAM LinG
Das, A., Ng, W. K., & Woon, Y. K. (2001). Rapid associa- Woon, Y. K., Ng, W. K., & Lim, E. P. (2002). Online and
tion rule mining. Proceedings of the 10th International incremental mining of separately grouped web access A
Conference on Information and Knowledge Manage- logs. Proceedings of the Third International Conference
ment (pp. 474-481), USA. on Web Information Systems Engineering (pp. 53-62),
Singapore.
Dong, G., & Li, J. (1999). Efficient mining of emerging
patterns: Discovering trends and differences. Proceed- Woon, Y. K., Ng, W. K., & Lim, E. P. (2004). A support-
ings of the Fifth International Conference on Knowl- ordered trie for fast frequent itemset discovery. IEEE
edge Discovery and Data Mining (pp. 43-52), USA. Transactions on Knowledge and Data Engineering,
16(5).
Elliott, K., Scionti, R., & Page, M. (2003). The confluence
of data mining and market research for smarter CRM. Yang, D. Y., Johar, A., Grama, A., & Szpankowski, W.
Retrieved from http://www.spss.com/home_page/ (2000). Summary structures for frequency queries on
wp133.htm large transaction sets. Proceedings of the Data Com-
pression Conference (pp. 420-429).
Han, J., Pei, J., Yin Y., & Mao, R. (2004). Mining frequent
patterns without candidate generation: A frequent-pat- Yiu, M. L., & Mamoulis, N. (2003). Frequent-pattern based
tern tree approach. Data Mining and Knowledge Discov- iterative projected clustering. Proceedings of the Third
ery, 8(1), 53-97. International Conference on Data Mining, USA.
Hidber, C. (1999). Online association rule mining. Pro- Zaki, M. J. (2000). Scalable algorithms for association
ceedings of the ACM SIGMOD Conference (pp. 145-154), mining. IEEE Transactions on Knowledge and Data
USA. Engineering, 12(3), 372-390.
Knuth, D.E. (1968). The art of computer programming, Vol.
1. Fundamental Algorithms. Addison-Wesley Publish-
ing Company. KEY TERMS
Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M. S.,
& Duri, S. (2001). Personalization of supermarket product Apriori: A classic algorithm that popularized asso-
recommendations. Data Mining and Knowledge Discov- ciation rule mining. It pioneered a method to generate
ery, 5(1/2), 11-32. candidate itemsets by using only frequent itemsets in
the previous pass. The idea rests on the fact that any
Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extrac- subset of a frequent itemset must be frequent as well.
tion of knowledge on protein-protein interaction by asso- This idea is also known as the downward closure prop-
ciation rule discovery. Bioinformatics, 18(5), 705-714. erty.
Sarda, N. L., & Srinivas, N. V. (1998). An adaptive algo- Itemset: An unordered set of unique items, which
rithm for incremental mining of association rules. Pro- may be products or features. For computational effi-
ceedings of the Ninth International Conference on Da- ciency, the items are often represented by integers. A
tabase and Expert Systems (pp. 240-245), Austria. frequent itemset is one with a support count that ex-
ceeds the support threshold, and a candidate itemset is
Wang, K., Zhou, S., & Han, J. (2002). Profit mining: From a potential frequent itemset. A k-itemset is an itemset
patterns to actions. Proceedings of the Eighth Interna- with exactly k items.
tional Conference on Extending Database Technology
(pp. 70-87), Prague. Key: A unique sequence of values that defines the
location of a node in a tree data structure.
Woon, Y. K., Li, X., Ng, W. K., & Lu, W. F. (2003).
Parameterless data compression and noise filtering us- Patricia Trie: A compressed binary trie. The
ing association rule mining. Proceedings of the Fifth Patricia (Practical Algorithm to Retrieve Information
International Conference on Data Warehousing and Coded in Alphanumeric) trie is compressed by avoiding
Knowledge Discovery (pp. 278-287), Prague. one-way branches. This is accomplished by including in
each node the number of bits to skip over before making
Woon, Y. K., Ng, W. K., & Das, A. (2001). Fast online the next branching decision.
dynamic association rule mining. Proceedings of the
Second International Conference on Web Information SOTrieIT: A dual-level trie whose nodes represent
Systems Engineering (pp. 278-287), Japan. itemsets. The position of a node is ordered by the support
count of the itemset it represents; the most frequent
63
TEAM LinG
itemsets are found on the leftmost branches of the rithm has to be executed many times before this value can
SOTrieIT. be well adjusted to yield the desired results.
Support Count of an Itemset: The number of transac- Trie: An n-ary tree whose organization is based on
tions that contain a particular itemset. key space decomposition. In key space decomposition,
the key range is equally subdivided, and the splitting
Support Threshold: A threshold value that is used to position within the key range for each node is pre-
decide if an itemset is interesting/frequent. It is defined by defined.
the user, and generally, an association rule mining algo-
64
TEAM LinG
65
Association Rule Mining and Application to A

MPIS
Raymond Chi-Wing Wong
The Chinese University of Hong Kong, Hong Kong
Ada Wai-Chee Fu
The Chinese University of Hong Kong, Hong Kong
INTRODUCTION frequent itemsets. It is an iterative approach and there are

two steps in each iteration. The first step generates a set
Association rule mining (Agrawal, Imilienski, & Swami, of candidate itemsets. Then, the second step prunes all
1993) has been proposed for understanding the relation- disqualified candidates (i.e., all infrequent itemsets). The
ships among items in transactions or market baskets. For iterations begin with size 2 itemsets and the size is
instance, if a customer buys butter, what is the chance that incremented at each iteration. The algorithm is based on
he/she buys bread at the same time? Such information may the closure property of frequent itemsets: if a set of items
be useful for decision makers to determine strategies in a is frequent, then all its proper subsets are also frequent.
store. The weaknesses of this algorithm are the generation of a
large number of candidate itemsets and the requirement to
scan the database once in each iteration.
BACKGROUND A data structure called FP-tree and an efficient algo-
rithm called FP-growth are proposed by Han, Pei, & Yin
Given a set I = {I1, I2,, In} of items (e.g., carrot, orange and (2000) to overcome the above weaknesses. The idea of FP-
knife) in a supermarket. The database contains a number tree is fetching all transactions from the database and
of transactions. Each transaction t is a binary vector with inserting them into a compressed tree structure. Then,
t[k]=1 if t bought item Ik and t[k]=0 otherwise (e.g., {1, algorithm FP-growth reads from the FP-tree structure to
mine frequent itemsets.
0, 0, 1, 0}). An association rule is of the form X Ij, where
X is a set of some items in I, and Ij is a single item not in X
(e.g., {Orange, Knife} Plate).
A transaction t satisfies X if for all items Ik in X, t[k]
MAIN THRUST
= 1. The support for a rule X Ij is the fraction of trans-
actions that satisfy the union of X and Ij. A rule X Ij has Variations in Association Rules
confidence c% if and only if c% of transactions that
satisfy X also satisfy Ij. Many variations on the above problem formulation have
The mining process of association rule can be divided been suggested. The association rules can be classified
into two steps: based on the following (Han & Kamber, 2000):
1. Frequent Itemset Generation: Generate all sets of 1. Association Rules Based on the Type of Values of
items that have support greater than or equal to a Attribute: Based on the type of values of attributes,
certain threshold, called minsupport there are two kinds Boolean association rule,
2. Association Rule Generation: From the frequent which is presented above, and quantitative associa-
itemsets, generate all association rules that have tion rule. Quantitative association rule describes
confidence greater than or equal to a certain thresh- the relationships among some quantitative attributes
old called minconfidence (e.g., income and age). An example is
income(40K..50K) age(40..45). One proposed
Step 1 is much more difficult compared with Step 2. method is grid-based dividing each attribute into
Thus, researchers have focused on the studies of fre- a fixed number of partitions [Association Rule Clus-
quent itemset generation. tering System (ARCS) in Lent, Swami & Widom
The Apriori Algorithm is a well-known approach, (1997)]. Srikant & Agrawal (1996) proposed to par-
which was proposed by Agrawal & Srikant (1994), to find tition quantitative attributes dynamically and to
TEAM LinG
Association Rule Mining and Application to MPIS
merge the partitions based on a measure of partial Figure 2. A concept hierarchy of the fruit
completeness. Another non-grid based approach is
found in Zhang, Padmanabhan, & Tuzhilin (2004).
2. Association Rules based on the Dimensionality of fruit
Data: Association rules can be divided into single-
dimensional association rules and multi-dimen-
sional association rules. One example of single-
dimensional rule is buys({Orange, Knife}) apple orange banana
buys(Plate), which contains only the dimension
buys. Multi-dimensional association rule is the one
containing attributes for more than one dimension.
For example, income(40K..50K) buys(Plate). One exists no itemset X such that (1) X X and (2) trans-
mining approach is to borrow the concept of data actions t, X is in t implies X is in t. These considerations
cube in the field of data warehousing. Figure 1 can reduce the resulting number of frequent itemsets
shows a lattice for the data cube for the dimensions significantly.
age, income and buys. Researchers (Kamber, Han, Another variation of the frequent itemset problem is
& Chiang, 1997) applied the data cube model and mining top-K frequent itemsets (Cheung & Fu, 2004). The
used the aggregate techniques for mining. problem is to find K frequent itemsets with the greatest
3. Association Rules based on the Level of Abstrac- supports. It is often more reasonable to assume the
tions of Attribute: The rules discussed in previous parameter K, instead of the data-distribution dependent
sections can be viewed as single-level association parameter of minsupport because the user typically would
rule. A rule that references different levels of ab- not have the knowledge of the data distribution before
straction of attributes is called a multilevel associa- data mining.
tion rule. Suppose there are two rules The other variations of the problem are the incremen-
income(10K..20K) buys(fruit) and tal update of mining association rules (Hidber, 1999),
income(10K..20K) buys(orange). There are two constraint-based rule mining (Grahne & Lakshmanan,
different levels of abstractions in these two rules 2000), distributed and parallel association rule mining
because fruit is a higher-level abstraction of or- (Gilburd, Schuster, & Wolff, 2004), association rule min-
ange. Han & Fu (1995) apply a top-down strategy ing with multiple minimum supports/without minimum
to the concept hierarchy in the mining of frequent support (Chiu, Wu, & Chen, 2004), association rule
itemsets. mining with weighted item and weight support (Tao,
Murtagh, & Farid, 2003), and fuzzy association rule min-
Other Extensions to Association Rule ing (Kuok, Fu, & Wong, 1998).
Mining Association rule mining has been integrated with
other data mining problems. There have been the integra-
There are other extensions to association rule mining. tion of classification and association rule mining (Wang,
Some of them (Bayardo, 1998) find maxpattern (i.e., maxi- Zhou, & He, 2000) and the integration of association rule
mal frequent patterns) while others (Zaki & Hsiao, 2002) mining with relational database systems (Sarawagi, Tho-
find frequent closed itemsets. Maxpattern is a frequent mas, & Agrawal, 1998).
itemset that does not have a frequent item superset. A
frequent itemset is a frequent closed itemsets if there Application of the Concept of
Association Rules to MPIS
Figure 1. A lattice showing the data cube for the
dimensions age, income, and buys Other than market basket analysis (Blischok, 1995), asso-
ciation rules can also help in applications such as intru-
() sion detection (Lee, Stolfo, & Mok, 1999), heterogeneous
genome data (Satou et al., 1997), mining remotely sensed
income images/data (Dong, Perrizo, Ding, & Zhou, 2000) and
(age) (buys)
product assortment decisions (Wong, Fu, & Wang, 2003;
Wong & Fu, 2004). Here we focus on the application on
(age, income) income, buys) product assortment decisions, as it is one of very few
age, buys)
examples where the association rules are not the end
(age, income, buys) mining results.
66
TEAM LinG
Transaction database in some applications can be very fidence of I d, the more likely the profit of I should
large. For example, Hedberg (1995) quoted that Wal-Mart not be counted. This is the reasoning behind the above A
kept about 20 million sales transactions per day. Such data definition. In the above example, suppose we choose
requires sophisticated analysis. As pointed out by Blischok monitor and telephone. Then, d = {keyboard}. All profits
(1995), a major task of talented merchants is to pick the of monitor will be lost if, in the history, we find conf(I
profit generating items and discard the losing items. It may d)=1, where I = monitor. This example illustrates the
be simple enough to sort items by their profit and do the importance of the consideration of cross-selling factor in
selection. However, this ignores a very important aspect the profit estimation.
in market analysis the cross-selling effect. There can be Wong, Fu, & Wang (2003) propose two algorithms to
items that do not generate much profit by themselves but deal with this problem. In the first algorithm, they ap-
they are the catalysts for the sales of other profitable items. proximate the total profit of the item selection in qua-
Recently, some researchers (Kleinberg, Papadimitriou, & dratic form and solve a quadratic optimization problem.
Raghavan, 1998) suggest that concepts of association The second one is a greedy approach called MPIS_Alg,
rules can be used in the item selection problem with the which prunes items iteratively according to an estimated
consideration of relationships among items. function based on the formula of the total profit of the
One example of the product assortment decisions is item selection until J items remain.
Maximal-Profit Item Selection (MPIS) with cross-selling Another product assortment decision problem is stud-
considerations (Wong, Fu, & Wang, 2003). Consider the ied by Wong & Fu, (2004), which addresses the problem
major task of merchants to pick profit-generating items and of selecting a set of marketing items in order to boost the
discard the losing items. Assume we have a history record sales of the store.
of the sales (transactions) of all items. This problem is to
select a subset from the given set of items so that the
estimated profit of the resulting selection is maximal among FUTURE TRENDS
all choices.
Suppose a shop carries office equipment composed of A new area for investigation of the problem of mining
monitors, keyboards and telephones, with profits of frequent itemsets is mining data streaming for frequent
$1000K, $100K and $300K, respectively. If now the shop itemsets (Manku & Motwani, 2002; Yu, Chong, Lu, &
decides to remove one of the three items from its stock, the Zhou, 2004). In such a problem, the data is so massive
question is which two we should choose to keep. If we that all data cannot be stored in the memory of a computer
simply examine the profits, we may choose to keep moni- and the data cannot be processed by traditional algo-
tors and telephones, and so the total profit is $1300K. rithms. The objective of all proposed algorithm is to store
However, we know that there is strong cross-selling effect as few as possible data and to minimize the error gener-
between monitor and keyboard (see Table 1). If the shop ated by some estimation in the model.
stops carrying keyboard, the customers of monitor may Privacy preservation on the association rule mining
choose to shop elsewhere to get both items. The profit is also rigorously studied in these few years (Vaidya &
from monitor may drop greatly, and we may be left with Clifton, 2002; Agrawal, Evfimievski, & Srikant, 2003). The
profit of $300K from telephones only. If we choose to keep problem is to mine from two or more different sources
both monitors and keyboards, then the profit can be without exposing individual transaction data to each
expected to be $1100K, which is higher. other.
MPIS will give us the desired solution. MPIS utilizes
the concept of the relationship between selected items and
unselected items. Such relationship is modeled by the CONCLUSION
cross-selling factor. Suppose d is the set of unselected
items and I is the selected item. A loss rule is proposed in Association rule mining plays an important role in the
the form I d, where d means the purchase of any item literature of data mining. It poses many challenging
in d. The rule indicates that from the history, whenever a
customer buys the item I, he/she also buys at least one of
Table 1.
the items in d. Interpreting this as a pattern of customer
behavior, and assuming that the pattern will not change
Monitor Keyboard Telephone
even when some items were removed from the stock, if 1 1 0
none of the items in d are available then the customer also 1 1 0
will not purchase I. This is because if the customer still 0 0 1
0 0 1
purchases I, without purchasing any items in d, then the 0 0 1
pattern would be changed. Therefore, the higher the con- 1 1 1
67
TEAM LinG
issues for the development of efficient and effective Hedberg, S. (1995, October). The data gold rush. BYTE, 83-99.
methods. After taking a closer look, we find that the
application of association rules requires much more in- Hidber, C. (1999), Online association rule mining. SIGMOD,
vestigations in order to aid in more specific targets. We 145-156.
may see a trend towards the study of applications of Kamber, M., Han, J., & Chiang, J.Y. (1997). Metarule-
association rules. guided mining of multi-dimensional association rules
using data cubes. In Proceeding of the 3rd International
Conference on Knowledge Discovery and Data Mining
REFERENCES (pp. 207-210).
Agrawal, R., Evfimievski, A., & Srikant, R. (2003). Informa- Kleinberg, J., Papadimitriou, C., & Raghavan, P. (1998). A
tion sharing across private database. SIGMOD, 86-97. microeconomic view of data mining. Knowledge Discov-
ery Journal, 2(4), 311-324.
Agrawal, R., Imilienski, T., & Swami. (1993). Mining asso-
ciation rules between sets of items in large databases. Kuok, C.M., Fu, A.W.C., & Wong, M.H., (1998). Mining
SIGMOD, 129-140. fuzzy association rules in databases. ACM SIGMOD
Record, 27(1), 41-46.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for
mining association rules. In Proceedings of the 20th VLDB Lee, W., Stolfo, S.J., & Mok, K.W. (1999). A data mining
Conference (pp. 487-499). framework for building intrusion detection models. In
IEEE Symposium on Security and Privacy (pp. 120-132).
Bayardo, R.J. (1998). Efficiently mining long patterns
from databases. SIGMOD, 85-93. Lent, B., Swami, A.N., & Widom, J. (1997). Clustering
Association Rules. In ICDE (pp. 220-231).
Blischok, T. (1995). Every transaction tells a story. Chain
Store Age Executive with Shopping Center Age, 71(3), Manku, G.S., & Motwani, R. (2002). Approximate fre-
50-57. quency counts over data streams. In Proceedings of the
20 th International Conference on VLDB (pp. 346-357).
Cheung, Y.L., & Fu, A.W.-C. (2004). Mining association
rules without support threshold: With and without Item Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrat-
Constraints. TKDE, 16(9), 1052-1069. ing association rule mining with relational database sys-
tems: Alternatives and implications. SIGMOD, 343-354.
Chiu, D.-Y., Wu, Y.-H., & Chen, A.L.P. (2004). An efficient
algorithm for mining frequent sequences by a new strat- Satou, K., Shibayama, G., Ono, T., Yamamura, Y., Furuichi,
egy without support counting. ICDE, 375-386. E., Kuhara, S., & Takagi, T. (1997). Finding association
rules on heterogeneous genome data. In Pacific Sympo-
Dong, J., Perrizo, W., Ding, Q., & Zhou, J. (2000). The sium on Biocomputing (PSB) (pp. 397-408).
application of association rule mining to remotely sensed
data. In Proceedings of the 2000 ACM symposium on Srikant, R., & Agrawal, R. (1996). Mining quantitative
Applied computing (pp. 340-345). association rules in large relational tables. SIGMOD, 1-12.
Gilburd, B., Schuster, A., & Wolff, R. (2004). A new Tao, F., Murtagh, F., & Farid, M. (2003). Weighted asso-
privacy model and association-rule mining algorithm for ciation rule mining using weighted support and signifi-
large-scale distributed environments. SIGKDD. cance framework. In The Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Grahne, G., Lakshmanan, L., & Wang, X. (2000). Efficient Mining (pp. 661-666).
mining of constrained correlated sets. ICDE, 512-521
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso-
Han, J., & Fu, Y. (1995). Discovery of multiple-level association rule mining in vertically partitioned data. In The
ciation rules from large databases. In Proceedings of the Eighth ACM SIGKDD International Conference on
1995 International Conference on VLDB (pp. 420-431). Knowledge Discovery and Data Mining (pp. 639-644).
Han, J., & Kamber, M. (2000). Data mining: Concepts and Wang, K., Zhou, S., & He, Y. (2000). Growing decision
techniques. San Mateo, CA: Morgan Kaufmann Publishers. trees on support-less association rules. In Sixth ACM
SIGKDD International Conference on Knowledge Dis-
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns covery & Data Mining (pp. 265-269).
without candidate generation. SIGMOD, 1-12.
68
TEAM LinG
Wong, R.C.-W., & Fu, A.W.-C. (2004). ISM: Item selection is a set of items and Ij is a single item not in X, is the fraction
for marketing with cross-selling considerations. In Ad- of the transactions containing all items in set X that also A
vances in Knowledge Discovery and Data Mining, 8th contain item Ij.
Pacific-Asia Conference (PAKDD) (pp. 431-440), Lecture
Notes in Computer Science 3056. Berlin: Springer. Frequent Itemset/Pattern: The itemset with support
greater than or equal to a certain threshold, called
Wong, R.C.-W., Fu, A.W.-C., & Wang, K. (2003). MPIS: minsupport.
Maximal-profit item selection with cross-selling consider-
ations. In IEEE International Conference on Data Min- Infrequent Itemset: Itemset with support smaller than
ing (ICDM) (pp. 371-378). a certain threshold, called minsupport.
Yu, J.X., Chong, Z., Lu, H., & Zhou, A. (2004). False Itemset: A set of items
positive or false negative: Mining frequent itemsets from K-Itemset: Itemset with k items
high speed transactional data streams. In Proceedings of
the Thirtieth International Conference on Very Large Maximal-Profit Item Selection (MPIS): The problem
Data Bases. of item selection, which selects a set of items in order to
maximize the total profit with the consideration of cross-
Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient selling effect
algorithm for closed itemset mining. In SIAM Interna-
tional Conference on Data Mining (SDM). Support (Itemset) Or Frequency: The support of an
itemset X is the fraction of transactions containing all
Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the items in X.
discovery of significant statistical quantitative rules. In
Proceedings of the 10th ACM SIGKDD Knowledge Dis- Support (Rule): The support of a rule X Ij, where X
covery and Data Mining Conference. is a set of items and Ij is a single item not in X, is the fraction
of the transactions containing all items in set X that also
contain item Ij.
KEY TERMS
Transaction: A record containing the items bought
Association Rule: A kind of rule in the form X Ij, where by a customer.
X is a set of some items and Ij is a single item not in X.
Confidence: The confidence of a rule X I j, where X
69
TEAM LinG
70
Association Rule Mining of Relational Data

Anne Denton
North Dakota State University, USA
Christopher Besemann
INTRODUCTION than one property per node or edge. Data associated

with nodes and edges can be modeled within the
Most data of practical relevance are structured in more relational algebra.
complex ways than is assumed in traditional data mining
algorithms, which are based on a single table. The concept Association rule mining of relational data incorpo-
of relations allows for discussing many data structures rates important aspects of these areas to form an innovative
such as trees and graphs. Relational data have much data mining technique of important practical relevance.
generality and are of significant importance, as demon-
strated by the ubiquity of relational database manage-
ment systems. It is, therefore, not surprising that popular MAIN THRUST
data mining techniques, such as association rule mining,
have been generalized to relational data. An important The general concept of association rule mining of rela-
aspect of the generalization process is the identification tional data will be explored, as well as the special case of
of problems that are new to the generalized setting. mining a relationship that corresponds to a graph.
General Concept
BACKGROUND
Two main challenges have to be addressed when applying
Several areas of databases and data mining contribute to association rule mining to relational data. Combined min-
advances in association rule mining of relational data: ing of multiple tables leads to a search space that is
typically large even for moderately sized tables. Perfor-
Relational Data Model: underlies most commercial mance is, thereby, commonly an important issue in rela-
database technology and also provides a strong tional data mining algorithms. A less obvious problem lies
mathematical framework for the manipulation of in the skewing of results (Jensen & Neville, 2002). The
complex data. Relational algebra provides a natural relational join operation combines each record from one
starting point for generalizations of data mining table with each occurrence of the corresponding record in
techniques to complex data types. a second table. That means that the information in one
Inductive Logic Programming, ILP (Deroski & record is represented multiple times in the joined table.
Lavra , 2001): a form of logic programming, in Data mining algorithms that operate either explicitly or
which individual instances are generalized to make implicitly on joined tables, thereby, use the same informa-
hypotheses about unseen data. Background knowl- tion multiple times. Note that this problem also applies to
edge is incorporated directly. algorithms in which tables are joined on-the-fly by iden-
Association Rule Mining, ARM (Agrawal, tifying corresponding records as they are needed. Further
Imielinski, & Swami, 1993): identifies associa- specific issues may have to be addressed when reflexive
tions and correlations in large databases. Associa- relationships are present. These issues will be discussed
tion rules are defined based on items, such as in the section on relations that represent a graph.
objects in a shopping cart. Efficient algorithms are A variety of techniques have been developed for data
designed by limiting output to sets of items that mining of relational data (D eroski & Lavra , 2001). A
occur more frequently than a given threshold. typical approach is called inductive logic programming,
Graph Theory: addresses networks that consist of ILP. In this approach relational structure is represented in
nodes, which are connected by edges. Traditional the form of Prolog queries, leaving maximum flexibility to
graph theoretic problems typically assume no more the user. While the notation of ILP differs from the
TEAM LinG
relational notation it can be noted that all relational A typical example of an association rule mining prob-
operators can also be represented in ILP. The approach lem is mining of annotation data of proteins in the pres- A
does thereby not limit the types of problems that can be ence of a protein-protein interaction graph (Oyama, Kitano,
addressed. It should, however, also be noted that while Satou, & Ito, 2002). Associations are extracted that relate
relational database management system are developed functions and localizations of one protein with those of
with performance in mind there may be a trade-off between interacting proteins. Oyama et al. use association rule
the generality of Prolog-based environments and their mining, as applied to joined relations, for this work.
limitations in speed. Another example could be association rule mining of
Application of ARM within the ILP setting corre- attributes associated with scientific publications on the
sponds to a search for frequent Prolog queries as a graph of their mutual citations.
generalization of traditional association rules (Dehaspe & A problem of the straight-forward approach of mining
De Raedt, 1997). Examples of association rule mining of joined tables directly becomes obvious upon further
relational data using ILP (Dehaspe & Toivonen, 2001) study of the rules: In most cases the output is dominated
could be shopping behavior of customers where relation- by rules that involve the same item as it occurs in different
ships between customers are included in the reasoning. entity instances that participate in a relationship. In the
While ILP does not use a relational joining step as such, example of protein annotations within the protein interac-
it does also associate individual objects with multiple tion graph a protein in the nucleus is found to fre-
occurrences of corresponding objects. Problems with quently interact with another protein that is also located
skewing are, thereby, also encountered in this approach. in the nucleus. Similarities among relational neighbors
An alternative to the ILP approach is to apply the have been observed more generally for relational data-
standard definition of association rule mining to relations bases (Macskassy & Provost, 2003). It can be shown that
that are joined using the relational join operation. While filtering of output is not a consistent solution to this
such an approach is less general it is often more efficient problem, and items that are repeated for multiple nodes
since the join operation is highly optimized in standard should be eliminated in a preprocessing step (Besemann
database systems. It is important to note that a join & Denton, 2004). This is an example of a problem that does
operation typically changes the support of an item set, not occur in association rule mining of a single table and
and any support calculation should therefore be based on requires special attention when moving to multiple rela-
the relation that uses the smallest number of join operations. The example also highlights the need to discuss
tions (Cristofor & Simovici, 2001). Equivalent changes in differences between sets of items of related objects are
item set weighting occur in ILP. (Besemann, Denton, Yekkirala, Hutchison, & Anderson, 2004).
Interestingness of rules is an important issue in any
type of association rule mining. In traditional association Related Research Areas
rule mining the problem of rule interest has been ad-
dressed in a variety of work on redundant rules, including A related research area is graph-based ARM (Inokuchi,
closed set generation (Zaki, 2000). Additional rule metrics Washio, & Motoda, 2000; Yan & Han, 2002). Graph-based
such as lift and conviction have been defined (Brin, ARM does not typically consider more than one label on
Motwani, Ullman, & Tsur, 1997). In relational association each node or edge. The goal of graph-based ARM is to
rule mining the problem has been approached by the find frequent substructures based on that one label,
definition of a deviation measure (Dehaspe & Toivonen, focusing on algorithms that scale to large subgraphs. In
2001). In general it can be noted that relational data mining relational ARM multiple item are associated with each
poses many additional problems related to skewing of node and the main problem is to achieve scaling with
data compared with traditional mining on a single table respect to the number of items per node. Scaling to large
(Jensen & Neville, 2002). subgraphs is usually irrelevant due to the small world
property of many types of graphs. For most networks of
Relations that Represent a Graph practical interest any node can be reached from almost
any other by means of no more than some small number
One type of relational data set has traditionally received of edges (Barabasi & Bonabeau, 2003). Association rules
particular attention, albeit under a different name. A that involve longer distances are therefore unlikely to
relation representing a relationship between entity in- produce meaningful results.
stances of the same type, also called a reflexive relation- There are other areas of research on ARM in which
ship, can be viewed as the definition of a graph. Graphs related transactions are mined in some combined fashion.
have been used to represent social networks, biological Sequential pattern or episode mining (Agrawal & Srikant,
networks, communication networks, and citation graphs, 1995; Yan, Han, & Afshar, 2003) and inter-transaction
just to name a few. mining (Tung, Lu, Han, & Feng, 1999) are two main
71
TEAM LinG
categories. Generally the interest in association rule min- Besemann, C., & Denton, A. (2004, June). UNIC: UNique
ing is moving beyond the single-table setting to incorpo- item counts for association rule mining in relational
rate the complex requirements of real-world data. data. Technical Report, North Dakota State University,
Fargo, North Dakota.
Besemann, C., Denton, A., Yekkirala, A., Hutchison, R.,
FUTURE TRENDS & Anderson, M. (2004, Aug.). Differential association
rule mining for the study of protein-protein interaction
The consensus in the data mining community of the impor- networks. In Proceedings ACM SIGKDD Workshop on
tance of relational data mining was recently paraphrased Data Mining in Bioinformatics, Seattle, WA.
by Dietterich (2003) as I.i.d. learning is dead. Long live
relational learning. The statistics, machine learning, and Brin, S., Motwani, R., Ullman, J.D., & Tsur, S. (1997).
ultimately data mining communities have invested de- Dynamic itemset counting and implication rules for mar-
cades into sound theories based on a single table. It is now ket basket data. In Proceedings of the ACM SIGMOD
time to afford as much rigor to relational data. When taking International Conference on Management of Data,
this step it is important to not only specify generalizations Tucson, AZ.
of existing algorithms but to also identify novel questions
that may be asked that are specific to the relational setting. Cristofor, L., & Simovici, D. (2001). Mining association
It is, furthermore, important to identify challenges that rules in entity-relationship modeled databases. Tech-
only occur in the relational setting, including skewing due nical Report, University of Massachusetts Boston.
to the application of the relational join operator, and Dehaspe, L., & De Raedt, L. (1997, Dec.). Mining associa-
correlations that are frequent in relational neighbors. tion rules in multiple relations. In Proceedings of the 7th
International Workshop on Inductive Logic Program-
ming (pp. 125-132), Prague, Czech Republic.
CONCLUSION
Dehaspe, L., & Toivonen, H. (2001). Discovery of rela-
Association rule mining of relational data is a powerful tional association rules. In S. D eroski, & N. Lavra
frequent pattern mining technique that is useful for several (Eds.), Relational data mining. Berlin: Springer.
data structures including graphs. Two main approaches Dietterich, T. (2003, Nov.). Sequential supervised learn-
are distinguished. Inductive logic programming provides ing: Methods for sequence labeling and segmentation.
a high degree of flexibility, while mining of joined relations Invited Talk, 3rd IEEE International Conference on
is a fast technique that allows the study problems related Data Mining, Melbourne, FL, USA.
to skewed or uninteresting results. The potential compu-
tational complexity of relational algorithms and specific D eroski, S., & Lavra , N. (2001). Relational data min-
properties of relational data make its mining an important ing. Berlin: Springer.
current research topic. Association rule mining takes a Inokuchi, A., Washio, T., & Motoda, H. (2000) An apriori-
special role in this process, being one of the most impor- based algorithm for mining frequent substructures from
tant frequent pattern algorithms. graph data. In Proceedings of the 4th European Confer-
ence on Principles of Data Mining and Knowledge
Discovery (pp. 13-23), Lyon, France.
REFERENCES
Jensen, D., & Neville, J. (2002). Linkage and
Agrawal, R., Imielinski, T., & Swami, A.N. (1993, May). autocorrelation cause feature selection bias in relational
Mining association rules between sets of items in large learning. In Proceedings of the 19th International Con-
databases. In Proceedings of the ACM International ference on Machine Learning (pp. 259-266), Sydney,
Conference on Management of Data (pp. 207-216), Wash- Australia.
ington, D.C. Macskassy, S., & Provost, F. (2003). A simple relational
Agrawal, R., & Srikant, R. (1995). Mining sequential pat- classifier. In Proceedings of the 2nd Workshop on Multi-
terns. In Proceedings of the 11 th International Conference Relational Data Mining at KDD03, Washington, D.C.
on Data Engineering (pp. 3-14), IEEE Computer Society Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extrac-
Press, Taipei, Taiwan. tion of knowledge on protein-protein interaction by as-
Barabasi, A.L., & Bonabeau, E. (2003). Scale-free net- sociation rule discovery. Bioinformatics, 18(8), 705-714.
works. Scientific American, 288(5), 60-69.
72
TEAM LinG
Tung, A.K.H., Lu, H., Han, J., & Feng, L. (1999). Breaking Confidence: The confidence of a rule is the support
the barrier of transactions: Mining inter-transaction asso- of the item set consisting of all items in the rule (A ) A
ciation rules. In Proceedings of the International Confer- divided by the support of the antecedent.
ence on Knowledge Discovery and Data Mining, San
Diego, CA. Entity-Relationship Model (E-R-Model): A model to
represent real-world requirements through entities, their
Yan, X., & Han, J. (2002). gSpan: Graph-based substruc- attributes, and a variety of relationships between them. E-
ture pattern mining. In Proceedings of the International R-Models can be mapped automatically to the relational
Conference on Data Mining, Maebashi City, Japan. model.
Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining Inductive Logic Programming (ILP): Research area
closed sequential patterns in large datasets. In Proceed- at the interface of machine learning and logic program-
ings of the 2003 SIAM International Conference on Data ming. Predicate descriptions are derived from examples
Mining, San Francisco, CA. and background knowledge. All examples, background
knowledge and final descriptions are represented as logic
Zaki, M.J. (2000). Generating non-redundant association programs.
rules. In Proceedings of the International Conference on
Knowledge Discovery and Data Mining (pp. 34-43), Boston, Redundant Association Rule: An association rule is
MA. redundant if it can be explained based entirely on one or
more other rules.
KEY TERMS Relation: A mathematical structure similar to a table in

which every row is unique, and neither rows nor columns
Antecedent: The set of items A in the association rule have a meaningful order.
A B. Relational Database: A database that has relations
Apriori: Association rule mining algorithm that uses and relational algebra operations as underlying math-
the fact that the support of a non-empty subset of an item ematical concepts. All relational algebra operations result
set cannot be smaller than the support of the item set itself. in relations as output. A join operation is used to combine
relations. The concept of a relational database was
Association Rule: A rule of the form A B meaning introduced by E. F. Codd at IBM in 1970.
if the set of items A is present in a transaction, then the Support: The support of an item set is the fraction of
set of items B is likely to be present too. A typical example transactions that have all items in that item set.
constitutes associations between items purchased at a
supermarket.
73
TEAM LinG
74
Association Rules and Statistics

Martine Cadot
University of Henri Poincar/LORIA, Nancy, France
Jean-Baptiste Maj
LORIA/INRIA, France
Tarek Ziad
NUXEO, France
INTRODUCTION first part is discussed and compared to statistics. Further-

more, in this article, only data structured in tables are used
A manager would like to have a dashboard of his company for association rules.
without manipulating data. Usually, statistics have solved
this challenge, but nowadays, data have changed (Jensen,
1992); their size has increased, and they are badly struc- MAIN THRUST
tured (Han & Kamber, 2001). A recent methoddata
mininghas been developed to analyze this type of data The problem differs with the number of variables. In the
(Piatetski-Shapiro, 2000). A specific method of data min- sequel, problems with two, three, or more variables are
ing, which fits the goal of the manager, is the extraction of discussed.
association rules (Hand, Mannila & Smyth, 2001). This
extraction is a part of attribute-oriented induction (Guyon Two Variables
& Elisseeff, 2003).
The aim of this paper is to compare both types of The link between two variables (A and B) depends on the
extracted knowledge: association rules and results of coding. The outcome of statistics is better when data are
statistics. quantitative. A current model is linear regression. For
instance, the salary (S) of a worker can be expressed by the
following equation:
BACKGROUND
S = 100 Y + 20000 + (1)
Statistics have been used by people who want to extract
knowledge from data for one century (Freeman, 1997). where Y is the number of years in the company, and is
Statistics can describe, summarize and represent the data. a random number. This model means that the salary of a
In this paper data are structured in tables, where lines are newcomer in the company is $20,000 and increases by
called objects, subjects or transactions and columns are $100 per year.
called variables, properties or attributes. For a specific The association rule for this model is: YS. This
variable, the value of an object can have different types: means that there are a few senior workers with a small
quantitative, ordinal, qualitative or binary. Furthermore, paycheck. For this, the variables are translated into binary
statistics tell if an effect is significant or not. They are variables. Y is not the number of years, but the property
called inferential statistics. has seniority, which is not quantitative but of type Yes/
Data mining (Srikant, 2001) has been developed to No. The same transformation is applied to the salary S,
precede a huge amount of data, which is the result of which becomes the property has a big salary.
progress in digital data acquisition, storage technology, Therefore, these two methods both provide the link
and computational power. The association rules, which between the two variables and have their own instruments
are produced by data-mining methods, express links on for measuring the quality of the link. For statistics, there
database attributes. The knowledge brought by the asso- are the tests of regression model (Baillargeon, 1996), and
ciation rules is shared in two different parts. The first for association rules, there are measures like support,
describes general links, and the second finds specific confidence, and so forth (Kodratoff, 2001). But, depend-
links (knowledge nuggets) (Fabris & Freitas, 1999; ing on the type of data, one model is more appropriate than
Padmanabhan & Tuzhilin, 2000). In this article, only the the other (Figure 1).
TEAM LinG
Figure 1. Coding and analysis methods variable has a particular effect on the link between Y and
S, called interaction (Winer, Brown & Michels, 1991). A
Quantitative Ordinal Qualitative Yes/No The association rules for this model are:
- Association Rules + YS, ES for the equation (2)

YS, ES, YES for the equation (3)
+ Statistics -
The statistical test of the regression model allows to
choose with or without interaction (2) or (3). For the
association rules, it is necessary to prune the set of three
Three Variables rules, because their measures do not give the choice
between a model of two rules and a model of three rules
If a third variable E, the experience of the worker, is (Zaki, 2000; Zhu, 1998).
integrated, the equation (1) becomes:
More Variables
S = 100 Y + 2000 E + 19000 + (2)
With more variables, it is difficult to use statistical models
E is the property has experience. If E=1, a new to test the link between variables (Megiddo & Srikant,
experienced worker gets a salary of $21,000, and if E=0, a 1998). However, there are still some ways to group vari-
new non-experienced worker gets a salary of $19,000. The ables: clustering, factor analysis, and taxonomy (Govaert,
increase of the salary, as a function of seniority (Y), is the 2003). But the complex links between variables, like inter-
same in both cases of experience. actions, are not given by these models and decrease the
quality of the results.
S = 50 Y + 1500 E + 50 E Y + 19500 + (3)
Comparison
Now, if E=1, a new experienced worker gets a salary of
$21,000, and if E=0, a new non-experienced worker gets a
Table 1 briefly compares statistics with the association
salary of $19,500. The increase of the salary, as a function
rules. Two types of statistics are described: by tests and
of seniority (Y), is $50 higher for experienced workers.
by taxonomy. Statistical tests are applied to a small amount
These regression models belong to a linear model of
of variables and the taxonomy to a great amount of
statistics (Prum, 1996), where, in the equation (3), the third
Table 1. Comparison between statistics and association rules
Statistics Data Mining

Tests Taxonomy Association rules
Decision Tests (+) Threshold defined (-) Threshold defined (-)
Level of Knowledge Low (-) High and simple (+) High and complex (+)
No. of Variables Small (-) High (+) Small and high (+)
Complex Link Yes (-) No (+) No (-)
Figure 2. (a) Regression equations; (b) taxonomy; (c) association rules
a) b) c)
75
TEAM LinG
variables. In statistics, the decision is easy to make out of Govaert, G. (2003). Analyse de donnes. Lavoisier, France:
test results, unlike association rules, where a difficult Hermes-Science.
choice on several indices thresholds has to be performed.
For the level of knowledge, the statistical results need Gras, R., & Bailleul, M. (2001). La fouille dans les donnes
more interpretation relative to the taxonomy and the asso- par la mthode danalyse statistique implicative.
ciation rules. Colloque de Caen. Ecole polytechnique de lUniversit
Finally, graphs of the regression equations (Hayduk, de Nantes, Nantes, France.
1987), taxonomy (Foucart, 1997), and association rules Guyon, I., & Elisseeff, A. (2003). An introduction to
(Gras & Bailleul, 2001) are depicted in Figure 2. variable and feature selection: Special issue on variable
and feature selection. Journal of Machine Learning
Research, 3, 1157-1182.
FUTURE TRENDS
Han, J., & Kamber, M. (2001). Data mining: Concepts
With association rules, some researchers try to find the and techniques. San Francisco, CA: Morgan Kaufmann.
right indices and thresholds with stochastic methods. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
More development needs to be done in this area. Another data mining. Cambridge, MA: MIT Press.
sensitive problem is the set of association rules that is not
made for deductive reasoning. One of the most common Hayduk, L.A. (1987). Structural equation modelling
solutions is the pruning to suppress redundancies, con- with LISREL. Maryland: John Hopkins Press.
tradictions and loss of transitivity. Pruning is a new method Jensen, D. (1992). Induction with randomization test-
and needs to be developed. ing: Decision-oriented analysis of large data sets [doc-
toral thesis]. Washington University, Saint Louis, MO.
CONCLUSION Kodratoff, Y. (2001). Rating the interest of rules induced

from data and within texts. Proceedings of the 12th IEEE
With association rules, the manager can have a fully International Conference on Database and Expert Sys-
detailed dashboard of his or her company without manipu- tems Aplications-Dexa, Munich, Germany.
lating data. The advantage of the set of association rules Megiddo, N., & Srikant, R. (1998). Discovering predictive
relative to statistics is a high level of knowledge. This association rules. Proceedings of the Conference on
means that the manager does not have the inconvenience Knowledge Discovery in Data, New York.
of reading tables of numbers and making interpretations.
Furthermore, the manager can find knowledge nuggets Padmanabhan, B., & Tuzhilin, A. (2000). Small is beauti-
that are not present in statistics. ful: Discovering the minimal set of unexpected patterns.
The association rules have some inconvenience; how- Proceedings of the Conference on Knowledge Discov-
ever, it is a new method that still needs to be developed. ery in Data. Boston, Massachusetts.
Piatetski-Shapiro, G. (2000). Knowledge discovery in da-
tabases: 10 years after. Proceedings of the Conference on
REFERENCES Knowledge Discovery in Data, Boston, Massachusetts.
Baillargeon, G. (1996). Mthodes statistiques de Prum, B. (1996). Modle linaire: Comparaison de
lingnieur: Vol. 2. Trois-Riveres, Quebec: Editions SMG. groupes et rgression. Paris, France: INSERM.
Fabris,C., & Freitas, A. (1999). Discovery surprising pat- Srikant, R. (2001). Association rules: Past, present, fu-
terns by detecting occurrences of Simpsons paradox: ture. Proceedings of the Workshop on Concept Lattice-
Research and development in intelligent systems XVI. Based Theory, Methods and Tools for Knowledge Dis-
Proceedings of the 19th Conference of Knowledge-Based covery in Databases, California.
Systems and Applied Artificial Intelligence, Cambridge,
UK. Winer, B.J., Brown, D.R., & Michels, K.M.(1991). Statis-
tical principles in experimental design. New York:
Foucart, T. (1997). Lanalyse des donnes, mode demploi. McGraw-Hill.
Rennes, France: Presses Universitaires de Rennes.
Zaki, M.J. (2000). Generating non-redundant association
Freedman, D. (1997). Statistics. W.W. New York: Norton & rules. Proceedings of the Conference on Knowledge
Company. Discovery in Data, Boston, Massachusetts.
76
TEAM LinG
Zhu, H. (1998). On-line analytical mining of association Linear Model: A variable is fitted by a linear combina-
rules [doctoral thesis]. Simon Fraser University, Burnaby, tion of other variables and interactions between them. A
Canada.
Pruning: The algorithms of extraction for the associa-
tion rule are optimized in computationality cost but not in
other constraints. This is why a suppression has to be
KEY TERMS performed on the results that do not satisfy special
constraints.
Attribute-Oriented Induction: Association rules, clas- Structural Equations: System of several regression
sification rules, and characterization rules are written with equations with numerous possibilities. For instance, a
attributes (i.e., variables). These rules are obtained from same variable can be made into different equations, and
data by induction and not from theory by deduction. a latent (not defined in data) variable can be accepted.
Badly Structured Data: Data, like texts of corpus or Taxonomy: This belongs to clustering methods and is
log sessions, often do not contain explicit variables. To usually represented by a tree. Often used in life categori-
extract association rules, it is necessary to create vari- zation.
ables (e.g., keyword) after defining their values (fre-
quency of apparition in corpus texts or simply apparition/ Tests of Regression Model: Regression models and
non apparition). analysis of variance models have numerous hypothesis,
e.g. normal distribution of errors. These constraints allow
Interaction: Two variables, A and B, are in interaction to determine if a coefficient of regression equation can be
if their actions are not seperate. considered as null with a fixed level of significance.
77
TEAM LinG
78
Automated Anomaly Detection

Brad Morantz
Georgia State University, USA
INTRODUCTION purchasing patterns. This is the semiotics of data, as we

transform data to information and finally to knowledge.
Preparing a dataset is a very important step in data mining. Dirty data, or data containing errors, are a major
If the input to the process contains problems, noise, or problem in this process. The old saying is, garbage in,
errors, then the results will reflect this, as well. Not all garbage out (Statsoft, 2004). Heuristic estimates are that
possible combinations of the data should exist, as the data 60-80% of the effort should go into preparing the data for
represent real-world observations. Correlation is expected mining, and only the small remaining portion actually is
among the variables. If all possible combinations were required for the data-mining effort itself. These data
represented, then there would be no knowledge to be records that are deviations from the common rule are
gained from the mining process. called anomalies.
The goal of anomaly detection is to identify and/or Data are always dirty and have been called the curse
remove questionable or incorrect observations. These of data mining (Berry & Linoff, 2000). Several factors can
occur because of keyboard error, measurement or record- be responsible for attenuating the quality of the data,
ing error, human mistakes, or other causes. Using knowl- among them errors, missing values, and outliers (Webb,
edge about the data, some standard statistical techniques, 2002). Missing data have many causes, varying from
and a little programming, a simple data-scrubbing program recording error to illegible writing to just not supplied.
can be written that identifies or removes faulty records. This is closely related to incorrect values that also can be
Duplicates can be eliminated, because they contribute no caused by poor penmanship as well as measurement error,
new knowledge. Real valued variables could be within keypunch mistakes, different or incorrect metrics, mis-
measurement error or tolerance of each other, yet each placed decimal, and other similar causes.
could represent a unique rule. Statistically categorizing the Fuzzy definitions, where the meaning of a value is
data would eliminate or, at least, greatly reduce this. either unclear or inconsistent, are another problem (Berry
In application of this process with actual datasets, & Linoff, 2000). Often, when something is being measured
accuracy has been increased significantly, in some cases and recorded, mistakes happen. Even automated processes
double or more. can produce dirty data (Bloom, 1998). Micro-array data has
errors due to base pairs on the probe not matching correctly
to genes in the test material (Shavlik et al., 2004). The
BACKGROUND sources of error are large, and it is necessary to have a
process that finds these anomalies and identifies them.
Data mining is an exploratory process looking for as yet In real valued datasets, the possible combinations are
unknown patterns (Westphal & Blaxton, 1998). The data (almost) unlimited. A dataset with eight variables, each
represent real-world occurrences, and there is correlation with four significant digits, could yield as many as 1032
among the variables. Some are principled in their con- combinations. Mining such a dataset would not only be
struction, one event triggering another. Sometimes events tedious and time-consuming, but possibly could yield an
occur in a certain order (Westphal & Blaxton, 1998). Not overly large number of patterns. Using (six-range) cat-
all possible combinations of the data are to be expected. egorical data, the same problem would only have 1.67 x 106
If this were not the case, then we would learn nothing from combinations. Gauss normally distributed data can be
this data. These methods allow us to see patterns and separated into plus or minus 1, 2, or 3 sigma. Other
regularities in large datasets (Mitchell, 1999). distributions can use Chebyshev or other distributions
Credit reporting agencies have been examining large with similar dividing points. There is no real loss of data,
datasets of credit histories for quite some time, trying to yet the process is greatly simplified.
determine rules that will help discern between problematic Finding the potentially bad observations or records
and responsible consumers (Mitchell, 1999). Datasets is the first problem. The second problem is what to do once
have been mined looking for indications for boiler explo- they are found. In many cases it is possible to go back and
sion probabilities to high-risk pregnancies to consumer verify the value, correcting it, if necessary. If this is
TEAM LinG
possible, the program should flag values that are to be shucked (removed from the shell) for obvious reasons.
verified. This may not always be possible, or it may be too Other such rules from domain knowledge can be created A
expensive. Not all situations repeat within a reasonable (abalone.net, 2004; University of Capetown, 2004; World
time, if at all (i.e., observation of Halleys comet). Aquaculture, 2004). Sometimes, they may seem too obvi-
There are two schools of thought, the first being to ous, but they are effective. The rules can be programmed
substitute the mean value for the missing or wrong value. into a subroutine specific to the dataset.
The problem with this is that it might not be a reasonable Regression can be used to check for variables that are
value, and it can create a new rule, one that could be false not statistically significant. Step-wise regression is a
(i.e., shoe size for a giant is not average). It might introduce handy tool for identifying significant variables. Other
sample bias, as well (Berry & Linoff, 2000). ratio variables can be created and then checked for signifi-
Deleting the observation is the other common solu- cance using regression. Again, domain knowledge can
tion. Quite often, in large datasets, a duplicate exists, so help create these variables, as well as insight and some
deleting causes no loss. The cost of improper commission luck. Insignificant variables can be deleted from the
is greater than that of omission. Sometimes an outlier tells dataset, and new ones can be added.
a story. So, one has to be careful about deletions. If the dataset is real valued, it is possible that records
exist that are within tolerance or measurement error of
each other. There are two ways to reduce the number of
THE AUTOMATED ANOMALY unique observations. (1) Attenuate the accuracy by round-
DETECTION PROCESS ing to reduce the number of significant digits. Each
variable rounding to one less significant digit reduces the
number of possible patterns by an order of magnitude. (2)
Methodology Calculate a mean and standard deviation for the cleaned
dataset. Using an appropriate distribution, sort the values
To illustrate the process, a public dataset is used. This by standard deviations from the mean. Testing to see if the
particular one is available from the University of California chosen distribution is correct is accomplished by using a
at Irvine Machine Learning Repository (University of Cali- Chi square test, a Kolmogorof Smirnoff test, or the empiri-
fornia, 2003). Known as the Abalone dataset, it consists of cal test. The number of standard deviations replaces the
4,400 observations of abalones that were captured in the real valued data, and a simple categorical dataset will exist.
wild with several measurements of each one. Natural varia- This allows for simple comparisons between observa-
tion exists, as well as human error, both in making the tions. Otherwise, records with values as little as .0001%
measurements and in the recording. Also listed on the Web differences would be considered unique and different.
site were some studies that used the data and their results. While some of the precision of the original data is lost, this
Accuracy in the form of hit rate varied between 0-35%. process is exploratory and finds the general patterns that
While it may seem overly simple and obvious, plot- are in the data. This allows one to gain insight into the
ting the data is the first step. These graphical views can database using a combination of statistics and artificial
provide much insight into the data (Webb, 2002). The data intelligence (Pazzani, 2000), using human knowledge and
for each variable can be plotted vs. frequency of occur- skill as the catalyst to improve the results.
rence to visually determine distribution. Combining this The final step before mining the data is to remove
with knowledge of the research will help to determine the duplicates, as they add no additional information. As the
correct distribution to use for each included variable. A collection of observations gets increasingly larger, it gets
sum of independent terms would tend to support a Gauss harder to introduce new experiences. This process can be
normal distribution, while the product of a number of incorporated into the computer program by a simple
independent terms might suggest using log normal. This process that is similar to bubblesort. Instead of comparing
plotting also might suggest necessary transformations. to see which row is greater, it just looks for differences. If
It is necessary to understand the acceptable range for none are found, then the row is deleted.
each field. Some values obtained might not be reasonable.
If there is a zero in a field, is it indicative of a missing value,
or is it an acceptable value? No value is not the same as
Example Results
zero. Some values, while within bounds, might not be
possible. It is also necessary to check for obvious mis- A few variables were plotted producing, some very un-
takes, inconsistencies, or out of bounds. usual graphs. These were definitely not the graphs that
Knowledge about the subject of study is necessary. were expected. This was the first indication that the
From this, rules can be made. In the example of the abalone, dataset was noisy. Abalones are born in very large num-
the animal in the shell must weigh more than when it is bers, but with an extremely high infant mortality rate (over
79
TEAM LinG
99%) (Bamfield Marine Science Centre, 2004). This graph of the dataset and its plots led to some suspicion of the
did not reflect that. group with one ring (age = 2.5 years). OLS regression was
An initial scan of the data showed some inconsistent performed on this group, yielding an F of 27, but an R 2 of
points, like a five-year-old infant, a shucked animal weigh- only 0.03. This tells us that this portion of the data is only
ing more than a complete one, and other similar abnormali- muddying the water and attenuating the performance of
ties. Another problem with most analyses of these datasets our model.
is that gender is not ratio or ordinal data and, therefore, had Upon removal of this group of observations, OLS
to be converted to a dummy variable. regression was performed on the remaining data, giving
Step-wise regression removed all but five variables. an improved F of 639 (showing that, indeed, it is a good
The remaining variables were: diameter, height, whole model) and an R2 of 0.53, an acceptable level and one that
weight, shucked weight, and viscera weight. Two new can adequately describe the variation in the criterion.
variables were created: shell ratio (whole weight divided The results listed at the Web site where the dataset
by shell weight) and weight to diameter ratio. Since the was obtained are as follows:
diameter is directly proportional to volume, this variable is Sam Waugh in the Computer Science Department at
proportional to density. The proof of its significance was the University of Tasmania used this dataset in 1995 for
a t value of 39 and an F value of 1561. These are both his doctoral dissertation (University of California, 2003).
statistically significant. A plot of shell ratio vs. frequency His results, while the first recorded attempt, did not have
yielded a fairly Gauss normal looking curve. good accuracy at predicting the age. The problem was
As these are real valued data with four digits given, it encoded as a classification task.
is possible to have observations that vary by as little as
0.01%. This value is even less than the accuracy of the 24.86% Cascade Correlation (no hidden nodes)
measuring instruments. In other words, there are really a 26.25% Cascade Correlation (five hidden
relatively small number of possibilities, described by a nodes)
large number of almost identical examples, some within 21.5% C4.5
measurement tolerance of each other. 0.0% Linear Discriminant Analysis
The mean and standard deviation were calculated for 3.57% k=5 Nearest Neighbor
each of the remaining and new variables of the dataset. The
empirical test was done to verify approximate meeting of Clark, et al. (1996) did further work on this dataset.
Gauss normal distribution. Each value then was replaced They split the ring classification into three groups: 1 to
by the integer number of standard deviations it is from the 8, 9 to 10, and 11 and up. This reduced the number of
mean, creating a categorical dataset. Simple visual inspec- targets and made each one bigger, in effect making each
tion showed two things: (1) there was, indeed, correlation easier to hit.
among the observations; and (2) it became increasingly Their results were much better, as shown in the
more difficult to introduce a new pattern. following:
Duplicate removal process was the next step. As
expected, the first 50 observations only had 22% dupli- 64% Back propagation
cates, but by the time the entire dataset was processed, 55% Dystal
65% of the records were removed, because it presented no
new information. The results obtained from the answer tree using the
To better understand the quality of the data, least new cleaned dataset are shown in Table 1.
squares regression was performed. The model produced All of the one-ring observations were filtered out in
an ANOVA F value of 22.4, showing good confidence in a previous step, and the extraction was 100% accurate in
it. But the Pearsonian correlation coefficient R2 of only 0.25 not predicting any as being one-ring. The hit rates are as
indicated that there was some problem. Visual observation follows:
Table 1. Hit Rate

Actual Category
Predicted 1 2 3 4 Total
1 0 0 0 0 0
2 11 269 79 2 361
3 231 140 953 280 1604
4 22 3 27 81 133
Total 264 412 1059 363 2098
80
TEAM LinG
1 ring 100.0% correct of the Australian Conference on Neural Networks (ACNN

2 ring 74.5% correct 96), Canberra, Australia. A
3 ring 59.4% correct
4 ring 60.9% correct Mitchell, T. (1999). Machine learning and data mining.
Overall accuracy 62.1% correct Communications of the ACM, 42(11), 30-36.
Pazzani, M.J. (2000). Knowledge discovery from data?
IEEE Intelligent Systems, 15(2), 10-13.
FUTURE TRENDS
Shavlik, J., Molla, M., Waddell, M., & Page, D. (2004).
A program could be written that would input simple things Using machine learning to design and interpret gene-
like the number of variables, the number of observations, expression microarrays. American Association for Artifi-
and some classification results. A rule input mechanism cial Intelligence, 25(1), 23-44.
would accept the domain-specific rules and make them Statsoft, Inc. (2004). Electronic statistics textbook. Re-
part of the analysis. Further improvements would be the trieved from http://www.statsoft.com
inclusion of fuzzy logic. Type I would allow the use of
lingual variables (i.e., big, small, hot, cold) in the records, Tasmanian Abalone Council Ltd. (2004). http://
and type II would allow for some fuzzy overlap and fit. tasabalone.com.au
University of California at Irvine. (2003). Machine learn-
ing repository, abalone database. Retrieved from http:/
CONCLUSION /ics.uci.edu/~mlearn/MLRepository
Data mining is an exploratory process to see what is in the University of Capetown Zoology Department. (2004).
data and what patterns can be found. Noise and errors in http://web.uct.ac.za/depts/zoology/abnet
the dataset are reflected in the results from the mining Webb, A. (2002). Statistical pattern recognition. West
process. Cleaning the data and identifying anomalies Sussex, England: Wiley & Sons.
should be performed. Marked observations should be
verified and corrected, if possible. If this cannot be done, Westphal, C., & Blaxton, T. (1998). Data mining solutions
they should be deleted. In real valued datasets, the values methods and tools for solving real-world problems. New
can be categorized with accepted statistical techniques. York: Wiley & Sons.
Anomaly detection, after some manual viewing and analy-
sis, can be automated. Part of the process is specific to the World Aquaculture. (2004). http://www7.taosnet.com/
knowledge domain of the dataset, and part could be platinum/data/light/species/abalone.html
standardized. In our example problem, this cleaning pro-
cess improved results, and the mining produced a more
accurate rule set. KEY TERMS
Anomaly: A value or observation that deviates from

REFERENCES the rule or analogy. A potentially incorrect value.
Abalone.net. (2003). All about abalone: An online guide. ANOVA or Analysis of Variance: A powerful statis-
Retrieved from http://www.abalone.net tical method for studying the relationship between a
response or criterion variable and a set of one or more
Bamfield Marine Sciences Centre Public Education predictor or independent variable(s).
Programme. (2004). Oceanlink. Retrieved from http://
oceanlink.island.net/oinfo/Abalone/abalone.html Correlation: Amount of relationship between two
variables, how they change relative to each other, range:
Berry, M.J.A., & Linoff, G.S. (2000). Mastering data min- -1 to +1.
ing: The art and science of customer relationship man-
agement. New York, NY: Wiley & Sons, Inc. F Value: Fisher value, a statistical distribution, used
here to indicate the probability that an ANOVA model is
Bloom, D. (1998). Technology, experimentation, and the good. In the ANOVA calculations, it is the ratio of squared
quality of survey data. Science, 280(5365), 847-848. variances. A large number translates to confidence in the
Clark, D., Schreter, Z., & Adams, A. (1996). A quantitative model.
comparison of dystal and backpropagation. Proceedings
81
TEAM LinG
Ordinal Data: Data that is in order but has no relation- Step-Wise Regression: An automated procedure on
ship between the values or to an external value. statistical programs that adds one predictor variable at a
time, and if it is not statistically significant, it removes it
Pearsonian Correlation Coefficient: Defines how from the model. Some work in both directions either by
much of the variation in the criterion variable(s) is caused adding or removing from the model, one at a time.
by the model. Range: 0 to 1.
T Value, also called Students t: A statistical
Ratio Data: Data that is in order and has fixed spacing; distribution for smaller sample sizes. In regression rou-
a relationship between the points that is relative to a fixed tines in statistical programs, it indicates whether a predic-
external point. tor variable is statistically significant or if it truly is
contributing to the model. A value more than about 3 is
required for this indication.
82
TEAM LinG
83
Automatic Musical Instrument Sound A

Classification
Alicja A. Wieczorkowska
Polish-Japanese Institute of Information Technology, Poland
INTRODUCTION Aerophones are classified according to how the air is

set in motion, mainly depending on the mouthpiece: blow
The aim of musical instrument sound classification is to hole, whistle, reed, and lip-vibrated. Subcategories in-
process information from audio files by a classificatory clude flutes (end-blown, side-blown, nose, globular,
system and accurately identify musical instruments play- multiple), panpipes, whistle mouthpiece (recorder), single-
ing the processed sounds. This operation and its results and double-reed (clarinet, oboe), air chamber (pipe or-
are called automatic classification of musical instru- gans), lip-vibrated (trumpet or horn), and free aerophone
ment sounds. (bullroarers) (SIL, 1999).
The description of properties of musical instrument
sounds is usually given in vague subjective terms, like
BACKGROUND sharp, nasal, bright, and so forth, and only some of them
(i.e., brightness) have numerical counterparts. There-
Musical instruments are grouped into the following fore, one of the main problems in this research is to
categories (Hornbostel & Sachs, 1914): prepare the appropriate numerical sound description for
instrument recognition purposes.
Idiophones: Made of solid, non-stretchable, so- Automatic classification of musical instrument
norous material. sounds aims at classifying audio data accurately into
Membranophones: Skin drums; membranophones, appropriate groups representing instruments. This clas-
and idiophones are called percussion. sification can be performed at instrument level, instru-
Chordophones: Stringed instruments. ment family level (e.g., brass), or articulation (i.e., how
Aerophones: Wind instruments: woodwinds sound is struck, sustained, and released, e.g. vibrato -
(single-reed, double-reed, flutes) and brass (lip- varying the pitch of a note up and down) (Smith, 2000). As
vibrated) a preprocessing, the audio data are usually parameterized
(i.e., numerical or other parameters or attributes are as-
Idiophones are classified according to the material, signed, and then data mining techniques are applied to the
number of idiophones and resonators in a single instru- parameterized data). Accuracy of classification varies,
ment, and whether pitch or tuning is important. Subcat- depending on the audio data used in the experiments,
egories include idiophones struck together by concus- number of instruments, parameterization, classification,
sion (e.g., castanets), struck (gong), rubbed (musical and validation procedure applied. Automatic classifica-
glasses), scraped (washboards), stamped (hard floors tion compares favorably with human performance. Listen-
stamped with tap shoes), shaken (rattles), and plucked ers identify musical instruments with accuracy far from
(jews harp). perfect, with results depending on the sounds chosen and
Membranophones are classified according to their experience of listeners. Classification systems allow in-
shape, material, number of heads, if they have snares, etc., strument identification without participation of human
whether and how the drum is tuned, and how the skin is experts. Therefore, such systems can be valuable assis-
fixed and played. Subcategories include drums (cylindri- tance for users of audio data searching for specific timbre,
cal, conical, barrel, hourglass, goblet, footed, long, kettle, especially if they are not experienced musicians and when
frame, friction drum, and mirliton/kazoo). the amount of available audio data is huge, thus making
Chordophones are classified with respect to the manual searching impractical, if possible at all. When
relationship of the strings to the body of the instrument, combined with a melody-searching system, automatic
if they have frets (low bridges on the neck or body, instrument classification may provide a handy tool for
where strings are stopped) or movable bridges, number finding favorite tunes performed by favorite instruments
of strings, and how they are played and tuned. Subcat- in audio databases.
egories include zither, lute plucked, and bowed (e.g.,
guitars, violin), harp, lyre, and bow.
TEAM LinG
Automatic Musical Instrument Sound Classification
MAIN THRUST transform. Some analyses are adjusted to the properties of

human hearing, which perceives changes of sound ampli-
Research on automatic classification of musical instru- tude and frequency in a logarithmic-like manner (e.g.,
ments so far has been performed mainly on isolated, frequency contents analysis in mel scale). The results
singular sounds; works on polyphonic sounds usually based on such analysis are easier to interpret in subjective
aim at source separation and operations like pitch track- terms. Also, statistic and mathematic operations are ap-
ing of these sounds (Viste & Evangelista, 2003). Most plied to the sound representation, yielding good results,
commonly used data include MUMS compact discs too. Some descriptors require calculating pitch of the
(Opolko & Wapnick, 1987), the University of Iowa sound, and any inaccuracies in pitch calculation (e.g.,
samples (Fritts, 1997), and IRCAMs Studio on Line octave errors) may lead to erroneous results.
(IRCAM, 2003). Parameter sets investigated in the research are usu-
A broad range of data mining techniques was applied ally a mixture of various types, since such combinations
in this research, aiming at extraction of information allow capturing more representative sound description
hidden in audio data (i.e., sound features that are com- for instrument classification purposes.
mon for a given instrument and differentiate it from the The following analysis and parameterization meth-
others). Descriptions of musical instrument sounds are ods are used to describe musical instrument sounds:
usually subjective, and finding appropriate numeric de-
scriptors (parameters) is a challenging task. Sound pa- Autocorrelation and cross-correlation functions
rameterization is arbitrarily chosen by the researchers, investigating periodicity of the signal and statisti-
and the parameters may reflect features that are known cal parameters of spectrum obtained via Fourier
to be important for the human in the instrument recog- transform: average amplitude and frequency varia-
nition task, like descriptors of sound evolution in time tions (wide in vibrated sounds), standard devia-
(i.e., onset features, depth of sound vibration, etc.), tions (Ando & Yamaguchi, 1993).
subjective timbre features (i.e., brightness of the sound), Contents of selected groups of partials in the
and so forth. Basically, parameters characterize coeffi- spectrum (Pollard & Jansson, 1982;
cients of sound analysis, since they are relatively easy Wieczorkowska, 1999a), including amount of even
to calculate. and odd harmonics (Martin & Kim, 1998), allow-
On the basis of parameterization, further research ing identification of clarinet sounds.
can be performed. Clustering applied to parameter vec- Vibrato strength and other changes of sound fea-
tors reveals similarity among sounds and adds a new tures in time (Martin & Kim, 1998; Wieczorkowska
glance on instrument classification, usually based on et al., 2003) and temporal envelope of the sound.
instrument construction or sound articulation. Deci- Statistical moments of the time wave, spectral
sion rules and trees allow identification of the most centroid (gravity center), coefficients of cepstrum
descriptive sound features. Transformation of sound pa- (i.e., the Fourier transform applied to the loga-
rameters may produce new descriptors better suited for rithm of amplitude plot of the spectrum), con-
automatic instrument classification. The classification stant-Q coefficients (i.e., for logarithmically-
can be performed hierarchically, taking instrument fami- spaced spectral bins) (Brown, 1999; Brown et al.,
lies and articulation into account. Classifiers represent a 2001).
broad range of methods, from simple statistic tools to Wavelet analysis, providing time-frequency plot
new advanced algorithms rooted in artificial intelligence. based on decomposition of sound signal into func-
tions called wavelets (Kostek & Czyzewski, 2001;
Parameterization Methods Wieczorkowska, 1999b).
Mel-frequency coefficients (i.e., in mel scale,
Sound is a physical disturbance in the medium (e.g., air) adjusted to the properties of human hearing) and
through which it is propagated (Whitaker & Benson, linear prediction cepstral coefficients, where fu-
2002). Periodic fluctuations are perceived as sound ture values are estimated as a linear function of
having pitch. The audible frequency range is about 20- previous values (Eronen, 2001).
20,000 Hz (hertz or cycles per second). The parameter- Multidimensional Scaling Analysis (MSA) trajec-
ization aims at capturing most distinctive sound fea- tories obtained through Principal Component
tures regarding sound amplitude evolution in time, static Analysis (PCA) applied to the constant-Q spectral
spectral features (frequency contents) of the most stable snapshots to determine the most significant at-
part of the sound, and evolution of frequency content in tributes of each sound (Kaminskyj, 2002). PCA
time. These features are based on the Fourier spectrum transforms a set of variables into a smaller set of
and time-frequency sound representations like wavelet
84
TEAM LinG
uncorrelated variables, which keep as much of the A statistical pattern-recognition technique; maxi-
variability in the data as possible. mum a posteriori classifier based on Gaussian mod- A
MPEG-7 audio descriptors, including log-attack time els (introducing prior probabilities), obtained via
(i.e., logarithm of onset duration, fundamental fre- Fisher multiple discriminant analysis that projects
quencypitch, spectral envelope, and spread, etc.) the high-dimensional feature space into a space of
(ISO, 2003; Peeters, et al., 2000). one dimension fewer than the number of classes in
which the classes are separated maximally (Martin
Feature vectors obtained via parameterization of & Kim, 1998).
musical instrument sounds are used as inputs for classi- Neural networks designed by analogy with a sim-
fiers, both for training and recognition purposes. plified model of the neural connections in the
brain and trained to find relationships in the data;
Classification Techniques multi-layer nets and self-organizing feature maps
have been used (Cosi et al., 1994; Kostek &
Automatic classification is the process by which a clas- Czyzewski, 2001).
sificatory system processes information in order to Decision trees, where nodes are labeled with
automatically classify data accurately, or the result of sound parameters, edges are labeled with param-
such a process. A class may represent an instrument, eter values, and leaves represent classes
articulation, instrument family, and so forth. Classifiers (Wieczorkowska, 1999b).
applied to this task range from probabilistic and statisti- Rough-set-based algorithms; rough sets are de-
cal algorithms through methods based on learning by fined by upper approximation, containing ele-
example, where classification is based on the distance ments that belong to the set for sure, and lower
between the observed sample and the nearest known approximation containing elements that may be-
neighbor, to methods originating from artificial intelli- long to the set (Wieczorkowska, 1999a).
gence like neural networks, which mimic neural connec- Support vector machines that aim at finding the
tions in the brain. Each classifier yields a new sound hyperplane that best separates observations be-
description (representation). Some classifiers produce an longing to different classes (Agostini et al., 2003).
explicit set of classification rules (e.g., decision trees or Hidden Markov Models (HMM) used for repre-
rough set based algorithms), giving insight into relation- senting sequences of states; in this case, can be
ships between specific sound timbres and the calculated used for representing long sequences of feature
features. Since human-performed recognition of musical vectors that define an instrument sound (Herrera
instruments is based on subjective criteria and difficult to et al., 2000).
formalize, learning algorithms that allow extraction of pre-
cise rules of sound classification are broadening our knowl- Classifiers are first trained and then tested with
edge and giving formal representation of subjective sound respect to their generalization purposes (i.e., whether
features. they work properly on unknown samples.
The following algorithms can be applied to musical
instrument sound classification: Validation and Results
Bayes decision rule (i.e., probabilistic classifica- Parameterization and classification methods yield vari-
tion method of assignment of unknown samples) to ous results, depending on the sound data and validation
the classes. In Brown (1999), training data were procedure when classifiers are tested on unseen
grouped into clusters obtained through k-means samples. Usually, the available data are divided into
algorithm, and Gaussian probability density func- training and test sets. For instance, 70% of the data is
tions were formed from the mean and variance of used for training and the remaining 30% for testing;
each cluster. this procedure is usually repeated a number of times,
K-Nearest Neighbor (k-NN) algorithm, where the and the final result is the average of all runs. Other
class (instrument) for a tested sound sample is popular divisions are in proportions 80/20 or 90/10.
assigned on the basis of the distances between the Also, leave-one-out procedure is used, where only one
vector of parameters for this sample and the major- sample is used for testing. Generally, the higher per-
ity of k nearest vectors representing known samples centage of the training data is in proportion to the test
(Kaminskyj, 2002; Martin & Kim, 1998). To im- data and the smaller the number of classes, the higher
prove performance, genetic algorithms are addi- the accuracy that is obtained. Some instruments are
tionally applied to find the optimal set of weights easily identified with high accuracy, whereas, others
for the parameters (Fujinaga & McMillan, 2000). frequently are misclassified, especially with those from
85
TEAM LinG
the same family. Classification of instruments is sometimes Rough-set-based classifiers and decision trees ap-
performed hierarchically: articulation or family is recog- plied to the data representing 18 classes (11 orches-
nized first, and then the instrument is identified. tral instruments, various articulation), parameter-
Following is an overview of results obtained so far in ized using Fourier and wavelet-based attributes,
the research on musical instrument sound classification: yielded 68-77 % accuracy in 90/10 test, and 64-68%
in 70/30 tests (Wieczorkowska, 1999b).
Brown (1999) reported an average 84.1% recog- K-NN and rough-set-based classifiers, applied to
nition accuracy for two classesoboe and saxo- spectral and temporal sound parameterization,
phoneusing cepstral coefficients as features yielded 68% accuracy in 80/20 tests for 18 classes,
and Bayes decision rules for clusters obtained via representing 11 orchestral instruments and vari-
k-means algorithm. ous articulation (Wieczorkowska et al., 2003).
Brown, Houix, and McAdams (2001), in experi-
ments with four classes, obtained 7984% accu- Generally, instrument families or sustained/impul-
racy for bin-to-bin differences of constant-Q co- sive sounds are identified with accuracy exceeding 90%,
efficients, and cepstral and autocorrelation coef- whereas instruments, if there are more than 10, are
ficients using Bayesian method. identified with accuracy reaching about 70%. These
K-NN classification applied to mel-frequency and results compare favorably with human performance and
linear prediction cepstral coefficients (Eronen, exceed results obtained for inexperienced listeners.
2001), with training on 29 orchestral instruments
and testing on 16 instruments from various re-
cordings, yielded 35% accuracy for instruments FUTURE TRENDS
and 77% for families. K-NN, combined with ge-
netic algorithms (Fujinaga & McMillan, 2000), Automatic indexing and searching of audio files is gain-
yielded 50% correctness in leave-one-out tests on ing increasing interest. MPEG-7 standard addresses the
spectral features representing 23 orchestral in- issue of content description in multimedia data, and
struments played with various articulations. audio descriptors provided in this standard form a basis
Kaminskyj (2002) applied k-NN to constant-Q for further research. Constant growth of audio resources
and cepstral coefficients, MSA trajectories, am- available on the Internet causes an increasing need for
plitude envelope, and spectral centroid. He ob- content-based search of audio data. Therefore, we can
tained 89-92% accuracy for instruments, 96% for expect intensification of research in this domain and
families, and 100% in identifying impulsive vs. progress of studies on automatic classification of mu-
sustained sounds in leave-one-out tests for MUMS sical instrument sounds.
data. Tests on other recordings initially yielded
33-61% accuracy, and 87-90% after improve-
ments. CONCLUSION
Multilayer neural networks applied to wavelet and
Fourier-based parameterization yielded 72-99% Results obtained so far in automatic musical instrument
accuracy for various groups of four instruments sound classification vary, depending on the size of the data,
(Kostek & Czyzewski, 2001). sound parameterization, classifier, and testing method.
The statistical pattern-recognition technique and Also, some instruments are identified easily with high
k-NN algorithm, applied to sounds representing accuracy, whereas others are misclassified frequently, in
14 orchestral instruments played with various ar- case of both human and machine performance.
ticulation, yielded 71.6% accuracy for instruments, Increasing interest in content-based searching
86.9% for families, and 98.8% in discriminating through audiovisual data and growth of amount of mul-
continuant sounds vs. pizzicato (Martin & Kim, timedia data available via the Internet raises the need and
1998) in 70/30 tests. The features included pitch, perspective for further progress in automatic classifi-
spectral centroid, ratio of odd-to-even harmonic cation of audio data.
energy, onset asynchrony, and the strength of vi-
brato and tremolo (quick changes of sound ampli-
tude or note repetitions).
Discriminant analysis and support vector machines
REFERENCES
yielded about 70% accuracy in leave-one-out tests
with spectral features for 27 instruments (Agostini Agostini, G., Longari, M., & Pollastri, E. (2003). Musi-
et al., 2003). cal instrument timbres classification with spectral fea-
86
TEAM LinG
tures. EURASIP Journal on Applied Signal Processing, Kostek, B., & Czyzewski, A. (2001). Representing musical
1, 1-11. instrument sounds for their automatic classification. Jour- A
nal of the Audio Engineering Society, 49(9), 768-785.
Ando, S., & Yamaguchi, K. (1993). Statistical study of
spectral parameters in musical instrument tones. Journal Martin, K.D., & Kim, Y.E. (1998). Musical instrument
of the Acoustical Society of America, 94(1), 37-45. identification: A pattern-recognition approach. Pro-
ceedings of the 136th Meeting of the Acoustical Society
Brown, J.C. (1999). Computer identification of musical of America, Norfolk, Virginia.
instruments using pattern recognition with cepstral co-
efficients as features. Journal of the Acoustical Soci- Opolko, F., & Wapnick, J. (1987). MUMSMcGill Univer-
ety of America, 105, 1933-1941. sity master samples. [CD-ROM]. McGill University,
Montreal, Quebec, Canada.
Brown, J.C., Houix, O., & McAdams, S. (2001). Feature
dependence in the automatic identification of musical Peeters, G., McAdams, S., & Herrera, P. (2000). Instrument
woodwind instruments. Journal of the Acoustical Soci- sound description in the context of MPEG-7. Proceedings
ety of America, 109, 1064-1072. of the International Computer Music Conference
ICMC2000, Berlin, Germany.
Cosi, P., De Poli, G., & Lauzzana, G. (1994). Auditory
modelling and self-organizing neural networks for tim- Pollard, H.F., & Jansson, E.V. (1982). A tristimulus method
bre classification. Journal of New Music Research, 23, for the specification of musical timbre. Acustica, 51, 162-
71-98. 171.
Eronen, A. (2001). Comparison of features for musical SIL. (1999). LinguaLinks library. Retrieved 2004 from
instrument recognition. Proceedings of the IEEE Work- http://www.sil.org/LinguaLinks/Anthropology/Expndd
shop on Applications of Signal Processing to Audio EthnmsclgyCtgrCltrlMtrls/MusicalInstrumentsSub
and Acoustics WASPAA 2001, New York, NY, USA. categorie.htm
Fritts, L. (1997). The University of Iowa musical instru- Smith, R. (2000). Rods encyclopedic dictionary of tradi-
ment samples. Retrieved 2004 from http://there tional music. Retrieved 2004 from http://
min.music.uiowa.edu/MIS.html www.sussexfolk.freeserve.co.uk/ency/a.htm
Fujinaga, I., & McMillan, K. (2000). Realtime recogni- Viste, H., & Evangelista, G. (2003). Separation of har-
tion of orchestral instruments. Proceedings of the Inter- monic instruments with overlapping partials in multi-
national Computer Music Conference, Berlin, Germany. channel mixtures. Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and
Herrera, P., Amatriain, X., Batlle, E., & Serra, X. (2000). Acoustics WASPAA-03, New Paltz, New York.
Towards instrument segmentation for music content
description: A critical review of instrument classifica- Whitaker, J.C. & Benson, K.B. (Eds.). (2002). Stan-
tion techniques. Proceedings of the International Sym- dard handbook of audio and radio engineering. New
posium on Music Information Retrieval ISMIR 2000, York: McGraw-Hill.
Plymouth, Massachusetts.
Wieczorkowska, A. (1999a). Rough sets as a tool for
Hornbostel, E.M.V., & Sachs, C. (1914). Systematik der audio signal classification. Foundations of Intelligent
Musikinstrumente. Ein Versuch. Zeitschrift fr Systems LNCS/LNAI 1609, 11th Symposium on Method-
Ethnologie, 46(4-5), 553-90. ologies for Intelligent Systems, Proceedings/ISMIS99,
Warsaw, Poland.
IRCAM, Institute de Recherche et Coordination Acoustique/
Musique. (2003). Studio on line. Retrieved 2004 from http:// Wieczorkowska, A. (1999b). Skuteczno rozpoznawania
forumnet.ircam.fr/rubrique.php3?id_rubr ique=107 dwikw instrumentw muzycznych w zalenoci od
sposobu parametryzacji i rodzaju klasyfikatora (Effi-
ISO, International Organisation for Standardisation. (2003). ciency of musical instrument sounds recognition de-
MPEG-7 Overview. Retrieved 2004 from http:// pending on parameterization and classifier) [doctoral
www.chiariglio ne.org/mpeg/standards/mpeg-7/mpeg- thesis] [in Polish]. Gdansk: Technical University of
7.htm Gdansk.
Kaminskyj, I. (2002). Multi-feature musical instrument
Wieczorkowska, A., Wrblewski, J., Synak, P., & lzak ,
sound classifier w/user determined generalisation per-
D. (2003). Application of temporal descriptors to musical
formance. Proceedings of the Australasian Computer
instrument sound recognition. Journal of Intelligent
Music Association Conference ACMC 2002, Melbourne,
Information Systems, 21(1), 71-93.
Australia.
87
TEAM LinG
KEY TERMS Idiophones: The category of musical instruments made

of solid, non-stretchable sonorous material. Subcatego-
Aerophones: The category of musical instruments, ries: idiophones struck together by concussion (e.g., cas-
called wind instruments, producing sound by the vibra- tanets), struck (gong), rubbed (musical glasses), scraped
tion of air. Woodwinds and brass (lip-vibrated) instru- (washboards), stamped (hard floors stamped with tap
ments belong to this category, including single reed shoes), shaken (rattles), and plucked (jews harp).
woodwinds (e.g., clarinet), double reeds (oboe), flutes, Membranophones: The category of musical instru-
and brass (trumpet). ments; skin drums. Examples: daraboukka, tambourine.
Articulation: The process by which sounds are formed; Parameterization: The assignment of parameters
the manner in which notes are struck, sustained, and to represent processes that usually are not easily de-
released. Examples: staccato (shortening and detaching scribed by equations.
of notes), legato (smooth), pizzicato (plucking strings),
vibrato (varying the pitch of a note up and down), muted Sound: A physical disturbance in the medium through
(stringed instruments, by sliding a block of rubber or which it is propagated. This fluctuation may change
similar material onto the bridge; brass (by inserting a routinely, and such periodic sound is perceived as hav-
conical device into the bell). ing pitch. The audible frequency range is about 20
20,000 Hz (hertz, or cycles per second). Harmonic
Automatic Classification: The process by which a sound wave consists of frequencies being integer mul-
classificatory system processes information in order to tiples of the first component (fundamental frequency),
classify data accurately; also, the result of such process. corresponding to pitch.
Chordophones: The category of musical instru- Spectrum: The distribution of component frequen-
ments producing sound by means of a vibrating string; cies of the sound, each being a sine wave, with their
stringed instruments. Examples: guitar, violin, piano, harp, amplitudes and phases (time locations of these compo-
lyre, musical bow. nents). These frequencies can be determined through
Fourier analysis.
88
TEAM LinG
89
Bayesian Networks B
Ahmad Bashir
University of Texas at Dallas, USA
Latifur Khan
Mamoun Awad
INTRODUCTION is that each feature Fi is conditionally independent of

every other feature Fj. This situation is mathematically
A Bayesian network is a graphical model that finds represented as:
probabilistic relationships among variables of a system.
The basic components of a Bayesian network include a set P(Fi | C, Fj) = P(Fi | C)
of nodes, each representing a unique variable in the
system, their inter-relations, as indicated graphically by Such nave models are easier to compute, because
edges, and associated probability values. By using these they factor into P(C) and a series of independent probabil-
probabilities, termed conditional probabilities, and their ity distributions. The Nave Bayes classifier combines
interrelations, we can reason and calculate unknown this model with a decision rule. The common rule is to
probabilities. Furthermore, Bayesian networks have dis- choose the label that is most probable, known as the
tinct advantages compared to other methods, such as maximum a posteriori or MAP decision rule.
neural networks, decision trees, and rule bases, which we In a supervised learning setting, one wants to estimate
shall discuss in this paper. the parameters of the probability model. Because of the
independent feature assumption, it suffices to estimate
the class prior and the conditional feature models inde-
BACKGROUND pendently by using the method of maximum likelihood.
The Nave Bayes classifier has several properties that
Bayesian classification is based on Nave Bayesian clas- make it simple and practical, although the independence
sifiers, which we discuss in this section. Naive Bayesian assumptions are often violated. The overall classifier is
classification is the popular name for a probabilistic the robust to serious deficiencies of its underlying nave
classification. The term Naive Bayes refers to the fact that probability model, and in general, the Nave Bayes ap-
the probability model can be derived by using Bayes proach is more powerful than might be expected from the
theorem and that it incorporates strong independence extreme simplicity of its model; however, in the presence
assumptions that often have no bearing in reality, hence of nonindependent attributes wi, the Nave Bayesian
they are deliberately nave. Depending on the model, classifier must be upgraded to the Bayesian classifier,
Nave Bayes classifiers can be trained very efficiently in which will more appropriately model the situation.
a supervised learning setting. In many practical applica-
tions, parameter estimation for Nave Bayes models uses
the method of maximum likelihood. MAIN THRUST
Abstractly, the desired probability model for a classi-
fier is a conditional model The basic concept in the Bayesian treatment of certainties
in causal networks is conditional probability. When the
P(C | F1,,Fn) probability of an event A, P(A), is known, then it is
conditioned by other known factors. A conditional prob-
over a dependent class variable C with a small number of ability statement has the following form:
outcomes, or classes, conditional on several feature vari-
ables F1 through Fn. The nave conditional independence Given that event B has occurred, the probability of
assumptions play a role at this stage, when the probabili- the event A occurring is x.
ties are being computed. In this model, the assumption
TEAM LinG
Bayesian Networks
Graphical Models Bayesian Probabilities
A graphical model visually illustrates conditional inde- Probability calculus does not require that the probabili-
pendencies among variables in a given problem. Two ties be based on theoretical results or frequencies of
variables that are conditionally independent have no repeated experiments, commonly known as relative fre-
direct impact on each others values. Furthermore, the quencies. Probabilities may also be completely subjective
graphical model shows any intermediary variables that estimates of the certainty of an event.
separate two conditionally independent variables. Consider an example of a basketball game. If one were
Through these intermediary variables, two conditionally to bet on an upcoming game between Team A and Team
independent variables affect one another. B, it is important to know the probability of Team A
A graph is composed of a set of nodes, which repre- winning the game. This probability is definitely not a ratio,
sent variables, and a set of edges. Each edge connects two a relative frequency, or even an estimate of a relative
nodes, and an edge can have an optional direction as- frequency; the game cannot be repeated many times
signed to it. For X1 and X2, if a causal relationship between under exactly the same conditions. Rather, the probability
the variables exists, the edge will be directional, leading represents only ones belief concerning Team As chances
from the case variable to the effect variable; if just a of winning. Such a probability is termed a Bayesian or
correlation between the variables exists, the edge will be subjective probability and makes use of Bayes theorem
undirected. to calculate unknown probabilities.
We use an example with three variables to illustrate A Bayesian probability may also be referred to as a
these concepts. In this example, two conditionally inde- personal probability. The Bayesian probability of an
pendent variables, A and C, are directly related to another event x is a persons degree of belief in that event. A
variable, B. To represent this situation, an edge must exist Bayesian probability is a property of the person who
between the nodes of the variables that are directly assigns the probability, whereas a classical probability
related, that is, between A and B and between B and C. is a physical property of the world, meaning it is the
Furthermore, the relationships between A and B and B and physical probability of an event.
C are correlations as opposed to causal relations; hence, An important difference between physical probability
the respective edges will be undirected. Figure 1 illus- and Bayesian probability is that repeated trials are not
trates this example. Due to conditional independence, necessary to measure the Bayesian probability. The Baye-
nodes A and C still have an indirect influence on one sian method can assign a probability for events that may
another; however, variable B encodes the information be difficult to experimentally determine. An oft-voiced
from A that impacts C, and vice versa. criticism of the Bayesian approach is that probabilities
A Bayesian network is a specific type of graphical seem arbitrary, but this is a probability assessment issue
model, with directed edges and no cycles (Stephenson, that does not take away from the many possibilities that
2000). The edges in Bayesian networks are viewed as Bayesian probabilities provide.
causal connections, where each parent node causes an
effect on its children. Causal Influence
In addition, nodes in a Bayesian network contain a
conditional probability table, or CPT, which stores all Bayesian networks require an operational method for
probabilities that may be used to reason or make infer- identifying causal relationships in order for accurate
ences within the system. domain modeling. Hence, causal influence is defined in
the following manner: If the action of making variable X
Figure 1. Graphical model of two independent variables take some value sometimes changes the value taken by
A and C that are directly related to a third variable B variable Y, then X is assumed to be responsible for some-
times changing Ys value, and one may conclude that X is
a cause of Y. More formally, X is manipulated when we
force X to take some value, and we say X causes Y if some
B manipulation of X leads to a change in the probability
distribution of Y.
Furthermore, if manipulating X leads to a change in the
probability distribution of Y, then X obtaining a value by
any means whatsoever also leads to a change in the
probability distribution of Y. Hence, one can make the
A C
natural conclusion that causes and their effects are statis-
90
TEAM LinG
Bayesian Networks
tically correlated. However, note that variables can be Advantages

correlated in less direct ways; that is, one variable may not B
necessarily cause the other. Rather, some intermediaries A Bayesian network is a graphical model that finds
may be involved. probabilistic relationships among variables of the sys-
tem, but a number of models are available for data analy-
Bayesian Networks: A First Look sis, including rule bases, decision trees, and artificial
neural networks. Several techniques for data analysis
This section provides a detailed definition of Bayesian also exist, including classification, density estimation,
networks in an attempt to merge the probabilistic, logic, regression, and clustering. However, Bayesian networks
and graphical modeling ideas that we have presented thus have distinct advantages that we discuss here.
far. One of the biggest advantages of Bayesian networks
Causal relations, apart from the reasoning that they is that they have a bidirectional message passing archi-
model, also have a quantitative side, namely their strength. tecture. Learning from the evidence can be interpreted as
This is expressed by associating numbers to the links. Let unsupervised learning. Similarly, expectation of an ac-
the variable A be a parent of the variable B in a causal tion can be interpreted as unsupervised learning. Be-
network. Using probability calculus, it will be normal to let cause Bayesian networks pass data between nodes and
the conditional probability be the strength of the link see the expectations from the world model, they can be
between these variables. On the other hand, if the variable considered as bidirectional learning systems (Helsper &
C is also a parent of the variable B, then conditional Gaag, 2002). In addition to bidirectional message pass-
probabilities P(B|A) and P(B|C) do not provide any infor- ing, Bayesian networks have several important features,
mation on how the impacts from variable A and variable B such as the allowance of subjective a priori judgments,
interact. They may cooperate or counteract in various direct representation of causal dependence, and the
ways. Therefore, the specification of P(B|A,C) is required. ability to imitate the human thinking process.
If loops exist within the reasoning, the domain model Bayesian networks handle incomplete data sets with-
may contain feedback cycles; these cycles are difficult to out difficulty because they discover dependencies
model quantitatively. For such networks, no calculus cop- among all the variables. When one of the inputs is not
ing with feedback cycles has been developed. Therefore, observed, most other models will end up with an inaccu-
the network must be loop free. In fact, from a practical rate prediction because they do not calculate the corre-
perspective, it should be modeled as closely to a tree as lation between the input variables. Bayesian networks
possible. suggest a natural way to encode these dependencies.
For this reason, they are a natural choice for image
Evidential Reasoning classification or annotation, because the lack of a clas-
sification can also be viewed as missing data.
As stated previously, Bayesian networks accomplish such Considering the Bayesian statistical techniques,
an economy by pointing out, for each variable Xi, the Bayesian networks facilitate the combination of domain
conditional probabilities P(Xi | pai) where pai is the set of knowledge and data (Neapolitan, 2004). Prior or domain
parents (of Xi) that render Xi independent of all its other knowledge is crucially important if one performs a real-
parents. After giving this specification, the joint probabil- world analysis, particularly when data are inadequate or
ity distribution can be calculated by the product expensive. The encoding of causal prior knowledge is
straightforward because Bayesian networks have causal
semantics. Additionally, Bayesian networks encode the
P( x1,..., xn ) = P( xi | pai ) strength of causal relationships with probabilities. There-
i
fore, prior knowledge and data can be put together with
Using this product, all probabilistic queries can be well studied techniques from Bayesian statistics for both
found coherently with probability calculus. Passing evi- forward and backward reasoning with the Bayesian net-
dence up and down a Bayesian network is known as belief work.
propagation, and exact probability calculations is NP- Bayesian networks also ease many of the theoretical
hard, meaning that the calculation time to compute exact and computational difficulties of rule-based systems by
probabilities grows exponentially with the size of the utilizing graphical structures for representing and man-
Bayesian network. A number of algorithms are available aging probabilistic knowledge. Independencies can be
for probabilistic calculations in Bayesian networks dealt with explicitly; they can be articulated, encoded
(Huang, King, & Lyu, 2002), beginning with Pearls graphically, read off the network, and reasoned about,
message passing algorithm. and yet they forever remain robust to numerical impres-
91
TEAM LinG
Bayesian Networks
sion. Moreover, graphical representations uncover sev- Bn) that is associated with each variable A with
eral opportunities for efficient computation and serve as parents B1, B2 Bn
understandable logic diagrams.
Bayesian networks can simulate humanlike reason- Bayesian networks continue to play a vital role in
ing; this fact is not, however, due to any structural prediction and classification within data mining
similarities with the human brain. Rather, it is because of (Niedermeyer, 1998). They are a marriage between prob-
the resemblance between the ways Bayesian networks ability theory and graph theory, providing a natural tool
and humans reason. The resemblance is more psychologi- for dealing with two problems that occur throughout
cal than biological but nevertheless a true benefit. applied mathematics and engineering: uncertainty and
complexity. Also, Bayesian networks play an increasingly
Bayesian Inference important role in the design and analysis of machine
learning algorithms, serving as a promising way to ap-
Inference is the task of computing the probability of each proach present and future problems related to artificial
value of a node in a Bayesian network when other vari- intelligence and data mining (Choudhary, Rehg, Pavlovic,
ables values are known (Jensen, 1999). This concept is & Pentland, 2002; Doshi, Greenwald, & Clarke, 2002;
what makes Bayesian networks so powerful, as it allows Fenton, Cates, Forey, Marsh, Neil, & Tailor, 2003).
the user to apply knowledge toward forward or backward
reasoning. Suppose that a specific value for one or more
of the variables in the network has been observed. If one REFERENCES
variable has a definite value, or evidence, the probabili-
ties, or belief values, for the other variables need to be Choudhury, T., Rehg, J. M., Pavlovic, V., & Pentland, A.
revised, as this variable is not a defined value. This (2002). Boosting and structure learning in dynamic Baye-
calculation of the updated probabilities for system vari- sian networks for audio-visual speaker detection. Pro-
ables that are based on new evidence is precisely the ceedings of the International Conference on Pattern
definition of inference. Recognition (ICPR), Canada, III (pp. 789-794).
Doshi, P., Greenwald, L., & Clarke, J. (2002). Towards
effective structure learning for large Bayesian networks.
FUTURE TRENDS Proceedings of the AAAI Workshop on Probabilistic
Approaches in Search, Canada (pp. 16-22).
The future of Bayesian networks lies in determining new
ways to tackle the following issues of Bayesian inferencing Fenton, N., Cates, P., Forey, S., Marsh, W., Neil, M., &
and in building a Bayesian structure that accurately Tailor, M. (2003). Modelling risk in complex software
represents a particular system. As we discuss in this projects using Bayesian networks (Tech. Rep. No. RA-
paper, conditional dependencies can be mapped into a DAR Tech Repo). London: Queen Mary University.
graph in several ways, each with subtle semantic and
statistical differences. Future research will give way to Helsper, E.M. & Gaag, L.C. van der. (2002). Building
Bayesian networks that can understand system seman- Bayesian networks through ontologies. Proceedings of
tics and adapt accordingly, not only with respect to the the 15th Eureopean Conference on Artificial Intelli-
conditional probabilities within each node but also with gence, Lyon, France (pp. 680-684).
respect to the graph itself. Huang, K., King, I., & R. Lyu, M. (2002). Learning maximum
likelihood semi-naive Bayesian network classifier. Pro-
ceedings of the IEEE International Conference on Sys-
CONCLUSION tems, Man, and Cybernetics, 3, 6, Hamammet, Tunisia.
A Bayesian network consists of the following elements: Jensen, F. (1999). Gradient descent training of Bayesian
networks. Proceedings of the European Conference on
A set of variables and a set of directed edges Symbolic and Quantitative Approaches to Reasoning
between variables and Uncertainty (pp. 190-200).
A finite set of mutually exclusive states for each Neapolitan, R. E. (2004). Learning Bayesian networks.
variable Upper Saddle River, NJ: Prentice-Hall.
A directed acyclic graph (DAG), constructed from
the variables coupled with the directed edges Niedermayer, D. (1998). An introduction to Bayesian
A conditional probability table (CPT) P(A | B1, B2, , networks and their contemporary applications. Re-
92
TEAM LinG
Bayesian Networks
trieved October 2004, from http://www.niedermayer.ca/ Independent: Two random variables are independent
papers/bayesian/ when knowing something about the value of one of them B
does not yield any information about the value of the
Stephenson, T. (2000). An introduction to Bayesian net- other.
work theory and usage. Retrieved October 2004, from
http://www.idiap.ch/publications/todd00a.bib.abs.html Joint Probability: The probability of two events oc-
curring in conjunction.
KEY TERMS Maximum Likelihood: Method of point estimation

using as an estimate of an unobservable population
Bayes Theorem: Result in probability theory that parameter the member of the parameter space that maxi-
states the conditional probability of a variable A, given B, mizes the likelihood function.
in terms of the conditional probability of variable B, given Neural Networks: Learning systems that are designed
A, and the marginal probability of A alone. by analogy with a simplified model of the neural connec-
Conditional Probability: Probability of some event A, tions in the brain and can be trained to find nonlinear
assuming event B, written mathematically as P(A|B). relationships in data.
Data Mining: The application of analytical methods Supervised Learning: A machine learning technique
and tools to data for the purpose of identifying patterns for creating a function from training data; the task of the
and relationships such as classification, prediction, esti- supervised learner is to predict the value of the function
mation, or affinity grouping. for any valid input object after having seen only a small
number of training data.
93
TEAM LinG
94
Best Practices in Data Warehousing from the

Federal Perspective
Les Pang
National Defense University, USA
INTRODUCTION proach, an organization can gain significant competitive

advantages through the new level of corporate knowledge.
Data warehousing has been a successful approach for Various agencies in the Federal Government at-
supporting the important concept of knowledge man- tempted to implement a data warehousing strategy in
agement one of the keys to organizational success at order to achieve data interoperability. Many of these
the enterprise level. Based on successful implementa- agencies have achieved significant success in improving
tions of warehousing projects, a number of lessons internal decision processes as well as enhancing the
learned and best practices were derived from these delivery of products and services to the citizen. This
project experiences. The scope was limited to projects chapter aims to identify the best practices that were
funded and implemented by federal agencies, military implemented as part of the successful data warehousing
institutions and organizations directly supporting them. projects within the federal sector.
Projects and organizations reviewed include the fol-
lowing:
MAIN THRUST
Census 2000 Cost and Progress System
Defense Dental Standard System Each best practice (indicated in boldface) and its ratio-
Defense Medical Logistics Support System Data nale are listed below. Following each practice is a
Warehouse Program description of illustrative project or projects (indicated
Department of Agriculture Rural Development in italics), which support the practice.
Data Warehouse
DOD Computerized Executive Information Sys- Ensure the Accuracy of the Source
tem Data to Maintain the Users Trust of
Department of Transportation (DOT) Executive
Reporting Framework System
the Information in a Warehouse
Environmental Protection Agency (EPA)
Envirofacts Warehouse The user of a data warehouse needs to be confident that
Federal Credit Union the data in a data warehouse is timely, precise, and
Health and Urban Development (HUD) Enterprise complete. Otherwise, a user that discovers suspect data
Data Warehouse in warehouse will likely cease using it, thereby reducing
Internal Revenue Service (IRS) Compliance Data the return on investment involved in building the ware-
Warehouse house. Within government circles, the appearance of
U.S. Army Operational Testing and Evaluation Com- suspect data takes on a new perspective.
mand HUD Enterprise Data Warehouse - Gloria Parker,
U.S. Coast Guard Executive Information System HUD Chief Information Officer, spearheaded data ware-
U.S. Navy Type Commanders Readiness Manage- housing projects at the Department of Education and at
ment System HUD. The HUD warehouse effort was used to profile
performance, detect fraud, profile customers, and do
what if analysis. Business areas served include Fed-
eral Housing Administration loans, subsidized proper-
BACKGROUND ties, and grants. She emphasizes that the public trust of
the information is critical. Government agencies do not
Data warehousing involves the consolidation of data want to jeopardize our public trust by putting out bad
from various transactional data sources in order to data. Bad data will result in major ramifications not only
support the strategic needs of an organization. This from citizens but also from the government auditing
approach links the various silos of data that is distrib- arm, the General Accounting Office, and from Congress
uted throughout an organization. By applying this ap- (Parker, 1999).
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective
EPA Envirofacts Warehouse - The Envirofacts data their dependents. Over 12,000 doctors, nurses and ad-
warehouse comprises of information from 12 different ministrators use it. Frank Gillett, an analyst at Forrester B
environmental databases for facility information, in- Research, Inc., stated that, What kills these huge data
cluding toxic chemical releases, water discharge permit warehouse projects is that the human beings dont agree
compliance, hazardous waste handling processes, on the definition of data. Without that . . . all that $450
Superfund status, and air emission estimates. Each pro- million [cost of the warehouse project] could be thrown
gram office provides its own data and is responsible for out the window (Hamblen, 1998).
maintaining this data. Initially, the Envirofacts ware-
house architects noted some data integrity problems, Be Selective on what Data Elements to
namely, issues with accurate data, understandable data, Include in the Warehouse
properly linked data and standardized data. The archi-
tects had to work hard to address these key data issues so Users are unsure of what they want so they place an
that the public can trust that the quality of data in the excessive number of data elements in the warehouse.
warehouse (Garvey, 2003). This results in an immense, unwieldy warehouse in
U.S. Navy Type Commander Readiness Manage- which query performance is impaired.
ment System The Navy uses a data warehouse to Federal Credit Union - The data warehouse archi-
support the decisions of its commanding officers. Data tect for this organization suggests that users know which
at the lower unit levels is aggregated to the higher levels data they use most, although they will not always admit
and then interfaced with other military systems for a to what they use least (Deitch, 2000).
joint military assessment of readiness as required by the
Joint Chiefs of Staff. The Navy found that it was spend-
ing too much time to determine its readiness and some
Select the Extraction-Transformation-
of its reports contained incorrect data. The Navy devel- Loading (ETL) Strategy Carefully
oped a user friendly, Web-based system that provides
quick and accurate assessment of readiness data at all Having an effective ETL strategy that extracts data from
levels within the Navy. The system collects, stores, the various transactional systems, transforms the data to
reports and analyzes mission readiness data from air, sub a common format, and loads the data into a relational or
and surface forces for the Atlantic and Pacific Fleets. multidimensional database is the key to a successful
Although this effort was successful, the Navy learned that data warehouse project. If the ETL strategy is not effec-
data originating from the lower levels still needs to be tive, it will mean delays in refreshing the data ware-
accurate. The reason is that a number of legacy systems, house, contaminating the data warehouse with dirty data,
which serves as the source data for the warehouse, lacked and increasing the costs in maintaining the warehouse.
validation functions (Microsoft, 2000). IRS Compliance Warehouse supports research and
decision support, allows the IRS to analyze, develop,
Standardize the Organizations Data and implement business strategies for increasing volun-
tary compliance, improving productivity and managing
Definitions the organization. It also provides projections, forecasts,
quantitative analysis, and modeling. Users are able to
A key attribute of a data warehouse is that it serves as a query this data for decision support.
single version of the truth. This is a significant im- A major hurdle was to transform the large and di-
provement over the different and often conflicting ver- verse legacy online transactional data sets for effective
sions of the truth that come from an environment of use in an analytical architecture. They needed a way to
disparate silos of data. To achieve this singular version process custom hierarchical data files and convert to
of the truth, there needs to be consistent definitions of ASCII for local processing and mapping to relational
data elements to afford the consolidation of common databases. They ended up with developing a script pro-
information across different data sources. These con- gram that will do all of this. ETL is a major challenge and
sistent data definitions are captured in a data warehouses may be a showstopper for a warehouse implementa-
metadata repository. tion (Kmonk, 1999).
DoD Computerized Executive Information System
(CEIS) is a 4-terabyte data warehouse holds the medical
records of the 8.5 million active members of the U.S.
Leverage the Data Warehouse to
military health care system who are treated at 115 Provide Auditing Capability
hospitals and 461 clinics around the world. The Defense
Department wanted to convert its fixed-cost health care An overlooked benefit of data warehouses is its capabil-
system to a managed-care model to lower costs and ity of serving as an archive of historic knowledge that
increase patient care for the active military, retirees and can be used as an audit trail for later investigations.
95
TEAM LinG
U.S. Army Operational Testing and Evaluation Com- found that portals provide an effective way to access
mand (OPTEC) is charged with developing test criteria diverse data sources via a single screen (Kmonk, 1999).
and evaluating the performance of extremely complex weap-
ons equipment in every conceivable environment and Make Warehouse Data Available to All
condition. Moreover, as national defense policy is under- Knowledgeworkers (Not Only to
going a transformation, so do the weapon systems, and
thus the testing requirements. The objective of their ware-
Managers)
house was to consolidate a myriad of test data sets to
provide analysts and auditors with access to the specific The early data warehouses were designed to support
information needed to make proper decisions. upper management decision-making. However, over
OPTEC was having fits when audit agencies, such as time, organizations have realized the importance of
the General Accounting Office (GAO), would show up to knowledge sharing and collaboration and its relevance
investigate a weapon system. For instance, if problems to the success of the organizational mission. As a
with a weapon show up five years after it is introduced result, upper management has become aware of the
into the field, people are going to want to know what tests need to disseminate the functionality of the data ware-
were performed and the results of those tests. A warehouse throughout the organization.
house with its metadata capability made data retrieval IRS Compliance Data Warehouse supports a diver-
much more efficient (Microsoft, 2000). sity of user types economists, research analysts, and
statisticians all of whom are searching for ways to
improve customer service, increase compliance with
Leverage the Web and Web Portals for federal tax laws and increase productivity. It is not just
Warehouse Data to Reach Dispersed for upper management decision making anymore
Users (Kmonk, 1999).
In many organizations, users are geographically distrib- Supply Data in a Format Readable by
uted and the World Wide Web has been very effective as Spreadsheets
a gateway for these dispersed users to access the key
resources of their organization, which include data ware- Although online analytical tools such as those sup-
houses and data marts. ported by Cognos and Business Objects are useful for
U.S. Army OPTEC developed a Web-based front end data analysis, the spreadsheet is still the basic tool used
for its warehouse so that information can be entered and by most analysts.
accessed regardless of the hardware available to users. It U.S. Army OPTEC wanted users to transfer data and
supports the geographically dispersed nature of OPTECs work with information on applications that they are
mission. Users performing tests in the field can be familiar with. In OPTEC, they transfer the data into a
anywhere from Albany, New York to Fort Hood, Texas. format readable by spreadsheets so that analysts can
That is why the browser client the Army developed is so really crunch the data. Specifically, pivot tables found in
important to the success of the warehouse (Microsoft, spreadsheets allows the analysts to manipulate the infor-
2000). mation to put meaning behind the data (Microsoft, 2000).
DoD Defense Dental Standard System supports more
than 10,000 users at 600 military installations world-
wide. The solution consists of three main modules:
Restrict or Encrypt Classified/
Dental Charting, Dental Laboratory Management, and Sensitive Data
Workload and Dental Readiness Reporting. The charting
module helps dentists graphically record patient infor- Depending on requirements, a data warehouse can con-
mation. The lab module automates the workflow between tain confidential information that should not be re-
dentists and lab technicians. The reporting module al- vealed to unauthorized users. If privacy is breached, the
lows users to see key information though Web-based organization may become legally liable for damages
online reports, which is a key to the success of the and suffer a negative reputation with the ensuing loss of
defense dental operations. customers trust and confidence. Financial conse-
IRS Compliance Data Warehouse includes a Web- quences can result.
based query and reporting solution that provides high-value, DoD Computerized Executive Information System
easy-to-use data access and analysis capabilities, be quickly uses an online analytical processing tool from a popu-
and easily installed and managed, and scale to support lar vendor that could be used to restrict access to
hundreds of thousands of users. With this portal, the IRS certain data, such as HIV test results, so that any confi-
dential data would not be disclosed (Hamblen, 1998).
96
TEAM LinG
Considerations must be made in the architecting of a data Use Information Visualization

warehouse. One alternative is to use a roles-based archi- Techniques Such as Geographic B
tecture that allows access to sensitive data by only
authorized users and the encryption of data in the event
Information Systems (GIS)
of data interception.
A GIS combines layers of data about a physical location
to give users a better understanding of that location. GIS
Perform Analysis During the Data allows users to view, understand, question, interpret,
Collection Process and visualize data in ways simply not possible in para-
graphs of text or in the rows and columns of a spread-
Most data analyses involve completing data collection sheet or table.
before analysis can begin. With data warehouses, a new EPA Envirofacts Warehouse includes the capability
approach can be undertaken. of displaying its output via the EnviroMapper GIS sys-
Census 2000 Cost and Progress System was built to tem. It maps several types of environmental informa-
consolidate information from several computer sys- tion, including drinking water, toxic and air releases,
tems. The data warehouse allowed users to perform hazardous waste, water discharge permits, and Superfund
analyses during the data collection process; something sites at the national, state, and county levels (Garvey,
was previously not possible. The system allowed execu- 2003). Individuals familiar with Mapquest and other
tives to take a more proactive management role. With online mapping tools can easily navigate the system and
this system, Census directors, regional offices, manag- quickly get the information they need.
ers, and congressional oversight committees have the
ability to track the 2000 census, which never been done
before (SAS, 2000). FUTURE TRENDS
Leverage User Familiarity with Data warehousing will continue to grow as long as there
Browsers to Reduce Training are disparate silos of data sources throughout an organi-
Requirements zation. However, the irony is that there will be a prolif-
eration of data warehouses as well as data marts, which
The interface of a Web browser is very familiar to most will not interoperate within an organization. Some ex-
employees. Navigating through a learning management perts predict the evolution toward a federated architec-
system using a browser may be more user friendly than ture for the data warehousing environment. For ex-
using the navigation system of a proprietary training ample, there will be a common staging area for data
software. integration and, from this source, data will flow among
U.S. Department of Agriculture (USDA) Rural De- several data warehouses. This will ensure that the single
velopment, Office of Community Development, admin- truth requirement is maintained throughout the organi-
isters funding programs for the Rural Empowerment zation (Hackney, 2000).
Zone Initiative. There was a need to tap into legacy Another important trend in warehousing is one away
databases to provide accurate and timely rural funding from historic nature of data in warehouses and toward
information to top policy makers. Through Web acces- real-time distribution of data so that information vis-
sibility using an intranet system, there were dramatic ibility will be instantaneous (Carter, 2004). This is a key
improvements in financial reporting accuracy and timely factor for business decision-making in a constantly
access to data. Prior to the intranet, questions such as changing environment. Emerging technologies, namely
What were the Rural Development investments in 1997 service-oriented architectures and Web services, are
for the Mississippi Delta region? required weeks of expected to be the catalyst for this to occur.
laborious data gathering and analysis, yet yielded obso-
lete answers with only an 80 percent accuracy factor.
Now, similar analysis takes only a few minutes to per- CONCLUSION
form, and the accuracy of the data is as high as 98
percent. More than 7,000 Rural Development employ- An organization needs to understand how it can leverage
ees nationwide can retrieve the information at their data from a warehouse or mart to improve its level of
desktops, using a standard Web browser. Because em- service and the quality of its products and services.
ployees are familiar with the browser, they did not need Also, the organization needs to recognize that its most
training to use the new data mining system (Ferris, valuable resource, the workforce, needs to be adequately
2003). trained in accessing and utilizing a data warehouse. The
97
TEAM LinG
workforce should recognize the value of the knowledge Parker, G. (1999). Data warehousing at the federal govern-
that can be gained from data warehousing and how to ment: A CIO perspective. In Proceedings from Data
apply it to achieve organizational success. Warehouse Conference 99.
A data warehouse should be part of an enterprise
architecture, which is a framework for visualizing the PriceWaterhouseCoopers. (2001). Technology forecast.
information technology assets of an enterprise and how SAS. (2000). The U.S. Bureau of the Census counts on a
these assets interrelate. It should reflect the vision and better system.
business processes of an organization. It should also
include standards for the assets and interoperability Schwartz, A. (2000). Making the Web Safe. Federal Com-
requirements among these assets. puter Week.
REFERENCES KEY TERMS

AMS. (1999). Military marches toward next-genera- ASCII: American Standard Code for Information
tion health care service: The Defense Dental Stan- Interchange. Serves a code for representing English
dard System. characters as numbers with each letter assigned a num-
Carter, M. (2004). The death of data warehousing. Loosely ber from 0 to 127.
Coupled. Data Warehousing: A compilation of data designed
Deitch, J. (2000). Technicians are from Mars, users are to for decision support by executives, managers, ana-
from Venus: Myths and facts about data warehouse lysts and other key stakeholders in an organization. A
administration (Presentation). data warehouse contains a consistent picture of busi-
ness conditions at a single point in time.
Ferris, N. (1999). 9 hot trends for 99. Government Execu-
tive. Database: A collection of facts, figures, and objects
that is structured so that it can easily be accessed,
Ferris, N. (2003). Information is power. Government Ex- organized, managed, and updated.
ecutive.
Enterprise Architecture: A business and perfor-
Garvey, P. (2003). Envirofacts warehouse public ac- mance-based framework to support cross-agency col-
cess to environmental data over the Web (Presenta- laboration, transformation, and organization-wide im-
tion). provement.
Gerber, C. (1996). Feds turn to OLAP as reporting tool. Extraction-Transformation-Loading (ETL): A
Federal Computer Week. key transitional set of steps in migrating data from the
Hackney, D. (2000). Data warehouse delivery: The fed- source systems to the database housing the data ware-
erated future. DM Review. house. Extraction refers to drawing out the data from
the source system, transformation concerns converting
Hamblen, M. (1998). Pentagon to deploy huge medical the data to the format of the warehouse and loading
data warehouse. Computer World. involves storing the data into the warehouse.
Kirwin, B. (2003). Management update: Total cost of Geographic Information Systems: Map-based
ownership analysis provides many benefits. Gartner tools used to gather, transform, manipulate, analyze, and
Research, IGG-08272003-01. produce information related to the surface of the Earth.
Kmonk, J. (1999). Viador information portal provides Web Hierarchical Data Files: Database systems that
data access and reporting for the IRS. DM Review. are organized in the shape of a pyramid with each row of
objects linked to objects directly beneath it. This
Matthews, W. (2000). Digging digital gold. Federal Com- approach has generally been superceded by relationship
puter Week. database systems.
Microsoft Corporation. (2000). OPTEC adopts data ware- Knowledge Management: A concept where an organi-
housing strategy to test critical weapons systems. zation deliberately and comprehensively gathers, orga-
Microsoft Corporation. (2000). U.S. Navy ensures readi- nizes, and analyzes its knowledge, then shares it inter-
ness using SQL Server. nally and sometimes externally.
98
TEAM LinG
Legacy System: Typically, a database management Terabyte: A unit of memory or data storage capacity
system in which an organization has invested consider- equal to roughly 1,000 gigabytes. B
able time and money and resides on a mainframe or
minicomputer. Total Cost of Ownership: Developed by Gartner
Group, an accounting method used by organizations
Outsourcing: Acquiring services or products from seeking to identify their both direct and indirect sys-
an outside supplier or manufacturer in order to cut costs tems costs.
and/or procure outside expertise.
Performance Metrics: Key measurements of sys-
tem attributes that is used to determine the success of NOTE
the process.
Pivot Tables: An interactive table found in most The views expressed in this article are those of the
spreadsheet programs that quickly combines and com- author and do not reflect the official policy or position
pares typically large amounts of data. One can rotate its of the National Defense University, the Department of
rows and columns to see different arrangements of the Defense or the U.S. Government.
source data, and also display the details for areas of
interest.
99
TEAM LinG
100
Bibliomining for Library Decision-Making

Scott Nicholson
Syracuse University School of Information Studies, USA
Jeffrey Stanton
Syracuse University School of Information Studies, USA
Most people think of a library as the little brick building Forward-thinking authors in the field of library science
in the heart of their community or the big brick building in began to explore sophisticated uses of library data some
the center of a college campus. However, these notions years before the concept of data mining became popular-
greatly oversimplify the world of libraries. Most large ized. Nutter (1987) explored library data sources to sup-
commercial organizations have dedicated in-house li- port decision making but lamented that the ability to
brary operations, as do schools; nongovernmental orga- collect, organize, and manipulate data far outstrips the
nizations; and local, state, and federal governments. With ability to interpret and to apply them (p. 143). Johnston
the increasing use of the World Wide Web, digital librar- and Weckert (1990) developed a data-driven expert sys-
ies have burgeoned, serving a huge variety of different tem to help select library materials, and Vizine-Goetz,
user audiences. With this expanded view of libraries, two Weibel, and Oskins (1990) developed a system for auto-
key insights arise. First, libraries are typically embedded mated cataloging based on book titles (see also Morris,
within larger institutions. Corporate libraries serve their 1992, and Aluri & Riggs, 1990). A special section of
corporations, academic libraries serve their universities, Library Administration and Management, Mining your
and public libraries serve taxpaying communities who automated system, included articles on extracting data
elect overseeing representatives. Second, libraries play a to support system management decisions (Mancini, 1996),
pivotal role within their institutions as repositories and extracting frequencies to assist in collection decision
providers of information resources. In the provider role, making (Atkins, 1996), and examining transaction logs to
libraries represent in microcosm the intellectual and learn- support collection management (Peters, 1996).
ing activities of the people who comprise the institution. More recently, Banerjeree (1998) focused on describ-
This fact provides the basis for the strategic importance ing how data mining works and how to use it to provide
of library data mining: By ascertaining what users are better access to the collection. Guenther (2000) discussed
seeking, bibliomining can reveal insights that have mean- data sources and bibliomining applications but focused
ing in the context of the librarys host institution. on the problems with heterogeneous data formats.
Use of data mining to examine library data might be Doszkocs (2000) discussed the potential for applying
aptly termed bibliomining. With widespread adoption of neural networks to library data to uncover possible asso-
computerized catalogs and search facilities over the past ciations between documents, indexing terms, classifica-
quarter century, library and information scientists have tion codes, and queries. Liddy (2000) combined natural
often used bibliometric methods (e.g., the discovery of language processing with text mining to discover informa-
patterns in authorship and citation within a field) to tion in digital library collections. Lawrence, Giles, and
explore patterns in bibliographic information. During the Bollacker (1999) created a system to retrieve and index
same period, various researchers have developed and citations from works in digital libraries. Gutwin, Paynter,
tested data-mining techniques, which are advanced sta- Witten, Nevill-Manning, and Frank (1999) used text min-
tistical and visualization methods to locate nontrivial ing to support resource discovery.
patterns in large datasets. Bibliomining refers to the use These projects all shared a common focus on improv-
of these bibliometric and data-mining techniques to ex- ing and automating two of the core functions of a library:
plore the enormous quantities of data generated by the acquisitions and collection management. A few authors
typical automated library. have recently begun to address the need to support
management by focusing on understanding library users:
TEAM LinG
Schulman (1998) discussed using data mining to examine ILS Data Sources from the Creation of
changing trends in library user behavior; Sallis, Hill, the Library System B
Jancee, Lovette, and Masi (1999) created a neural network
that clusters digital library users; and Chau (2000) dis-
cussed the application of Web mining to personalize
Bibliographic Information
services in electronic reference.
The December 2003 issue of Information Technology One source of data is the collection of bibliographic
and Libraries was a special issue dedicated to the records and searching interfaces that represents materi-
bibliomining process. Nicholson presented an overview als in the library, commonly known as the Online Public
of the process, including the importance of creating a data Access Catalog (OPAC). In a digital library environment,
warehouse that protects the privacy of users. Zucca the same type of information collected in a bibliographic
discussed the implementation of a data warehouse in an library record can be collected as metadata. The concepts
academic library. Wormell; Surez-Balseiro, Iribarren- parallel those in a traditional library: Take an agreed-upon
Maestro, & Casado; and Geyer-Schultz, Neumann, & standard for describing an object, apply it to every object,
Thede used bibliomining in different ways to understand and make the resulting data searchable. Therefore, digital
the use of academic library sources and to create appro- libraries use conceptually similar bibliographic data
priate library services. sources to traditional libraries.
We extend these efforts by taking a more global view
of the data generated in libraries and the variety of Acquisitions Information
decisions that those data can inform. Thus, the focus of
this work is on describing ways in which library and Another source of data for bibliomining comes from
information managers can use data mining to understand acquisitions, where items are ordered from suppliers and
patterns of behavior among library users and staff and tracked until they are received and processed. Because
patterns of information resource use throughout the insti- digital libraries do not order physical goods, somewhat
tution. different acquisition methods and vendor relationships
exist. Nonetheless, in both traditional and digital library
environments, acquisition data have untapped potential
MAIN THRUST for understanding, controlling, and forecasting informa-
tion resource costs.
Integrated Library Systems and Data
ILS Data Sources from Usage of the
Warehouses
Library System
Most managers who wish to explore bibliomining will
need to work with the technical staff of their Integrated User Information
Library System (ILS) vendors to gain access to the data-
bases that underlie the system and create a data ware- In order to verify the identity of users who wish to use
house. The cleaning, preprocessing, and anonymizing of library services, libraries maintain user databases. In
the data can absorb a significant amount of time and effort. libraries associated with institutions, the user database is
Only by combining and linking different data sources, closely aligned with the organizational database. Sophis-
however, can managers uncover the hidden patterns that ticated public libraries link user records through zip codes
can help them understand library operations and users. with demographic information in order to learn more about
their user population. Digital libraries may or may not have
Exploration of Data Sources any information about their users, based upon the login
procedure required. No matter what data are captured
Available library data sources are divided into three about the patron, it is important to ensure that the iden-
groups for this discussion: data from the creation of the tification information about the patron is separated from
library, data from the use of the collection, and data from the demographic information before this information is
external sources not normally included in the ILS. stored in a data warehouse; doing so protects the privacy
of the individual.
101
TEAM LinG
Circulation and Usage Information fore, tracking in-house use is also vital in discovering
patterns of use. This task becomes much easier in a
The richest sources of information about library user digital library, as Web logs can be analyzed to discover
behavior are circulation and usage records. Legal and what sources the users examined.
ethical issues limit the use of circulation data, however. A
data warehouse can be useful in this situation, because Interlibrary Loan and Other Outsourcing
basic demographic information and details about the cir- Services
culation could be recorded without infringing upon the
privacy of the individual. Many libraries use interlibrary loan and/or other
Digital library services have a greater difficulty in outsourcing methods to get items on a need-by-need
defining circulation, as viewing a page does not carry the basis for users. The data produced by this class of
same meaning as checking a book out of the library, transactions will vary by service but can provide a
although requests to print or save a full text information window to areas of need in a library collection.
resource might be similar in meaning. Some electronic full-
text services already implement the server-side capture of Applications of Bibliomining through a
such requests from their user interfaces.
Data Warehouse
Searching and Navigation Information Bibliomining can provide an understanding of the indi-
vidual sources listed previously in this article; however,
The OPAC serves as the primary means of searching for much more information can be discovered when sources
works owned by the library. Additionally, because most are combined through common fields in a data ware-
OPACs use a Web browser interface, users may also house.
access bibliographic databases, the World Wide Web,
and other online resources during the same session; all
this information can be useful in library decision making.
Bibliomining to Improve Library Services
Digital libraries typically capture logs from users who are
searching their databases and can track, through Most libraries exist to serve the information needs of
clickstream analysis, the elements of Web-based services users, and therefore, understanding the needs of indi-
visited by users. In addition, the combination of a login viduals or groups is crucial to a librarys success. For
procedure and cookies allows the connection of user many decades, librarians have suggested works; market
demographics to the services and searches they used in a basket analysis can provide the same function through
session. usage data in order to aid users in locating useful works.
Bibliomining can also be used to determine areas of
deficiency and to predict future user needs. Common
External Data Sources areas of item requests and unsuccessful searches may
point to areas of collection weakness. By looking for
Reference Desk Interactions patterns in high-use items, librarians can better predict
the demand for new items.
In the typical face-to-face or telephone interaction with a Virtual reference desk services can build a database
library user, the reference librarian records very little infor- of questions and expert-created answers, which can be
mation about the interaction. Digital reference transac- used in a number of ways. Data mining could be used to
tions, however, occur through an electronic format, and discover patterns for tools that will automatically assign
the transaction text can be captured for later analysis, questions to experts based upon past assignments. In
which provides a much richer record than is available in addition, by mining the question/answer pairs for pat-
traditional reference work. The utility of these data can be terns, an expert system could be created that can provide
increased if identifying information about the user can be users an immediate answer and a pointer to an expert for
captured as well, but again, anonymization of these trans- more information.
actions is a significant challenge.
Bibliomining for Organizational Decision
Item Use Information Making Within the Library
Fussler and Simon (as cited in Nutter, 1987) estimated that Just as the user behavior is captured within the ILS, the
75 to 80% of the use of materials in academic libraries is in behavior of library staff can also be discovered by con-
house. Some types of materials never circulate, and there-
102
TEAM LinG
necting various databases to supplement existing perfor- FUTURE TRENDS

mance review methods. Although monitoring staff through B
their performance may be an uncomfortable concept, Consortial Data Warehouses
tighter budgets and demands for justification require
thoughtful and careful performance tracking. In addition, One future path of bibliomining is to combine the data
research has shown that incorporating clear, objective from multiple libraries through shared data warehouses.
measures into performance evaluations can actually im- This merger will require standards if the libraries use
prove the fairness and effectiveness of those evaluations different systems. One such standard is the COUNTER
(Stanton, 2000). project (2004), which is a standard for reporting the use of
Low-use statistics for a work may indicate a problem digital library resources. Libraries working together to
in the selection or cataloging process. Looking at the pool their data will be able to gain a competitive advantage
associations between assigned subject headings, call over publishers and have the data needed to make better
numbers, and keywords, along with the responsible party decisions. This type of data warehouse can power evi-
for the catalog record, may lead to a discovery of system dence-based librarianship, another growing area of re-
inefficiencies. Vendor selection and price can be exam- search (Eldredge, 2000).
ined in a similar fashion to discover if a staff member Combining these data sources will allow library sci-
consistently uses a more expensive vendor when cheaper ence research to move from making statements about a
alternatives are available. Most libraries acquire works particular library to making generalizations about
both by individual orders and through automated order- librarianship. These generalizations can then be tested on
ing plans that are configured to fit the size and type of that other consortial data warehouses and in different settings
library. Although these automated plans do simplify the and may be the inspiration for theories. Bibliomining and
selection process, if some or many of the works they other forms of evidence-based librarianship can therefore
recommend go unused, then the plan might not be cost encourage the expansion of the conceptual and theoreti-
effective. Therefore, merging the acquisitions and circu- cal frameworks supporting the science of librarianship.
lation databases and seeking patterns that predict low use
can aid in appropriate selection of vendors and plans.
Bibliomining, Web Mining, and Text
Bibliomining for External Reporting and Mining
Justification Web mining is the exploration of patterns in the use of
Web pages. Bibliomining uses Web mining as its base but
The library may often be able to offer insights to their adds some knowledge about the user. This aids in one of
parent organization or community about their user base the shortcomings of Web mining many times, nothing
through patterns detected with bibliomining. In addition, is known about the user. This lack still holds true in some
library managers are often called upon to justify the digital library applications; however, when users access
funding for their library when budgets are tight. Likewise, password-protected areas, the library has the ability to
managers must sometimes defend their policies, particu- map some information about the patron onto the usage
larly when faced with user complaints. Bibliomining can information. Therefore, bibliomining uses tools from Web
provide the data-based justification to back up the anec- usage mining but has more data available for pattern
dotal evidence usually used for such arguments. discovery.
Bibliomining of circulation data can provide a number Text mining is the exploration of the context of text in
of insights about the groups who use the library. By order to extract information and understand patterns. It
clustering the users by materials circulated and tying helps to add information to the usage patterns discovered
demographic information into each cluster, the library can through bibliomining. To use terms from information
develop conceptual user groups that provide a model of science, bibliomining focuses on patterns in the data that
the important constituencies of the institutions user label and point to the information container, while text
base; this grouping, in turn, can fulfill some common mining focuses on the information within that container.
organizational needs for understanding where common In the future, organizations that fund digital libraries can
interests and expertise reside in the user community. This look to text mining to greatly improve access to materials
capability may be particularly valuable within large orga- beyond the current cataloging/metadata solutions.
nizations where research and development efforts are The quality and speed of text mining continues to
dispersed over multiple locations. improve. Liddy (2000) has researched the extraction of
103
TEAM LinG
information from digital texts; implementing these tech- REFERENCES

nologies can allow a digital library to move from suggest-
ing texts that might contain the answer to just providing Atkins, S. (1996). Mining automated systems for collec-
the answer by extracting it from the appropriate text or tion management. Library Administration & Manage-
texts. The use of such tools risks taking textual material ment, 10(1), 16-19.
out of context and also provides few hints about the
quality of the material, but if these extractions were links Banerjee, K. (1998). Is data mining right for your library?
directly into the texts, then context could emerge along Computer in Libraries, 18(10), 28-31.
with an answer. This situation could provide a substantial Chau, M. Y. (2000). Mediating off-site electronic reference
asset to organizations that maintain large bodies of tech- services: Human-computer interactions between libraries
nical texts, because it would promote rapid, universal and Web mining technology. IEEE Fourth International
access to previously scattered and/or uncataloged mate- Conference on Knowledge-Based Intelligent Engineer-
rials. ing Systems & Allied Technologies, USA, 2 (pp. 695-699).
Example of Hybrid Approach Chaudhry, A. S. (1993). Automation systems as tools of

use studies and management information. IFLA Journal,
Hwang and Chuang (in press) have recently combined 19(4), 397-409.
bibliomining, Web mining, and text mining in a COUNTER (2004). COUNTER: Counting online usage of
recommender system for an academic library. They started networked electronic resources. Retrieved from http://
by using data mining on Web usage data for articles in a www.projectcounter.org/about.html
digital library and combining that information with infor-
mation about the users. They then built a system that Doszkocs, T. E. (2000). Neural networks in libraries: The
looked at patterns between works based on their content potential of a new information technology. Retrieved
by using text mining. By combing these two systems into from http://web.simmons.edu/~chen/nit/NIT%2791/
a hybrid system, they found that the hybrid system 027~dos.htm
provides more accurate recommendations for users than
either system taken separately. This example is a perfect Eldredge, J. (2000). Evidence-based librarianship: An
representation of the future of bibliomining and how it can overview. Bulletin of the Medical Library Association,
be used to enhance the text-mining research projects 88(4), 289-302.
already in progress. Geyer-Schulz, A., Neumann, A., & Thede, A. (2003). An
architecture for behavior-based library recommender sys-
tems. Information Technology and Libraries, 22(4), 165-
CONCLUSION 174.
Libraries have gathered data about their collections and Guenther, K. (2000). Applying data mining principles to
users for years but have not always used those data for library data collection. Computers in Libraries, 20(4), 60-
better decision making. By taking a more active approach 63.
based on applications of data mining, data visualization, Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., &
and statistics, these information organizations can get a Frank, E. (1999). Improving browsing in digital libraries
clearer picture of their information delivery and manage- with keyphrase indexes. Decision Support Systems, 2I,
ment needs. At the same time, libraries must continue to 81-104.
protect their users and employees from the misuse of
personally identifiable data records. Information discov- Hwang, S., & Chuang, S. (in press). Combining article
ered through the application of bibliomining techniques content and Web usage for literature recommendation in
gives the library the potential to save money, provide digital libraries. Online Information Review.
more appropriate programs, meet more of the users infor- Johnston, M., & Weckert, J. (1990). Selection advisor: An
mation needs, become aware of the gaps and strengths of expert system for collection development. Information
their collection, and serve as a more effective information Technology and Libraries, 9(3), 219-225.
source for its users. Bibliomining can provide the data-
based justifications for the difficult decisions and fund- Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digital
ing requests library managers must make. libraries and autonomous citation indexing. IEEE Com-
puter, 32(6), 67-71.
104
TEAM LinG
Liddy, L. (2000, November/December). Text mining. Bul- KEY TERMS

letin of the American Society for Information Science, 13- B
14. Bibliometrics: The study of regularities in citations,
Mancini, D. D. (1996). Mining your automated system for authorship, subjects, and other extractable facets from
systemwide decision making. Library Administration & scientific communication by using quantitative and visu-
Management, 10(1), 11-15. alization techniques. This study allows researchers to
understand patterns in the creation and documented use
Morris, A. (Ed.). (1992). Application of expert systems in of scholarly publishing.
library and information centers. London: Bowker-Saur.
Bibliomining: The application of statistical and pat-
Nicholson, S. (2003). The bibliomining process: Data tern-recognition tools to large amounts of data associ-
warehousing and data mining for library decision-making. ated with library systems in order to aid decision making
Information Technology and Libraries, 22(4), 146-151. or to justify services. The term bibliomining comes from
the combination of bibliometrics and data mining, which
Nutter, S. K. (1987). Online systems and the management are the two main toolsets used for analysis.
of collections: Use and implications. Advances in Library
Automation Networking, 1, 125-149. Data Warehousing: The gathering and cleaning of
data from disparate sources into a single database, which
Peters, T. (1996). Using transaction log analysis for library is optimized for exploration and reporting. The data ware-
management information. Library Administration & house holds a cleaned version of the data from operational
Management, 10(1), 20-25. systems, and data mining requires the type of cleaned
Sallis, P., Hill, L., Janee, G., Lovette, K., & Masi, C. (1999). data that live in a data warehouse.
A methodology for profiling users of large interactive Evidence-Based Librarianship: The use of the best
systems incorporating neural network data mining tech- available evidence, combined with the experiences of
niques. Proceedings of the 1999 Information Resources working librarians and the knowledge of the local user
Management Association International Conference (pp. base, to make the best decisions possible (Eldredge,
994-998). 2000).
Schulman, S. (1998). Data mining: Life after report genera- Integrated Library System: The automation system
tors. Information Today, 15(3), 52. for libraries that combines modules for cataloging, acqui-
Stanton, J. M. (2000). Reactions to employee performance sition, circulation, end-user searching, database access,
monitoring: Framework, review, and research directions. and other library functions through a common set of
Human Performance, 13, 85-113. interfaces and databases.
Surez-Balseiro, C. A., Iribarren-Maestro, I., Casado, E. S. Online Public Access Catalog (OPAC): The module
(2003). A study of the use of the Carlos III University of of the Integrated Library System designed for use by the
Madrid Librarys online database service in Scientific public to allow discovery of the librarys holdings through
Endeavor. Information Technology and Libraries, 22(4), the searching of bibliographic surrogates. As libraries
179-182. acquire more digital materials, they are linking those
materials to the OPAC entries.
Vizine-Goetz, D., Weibel, S., & Oskins, M. (1990). Auto-
mating descriptive cataloging. In R. Aluri, & D. Riggs
(Eds.), Expert systems in libraries (pp. 123-127). Norwood,
NJ: Ablex Publishing Corporation. NOTE
Wormell, I. (2003). Matching subject portals with the This work is based on Nicholson, S., & Stanton, J. (2003).
research environment. Information Technology and Li- Gaining strategic advantage through bibliomining: Data
braries, 22(4), 158-166. mining for management decisions in corporate, special,
Zucca, J. (2003). Traces in the clickstream: Early work on digital, and traditional libraries. In H. Nemati & C. Barko
a management information repository at the University of (Eds.). Organizational data mining: Leveraging enter-
Pennsylvania. Information Technology and Libraries, prise data resources for optimal performance (pp. 247
22(4), 175-178. 262). Hershey, PA: Idea Group.
105
TEAM LinG
106
Biomedical Data Mining Using RBF Neural

Networks
Feng Chu
Lipo Wang
INTRODUCTION Thus, to develop accurate and efficient classifiers

based on gene expression becomes a problem of both
Accurate diagnosis of cancers is of great importance for theoretical and practical importance. Recent approaches
doctors to choose a proper treatment. Furthermore, it also on this problem include artificial neural networks (Khan et
plays a key role in the searching for the pathology of al., 2001), support vector machines (Guyon, Weston,
cancers and drug discovery. Recently, this problem at- Barnhill, & Vapnik, 2002), k-nearest neighbor (Olshen &
tracts great attention in the context of microarray technol- Jain, 2002), nearest shrunken centroids (Tibshirani, Hastie,
ogy. Here, we apply radial basis function (RBF) neural Narashiman, & Chu, 2002), and so on.
networks to this pattern recognition problem. Our experi- A solution to this problem is to find out a group of
mental results in some well-known microarray data sets important genes that contribute most to differentiate
indicate that our method can obtain very high accuracy cancer subtypes. In the meantime, we should also provide
with a small number of genes. proper algorithms that are able to make correct prediction
based on the expression profiles of those genes. Such
work will benefit early diagnosis of cancers. In addition,
BACKGROUND it will help doctors choose proper treatment. Furthermore,
it also throws light on the relationship between the can-
Microarray is also called gene chip or DNA chip. It is a cers and those important genes.
newly appeared biotechnology that allows biomedical From the point of view of machine learning and statis-
researchers monitor thousands of genes simultaneously tical learning, cancer classification using gene expression
(Schena, Shalon, Davis, & Brown, 1995). Before the ap- profiles is a challenging problem. The reason lies in the
pearance of microarrays, a traditional molecular biology following two points. First, typical gene expression data
experiment usually works on only one gene or several sets usually contain very few samples (from several to
genes, which makes it difficult to have a whole picture several tens for each type of cancers). In other words, the
of an entire genome. With the help of microarrays, re- training data are scarce. Second, such data sets usually
searchers are able to monitor, analyze and compare ex- contain a large number of genes, for example, several
pression profiles of thousands of genes in one experiment. thousands. That is, the data are high dimensional. There-
On account of their features, microarrays have been fore, this is a special pattern recognition problem with
used in various tasks such as gene discovery, disease relatively small number of patterns and very high dimen-
diagnosis, and drug discovery. Since the end of the last sionality. To provide such a problem with a good solution,
century, cancer classification based on gene expression appropriate algorithms should be designed.
profiles has attracted great attention in both the biologi- In fact, a number of different approaches such as k-
cal and the engineering fields. Compared with traditional nearest neighbor (Olshen and Jain, 2002), support vector
cancer diagnostic methods based mainly on the morpho- machines (Guyon et al.,2002), artificial neural networks
logical appearances of tumors, the method using gene (Khan et al., 2001) and some statistical methods have
expression profiles is more objective, accurate, and reli- been applied to this problem since 1995. Among these
able. More importantly, some types of cancers have approaches, some obtained very good results. For ex-
subtypes with very similar appearances that are very hard ample, Khan et al. (2001) classified small round blue cell
to be classified by traditional methods. It has been proven tumors (SRBCTs) with 100% accuracy by using 96 genes.
that gene expression has a good capability to clarify this Tibshirani et al. (2002) successfully classified SRBCTs
previously muddy problem. with 100% accuracy by using only 43 genes. They also
TEAM LinG
Biomedical Data Mining Using RBF Neural Networks
classified three different subtypes of lymphoma with the systemic bias induced during experiments. We follow
100% accuracy by using 48 genes. (Tibshirani, Hastie, the normalization procedure used by Dudoit, Fridlyand, B
Narashiman, & Chu, 2003) and Speed (2002). Three preprocessing steps were ap-
However, there are still a lot of things can be done to plied: (a) thresholding with floor of 100 and ceiling of
improve present algorithms. In this work, we use and 16000; (b) filtering, exclusion of genes with max/min<5 or
compare two gene selection schemes, i.e., principal com- (max-min)<500. max and min refer to the maximum and the
ponents analysis (PCA) (Simon, 1999) and a t-test-based minimum of the gene expression values, respectively; and
method (Tusher, Tibshirani, & Chu, 2001). After that, we (c) base 10 logarithmic transformation. There are 3571
introduce an RBF neural network (Fu & Wang, 2003) as the genes survived after these three steps. After that, the data
classification algorithm. were standardized across experiments, i.e., minus the
mean and divided by the standard deviation of each
experiment.
MAIN THRUST
Methods for Gene Selection
After a comparative study of gene selection methods, a
detailed description of the RBF neural network and some As mentioned in the former part, the gene expression data
experimental results are presented in this section. are very high-dimensional. The dimension of input pat-
terns is determined by the number of genes used. In a
Microarray Data Sets typical microarray experiment, usually several thousands
of genes take part in. Therefore, the dimension of patterns
We analyze three well-known gene expression data sets, is several thousands. However, only a small number of the
i.e., the SRBCT data set (Khan et al., 2001), the lymphoma genes contribute to correct classification; some others
data set (Alizadeh et al., 2000), and the leukemia data set even act as noise. Gene selection can eliminate the
(Golub et al., 1999). influence of such noise. Furthermore, the fewer the
The lymphoma data set (http://llmpp.nih.gov/lym- genes used, the lower the computational burden to the
phoma) (Alizadeh et al., 2000) contains 4026 well mea- classifier. Finally, once a smaller subset of genes is iden-
sured clones belonging to 62 samples. These samples tified as relevant to a particular cancer, it helps biomedical
belong to following types of lymphoid malignancies: researchers focus on these genes that contribute to the
diffuse large B-cell lymphoma (DLBCL, 42 samples), fol- development of the cancer. The process of gene selection
licular lymphoma (FL, nine samples) and chronicle lym- is ranking genes discriminative ability first and then
phocytic leukemia (CLL, 11 samples). In this data set, a retaining the genes with high ranks.
small part of data is missing. A k-nearest neighbor algo- As a critical step for classification, gene selection has
rithm was used to fill those missing values (Troyanskaya been studied intensively in recent years. There are two
et al., 2001). main approaches, one is principal component analysis
The SRBCT data set (http://research.nhgri.nih.gov/ (PCA) (Simon, 1999), perhaps the most widely used method;
microarray/Supplement/) (Khan et al., 2001) contains the the other is a t-test-based approach which has been more
expression data of 2308 genes. There are totally 63 training and more widely accepted. In the important papers
samples and 25 testing samples. Five of the testing samples (Alizadeh et al., 2000; Khan et al., 2001), PCA was used.
are not SRBCTs. The 63 training samples contain 23 Ewing The basic idea of PCA is to find the most informative
family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 genes that contain most of the information in the data set.
neuroblastoma (NB), and eight Burkitt lymphomas (BL). Another approach is based on t-test that is able to mea-
And the 20 testing samples contain six EWS, five RMS, six sure the difference between two groups. Thomas, Olsen,
NB, and three BL. Tapscott, and Zhao. (2001) recommended this method.
The leukemia data set (http://www-genome.wi.mit.edu/ Tusher et al. (2001) and Pan (2002) also proposed their
cgi-\\bin /cancer/publications) (Golub et al., 1999) has method based on t-test, respectively. Besides these two
two types of leukemia, i.e., acute myeloid leukemia (AML) main methods, there are also some other methods. For
and acute lymphoblastic leukemia (ALL). Among these example, a method called Markov blanket was proposed
samples, 38 of them are for training; the other 34 blind by Xing, Jordan, and Karp (2001). Li, Weinberg, Darden,
samples are for testing. The entire leukemia data set and Pedersen (2001) applied another method which com-
contains the expression data of 7,129 genes. Different bined genetic algorithm and K-nearest neighbor.
with the cDNA microarray data, the leukemia data are PCA (Simon, 1999) aims at reducing the input dimen-
oligonucleotide microarray data. Because such expression by transforming the input space into a new space
sion data are raw data, we need to normalize them to reduce described by principal components (PCs). All the PCs are
107
TEAM LinG
orthogonal and they are ordered according to the absolute total number of samples. x i is the general mean expres-
value of their eigenvalues. The k-th PC is the vector with sion value for gene i. s i is the pooled within-class
the k-th largest eigenvalue. By leaving out the vectors with standard deviation for gene i. Actually, the t-score used
small eigenvalues, the input spaces dimension is reduced. here is a t-statistics between a specific class and the
In fact, the PCs indicate the directions with largest overall centroid of all the classes.
variations of input vectors. Because PCA chooses vectors To compare the t-test-based method with PCA, we
with largest eigenvalues, it covers directions with largest also applied it to the lymphoma data set with the same
variations of vectors. In the directions determined by the procedure as what we did by using PCA. This method
vectors with small eigenvalues, the variations of vectors obtained 100% accuracy with only the top six genes. The
are very small. In a word, PCA intends to capture the most results are shown in Figure 1. This comparison indicated
informative directions (Simon, 1999). that the t-test-based method was much better than PCA
We tested PCA in the lymphoma data set (Alizadeh et in this problem.
al., 2000). We obtained 62 PCs from the 4026 genes in the
data set by using PCA. Then, we ranked those PCs accord- An RBF Neural Network
ing to their eigenvalues (absolute values). Finally, we used
our RBF neural network that will be introduced in the latter
An RBF neural network (Haykin, 1999) has three layers.
part to classify the lymphoma data set.
The first layer is an input layer; the second layer is a
At first, we randomly divided the 62 samples into two
hidden layer that includes some radial basis functions,
parts, 31 samples for training and the other 31 samples for
also known as hidden kernels; the third layer is an output
testing. We then input the 62 PCs one by one to the RBF
layer. An RBF neural network can be regarded as a
network according to their eigenvalue ranks starting with
mapping of the input domain X onto the output domain
the PC ranked one. That is, we first used only a single PC
Y. Mathematically, an RBF neural network can be de-
that is ranked 1 as the input to the RBF network. We trained
scribed as follows:
the network with the training data and subsequently tested
the network with the testing data. We repeated this pro-
N
cess with the top two PCs, then the top three PCs, and so y m ( x) = wmi G ( x t i ) + bm , i=1,2,,N; m=1,2,M
on. Figure 1 shows the testing error. From this result, we i =1
found that the RBF network can not reach 100% accuracy.
The best testing accuracy is 93.55% that happened when 36 Here stands for the Euclidean norm. M is the
or 61 PCs were input to the classifier. The classification result
number of outputs. N is the number of hidden kernels.
using the t-test-based gene selection method will be shown
in the next section, which is much better than PCA approach. y m (x) is the output m corresponding to the input x. t i is
The t-test-based gene selection measures the differ- the position of kernel i. wmi is the weight between the
ence of genes distribution using a t-test based scoring kernel i and the output m. bm is the bias on the output m.
scheme, i.e., t-score (TS). After that, only the genes with G ( x t i ) is the kernel function. Usually, an RBF neural
the highest TSs are to be put into our classifier. The TS of
network uses Gaussian kernel functions as follows:
gene i is defined as follows (Tusher et al., 2001):
x ik x i
TS i = max , k = 1,2,...K
d
k i s
Figure 1. Classification results of using PCA and the t-
test-based method as gene selection methods
x ik = jC k x ij / nk
n
x i = xij / n 1
j =1
(x )
where: 1 2 0. 8
si2 = ij x ik
nK k
jC k
0. 6
PCA
Er r or
d k = 1 / nk + 1 / n
t - t est
0. 4
0. 2
There are K classes. max {yk, k = 1,2,..K} is the maximum
of all y k, k = 1,2,..K. Ck refers to class k that includes nk 0
1 6 11 16 21 26 31 36 41 46 51 56 61
samples. xij is the expression value of gene i in sample j. x ik
Number of genes
is the mean expression value in class k for gene i. n is the
108
TEAM LinG
2 FUTURE TRENDS
B
x t i
G ( x t i ) = exp( 2
)
2 i
Until now, the focus of work is investigating the informa-
tion with statistical importance in microarray data sets. In
where i is the radius of the kernel i. the near future, we will try to incorporate more biological
The main steps to construct an RBF neural network knowledge into our algorithm, especially the correlations
include: (a) determining the positions of all the kernels (ti); of genes.
(b) determining the radius of each kernel ( i ); and (c) In addition, with more and more microarray data sets
calculating the weights between each kernel and each produced in laboratories around the world, we will try to
output node. mine multi-data-set with our RBF neural network, i.e., we
In this paper, we use a novel RBF neural network will try to process the combined data sets. Such an attempt
proposed by Fu and Wang (Fu and Wang, 2003), which will hopefully bring us a much broader and deeper insight
allows for large overlaps of hidden kernels belonging to into those data sets.
the same class.
Results CONCLUSION
In the SRBCT data set, we first ranked the entire 2308 Through our experiments, we conclude that the t-test-
genes according to their TSs (Tusher et al., 2001). Then based gene selection method is an appropriate feature
we picked out 96 genes with the highest TSs. We applied selection/dimension reduction approach, which can find
our RBF neural network to classify the SRBCT data set. more important genes than PCA can.
The SRBCT data set contains 63 samples for training and The results in the SRBCT data set and the leukemia
20 blind samples for testing. We input the selected 96 data set proved the effectiveness of our RBF neural
genes one by one to the RBF network according to their network. In the SRBCT data set, it obtained 100% accu-
TS ranks starting with the gene ranked one. We repeated racy with only seven genes. In the leukemia data set, it
this process with the top two genes, then the top three made only one error with 12, 20, 22, and 32 genes, respec-
genes, and so on. Figure 2 shows the testing errors with tively. In view of this, we also conclude that our RBF
respect to the number of genes. The testing error decreased neural network outperforms almost all the previously
to 0 when the top seven genes were input into the RBF published methods in terms of accuracy and the number
network. of genes required.
In the leukemia data set, we chose 56 genes with the
highest TSs (Tusher et al., 2001). We followed the same
procedure as in the SRBCT data set. We did classification REFERENCES
with 1 gene, then two genes, then three genes and so on.
Our RBF neural network got an accuracy of 97.06%, i.e. Alizadeh, A.A. et al. (2000). Distinct types of diffuse large
one error in all 34 samples, when 12, 20, 22, 32 genes were B-cell lymphoma identified by gene expression profiling.
input, respectively. Nature, 403, 503-511.
Figure 2. The testing result in the SRBCT data set Figure 3. The testing result in the leukemia data set
0. 5 0. 2
0. 4 0. 15
0. 3
Er r or
Er r or
0. 1
0. 2
0. 05
0. 1
0 0
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55
Number of genes Number of genes
109
TEAM LinG
Dudoit, S., Fridlyand, J., & Speed, J. (2002). Comparison Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance
of discrimination methods for the classification of tumors analysis of microarrays applied to the ionizing radiation
using gene expression data. Journal of American Statis- response. Proc. Natl. Acad. Sci. USA, 98, 5116-5121.
tics Association, 97, 77-87.
Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature
Fu, X., & Wang, L. (2003). Data dimensionality reduction selection for high-dimensional genomic microarray data.
with application to simplifying RBF neural network struc- Proceedings of the Eighteenth International Conference
ture and improving classification performance. IEEE Trans. on Machine Learning (pp. 601-608). Morgan Kaufmann
Syst., Man, Cybernetics. Part B: Cybernetics, 33, 399-409. Publishers, Inc.
Golub, T.R. et al. (1999). Molecular classification of can-
cer: class discovery and class prediction by gene expres- KEY TERMS
sion monitoring. Science, 286, 531-537.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Feature Extraction: Feature extraction is the process
Gene selection for cancer classification using support to obtain a group of features with the characters we need
vector machines. Machine Learning, 46, 389-422. from the original data set. It usually uses a transform (e.g.
principal component analysis) to obtain a group of fea-
Haykin, S. (1999). Neural network, a comprehensive tures at one time of computation.
foundation (2nd ed.). New Jersey, U.S.A: Prentice-Hall, Inc.
Feature Selection: Feature selection is the process to
Khan, J.M. et al. (2001). Classification and diagnostic select some features we need from all the original features.
prediction of cancers using gene expression profiling and It usually measures the character (e.g. t-test score) of each
artificial neural networks. Nature Medicine, 7, 673-679. feature first, then, chooses some features we need.
Li, L., Weinberg, C.R., Darden, T.A., & Pedersen, L.G. Gene Expression Profile: Through microarray chips,
(2001). Gene selection for sample classification based on an image that describes to what extent genes are ex-
gene expression data: Study of sensitivity to choice of pressed can be obtained. It usually uses red to indicate the
parameters of the GA/KNN method. Bioinformatics, 17, high expression level and uses green to indicate the low
1131-1142. expression level. This image is also called a gene expres-
sion profile.
Olshen, A.B., & Jain, A.N. (2002). Deriving quantitative
conclusions from microarray expression data. Microarray: A Microarray is also called a gene chip
Bioinformatics, 18, 961-970. or a DNA chip. It is a newly appeared biotechnology that
allows biomedical researchers monitor thousands of genes
Pan, W. (2002). A comparative review of statistical meth- simultaneously.
ods for discovering differentially expressed genes in repli-
cated microarray experiments. Bioinformatics, 18, 546-554. Principal Components Analysis: Principal compo-
nents analysis transforms one vector space into a new
Schena, M., Shalon, D., Davis, R.W., & Brown, P.O. (1995). space described by principal components (PCs). All the
Quantitative monitoring of gene expression patterns with PCs are orthogonal to each other and they are ordered
a complementary DNA microarray. Science, 270, 467-470. according to the absolute value of their eigenvalues. By
Thomas, J.G., Olsen, J.M., Tapscott, S.J. & Zhao, L.P. leaving out the vectors with small eigenvalues, the dimen-
(2001). An efficient and robust statistical modeling ap- sion of the original vector space is reduced.
proach to discover differentially expressed genes using ge- Radial Basis Function (RBF) Neural Network: An
nomic expression profiles. Genome Research, 11, 1227-1236. RBF neural network is a kind of artificial neural network.
Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2002). It usually has three layers, i.e., an input layer, a hidden
Diagnosis of multiple cancer types by shrunken centroids of layer, and an output layer. The hidden layer of an RBF
gene expression. Proc. Natl. Acad. Sci. USA, 99, 6567-6572. neural network contains some radial basis functions, such
as Gaussian functions or polynomial functions, to trans-
Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2003). form input vector space into a new non-linear space. An
Class predication by nearest shrunken centroids with applica- RBF neural network has the universal approximation abil-
tions to DNA microarrays. Statistical Science, 18, 104-117. ity, i.e., it can approximate any function to any accuracy,
Troyanskaya, O., Cantor, M, & Sherlock, G., et al. (2001). as long as there are enough hidden neurons.
Missing value estimation methods for DNA microarrays.
Bioinformatics, 17, 520-525.
110
TEAM LinG
T-Test: T-test is a kind of statistical method that Training a Neural Network: Training a neural net-
measures how large the difference is between two groups work means using some known data to build the structure B
of samples. and tune the parameters of this network. The goal of training
is to make the network represent a mapping or a regression we
Testing a Neural Network: To know whether a trained need.
neural network is the mapping or the regression we need,
we test this network with some data that have not been
used in the training process. This procedure is called
testing a neural network.
111
TEAM LinG
112
Building Empirical-Based Knowledge for

Design Recovery
Hee Beng Kuan Tan
Yuan Zhao
INTRODUCTION discusses the application of the proposed approach for the

recovery of functional dependencies enforced in database
Although the use of statistically probable properties is transactions. The final section shows our conclusion.
very common in the area of medicine, it is not so in
software engineering. The use of such properties may
open a new avenue for the automated recovery of de- MAIN THRUST
signs from source codes. In fact, the recovery of de-
signs can also be called program mining, which in turn The Approach
can be viewed as an extension of data mining to the
mining in program source codes. Many types of designs are usually implemented through
a few methods. The use of a method has a direct influ-
ence on the programs that implement the designs. As a
BACKGROUND result, these programs may have some certain charac-
teristics. And we may be able to recognize the designs or
Today, most of the tasks in software verification, test- their properties through recognizing these characteris-
ing, and re-engineering remain manually intensive tics, from either a theoretical or empirical basis or the
(Beizer, 1990), time-consuming, and error prone. As combination of the two.
many of these tasks require the recognition of designs An overview of the approach for building empirical-
from program source codes, automation of the recogni- based knowledge for design recovery through program
tion is an important means to improve these tasks. analysis is shown in Figure 1. In the figure, arcs show
However, many designs are difficult (if not impossible) interactions between tasks. In the approach, we first
to recognize automatically from program source codes research the designs or their properties, which can be
through theoretical knowledge alone (Biggerstaff, recognized from some characteristics in the programs
Mitbander, & Webster, 1994; Kozaczynski, Ning, & that implement them through automated program analy-
Engberts, 1992). Most of the approaches proposed for sis. This task requires domain knowledge or experience.
the recognition of designs from program source codes The reason for using design properties is that some
are based on plausible inference (Biggerstaff et al., designs could be too complex to recognize directly. In
1994; Ferrante, Ottenstein, & Warren, 1987; the latter case, we first recognize the properties, then
Kozaczynski et al., 1992). That is, they are actually use them to infer the designs. We aim for characteris-
based on empirical-based knowledge (Kitchenham, tics that are not only sufficient but also necessary for
Pfleeger, Pickard, Jones, Hoaglin, Emam, & Rosenberg, recognizing designs or their properties. If the charac-
2002). However, to the best of our knowledge, the teristics are sufficient but not necessary, even if we
building of empirical-based knowledge to supplement cannot infer the target designs, they do not imply the
theoretical knowledge for the recognition of designs nonexistence of the designs. If we cannot find charac-
from program source code has not been formally dis- teristics from which a design can be formally proved,
cussed in the literature. then we will look for characteristics that have signifi-
This paper introduces an approach for the building cant statistical evidence. These empirical-based char-
and applying of empirical-based knowledge to supple- acteristics are taken as hypotheses. With the use of
ment theoretical knowledge for the recognition of de- hypotheses and theoretical knowledge, a theory is built
signs from program source codes. The first section for the inference of designs.
introduces the proposed approach. The second section
TEAM LinG
Building Empirical-Based Knowledge for Design Recovery
Secondly, we design experiments to validate the hy- Figure 1. An overview of the proposed design recovery
potheses. Some software tools should be developed to B
automate or semiautomate the characterization of the Identify Designs
or
properties stated in the hypotheses. We may merge their Properties
multiple hypotheses together as a single hypothesis for
the convenience of hypothesis testing. An experiment is
Identity Program
designed to conduct a binomial test (Gravetter & Wallnau, Charactertics and Formulate
Hypotheses
2000) for each resulting hypothesis. If altogether we
have k hypotheses denoted by H1,., Hk and we would
like the probability of validity of the proposed design Develop Software Tools
Develop Theory
to aid Experiments for
recovery to be more than or equal to q, then we must Hypotheses Testing
for the Inference of Designs
choose p1,., pk such that p1,., pk q. For each

hypothesis Hj (1 j k), the null and alternate hypoth-
esis of the binomial test states that less than pj*100% Conduct Experiements for
Hypotheses Testing
and equal or more than pj*100%, respectively, of the
cases that Hj holds. That is:
accept hypothesis Develop Algorithms
Hj0 (null hypothesis): probability that Hj holds < p for Automated Design
Recovery
Hj1 (alternate hypothesis): probability that Hj holds reject hypothesis
p
Develop Software Tool
to Implement the Algorithms
For the use of normal approximation for the bino-
mial test, both npj and n(1-pj) must be greater than or
equal to 10. As such, the sample size n must be greater Conduct Experiment
than or equal to max (10/pj, 10/(1-pj)). The experiment to validate the Effectiveness
is designed to draw a random sample of size n to test the

hypothesis. For each case in the sample, the validity of
the hypothesis is examined. The total number of cases,
X, that the hypothesis holds is recorded and substituted
in the following binomial test statistics: Applying the Proposed Approach for
the Recovery of Functional
X /n p Dependencies
j
z= ( p (1 p ) / n )
j j Let R be a record type and X be a sequence of attributes
of R. For any record r in R, its sequence of values of the
Let be the Type I error probability. If z is greater attributes in X is referred as the X-value of r. Let R be a
than z, we reject the null hypothesis; otherwise, we record type, and X and Y be sequences of attributes of R.
accept the null hypothesis, where the probability of We say that the functional dependency (FD), X Y of R,
standard normal model for z z is . holds at time t, if at time t, for any two R records r and s,
Thirdly, we develop the required software tools and the X-values of r and s are identical, then the Y-values of
conduct the experiments to test each hypothesis ac- r and s are also identical. We say that the functional
cording to the design drawn in the previous step. dependency holds in a database if, except in the midst of
Fourthly, if all the hypotheses are accepted, algo- a transaction execution (which updates some record types
rithms will be developed for the use of the theory to involved in the dependency), the dependency always
automatically recognize the designs from program source holds (Ullman, 1982).
codes. A software tool will also be developed to imple- Many of the worlds database applications have been
ment the algorithms. Some experiments should also be built on old generation DBMSs (database management
conducted to validate the effectiveness of the method. systems). Due to the nature of system development, many
113
TEAM LinG
functional dependencies are not discovered in the initial := y), {xclu_fd(X Y of R)}}), such that if the rd node
system development. They are only identified during the successfully reads an R record specified, and y is iden-
system maintenance stage. tical to the Y-value of the record read, is called an
Although keys can be used to implement functional insertion pattern for enforcing the FD, X Y of R.
dependencies in old generation DBMSs, due to the effort A program path in the form ({{xclu_fd(X Y of R)},
in restructuring databases during the system maintenance mdf(R, X == x0, Y := y), {xclu_fd(X Y of R)}}), such that
stage, most functional dependencies identified during this all the R records in which the X-values are equal to x 0 are
stage are not defined explicitly as keys in the databases. modified by the mdf node, is called a Y-modification
They are enforced in transactions. Furthermore, most of pattern for enforcing the FD, X Y of R.
the conventional files and relational databases allow only A program path in the form (rd(R, X == x), {xclu_fd(X
the definition of one key. As such, most of the candidate Y of R)}, {{xclu_fd(X Y of R)}, mdf(R, X == x0, X :=
keys are enforced in transactions. As a result, many func- x, Y := unchange), {xclu_fd(X Y of R)}}), such that the
tional dependencies in a legacy database are enforced in mdf node is only executed if the rd node does not
the transactions that update the database. Our previous successfully read an R record specified, is called an X-
work (Tan, 2004) has proven that if all the transactions that modification pattern for enforcing the FD, X Y of
update a database satisfy a set of properties with reference R.
to a functional dependency, X Y of R, then the func- We have proven the following rules by Tan (2004):
tional dependency holds in the database. Before proceed-
ing further, we shall discuss these properties. Nonviolation of FD: In a transaction, if all the
For generality, we shall express a program path in the nodes that insert R records or modify the at-
form (a1, , an), where ai is either a node or in the form tribute X or Y in any program path from the start
node to the end node are always contained in a
{ nii ,.., nik }, in which nii ,.., nik are nodes. If ai is in the
sequence of subpaths in the patterns for enforc-
ing the functional dependency, X Y of R, then
form { nii ,.., nik }, then in (a1, , a n), after the predeces-
the transaction does not violate the functional
sor node of a i, the subsequence of nodes, nii ,.., nik , dependency.
FD in Database: Each transaction that updates
may occur any number of times (possibly zero), before any record type involved in a functional depen-
proceeding to the successor of a i. dency does not violate the functional dependency
Before proceeding further, we shall introduce some if and only if is a functional dependency
notations to represent certain types of nodes in control designed in the database.
flow graphs of transactions that will be used throughout
this paper: Theoretically, the property stated in the first rule is
not a necessary property in order for the functional
rd(R, W == b): A node that reads or selects an R dependency to hold. As such, we may not be able to
record in which the W-value is b if it exists. recover all functional dependencies enforced by rec-
mdf(R, W == b, Z1 := c1,.., Z n := cn): A node that ognizing these properties.
modifies the Z 1-value, .. , Zn-value of an R record, Fortunately, other than very exceptional cases, most
in which the original W-value is b, to c1,.., cn, of the enforcement of functional dependency does
respectively. result to the previously mentioned property. As such,
ins(R, Z1 := c1,.., Zn := cn): A node that inserts an empirically, the property is usually also necessary for
R record, in which its Z1-value,.. , Zn-value are the functional dependency, X Y of R, to hold in the
set to c1,.., c n, respectively. database. Thus, we take the following hypothesis.
Here, R is a record type. W, Z 1, , Zn are sequences of Hypothesis 1: If a transaction does not violate
R attributes. The values of those attributes that are not the functional dependency, X Y of R, then all
mentioned in the mdf and ins nodes can be modified and the nodes that insert R records or modify the
set to any value. attribute X or Y in any program path from the start
We shall also use xclu_fd(X Y of R) to denote a node to the end node are always contained in a
node in the control flow graph of a transaction that does sequence of subpaths in the patterns for enforc-
not perform any updating that may lead to the violation of ing the functional dependency.
the functional dependency X Y of R.
A program path in the form (rd(R, X == x), {xclu_fd(X With the hypothesis, the result discussed by Tan and
Y of R)}, {{xclu_fd(X Y of R)}, ins(R, X := x, Y Thein (in press) can be extended to the following theorem.
114
TEAM LinG
Theorem 1: A functional dependency, X Y of R, manufacturer to the value of new-equipment-manu-

holds in a database if and only if in each transaction facturer. B
that updates any record type involved in the func-
tional dependency, all the nodes that insert R records We also analyzed 50 transactions from three exist-
or modify the attribute X or Y in any program path ing database applications. Table 1 summarizes the result
from the start node to the end node are always of the transactions designed by the students.
contained in a sequence of subpaths in the patterns We examined each transaction that enforces the
for enforcing the functional dependency. functional dependency. We found that in each of these
transactions, all the nodes that insert R records or
The proof of sufficiency can be found in previous modify the attribute X or Y in any program path from the
work (see Tan & Thein, 2004). The necessity is implied start node to the end node are always contained in a
directly from the hypothesis. sequence of subpaths in the patterns for enforcing the
This theorem can be used to recover all the functional functional dependency. As such, Hypothesis 1 holds in
dependencies in a database. We have developed an each of these transactions. Because each transaction
algorithm the recovery. The algorithm is similar to the that does not enforce the functional dependency cor-
algorithm presented by Tan and Thein (2004). rectly violates the functional dependency, Hypothesis 1
We have conducted an experiment to validate the always holds in such a transaction. Therefore, Hypoth-
hypothesis. In this experiment, we get each of our 71 esis 1 holds in all the 213 transactions. For the 50
postgraduate students who take the software require- transactions that we drew from three existing database
ments analysis and design course to design the follow- applications, each database application enforces only
ing three transactions in pseudocode to update the table, one functional dependency in their transactions (other
plant-equipment (plant-name, equipment-name, equip- functional dependencies are implemented as keys in the
ment-manufacturer, manufacturer-address), such that databases). We checked each transaction on Hypothesis
each transaction ensures that the functional dependency, 1 with respect to the functional dependency that is
equipment-manufacturer manufacturer-address of enforced in the transactions in the database application
plant-equipment holds: to which the transaction belongs. And Hypothesis 1 held
in all the transactions. Therefore, Hypothesis 1 holds
Insert a new tuple in the table to store a user input for all the 263 transactions in our sample.
each time. The input has four fields, inp-plant- Taking 0.001 () as the Type I error probability, if
name, inp-equipment-name, inp-equipment-manu- the binomial test statistics z computed from the formula
facturer and inp-manufacturer-address. In the tuple, discussed previously in this paper is greater than 3.09,
the value of each attribute is set according to the we reject the null hypothesis; otherwise, we accept the
corresponding input field. null hypothesis. In our experiment, we have n = 263, X
For each user input that has a single field, inp- = 263, and p = 0.96, so our z score is 3.31. Note that our
equipment-manufacturer, delete the tuple in the sample size is greater than max(10/0.96, 10/(1-0.96)).
table in which the value of the attribute, equip- Thus, we reject the null hypothesis, and the test gives
ment-manufacturer, is identical to the value of evidence that Hypothesis 1 holds for equal or more than
inp-equipment-manufacturer. 96% of the cases at the 0.1 % level of significance.
For each user input that has two fields, old-equip- Because only one hypothesis is involved in the approach
ment-manufacturer and new-equipment-manufac- for the inference of functional dependencies, the test
turer, update the equipment-manufacturer in each gives evidence that the approach holds for equal or more
tuple in which the value of equipment-manufac- than 96% of the cases at the 0.1 % level of significance.
turer is identical to the value of old-equipment-
FUTURE TRENDS
Table 1. The statistics of an experiment
In general, many designs are very difficult (if not impos-
sible) to formally prove from program characteristics
Enforcement of the (Chandra, Godefroid, & Palm, 2002; Clarke, Grumberg,
Transaction Functional Dependency & Peled,1999; Deng & Kothari, 2002). Therefore, the
Correct Wrong use of empirical-based knowledge is important for the
1 55 16 automated recognition of designs from program source
2 65 6 codes (Embury & Shao, 2001; Tan, Ling, & Goh, 2002;
3 36 35 Wong, 2001). We believe that the integration of empiri-
115
TEAM LinG
cal-based properties into existing program analysis and Gravetter, F. J., & Wallnau, L. B. (2000). Statistics for the
model-checking techniques will be a fruitful direction behavioral sciences. Belmont, CA: Wadsworth.
in the future.
Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones,
P. W., Hoaglin, D. C., Emam, K. E., & Rosenberg, J. (2002).
Preliminary guidelines for empirical research in software
CONCLUSION engineering. IEEE Transactions on Software Engineer-
ing, 28(8), 721-734.
Empirical-based knowledge has been used in the recog-
nition of designs from source codes through automated Kozaczynski, W., Ning, J., & Engberts, A. (1992). Program
program analysis. It is a promising research direction. concept recognition and transformation. IEEE Transac-
This chapter introduces the approach for building em- tions on Software Engineering, 18(12), 1065-1075.
pirical-based knowledge, a vital part for such research
exploration. We have also applied it in the recognition Tan, H. B. K., Ling, T. W., & Goh, C. H. (2002). Exploring
of functional dependencies enforced in database trans- into programs for the recovery of data dependencies
actions from the transactions. We believe that our ap- designed. IEEE Transactions on Knowledge and Data
proach will encourage more exploration of the discov- Engineering, 14(4), 825-835.
ery and use of empirical-based knowledge in this area. Tan, H. B. K., & Thein, N. L. (in press). Recovery of PTUIE
Recently, we have completed our work on the recovery handling from source codes through recognizing its prob-
of posttransaction user-input error (PTUIE) handling able properties. IEEE Transactions on Knowledge and
for database transaction. This approach appeared in IEEE Data Engineering, 16(10), 1217-1231.
Transactions on Knowledge and Data Engineering
(Tan & Thein, 2004). Ullman, J. D. (1982). Principles of database systems (2nd
ed.). Rockville, MD: Computer Science Press.
Wong, K. (2001). Research challenges in reverse engi-
REFERENCES neering community. Proceedings of the International
Workshop on Program Comprehension (pp. 323-332),
Basili, V. R. (1996). The role of experimentation in Canada.
software engineering: Past, current, and future. The 18th
International Conference on Software Engineering (pp.
442-449), Germany. KEY TERMS
Beizer, B. (1990). Software testing techniques. New York: Control Flow Graph: An abstract data structure
Van Nostrand Reinhold. used in compilers. It is an abstract representation of a
Chandra, S., Godefroid, P., & Palm, C. (2002). Software procedure or program, maintained internally by a com-
model checking in practice: An industrial case study. piler. Each node in the graph represents a basic block.
Proceedings of the International Conference on Soft- Directed edges are used to represent jumps in the con-
ware Engineering (pp. 431-441), USA. trol flow.
Clarke, E. M., Grumberg, O., & Peled, D. A. (1999). Model Design Recovery: Recreates design abstractions
checking. MIR Press. from a combination of code, existing design documen-
tation (if available), personal experience, and general
Deng, Y. B., & Kothari, S. (2002). Recovering conceptual knowledge about problem and application domains.
roles of data in a program. Proceedings of the Interna-
tional Conference on Software Maintenance (pp. 342- Functional Dependency: For any record r in a
350), Canada. record type, its sequence of values of the attributes in X
is referred to as the X-value of r. Let R be a record type,
Embury, S. M., & Shao, J. (2001). Assisting the compre- and X and Y be sequences of attributes of R. We say that
hension of legacy transactions. Proceedings of the the functional dependency, X Y of R, holds at time t,
Working Conference on Reverse Engineering (pp. 345- if at time t, for any two R records r and s, the X-values
354), Germany. of r and s are identical, then the U-values of r and s are
Ferrante, J., Ottenstein, K. J., & Warren, J. O. (1987). The also identical.
program dependence graph and its use in optimisation. Hypothesis Testing: Hypothesis testing refers to
ACM Transaction on Programming Languages and Sys- the process of using statistical analysis to determine if the
tems, 9(3), 319-349. observed differences between two or more samples are
116
TEAM LinG
due to random chance (as stated in the null hypothesis) to the set of values or behaviors arising dynamically at run-
or to true differences in the samples (as stated in the time when executing a program on a computer. B
alternate hypothesis).
PTUIE: Posttransaction user-input error. An error
Model Checking: A method for formally verifying made by users in an input to a transaction execution and
finite-state concurrent systems. Specifications about discovered only after completion of the execution.
the system are expressed as temporal logic formulas,
and efficient symbolic algorithms are used to traverse the Transaction: An atomic set of processing steps in a
model defined by the system and check whether the database application such that all the steps are per-
specification holds. formed either fully or not at all.
Program Analysis: Offers static compile-time tech-

niques for predicting safe and computable approximations
117
TEAM LinG
118
Business Processes
David Sundaram
The University of Auckland, New Zealand
Victor Portougal
INTRODUCTION all functional areas were presented as subsystems sup-

ported by a separate functional areas information
The concept of a business process is central to many system. However, in a real business environment func-
areas of business systems, specifically to business sys- tional areas are interdependent, each requiring data
tems based on modern information technology. In the from the others. This fostered the development of the
new era of computer-based business management the concept of a business process as a multi-functional set
design of business process has eclipsed the previous of activities designed to produce a specified output.
functional design. Lindsay, Downs, and Lunn (2003) This concept has a customer focus. Suppose, a de-
suggest that business process may be divided into mate- fective product was delivered to a customer, it is the
rial, information and business parts. Further search for business function of customer service to accept the
efficiency and cost reduction will be predominantly defective item. The actual repair and redelivery of the
through office automation. The information part, in- item, however, is a business process that involves sev-
cluding data warehousing, data mining and increasing eral functional areas and functions within those areas.
informatisation of the office processes will play a key The customer is not concerned about how the product
role. Data warehousing and data mining, in and of them- was made, or how its components were purchased, or
selves, are business processes that are aimed at increas- how it was repaired, or the route the delivery truck took
ing the intelligent density (Dhar & Stein, 1997) of the to get to her house. The customer wants the satisfaction
data. But more important is the significant roles they play of having a working product at a reasonable price. Thus,
within the context of larger business and decision pro- the customer is looking across the companys func-
cesses. Apart from these the creation and maintenance of tional areas in her process. Business managers are now
a data warehouse itself comprises a set of business trying to view their business operations from the per-
processes (see Scheer, 2000). Hence a proper under- spective of a satisfied customer. Thinking in terms of
standing of business processes is essential to better business processes helps managers to look at their
understand data warehousing as well as data mining. organization from the customers perspective. ERP pro-
grams help to manage company wide business pro-
cesses, using a common database and shared manage-
BACKGROUND ment reporting tools. ERP software supports the effi-
cient operation of business processes by integrating
Companies that make products or provide services have business activities, including sales, marketing, manu-
several functional areas of operations. Each functional facturing, accounting, and staffing.
area comprises a variety of business functions or busi-
ness activities. For example, functional area of financial
accounting includes the business functions of financials, MAIN THRUST
controlling, asset management, and so on. Human re-
sources functional area includes the business functions In the following sections we first look at some defini-
of payroll, benefit administration, workforce planning, tions of business processes and follow this by some
and application data administration. Historically, orga- representative classifications of business process. Af-
nizational structures have separated functional areas, ter laying this foundation we look at business processes
and business education has been similarly organized, so that are specific to data warehousing and data mining.
each functional area was taught as a separate course. In The section culminates by looking at a business process
materials requirement planning (MRP) systems, prede- modelling methodology (namely ARIS) and briefly dis-
cessors of enterprise resource planning (ERP) systems, cussing the dark side of business process (namely mal-
processes).
TEAM LinG
Business Processes
Definitions of Business Processes initiated in response to a specific event

B
Many definitions have been put forward to define business Every process is initiated by an event.
processes: some broad and some narrow. The broad ones The event is a request for the result produced by
help us understand the range and scope of business pro- the process.
cesses but the narrow ones are also valuable in that they are
actionable/pragmatic definitions that help us in defining, work tasks
modelling, and reengineering business processes.
Ould (1995) lists a few key features of Business The business process is a collection of clearly
Processes; it contains purposeful activity, it is carried identifiable tasks executed by one or more actors
out collaboratively by a group, it often crosses func- (person or organisation or machine or depart-
tional boundaries, it is invariably driven by outside ment).
agents or customers. Jacobson (1995) on the other hand A task could potentially be divided up into more
succinctly describes a business process as: The set of and finer steps.
internal activities performed to serve a customer. Bider
(2000) suggests that the business process re-engineer- a collection of interrelated
ing (BPR) community feel there is no great mystery
about what a process is - they follow the most general Such steps and tasks are not necessarily sequential
definition of business processes proposed by Hammer but could have parallel flows connected with com-
and Champy (1993) that a process is a set of partially plex logic.
ordered activities intended to reach a goal. The steps are interconnected through their dealing
Davenport (1993) defines process broadly as a with or processing one (or more) common work
structured, measured set of activities designed to pro- item(s) or business object(s)
duce a specified output for a particular customer or
market and more specifically as a specific order of Due to the importance of the business process
work activities across time and place, with a beginning, concept to the developing of the computerized enter-
an end, and clearly identified inputs and outputs: a prise management the work on refining the definition is
structure for action. going on. Lindsay, Downs, and Lunn (2003) argue that
While these definitions are useful they are not ad- the definitions of business process given in much of the
equate. Sharp and McDermott (2001) provide an excel- literature on Business Process Management (BPM) are
lent working definition of a business process: limited in depth and their related models of business
processes are too constrained. Because they are too
A business process is a collection of interrelated work limited to express the true nature of business processes,
tasks, initiated in response to an event that achieves they need to be further developed and adapted to todays
a specific result for the customer of the process. challenging environment.
It is worth exploring each of the phrases within this Business Process Classification
definition.
Over the years many classifications of processes have
achieves a particular result been suggested. The American Productivity & Quality
Center (Process Classification Framework, 1996) dis-
The result might be Goods and/or Services. tinguishes two types of processes 1) operating pro-
It should be possible to identify and count the cesses and 2) management and support processes.
result eg. Fulfilment of Orders, Resolution of Operating processes include processes such as:
Complaints, Raising of Purchase Orders, etc.
Understanding Markets and Customers
for the customer of the process Development of Vision and Strategy
Design of Product and Services
Every process has a customer. The customer maybe Marketing and Selling of Products and Services
internal (employee) or external (organisation). Production and Delivery of Products and Services
A key requirement is that customer should be able Invoicing and Servicing of Customers
to give feedback on the process
119
TEAM LinG
Business Processes
In contrast management and support processes in- standable patterns in data. Some of the key steps of the
clude processes such as: data mining business process involve:
Development and Management of Human Resources Understanding the business

Management of Information Preparation of the data this would usually in-
Management of Financial and Physical Resources volve data selection, pre-processing, and poten-
Management of External Relationships tially transformation of the data into a form that
Management of Improvement and Change is more amenable to modelling and understanding
Execution of Environmental Management Program Modelling using data mining techniques
Interpretation and evaluation of the model and
Most organisations would be able to fit their pro- results which will hopefully help us to better
cesses within this broad classification. However the understand the business
Gartner group (Genovese, Bond, Zrimsek, & Frey, 2001) Deploy the solution
has come up with a slightly different set of processes
that has a lifecycle orientation: Obviously the above steps are iterative at all stages
of the process and could feed backward. Tools such as
Prospect to Cash and Care Clementine from SPSS support all the above steps.
Requisition to Payment
Planning and Execution Business Processes Modelling
Plan to Performance
Design to Retirement The business processes frequently have to be recorded.
Human Capital Management According to Scheer (2000) the components of a pro-
cess description are events, statuses, users, organiza-
Each and every one of these processes could be tional units, and information technology resources.
potentially supported by data warehousing and data min- This multitude of elements would severely complicate
ing processes and technologies. the business process model. In order to reduce this
complexity, many researchers suggest that the model
Data Warehousing Business Processes is divided into individual views that can be handled
(largely) independent from one another. This is done
The information movement and use come under the for example in ARIS (see Scheer, 2000). The relation-
category of a business process when it involves human- ships between the components within a view are tightly
computer interaction. Data warehousing includes a set of coupled while the individual views are rather loosely
business processes, classified above as management of linked.
information. The set includes: ARIS includes four basic views, represented by:
Data input Data view: statuses and events represented as

Data export data and the other data structures of the process
Special query and reporting (regular reporting is a Function view: the description of functions used
fully computer process) in business processes
Security and safety procedures
Organization view: the users and the organiza-
The bulk of the processes are of course in the data tional units as well as their relationships and the
input category. It is here is that many errors originate relevant structures
that can disrupt the smooth operations of the manage- Control view: presenting the relationships be-
ment system. Strict formalisation of the activities and tween the other three views.
their sequence brings the major benefits. However, the
other categories are important as well, and their formal The integration of these three views through a sepa-
description and enforcement help to avoid mal-func- rate view is the main difference between ARIS and
tioning of the system. other descriptive methodologies, like Object Oriented
Design (OOD) (Jacobson, 1995).
Data Mining Business Processes Breaking down the initial business process into
individual views reduces the complexity of its descrip-
Data mining is a key business process that enables tion. Figure 1 illustrates these four views.
organisations to identify valid, novel, useful, and under-
120
TEAM LinG
Business Processes
Figure 1. ARIS views/house of business engineering view. This view, however, is significant for the subject-
(Scheer, 2000) related view of business processes only when it gives an B
opportunity for describing in full the other components
that are more directly linked toward the business.
Organisation View Mal-Processes
Most definitions of business processes suggest that

they have a goal of accomplishing a given task for the
customer of the process. This positive orientation has
been reflected in business process modelling method-
ologies, tools, techniques and even in the reference
Data Control Function models and implementation approaches. They do not
View View View explicitly consider the negative business process sce-
narios that could result in the accomplishment of unde-
sirable outcomes.
Consider a production or service organisation that
implements an ERP system. Let us assume the design
The control view can be depicted using event-driven stage is typical and business process oriented. We use
process chains (EPC). The four key components of the requirements elicitation techniques to define the as-
EPC are events (statuses), functions (transformations), is business process and engineer to-be processes
organization, and information (data). Events (initial) trig- (reference) using as much as possible, fragments of the
ger functions that then result in events (final). The func- reference model of a corresponding ERP system.
tions are carried out by one or more organizational units As a result of the design, a set of business processes
using appropriate data. A simplified depiction, of the will be created, defining a certain workflow for manag-
interconnection between the four basic components of ers (employees) of the company. These processes will
the EPC is illustrated in Figure 2. This was modeled using consist of functions that should be executed in a certain
the ARIS software (IDS Scheer, 2000). order defined by events. The employee responsible for
One or more events can be connected to two or more execution of a business process or a part of it will be
functions (and vice versa) using the three logical opera- called its user. Every business process starts from a
tors OR, AND, and XOR (exclusive OR). These basic pre-defined set of events, and performs a pre-defined
elements and operators enable us to model even the set of functions which involve either data input or data
most complex process. modification (when a definite data structure is taken
The SAP Reference Model (Curran, Keller, & Ladd, from the data-base, modified and put back, or trans-
1998) is also based on the four ARIS views, it employs ferred to another destination).
several other views, which are not so essential, but When executing a process, a user may intentionally
nevertheless helpful in depicting, storing and retrieving put wrong data into the database, or modify the data in the
business processes. wrong way: e.g. execute an incomplete business process,
For example, information technology resources can or execute a wrong branch of the business process, or
constitute a separate descriptive object, the resource even create an undesirable branch of a business process.
Such processes are called mal-processes.
Hence mal-processes are behaviour to be avoided
Figure 2. Basic components of the EPC
by the system. It is a sequence of actions that a system
can perform, interacting with a legal user of the system,
Initial Event resulting in harm for the organization or stakeholder if
the sequence is allowed to complete. The focus is on the
actions, which may be done intentionally or through
neglect. Similar effects produced accidentally are a
Transforming Organizational
Data
Function unit
subject of interest for data safety. Just as we have best
business practices, mal-processes illustrate the wrong
business practices. So, it would be natural, by analogy
with the reference model (that represents a repository
Final Event of business models reflecting best business practices),
to create a repository of business models reflecting
121
TEAM LinG
Business Processes
wrong business practices. Thus a designer could avoid Data Mining (CRISP-DM, 2004) that provides a lifecycle
wrong design solutions. process oriented approach to the mining of data in
A repository of such mal-processes would enable organizations. Apart from this data warehousing and data
organisations to avoid typical mistakes in enterprise mining techniques are a key element in various steps of
system design. Such a repository can be a valuable asset the process lifecycle alluded to in the trends. Data
in education of ERP designers and it can be useful in warehousing and data mining techniques help us not only
general management education as well. Another appli- in process identification and analysis (identifying and
cation of this repository might be troubleshooting. For analyzing candidates for improvement) but also in ex-
example, sources of some nasty errors in the sales and ecution and monitoring and control of processes. Data
distribution system might be found in the mal-pro- warehousing technologies enable us in collecting, ag-
cesses of data entry in finished goods handling. gregating, slicing and dicing of process information
while data mining technologies enable us in looking for
patterns in process information enabling us to monitor
FUTURE TRENDS and control the organizational process more efficiently.
Thus there is a symbiotic mutually enhancing relation-
There are three key trends that characterise business ship both at the conceptual level as well as at the tech-
processes: digitisation (automation), integration (intra nology level between business processes and data ware-
and inter organisational), and lifecycle management housing and data mining.
(Kalakota & Robinson, 2003). Digitisation involves the
attempts by many organisations to completely automate
as many of their processes as possible. Another equally REFERENCES
important initiative is the seamless integration and co-
ordination of processes within and without the APQC (1996). Process classification framework (pp. 1-6).
organisation: backward to the supplier, forward to the APQCs International Benchmark Clearinghouse & Arthur
customers, and vertically of operational, tactical, and Andersen & Co.
strategic business processes. The management of both
these initiatives/trends depends to a large extent on the Bider, I., Johannesson, P., & Perjons, E. (2002). Goal-
proper management of processes throughout their oriented patterns for business processes. Workshop on
lifecycle: from process identification, process model- Goal-Oriented Business Process Modelling, London.
ling, process analysis, process improvement, process CRISP-DM (2004). Retrieved from http://www.crisp-dm.org
implementation, process execution, to process monitor-
ing/controlling (Rosemann, 2001). Implementing such a Curran, T., Keller, G., & Ladd, A. (1998). SAP R/3 business
lifecycle orientation enables organizations to move in blueprint: Understanding business process reference
benign cycles of improvement and sense, respond, and model. Upper Saddle River, NJ: Prentice Hall.
adapt to the changing environment (internal and external). Davenport, T.H. (1993). Process innovation. Boston, MA:
All these trends require not only the use of Enterprise Harvard Business School Press.
Systems as a foundation but also data warehousing and
data mining solutions. These trends will continue to be Davis, R., (2001) Business process modelling with
major drivers of the enterprise of the future. ARIS: A practical guide. UK: Springer-Verlag.
Genovese, Y., Bond, B., Zrimsek, B., & Frey, N. (2001).
The transition to ERP II: Meeting the challenges.
CONCLUSION Gartner Group.
In this modern landscape, business processes and tech- Hammer, M., & Champy, J. (1993). Re-engineering the
niques and tools for data warehousing and data mining Corporation: A manifesto for business revolution. New
are intricately linked together. The impacts are not just York: Harper Business.
one way. Concepts from business processes could be
and are used to make the data mining and data warehous- Jacobson. (1995). The object advantage. Addison-
ing processes an integral part of organizational pro- Wesley.
cesses. Data warehousing and data mining processes are Kalakota, R., & Robinson, M. (2003). Services blue-
a regular part of organizational business processes, print: Roadmap for execution. Boston: Addison-
enabling the conversion of operational information into Wesley.
tactical and strategic level information. An example of
this is the Cross Industry Standard Process (CRISP) for
122
TEAM LinG
Business Processes
Lindsay, D., Downs, K., & Lunn. (2003). Business pro- Business Process: A business process is a collection
cessesattempts to find a definition. Information and of interrelated work tasks, initiated in response to an B
Software Technology, 45, 1015-1019. event that achieves a specific result for the customer of
the process.
Ould, A. M. (1995). Business processes: Modelling and
analysis for reengineering. Wiley. Digitisation: Measures that automate processes.
Rosemann, M. (2001, March). Business process lifecycle ERP: Enterprise resource planning system, a software
management. Queensland University of Technology. system for enterprise management. It is also referred to as
Enterprise Systems (ES).
Scheer, (2000). ARIS methods. IDS.
Functional Areas: Companies that make products
Scheer, A.-W., & Habermann, F. (2000). Making ERP a to sell have several functional areas of operations.
success. Communications of the Association of Com- Each functional area comprises a variety of business
puting Machinery, 43(4) 57-61. functions or business activities.
Sharp, A., & McDermott, P. (2001). Just what are pro- Integration of Processes: The coordination and
cesses anyway? Workflow modeling: tools for process integration of processes seamlessly within and without
improvement and application development (pp. 53-69). the organization.
Mal-Processes: A sequence of actions that a sys-
tem can perform, interacting with a legal user of the
KEY TERMS system, resulting in harm for the organization or stake-
holder.
ARIS: Architecture of Integrated Information Sys-
tems, a modeling and design tool for business pro- Process Lifecycle Management: Activities under-
cesses. taken for the proper management of processes such as
identification, analysis, improvement, implementation,
as-is Business Process: Current business pro- execution, and monitoring.
cess.
to-be Business Process: Re-engineered busi-
ness process.
123
TEAM LinG
124
Case-Based Recommender Systems

Fabiana Lorenzi
Universidade Luterana do Brasil, Brazil
Francesco Ricci
eCommerce and Tourism Research Laboratory, ITC-irst, Italy
INTRODUCTION about what products fit the customers preferences (Burke,

2000). The most important advantage is that knowledge
Recommender systems are being used in e-commerce can be expressed as a detailed user model, a model of the
web sites to help the customers in selecting products selection process or a description of the items that will be
more suitable to their needs. The growth of Internet and suggested. Knowledge-based recommenders can exploit
the business to consumer e-Commerce has brought the the knowledge contained in case or encoded in a similarity
need for such a new technology (Schafer, Konstan, & metric. Case-Based Reasoning (CBR) is one of the meth-
Riedl., 2001). odologies used in the knowledge-based approach. CBR is
a problem solving methodology that faces a new problem
by first retrieving a past, already solved similar case, and
BACKGROUND then reusing that case for solving the current problem
(Aaamodt & Plaza, 1994). In a CBR recommender system
In the past years, a number of research projects have (CBR-RS) a set of suggested products is retrieved from
focused on recommender systems. These systems imple- the case base by searching for cases similar to a case
ment various learning strategies to collect and induce described by the user (Burke, 2000). In the simplest
user preferences over time and automatically suggest application of CBR to recommendation problem solving,
products that fit the learned user model. the user is supposed to look for some product to pur-
The most popular recommendation methodology is chase. He/she inputs some requirements about the prod-
collaborative filtering (Resnick, Iacovou, Suchak, uct and the system searches in the case base for similar
Bergstrom, & Riedl, 1994) that aggregates data about products (by means of a similarity metric) that match the
customers preferences (ratings) to recommend new user requirements. A set of cases is retrieved from the case
products to the customers. Content-based filtering base and these cases can be recommender to the user. If
(Burke, 2000) is another approach that builds a model of the user is not satisfied with the recommendation he/she
user interests, one for each user, by analyzing the spe- can modify the requirements, i.e. build another query, and
cific customer behavior. In collaborative filtering the a new cycle of the recommendation process is started.
recommendation depends on the previous customers In a CBR-RS the effectiveness of the recommendation
information, and a large number of previous user/sys- is based on: the ability to match user preferences with
tem interactions are required to build reliable recom- product description; the tools used to explain the match
mendations. In content-based systems only the data of and to enforce the validity of the suggestion; the function
the current user are exploited and it requires either provided for navigating the information space. CBR can
explicit information about user interest, or a record of support the recommendation process in a number of
implicit feedback to build a model of user interests. ways. In the simplest approach the CBR retrieval is called
Content-based systems are usually implemented as clas- taking in input a partial case defined by a set of user
sifier systems based on machine learning research preferences (attribute-value pairs) and a set of products
(Witten & Frank, 2000). In general, both approaches do matching these preferences are returned to the user.
not exploit specific knowledge of the domain. For in-
stance, if the domain is computer recommendation, the
two above approaches, in building the recommendation MAIN THRUST
for a specific customer, will not exploit knowledge
about how a computer works and what is the function of CBR systems implement a problem solving cycle very
a computer component. similar to the recommendation process. It starts with a
Conversely, in a third approach called knowledge- new problem, retrieves similar cases from the case base
based, specific domain knowledge is used to reason and shows to the user an old solution or adapts it to better
TEAM LinG
Figure 1. CBR-RS framework (Lorenzi & Ricci, 2004) This means that a case c = (x, u, s, e) CB, generally
consists of four (optional) sub-elements x, u, s, e, which C
are elements of the spaces X, U, S, E respectively. Each
CBR-RS adopts a particular model for the spaces X, U,
S, E. These spaces could be empty, vector, set of docu-
ment (textual), labeled graphs, etc.
Content model (X): the content model describes

the attributes of the product.
User profile (U): the user profile models personal
user information, such as, name, address, and age
or also past information about the user, such as her
preferred products.
Session model (S): the session model is intro-
duced to collect information about the recom-
solve the new problem and finishes retaining the new case mendation session (problem solving loop). In
in the case base. Considering the classic CBR cycle (see DieToRecs, for instance, a case includes a tree-
Aamodt & Plaza, 1994) we specialized this general frame- based model of the user interaction with the sys-
work to the specific tasks of product recommendation. In tem and it is built incrementally during the recom-
Figure 1 the boxes, corresponding to the classical CBR mendation session.
steps (retrieve, reuse, revise, review, and retain), contain Evaluation model (E): the evaluation model de-
references to systems or functionalities (acronyms) that scribe the outcome of the recommendation, i.e., if
will be described in the next sections. the suggestion was appropriate or not. This could
We now provide a general description of the frame- be a user a-posteriori evaluation, or, as in
work by making some references to systems that will be (Montaner, Lopez, & la Rosa, 2002), the outcome
better described in the rest of the paper. The first user- of an evaluation algorithm that guesses the good-
system interaction in the recommendation cycle occurs ness of the recommendation (exploiting the case
in the input stage. According to Bergmann, Richter, base of previous recommendations).
Schmitt, Stahl, and Vollrath (2001), there are different
strategies to interact with the user, depending on the level Actually, in CBR-RSs there is a large variability in
of customer assistance offered during the input. The most what a case really models and therefore what compo-
popular strategy is the dialog-based, where the system nents are really implemented. There are systems that
offers guidance to the user by asking questions and use only the content model, i.e., they consider a case as
presenting products alternatives, to help the user to a product, and other systems that focus on the perspec-
decide. Several CBR recommender systems ask the user tive of cases as recommendation sessions.
for input requirements to have an idea of what the user The first step of the recommendation cycle is the
is looking for. In the First Case system (McSherry, retrieval phase. This is typically the main phase of the
2003a), for instance, the user provides the features of a CBR cycle and the majority of CBR-RSs can be de-
personal computer that he/she is looking for, such as, scribed as sophisticated retrieval engines. For example,
type, price, processor or speed. Expertclerk (Shimazu, in the Compromise-Driven Retrieval (McSherry, 2003b)
2002) asks the user to answer some questions instead of the system retrieves similar cases from the case base
provide requirements. And with the set of the answered but also groups the cases, putting together those offer-
questions the system creates the query. ing to the user the same compromise, and presents to the
In CBR-RSs, the knowledge is stored in the case user just a representative case for each group.
base. A case is a piece of knowledge related to a particular After the retrieval, the reuse stage decides if the case
context and representing an experience that teaches an solution can be reused in the current problem. In the
essential lesson to reach the goal of the problem-solving simplest CBR-RSs, the system reuses the retrieved
activity. Case modeling deals with the problem of deter- cases showing them to the user. In more advanced solu-
mining which information should be represented and tions, such as (Montaner, Lopez, & la Rosa, 2002) or
which formalism of representation would be suitable. In (Ricci et al., 2003), the retrieved cases are not recom-
CBR-RSs a case should represent a real experience of mended but used to rank candidate products identified
solving a user recommendation problem. In a CBR-RS, our with other approaches (e.g. Ricci et al., 2003) with an
analysis has identified a general internal structure of the interactive query management component.
case base: CB = [X U S E].
125
TEAM LinG
In the next stage the reused case component is adapted DieToRecs DTR
to better fit in the new problem. Mostly, the adaptation in
CBR-RSs is implemented by allowing the user to customize DieToRecs helps the user to plan a leisure travel
the retrieved set of products. This can also be implemented (Fesenmaier, Ricci, Schaumlechner, Wober, & Zanella
as a query refinement task. For example, in Comparison- 2003). We present here two different approaches (de-
based Retrieval (McGinty & Smyth, 2003) the system asks cision styles) implemented in DieToRecs: the single
user feedbacks (positive or negative) about the retrieved item recommendation (SIR), and the travel completion
product and with this information it updates the user (TC). A case represents a user interaction with the sys-
query. tem, and it is built incrementally during the recommenda-
The last step of the CBR recommendation cycle is the tion session. A case comprises all the quoted models:
retain phase (or learning), where the new case is retained content, user profile, session and evaluation model.
in the case base. In DieToRecs (Fesenmaier, Ricci, SIR starts with the user providing some prefer-
Schaumlechner, Wober, & Zanella, 2003), for instance, ences. The system searches in the catalog for products
all the user/system recommendation sessions are stored that (logically) match with these preferences and re-
as cases in the case base. turns a result set. This is not to be confused with the
The next subsections describe very shortly some repre- retrieval set that contains a set of similar past recom-
sentative CBR-RSs, focusing on their peculiar characteris- mendation session. The products in the result set are
tic (see Lorenzi & Ricci, 2004) for the complete report. then ranked with a double similarity function (Ricci et
al., 2003) in the revise stage, after a set of relevant
Interest Confidence Value ICV recommendation sessions are retrieved.
In the TC function the cycle starts with the users
Montaner, Lopez, and la Rosa (2002) assume that the preferences too but the system retrieves from the case
users interest in a new product is similar to the users base cases matching users preferences. Before rec-
interest in similar past products. This means that when a ommending the retrieved cases to the user, the system
new product comes up, the recommender system pre- in the revise stage updates, or replaces, the travel
dicts the users interest in it based on interest attributes products contained in the case exploiting up-to-date
of similar experiences. A case is modeled by objective information taken from the catalogues. In the review
attributes describing the product (content model) and phase the system allows the user to reconfigure the
subjective attributes describing implicit or explicit in- recommended travel plan. The system allows the user
terests of the user in this product (evaluation model), to replace, add or remove items in the recommended
i.e., c X E. In the evaluation model, the authors intro- travel. When the user accepts the outcome (the final
duced the drift attribute, which models a decaying im- version of the recommendation shown to the user), the
portance of the case as time goes and the case is not used. system retains this new case in the case base.
The system can recommend in two different ways:
prompted or proactive. In prompted mode, the user pro- Compromise-Driven Retrieval CDR
vides some preferences (weights in the similarity metric)
and the system retrieves similar cases. In the proactive CDR models a case only by the content component
recommendation, the system does not have the user prefer- (McSherry, 2003a). In CDR, if a given case c1 is more
ences, so it estimates the weights using past interactions. similar to the target query than another case c2, and
In the reuse phase the system extracts the interest differs from the target query in a subset of the at-
values of retrieved cases and in the revise phase it calcu- tributes in which c2 differs from the target query, then
lates the interest confidence value of a restaurant to c 1 is more acceptable than c2.
decide if this should be recommender to the user or not. In the CDR retrieval algorithm the system sorts all
The adaptation is done asking to the user the correct the cases in the case-base according to the similarity to
evaluation of the product and after that a new case (the a given query. In a second stage, it groups together the
product and the evaluation) is retained in the case base. cases making the same compromise (do not match a
It is worth noting that in this approach the recommended user preferred attribute value) and builds a reference
product is not retrieved from the case base, but the set with just one case for each compromise group. The
retrieved cases are used to estimate the user interest in cases in the reference set are recommended to the user.
this new product. This approach is similar to that used in The user can also refine (review) the original query,
DieToRecs in the single item recommendation function. accepting one compromise, and adding some preference
on a different attribute (not that already specified). The
system will further decompose the set of cases corre-
126
TEAM LinG
sponding to the selected compromise. The revise and Table 1. Comparison of the CBR-RSs
retain phases do not appear in this approach.
Approach
ICV
Retrieval
Similarity
Reuse
IC value
Revise
IC computation
Review
Feedback
Retain
Default
C
SIR Similarity Selective Rank User edit Default
ExpertClerk EC TC
OBR
Similarity
Similarity +
Default
Default
Logical query
None
User edit
Tweak
Default
None
Ordering
CDR Similarity + Default None Tweak None
Expertclerk is a tool for developing a virtual salesclerk Grouping
EC Similarity Default None Feedback None
system as a front-end of e-commerce websites (Shimazu,
2002). The system implements a question selection method
(decision tree with information gain). Using navigation- area have already delivered, how the existing CBR-RSs
by-asking, the system starts the recommendation session behave and which are the topics that could be better
asking questions to user. The questions are nodes in a exploit in future systems.
decision tree. A question node subdivides the set of
answer nodes and each one of these represents a different
answer to the question posed by the question node. The CONCLUSION
system concatenates all the answer nodes chosen by the
user and then constitutes the SQL retrieval condition In the previous sections we have briefly analyzed eight
expression. different CBR recommendations. Table 1 shows the
This query is applied to the case base to retrieve the main features of these approaches.
set of cases that best match the user query. Then, the Some observations are in order. The majority of the
system shows three samples products to the user and CBR-RSs stress the importance of the retrieval phase.
explains their characteristics (positive and negative). Some systems perform retrieval in two steps. First,
In the review phase, the system switches to the cases are retrieved by similarity, and then the cases are
navigation-by-proposing conversation mode and allows grouped or filtered. The use of pure similarity does not
the user to refine the query. After refinement, the sys- seem to be enough to retrieve a set of cases that satisfy
tem applies the new query to the case base and retrieves the user. This seems to be true especially in those
new cases. These cases are ranked and shown to the user. application domains that require a complex case struc-
The cycle continues until the user finds a preferred ture (e.g. travel plans) and therefore require the devel-
product. In this approach the revise and the retain phases opment of hybrid solutions for case retrieval.
are not implemented. The default reuse phase is used in the majority of the
CBR-RSs, i.e., all the retrieved cases are recommended
to the user. ICV and SIR have implemented the reuse
FUTURE TRENDS case in different way. In SIR, for instance, the system
can retrieve just part of the case. The same systems that
This paper presented a review of the literature on CBR implemented non-trivial reuse approaches, have also
recommender systems. We have found that it is often implemented both the revise phase, where the cases are
unclear how and why the proposed recommendation adapted, and the retain phase, where the new case (adapted
methodology can be defined as case-based and there- case) is stored.
fore we have introduced a general framework that can All the CBR-RSs analyzed implement the review phase,
illustrate similarities and differences of various ap- allowing the user to refine the query. Normally the sys-
proaches. Moreover, we have found that the classical tem expects some feedback from the user (positive or
CBR problem-solving loop is implemented only par- negative), new requirements or a product selection.
tially and sometime is not clear whether a CBR stage
(retrieve, reuse, revise, review, retain) is implemented
or not. For this reason, the proposed unifying frame- REFERENCES
work makes possible a coherent description of different
CBR-RSs. In addition an extensive usage of this frame- Aamodt, A., & Plaza, E. (1994). Case-based reasoning:
work can help describing in which sense a recommender Foundational issues, methodological variations, and
system exploits the classical CBR cycle, and can point system approaches. AI Communications, 7(1), 39-59.
out new interesting issues to be investigated in this area.
For instance, the possible ways to adapt retrieved cases Bergmann, R., Richter, M., Schmitt, S., Stahl, A., &
to improve the recommendation and how to learn these Vollrath, I. (2001). Utility-oriented matching: New re-
adapted cases. search direction for case-based reasoning. 9th German
We believe, that with such a common view it will be Workshop on Case-Based Reasoning, GWCBR01 (pp.
easier to understand what the research projects in the 14-16), Baden-Baden, Germany.
127
TEAM LinG
Burke, R. (2000). Knowledge-based recommender sys- active query management and twofold similarity. 5th Inter-
tems. Encyclopedia of Library and Information Science, national Conference on Case-Based Reasoning, ICCBR
Vol. 69. 2003 (pp. 479-493), Trondheim, Norway.
Fesenmaier, D, Ricci, F, Schaumlechner, E, Wober, K., Schafer, J.B, Konstan, J. A., & Riedl, J. (2001). E-
& Zanella, C. (2003). DIETORECS: Travel advisory for commerce recommendation applications. Data Mining
multiple decision styles. Information and Communi- and Knowledge Discovery, 5(1/2), 115-153.
cation Technologies in Tourism, 232-241.
Shimazu, H. (2002). Expertclerk: A conversational case-
Lorenzi, F., & Ricci, F. (2004). A unifying framework based reasoning tool for developing salesclerk agents in
for case-base reasoning recommender systems. Tech- e-commerce webshops. Artificial Intelligence Review,
nical Report, IRST. 18, 223-244.
McGinty, L., & Smyth, B. (2002). Comparison-based Witten, I. H., & Frank, E. (2000). Data mining. Morgan
recommendation. Advances in Case-Based Reason- Kaufmann Publisher.
ing, 6th European Conference on Case Based Reason-
ing, ECCBR 2002 (pp. 575-589), Aberdeen, Scotland.
McGinty, L., & Smyth, B. (2003). The power of sugges- KEY TERMS
tion. 18th International Joint Conference on Artificial
Intelligence, IJCAI-03 (pp. 276-290), Acapulco, Mexico. Case-Based Reasoning: It is an Artificial Intelli-
McSherry, D. (2003a). Increasing dialogue efficiency in gence approach that solves new problems using the
case-based reasoning without loss of solution quality. solutions of past cases.
18 th International Joint Conference on Artificial Intelli- Collaborative Filtering: Approach that collects
gence, IJCAI-03 (pp. 121-126), Acapulco, Mexico. user ratings on currently proposed products to infer the
McSherry, D. (2003b). Similarity and compromise. 5th In- similarity between users.
ternational Conference on Case-Based Reasoning, Content-Based Filtering: Approach where the user
ICCBR 2003 (pp. 291-305), Trondheim, Norway. expresses needs and preferences on a set of attributes
Montaner, M., Lopez, B., & la Rosa, J.D. (2002). Improv- and the system retrieves the items that match the de-
ing case representation and case base maintenance in scription.
recommender systems. Advances in Case-Based Rea- Conversational Systems: Systems that can com-
soning, 6th European Conference on Case Based Reason- municate with users through a conversational paradigm
ing, ECCBR 2002 (pp. 234-248),Aberdeen, Scotland.
Machine Learning: The study of computer algo-
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & rithms that improve automatically through experience.
Riedl, J. (1994). Grouplens: An open architecture for
collaborative filtering of Netnews. ACM Conference Recommender Systems: Systems that help the user to
on Computer-Supported Cooperative Work (pp. 175- choose products, taking into account his/her preferences.
186). Web Site Personalization: Web sites that are person-
Ricci, F., Venturini, A., Cavada, D., Mirzadeh, N., Blaas, D., alized to each user, knowing the user interests and needs.
& Nones, M. (2003). Product recommendation with inter-
128
TEAM LinG
129
Categorization Process and Data Mining C

Maria Suzana Marc Amoretti
Federal University of Rio Grande do Sul (UFRGS), Brazil
INTRODUCTION rization process. The computational techniques from sta-

tistics and pattern recognition are used to do this data-
For some time, the fields of computer science and cogni- mining practice.
tion have diverged. Researchers in these two areas know
ever less about each others work, and their important
discoveries have had diminishing influence on each other. BACKGROUND
In many universities, researchers in these two areas are in
different laboratories and programs, and sometimes in Semiotics is a theory of the signification (representations,
different buildings. One might conclude from this lack of symbols, categories) and meaning extraction. It is a strongly
contact that computer science and semiotics functions, multi-disciplinary field of study, and mathematical tools
such as perception, language, memory, representation, of semiotics include those used in pattern recognition.
and categorization, reflect our independent systems. But Semiotics is also an inclusive methodology that incorpo-
for the last several decades, the divergence between rates all aspects of dealing with symbolic systems of
cognition and computer science tends to disappear. These signs. Signification join all the concepts in an elementary
areas need to be studied together, and the cognitive structure of signification. This structure is a related net
science approach can afford this interdisciplinary view. that allows the construction of a stock of formal defini-
This article refers to the possibility of circulation be- tions such as semantic category. Hjelmeslev considers
tween the self-organization of the concepts and the rel- the category as a paradigm, where elements can be intro-
evance of each conceptual property of the data-mining duced only in some positions.
process and especially discusses categorization in terms of Text categorization process, also known as text clas-
a prototypical theory, based on the notion of prototype and sification or topic spotting, is the task of automatically
basic level categories. Categorization is a basic means of classifying a set of documents into categories from a
organizing the world around us and offers a simple way to predefined set. This task has several applications, includ-
process the mass of stimuli that one perceives every day. ing automated indexing of scientific articles according to
The ability to categorize appears early in infancy and has an predefined thesauri of technical terms, filing patents into
important role for the acquisition of concepts in a prototypi- patent directories, and selective dissemination of infor-
cal approach. Prototype structures have cognitive represen- mation to information users. There are many new catego-
tations as representations of real-world categories. rization methods to realize the categorization task, includ-
The senses of the English words cat or table are ing, among others, (1) the language model based classi-
involved in a conceptual inclusion in which the extension fication; the maximum entropy classification, which is a
of the superordinated (animal/furniture) concept includes probability distribution estimation technique used for a
the extension of the subordinated (Persian cat/dining variety of natural language tasks, such as language mod-
room table) concept, while the intention of the more eling, part-of-speech tagging, and text segmentation (the
general concept is included by the intention of the more theory underlying maximum entropy is that without exter-
specific concept. This study is included in the categori- nal knowledge, one should prefer distributions that are
zation process. Categorization is a fundamental process uniform); (2) the Nave Bayes classification; (3) the Near-
of mental representation used daily for any person or for est Neighbor (the approach clusters words into groups
any science, and it is also a central problem in semiotics, based on the distribution of class labels associated with
linguistics, and data mining. each word); (4) distributional clustering of words to
Data mining also has been defined as a cognitive document classification; (5) the Latent Semantic Index-
strategy for searching automatically new information ing (LSI), in which we are able to compress the feature
from large datasets or selecting a document, which is space more aggressively, while still maintaining high
possible with computer science and semiotics tools. Data document classification accuracy (this information re-
mining is an analytic process to explore data in order to trieval method improves the users ability to find relevant
find interesting pattern motifs and/or variables in the information, the text categorization method based on a
great quantity of data; it depends mostly on the catego- combination of distributional features with a Support
TEAM LinG
Categorization Process and Data Mining
Vector Machine (SVM) classifier, and the feature se- technologies in the categorization process through the
lection approach uses distributional clustering of words making of conceptual maps, especially the possibility
via the recently introduced information bottleneck of creating a collaborative map made by different users,
method, which generates a more efficient representa- points out the cultural aspects of the concept represen-
tion of the documents); (6) the taxonomy method, based tation in terms of existing coincidences as to the choice
on hierarchical text categorization that documents are of the prototypical element by the same cultural group.
assigned to leaf-level categories of a category tree (the Thus, the technologies of information, focused on the
taxonomy is a recently emerged subfield of the seman- study of individual maps, demand revisited discussions
tic networks and conceptual maps). After the previous on the popular perceptions concerning concepts used
work in hierarchical classification focused on docu- daily (folk psychology). It aims to identify ideological
ments category, the tree classify method internal cat- similarity and cognitive deviation, both based on the
egories with a top down level-based classification that prototypes and on the levels of categorization developed
can classify concepts in the document. in the maps, with an emphasis on the cultural and semiotic
aspects of the investigated groups.
The networks of semantic values thus created and It attempted to show how the semiotic and linguistic
stabilized constitute the cultural-metaphorical worlds analysis of the categorization process can help in the
which are discursively real for the speakers of particular identification of the ideological similarity and cognitive
languages. The elements of these networks, though deviations, favoring the involvement of subjects in the
ultimately rooted in the physical-biological realm can map production, exploring and valuing the relation be-
and do operate independently of the latter, and form the tween the categorization process and the cultural experi-
stuff of our everyday discourses (Manjali, 1997, p. 1). ence of the subject in the world, both parts of the cogni-
tive process of conceptual map construction.
The prototype has given way to a true revolution (the
Roschian revolution) regarding classic lexical semantics. The concept maps, or the semantic nets, are space graphic
If we observe the conceptual map for chair, for instance, representations of the concepts and their relationships.
we will realize that the choice of most representative chair The concept maps represent, simultaneously, the
types; that is, our prototype of chair, supposes a double organization process of the knowledge, by the
adequacy: referential because the sign (concept of chair) relationships (links) and the final product, through the
must integrate the features retained from the real or concepts (nodes). This way, besides the relationship
imaginary world, and structural, because the sign must be between linguistic and visual factors is the interaction
pertinent (ideological criterion) and distinctive concern- among their objects and their codes (Amoretti, 2001, p.
ing the other neighbor concepts of chair. When I say that 49).
this object is a chair, it is supposed that I have an idea of
the chair sign, forming the use of a lexical or visual image The building of a map involves collaboration when the
competence coming from my referential experience, and subjects/students/users share information, still without
that my prototypical concept of chair is more adequate modifying the data, and involves cooperation, when users
than its neighbors bench or couch, because I perceive not only share their knowledge but also may interfere and
that there is a back part and there are no arms. Then, it is modify the information received from the other users, acting
useless to try to explain the creation of a prototype inside in a asynchronized way to build a collective map. Both
a language, because it is formed from context interactions. cooperation and collaboration attest the autonomy of the
The double origin of a prototype is bound, then, to ongoing cognitive process, the direction given by the users
shared knowledge relation between the subjects and themselves when trying to adequate their knowledge.
their communities (Amoretti, 2003). When people do a conceptual map, they usually privi-
lege the level where the prototype is. The basic concept
map starts with a general concept at the top of the map and
MAIN THRUST then works its way down through a hierarchical structure
to more specific concepts. The empirical concept (Kant)
Hypertext poses new challenges for a data-mining proof cat and chair has been studied by users with map
cess, especially for text categorization research, because software. They make an initial map at the beginning of the
metadata extracted from Web sites provide rich informa- semester and another about the same subject at the end
tion for classifying hypertext documents, and it is a new of the semester. I first discussed how cats and chairs
kind of problem to solve, to know how to appropriately appear, what could be called the structure of cat and chair
represent that information and automatically learn statis- appearance. Second, I discussed how cat and chair are
tical patterns for hypertext categorization. The use of perceived and which attributes make a cat a cat and a chair
130
TEAM LinG
a chair. Finally, I will consider cat and chair as an experi- same category. A chair is a more central member than a
ential category, so the point of departure is our experience television, which, in turn, is a rather marginal member. C
in the world about cat and chair. The acquisition of the Rosch (2000) claims that prototypes can only constrain,
concept cat and chair is mediated by concrete experiences. but do not determine, models of representations.
Thus, the learner must possess relevant prior knowledge The main thrust of my argument is that it is very
and a mental scheme to acquire a prototypical concept. important to data mining to know, besides the cognitive
The expertise changes the conceptual level organiza- categorization process, what is the prototypical concept
tion competences. In the first maps, the novice chair map in a dataset. This basic level of concept organization
privileged the basic level, the most important exemplar of reflects the social representation in a better way than the
a class, the chair prototype. This level has a high coher- other levels (i.e., superordinate and subordinate levels).
ence and distinctiveness. After thinking about this con- The prototype knowledge affords a variety of culture
cept, students (now chair experts) repeated the experiment representations. The conceptual mapping system con-
and carried out again the expert chair map with much more tains prototype data on the hierarchical way of concept
details in the superordinate level, showing eight different levels. Using different software (three measuring lev-
kinds of chairs: dining room chair, kitchen chair, garden elssuperordinate, basic, and subordinate), I suggest
chair, and so forth. This level has high coherence and low the construction of different maps for each concept to
distinctiveness (Rosh, 2000). So, users learn by doing the analyze the categorization cognitive process with maps
categorization process. and to show how the categorization performance of
Language system arbitrarily cuts up the concepts into individual and collective or organizational team overtime
discrete categories (Hjelmeslev, 1968), and all categories is important in a data-mining work.
have equal status. Human language is both natural and Categorization is a part of Jakobsons (2000) commu-
cultural. According to the prototype theory, the role played nication model (also appropriated from information
by non-linguistic factors like perception and environment theory) with cultural aspects (context). This principle
is demonstrated throughout the concept as the prototype allows to locate on a definite gradient objects and rela-
from the subjects of each community. tions that are observed, based on similarity and contigu-
A concept is a sort of scheme. An effective way of ity associations (frame/script semantic) (Schank, 1999)
representing a concept is to retain only its most important and based on hierarchically relations (prototypic seman-
properties. This group of most important properties of a tic) (Kleiber, 1990; Rosch, 2000), in terms of perceived
concept is called prototype. The idea of prototype makes family resemblance among category members.
it possible for the subject to have a mental construction,
identifying the typical features of several categories, and,
when the subject finds a new object, he or she may compare FUTURE TRENDS
it to the prototype in his or her memory. Thus, the proto-
type of chair, for instance, allows new objects to be Data-mining language technology systems typically have
identified and labeled as chairs. In individual conceptual focused on the factual aspect of content analysis. How-
maps creation, one may confirm the presence of variables ever, there are other categorization aspects, including
for the same concept. pragmatics, point of view, and style, which must receive
The notion of prototype originated in the 1970s, greatly more attention like types and models of subjective clas-
due to Eleanor Roschs (2000) psychological research on sification information and categorization characteristics
the organization of conceptual categories. Its revolution- such as centrality, polarity, intensity, and different lev-
ary character marked a new era for the discussions on els of granularity (i.e., expression, clause, sentence,
categorization and brought existing theories, such as the discourse segment, document, hypertext).
classical view of the prototype question. On the basis of It is also important to define properties heritage
Roschs results, it is argued that members of the so-called among different category levels, viewed throughout hi-
Aristotelian (or classical) categories share all the same erarchical relations as one that allowed to virtually add
properties and showed that categories are structured in an certain pairs of value (attributes from a unit to another).
entirely different way; members that constitute them are We should also think of concepts managing that, in a
assigned in terms of gradual participation, and the cat- given category, are considered as an exception. It would
egorical attribution is made by human beings according to be necessary to allow the heritage blockage of certain
the more or less centrality/marginality of collocation within attributes. I will be opening new perspectives to the data-
the categorical structure. Elements recognized as central mining research of categorization and prototype study,
members of the category represent the prototype. For which shows the ideological similarity perception medi-
instance, a chair is a very good example of the category ated by collaborative conceptual maps.
furniture, while a television is a less typical example of the
131
TEAM LinG
Much is still unknowable about the future of data Andler, D. (1987). Introduction aux sciences cognitives.
mining in higher education and in the business intelli- Paris: Gallimard.
gence process. The categorization process is a factor that
will affect this future and can be identified with the crucial Cordier, F. (1989). Les notions de typicalit et niveau
role played by the prototypes. Linguistics have not yet dabstracion: Analyse des proprits des representa-
paid this principle due attention. However, some conse- tions [thse de doctorat ddtat]. Paris: Sud University.
quences should already necessarily follow from its proto- das, J. (1966). Smantique structurale. Recherche de
type recognition. The extremely powerful explanation of mthode. Paris: PUF.
prototype categorization constitutes the most salient
feature in data mining. So, a very important application in Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992).
the data-mining methodology is the results of the proto- Knowledge discovery in databases: An overwiev. AI
type categorization research like a form of retrieval of Magazine, 13(2), 57-70.
unexpected information. Greimas, A.J. (1966). Smantique structurale. Recherche
de mthode. Paris: PUF.
CONCLUSION Hand, D., Mannila, H., & Smyth, P. (2001). Principles of

data mining. Boston: MIT Press.
Categorization explains aspects of peoples cultural logic. Hjelmeslev, L. (1968). Prolgomnes une thorie du
In this chapter, statistical pattern recognition approaches langage. Paris: ditions Minuit.
are used to classify concepts which are present in a given
dataset and expressed by conceptual maps. Prototypes Jakobson, R. (2000). Linguistics and poetics. Londres:
whose basis is the concepts representation from the Lodge and Wood.
heritage par dfaut that allows a great economy in the
acquisition and managing of the information. The objec- Kleiber, G. (1990). La smantique du prototype: Catgories
tive of this study was to investigate the potential of et sens lexical. Paris: PUF.
founding the categories in the text with conceptual maps, Manjali, F.D. (1997). Dynamical models in semiotics/
to use these maps as data-mining tools. Based on user semantics. Retrieved from http://www.chass.utoronto.ca/
cognitive characteristics of knowledge organization and epc/srb/cyber/manout.html
based on the prevalence of the basic level (the prototypi-
cal level), a case is of the necessity of the categorization Minksy, M.L. (1977). A framework for representing knowl-
cognitive process like as a step of data mining. edge. In P.H. Winston (Ed.), The psychology of computer
vision (pp. 211-277). New York: McGraw-Hill.
Rosch, E. et al. (2000). The embodied mind. London: MIT.
REFERENCES
Schank, R. (1999). Dynamic memory revisited. Cambridge,
Amoretti, M.S.M. (2001). Prottipos e esteretipos: MA: Cambridge University Press.
aprendizagem de conceitos: Mapas conceituais: Uma Sebastiani, F. (2002). Machine learning in automated text
experincia em Educao Distncia. Proceedings of the categorization. ACM Computing Surveys (CSUR).
Revista Informtica na Educao. Teoria e Prtica,
Porto Alegre, Brazil. Yiming, Y. (1999). An evaluation of statistical approaches
to text categorization. Information Retrieval.
Amoretti, M.S.M. (2003). Conceptual maps: A metacognitive
strategy to learn concepts. Proceedings of the 25th Annual
Meeting of the Cognitive Science Society, Boston, Mas-
sachusetts. KEY TERMS
Amoretti, M.S.M. (2004a). Categorization process and
conceptual maps. Proceedings of the First International Categorization: A cognitive process based on simi-
Conference on Concept Mapping, Pamplona, Spain. larity of mental schemes and concepts that subjects
establish conditions that are both necessary and suffi-
Amoretti, M.S.M. (2004b). Collaborative learning con- cient (properties) to capture meaning and/or the hierarchy
cepts in distance learning: Conceptual map: Analysis of inclusion (as part of a set) by family ressemblences shared
prototypes and categorization levels. Proceedings of the by their members. Every category has a prototypical
CCM Digital Government Symposium, Tuscaloosa, Ala- internal structure, depending on the context.
bama.
132
TEAM LinG
Concept: A sort of scheme produced by repeated Prototype: An effective way of representing a concept
experiences. Concepts are essentially each little idea that is to retain only its most important properties or the most C
we have in our heads about anything. This includes not typical element of a category, which serves as a cognitive
only everything, but every attribute of everything. reference point with respect to a cultural community. This
group of most important properties or most typical ele-
Conceptual Maps: Semiotic representation (linguistic ments of a concept is called prototype. The idea of
and visual) of the concepts (nodes) and their relation- prototype makes possible that the subject has a mental
ships (links); represent the organization process of the construction, identifying the typical features of several
knowledge When people do a conceptual map, they categories. Prototype is defined as the object that is a
usually privilege the level where the prototype is. They categorys best model.
prefer to categorize at an intermediate level; this basic
level is the first level learned, the most common level
named, and the most general level where visual shape and
attributes are maintained.
133
TEAM LinG
134
Center-Based Clustering and Regression

Clustering
Bin Zhang
Hewlett-Packard Research Laboratories, USA
INTRODUCTION
With guarantee of convergence to only a local opti-
Center-based clustering algorithms are generalized to mum, the quality of the converged results, measured by
more complex model-based, especially regression- the performance function of the algorithm, could be far
model-based, clustering algorithms. This article briefly from its global optimum. Several researchers explored
reviews three center-based clustering algorithmsK- alternative initializations to achieve the convergence to
Means, EM, and K-Harmonic Meansand their gener- a better local optimum (Bradley & Fayyad, 1998; Meila
alizations to regression clustering algorithms. More & Heckerman, 1998; Pena et al., 1999).
details can be found in the referenced publications. K-Harmonic Means (KHM) (Zhang, 2001; Zhang et
al., 2000) is a recent addition to the family of center-
based clustering algorithms. KHM takes a very different
BACKGROUND approach from improving the initializations. It tries to
address directly the source of the problema single
Center-based clustering is a family of techniques with cluster is capable of trapping far more centers than its
applications in data mining, statistical data analysis fair share. This is the main reason for the existence of a
(Kaufman et al., 1990), data compression (vector quanti- very large number of local optima under K-Means and
zation) (Gersho & Gray, 1992), and many others. K- EM when K>10. With the introduction of a dynamic
means (KM) (MacQueen, 1967; Selim & Ismail, 1984), weighting function of data, KHM is much less sensitive
and the Expectation Maximization (EM) (Dempster et al., to initialization, demonstrated through a large number
1977; McLachlan & Krishnan, 1997; Rendner & Walker, of experiments in Zhang (2003). The dynamic weighting
1984) with linear mixing of Gaussian density functions function reduces the ability of a single data cluster,
are two of the most popular clustering algorithms. trapping many centers.
K-Means is the simplest among the three. It starts Replacing the point-centers by more complex data
model centers, especially regression models, in the
with initializing a set of centers M = {mk | k = 1,..., K } and second part of this article, a family of model-based
iteratively refines the location of these centers to find clustering algorithms is created. Regression clustering
the clusters in a dataset. Here are the steps: has been studied under a number of different names:
Clusterwise Linear Regression by Spath (1979, 1981,
K-Means Algorithm 1983, 1985), DeSarbo and Cron (1988), Hennig (1999,
2000) and others; Trajectory clustering using mixtures
Step 1: Initialize all centers (randomly or based of regression models by Gaffney and Smith (1999);
on any heuristic). Fitting Regression Model to Finite Mixtures by Will-
Step 2: Associate each data point with the nearest iams (2000); Clustering Using Regression by Gawrysiak,
et. al. (2000); Clustered Partial Linear Regression by
center. This step partitions the data set into K
Torgo, et. al. (2000). Regression clustering is a better
disjoint subsets (Voronoi Partition).
name for the family, because it is not limited to linear or
Step 3: Calculate the best center locations (i.e., the
piecewise regressions.
centroids of the partitions) to maximize a perfor- Spath (1979, 1981, 1982) used linear regression and
mance function (2), which is the total squared dis- partition of the dataset, similar to K-means, in his
tance from each data point to the nearest center. algorithm that locally minimizes the total mean square
Step 4: Repeat Steps 2 and 3 until there are no error over all K-regressions. He also developed an
more changes on the membership of the data points incremental version of his algorithm. He visualized his
(proven to converge). piecewise linear regression concept in his book (Spath,
TEAM LinG
Center-Based Clustering and Regression Clustering
1985) exactly as he named his algorithm. DeSarbo (1988) where Sk X is the subset of x that are closer to mk than
used a maximum likelihood method for performing to all other centers (the Voronoi partition). C
clusterwise linear regression. Hennig (1999) studied clus-
tered linear regression, as he named it, using the same linear
K
mixing of Gaussian density functions. 1
(b) EM: d ( x, M ) = log pk * EXP( || xi ml ||2 ) , where
( )
D
k =1

MAIN THRUST { pk }1K is a set of mixing probabilities.
For K-Means, EM, and K-Harmonic means, both their A linear mixture of K identical spherical (Gaussian
performance functions and their iterative algorithms density) functions, which is still a probability density
are treated uniformly in this section for comparison. function, is used here.
This uniform treatment is carried over to the three
regression clustering algorithms, RC-KM, RC-EM and (c) K-Harmonic Means: d ( x, M ) = 1HA (|| x mk || p ) , the
RC-KHM, in the second part. k K
harmonic average of the K distances,

Performance Functions of the Center-
Based Clustering K
Perf KHM ( X , M ) =
1 , where p > 2.
Among many clustering algorithms, center-based clus-
xX
|| x m ||
mM
2
(3)
i l
tering algorithms stand out in two important aspects
a clearly defined objective function that the algorithm K-Means and K-Harmonic Means performance func-
minimizes, compared with agglomerative clustering al- tions also can be written similarly to the EM, except that
gorithms that do not have a predefined objective; and a only a positive function takes the place where this
low runtime cost, compared with many other types of probability function is (Zhang, 2001).
clustering algorithms. The time complexity per itera-
tion for all three algorithms is linear in the size of the
Center-Based Clustering Algorithms
dataset N, the number of clusters K, and the dimension-
ality of data D. The number of iterations it takes to
K-Means algorithms are shown in the Introduction. We
converge is very insensitive to N.
list EM and K-Harmonic Means algorithms here to
Let X = {xi | i = 1,..., N } be a dataset with K clusters, show their similarity.
iid sampled from a hidden distribution, and
M = {mk | k = 1,..., K } be a set of K centers. K-Means, EM, EM (with Linear Mixing of Spherical
and K-Harmonic Means find the clustersthe (local) Gausian Densities) Algorithm
optimal locations of the centersby minimizing a func-
tion of the following form over the K centers, Step 1: Initialize the centers and the mixing prob-
abilities { pk }1K .
Perf ( X , M ) = d ( x, M ) (1) Step 2: Calculate the expected membership prob-
xX
abilities (see item ).
where d(x,M) measures the distance from a data point to Step 3: Maximize the likelihood to the current
the set of centers. Each algorithm uses a different membership by finding the best centers.
distance function: Step 4: Repeat Steps 2 and 3 until a chosen con-
vergence criterion is satisfied.
(a) K-Means: d ( x, M ) = MIN (|| x mk ||) , which makes
1 k K
K-Harmonic Means Algorithm
(1) the same as the more popular form
Step 1: Initialize the centers.
K
Perf KM ( X , M ) = || x mk || ,
2
(2)
Step 2: Calculate the membership probabilities
k =1 xSk and the dynamic weighting (see item <C>).
135
TEAM LinG
Step 3: Find the best centers to maximize its perfor- N

mk(u ) =
1 N
1
xi , di ,k =|| xi mk(u 1) ||
mance function.
2 2
K 1 K 1
i =1
dip,k+ 2 p
i =1
d p+2
p (7)
i ,k
Step 4: Repeat Steps 2 and 3 until a chosen conver- l =1 d i,l l =1 d i,l
gence criterion is satisfied.

as (Zhang, 2001)
All three iterative algorithms also can be written uni-
formly as: K
1
d p+2
1
p +2
a ( xi ) =
k =1 i ,k d
p(m p(mk( u 1) | xi ) = K ,k , i = 1,..., N
i
( u 1)
k | x) a ( u 1) ( x) x K 1
2
and 1 . (8)
m (u )
= xX
, p p +2
l =1 d i ,l
k
p(m ( u 1)
k | x) a (u 1) ( x) k =1 di ,k
xX
The dynamic weighting function a ( x) 0 approaches

K
zero when x approaches one of the centers. Intuitively,
a ( u 1)
( x) > 0, p(m ( u 1)
k | x ) 0 and p(m
l =1
( u 1)
l | x) = 1 (4)
the closer a data point is to a center, the smaller weight
it gets in the next iteration. This weighting reduces the
(We dropped the iteration index u-1 on p() for shorter ability of a cluster trapping more than one center. The
effect is clearly observed in the visualization of hundreds
notations). Function a ( u 1) ( x) is a weight on the data point
of experiments conducted (see next section). Compared
x in the current iteration. It is called a dynamic weighting to the KHM, both KM and EM have all data points fully
function, because it changes in each iteration. Functions participate in all iterations (weighting function is a con-
p (mk(u 1) | x) are soft-membership functions, or the prob- stant 1). They do not have a dynamic weighting function
ability of x being associated to the center mk( u 1) . For each as K-Harmonic Means does. EM and KHM both have
soft-membership functions, but K-Means has 0/1 mem-
algorithm, the details on a() and p(,) are: bership function.
A. K-Means: a ( u 1) ( x) =1 for all x in all iterations, and Empirical Comparisons of Center-

p (m ( u 1)
k | x ) =1 if m ( u 1)
k is the closest center to x, Based Clustering Algorithms
otherwise p (mk( u 1) | x) =0. Intuitively, each x is 100%
Empirical comparisons of K-Means, EM, and K-Har-
associated with the closest center, and there is no monic Means on 1,200 randomly generated data sets
weighting on the data points. can be found in the paper by Zhang (2003). Each data set
B. EM: a ( u 1) ( x) =1 for all x in all iterations, and has 50 clusters ranging from well-separated to signifi-
cantly overlapping. The dimensionality of data ranges
p (mk( u 1) | x )
from 2 to 8. All three algorithms are run on each
dataset, starting from the same initialization of the
p ( x | mk(u 1) ) * p (mk(u 1) ) centers, and the converged results are measured by a
p (mk( u 1) | x ) = , common quality measurethe K-Meansfor com-
p( x | mk(u 1) ) * p(mk(u 1) )
xX
(5)
parison. Sensitivity to initialization is studied by a
rerun of all the experiments on different types of
initializations. Major conclusions from the empirical
1
p (mk(u 1) ) = p(mk(u 2) | x),
| X | xX
(6) study are as follows:
1. For low dimensional datasets, the performance

and p ( x | m ( u 1)
k ) is the spherical Gausian density ranking of three algorithms is KHM > KM > EM
(> means better). For low dimensional datasets
function centered at mk( u 1) . (up to 8), the difference is significant.
2. KHMs performance has the smallest variation
C. K-Harmonic Means: a ( x) and p (m ( u 1) ( u 1)
| x ) are under different datasets and different
k
initializations. EMs performance has the biggest
extracted from the KHM algorithm:
variation. Its results are most sensitive to
initializations.
136
TEAM LinG
3. Reproducible results become even more important Regression in (9) is not effective when the dataset
when we use these algorithms to different datasets contains a mixture of very different response character- C
that are sampled from the same hidden distribution. istics, as shown in Figure 1a; it is much better to find the
The results from KHM better represent the proper- partitions in the data and to learn a separate function on
ties of the distribution and are less dependent on a each partition, as shown in Figure 1b. This is the idea of
particular sample set. EMs results are more depen- Regression-Clustering (RC). Regression provides a model
dent on the sample set. for the clusters; clustering partitions the data to best fit
the models. The linkage between the two algorithms is a
The details on the setup of the experiments, quantita- common objective function shared between the regres-
tive comparisons of the results, and the Matlab source sions and the clustering.
code of K-Harmonic Means can be found in the paper. RC algorithms can be viewed as replacing the K
geometric-point centers in center-based clustering al-
Generalization to Complex Model- gorithms by a set of model-based centers, particularly a
Based ClusteringRegression set of regression functions M = { f1 ,..., f K } . With
Clustering the same performance function as defined in (1), but the
distance from a data point to the set of centers replaced
Clustering applies to datasets without response infor- by the following, ( e( f ( x), y) =|| f ( x) y ||2 )
mation (unsupervised); regression applies to datasets
with response variables chosen. Given a dataset with
a) d (( x, y ), M ) = MIN (e( f ( x ), y )) for RC with K-Means
responses, Z = ( X , Y ) = {( xi , yi ) | i = 1,..., N } , a family of f M
(RC-KM)
functions = { f } (a function class making the optimi-
K
zation problem well defined, such as polynomials of up 1
b) d (( x, y ), M ) = log pk * EXP( e( f ( x), y )) for EM
( )
D
to a certain degree) and a loss function e() 0 , regres- k =1

sion solves the following minimization problem (Mont- and
gomery et al., 2001):
c) d (( x, y ), M ) = HA(e( f ( x ), y )) for RC K-Harmonic
f M
N Means (RC-KHM).
f opt = arg min e( f ( xi ), yi ) (9)
f i =1
The three iterative algorithmsRC-KM, RC-EM, and
RC-KHMminimizing their corresponding performance
m
function, take the following common form (10). Regres-
Commonly, = { l h( x, al ) | l R, al R n } , linear
l =1 sion with weighting takes the places of weighted averag-
expansion of simple parametric functions, such as poly- ing in (4). The regression function centers in the uth
nomials of degree up to m, Fourier series of bounded iteration are the solution of the minimization,
frequency, neural networks. Usually,
e( f ( x), y ) =|| f ( x) y || , with p=1,2 most widely used
p N
f k( u ) = arg min a ( zi ) p ( Z k | zi ) || f ( xi ) yi ||2 (10)
(Friedman, 1999). f i =1
where the weighting a ( zi ) and the probability p( Z k | zi )

of data point zi in cluster Z k are both calculated from
Figure 1. (a) Left: a single function is regressed on all
training data, which is a mixture of three different the (u-1)-iterations centers { f k( u 1) } as follows:
distributions; (b) Right: three regression functions,
each regressed on a subset found by RC. The residue (a) For RC-K-Means, a( zi ) = 1 and p( Z k | zi ) =1 if
errors are much smaller.
e( f k(u 1) ( xi ), yi ) < e( f k('u 1) ( xi ), yi ) k ' k , otherwise
p ( Z k | zi ) =0. Intuitively, RC-K-Means has the fol-
lowing steps:
137
TEAM LinG
Step 1: Initialize the regression functions. Empirical Comparison of Center-Based

Step 2: Associate each data point (x,y) with the Clustering Algorithms
regression function that provides the best ap-
Comparison of the three RC-algorithms on randomly gen-
proximation ( arg kmin{e( f k ( x ), y ) | k = 1,..., K } .
erated datasets can be found in the paper by Zhang

Step 3: Recalculate the regression function on (2003a). RC-KHM is shown to be less sensitive to
each partition that maximizes the performance initialization than RC-KM and RC-EM. Details on imple-
function. menting the RC algorithms with extended linear regres-
Step 4: Repeat Steps 2 and 3 until no more data sion models are also available in the same paper.
points change their membership.
Comparing these steps with the steps of K-Means, the
only differences are that point-centers are replaced by FUTURE TRENDS
regression-functions; distance from a point to a center is
replaced by the residue error of a pair (x,y) approximated Improving the understanding of dynamic weighting to
by a regression function. the convergence behavior of clustering algorithms and
finding systemic design methods to develop better per-
(b) For RC-EM, a( zi ) = 1 and forming clustering algorithms require more research.
Some of the work in this direction is appearing. Nock
and Nielsen (2004) took the dynamic weighting idea and
1
pk(u 1) EXP ( e( f k(u 1) ( xi ), yi ))
2
developed a general framework similar to boosting
p ( u ) ( Z k | zi ) = 1 N

K
1
pk(u 1) EXP( e( f k(u 1) ( xi ), yi ))
and pk(u 1) =
N
p( Z
i =1
( u 1)
k | zi ) . theory in supervised learning.
k =1 2 Regression clustering will find many applications in
analyzing real-word data. Single-function regression
The same parallel structure can be observed between has been used very widely for data analysis and forecast-
the center-based EM clustering algorithm and the RC- ing. Data collected in an uncontrolled environment, like
EM algorithm. in stocks, marketing, economy, government census, and
many other real-world situations, are very likely to
(c) For RC-K-Harmonic Means, with contain a mixture of different response characters. Re-
e( f ( x), y ) =|| f ( xi ) yi || , p' gression clustering is a natural extension to the classi-
cal single-function regression.
K K K
a p ( zi ) = dip,l' + 2 d p'
i ,l
p '+ 2
and p ( Z k | zi ) = di ,k d p '+ 2
i ,l
l =1 l =1 l =1 CONCLUSION
where d i ,l =|| f (u 1) ( xi ) yi || . ( p > 2 is used.) Replacing the simple geometric-point centers in cen-
The same parallel structure can be observed between ter-based clustering algorithms by more complex data
the center-based KHM clustering algorithm and the RC- models provides a general scheme for deriving other
KHM algorithm. model-based clustering algorithms. Regression models
Sensitivity to initialization in center-based cluster- are used in this presentation to demonstrate the process.
ing carries over to regression clustering. In addition, a The key step in the generalization is defining the dis-
new form of local optimum is illustrated in Figure 2. tance function from a data point to the set of models
It happens to all three RC algorithms, RC-KM, RC- the regression functions in this special case.
KHM, and RC-EM. Among the three algorithms, EM has a strong foun-
dation in probability theory. It is the convergence to
only a local optimum and the existence of a very large
Figure 2. A new kind of local optimum occurs in number of optima when the number of clusters is more
regression clustering. than a few (>5, for example) that keeps practitioners
from the benefits of its theory. K-Means is the simplest
and its objective function the most intuitive. But it has
the similar problem as the EMs sensitivity to initializa-
tion of the centers. K-Harmonic Means was developed
with close attention to the dynamics of its convergence;
it is much more robust than the other two on low dimen-
138
TEAM LinG
sional data. Improving the convergence of center-based gence Artificial in Uncertainty (pp. 386-395). Morgan
clustering algorithms on higher dimensional data (dim > Kaufman. C
10) still needs more research.
Montgomery, D.C., Peck, E.A., & Vining, G.G. (2001).
Introduction to linear regression analysis. John Wiley
& Sons.
REFERENCES
Nock, R., & Nielsen, F. (2004). An abstract weighting
Bradley, P., & Fayyad, U.M. (1998). Refining initial points framework for clustering algorithms. Proceedings of
for KM clustering. MS Technical Report MSR-TR-98-36. the Fourth International SIAM Conference on Data
Mining. Orlando, Florida.
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maxi-
mum likelihood from incomplete data via the EM algo- Pena, J., Lozano, J., & Larranaga, P. (1999). An empirical
rithm. Journal of the Royal Statistical Society, 39(1), 1- comparison of four initialization methods for the K-
38. means algorithm. Pattern Recognition Letters, 20,
1027-1040.
DeSarbo, W.S., & Corn, L.W. (1988). A maximum likeli-
hood methodology for clusterwise linear regression. Jour- Rendner, R.A., & Walker, H.F. (1984). Mixture densi-
nal of Classification, 5, 249-282. ties, maximum likelihood and the EM algorithm. SIAM
Review, 26(2).
Duda, R., & Hart, P. (1972). Pattern classification and
scene analysis. John Wiley & Sons. Schapire, R.E. (1999). Theoretical views of boosting and
applications. Proceedings of the Tenth International
Friedman, J., Hastie, T., & Tibshirani. R. (1998). Additive Conference on Algorithmic Learning Theory.
logistic regression: A statistical view of boosting [tech-
nical report]. Department of Statistics, Stanford Univer- Selim, S.Z., & Ismail, M.A (1984). K-means type algo-
sity. rithms: A generalized convergence theorem and charac-
terization of local optimality. IEEE Transactions on
Gersho, A., & Gray, R.M. (1992). Vector quantization and PAMI-6, 1.
signal compression. Kluwer Academic Publishers.
Silverman, B.W. (1998). Density estimation for statis-
Hamerly, G., & Elkan, C. (2002). Alternatives to the k- tics and data analysis. Chapman & Hall/CRC.
means algorithm that find better clusterings. Proceed-
ings of the ACM conference on information and knowl- Spath, H. (1981). Correction to algorithm 39:
edge management (CIKM). Clusterwise linear regression. Computing, 26, 275.
Hamerly, G., & Elkan, C. (2003). Learning the k in k-means. Spath, H. (1982). Algorithm 48: A fast algorithm for
Proceedings of the Seventeenth Annual Conference on clusterwise linear regression. Computing, 29, 175-181.
Neural Information Processing Systems.
Spath, H. (1985). Cluster dissection and analysis. New
Hennig, C. (1997). Datenanalyse mit modellen fur cluster York: Wiley.
linear regression [Dissertation]. Hamburg, Germany:
Institut Fur Mathmatsche Stochastik, Universitat Ham- Tibshirani, R., Walther, G., & Hastie, T. (2000). Estimating
burg. the number of clusters in a dataset via the gap statistic.
Retrieved from http://www-stat.stanford.edu/~tibs /
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in research.html
data: An introduction to cluster analysis. John Wiley &
Sons Zhang, B. (2001). Generalized K-harmonic meansDy-
namic weighting of data in unsupervised learning. Pro-
MacQueen, J. (1967). Some methods for classification and ceedings of the First SIAM International Conference on
analysis of multivariate observations. Proceedings of the Data Mining (SDM2001), Chicago, Illinois.
Fifth Berkeley Symposium on Mathematical Statistics
and Probability, Berkeley, California. Zhang, B. (2003). Comparison of the performance of cen-
ter-based clustering algorithms. Proceedings of PAKDD-
McLachlan, G. J., & Krishnan, T. (1997). EM algorithm 03, Seoul, South Korea.
and extensions. John Wiley & Sons.
Zhang, B. (2003a). Regression clustering. Proceedings of
Meila, M., & Heckerman, D. (1998). An experimental com- the IEEE International Conference on Data Mining,
parison of several clustering and initialization methods. In Melbourne, Florida.
Proceedings of the Fourteenth Conference on Intelli-
139
TEAM LinG
Zhang, B., Hsu, M., & Dayal, U. (2000). K-harmonic means. Model-Based Clustering: A mixture of simpler distri-
Proceedings of the International Workshop on Tempo- butions is used to fit the data, which defines the clusters
ral, Spatial and Spatio-Temporal Data Mining, Lyon, of the data. EM with linear mixing of Gaussian density
France. functions is the best example, but K-Means and K-Har-
monic Means are the same type. Regression clustering
algorithms are also model-based clustering algorithms
with mixing of more complex distributions as its model.
KEY TERMS
Regression: A statistical method of learning the rela-
Boosting: Assigning and updating weights on data tionship between two sets of variables from data. One set
points according to a particular formula in the process of is the independent variables or the predictors, and the
refining classification models. other set is the response variables.
Center-Based Clustering: Similarity among the data Regression Clustering: Combining the regression
points is defined through a set of centers. The distance methods with center-based clustering methods. The
from each data point to a center determined the data points simple geometric-point centers in the center-based clus-
association with that center. The clusters are represented tering algorithms are replaced by regression models.
by the centers. Sensitivity to Initialization: Center-based clus-
Clustering: Grouping data according to similarity tering algorithms are iterative algorithms that minimiz-
among them. Each clustering algorithm has its own defi- ing the value of its performance function. Such algo-
nition of similarity. Such grouping can be hierarchical. rithms converge to only a local optimum of its perfor-
mance function. The converged positions of the centers
Dynamic Weighting: Reassigning weights on the depend on the initial positions of the centers where the
data points in each iteration of an iterative algorithm. algorithm start with.
140
TEAM LinG
141
Classification and Regression Trees C

Johannes Gehrke
Cornell University, USA
INTRODUCTION unlabeled record, we start at the root node. If the record

satisfies the predicate associated with the root node, we
It is the goal of classification and regression to build a follow the tree to the left child of the root, and we go to the
data-mining model that can be used for prediction. To right child otherwise. We continue this pattern through a
construct such a model, we are given a set of training unique path from the root of the tree to a leaf node, where
records, each having several attributes. These attributes we predict the value of the dependent attribute associated
either can be numerical (e.g., age or salary) or categorical with this leaf node. An example decision tree for a classi-
(e.g., profession or gender). There is one distinguished fication problem, a classification tree, is shown in Figure
attributethe dependent attribute; the other attributes 1. Note that a decision tree automatically captures inter-
are called predictor attributes. If the dependent attribute actions between variables, but it only includes interac-
is categorical, the problem is a classification problem. If tions that help in the prediction of the dependent at-
the dependent attribute is numerical, the problem is a tribute. For example, the rightmost leaf node in the example
regression problem. It is the goal of classification and shown in Figure 1 is associated with the classification
regression to construct a data-mining model that predicts rule: If (Age >= 40) and (Gender=male), then YES; as
the (unknown) value for a record, where the value of the classification rule that involves an interaction between
dependent attribute is unknown. (We call such a record an the two predictor attributes age and salary.
unlabeled record.) Classification and regression have a Decision trees can be mined automatically from a
wide range of applications, including scientific experi- training database of records, where the value of the
ments, medical diagnosis, fraud detection, credit approval, dependent attribute is known: A decision tree construc-
and target marketing (Hand, 1997). tion algorithm selects which attribute(s) to involve in the
Many classification and regression models have been splitting predicates, and the algorithm decides also on the
proposed in the literature; among the more popular mod- shape and depth of the tree (Murthy, 1998).
els are neural networks, genetic algorithms, Bayesian
methods, linear and log-linear models and other statistical
methods, decision tables, and tree-structured models, MAIN THRUST
which is the focus of this article (Breiman, Friedman,
Olshen & Stone, 1984). Tree-structured models, so-called Let us discuss how decision trees are mined from a
decision trees, are easy to understand; they are non- training database. A decision tree usually is constructed
parametric and, thus, do not rely on assumptions about in two phases. In the first phase, the growth phase, an
the data distribution; and they have fast construction overly large and deep tree is constructed from the training
methods even for large training datasets (Lim, Loh & Shih, data. In the second phase, the pruning phase, the final size
2000). Most data-mining suites include tools for classifi- of the tree is determined with the goal to minimize the
cation and regression tree construction (Goebel & expected misprediction error (Quinlan, 1993).
Gruenwald, 1999).
Figure 1. An example classification tree

BACKGROUND
Let us start by introducing decision trees. For the ease of Age <
40
explanation, we are going to focus on binary decision
trees. In binary decision trees, each internal node has two
children nodes. Each internal node is associated with a No Gender=M
predicate, called the splitting predicate, which involves
only the predictor attributes. Each leaf node is associated
with a unique value for the dependent attribute. A deci- No Yes
sion encodes a data-mining model as follows. For an
TEAM LinG
Classification and Regression Trees
There are two problems that make decision tree con- databases (Gehrke, Ramakrishnan & Ganti, 2000; Shafer,
struction a hard problem. First, construction of the opti- Agrawal & Mehta, 1996).
mal tree for several measures of optimality is an NP-hard In most classification and regression scenarios, we
problem. Thus, all decision tree construction algorithms also have costs associated with misclassifying a record,
grow the tree top-down according to the following greedy or with being far off in our prediction of a numerical
heuristic: At the root node, the training database is dependent value. Existing decision tree algorithms can
examined, and a splitting predicate is selected. Then the take costs into account, and they will bias the model
training database is partitioned according to the splitting toward minimizing the expected misprediction cost in-
predicate, and the same method is applied recursively at stead of the expected misclassification rate, or the ex-
each child node. The second problem is that the training pected difference between the predicted and true value of
database is only a sample from a much larger population the dependent attribute.
of records. The decision tree has to perform well on
records drawn from the population, not on the training
database. (For the records in the training database, we FUTURE TRENDS
already know the value of the dependent attribute.)
Three different algorithmic issues need to be ad- Recent developments have expanded the types of models
dressed during the tree construction phase. The first that a decision tree can have in its leaf nodes. So far, we
issue is to devise a split selection algorithm, such that the assumed that each leaf node just predicts a constant value
resulting tree models the underlying dependency rela- for the dependent attribute. Recent work, however, has
tionship between the predictor attributes and the depen- shown how to construct decision trees with linear models
dent attribute well. During split selection, we have to make in the leaf nodes (Dobra & Gehrke, 2002). Another recent
two decisions. First, we need to decide which attribute we development in the general area of data mining is the use
will select as the splitting attribute. Second, given the of ensembles of models, and decision trees are a popular
splitting attribute, we have to decide on the actual split- model for use as a base model in ensemble learning
ting predicate. For a numerical attribute X, splitting predi- (Caruana, Niculescu-Mizil, Crew & Ksikes, 2004). An-
cates are usually of the form X c, where c is a constant. other recent trend is the construction of data-mining
For example, in the tree shown in Figure 1, the splitting models of high-speed data streams, and there have been
predicate of the root node is of this form. For a categorical adaptations of decision tree construction algorithms to
attribute X, splits are usually of the form X in C, where C such environments (Domingos & Hulten, 2002). A last
is a set of values in the domain of X. For example, in the recent trend is to take adversarial behavior into account
tree shown in Figure 1, the splitting predicate of the right (e.g., in classifying spam). In this case, an adversary who
child node of the root is of this form. There exist decision produces the records to be classified actively changes his
trees that have a larger class of possible splitting predi- or her behavior over time to outsmart a static classifier
cates; for example, there exist decision trees with linear (Dalvi, Domingos, Mausam, Sanghai & Verma, 2004).
combinations of numerical attribute values as splitting
predicates a iXi+c0, where i ranges over all attributes)
(Loh & Shih, 1997). Such splits, also called oblique splits, CONCLUSION
result in shorter trees; however, the resulting trees are no
longer easy to interpret. Decision trees are one of the most popular data-mining
The second issue is to devise a pruning algorithm that models. Decision trees are important, since they can result
selects the tree of the right size. If the tree is too large, then in powerful predictive models, while, at the same time,
the tree models the training database too closely instead they allow users to get insight into the phenomenon that
of modeling the underlying population. One possible is being modeled.
choice of pruning a tree is to hold out part of the training
set as a test set and to use the test set to estimate the
misprediction error of trees of different size. We then REFERENCES
simply select the tree that minimizes the misprediction
error. Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J.
The third issue is to devise an algorithm for intelligent (1984). Classification and regression trees. Kluwer Aca-
management of the training database in case the training demic Publishers.
database is very large (Ramakrishnan & Gehrke, 2002).
This issue has only received attention in the last decade, Caruana, R., Niculescu-Mizil, A., Crew, R., & Ksikes, A.
but there exist now many algorithms that can construct (2004). Ensemble selection from libraries of models. Pro-
decision trees over extremely large, disk-resident training
142
TEAM LinG
Classification and Regression Trees
ceedings of the Twenty-First International Conference, Quinlan, J.R. (1993). C4.5: Programs for machine learn-
Banff, Alberta, Canada. ing. Morgan Kaufman. C
Dalvi, N., Domingos, P., Mausam, S.S., & Verma, D. (2004). Ramakrishnan, R. & Gehrke, J. (2002). Database manage-
Adversarial classification. Proceedings of the Tenth Inter- ment systems (3rd ed.). McGrawHill.
national Conference on Knowledge Discovery and Data
Mining, Seattle, Washington. Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A
scalable parallel classifier for data mining. Proceedings
Dobra, A., & Gehrke, J. (2002). SECRET: A scalable linear of the 22nd International Conference on Very Large
regression tree algorithm. Proceedings of the Eighth ACM Databases, Bombay, India.
SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, Edmonton, Alberta, Canada.
Domingos, P., & Hulten, G. (2002). Learning from infinite KEY TERMS
data in finite time. Advances in Neural Information Pro-
cessing Systems, 14, 673-680. Attribute: Column of a dataset.
Gehrke, J., Ramakrishnan, R., & Ganti, V. (2000). Categorical Attribute: Attribute that takes values
RainforestA framework for fast decision tree construc- from a discrete domain.
tion of large datasets. Data Mining and Knowledge Dis-
covery, 4(2/3), 127-162. Classification Tree: A decision tree where the de-
pendent attribute is categorical.
Goebel, M., & Gruenwald, L. (1999). A survey of data
mining software tools. SIGKDD Explorations, 1(1), 20-33. Decision Tree: Tree-structured data mining model
used for prediction, where internal nodes are labeled with
Hand, D. (1997). Construction and assessment of classifi- predicates (decisions), and leaf nodes are labeled with
cation rules. Chichester, England: John Wiley & Sons. data-mining models.
Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison Numerical Attribute: Attribute that takes values
of prediction accuracy, complexity, and training time of from a continuous domain.
thirty-three old and new classification algorithms. Ma-
chine Learning, 48, 203-228. Regression Tree: A decision tree where the depen-
dent attribute is numerical.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods
for classification trees. Statistica Sinica, 7(4), 815-840. Splitting Predicate: Predicate at an internal node of
the tree; it decides which branch a record traverses on its
Murthy, S.K. (1998). Automatic construction of decision way from the root to a leaf node.
trees from data: A multi-disciplinary survey. Data Mining
and Knowledge Discovery, 2(4), 345-389.
143
TEAM LinG
144
Classification Methods
Aijun An
York University, Canada
INTRODUCTION sian classification), classifying an example based on dis-

tances in the feature space (e.g. the k-nearest neighbor
Generally speaking, classification is the action of as- method), and constructing a classification tree that clas-
signing an object to a category according to the charac- sifies examples based on tests on one or more predictor
teristics of the object. In data mining, classification variables (i.e., classification tree analysis).
refers to the task of analyzing a set of pre-classified data In the field of machine learning, attention has more
objects to learn a model (or a function) that can be used focused on generating classification expressions that
to classify an unseen data object into one of several are easily understood by humans. The most popular
predefined classes. A data object, referred to as an machine learning technique is decision tree learning,
example, is described by a set of attributes or variables. which learns the same tree structure as classification
One of the attributes describes the class that an example trees but uses different criteria during the learning
belongs to and is thus called the class attribute or class process. The technique was developed in parallel with
variable. Other attributes are often called independent the classification tree analysis in statistics. Other ma-
or predictor attributes (or variables). The set of ex- chine learning techniques include classification rule
amples used to learn the classification model is called learning, neural networks, Bayesian classification, in-
the training data set. Tasks related to classification stance-based learning, genetic algorithms, the rough set
include regression, which builds a model from training approach and support vector machines. These tech-
data to predict numerical values, and clustering, which niques mimic human reasoning in different aspects to
groups examples to form categories. Classification be- provide insight into the learning process.
longs to the category of supervised learning, distin- The data mining community inherits the classifica-
guished from unsupervised learning. In supervised learn- tion techniques developed in statistics and machine
ing, the training data consists of pairs of input data learning, and applies them to various real world prob-
(typically vectors), and desired outputs, while in unsu- lems. Most statistical and machine learning algorithms
pervised learning there is no a priori output. are memory-based, in which the whole training data set
Classification has various applications, such as learn- is loaded into the main memory before learning starts.
ing from a patient database to diagnose a disease based In data mining, much effort has been spent on scaling up
on the symptoms of a patient, analyzing credit card the classification algorithms to deal with large data sets.
transactions to identify fraudulent transactions, auto- There is also a new classification technique, called
matic recognition of letters or digits based on handwrit- association-based classification, which is based on as-
ing samples, and distinguishing highly active compounds sociation rule learning.
from inactive ones based on the structures of com-
pounds for drug discovery.
MAIN THRUST
BACKGROUND Major classification techniques are described below.

The techniques differ in the learning mechanism and in
Classification has been studied in statistics and machine the representation of the learned model.
learning. In statistics, classification is also referred to
as discrimination. Early work on classification focused Decision Tree Learning
on discriminant analysis, which constructs a set of
discriminant functions, such as linear functions of the Decision tree learning is one of the most popular clas-
predictor variables, based on a set of training examples sification algorithms. It induces a decision tree from
to discriminate among the groups defined by the class data. A decision tree is a tree structured prediction
variable. Modern studies explore more flexible classes model where each internal node denotes a test on an
of models, such as providing an estimate of the join attribute, each outgoing branch represents an outcome
distribution of the features within each class (e.g. Baye- of the test, and each leaf node is labeled with a class or
TEAM LinG
Figure 1. A decision tree with tests on attributes X and Y search heuristics that control the search, and the pruning
method used. The most widespread approach to rule C
X=? induction is sequential covering, in which a greedy
X<1 X 1 general-to-specific search is conducted to learn a dis-
junctive set of conjunctive rules. It is called sequential
Y =? Class 2
Y=C
covering because it sequentially learns a set of rules that
Y=A Y=B together cover the set of positive examples for a class.
Class 1 Class 2 Class 1 Algorithms belonging to this category include CN2
(Clark & Boswell, 1991), RIPPER (Cohen, 1995) and
ELEM2 (An & Cercone, 1998).
class distribution. A simple decision tree is shown in Figure
1. With a decision tree, an object is classified by following Naive Bayesian Classifier
a path from the root to a leaf, taking the edges correspond-
ing to the values of the attributes in the object. The naive Bayesian classifier is based on Bayes theo-
A typical decision tree learning algorithm adopts a rem. Suppose that there are m classes, C1, C2, , Cm. The
top-down recursive divide-and-conquer strategy to con- classifier predicts an unseen example X as belonging to
struct a decision tree. Starting from a root node repre- the class having the highest posterior probability condi-
senting the whole training data, the data is split into two tioned on X. In other words, X is assigned to class Ci if
or more subsets based on the values of an attribute and only if
chosen according to a splitting criterion. For each sub-
set a child node is created and the subset is associated
with the child. The process is then separately repeated P(Ci|X) > P(Cj|X) for 1j m, j i.
on the data in each of the child nodes, and so on, until a
termination criterion is satisfied. Many decision tree
learning algorithms exist. They differ mainly in at-
By Bayes theorem, we have
tribute-selection criteria, such as information gain, gain
ratio (Quinlan, 1993), gini index (Breiman, Friedman,
Olshen, & Stone, 1984), etc., termination criteria and P( X | Ci ) P (Ci )
P (Ci | X ) = .
post-pruning strategies. Post-pruning is a technique that P( X )
removes some branches of the tree after the tree is con-
structed to prevent the tree from over-fitting the training As P(X) is constant for all classes, only
data. Representative decision tree algorithms include CART
P( X | C i ) P(C i ) needs to be maximized. Given a set of
(Breiman et al., 1984) and C4.5 (Quinlan, 1993). There are
also studies on fast and scalable construction of decision training data, P(Ci) can be estimated by counting how
trees. Representative algorithms of such kind include often each class occurs in the training data. To reduce
RainForest (Gehrke, Ramakrishnan, & Ganti, 1998) and the computational expense in estimating P(X|Ci) for all
SPRINT (Shafer, Agrawal, & Mehta., 1996). possible Xs, the classifier makes a nave assumption that
the attributes used in describing X are conditionally
Decision Rule Learning independent of each other given the class of X. Thus,
given the attribute values (x1, x2, xn) that describe X,
we have
Decision rules are a set of if-then rules. They are the
most expressive and human readable representation of
classification models (Mitchell, 1997). An example of n
P( X | C i ) = P( x j | C i ) .
decision rules is if X<1 and Y=B, then the example j =1
belongs to Class 2. This type of rules is referred to as
propositional rules. Rules can be generated by translat-
ing a decision tree into a set of rules one rule for each The probabilities P(x1|Ci), P(x2|Ci), , P(xn|Ci) can be
leaf node in the tree. A second way to generate rules is estimated from the training data.
to learn rules directly from the training data. There is a The nave Bayesian classifier is simple to use and
variety of rule induction algorithms. The algorithms efficient to learn. It requires only one scan of the
induce rules by searching in a hypothesis space for a training data. Despite the fact that the independence
hypothesis that best matches the training data. The algo- assumption is often violated in practice, nave Bayes
rithms differ in the search method (e.g. general-to- often competes well with more sophisticated classifi-
specific, specific-to-general, or two-way search), the ers. Recent theoretical analysis has shown why the naive
145
TEAM LinG
Bayesian classifier is so robust (Domingos & Pazzani, The k-nearest neighbour classifier is intuitive, easy
1997; Rish, 2001). to implement and effective in practice. It can construct a
different approximation to the target function for each
Bayesian Belief Networks new example to be classified, which is advantageous
when the target function is very complex, but can be
A Bayesian belief network, also known as Bayesian net- discribed by a collection of less complex local approxima-
work and belief network, is a directed acyclic graph whose tions (Mitchell, 1997). However, its cost of classifying
nodes represent variables and whose arcs represent de- new examples can be high due to the fact that almost all
pendence relations among the variables. If there is an arc the computation is done at the classification time. Some
from node A to another node B, then we say that A is a refinements to the k-nearest neighbor method include
parent of B and B is a descendent of A. Each variable is weighting the attributes in the distance computation and
conditionally independent of its nondescendents in the weighting the contribution of each of the k neighbors
graph, given its parents. The variables may correspond to during classification according to their distance to the
actual attributes given in the data or to hidden variables example to be classified.
believed to form a relationship. A variable in the network
can be selected as the class attribute. The classification Neural Networks
process can return a probability distribution for the class
attribute based on the network structure and some condi- Neural networks, also referred to as artificial neural
tional probabilities estimated from the training data, which networks, are studied to simulate the human brain
predicts the probability of each class. although brains are much more complex than any arti-
The Bayesian network provides an intermediate ap- ficial neural network developed so far. A neural net-
proach between the nave Bayesian classification and the work is composed of a few layers of interconnected
Bayesian classification without any independence as- computing units (neurons or nodes). Each unit com-
sumptions. It describes dependencies among attributes, putes a simple function. The input of the units in one
but allows conditional independence among subsets of layer are the outputs of the units in the previous layer.
attributes. Each connection between units is associated with a
The training of a belief network depends on the weight. Parallel computing can be performed among
senario. If the network structure is known and the vari- the units in each layer. The units in the first layer take
ables are observable, training the network only consists input and are called the input units. The units in the last
of estimating some conditional probabilities from the layer produces the output of the networks and are called
training data, which is straightforward. If the network the output units. When the network is in operation, a
structure is given and some of the variables are hidden, a value is applied to each input unit, which then passes its
method of gradient decent can be used to train the net- given value to the connections leading out from it, and
work (Russell, Binder, Koller, & Kanazawa, 1995). Al- on each connection the value is multiplied by the weight
gorithms also exist for learning the netword structure associated with that connection. Each unit in the next
from training data given observable variables (Buntime, layer then receives a value which is the sum of the
1994; Cooper & Herskovits, 1992; Heckerman, Geiger, values produced by the connections leading into it, and
& Chickering, 1995). in each unit a simple computation is performed on the
value - a sigmoid function is typical. This process is
The k-Nearest Neighbour Classifier then repeated, with the results being passed through
subsequent layers of nodes until the output nodes are
The k-nearest neighbour classifier classifies an unknown reached. Neural networks can be used for both regres-
example to the most common class among its k nearest sion and classification. To model a classification func-
neighbors in the training data. It assumes all the ex- tion, we can use one output unit per class. An example
amples correspond to points in a n-dimensional space. A can be classified into the class corresponding to the
neighbour is deemed nearest if it has the smallest dis- output unit with the largest output value.
tance, in the Euclidian sense, in the n-dimensional fea- Neural networks differ in the way in which the
ture space. When k = 1, the unknown example is classi- neurons are connected, in the way the neurons process
fied into the class of its closest neighbour in the training their input, and in the propogation and learning methods
set. The k-nearest neighbour method stores all the train- used (Nurnberger, Pedrycz, & Kruse, 2002). Learning
ing examples and postpones learning until a new example a neural network is usually restricted to modifying the
needs to be classified. This type of learning is called weights based on the training data; the structure of the
instance-based or lazy learning. initial network is usually left unchanged during the
learning process. A typical network structure is the
146
TEAM LinG
multilayer feed-forward neural network, in which none creasingly applied to provide decision support in busi-
of the connections cycles back to a unit of a previous ness, biomedicine, financial analysis, telecommunications C
layer. The most widely used method for training a feed- and so on. For example, there are recent applications of
forward neural network is backpropagation (Rumelhart, classification techniques to identify fraudulent usage of
Hinton, & Williams, 1986). credit cards based on credit card transaction databases;
and various classification techniques have been explored
Support Vector Machines to identify highly active compounds for drug discovery.
To better solve application-specific problems, there has
The support vector machine (SVM) is a recently devel- been a trend toward the development of more application-
oped technique for multidimensional function approxi- specific data mining systems (Han & Kamber, 2001).
mation. The objective of support vector machines is to Traditional classification algorithms assume that the
determine a classifier or regression function which whole training data can fit into the main memory. As
minimizes the empirical risk (that is, the training set automatic data collection becomes a daily practice in many
error) and the confidence interval (which corresponds businesses, large volumes of data that exceed the memory
to the generalization or test set error) (Vapnik, 1998). capacity become available to the learning systems. Scal-
Given a set of N linearly separable training ex- able classification algorithms become essential. Although
some scalable algorithms for decision tree learning have
amples S = {x i R n | i = 1,2,..., N } , where each example
been proposed, there is still a need to develop scalable and
belongs to one of the two classes, represented efficient algorithms for other types of classification tech-
by yi {+1,1} , the SVM learning method seeks the op- niques, such as decision rule learning.
timal hyperplane w x + b = 0 , as the decision surface, Previously, the study of classification techniques
which separates the positive and negative examples with focused on exploring various learning mechanisms to
the largest margin. The decision function for classify- improve the classification accuracy on unseen examples.
ing linearly separable data is: However, recent study on imbalanced data sets has
shown that classification accuracy is not an appropriate
measure to evaluate the classification performance when
f (x) = sign(w x + b) , the data set is extremely unbalanced, in which almost all
the examples belong to one or more, larger classes and
where w and b are found from the training set by solving far fewer examples belong to a smaller, usually more
a constrained quadratic optimization problem. The final interesting class. Since many real world data sets are
decision function is unbalanced, there has been a trend toward adjusting
existing classification algorithms to better identify ex-
amples in the rare class.
N
f (x) = sign i y i (x i x) + b . Another issue that has become more and more impor-
i =1 tant in data mining is privacy protection. As data mining
tools are applied to large databases of personal records,
privacy concerns are rising. Privacy-preserving data min-
The function depends on the training examples for
ing is currently one of the hottest research topics in data
which i is non-zero. These examples are called sup- mining and will remain so in the near future.
port vectors. Often the number of support vectors is
only a small fraction of the original dataset. The basic
SVM formulation can be extended to the nonlinear case CONCLUSION
by using nonlinear kernels that map the input space to a
high dimensional feature space. In this high dimensional Classification is a form of data analysis that extracts a
feature space, linear classification can be performed. model from data to classify future data. It has been
The SVM classifier has become very popular due to its studied in parallel in statistics and machine learning, and
high performances in practical applications such as text is currently a major technique in data mining with a
classification and pattern recognition. broad application spectrum. Since many application
problems can be formulated as a classification problem
and the volume of the available data has become over-
FUTURE TRENDS whelming, developing scalable, efficient, domain-spe-
cific, and privacy-preserving classification algorithms is
Classification is a major data mining task. As data mining essential.
becomes more popular, classification techniques are in-
147
TEAM LinG
REFERENCES Rish, I. (2001). An empirical study of the naive Bayes

classifier. Proceedings of IJCAI 2001 Workshop on
An, A., & Cercone, N. (1998). ELEM2: A learning Empirical Methods in Artificial Intelligence.
system for more accurate classifications. Proceedings Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986).
of the 12th Canadian Conference on Artificial Intelli- Learning representations by back-propagating errors.
gence (pp. 426-441). Nature, 323, 533-536.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995).
Classification and regression trees. Wadsworth Interna- Local learning in probabilistic networks with hidden
tional Group. variables. Proceedings of the 14th Joint International
Buntine, W.L. (1994). Operations for learning with graphi- Conference on Artificial Intelligence, 2 (pp. 1146-1152).
cal models. Journal of Artificial Intelligence Research, 2, Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A
159-225. scalable parallel classifier for data mining. Proceed-
Castillo, E., Gutirrez, J.M., & Hadi, A.S. (1997). Expert ings of the 22 th International Conference on Very
systems and probabilistic network models. New York: Large Data Bases (pp. 544-555).
Springer-Verlag. Vapnik, V. (1998). Statistical learning theory. New York:
Clark P., & Boswell, R. (1991). Rule induction with CN2: John Wiley & Sons.
Some recent improvements. Proceedings of the 5th Euro-
pean Working Session on Learning (pp. 151-163).
Cohen, W.W. (1995). Fast effective rule induction. Pro- KEY TERMS
ceedings of the 11th International Conference on Ma-
chine Learning (pp. 115-123), Morgan Kaufmann. Backpropagation: A neural network training algo-
rithm for feedforward networks where the errors at the
Cooper, G., & Herskovits, E. (1992). A Bayesian method output layer are propagated back to the previous layer to
for the induction of probabilistic networks from data. update connection weights in learning. If the previous
Machine Learning, 9, 309-347. layer is not the input layer, then the errors at this hidden
Domingos, P., & Pazzani, M. (1997). On the optimality layer are propagated back to the layer before.
of the simple Bayesian classifier under zero-one loss. Disjunctive Set of Conjunctive Rules: A conjunc-
Machine Learning, 29, 103-130. tive rule is a propositional rule whose antecedent con-
Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). RainForest sists of a conjunction of attribute-value pairs. A dis-
- A framework for fast decision tree construction of large junctive set of conjunctive rules consists of a set of
datasets. Proceedings of the 24th International Confer- conjunctive rules with the same consequent. It is called
ence on Very Large Data Bases (pp. 416-427). disjunctive because the rules in the set can be combined
into a single disjunctive rule whose antecedent consists
Han, J., & Kamber, M. (2001). Data mining Concepts of a disjunction of conjunctions.
and techniques. Morgan Kaufmann.
Generic Algorithm: An algorithm for optimizing a
Heckerman, D., Geiger, D., & Chickering, D.M. (1995) Learn- binary string based on an evolutionary mechanism that
ing bayesian networks: The combination of knowledge and uses replication, deletion, and mutation operators car-
statistical data. Machine Learning, 20, 197-243. ried out over many generations.
Mitchell, T.M. (1997). Machine learning. McGraw-Hill. Information Gain: Given a set E of classified ex-
Nurnberger, A., Pedrycz, W., & Kruse, R. (2002). Neural amples and a partition P = {E1, ..., En} of E, the informa-
network approaches. In Klosgen & Zytkow (Eds.), Hand- tion gain is defined as
book of data mining and knowledge discovery (pp. 304-
317). Oxford University Press.
n
| Ei |
entropy( E ) entropy( Ei ) * ,
i =1 |E|
Pearl, J. (1986). Fusion, propagation, and structuring in
belief networks. Artificial Intelligence, 29(3), 241-288.
Quinlan, J.R. (1993). C4.5: Programs for machine learn- where |X| is the number of examples in X, and
ing. Morgan Kaufmann. m
entropy( X ) = p j log 2 ( p j ) (assuming there are m classes
j =1
148
TEAM LinG
in X and pj denotes the probability of the jth class in X). approximations of a class. It can be used to reduce the
Intuitively, the information gain measures the decrease of feature set and to generate decision rules. C
the weighted average impurity of the partitions E1, ..., E n,
compared with the impurity of the complete set of ex- Sigmod Function: A mathematical function defined
amples E. by the formula
Machine Learning: The study of computer algorithms 1

P (t ) =
that develop new knowledge and improve its performance 1 + e t
automatically through past experience.
Its name is due to the sigmoid shape of its graph. This
Rough Set Data Analysis: A method for modeling function is also called the standard logistic function.
uncertain information in data by forming lower and upper
149
TEAM LinG
150
Closed-Itemset Incremental-Mining Problem

Luminita Dumitriu
Dunarea de Jos University, Romania
INTRODUCTION number of frequent itemsets, that is, search space reduc-

tion.
Association rules, introduced by Agrawal, Imielinski In this article, I describe the closed-itemset ap-
and Swami (1993), provide useful means to discover proaches, considering the fact that an association rule
associations in data. The problem of mining association mining process leads to a large amount of results (most
rules in a database is defined as finding all the associa- of the time, that is) difficult to understand by the user.
tion rules that hold with more than a user-given mini- I take into account the interactivity of the data-mining
mum support threshold and a user-given minimum con- process proposed by Ankerst (2001).
fidence threshold. According to Agrawal, Imielinski and
Swami, this problem is solved in two steps:
BACKGROUND
1. Find all frequent itemsets in the database.
2. For each frequent itemset I, generate all the associa- In this section I first describe the use of closed-itemset
tion rules I'I\I', where I'I. lattices as the theoretical framework for the closed-
itemset approach. The application of Formal Concept
The second problem can be solved in a straightfor- Analysis to the association rule problem was first men-
ward manner after the first step is completed. Hence, tioned in Zaki and Ogihara (1998). For more details on
the problem of mining association rules is reduced to lattice theory, see Ganter and Wille (1999).
the problem of finding all frequent itemsets. This is not The closed-itemset approach is described below. I
a trivial problem, because the number of possible fre- define a context (T, I, D), the Galois connection of a
quent itemsets is equal to the size of the power set of I, context ((T, I, D), s, t), a concept of the context (X, Y),
2 |I| . and the set of concepts in the context, denoted (T, I, D).
Many algorithms are proposed in the literature, most The main result in the Formal Concept Analysis
of them based on the Apriori mining method (Agrawal & theory is as follows:
Srikant, 1994), which relies on a basic property of
frequent itemsets: All subsets of a frequent itemset are Fundamental theorem of Formal Concept
frequent. This property also says that all supersets of an Analysis (FCA): Let (T, I, D) be a context. Then
infrequent itemset are infrequent. This approach works (T, I, D) is a complete lattice with join and meet
well on weakly correlated data, such as market-basket operators given by closed set intersection and
data. For overcorrelated data, such as census data, there reunion operators.
are other approaches, including Close (Pasquier, Bastide,
Taouil, & Lakhal, 1999), CHARM (Zaki & Hsiao, 1999) How does FCA apply to the association rule prob-
and Closet (Pei, Han, & Mao, 2000), which are more lem? First, T is the set of transaction ids, I is the set of
appropriate. items in the database, and D is the database itself.
An interesting study of specific approaches is per- Second, the mapping s associates to a transaction set X
formed in Zheng, Kohavi and Mason (2001), qualifying the maximal itemset Y present in all transactions in X.
CHARM as the most adjusted algorithm to real-world The mapping t associates to any itemset Y the maximal
data. Some later improvements of Closet are mentioned transaction set X, where each transaction comprises all
in Wang, Han, and Pei (2003) concerning speed and the items in the itemset Y. The resulting frequent con-
memory usage, while a different support-counting cepts are considered in mining application only for their
method is proposed in Zaki and Gouda (2001). These itemset side, the transaction set side being ignored.
approaches search for closed itemsets structured in What is the advantage of FCA application? Among
lattices that are closely related with the concept lattice the results in Apriori, there are itemsets I will call
in formal concept analysis (Ganter & Wille, 1999). The them Y and Y, where Y is included in Y, and they have the
main advantage of a closed itemset approach is the same support. These two itemsets are two distinct re-
smaller size of the resulting concept lattice versus the sults of Apriori, even if they characterize differently the
TEAM LinG
same transaction set. In fact, the longest itemset is the model built upon the previously selected items, thus
most precise characterization of that transaction set, all generating at all times a new and extended data model. C
the others being partial and redundant definitions. Due This strategy generates results layer by layer, just like
to the observation that s t and ts are closure operators, an onion.
the concepts in the lattice of concepts eliminate the The main difference between the depth-first strategy
presence of any unclosed itemsets. Under these cir- and the layer-based strategy is that interactivity is of-
cumstances, the FCA-based approaches have only fered to the user. Just like peeling an onion, one can take
subunitary confidence association rules as results, due a previously found data model, reduce it or enlarge it
to the fact that all unitary confidence association rules with some items, and reach the data view that is the most
are considered redundant behaviors. All unitary confi- revealing to the individual.
dence association rules can be expressed through a base, In breadth-first as well as depth-first strategies, it is
the pseudo-intent set. impossible to provide interactivity to the user due to the
If I consider the data-mining process in the vision of fact that all items of interest for the mining process have
Ankerst (2001) the resulting data model in Apriori is a to be available from the start.
long list of frequent itemsets, while FCA is a conceptual
structure, namely a lattice of concepts, free of any
redundancy. FUTURE TRENDS
I am considering that the most challenging trends would

MAIN THRUST manifest in quantitative attribute mapping as well as in
online mining. The first case has a well-known problem
Many interesting algorithms are issued from the FCA in what concerns the quality of numerical-to-Boolean
approach to the association rule problem. Although attribute mapping. Solutions to this problem are already
many differences exist between the support counting, considered in Aumann and Lindell (1999), Hong, Kuo,
memory usage and performance of all these algorithms, Chi, and Wang (2000), Imberman and Domanski (2001),
the general lines are almost the same. and Webb (2001), but the problem is far from resolu-
In the following sections, I take into account some tion. An interactive association rule approach may help
of the main characteristics of these algorithms. in a generate-and-test paradigm to find the most suitable
mappings in a particular database context.
Resulting Data Model Online mining is very alike cognitive processes: A
data model is flooded with facts; either they are consis-
Most of the FCA approaches do not generate a lattice of tent with the data model, contradict it, or have a neutral
concepts but a spanning tree on that lattice. The argu- character. When contradicting facts become important
ment is that some of the pairs of adjacent concepts in (in number, frequency, or any other way), the model has
lattice can be inferred later. The main algorithms that to change. The real problem of the data-mining process
follow this principle are CHARM and Closet. is that the model is too large in number of concepts or
One approach builds the entire lattice (Dumitriu, the concepts are too rigidly related to support change.
2002), called ERA. The argument for building the entire
lattice resides in the fact that missing pairs of adjacent
concepts in the spanning tree-based approaches can CONCLUSION
have an identical support count, thus transforming one
of the concepts in the pair in a nonconcept. In fact, I have introduced the idea of data model, expressed as a
CHARM has a later stage of results checking from this frequent closed-itemset lattice, with a base for global
point of view. Another strong point of the ERA algo- implications, if needed. In the closed-itemset incre-
rithm is that it offers the pseudo-intents as results as mental-mining solution, two new and important opera-
well, thus completely characterizing the data. tions are applicable to data models: extension and re-
duction with several items. The main advantages of this
Itemset-Building Strategy approach are:
While Apriori is a breadth-first result-building algo- The construction of small models of data, which
rithm, most of the FCA-based algorithms are depth-first. makes them more understandable for the user;
Only ERA has a different strategy: Each item in the also, the response time is small
database is used to enlarge an already existing data
151
TEAM LinG
The extension of data models with a set of new items on the AprioriTid algorithm. Proceedings of the ACM
returns to the user only the supplementary results, Symposium on Applied Computing (pp. 534-536), Italy.
hence a smaller amount of results; the response time
is considerably smaller than when building the model Imberman, S., & Domanski, B. (2001, August). Finding
from scratch association rules from quantitative data using data
Whenever data models are incomprehensible, some Booleanization. Proceedings of the Seventh Americas
of the items can be removed, thus obtaining and Conference on Information Systems, USA.
easy-to-understand data model Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999,
The extension or reduction of a model spares the January). Discovering frequent closed itemsets for asso-
time spent building it, thus reusing knowledge ciation rules. Proceedings of the International Confer-
ence on Database Theory (pp. 398-416), Israel.
After many successful attempts to make it faster, the
mining process becomes more interactive and flexible Pei, J., Han, J., & Mao, R. (2000, May). CLOSET: An
due to the increased number of human interventions. efficient algorithm for mining frequent closed itemsets.
Proceedings of the Conference on Data Mining and
Knowledge Discovery (pp. 11-20), USA.
REFERENCES Valtchev, P., Missaoui, R., Godin, R., & Meridji, M.
(2002). A framework for incremental generation of
Agrawal, R., Imielinski, T., & Swami, A. (1993, May). frequent closed itemsets using Galois (concept) lat-
Mining association rules between sets of items in large tice theory. Journal of Experimental and Theoretical
databases. Proceedings of the ACM SIGMOD Confer- Artificial Intelligence, 14(2/3), 115-142.
ence on Management of Data (pp. 207-216), USA.
Wang, J., Han, J., & Pei, J. (2003, August). CLOSET+:
Agrawal, R., & Srikant, R. (1994, September). Fast algo- Searching for the best strategies for mining frequent
rithms for mining association rules. Proceedings of the closed itemsets. Proceedings of the ACM SIGKDD
20th International Conference on Very Large Data Bases International Conference on Knowledge Discovery
(pp. 487-499), Chile. and Data Mining (pp. 236-245), USA.
Ankerst, M. (2001, May). Human involvement and Webb, G. I. (2001, August). Discovering associations
interactivity of the next generations data mining tools. with numeric variables. Proceedings of the ACM
Proceedings of the ACM SIGMOD Workshop on Re- SIGKDD International Conference on Knowledge
search Issues in Data Mining and Knowledge Discovery Discovery and Data Mining (pp. 383-388), USA.
(pp. 178-188), USA.
Zaki, M. J., & Gouda, K. (2001). Fast vertical mining
Aumann, Y., & Lindell, Y. (1999, August). A statistical using diffsets (Tech. Rep. No. 01-1): Rensselaer Poly-
theory for quantitative association rules. Proceedings of technic Institute, Department of Computer Science.
the International Conference on Knowledge Discov-
ery in Databases (pp. 261-270), USA. Zaki, M. J., & Hsiao, C. J. (1999). CHARM: An efficient
algorithm for closed association rule mining (Tech.
Dumitriu, L. (2002). Interactive mining and knowledge Rep. No. 99-10). Rensselaer Polytechnic Institute,
reuse for the closed-itemset incremental-mining prob- Department of Computer Science.
lem. Newsletter of the ACM SIG on Knowledge Discov-
ery and Data Mining, 3(2), 28-36. Retrieved from http:// Zaki, M. J., & Ogihara, M. (1998, June). Theoretical
www.acm.org/sigs/sigkdd/explorations/issue3-2/ foundations of association rules. Proceedings of the
contents.htm Third ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery (pp. 7:1-7:8),
Ganter, B., & Wille, R. (1999). Formal concept analy- USA.
sis Mathematical foundations. Berlin, Germany:
Springer-Verlag. Zheng, Z., Kohavi, R., & Mason, L. (2001, August).
Real world performance of association rule algorithms.
Hong, T.-P., Kuo, C.-S., Chi, S.-C., & Wang, S.-L. (2000, Proceedings of the ACM SIGKDD International Con-
March). Mining fuzzy rules from quantitative data based ference on Knowledge Discovery and Data Mining (pp.
401-406), USA.
152
TEAM LinG
KEY TERMS Frequent Itemset: An itemset with support higher

than a predefined threshold, denoted minsup. C
Association Rule: A pair of frequent itemsets (A, B), Galois Connection: Let (T, I, D) be a context. Then the
where the ratio between the support of AB and A itemsets mappings
is greater than a predefined threshold, denoted minconf.
s: (T) (I), s(X) = { i I | ( t X) tDi }
Closure Operator: Let S be a set and c: (S) (S);
c is a closure operator on S if X, Y S, c satisfies the t: (I) (T), s(Y) = { t T | ( i Y) tDi }
following properties:
define a Galois connection between (T) and (I), the
1. extension, X c(X); power sets of T and I, respectively.
2. mononicity, if XY, then c(X) c(Y);
3. idempotency, c(c(X)) = c(X). Itemset: A set of items in a Boolean database D,
I={i1, i2, in}.
Note: st and ts are closure operators, when s and t Itemset Support: The ratio between the number of
are the mappings in a Galois connection transactions in D comprising all the items in I and the
Concept: In the Galois connection of the (T, I, D) total number of transactions in D (support(I) = |{TiD|
context, a concept is a pair (X, Y), X T, Y I, that satisfies ( ijI) ij Ti }| / |D|).
s(X)=Y and t(Y)=X. X is called the extent and Y the intent
Pseudo-Intent: The set X is a pseudo-intent if X
of the concept (X,Y).
c(X), where c is a closure operator and for all pseudo-
Context: A triple (T, I, D) where T and I are sets and intents Q X, c(Q) X.
DTI. The elements of T are called objects, and the
elements of I are called attributes. For any t T and i I,
note tDi when t is related to i, that is, ( t, i) D.
153
TEAM LinG
154
Cluster Analysis in Fitting Mixtures of Curves

Tom Burr
Los Alamos National Laboratory, USA
INTRODUCTION data-generation mechanism includes deterministic and

stochastic components and often involves determinis-
One data mining activity is cluster analysis, of which tic mean shifts between clusters in high dimensions. But
there are several types. One type deserving special there are other settings for cluster analysis. The particu-
attention is clustering that arises due to a mixture of lar one discussed here (see Figure 1) is a mixture of
curves. A mixture distribution is a combination of two curves, where any notion of a cluster mean would be
or more distributions. For example, a bimodal distribu- quite different from that in more typical clustering
tion could be a mix with 30% of the values generated applications. Furthermore, although finding clusters in
from one unimodal distribution and 70% of the values a two-dimensional scatter plot seems less challenging
generated from a second unimodal distribution. The than in higher-dimensions (the trained human eye is
special type of mixture we consider here is a mixture of likely to perform as well as any machine-automated
curves in a two-dimensional scatter plot. Imagine a method, although the eye would be slower), complica-
collection of hundreds or thousands of scatter plots, tions include overlapping clusters; varying noise magni-
each containing a few hundred points, including back- tude; varying feature and noise and density; varying
ground noise, but also containing from zero to four or feature shape, locations, and length; and varying types of
five bands of points, each having a curved shape. In a noise (scene-wide and event-specific). Any one of these
recent application (Burr et al., 2001), each curved band complications would justify treating the fitting of curve
of points was a potential thunderstorm event (see Figure mixtures as an important special case of cluster analy-
1), as observed from a distant satellite, and the goal was sis. Although as in pattern recognition, the following
to cluster the points into groups associated with thun- methods discussed require training scatter plots with
derstorm events. Each curve has its own shape, length, points labeled according to their cluster memberships,
and location, with varying degrees of curve overlap, we regard this as cluster analysis rather than pattern
point density, and noise magnitude. The scatter plots of recognition, because all scatter plots have from zero to
points from curves having small noise resemble a smooth four or five clusters whose shape, length, location, and
curve with very little vertical variation from the curve, extent of overlap with other clusters vary among scatter
but there can be a wide range in noise magnitude so that plots. The training data can be used both to train cluster-
some events have large vertical variation from the cen- ing methods and then to judge their quality. Fitting
ter of the band. In this context, each curve is a cluster and mixtures of curves is an important special case that has
the challenge is to use only the observations to estimate received relatively little attention to date. Fitting mix-
how many curves comprise the mixture, plus their shapes tures of probability distributions dates to Titterington
and locations. To achieve that goal, the human eye could et al. (1985), and several model-based clustering
train a classifier by providing cluster labels to all points schemes have been developed (Banfield & Raftery,
in example scatter plots. Each point either would belong 1993; Bensmail et al., 1997; Dasgupta & Raftery, 1998),
to a curved region or to a catch-all noise category, and a along with associated theory (Leroux, 1992). However,
specialized cluster analysis would be used to develop an these models assume that the mixture is a mixture of
approach for labeling (clustering) the points generated probability distributions (often Gaussian, which can be
according to the same mechanism in future scatter plots. long and thin, ellipsoidal, or more circular) rather than
curves. More recently, methods for mixtures of curves
have been introduced, including a mixture of principal
BACKGROUND curves model (Stanford & Raftery, 2000), a mixture of
regression models (Gaffney & Smyth 2003; Hurn, Justel,
Two key features that distinguish various types of clus- & Robert, 2003; Turner, 2000), and mixtures of local
tering approaches are the assumed mechanism for how regression models (i.e., smooth curves obtained using
the data is generated and the dimension of the data. The splines or nonparametric kernel smoothers for example).
TEAM LinG
Figure 1. Four mixture examples containing (a) one, from another cluster. The noise points are identified
(b) one, (c) two, and (d) zero thunderstorm events plus initially as those having low local density (away from C
background noise. The label 1 is for the first the central portion of any cluster) but, during the ex-
thunderstorm in the scene, 2 for the second, and so trapolation, can be judged to be a cluster member, if they
forth., and the highest integer label is reserved for the lie near the extrapolated curve. To increase robustness,
catch-all noise class. Therefore, in (d), because the method 1 can be applied twice, each time using slightly
highest integer is 1, there is no thunderstorm present different inputs (such as the decision threshold for the
(the mixture is all noise). initial noise rejection and the criteria for accepting
points into a cluster that are close to the extrapolated
region of the clusters curve). Then, only clusters that
10
2 2 11 2
111
8
2
are identified both times are accepted.
2 222 2
2 11 1 1
2 22 1
1 111 11
8
1
1
Method 2 uses the minimized, integrated squared
2 2 2 2 11 1
6
11
2 22 2 1
1 11111
2 11111111 1 1 11 1111 1 1
2 1111111 111
6
1111 1
y
1 1 1 1
111111
error (ISE, or L 2 distance) (Scott, 2002; Scott &
11111111111111
1 1 111111
1
11 111
1111111
111 111 1
111
1 111111 11111
4
1 11111 111
111 1111 1111
111 1 1 2 2
1 11111111111 11
111
Szewczyk, 2002) and appears to be a good approach for
4
11111
111111 11111111 2
1 1 1 1111 1111111
111111
111
11111111 11
111111111 1 1 2
1 111 111111111111
111111111
1111 22
111111 11
2
1 111
11
11
111111 2 22
fitting mixture models, including mixtures of regres-
2 2
2 2 2
2
0 50 100 150 200 250 0 200 400 600 800
x
(a)
x
(b)
sion models, as is our focus here. Qualitatively, the
minimum L2 distance method tries to find the largest
portion of the data that matches the model. In our
10
33 1
11
333333
1
11
111 context, at each stage, the model is all the points belong-
6
1
8
3 3 33 1 1 1 1 11
ing to a single curve plus everything else. Therefore, we

3 1 1
3 1 1111 1
3 1 11 1 111111
1 11 1 111 1
6
1 111 1 1 1 11
4
1 1111 11111
first seek cluster 1 having the most points, regard the
y
3 1 1 111 1 1111
111 1111 1111
1 11
1111111111111
1
1 11111111 1111111111 1
4
3 3 1 1
11 1 1111
remaining points as noise, remove the cluster, then
1 11 11 1
1111111 1 1
2
2221 1 1
22 1111111111111 1 11 1
2
3 22222222222222 2
repeat the procedure in search of feature 2, and so on,

222222222222
2 1
until a stop criterion is reached. It also should be pos-

0 200 400 600 800 1000 0 200 400 600 800 1000
x x
sible to estimate the number of components in the

(c) (d)
mixture in the first evaluation of the data, but that

approach has not yet been attempted. Scott (2002) has
MAIN THRUST
shown that in the parametric setting with model f(x|q), we
^
Several methods have been proposed for fitting mix- estimate using = arg min [ f ( x | ) f ( x | 0 )]2 dx where
tures of curves. In method 1 (Burr et al., 2001), first use the true parameter O is unknown. It follows that a reason-
density estimation to reject the background noise points able estimator minimizing the parametric ISE criterion is
such as those labeled as 2 in Figure 1a. For example, 2 n
L2 E = arg min [ f ( x | )2 dx f ( xi | )] . This assumes that
each point in the scatter plot has a distance to its kth n i =1
nearest neighbor, which can be used as a local density the correct parametric family is used; the concept can be
estimate (Silverman, 1986) to reject noise points. Next, extended to include the case in which the assumed para-
use a distance measure that favors long, thin clusters metric form is incorrect in order to achieve robustness.
(e.g., let the distance between clusters be the minimum Method 3 (principal curve clustering with noise) was
distance between any a point in the first cluster and a developed by Stanford and Raftery (2000) to locate
point in the second cluster), together with standard principal curves in noisy spatial point process data.
hierarchical clustering to identify at least the central Principal curves were introduced by Hastie and Stuetzle
portion of each cluster. Alternatively, model-based clus- (1989). A principal curve is a smooth curvilinear sum-
tering favoring long, thin Gaussian shapes (Banfield & mary of p-dimensional data. It is a nonlinear generaliza-
Raftery, 1993) or the fitting straight lines method in tion of the first principal component line that uses a
Murtagh and Raftery (1984) or Campbell et al. (1997) local averaging method. Stanford and Raftery (2000)
are effective for finding the central portion of each developed an algorithm that first uses hierarchical prin-
cluster. A curve fitted to this central portion can be cipal curve clustering (HPCC, which is a hierarchical
extrapolated and then used to accept other points as and agglomerative clustering method) and next uses
members of the cluster. Because hierarchical cluster- iterative relocation (reassign points to new clusters)
ing cannot accommodate overlapping clusters, this based on the classification estimation-maximization
method assumes that the central portions of each cluster (CEM) algorithm. A probability model included the
are non-overlapping. Points away from the central por- principal curve probability model for the feature clus-
tion from one cluster that lie close to the curve fitted to ters and a homogeneous Poisson process model for the
the central portion of the cluster can overlap with points noise cluster. More specifically, let X denote the set of
155
TEAM LinG
observations, x1, x2, , xn, and C be a partition considering Therefore, it is similar to method 3 in that a mixture model
of clusters C0, C 1, , C K, where the cluster Cj contains nj is specified but differs in that the curve is fit using
points. The noise cluster is C0 and assumes feature points parametric regression rather than principal curves, so
are distributed uniformly along the true underlying feature yi xi j
so that projections onto the features principal curve are L ( xi | , xi C j ) = f ij = (1/ j ) ( ) where is the stan-
j
randomly drawn from a uniform U(0,j) distribution, where
j is the length of the jth curve. An approximation to the dard Gaussian distribution. Also, Turners implementa-
probability for 0, 1, , 5 clusters is available from the tion did not introduce a clustering criterion, but it did
Bayesian Information Criterion (BIC), which is defined as attempt to estimate the number of clusters as follows.
BIC = 2 log(L(X|) - M log(n), where L is the likelihood of the Introduce indicator variable zi of which component of the
data X, and M is the number of fitted parameters, so M = mixture generated observation yi and iteratively maximize
K(DF + 2) + K + 1. For each of K features, we fit 2 parameters n K
Q = ik ln( f ik ) with respect to where is the complete
( j and j defined below) and a curve having DF degrees of i =1 k =1
freedom; there are K mixing proportions (j defined below), parameter set j for each class. The ik sat-
and the estimate of scene area is used to estimate the noise K
density. The likelihood L satisfies L(X|) = isfy ik = k fik / k fik . Then Q is maximized with respect to
i =1
n K
L( X | ) = L( xi | ) where L( xi | ) = j L( xi | , xi C j ) is the t by weighted regression of yi on x1, , xn with weights
i =1 j =0 n n
mixture likelihood (j is the probability that point i ik, and each k2 is given by k = ik ( yi xi k ) / ik . In
2 2
belongs to feature j) and i =1 i =1
practice, the components of come from the value of in

|| xi f (ij ) ||2 n
L ( xi | , xi C j ) = (1/ j )(1/ 2 j ) exp( ) for the feature
2 2j the previous iteration and k = (1/ n) ik . A difficulty in
i =1
clusters ( || xi f (ij ) || is the Euclidean distance from xi to its choosing the number of components in the mixture, each
projection point f(ij) on curve j) and L( xi | , xi C j ) = 1/ Area with unknown mixing probability, is that the likelihood
ratio statistic has an unknown distribution. Therefore,
for the noise cluster. Space will not permit a complete
Turner (2000) implemented a bootstrap strategy to choose
description of the HPCC-CEM method, but briefly, the
between 1 and 2 components, between 2 and 3, and so
HPCC steps are as follows:
forth. The strategy to choose between K and K + 1
components is (a) calculate the log-likelihood ratio
(1) Make an initial estimate of noise points and re-
statistic Q for a model having K and for a model having
move;
K + 1 components; (b) simulate data from the fitted K
(2) form an initial clustering with at least seven points
-component model; (c) fit the K and K + 1 component
in each cluster;
n
(3) fit a principal curve to each cluster; and Q*; (d) compute the p-value for Q as p = 1/ n I (Q Q ) ,
*
(4) calculate a clustering criterion i =1
where the indicator I( Q Q ) = 1 if Q Q* and 0*
V = VAbout + aVAlong, where VAbout measures the orthogonal otherwise.

distances to the curve (residual error sum of squares), and To summarize, method 1 avoids the mathematics of
VAlong measures the variance in arc length distances be- mixture fitting and seeks high-density regions to use as
tween projection points on the curve. Minimizing V (the a basis for extrapolation. It also checks the robustness
sum is over all clusters) will lead to clusters with regularly of its solution by repeating the fitting using different
spaced points along the curve and tightly grouped around input parameters and comparing results. Method 2 uses
it. Large values of a will cause the method to avoid clusters a likelihood for one cluster and everything else, and
with gaps and small values of a favor thinner clusters. then removes the highest-density cluster and repeats
Clustering (merging clusters) continues until V stops de- the procedure. Methods 3 and 4 both include formal
creasing. The flexibility provided by such a clustering mixture fitting (known to be difficult), with method 3
criterion (avoid or allow gaps and avoid or favor thin assuming the curve is well modeled as a principal curve
clusters) is potentially extremely useful, and Method 2 is and method 4 assuming the curve is well fit using para-
currently the only published method to include it. metric regression. Only method 3 allows the user to favor
Method 3 (Turner, 2000) uses the well-known EM or penalize clusters with gaps. All methods can be tuned
(estimation-maximization) algorithm (Dempster, Laird & to accept only thin (i.e., small residual variance) clusters.
Rubin, 1977) to handle a mixture of regression models.
156
TEAM LinG
FUTURE TRENDS CONCLUSION

C
Although we focus here on curves in the two dimensions, Fitting mixtures of curves with noise in two dimensions
clearly there are analogous features in higher dimensions, is a specialized type of cluster analysis. It is a manage-
such as principal curves in higher dimensions. Fitting able but fairly challenging and unique cluster analysis
mixtures in two dimensions is fairly straightforward, but task. Four options have been described briefly. Example
performance is rarely as good as desired. More choices for performance results for these methods all applied to the
probability distributions are needed; for example, the same data for one application are available (Burr et al.,
noise in Burr et al. (2001) was a non-homogeneous Pois- 2001). Results for methods 2, 3, and 4 also are available
son process, which might be better handled with adaptively in their respective references (on different data sets),
chosen regions of the scene. Also, very little has been and results for a numerical Bayesian approach using a
published regarding the measure of difficulty of curve- mixture of regressions with attention to the case having
mixture problem, although Hurn, Justel, and Robert (2003) nearly overlapping clusters also has been published
provided a numerical Bayesian approach for mixtures recently (Hurn, Justel & Robert, 2003).
involving switching regression. The term switching re-
gression describes the situation in which the likelihood is
nearly unchanged when some of the cluster labels are REFERENCES
switched. For example, imagine that clusters 1 and 2 were
slightly closer in Figure 1c, and then consider changes to Banfield, J., & Raftery, A. (1993). Model-based gaussian
the likelihood if we switched some 2 and 1 labels. Clearly, and non-gaussian clustering. Biometrics, 49, 803-821.
if curves are not well separated, we should not expect high
clustering and classification success. In the case where Bensmail, H., Celeux, G., Raftery, A., & Robert, C.
groups are separated by mean shifts, it is clear how to (1997). Inference in model-based cluster analysis, Sta-
define group separation. In the case where groups are tistics and Computing, 7, 1-10.
curves that overlap to varying degrees, one could envi- Burr, T., Jacobson, A., & Mielke, A. (2001). Identifying
sion a few options for defining group separation. For storms in noisy radio frequency data via satellite: An
example, the minimum distance between a point in one application of density estimation and cluster analysis.
cluster and a point in another cluster could define the Proceedings of US Army Conference on Applied Sta-
cluster separation and, therefore, also the measure of tistics, Santa Fe, New Mexico.
difficulty. Depending on how we define the optimum, it is
possible that performance can approach the theoretical Campbell, J., Fraley, C., Murtagh, F., & Raftery, A.
optimum. We define performance first by determining (1997). Linear flaw detection in woven textiles using
whether a cluster was found, and then by reporting the model-based clustering. Pattern Recognition Letters,
false positive and false negative rates for each found 18, 1539-1548.
cluster.
Another area of valuable research would be to ac- Dasgupta, A., & Raftery, A. (1998). Detecting features
commodate mixtures of local regression models (i.e., in spatial point processes with clutter via model-based
smooth curves obtained using splines or nonparametric clustering. Journal of American Statistical Associa-
kernel smoothers) (Green & Silverman, 1996). Be- tion, 93(441), 294-302.
cause local curve smoothers such as splines are ex- Dempster, A., Laird, N., & Rubin, D. (1977). Maximum
tremely successful in the case where the mixture is likelihood for incomplete data via the EM algorithm. Jour-
known to contain only one curve, it is likely that the nal of the Royal Statistical Society, Series B, 39, 1-38.
flexibility provided by local regression (parametric or
nonparametric) would be desirable in the broader con- Gaffney, S., & Smyth, P. (2003). Curve clustering with
text of fitting a mixture of curves. random effects regression mixtures. Proceedings of the
Existing software is fairly accessible for the meth- 9th International Conference on AI and Statistics, Florida.
ods described, assuming users have access to a statisti-
Green, P., & Silverman, B. (1994). Nonparametric regres-
cal programming language such as S-PLUS (2003).
sion and generalized linear models. London: Chapman
Executable code to accommodate a user-specified input
and Hall.
format would, of course, also be a welcome future contri-
bution, but it is now reasonable to empirically compare Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal
each method in each application of fitting mixtures of of American Statistical Association, 84, 502-516.
curves.
157
TEAM LinG
Hurn, M., Justel, A., & Robert, C. (2003). Estimating Cluster Analysis: Dividing objects into groups using
mixtures of regressions. Journal of Computational and varying assumptions regarding the number of groups,
Graphical Statistics, 12, 55-74. and the deterministic and stochastic mechanisms that
generate the observed values.
Leroux, B. (1992). Consistent estimation of a mixing
distribution. The Annals of Statistics, 20, 1350-1360. Estimation-Maximization Algorithm: An algorithm
for computing maximum likelihood estimates from incom-
Murtagh, F., & Raftery, A. (1984). Fitting straight lines plete data. In the case of fitting mixtures, the group labels
to point patterns. Pattern Recognition, 17, 479-483. are the missing data.
Scott, D. (2002). Parametric statistical modeling by Mixture of Distributions: A combination of two
minimum integrated square error. Technometrics, 43(3), or more distributions in which observations are gener-
274-285. ated from distribution i with probability pi and pi =1.
Scott, D., & Szewczyk, W. (2002). From kernels to Poisson Process: A stochastic process for generat-
mixtures. Technometrics, 43(3), 323-335. ing observations in which the number of observations in
Silverman, B. (1986). Density estimation for statistics a region (a region in space or time, for example) is
and data analysis. London: Chapman and Hall. distributed as a Poisson random variable.
S-PLUS Statistical Programming Language. (2003). Principal Curve: A smooth, curvilinear summary
Seattle, WA: Insightful Corp. of p-dimensional data. It is a nonlinear generalization of
the first principal component line that uses a local
Stanford, D., & Raftery, A. (2000). Finding curvilinear average of p-dimensional data.
features in spatial point patterns: Principal curve clus-
tering with noise. IEEE Transactions on Pattern Analy- Probability Density Function: A function that can
sis and Machine Intelligence, 22(6), 601-609. be summed (for discrete-valued random variables) or
integrated (for interval-valued random variables) to give
Tiggerington, D., Smith, A., & Kakov, U. (1985). Statis- the probability of observing values in a specified set.
tical analysis of finite mixture distributions. New
York: Wiley. Probability Density Function Estimate: An esti-
mate of the probability density function. One example is
Turner, T. (2000). Estimating the propagation rate of a the histogram for densities that depend on one variable
viral infection of potato plants via mixtures of regres- (or multivariate histograms for multivariate densities).
sions. Applied Statistics, 49(3), 371-384. However, the histogram has known deficiencies involv-
ing the arbitrary choice of bin width and locations.
Therefore, the preferred density function estimator,
which is a smoother estimator that uses local weighted
KEY TERMS sums with weights determined by a smoothing param-
eter, is free from bin width and location artifacts.
Bayesian Information Criterion: An approxima-
tion to the Bayes Factor, which can be used to estimate Scatter Plot: One of the most common types of
the Bayesian posterior probability of a specified model. plots, also known as an x-y plot, in which the first
component of a two-dimensional observation is dis-
Bootstrap: A resampling scheme in which surrogate played in the horizontal dimension and the second com-
data is generated by resampling the original data or ponent is displayed in the vertical dimension.
sampling from a model that was fit to the original data.
158
TEAM LinG
159
Clustering Analysis and Algorithms C

Xiangji Huang
York University, Canada
INTRODUCTION large data sets, such as BIRCH (Zhang, Ramakrishnan, &

Livny, 1996), CURE (Guha, Rastogi, & Shim, 1998),
Clustering is the process of grouping a collection of Squashing (DuMouchel, Volinsky, Johnson, Cortes, &
objects (usually represented as points in a multidimen- Pregibon, 1999) and Data Bubbles (Breuning, Kriegel,
sional space) into classes of similar objects. Cluster Krger, & Sander, 2001).
analysis is a very important tool in data analysis. It is a
set of methodologies for automatic classification of a
collection of patterns into clusters based on similarity. MAIN THRUST
Intuitively, patterns within the same cluster are more
similar to each other than patterns belonging to a differ- There exist a large number of clustering algorithms in
ent cluster. It is important to understand the difference the literature. In general, major clustering algorithms
between clustering (unsupervised classification) and can be classified into the following categories.
supervised classification.
Cluster analysis has wide applications in data min- Hierarchical Clustering
ing, information retrieval, biology, medicine, market-
ing, and image segmentation. With the help of clustering Hierarchical clustering builds a cluster hierarchy or a
algorithms, a user is able to understand natural clusters tree of clusters, also known as a dendrogram. Every
or structures underlying a data set. For example, clus- cluster node contains child clusters; sibling clusters
tering can help marketers discover distinct groups and partition the points covered by their common parent.
characterize customer groups based on purchasing pat- Such an approach allows the exploration of data on
terns in business. In biology, it can be used to derive different levels of granularity. Hierarchical clustering
plant and animal taxonomies, categorize genes with can be further classified into agglomerative (bottom-
similar functionality, and gain insight into structures up) and divisive (top-down) hierarchical clustering. An
inherent in populations. agglomerative clustering starts with one-point (single-
Typical pattern clustering activity involves the fol- ton) clusters and recursively merges two or more most
lowing steps: (1) pattern representation (including fea- appropriate clusters. A divisive clustering starts with
ture extraction and/or selection), (2) definition of a one cluster of all data points and recursively splits the
pattern proximity measure appropriate to the data do- most appropriate cluster. The process continues until a
main, (3) clustering, (4) data abstraction, and (5) as- stopping criterion (for example, the requested number
sessment of output. of k clusters) is achieved. Advantages of hierarchical
clustering include (a) embedded flexibility regarding
the level of granularity, (b) ease of handling of any
BACKGROUND forms of similarity or distance, and (c) applicability to
any attribute types. Disadvantages of hierarchical clus-
General references regarding clustering include Hartigan tering are (a) vagueness of termination criteria, and (b)
(1975), Jain and Dubes (1988), Kaufman and Rousseeuw the fact that most hierarchical algorithms do not revisit
(1990), Mirkin (1996), Jain, Murty, and Flynn (1999), once-constructed clusters with the purpose of their
and Ghosh (2002). A good introduction to contempo- improvement.
rary data-mining clustering techniques can be found in One of the most striking developments in hierarchi-
Han and Kamber (2001). Early clustering methods be- cal clustering is the algorithm BIRCH. BIRCH creates a
fore the 90s, such as k-means (Hartigan, 1975) and height-balanced tree of nodes that summarize its zero,
PAM and CLARA (Kaufman & Rousseeuw, 1990), are first, and second moments. Guha et al. (1998) intro-
generally suitable for small data sets. CLARANS (Ng & duced the hierarchical agglomerative clustering algo-
Han, 1994) made improvements to CLARA in quality rithm called CURE (Clustering Using Representatives).
and scalability based on randomized search. After This algorithm has a number of novel features of general
CLARANS, many algorithms were proposed to deal with significance. It takes special care with outliers and with
TEAM LinG
Clustering Analysis and Algorithms
label assignment. Although CURE works with numeri- where E is the sum of square-error for all objects in
cal attributes (particularly low-dimensional spatial data), the database, p is point in space representing a given
the algorithm ROCK, developed by the same research-
object, and mi is the mean of cluster Ci (both p and mi
ers (Guha, Rastogi, & Shim, 1999) targets hierarchical
agglomerative clustering for categorical attributes. are multidimensional).
Partitioning Clustering K-Medoids Method
Given a database of n objects and k, the number of In the k-medoids algorithm, a cluster is represented by
clusters to form, a partitioning algorithm organizes the one of its points. Instead of taking the mean value of the
objects into k partitions (k n), where each partition objects in a cluster as a reference point, the medoid can
represents a cluster. The clusters are formed to opti- be used, which is the most centrally located object in a
mize a partitioning criterion, often called a similarity cluster. The basic strategy of the k-medoids clustering
function, such as distance, so that the objects within a algorithms is to find k clusters in n objects by first
cluster are similar, whereas the objects of different arbitrarily finding a representative object (the medoid)
clusters are dissimilar in terms of the database at- for each cluster. Each remaining object is clustered
tributes. with the medoid to which it is the most similar. The
Partitioning clustering algorithms have advantages strategy then iteratively replaces one of the medoids by
in applications involving large data sets for which the one of the nonmedoids as long as the quality of the
construction of a dendrogram is computationally pro- resulting clustering is improved. This quality is esti-
hibitive. A problem accompanying the use of a partition- mated by using a cost function that measures the average
ing algorithm is the choice of the number of desired dissimilarity between an object and the medoid of its
output clusters. A seminal paper (Dubes, 1987) pro- cluster. It is important to understand that k-means is a
vides guidance on this key design decision. The parti- greedy algorithm, but k-medoids is not.
tioning techniques usually produce clusters by optimiz-
ing a criterion function defined either locally (on a Density-Based Clustering
subset of the patterns) or globally (defined over all the
patterns). Combinatorial search of the set of possible Heuristic clustering algorithms (such as partitioning
labelings for an optimum value of a criterion is clearly methods) work well for finding spherical-shaped clus-
computationally prohibitive. In practice, the algorithm ters in databases that are not very large. To find clusters
is typically run multiple times with different starting with complex shapes and for clustering very large data
states, and the best configuration obtained from all the sets, partitioning-based algorithms need to be extended.
runs is used as the output clustering. The most well- Most partitioning-based algorithms cluster objects based
known and commonly used partitioning algorithms are on the distance between objects. Such methods can find
k-means, k-medoids, and their variations. only spherical shaped clusters and encounter difficulty
at discovering clusters of arbitrary shapes. To discover
K-Means Method clusters with arbitrary shape, density-based clustering
algorithms have been developed. These algorithms typi-
The k-means algorithm (Hartigan, 1975) is by far the cally regard clusters as dense regions of objects in the
most popular clustering tool used in scientific and data space that are separated by regions of low density.
industrial applications. It proceeds as follows. First, it The general idea is to continue growing the given
randomly selects k objects, each of which initially cluster as long as the density (number of objects or data
represents a cluster mean or centre. For each of the points) in the neighborhood exceeds some threshold.
remaining objects, an object is assigned to the cluster to That is, for each data point within a given cluster, the
which it is the most similar, based on the distance neighborhood of a given radius has to contain at least a
between the object and the cluster mean. It then com- minimum number of points. Such a method can be used
putes the new mean for each cluster. This process iter- to filter out noise (outliers) and discover clusters of
ates until the criterion function converges. Typically, arbitrary shape. DBSCAN (EsterKriegel, Sander, & Xu,
the squared-error criterion is used, defined as 1996) is a typical density-based algorithm that grows
clusters according to a density threshold. OPTICS
(Ankerst Breuning, Kriegel, & Sander, 1999) is a den-
k
E = i=1 pC | p mi |2 sity-based algorithm that computes an augmented clus-
i
160
TEAM LinG
ter ordering for automatic and interactive cluster analy- approach assumes that the data are generated by a finite
sis. DENCLUE (Hinneburg & Keim, 1998) is another mixture of underlying probability distributions such as C
clustering algorithm based on a set of density distribu- multivariate normal distributions. The Gaussian mix-
tion functions. It differs from partition-based algorithms ture model has been shown to be a powerful tool for
not only by accepting arbitrary shape clusters but also by many applications (Banfield & Raftery, 1993). With
how it handles noise. the underlying probability model, the problems of de-
termining the number of clusters and of choosing an
Grid-Based Clustering appropriate clustering algorithm become probabilistic
model choice problems (Dasgupta & Raftery, 1998).
Grid-based algorithms quantize the object space into a This provides a great advantage over heuristic cluster-
finite number of cells that form a grid structure on which ing algorithms, for which no established method to
all the operations for clustering are performed. To some determine the number of clusters or the best clustering
extent, the grid-based methodology reflects a technical algorithm exists. Model-based clustering follows two
point of view. The category is eclectic: It contains both major approaches: a probabilistic approach or a neural
partitioning and hierarchical algorithms. The main ad- network approach.
vantage of this method is its fast processing time, which
is typically independent of the number of data objects, Probabilistic Approach
yet dependent on only the number of cells in each dimen-
sion in the quantized space. In the probabilistic approach, data are considered to be
Some typical examples of the grid-based algorithms samples independently drawn from a mixture model of
include STING, which explores statistical information several probability distributions (McLachlan & Basford,
stored in the grid cells; WaveCluster, which clusters 1988). The main assumption is that data points are
objects using a wavelet transform method; and CLIQUE, generated by (1) randomly picking a model j with
which represents a grid and density-based approach for probability j and (2) drawing a point x from a corre-
clustering in a high-dimensional data space. The algo-
sponding distribution. The area around the mean of
rithm STING (Wang, W., Wang, J., & Munta, 1997)
each distribution constitutes a natural cluster. We as-
works with numerical attributes (spatial data) and is
sociate the cluster with the corresponding distributions
designed to facilitate region-oriented queries. In doing
parameters such as mean, variance, and so forth. Each
so, STING constructs data summaries in a way similar to
data point carries not only its observable attributes, but
BIRCH. However, it assembles statistics in a hierarchi-
also a hidden cluster ID. Each point x is assumed to
cal tree of nodes that are grid-cells. The algorithm
belong to one and only one cluster.
WaveCluster (Sheikholeslami, Chatterjee, & Zhang,
Probabilistic clustering has some important fea-
1998) works with numerical attributes and has an ad-
tures. For example, it (a) can be modified to handle
vanced multiresolution. It is also known for some out-
records of complex structure, (b) can be stopped and
standing properties such as (a) a high quality of clusters,
resumed with consecutive batches of data, and (c) re-
(b) the ability to work well in relatively high-dimen-
sults in easily interpretable cluster system. Because
sional spatial data, (c) the successful handling of outli-
the mixture model has a clear probabilistic foundation,
ers, and (d) O(N) complexity. WaveCluster, which ap-
the determination of the most suitable number of clus-
plies wavelet transforms to filter the data, is based on
ters k becomes a more tractable task. From a data-
ideas of signal processing. The algorithm CLIQUE
mining perspective, an excessive parameter set causes
(Agrawal, Gehrke, Gunopulos, & Raghavan, 1998) for
overfitting, but from a probabilistic perspective, the
numerical attributes is fundamental in subspace cluster-
number of parameters can be addressed within the
ing. It combines the ideas of density-based clustering,
Bayesian framework. An important property of proba-
grid-based clustering, and the induction through dimen-
bilistic clustering is that the mixture model can be
sions similar to the Apriori algorithm in association rule
naturally generalized to clustering heterogeneous data.
learning.
However, statistical mixture models often require a
quadratic space, and the EM algorithm converges rela-
Model-Based Clustering tively slowly, making scalability an issue.
Model-based clustering algorithms attempt to optimize

Artificial Neural Networks Approach
the fit between the given data and some mathematical
model. For example, clustering algorithms based on
Artificial neural networks (ANNs) (Hertz, Krogh, &
probability models offer a principled alternative to heu-
Palmer, 1999) are inspired by biological neural net-
ristic-based algorithms. In particular, the model-based
161
TEAM LinG
works. ANNs have been used extensively over the past bases, such as real-life terabyte data sets. (2) Gracefully
four decades for both classification and clustering (Jain eliminate the need for a priori assumptions about the
& Mao, 1994). The neural ANNs approach to clustering data. (3) Use good sampling and data compression meth-
has two prominent methods: competitive learning and ods to improve efficiency and speed up clustering algo-
self-organizing feature maps. Both involve competing rithms. (4) Cluster extremely large and high-dimen-
neural units. Some of the features of the ANNs that are sional data.
important in pattern clustering are that they (a) process
numerical vectors and so require patterns to be repre-
sented with quantitative features only, (b) are inherently CONCLUSION
parallel and distributed processing architectures, and
(c) may learn their interconnection weights adaptively. This article describes five major approaches to cluster-
The neural network approach to clustering tries to ing, in addition to some other clustering algorithms.
emulate actual brain processing. Further research is Each has both positive and negative aspects, and each is
needed to make it readily applicable to very large data- suitable for different types of data and different as-
bases, due to long processing times and the intricacies sumptions about the cluster structure of the input. .
of complex data. Clustering is a process of grouping data items based on
a measure of similarity. Clustering is also a subjective
Other Clustering Techniques process; the same set of data items often needs to be
partitioned differently for different applications. This
Traditionally, each pattern belongs to one and only one subjectivity makes the process of clustering difficult,
cluster. Hence, the clusters resulting from this kind of because a single algorithm or approach is not adequate
clustering are disjoint. Fuzzy clustering extends this to solve every clustering problem. However, clustering
notion to associate each object with every cluster using is an interesting, useful, and challenging problem. It has
a membership function (Zadeh, 1965). Another approach great potential in applications such as object represen-
is constrained-based clustering, introduced in (Tung, Ng, tation, image segmentation, information filtering and
Lakshmanan & Han, 2001). This approach has important retrieval, and analyzing gene expression data.
applications in clustering two-dimensional spatial data in
the presence of obstacles. Another approach used in
clustering analysis is the Genetic Algorithm (Goldbery, REFERENCES
1989). An example is the GGA (Genetically Guided Algo-
rithm) for fuzzy and hard k-means (Hall, Ozyurt, & Bezdek, Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P.
1999). (1998). Automatic subspace clustering of high dimen-
sional data for data mining applications. Proceedings of
the ACM SIGMOD Conference (pp. 94-105), USA.
FUTURE TRENDS
Ankerst, M., Breuning, M., Kriegel, H., & Sander, J. (1999).
Choosing a clustering algorithm for a particular prob- OPTICS: Ordering points to identify clustering structure.
lem can be a daunting task. One major challenge in using Proceedings of the ACM SIGMOD Conference (pp. 49-
a clustering algorithm on a specific problem lies not in 60), USA.
performing the clustering itself, but rather in choosing Banfield, J. D., & Raftery, A. E. (1993). Model-based
the algorithm and the values of the associated param- Gaussian and non-Gaussian clustering. Biometrics, 49,
eters. Clustering algorithms also face problems of 803-821.
scalability, both in terms of computing time and memory
requirements. Despite the ongoing exponential increases Breuning, M., Kriegel, H., Krger, H., & Sander, J. (2001).
in the power of computers, scalability remains still a Data bubbles: Quality preserving performance boosting
major issue in many clustering applications. In com- for hierarchical clustering. Proceedings of the ACM
mercial data-mining applications, the quantity of the SIGMOD Conference, USA.
data to be clustered can far exceed the main memory
capacity of the computer, making both time and space Dasgupta, A., & Raftery, A. E. (1998). Detecting features
efficiency critical; this issue is addressed by clustering in spatial point processes with clutter via model-based
systems in the database community such as BIRCH. clustering. Journal of the American Statistical Associa-
This leads to the following set of continuing re- tion, 93, 294-302.
search in clustering, and in particular data mining. (1) Dubes, R. C. (1987). How many clusters are best? An
Extend clustering algorithms to handle very large data- experiment. Pattern Recognition, 20, 645-663.
162
TEAM LinG
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Lu, S. Y., & Fu, K. S. (1978). A sentence-to-sentence
Pregibon, D. (1999). Squashing flat files flatter. Proceed- clustering procedure for pattern analysis. IEEE Transac- C
ings of the ACM SIGKDD Conference (pp. 6-15), USA. tions on System Man Cybern., 8, 381-389.
Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A McLachlan, G. J., & Basford, K. D. (1998). Mixture models:
density-based algorithm for discovering clustering in Inference and application to clustering. New York:
large spatial databases with noise. Proceedings of the Dekker.
the ACM SIGKDD Conference (pp. 226-231), USA.
Mirkin, B. (1996). Mathematic classification and cluster-
Ghosh, J. (2002). Scalable clustering methods for data ing. Kluwer Academic.
mining. In N. Ye (Ed.), Handbook of data mining
Lawrence Erlbaum. Ng, R., & Han, J. (1994). Efficient and effective clustering
method for spatial data mining. Proceedings of the 20th
Goldbery, D. (1989). Genetic algorithm in search, Conference on Very Large Data Bases (pp. 144-155),
optimization and machine learning. Addison-Wesley. Chile.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998).
efficient clustering algorithm for large databases. Pro- WaveCluster: A multi-resolution clustering approach
ceedings of the ACM SIGMOD Conference (pp. 73-84), for very large spatial databases. Proceedings of the
USA. 24th Conference on Very Large Data Bases (pp. 428-439),
USA.
Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust
clustering algorithm for categorical attributes. Proceed- Tung, A. K. H., Hou, J., & Han, J. (2001). Spatial clustering
ings of the 15th International Conference on Data Engi- in the presence of obstacles. Proceedings of the 17th
neering (pp. 512-521), Australia. International Conference on Data Engineering (pp. 359-
367), Germany.
Hall, L. O., Ozyurt, B., & Bezdek, J. C. (1999). Clustering
with a genetically optimized approach. IEEE Transac- Tung, A. K. H., Ng, R. T., Lakshmanan, L. V. S., & Han, J.
tions on Evolutionary Computation, 3(2), 103-112. (2001). Constraint-based clustering in large databases.
Proceedings of the Eighth ICDT, London.
Han, J., & Kamber, M. (2001). Data mining: Concepts and
techniques. Morgan Kaufmann. Wang, W., Wang, J., & Munta, R. (1997). STING: A
statistical information grid approach to spatial data
Hartigan, J. (1975). Clustering algorithms. New York: mining. Proceedings of the 23rd Conference on Very
Wiley. Large Data Bases (pp. 186-195), Greece.
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduc- Zadeh, L. H. (1965). Fuzzy sets. Information Control, 8,
tion to the theory of neural computation. Reading, 338-353.
MA: Addison Wesley Longman.
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH:
Hinneburg, A., & Keim, D. (1998). An efficient ap- An efficient data clustering method for very large data-
proach to clustering large multimedia databases with bases. Proceedings of the ACM SIGMOD Conference
noise. Proceedings of the ACM SIGMOD Conference (pp. (pp. 103-114), Canada.
58-65), USA.
Jain, A., & Dubes, R. (1988). Algorithms for clustering
data. Englewood Cliffs, NJ: Prentice-Hall.
KEY TERMS
Jain, A., & Mao, J. (1994). Neural networks and pattern
recognition. In J. M. Zurada, R. J. Marks, & C. J. Apriori Algorithm: An efficient association rule
Robinson (Eds.), Computational intelligence: Imitat- mining algorithm developed by Agrawal, in 1993. Apriori
ing life (pp.194-212). employs a breadth-first search and uses a hash tree
structure to count candidate item sets efficiently. The
Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A algorithm generates candidate item sets of length k
review. ACM Computing Surveys, 31(3), 264-323. from k1 length item sets. Then, the patterns that have an
Kaufman, L., & Rousseeuw, P. (1990). Finding groups in infrequent subpattern are pruned. Following that, the
data: An introduction to cluster analysis. New York: whole transaction database is scanned to determine
Wiley. frequent item sets among the candidates. For determining
163
TEAM LinG
frequent items in a fast manner, the algorithm uses a hash Feature Selection: The process of identifying the
tree to store candidate item sets. most effective subset of the original features to use in
data analysis, such as clustering.
Association Rule: A rule in the form of if this, then
that. It states a statistical correlation between the occur- Overfitting: The effect on data analysis, data min-
rence of certain attributes in a database. ing, and biological learning of training too closely on
limited available data and building models that do not
Customer Relationship Management: The process generalize well to new unseen data.
by which companies manage their interactions with cus-
tomers. Supervised Classification: Given a collection of
labeled patterns, the problem in supervised classifica-
Data Mining: The process of efficient discovery of tion is to label a newly encountered but unlabeled pat-
actionable and valuable patterns from large databases. tern. Typically, the given labeled patterns are used to
learn the descriptions of classes that in turn are used to
label a new pattern.
164
TEAM LinG
165
Clustering in the Identification of Space Models C

Maribel Yasmina Santos
University of Minho, Portugal
Adriano Moreira
Sofia Carneiro
INTRODUCTION Space Models are independent of specific domain

knowledge, like the specification of the final number
Clustering is the process of grouping a set of objects into of clusters.
clusters so that objects within a cluster have high similar-
ity with each other, but are as dissimilar as possible to The following sections, in outline, include: (i) an
objects in other clusters. Dissimilarities are measured overview of clustering, its methods and techniques; (ii)
based on the attribute values describing the objects (Han the STICH algorithm, its assumptions, its implementation
& Kamber, 2001). and the results for a sample dataset; (iii) future trends; and
Clustering, as a data mining technique (Cios, Pedrycz, (iv) a conclusion with some comments about the pro-
& Swiniarski, 1998; Groth, 2000), has been widely used to posed algorithm.
find groups of customers with similar behavior or groups
of items that are bought together, allowing the identifica-
tion of the clients profile (Berry & Linoff, 2000). This BACKGROUND
article presents another use of clustering, namely in the
creation of Space Models. Clustering (Grabmeier, 2002; Jain, Murty, & Flynn, 1999;
Space Models represent divisions of the geographic Zat & Messatfa, 1997) is a discovering process (Fayyad
space in which the several geographic regions are grouped & Stolorz, 1997) that identifies homogeneous groups of
accordingly to their similarities with respect to a specific segments in a dataset.
indicator (values of an attribute). Space Models represent Han & Kamber (2001) state that clustering is a chal-
natural divisions of the geographic space based on some lenging field of research integrating a set of special
geo-referenced data. requirements. Some of the typical requirements of cluster-
This article addresses the development of a clustering ing in data mining are:
algorithm for the creation of Space Models STICH
(Space Models Identification Through Hierarchical Clus- Scalability in order to allow the analysis of large
tering). The Space Models identified, integrating several datasets, since clustering on a sample of a database
clusters, point out particularities of the analyzed data, may lead to biased results.
namely the exhibition of clusters with outliers, regions Discovery of clusters with arbitrary shapes since
which behavior is strongly different from the other ana- clusters can be of any shape. Some existing cluster-
lyzed regions. ing algorithms identify clusters that tend to be
In the work described in this article, some assumptions spherical, with similar size and density.
were adopted to the creation of Space Models through a Minimal domain knowledge since the clustering
clustering process, namely: results can be quite sensitive to the input param-
eters, like the number of clusters required by many
Space Models must be created by looking at the data clustering algorithms.
values available and no constraints must be im- Ability to deal with noisy data avoiding the iden-
posed for their identification. tification of clusters that were negatively influ-
Space Models can include clusters of different enced by outliers or erroneous data.
shapes and sizes. Insensitivity to the order of input records since
there are clustering algorithms that are influenced
TEAM LinG
Clustering in the Identification of Space Models
by the order in which the available records are Clustering Techniques

analyzed.
The K-Means Algorithm
Two of the well-known types of clustering algorithms
are based on partitioning and hierarchical methods. These Partitioning-based clustering algorithms such as k-means
methods and some of the corresponding algorithms are attempt to break data into a set of k clusters (Karypis, Han,
presented in the following subsections. & Kumar, 1999).
The k-means algorithm takes as input a parameter k
Clustering Methods that represents the number of clusters in which the n
objects of a dataset will be partitioned. The obtained
Partitioning Methods division tries to maximize the Intracluster similarity (a
measurement of the similarity between the objects inside
A partitioning method constructs a partition of n objects a cluster) and minimize the Intercluster similarity (a mea-
into k clusters where k n. Given k, the partitioning surement of the similarity between different clusters).
method creates an initial partitioning and then, using an That is to say a high similarity between the objects inside
iterative relocation technique, it attempts to improve the a cluster and a low similarity between objects in different
partitioning by moving objects from one cluster to an- clusters. This similarity is measured looking at the centers
other. The clusters are formed to optimize an objective- of gravity (centroids) of the clusters, which are calculated
partitioning criterion, such as the distance between the as the mean value of the objects inside them.
objects. A good partitioning must aggregate objects such Given the input parameter k, the k-means algorithm
that objects in the same cluster are similar to each other, works as follows (MacQueen, 1967):
whereas objects in different clusters are very different
(Han & Kamber, 2001). 1. Randomly selects k objects, each of which initially
Two of the well-known partitioning clustering algo- represents the cluster centre or the cluster mean.
rithms are the k-means algorithm, where each cluster is 2. Assign each of the remaining objects to the cluster
represented by the mean value of the objects in the to which it is the most similar, based on the distance
cluster, and the k-medoid algorithm, where each cluster is between the object and the cluster mean.
represented by one of the objects located near the centre 3. Compute the new mean (centroid) of each cluster.
of the cluster.
After the first iteration, each cluster is represented by
Hierarchical Methods the mean calculated in step 3. This process is repeated
until the criterion function converges. The squared-error
Hierarchical methods perform a hierarchical composition criterion is often used, which is defined as:
of a given set of data objects. It can be done bottom-up
(Agglomerative Hierarchical methods) or top-down (Divi- k l
E = o j Ci o j mi
2
sive Hierarchical methods). Agglomerative methods start (1)

i =1 j =1
with each object forming a separate cluster. Then they
perform repeated merges of clusters or groups close to where E is the sum of the square-error for the objects in
one another until all the groups are merged into one the dada set, l is the number of objects in a given cluster,
cluster or until some pre-defined threshold is reached oj represents an object, and mi is the mean value of the
(Han & Kamber, 2001). These algorithms are based on the cluster Ci. This criterion intends to make the resulting k
inter-object distances and on finding the nearest neigh- clusters as compact and as separate as possible (Han &
bors objects. Divisive methods start with all the objects Kamber, 2001).
in a single cluster and, in each successive iteration, the The k-means algorithm is applied when the mean of a
clusters are divided into smaller clusters until each cluster cluster can be obtained, which makes it not suitable for the
contains only one object or a termination condition is analysis of categorical attributes1. One of the disadvan-
verified. tages that can be pointed out to this method is the
The next subsection presents two examples of cluster- necessity for users to specify k in advance. This method
ing algorithms associated with partitioning and hierarchi- is also not suitable for discovering clusters with nonconvex
cal methods respectively. shapes or clusters of very different sizes (Han & Kamber,
166
TEAM LinG
2001). To overcome some of the drawbacks of the k-means The Algorithm

algorithm, several improvements and new developments C
have been done. See for example (Likas, Vlassis, & Verbeek, STICH uses an iterative process, in which no input
2003; Estivill-Castro & Yang, 2004). parameters are required. Another characteristic of STICH
is that it produces several usable Space Models, one at
The K-Nearest Neighbor Algorithm the end of each iteration of the clustering process.
As a hierarchical clustering algorithm with an
Hierarchical clustering algorithms generate a set of clus- agglomerative approach, STICH starts to assign each
ters with a single cluster at the top (integrating all data object in the dataset to a different cluster. The clustering
objects) and single-point clusters at the bottom, in which process begins with as many clusters as objects in the
each cluster is formed by one data object (Karypis, Han, & dataset, and ends with all the objects grouped into the
Kumar, 1999). same cluster.
Agglomerative hierarchical algorithms, like k-nearest The approach undertaken in STICH is as follows:
neighbor, start with each data object in a separate cluster.
Each step of the clustering algorithm merges the k records 1. For the dataset, calculate the Similarity Matrix of
that are most similar into a single cluster. the objects, which is the matrix of the Euclidean
The k-nearest neighbor algorithm2 uses a graph to distances between every pair of objects in the
represent the links between the several data objects. This dataset.
graph allows the identification of the several k objects 2. For each object, identify its minimum distance to
most similar to a given data item. Each node of the k- the dataset (the value that represents the distance
nearest neighbor graph represents a data item. An edge between the object and its 1-nearest neighbor).
exists between two nodes p and q if q is among the k most 3. Calculate the median3 of all the minimum distances
similar objects of p, or if p is among the k most similar identified in the previous step.
objects of q (Figure 1). 4. For each object, identify its k-nearest neighbors,
selecting the objects that have a distance value
less or equal to the median calculated in step 3. The
MAIN THRUST number of objects selected, k, may vary from one
object to another.
The Space Models Identification Through Hierarchical 5. For each object, calculate the average c of its k-
Clustering (STICH) algorithm presented in this article is nearest neighbors.
based on the k-nearest neighbor algorithm. However, the 6. For each object, verify in which clusters it appears
principles defined for STICH try to overcome some of the as one of the k-nearest neighbors and then assign
limitations of the k-nearest neighbor approach, namely this object to the cluster in which it appears with
the necessity to define a value for the input parameter k. the minimum k-nearest neighbor average c. In case
This value imposes restrictions to the maximum number of of tie, one object that appears into two clusters with
members that a given cluster can have. The number of equal average c, the object is assigned to the first
members of the obtained clusters is k+1 (except in the case cluster in which it appears in the Similarity Matrix.
of tie in the neighbors value). In some applications there 7. For each new cluster, calculate its centroid.
are no objective reasons for the choice of a value for k. The
k-means algorithm was also analyzed mainly because of its This process is iteratively repeated until all the ob-
simplicity, and to allow the comparison of results with jects are grouped into the same cluster. Each iteration of
STICH. STICH produces a new Space Model, with decreasing
Figure 1. K-nearest neighbor algorithm
25 25
72 99 99
72
38 38
45 45
124 124
87 87
65 65
50 50
a) Dataset b) 2-nearest neighbor
167
TEAM LinG
number of clusters, which can be used for different pur- clustering process the Intracluster similarity value is low,
poses. because of the high similarity between the objects inside
For each model, a quality metric is calculated after each the clusters, and the Intercluster similarity value is high,
iteration of the STICH algorithm. This metric, the because of the low similarity between the several clusters.
ModelQuality, is defined as: As the process proceeds, the Intracluster value increases
and the Intercluster value decreases. The minimum differ-
ModelQuality = Intraclust er Intercluster (2) ence between these two values means that the objects are
as separate as possible, considering a low number of
and is based on the difference between the Intracluster clusters. After this point, the clustering process forces
and Intercluster similarities. the aggregation of all the objects into the same cluster (the
The Intracluster indicator is calculated as the sum of stop criterion of agglomerative hierarchical clustering
all distances between the several objects in a given cluster methods), within a few more iterations.
(l represents the total number of objects in a cluster) and At this minimum point (Figure 2), reached at some
the mean value (mi) of the cluster (Ci) in which the object iteration of the clustering process, the resulting Space
(oj) reside. The total number of clusters identified in each Model is the one where the outliers are pointed out, that
iteration is represented by t. The Intracluster indicator is is, where the model isolates in different clusters the
calculated as follows: regions that are very different from all other regions.
t l
The Implementation
Intracluster = o j Ci o j mi (3)
i =1 j =1
STICH was implemented in Visual Basic for Applications
(VBA) and integrated in the Geographic Information Sys-
The Intercluster indicator is calculated as the sum of tem (GIS) ArcView 8.2 using ArcObjects4. For the creation
all distances existing between the centers of all the clus- of Space Models, STICH considers two processes: the
ters identified in a given iteration. The Intercluster indi- clustering of geographic regions based on a selected
cator is calculated as follows: indicator, and the creation of a new space geometry where
a new polygon is created for each resulting cluster by
t dissolving the borders of its members (the regions inside
t

Intercluster = mi m j (4) it). By using ArcGIS, both processes can be implemented
i =1 j =1
j i in the same platform.
The ModelQuality metric can be used by the user in The Results

the selection of the appropriate Space Model for a given
task. However, it is on the minimum value of this metric The Space Model presented afterward was obtained ana-
that we found the STICH outliers model, in the case they lyzing an indicator available in the EPSILON database.
exist in the dataset (Figure 2). In the beginning of the EPSILON (Environmental Policy via Sustainability Indi-
Figure 2. Intracluster similarity vs. Intercluster similarity
1 109
Intracluster similarity
Intercluster similarity
8 108 Diference
6 108
Min
4 108
2 108
Iteration
2 4 6 8 10
168
TEAM LinG
cators on a European-wide NUTS III Level) is a research Quality and Quality of Life, by delivering a tool aimed to
project founded by the European Union through the generate environmental sustainability indices at NUTS5- C
Information Society Technologies program. STICH is a III level.
deliverable of this research project, which contributes to In the following example, an indicator collected in the
the better understanding of the European Environmental EPSILON project is analyzed, namely the concentration in
the air of Heavy Metals - Lead (the data is available in
Table 1 for the 15 European countries that integrate the
Figure 3. Space model: Analysis of the heavy metals
EPSILON database).
lead attribute
Using the k-means and STICH algorithms, adopting
the value k=3 in order to allow the comparison between
the clusters achieved by them, the results obtained are
systematized in Table 2. Analyzing these results it is
possible to see that the clusters generated with the k-
means algorithm, using the implementation available in
the Clementine Data Mining System v8.0, are not as
homogeneous as the clusters obtained with STICH. The
k-means approach integrates the 0.0231413 value, an
outlier6 present in the dataset, with values like 0.00197921,
leaving out of this cluster (Cluster 1) values that are more
proximal to this last one. STICH obtains clusters that are
more homogeneous and that separates values that are
very different from the others.
Figure 3 shows the Space Model created by STICH as
result of the clustering process. This model, with three
clusters, shows the spatial distribution of the concentra-
tion of Lead in the air, for the 15 analyzed countries.
Table 1. Data available in the Epsilon database for the Heavy Metals - Lead attribute
Country Sweden Finland Ireland Denmark France
Indicator 0.00197921 0.00207107 0.0032313 0.0035931 0.00482136
Country England Austria Luxemburg Spain Netherlands
Indicator 0.00512893 0.00560728 0.00638133 0.00797498 0.00879874
Country Germany Belgium Greece Portugal Italy
Indicator 0.0091856 0.00931672 0.0114614 0.0124667 0.0231413
Table 2. Results obtained for the Heavy Metals - Lead attributes
Cluster 1 = {0.00197921, 0.0035931, 0.00482136, 0.00512893, 0.00560728,

0.00638133, 0.00797498, 0.00879874, 0.0091856, 0.00931672,
0.0114614, 0.0124667, 0.0231413}
k-means Cluster 2 = {0.00207107}
Cluster 3 = {0.0032313}
Cluster 1 = {0.00197921, 0.00207107, 0.0032313, 0.0035931, 0.00482136,

0.00512893, 0.00560728, 0.00638133}
STICH Cluster 2 = {0.00797498, 0.00879874, 0.0091856, 0.00931672, 0.0114614,

0.0124667}
Cluster 3 = {0.0231413}
169
TEAM LinG
As advantages for the creation of Space Models it can REFERENCES

be pointed out the definition of new space geometries,
which can be reused whenever needed, and the certainty Berry, M., & Linoff, G. (2000). Mastering data mining: The
that the new divisions of the space were obtained by art and science of customer relationship management.
characteristics present in data, and not imposed by human New York: John Wiley and Sons, Inc.
constraints as in administrative subdivisions.
Bhm, C., & Krebs, F. (2004). The k-nearest neighbor join:
Turbo charging the KDD process. Knowledge and Infor-
FUTURE TRENDS mation Systems, 6(6).
Cios, K., Pedrycz, W., & Swiniarski, R. (1998). Data mining
As already mentioned, clustering has been widely used in methods for knowledge discovery. Boston: Kluwer Aca-
several domain applications to find segments in data, like demic Publishers.
groups of clients with similar behavior. In the location-
based services domain, clustering can also be used to find Estivill-Castro, V., & Yang, J. (2004). Fast and robust
groups of mobile users with the same interests or with general purpose clustering algorithms. Data Mining and
similar mobility patterns. In this context, the STICH algo- Knowledge Discovery, 8(2), 127-150.
rithm can also be used to create Space Models as divi- Fayyad, U., & Stolorz, P. (1997). Data mining and KDD:
sions of the geographic space where each region repre- Promise and challenges. Future Generation Computer
sents an area with certain attributes useful to characterize Systems, 13(2), 99-115.
the situation of a mobile user. Examples of such regions
are urban or rural areas, commercial or leisure zones, or not Grabmeier, J. (2002). Techniques of cluster algorithms
safety zones in a town. When a mobile user is detected to in data mining. Data Mining and Knowledge Discovery,
be inside one of these regions, the attributes of that region 6(4), 303-360.
can be used to identify the appropriate services and
information to provide to him. These attributes can also Groth, R. (2000). Data mining: Building competitive ad-
be combined with other contextual data, such as the time vantage. Prentice Hall PTR.
of the day or the user preferences, to enhance the adapt- Han, J., & Kamber, M. (2001). Data mining: Concepts and
ability of a context-aware application (Salber, Dey, & techniques. CA: Morgan Kaufmann Publishers.
Abowd, 1999).
Jain, A.K., Murty, M.N., & Flynn, P.J. (1999). Data cluster-
ing: A review. ACM Computing Surveys, 31(3), 264-323.
CONCLUSION Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon:
Hierarchical clustering using dynamic modeling. IEEE
This article presented STICH, a hierarchical clustering Computer, 32(8), 68-75.
algorithm that allows the identification of Space Models.
These models are characterized by the integration of Laurikkala, J., Juhola, M., & Kentala, E. (2000). Informal
groups of regions that are similar to each other. Clusters identification of outliers in medical data. In Proceedings
of outliers are also formed by STICH, enabling the iden- of Intelligent Data Analysis in Medicine and Pharmacol-
tification of regions that are very different from the other ogy, Workshop at the 14th European Conference on
regions present in the dataset. Artificial Intelligence (pp. 20-24). Berlin.
As advantages of STICH, and besides the creation of Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global
new divisions of the space, we point out: k-means clustering algorithm. Pattern Recognition, 36(2),
451-461.
The discovery of clusters with arbitrary shapes and
sizes. MacQueen, J.B. (1967). Some methods for classification
The avoidance of any previous domain knowledge, and analysis of multivariate observations. In Proceed-
as for example the definition of a value for k. ings of 5-th Berkeley Symposium on Mathematical Sta-
The ability to deal with outliers values, since when tistics and Probability (pp. 281-297). Berkeley, CA: Uni-
they are present in the dataset, the approach under- versity of California Press.
taken allows their identification.
Modha, D.S., & Spangler, W.S. (2003). Feature weighting
in k-Means clustering. Machine Learning, 52(3), 217-237.
170
TEAM LinG
Salber, D., Dey, A.K., & Abowd, G.D. (1999). The context Partitioning Clustering: A clustering method char-
toolkit: Aiding the development of context-enabled appli- acterized by the division of the initial dataset in order to C
cations. In Proceedings of the 1999 Conference on Hu- find clusters that maximize the similarity between the
man Factors in Computing Systems (CHI 99) (pp. 434- objects inside the clusters.
441). Pittsburgh.
Space Model: A geometry of the geographic space
Zat, M., & Messatfa, H. (1997). A comparative study of obtained by the identification of geographic regions with
clustering methods. Future Generation Computer Sys- similar behavior looking at a specific metric.
tems, 13(2), 149-159.
KEY TERMS ENDNOTES

Cluster: Group of objects of a dataset that are similar 1
For datasets with multiple, heterogeneous feature
with respect to specific metrics (values of the attributes spaces see Modha & Spangler (2003).
used to identify the similarity between the objects). 2
The k-nearest neighbor principles are also used for
Clustering: Process of identifying homogeneous classification, a data mining task (Bhm & Krebs,
groups of clusters in a dataset. 2004).
3
The median is used instead of the average, since the
Data Mining: A discovering process that aims the average can be negatively influenced by outliers.
identification of patterns hidden in the analyzed dataset. 4
http://arcobjectsonline.esri.com
5
Nomenclature of Territorial Units for Statistics.
Hierarchical Clustering: A clustering method char- 6
Following the Interquartile Range definition
acterized by the successive aggregation (agglomerative (Laurikkala, Juhola, & Kentala, 2000) for the identi-
hierarchical methods) or desegregation (divisive hierar- fication of outliers, it can be confirmed that the value
chical methods) of objects in order to find clusters in a 0.0231413 is the unique value that represent a
dataset. possible outlier in the analyzed dataset (in this
Intercluster Similarity: A measurement of the simi- definition, outliers are values exceeding by 1.5 the
larity between the clusters identified in a particular clus- Interquartile Range below the 25th percentile or above
tering process. These clusters must be as dissimilar as the 75th percentile).
possible.
Intracluster Similarity: A measurement of the simi-
larity between the objects inside a cluster. Objects in a
cluster must be as similar as possible.
171
TEAM LinG
172
Clustering of Time Series Data

Anne Denton
INTRODUCTION Time series clustering draws from all of these areas.

It builds on a wide range of clustering techniques that
Time series data is of interest to most science and have been developed for other data, and adapts them
engineering disciplines and analysis techniques have while critically assessing their limitations in the time
been developed for hundreds of years. There have, how- series setting.
ever, in recent years been new developments in data
mining techniques, such as frequent pattern mining,
which take a different perspective of data. Traditional MAIN THRUST
techniques were not meant for such pattern-oriented
approaches. There is, as a result, a significant need for Many specialized tasks have been defined on time series
research that extends traditional time-series analysis, in data. This chapter addresses one of the most universal data
particular clustering, to the requirements of the new mining tasks, clustering, and highlights the special aspects
data mining algorithms. of applying clustering to time series data. Clustering tech-
niques overlap with frequent pattern mining techniques,
since both try to identify typical representatives.
BACKGROUND
Clustering Time Series
Time series clustering is an important component in the
application of data mining techniques to time series data Clustering of any kind of data requires the definition of
(Roddick & Spiliopoulou, 2002) and is founded on the a similarity or distance measure. A time series of length
following research areas: n can be viewed as a vector in an n-dimensional vector
space. One of the best-known distance measures, Eu-
Data Mining: Besides the traditional topics of clidean distance, is frequently used in time series clus-
classification and clustering, data mining addresses tering. The Euclidean distance measure is a special case
new goals, such as frequent pattern mining, asso- of an Lp norm. Lp norms may fail to capture similarity
ciation rule mining, outlier analysis, and data ex- well when being applied to raw time series data because
ploration. differences in the average value and average derivative
Time Series Data: Traditional goals include fore- affect the total distance. The problem is typically ad-
casting, trend analysis, pattern recognition, filter dressed by subtracting the mean and dividing the result-
design, compression, Fourier analysis, and cha- ing vector by its L2 norm, or by working with normalized
otic time series analysis. More recently frequent derivatives of the data (Gavrilov et al., 2000). Several
pattern techniques, indexing, clustering, classifi- specialized distance measures have been used for time
cation, and outlier analysis have gained in impor- series clustering, such as dynamic time warping, DTW
tance. (Berndt & Clifford 1996), and longest common subse-
Clustering: Data partitioning techniques such as quence similarity, LCSS (Vlachos, Gunopulos, & Kollios,
k-means have the goal of identifying objects that 2002).
are representative of the entire data set. Density- Time series clustering can be performed on whole
based clustering techniques rather focus on a de- sequences or on subsequences. For clustering of whole
scription of clusters, and some algorithms iden- sequences, high dimensionality is often a problem. Di-
tify the most common object. Hierarchical tech- mensionality reduction may be achieved through Dis-
niques define clusters at arbitrary levels of granu- crete Fourier Transform (DFT), Discrete Wavelet Trans-
larity. form (DWT), and Principal Component Analysis (PCA),
Data Streams: Many applications, such as com- as some of the most commonly used techniques. DFT
munication networks, produce a stream of data. (Agrawal, Falutsos, & Swami, 1993) and DWT have the
For real-valued attributes such a stream is ame- goal of eliminating high-frequency components that are
nable to time series data mining techniques. typically due to noise. Specialized models have been
TEAM LinG
introduced that ignore some information in a targeted way FUTURE TRENDS

(Jin, Lu, & Shi 2002). Others are based on models for C
specific data such as socioeconomic data (Kalpakis, Gada, Storage grows exponentially at rates faster than Moores
& Puttagunta, 2001). law for microprocessors. A natural consequence is that
A large number of clustering techniques have been old data will be kept when new data arrives leading to a
developed, and for a variety of purposes (Halkidi, massive increase in the availability of the time-depen-
Batistakis, & Vazirgiannis, 2001). Partition-based tech- dent data. Data mining techniques will increasingly have
niques are among the most commonly used ones for to consider the time dimension as an integral part of
time series data. The k-means algorithm, which is based other techniques. Much of the current effort in the data
on a greedy search, has recently been generalized to a mining community is directed at data that has a more
wide range of distance measures (Banerjee et al., 2004). complex structure than the simple tabular format ini-
tially covered in machine learning (Deroski & Lavra ,
Clustering Subsequences of a Time 2001). Examples, besides time series data, include data
Series in relational form, such as graph- and tree-structured
data, and sequences. When addressing new settings it
A variety of data mining tasks require clustering of will be of major importance to not only generalize
subsequences of one or more time series as preprocess- existing techniques and make them more broadly appli-
ing step, such as Association Rule Mining (Das et al., cable but to also critically assess problems that may
1998), outlier analysis and classification. Partition- appear in the generalization process.
based clustering techniques have been used for this
purpose in analogy to vector quantization (Gersho &
Gray, 1992) that has been developed for signal com- CONCLUSION
pression. It has, however, been shown that when a large
number of subsequences are clustered, the resulting Despite the maturity of both clustering and time series
cluster centers are very similar for different time series analysis, time series clustering is an active and fascinat-
(Keogh, Lin, & Truppel, 2003). The problem can be ing research topic. New data mining applications are
resolved by using a clustering algorithm that is robust to constantly being developed and require new types of
noise (Denton, 2004), which adapts kernel-density- clustering results. Clustering techniques from different
based clustering (Hinneburg & Keim, 2003) to time areas of data mining have to be adapted to the time series
series data. Partition-based techniques aim at finding context. Noise is a particularly serious problem for
representatives for all objects. Cluster centers in ker- time series data, thereby adding challenges to clustering
nel-density-based clustering, in contrast, are sequences process. Considering the general importance of time
that are in the vicinity of the largest number of similar series data, it can be expected that time series clustering
objects. The goal of kernel-density-based techniques is will remain an active topic for years to come.
thereby similar to frequent-pattern mining techniques
such as Motif-finding algorithms (Patel et al., 2002).
REFERENCES
Related Problems
Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2004,
One time series clustering problem of particular prac- April). Clustering with Bregman divergences. In Pro-
tical relevance is clustering of gene expression data ceedings SIAM International Conference on Data Min-
(Eisen, Spellman, Brown, & Botstein, 1998). Gene ex- ing, Lake Buena Vista, FL.
pression is typically measured at several points in time
(time course experiment) that may not be equally spaced. Berndt D.J., & Clifford, J. (1996). Finding patterns in time
Hierarchical clustering techniques are commonly used. series: A dynamic programming approach. In Advances in
Density-based clustering has recently been applied to knowledge discovery and data mining (pp. 229-248).
this problem (Jiang, Pei, & Zhang, 2003) Menlo Park, CA AAAI Press.
Time series with categorical data constitute a fur- Das, G., & Gunopulos, D. (2003). Time series similarity and
ther related topic. Examples are log files and sequences indexing. In N. Ye (Ed.), The handbook of data mining (pp.
of operating system commands. Some clustering algo- 279-304). Mahwah, NJ: Lawrence Erlbaum Associates.
rithms in this setting borrow from frequent sequence
mining algorithms (Vaarandi, 2003).
173
TEAM LinG
Das, G., Lin, K.-I., Mannila, H., Renganathan, G., & Smyth, ACM SIGKDD International Conference on Knowledge
P. (1998, Sept). Rule discovery from time series. In Pro- Discovery and Data Mining (pp. 544-549). Edmonton,
ceedings IEEE Int. Conf. on Data Mining, Rio de Janeiro, AB, Canada.
Brazil.
Patel, P., Keogh, E., Lin, J., & Lonardi, S. (2002). Mining
Denton, A. (2004, August). Density-based clustering of motifs in massive time series databases. In Proceedings
time series subsequences. In Proceedings The Third 2002 IEEE International Conference on Data Mining,
Workshop on Mining Temporal and Sequential Data Maebashi City, Japan.
(TDM 04) In Conjunction with The Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Reif, F. (1965). Fundamentals of statistical and thermal
Data Mining, Seattle, WA. physics. New York: McGraw-Hill.
Roddick, J.F., & Spiliopoulou, M. (2002). A survey of
Deroski, & Lavra . (2001). Relational data mining. Ber-
temporal knowledge discovery paradigms and meth-
lin: Springer.
ods. IEEE Transactions on Knowledge and Data En-
Jiang, D., Pei, J., & Zhang, A. (2003, March). DHC: A gineering, 14(4), 750-767.
density-based hierarchical clustering method for time
Vlachos, M., Gunopoulos, D., & Kollios, G., (2002, Feb-
series gene expression data. In Proceedings 3rd IEEE
ruary). Discovering similar multidimensional trajecto-
Symposium on Bioinformatics and Bioengineering
ries. In Proceedings 18th International Conference on
(BIBE03), Washington, D.C.
Data Engineering (ICDE02), San Jose, CA.
Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D.
Vaarandi, R. (2003). A data clustering algorithm for min-
(1998, December). Cluster analysis and display of ge-
ing patterns from event logs. In Proceedings 2003 IEEE
nome-wide expression patterns. In Proceedings of the
Workshop on IP Operations and Management, Kansas
National Academy of Science USA, 95 (25) (pp. 14863-8).
City, MO.
Gavrilov, M., Anguelov, D., Indyk, P., & Motwani, R.
(2000). Mining the stock market (extended abstract):
Which measure is best? In Proceedings of the Sixth
ACM SIGKDD International Conference on Knowl- KEY TERMS
edge Discovery and Data Mining (pp. 487-496), Boston,
MA. Dynamic Time Warping (DTW): Sequences are
allowed to be extended by repeating individual time
Gersho, A., & Gray, R.M. (1992). Vector quantization and series elements, such as replacing the sequence
signal compression. Boston, MA: Kluwer Academic Pub- X={x1,x2,x3} by X={x1,x2,x2,x3}. The distance be-
lishers. tween two sequences under dynamic time warping is
the minimum distance that can be achieved in by ex-
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On
tending both sequences independently.
clustering validation techniques. Intelligent Information
Systems Journal, 17(2-3), 107-145. Kernel-Density Estimation (KDE): Consider the
vector space in which the data points are embedded. The
Hinneburg, A., & Keim, D.A. (2003, November). A general
influence of each data point is modeled through a
approach to clustering in large databases with noise.
kernel function. The total density is calculated as the
Knowledge Information Systems, 5(4), 387-415.
sum of kernel functions for each data point.
Kalpakis, K., Gada, D., & Puttagunta, V. (2001). Distance
Longest Common Subsequence Similarity
measures for effective clustering of ARIMA time-series. In
(LCSS): Sequences are compared based on the as-
Proceedings IEEE International Conference on Data
sumption that elements may be dropped. For example,
Mining (pp. 273-280), San Jose, CA.
a sequence X={x1,x2,x3} may be replaced by
Keogh, E.J., Lin, J., & Truppel, W. (2003, December). X={x1,x3}. Similarity between two time series is
Clustering of time series subsequences is meaningless: calculated as the maximum number of matching time
Implications for previous and future research. In Pro- series elements that can be achieved if elements are
ceedings IEEE International Conference on Data Min- dropped independently from both sequences. Matches
ing (pp. 115-122), Melbourne, FL. in real-valued data are defined as lying within some
predefined tolerance.
Jin, X., Lu, Y., & Shi, C. (2002). Similarity measure based on
partial information of time series. In Proceedings Eighth Partition-Based Clustering: The data set is parti-
tioned into k clusters, and cluster centers are defined
174
TEAM LinG
based on the elements of each cluster. An objective Sliding Window: A time series of length n has (n-
function is defined that measures the quality of clustering w+1) subsequences of length w. An algorithm that oper- C
based on the distance of all data points to the center of the ates on all subsequences sequentially is referred to as a
cluster to which they belong. The objective function is sliding window algorithm.
minimized.
Time Series: Sequence of real numbers, collected
Principle Component Analysis (PCA): The projec- at equally spaced points in time. Each number corre-
tion of the data set to a hyper plane that preserves the sponds to the value of an observed quantity.
maximum amount of variation. Mathematically, PCA is
equivalent to singular value decomposition on the cova- Vector Quantization: A signal compression tech-
riance matrix of the data. nique in which an n-dimensional space is mapped to a
finite set of vectors. Each vector is called a codeword
Random Walk: A sequence of random steps in an n- and the collection of all codewords a codebook. The
dimensional space, where each step is of fixed or ran- codebook is typically designed using Linde-Buzo-Gray
domly chosen length. In a random walk time series, time (LBG) quantization, which is very similar to k-means
is advanced for each step and the time series element is clustering.
derived using the prescription of a 1-dimensional ran-
dom walk of randomly chosen step length.
175
TEAM LinG
176
Clustering Techniques
Sheng Ma
IBM T.J. Watson Research Center, USA
Tao Li
Florida International University, USA
INTRODUCTION Criterion/Objective Function: What are the objec-

tive functions or criteria that the clustering solu-
Clustering data into sensible groupings as a fundamental tions should aim to optimize? Typical examples
and effective tool for efficient data organization, summa- include entropy, maximum likelihood, and within-
rization, understanding, and learning has been the subclass or between-class distance (Li, Ma & Ogihara, 2004a).
ject of active research in several fields, such as statistics Optimization Procedure: What is the optimization
(Hartigan, 1975; Jain & Dubes, 1988), machine learning procedure for finding the solutions? A clustering
(Dempster, Laird & Rubin, 1977), information theory (Linde, problem is known to be NP-complete (Brucker, 1977),
Buzo & Gray, 1980), databases (Guha, Rastogi & Shim, and many approximation procedures have been de-
1998; Zhang, Ramakrishnan & Livny, 1996), and veloped. For instance, Expectation-Maximization-
bioinformatics (Cheng & Church, 2000) from various per- (EM) type algorithms have been used widely to find
spectives and with various approaches and focuses. local minima of optimization.
From an application perspective, clustering techniques Cluster Validation and Interpretation: Cluster vali-
have been employed in a wide variety of applications, dation evaluates the clustering results and judges
such as customer segregation, hierarchal document orga- the cluster structures. Interpretation often is neces-
nization, image segmentation, microarray data analysis, sary for applications. Since there is no label informa-
and psychology experiments. tion, clusters are sometimes justified by ad hoc
Intuitively, the clustering problem can be described as methods (such as exploratory analysis), based on
follows: Let W be a set of n entities, finding a partition of specific application areas.
W into groups, such that the entities within each group are
similar to each other, while entities belonging to different For a given clustering problem, the five components
groups are dissimilar. The entities usually are described are tightly coupled. The formal model is induced from the
by a set of measurements (attributes). Clustering does not physical representation of the data; the formal model,
use category information that labels the objects with prior along with the objective function, determines the cluster-
identifiers. The absence of label information distinguishes ing capability, and the optimization procedure decides
cluster analysis from classification and indicates that the how efficiently and how effectively the clustering results
goals of clustering are just finding a hidden structure or can be obtained. The choice of the optimization procedure
compact representation of data instead of discriminating depends on the first three components. Validation of
future data into categories. cluster structures is a way of verifying assumptions on
data generation and of evaluating the optimization procedure.
BACKGROUND
MAIN THRUST
Generally, clustering problems are determined by five
basic components: We review some of the current clustering techniques in
this section. Figure 1 gives a summary of clustering
Data Representation: What is the (physical) repre- techniques. The following further discusses traditional
sentation of the given dataset? What kind of at- clustering techniques, spectral-based analysis, model-
tributes (e.g., numerical, categorical or ordinal) are based clustering, and co-clustering.
there? Traditional clustering techniques focus on one-sided
Data Generation: The formal model for describing clustering, and they can be classified as partitional, hier-
the generation of the dataset. For example, Gaussian archical, density-based, and grid-based (Han & Kamber,
mixture model is a model for data generation. 2000). Partitional clustering attempts to directly decom-
TEAM LinG
Figure 1. Summary of clustering techniques

C
pose the dataset into disjoint classes, such that the data embedded in a low-dimensional space. Model-based clus-
points in a class are nearer to one another than the data tering attempts to learn generative models, by which the
points in other classes. Hierarchical clustering proceeds cluster structure is determined, from the data. Tishby,
successively by building a tree of clusters. Density- Pereira, and Bialek (1999) and Slonim and Tishby (2000)
based clustering is grouping the neighboring points of a developed information bottleneck formulation, in which,
dataset into classes based on density conditions. Grid- given the empirical joint distribution of two variables, one
based clustering quantizes the data space into a finite variable is compressed so that the mutual information
number of cells that form a grid-structure and then per- about the other is preserved as much as possible. Other
forms clustering on the grid structure. Most of these recent developments of clustering techniques include
algorithms use distance functions as objective criteria ensemble clustering, support vector clustering, matrix
and are not effective in high-dimensional spaces. factorization, high-dimensional data clustering, distrib-
As an example, we take a closer look at K-means uted clustering, and so forth.
algorithms. The typical K-means type algorithm is a widely- Another interesting development is co-clustering,
used partition-based clustering approach. Basically, it which conducts simultaneous, iterative clustering of both
first chooses a set of K data points as initial cluster data points and their attributes (features) through utiliz-
representatives (e.g., centers) and then performs an itera- ing the canonical duality contained in the point-by-at-
tive process that alternates between assigning the data tribute data representation. The idea of co-clustering of
points to clusters, based on their distances to the cluster data points and attributes dates back to Anderberg (1973)
representatives, and updating the cluster representa- and Nishisato (1980). Govaert (1985) researches simulta-
tives, based on new cluster assignments. The iterative neous block clustering of the rows and columns of the
optimization procedure of K-means algorithm is a special contingency table. The idea of co-clustering also has
form of EM-type procedure. The K-means type algorithm been applied to cluster gene expression and experiments
treats each attribute equally and computes the distances (Cheng & Church, 2000). Dhillon (2001) presents a co-
between data points and cluster representatives to deter- clustering algorithm for documents and words using
mine cluster memberships. bipartite graph formulation and a spectral heuristic. Re-
A lot of algorithms have been developed recently to cently, Dhillon, et al. (2003) proposed an information-
address the efficiency and performance issues presented theoretic co-clustering method for a two-dimensional
in traditional clustering algorithms. Spectral analysis has contingency table. By viewing the non-negative contin-
been shown to tightly relate to clustering task. Spectral gency table as a joint probability distribution between
clustering (Ng, Jordan & Weiss, 2001; Weiss, 1999), two discrete random variables, the optimal co-clustering
closely related to the latent semantics index (LSI), uses then maximizes the mutual information between the clus-
selected eigenvectors of the data affinity matrix to obtain tered random variables. Li and Ma (2004) recently devel-
a data representation that easily can be clustered or oped Iterative Feature and Data (IFD) clustering by rep-
177
TEAM LinG
resenting the data generation with data and feature coef- merical. The presence of complex types also makes
ficients. IFD enables an iterative co-clustering procedure the cluster validation and interpretation difficult.
for both data and feature assignments. However, unlike
previous co-clustering approaches, IFD performs cluster- More challenges also include clustering with multiple
ing using the mutually reinforcing optimization procedure criteria (where clustering problems often require optimi-
that has a proven convergence property. IFD only handles zation over more than one criterion), clustering relation
data with binary features. Li, Ma, and Ogihara (2004b) data (where data is represented with multiple relation
further extended the idea for general data. tables), and distributed clustering (where data sets are
geographically distributed across multiple sites).
FUTURE TRENDS
CONCLUSION
Although clustering has been studied for many years,
many issues, such as cluster validation, still need more Clustering is a classical topic to segment and group
investigation. In addition, new challenges, such as similar data objects. Many algorithms have been devel-
scalability, high-dimensionality, and complex data types, oped in the past. Looking ahead, many challenges drawn
have been brought by the ever-increasing growth of infor- from real-world applications will drive the search for
mation exposure and data collection. efficient algorithms that are able to handle heteroge-
neous data in order to process a large volume and to scale
Scalability and Efficiency: With the collection of to deal with a large number of dimensions.
huge amounts of data, clustering faces the problems
of scalability in terms of both computation time and
memory requirements. To resolve the scalability is- REFERENCES
sues, methods such as incremental and streaming
approaches, sufficient statistics for data summary, Aggarwal, C. et al. (1999). Fast algorithms for projected
and sampling techniques have been developed. clustering. Proceedings of ACM SIGMOD Conference.
Curse of Dimensionality: Another challenge is the
high dimensionality of data. It has been shown that Anderberg, M.-R. (1973). Cluster analysis for applica-
in a high dimensional space, the distance between tions. Academic Press Inc.
every pair of points is almost the same for a wide Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U.
variety of data distributions and distance functions (1999). When is nearest neighbor meaningful? Proceed-
(Beyer et al., 1999). Hence, most algorithms do not ings of the International Conference on Database
work efficiently in high-dimensional spaces, due to Theory.
the curse of dimensionality. Many feature selection
techniques have been applied to reduce the dimen- Brucker, P. (1977). On the complexity of clustering prob-
sionality of the space. However, as demonstrated in lems. In R. Henn, B. Korte, & W. Oletti (Eds.), Optimiza-
Aggarwal, et al. (1999), in many case, the correlations tion and operations research (pp. 45-54). New York:
among the dimensions often are specific to data Springer-Verlag.
locality; in other words, some data points are corre-
lated with a given set of features, and others are Cheng, Y., & Church, G.M. (2000). Bi-clustering of ex-
correlated with respect to different features. As pression data. Proceedings of the Eighth International
pointed out in Hastie, Tibshirani, and Friedman (2001) Conference on Intelligent Systems for Molecular Biol-
and Domeniconi, Gunopulos, and Ma (2004), all ogy (ISMB).
methods that overcome the dimensionality problems Dempster, A., Laird, N., & Rubin, D. (1977). Maximum
use a metric for measuring neighborhoods, which is likelihood from incomplete data via the EM algorithm.
often implicit and/or adaptive. Journal of the Royal Statistical Society, 39, 1-38.
Complex Data Types: The problem of clustering
becomes more challenging when the data contains Dhillon, I. (2001). Co-clustering documents and words
complex types (e.g., when the attributes contain using bipartite spectral graph partitioning. Technical
both categorical and numerical values). There are no Report 2001-05. Austin CS Dept.
inherent distance measures between data values. Dhillon, I.S., Mallela, S., & Modha, S.-S. (2003). Informa-
This is often the case in many applications where tion-theoretic co-clustering. Proceedings of ACM
data are described by a set of descriptive or pres- SIGKDD.
ence/absence attributes, many of which are not nu-
178
TEAM LinG
Domeniconi, C., Gunopulos, D., & Ma, S. (2004). Within- Tishby, N., Pereira, F.-C., & Bialek, W. (1999). The infor-
cluster adaptive metric for clustering. Proceedings of the mation bottleneck method. Proceedings of the 37th An- C
SIAM International Conference on Data Mining. nual Allerton Conference on Communication, Control
and Computing.
Govaert, G. (1985). Simultaneous clustering of rows and
columns. Control and Cybernetics, 437-458. Weiss, Y. (1999). Segmentation using eigenvectors: A
unifying view. Proceedings of ICCV (2).
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient
clustering algorithm for large databases. Proceedings of Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH:
the ACM SIGMOD Conference. An efficient data clustering method for very large data-
bases. Proceedings of the ACM SIGMOD Conference.
techniques. Morgan Kaufmann Publishers.
Hartigan, J. (1975). Clustering algorithms. Wiley.
KEY TERMS
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Cluster: A set of entities that are similar between
elements of statistical learning: Data mining, inference, themselves and dissimilar to entities from other clusters.
prediction. Springer.
Clustering: The process of dividing the data into
Jain, A.K., & Dubes, R.C. (1988). Algorithms for cluster- clusters.
ing data. Upper Saddle River, NJ: Prentice Hall.
Cluster Validation: Evaluates the clustering results
Li, T., & Ma, S. (2004). IFD: Iterative feature and data and judges the cluster structures.
clustering. Proceedings of the SIAM International Con-
ference on Data Mining. Co-Clustering: Performs simultaneous clustering of
both points and their attributes by way of utilizing the
Li, T., Ma, S., & Ogihara, M. (2004a). Entropy-based canonical duality contained in the point-by-attribute data
criterion in categorical clustering. Proceedings of the representation.
International Conference on Machine Learning (ICML 2004).
Curse of Dimensionality: This expression is due to
Li, T., Ma, S., & Ogihara, M. (2004b). Document clustering via Bellman; in statistics, it relates to the fact that the conver-
adaptive subspace iteration. Proceedings of the ACM SIGIR. gence of any estimator to the true value of a smooth
function defined on a space of high dimension is very
Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for slow. It has been used in various scenarios to refer to the
vector quantization design. IEEE Transactions on Com- fact that the complexity of learning grows significantly
munications, 28(1), 84-95. with the dimensions.
Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral Spectral Clustering: The collection of techniques
clustering: Analysis and an algorithm. Advances in Neu- that performs clustering tasks using eigenvectors of
ral Information Processing Systems, 14. matrices derived from the data.
Nishisato, S. (1980). Analysis of categorical data: Dual Subspace Clustering: An extension of traditional
scaling and its applications. Toronto: University of clustering techniques that seeks to find clusters in differ-
Toronto Press. ent subspaces within a given dataset.
Slonim, N., & Tishby, N. (2000). Document clustering
using word clusters via the information bottleneck method.
Proceedings of the ACM SIGIR.
179
TEAM LinG
180
Clustering Techniques for Outlier Detection

Frank Klawonn
University of Applied Sciences Braunschweig/Wolfenbuettel, Germany
Frank Rehm
German Aerospace Center, Germany
INTRODUCTION partitioning a data set into groups or clusters of data,

where data assigned to the same cluster are similar and
For many applications in knowledge discovery in data- data from different clusters are dissimilar. With this
bases, finding outliers, which are rare events, is of partitioning concept in mind, an important aspect of
importance. Outliers are observations that deviate sig- typical applications of cluster analysis is the identifica-
nificantly from the rest of the data, so they seem to have tion of the number of clusters in a data set. However,
been generated by another process (Hawkins, 1980). when we are interested in identifying outliers, the exact
Such outlier objects often contain information about an number of clusters is irrelevant (Georgieva & Klawonn).
untypical behaviour of the system. The idea of whether one prototype covers two or more
However, outliers bias the results of many data- data clusters or whether two or more prototypes com-
mining methods such as the mean value, the standard pete for the same data cluster is not important as long as
deviation, or the positions of the prototypes of k-means the actual outliers are identified and assigned to a proper
clustering (Estivill-Castro & Yang, 2004; Keller, 2000). cluster. The number of prototypes used for clustering
Therefore, before further analysis or processing of data depends, of course, on the number of expected clusters
is carried out with more sophisticated data-mining tech- but also on the distance measure respectively the shape
niques, identifying outliers is a crucial step. Usually, of the expected clusters. Because this information is
data objects are considered as outliers when they occur usually not available, the Euclidean distance measure is
in a region of extremely low data density. often recommended with rather copious prototypes.
Many clustering techniques that deal with noisy data One of the most referred statistical tests for outlier
and can identify outliers, such as possibilistic cluster- detection is the Grubbs test (Grubbs, 1969). This test is
ing (PCM) (Krishnapuram & Keller, 1993, 1996) and used to detect outliers in a univariate data set. Grubbs
noise clustering (NC) (Dave, 1991; Dave & test detects one outlier at a time. This outlier is removed
Krishnapuram, 1997), need good initializations or suf- from the data set, and the test is iterated until no outliers
fer from lack of adaptability to different cluster sizes. are detected.
Distance-based approaches (Knorr & Ng, 1998; Knorr,
Ng, & Tucakov, 2000) have a global view on the data set.
These algorithms can hardly treat data sets that contain MAIN THRUST
regions with different data density (Breuning, Kriegel,
Ng, & Sander, 2000). The detection of outliers that we propose in this work is
In this work, we present an approach that combines a a modified version of the one proposed in Santos-
fuzzy clustering algorithm (Hppner, Klawonn, Kruse, Pereira and Pires (2002) and is composed of two differ-
& Runkler, 1999) or any other prototype-based cluster- ent techniques. In the first step we partition the data set
ing algorithm with statistical distribution-based outlier with the fuzzy c-means clustering algorithm so the fea-
detection. ture space is approximated with an adequate number of
prototypes. The prototypes will be placed in the centre
of regions with a high density of feature vectors. Be-
BACKGROUND cause outliers are far away from the typical data, they
influence the placing of the prototypes.
Prototype-based clustering algorithms approximate a After partitioning the data, only the feature vectors
feature space by means of an appropriate number of belonging to each single cluster are considered for the
prototype vectors, where each vector is located in the detection of outliers. For each attribute of the feature
centre of the group of data (the cluster) that belongs to vectors of the considered cluster, the mean value and the
the respective prototype. Clustering usually aims at standard deviation has to be calculated. For the vector
TEAM LinG
Figure 1. Outlier detection with different numbers of prototypes

C
(a) 1 prototype (b) 3 prototypes (c) 10 prototypes
with the largest distance1 to the mean vector, which is FUTURE TRENDS
assumed to be an outlier, the value of the z-transforma-
tion for each of its components is compared to a critical Figure 1 shows that the algorithm can identify outliers
value. If one of these values is higher than the respective in multivariate data in a stable way. With only a few
critical value, then this vector is declared an outlier. One parameters, the solution can be adapted to different
can use the Mahalanobis distance as in Santos-Pereira requirements concerning the specific definition of an
and Pires (2002), but because simple clustering tech- outlier. With the choice of the number of prototypes, it
niques such as the fuzzy c-means algorithm tend to spheri- is possible to influence the result in a way that with lots
cal clusters, we apply a modified version of Grubbs test, of prototypes, even smaller data groups can be found. To
not assuming correlated attributes within a cluster. avoid overfitting the data, it makes sense in certain
The critical value is a parameter that must be set for cases to eliminate very small clusters. However, finding
each attribute depending on the specific definition of an out the proper number of prototypes should be of inter-
outlier. One typical criterion can be the maximum num- est for further investigations.
ber of outliers with respect to the amount of data In the case of using a fuzzy clustering algorithm such
(Klawonn, 2004). Eventually, large critical values lead as FCM (Bezdek, 1981) to partition the data, it is
to smaller numbers of outliers, and small critical values possible to assign a feature vector to different proto-
lead to very compact clusters. Note that the critical type vectors. In that way, you can consolidate whether a
value is set for each attribute separately. This leads to an certain feature vector is an outlier if the algorithm
axes-parallel view of the data, which in cases of axes- decides for each single cluster that the corresponding
parallel clusters leads to a better outlier detection than feature vector is an outlier.
the (hyper)spherical view of the data. FCM provides membership degrees for each feature
If an outlier is found, the feature vector has to be vector to every cluster. One approach could be to assign
removed from the data set. With the new data set, the a feature vector to the corresponding clusters with the
mean value and the standard deviation have to be calcu- two highest membership degrees. The feature vector is
lated again for each attribute. With the vector that has considered as an outlier if the algorithm makes the same
the largest distance to the new centre vector, the outlier decision in both clusters. In cases where the algorithm
test will be repeated by checking the critical values. This gives no definite answers, the feature vector can be
procedure will be repeated until no outlier is found. The labeled and processed by further analysis.
other clusters are treated in the same way.
Figure 1 shows the results of the proposed algo-
rithm. The crosses in this figure are feature vectors, CONCLUSION
which are recognized as outliers. As expected, only a
few points are declared as outliers when approximating In this article, we describe a method to detect outliers in
the feature space with only one prototype. The proto- multivariate data. Because information about the num-
type will be placed in the centre of all feature vectors. ber and shape of clusters is often not known in advance,
Hence, only points on the edges are defined as outliers. it is necessary to have a method that is relatively robust
Comparing the solutions with 3 and 10 prototypes, you with respect to these parameters. To obtain a stable
can determine that both solutions are almost identical. algorithm, we combined approved clustering techniques,
Even in the border regions, were two prototypes com- including the FCM or k-means, with a statistical method
peting for some data points, the algorithm would rarely to detect outliers. Because the complexity of the pre-
identify these points as outliers, which they intuitively sented algorithm is linear in the number of points, it can
are not. be applied to large data sets.
181
TEAM LinG
REFERENCES Krishnapuram, R., & Keller, J. M. (1996). The possibilistic

C-means algorithm: Insights and recommendations. IEEE
Bezdek, J. C. (1981). Pattern recognition with fuzzy Transactions on Fuzzy Systems, 4(3), 385-393.
objective function algorithms. New York: Plenum Press. Santos-Pereira, C. M., & Pires, A. M. (2002). Detection of
Breunig, M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). outliers in multivariate data: A method based on cluster-
LOF: Identifying density-based local outliers. Proceed- ing and robust estimators. Proceedings of the 15th
ings of the ACM SIGMOD International Conference on Symposium on Computational Statistics (pp. 291-296),
Management of Data (pp. 93-104). Germany.
Dave, R. N. (1991). Characterization and detection of noise

in clustering. Pattern Recognition Letters, 12, 657-664.
KEY TERMS
Dave, R. N., & Krishnapuram, R. (1997). Robust clustering
methods: A unified view. IEEE Transactions on Fuzzy Cluster Analysis: To partition a given data set into
Systems, 5(2), 270-293. clusters where data assigned to the same cluster are
Estivill-Castro, V., & Yang, J. (2004). Fast and robust similar and data from different clusters are dissimilar.
general purpose clustering algorithms. Data Mining Cluster Prototypes (Centres): Clusters in objec-
and Knowledge Discovery, 8, 127-150. tive function-based clustering are represented by pro-
Georgieva, O., & Klawonn, F. A cluster identification totypes that define how the distance of a data object to
algorithm based on noise clustering. Manuscript sub- the corresponding cluster is computed. In the simplest
mitted for publication. case, a single vector represents the cluster, and the
distance to the cluster is the Euclidean distance be-
Grubbs, F. (1969). Procedures for detecting outlying tween cluster centre and data object.
observations in samples. Technometrics, 11(1), 1-21.
Fuzzy Clustering: Cluster analysis where a data
Hawkins, D. (1980). Identification of outliers. London: object can have membership degrees to different clus-
Chapman and Hall. ters. Usually it is assumed that the membership degrees
of a data object to all clusters add up to one, so a
Hppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). membership degree can also be interpreted as the prob-
Fuzzy cluster analysis. Chichester, UK: Wiley. ability that the data object belongs to the correspond-
Keller, A. (2000). Fuzzy clustering with outliers. Pro- ing cluster.
ceedings of the 19th International Conference of the Noise Clustering: An additional noise cluster is
North American Fuzzy Information Processing Society, induced in objective function-based clustering to col-
USA. lect the noise data or outliers. All data objects are
Klawonn, F. (2004). Noise clustering with a fixed fraction assumed to have a fixed (large) distance to the noise
of noise. In A. Lotfi & J. M. Garibaldi (Eds.), Applications cluster, so only data that is far away from all the other
and science in soft computing. Berlin, Germany: Springer. clusters will be assigned to the cluster.
Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining Outliers: Observations in a sample, so far sepa-
distance-based outliers in large datasets. Proceedings rated in value from the remainder as to suggest that they
of the 24th International Conference on Very Large are generated by another process or are the result of an
Data Bases (pp. 392-403). error in measurement.
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance- Overfitting: The phenomenon that a learning algo-
based outliers: Algorithms and applications. VLDB Jour- rithm adapts so well to a training set that the random
nal, 8(3-4), 237-253. disturbances in the training set are included in the
model as being meaningful. Consequently, as these
Krishnapuram, R., & Keller, J. M. (1993). A possibilistic disturbances do not reflect the underlying distribution,
approach to clustering. IEEE Transactions on Fuzzy the performance on the test set, with its own but defini-
Systems, 1(2), 98-110. tively other disturbances, will suffer from techniques
that tend to fit well to the training set.
182
TEAM LinG
Possibilistic Clustering: The Possibilistic c-means ENDNOTE

(PCM) family of clustering algorithms is designed to C
alleviate the noise problem by relaxing the constraint on 1
Different distance measures can be used to deter-
memberships used in probabilistic fuzzy clustering. mine the vector with the largest distance to the
Z-Transformation: With the z-transformation, one centre vector. For elliptical non-axes-parallel clus-
can transform the values of any variable into values of the ters, the Mahalanobis distance leads to good re-
standard normal distribution. sults. If no information about the shape of the
clusters is available, the Euclidean distance is
commonly used.
183
TEAM LinG
184
Combining Induction Methods with the

Multimethod Approach
Mitja Leni
University of Maribor, FERI, Slovenia
Peter Kokol
Petra Povalej
Milan Zorman
INTRODUCTION (already solved cases) whose classes are known. In-

stances are typically represented as attribute-value vec-
The aggressive rate of growth of disk storage and, thus, tors. Learning input consists of a set of vectors/in-
the ability to store enormous quantities of data have far stances, each belonging to a known class, and the output
outpaced our ability to process and utilize that. This consists of a mapping from attribute values to classes.
challenge has produced a phenomenon called data This mapping hypothesis should accurately classify both
tombsdata is deposited to merely rest in peace, never the learning instances and the new unseen instances. The
to be accessed again. But the growing appreciation that hypothesis hopefully represents generalized knowledge
data tombs represent missed opportunities in cases that is interesting for domain experts. The fundamental
supporting scientific discovering, business exploita- theory of learning is presented by Valiant (1984) and
tion, or complex decision making has awakened the Auer (1995).
growing commercial interest in knowledge discovery
and data-mining techniques. That, in order, has stimu- Single-Method Approaches
lated new interest in the automatic knowledge induction
from cases stored in large databasesa very important When comparing single approaches with different knowl-
class of techniques in the data-mining field. With the edge representations and different learning algorithms,
variety of environments, it is almost impossible to there is no clear winner. Each method has its own advan-
develop a single-induction method that would fit all tages and some inherent limitations. Decision trees
possible requirements. Thereafter, we constructed a (Quinlan, 1993), for example, are easily understandable
new so-called multi-method approach, trying out some by a human and can be used even without a computer, but
original solutions. they have difficulties expressing complex nonlinear prob-
lems. On the other hand, connectivistic approaches that
simulate cognitive abilities of the brain can extract com-
BACKGROUND plex relations, but solutions are not easily understandable
to humans (only numbers of weights), and, therefore, as
Through time, different approaches have evolved, such such, they are not directly usable for data mining. Evolu-
as symbolic approaches, computational learning theory, tionary approaches to knowledge extraction are also a
neural networks, and so forth. In our case, we focus on good alternative, because they are not inherently limited
an induction process to find a way to extract generalized to a local solution (Goldberg, 1989) but are
knowledge from observed cases (instances). That is computationally expensive. There are many other ap-
accomplished by using inductive inference that is the proaches, like representation of the knowledge with rules,
process of moving from concrete instances to general rough sets, case-based reasoning, support vector ma-
model(s), where the goal is to learn how to extract chines, different fuzzy methodologies, and ensemble
knowledge from objects by analyzing a set of instances methods (Dietterich, 2000).
TEAM LinG
Combining Induction Methods with the Multimethod Approach
Hybrid Approaches MULTI-METHOD APPROACH

C
Hybrid approaches rest on the assumption that only the Multi-method approach was introduced in Leni and Kokol
synergetic combination of single models can unleash (2002). While studying other approaches, we were in-
their full power. Each of the single methods has its spired by the idea of hybrid approaches and evolutionary
advantages but also inherent limitations and disadvan- algorithms. Both approaches are very promising in
tages, which must be taken into account when using a achieving the goal to improve the quality of knowledge
particular method. For example, symbolic methods usu- extraction and are not inherently limited to sub-optimal
ally represent the knowledge in human readable form, solutions. We also noticed that almost all attempts to
and the connectivistc methods perform better in classi- combine different methods use the loose coupling ap-
fication of unseen objects and are less affected by the proach. Of course, loose coupling is easier to imple-
noise in data as are symbolic methods. Therefore, the ment, but methods work almost independently of each
logical step is to combine both worlds to overcome the other, and, therefore, a lot of luck is needed to make
disadvantages and limitations of a single one. them work as a team.
In general, the hybrids can be divided according to Opposed to the conventional hybrids described in
the flow of knowledge into four categories (Iglesias, the previous section, our idea is to dynamically combine
1996): and apply different methods in no predefined order to
the same problem or decomposition of the problem. The
Sequential Hybrid (Chain Processing): The out- main concern of the mutli-method approach is to find a
put of one method is an input to another method. way to enable a dynamic combination of methods to the
For example, the neural net is trained with the somehow quasi-unified knowledge representation. In
training set to reduce noise. multiple, equally-qualitative solutions, like evolution-
Parallel Hybrid (Co-Processing): Different ary algorithms (EA), each solution is obtained using an
methods are used to extract knowledge. In the next application of different methods with different param-
phase, some arbitration mechanism should be used eters. Therefore, we introduce a population composed
to generate appropriate results. of individuals/solutions that have the common goal to
External Hybrid (Meta Processing): One improve their classification abilities on a given environ-
method uses another external one. For example, ment/problem. We also enable the coexistence of dif-
meta decision trees (Todorovski & Dzeroski, 2000) ferent types of knowledge representation in the same
that use neural nets in decision nodes to improve population. The most common knowledge representa-
the classification results. tion models have to be standardized and strictly typed to
Embedded Hybrid (Sub-Processing): One support the applicability of different methods on indi-
method is embedded in another. That is the most viduals. Each induction method implementation uses its
powerful hybrid, but the least modular one, be- own internal knowledge representation that is not com-
cause usually the methods are coupled tightly. patible with other methods that use the same type of
knowledge. A typical example is WEKA, which uses at
The hybrid systems are commonly static in structure least four different knowledge representations for de-
and cannot change the order of how single methods are cision trees. Standardization, in general, brings greater
applied. To be able to use embedded hybrids of different modularity and interchangeability, but it has the follow-
internal knowledge representation, it is commonly re- ing disadvantage: already existing methods cannot be
quired to transform one method representation into directly integrated and have to be adjusted to the stan-
another. Some transformations are trivial, especially dardized representation.
when converting from symbolic approaches. The prob- Initial population of extracted knowledge is gener-
lem is when the knowledge is not so clearly presented, ated using different methods. In each generation, differ-
like in a case of the neural network (McGarry, Wermter ent operations appropriate for individual knowledge are
& MacIntyre, 2001; Zorman, Kokol & Podgorelec, applied to improve existing and create new intelligent
2000). The knowledge representation issue is very im- systems. That enables incremental refinement of ex-
portant in the multi-method approach, and we solved it tracted knowledge, with different views on a given prob-
in the original manner. lem. The main problem is how to combine methods that
185
TEAM LinG
Figure 1. An example of a decision tree induced using reused in other methods, we introduced methods on the
multi-method approach. Each node is induced with basis of operators. Therefore, we introduced the opera-
appropriate method (GAgenetic algorithm, ID3, Gini, tion on an individual as a function that transforms one or
Chi-square, J-measure, SVM, neural network, etc.) more individuals into a single individual. Operation can
be a part of one or more methods, like a pruning operator,
a boosting operator, and so forth. An operator-based
GA
view provides the ability simply to add new operations
Gini ID3, GA to the framework (Figure 2).
Usually, methods are composed of operations that
GA GA can be reused in other methods. We introduced the
operation on an individual that is a function, which
ID3
transforms one or more individuals into a single indi-

vidual. Operation can be part of one or more methods,
like a pruning operator, a boosting operator, and so
use different knowledge representations (e.g., neural net- forth. The transformation to another knowledge repre-
works and decision trees). In such cases, we have two sentation also is introduced on the individual operator
alternatives: (1) to convert one knowledge representation level. Therefore, the transition from one knowledge
to another using different already known methods or (2) to representation to another is presented as a method. An
combine both knowledge representations in a single intel- operator-based view provides us with the ability simply
ligent system. In both cases, knowledge transmutation is to add new operations to the framework (Figure 2).
executed (Cox & Ram, 1999). In the first case, conversion Representation with individual operations facili-
between different knowledge representations must be tates is an effective and modular way to represent the
implemented, which is usually not perfect, and some parts result as a single individual, but, in general, the result of
of the knowledge can be lost. But on the other hand, it can operation also can be a population of individuals (e.g.,
provide a different view and a good starting point in mutation operation in EA is defined on an individual
hypothesis search space. level and on the population level). The single method is
The second approach, which is based on combining composed of population operations that use individual
knowledge, requires some cut-points where knowledge operations and is introduced as a strategy in the frame-
representations can be merged. For example, in a deci- work that improves individuals in a population. A single
sion tree, such cut-points are internal nodes, where the method is composed out of population operations that
condition in an internal node can be replaced by another use individual operations and is introduced as a strategy
intelligent system (e.g., support vector machine [SVM]). in the framework that improves individuals in a popula-
The same idea also can be applied in decision leafs tion (Figure 2). Population operators can be general-
(Figure 1). ized with higher order functions and thereby reused in
different methods.
Operators
Meta-Level Control
Using the idea of the multi-method approach, we de-
An important concern of the multi-method framework
signed a framework that operates on a population of
is how to provide the meta-level services to manage the
extracted knowledge representationsindividuals. Since
available resources and the application of the methods.
methods usually are composed of operations that can be
We extended the quest for knowledge into another
dimension; that is, the quest for the best application
order of methods. The problem that arises is how to
Figure 2. Multi-method framework control the quality of resulting individuals and how to
intervene in the case of bad results. Due to different
knowledge representations, solutions cannot be com-
Framework
support
Methods Framework
support
Methods Framework
support
Methods
pared trivially to each other, and the assessment of
Population
Individual
operator 1
Individual
operator i+1
Population
operator 1
operator k+1
.
Strategy
1
Strategy
i+1
.
which method is better is hard to imagine. Individuals in
a population cannot be evaluated explicitly, and the
. . . . . .
. . . . . .
. . . .
Individual Population Population Strategy Strategy

explicit fitness function cannot be calculated easily.
Individual
operator i operator n operator k operator l i m
Therefore, the comparison of individuals and the as-

Population
sessment of the quality of the whole population cannot
be given. Even if some criteria to calculate fitness
186
TEAM LinG
function could be found, it probably would be very time- Figure 3. Hypothesis separation using GA
consuming and computational-intensive. Therefore, the C
idea of classical evolutionary algorithms controlling the h1
quality of population and guidance to the objective can- h3 h2
not be applied.
To achieve self-adaptive behavior of the evolution-
ary algorithm, the strategy parameters have to be coded
directly into the chromosome (Thrun & Pratt, 1998).
But in our approach, the meta-level strategy does not
know about the structure of the chromosome, and not all
methods. On other hand, there are many different methods
of the methods use the EA approach to produce a solu-
for decision-tree induction that all generate knowledge in
tion. Therefore, for meta-level chromosomes, the pa-
the form of tree. The most popular method is greedy
rameters of the method and its individuals are taken.
heuristic induction of a decision tree, which produces a
When dealing with self-adapting population with no
single decision tree with respect to the purity measure
explicit evaluation/fitness function, there is also an
(heuristic) of each split in a decision tree. Altering purity
issue of the best or most promising individual (progar
measure may produce totally different results with differ-
et al. 2000). But, of course, the question of how to
ent aspects (hypotheses) for a given problem. Hypoth-
control the population size or increase selection pres-
esis induction is done with the use of evolutionary algo-
sure must be answered effectively. Our solution was to
rithms (EA). When designing EA, algorithm operators for
classify population operators into three categories:
mutation, crossover and selection have to be carefully
operators for reducing, operators for maintaining, and
chosen (Podgorelec et al., 2001). Combining the EA ap-
operators for increasing the population size.
proach with heuristic approaches dramatically reduces
hypothesis search space.
There is another issue when combining two methods
CONCRETE COMBINATION using another classifier for separation. For example, in
TECHNIQUES Figure 3, there are two hypotheses, h1 and h2, that could
be perfectly separately induced using an existing set of
The multi-method approach searches for solutions in methods. Lets suppose that there is no method that is
huge (infinite) search space and exploits the acquisition able to acquire both hypotheses in a single hypothesis.
technique of each integrated method. Population of Therefore, we need a separation of the problem using
individual solutions represents a different aspect of another hypothesis, h3, which has no special meaning to
extracted knowledge. Transmutation from one to an- induce a successful composite hypothesis.
other knowledge representation introduces new aspects.
We can draw parallels between the multi-method ap- Problem Adaptation
proach and the scientific discovery. In real life, based on
the observed phenomena, various hypotheses are con- In many domains, we encounter data with very unbal-
structed. Different scientific communities draw differ- anced class distribution. That is especially true for
ent conclusions (hypotheses) consistent with collected applications in the medical domain, where most of the
data. For example, there are many theories about the instances are regular and only a small percent are as-
creation of the universe, but the current, widely ac- signed to an irregular class. Therefore, for most of the
cepted theory is the theory of the big bang. During the classifiers that want to achieve high accuracy and low
following phase, scientists discuss their theories, knowl- complexity, it is most rational to classify all new in-
edge is exchanged, new aspects are encountered, and stances into a majority class. But that feature is not
data collected is reevaluated. In that manner, existing desired, because we want to extract knowledge (espe-
hypotheses are improved, and new better hypotheses are cially when we want to explain a decision-making pro-
constructed. cess) and determine reasons for separation of classes.
To cope with the presented problem, we introduced an
Decision Trees instance reweighting method that works in a similar
manner, boosting but on a different level. Instances that
Decision trees are very appropriate to use as glue be- are rarely correctly classified gain importance. Fitness
tween different methods. In general, condition nodes criteria of individuals take importance into account and
contain classifier (usually simple attribute comparison) force competition among individual induction methods.
that enables quick and easy integration of different Of course, there is a danger of over-adapting to the
187
TEAM LinG
noise, but in that case, overall classification ability would medicine. For that reason, we have selected only symbolic
be decreased, and other induced classifiers can perform knowledge from a whole population of resulting solu-
better classification (self adaptation). We achieve a simi- tions.
lar effect in boosting by concentrating on hard learnable A detailed description of databases can be found in
instances and not dismissing already extracted knowl- Leni and Kokol (2002). Other databases have been down-
edge. loaded from the online repository of machine-learning
datasets maintained at UCI. We compared two variations
of our multi-method approach (MultiVeDec) with four
EXPERIMENTAL RESULTS conventional approaches; namely, C4.5 (Quinlan, 1993),
C5/See5, Boosted C5, and genetic algorithm (Podgorelec
Our current data-mining research is performed mainly & Kokol, 2001). The results are presented in Table 1. Gray
in the medical domain; thereafter, the knowledge repre- marked fields represent the best method on a specific
sentation that should be in a human understandable form database.
is very important; so we are focused on the decision-
tree induction. To make objective assessment of our
method, a comparison of extracted knowledge used for FUTURE TRENDS
classification was made with reference methods C4.5,
C5/See5 without boosting, C5/See5 with boosting, and As confirmed in many application domains, methods
genetic algorithm for decision-tree construction that use only single approaches often can lead to local
(Podgorelec et al., 2001). The following quantitative optima and do not necessarily provide the big picture
measures were used: about the problem. By applying different methods, in-
duction power of combined methods can supersede
num of correctly classified objects single methods. Of course, there is no single way to
accuracy = wave methods together. Our approach emphasizes modu-
num. of all objects
larity based on knowledge sharing. Implicit induction
method knowledge is shared with others via produced
hypothesis. Synergy of methods also can be improved
num of correctly classified objects in class c by weaving implicit knowledge of induction method
accuracyc =
num. of all objects in class c learning algorithms, which requires very tight coupling
of two or many induction methods.
accuracy i
average class accuracy = i CONCLUSION
num. of classes
The success of the multi-method approach can be ex-
We decided to use average class accuracy instead of plained with the fact that some methods converge to
sensitivity and specificity that are usually used when local optima. With the combination of multiple meth-
dealing with medical databases. Experiments have been ods (operators) in different order, a better local (and
made with seven real-world databases from the field of hopefully global) solution can be found. Static hybrid
Table 1. A comparison of the multimethod approach to conventional approaches

C4.5 C5 C5 Boost Genetically SVM(RBF) Multimethod Multimethod
induced without
decision problem
trees adaption

av 76.7 44.5 81.4 48.6 83.7 50.0 90.7 71.4 83.7 55.7 93.0 78.7 90.0 71.4
breastcancer 96.4 94.4 95.3 95.0 96.6 96.7 96.6 95.5 96.5 97.1 96.6 95.1 96.9 97.8
heartdisease 74.3 74.4 79.2 80.7 82.2 81.1 78.2 80.2 39.6 42.7 78.2 78.3 83.1 82.8
Hepatitis 80.8 58.7 82.7 59.8 82.7 59.8 86.5 62.1 76.9 62.5 88.5 75.2 88.5 75.2
mracid 94.1 75.0 100.0 100.0 100.0 100. 0 94.1 75.0 94.1 75.0 100.0 100.0 100.0 100.0
Lipids 60.0 58.6 66.2 67.6 71.2 64.9 76.3 67.6 25.0 58.9 75.0 73.3 75.0 73.3
Mvp 91.5 65.5 91.5 65.5 92.3 70.0 91.5 65.5 36.1 38.2 93.8 84.3 93.8 84.3
accuracy on test set average class accuracy on test set
188
TEAM LinG
systems usually work sequentially or in parallel on the cally induced decision trees. Proceedings of the Interna-
fixed structure and order, performing whole tasks. On tional ICSC Congress on Computational Intelligence: C
the other hand, the multi-method approach works simul- Methods and Applications (CIMA2001).
taneously with several methods on a single task (i.e.,
some parts are induced with different classical heuris- Quinlan, J.R. (1993). C4.5: Programs for machine learn-
tics, some parts with hybrid methods, and still other ing. San Mateo, CA: Morgan Kaufmann Publishers.
parts with evolutionary programming). The presented progar, M., et al. (2000). Vector decision trees. Intel-
multi-method approach enables a quick and modular way ligent Data Analysis, 4(3,4), 305-321.
to integrate different methods into an existing system
and enables the simultaneous application of several Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn.
methods. It also enables partial application of method Kluwer Academic Publishers.
operations to improve and recombine aspects and has no Todorovski, L., & Dzeroski, S. (2000). Combining mul-
limitation to the order and number of applied methods. tiple models with meta decision trees. Proceedings of
the Fourth European Conference on Principles of
Data Mining and Knowledge Discovery.
REFERENCES
Valiant, L.G. (1984). A theory of the learnable. Communi-
Auer, P., Holte, R.C., & Cohen, W.W. (1995). Theory cations of the ACM, 27(11), 1134-1142.
and applications of agnostic PAC-learning with small Vapnik, V.N. (1995). The nature of statistical learning
decision trees. Proceedings of the 12th International theory. New York: Springer Verlag.
Conference on Machine Learning.
Wolpert, D.H., & Macready, W.G. (1995). No free lunch
Cox, M.T., & Ram, A. (1999). Introspective multistrategy theorems for search. Technical Report SFI-TR-95-02-010,
learning: On the construction of learning strategies. Santa Fe, NM.
Artificial Intelligence, 112, 1-55.
Zorman, M., Kokol, P., & Podgorelec, V. (2000). Medi-
Dietterich, T.G. (2000). Ensemble methods in machine cal decision making supported by hybrid decision trees.
learning. Proceedings of the First International Work- Proceedings of ISA2000.
shop on Multiple Classifier Systems.
Goldberg, D.E. (1989). Genetic algorithms in search,
optimization, and machine learning. Reading, MA:
Addison Wesley.
KEY TERMS
Iglesias, C.J. (1996). The role of hybrid systems in Boosting: Creation of ensemble of hypothesis to
intelligent data management: The case of fuzzy/neural convert weak learner to strong one by modifying ex-
hybrids. Control Engineering Practice, 4(6), 839-845. pected instance distribution.
Leni , M., & Kokol, P. (2002). Combining classifiers with Induction Method: Process of learning, from cases
multimethod approach. Proceedings of the Second Inter- or instances, resulting in a general hypothesis of hidden
national Conference on Hybrid Intelligent Systems, Soft concept in data.
Computing Systems: Design, Management and Applica-
tions, Frontiers in Artificial Intelligence and Applica- Method Level Operator: Partial operation of in-
tions, Amsterdam. duction/knowledge transformation level of specific in-
duction method.
McGarry, K., Wermter, S., & MacIntyre, J. (2001). The
extraction and comparison of knowledge from local Multi-Method Approach: Investigation of a research
function networks, International Journal of Computa- question using a variety of research methods, each of
tional Intelligence and Applications, 1(4), 369-382. which may contain inherent limitations, with the expecta-
tion that combining multiple methods may produce con-
Podgorelec, V., & Kokol, P. (2001). Evolutionary deci- vergent evidence.
sion forestsDecision making with multiple evolu-
tionary constructed decision trees: Problems in applied Population Level Operator: Unusually parameter-
mathematics and computational intelligence. World ized operation that applies method level operators (pa-
Scientific and Engineering Society Press, 97-103. rameter) to part or whole evolving population.
Podgorelec, V., Kokol, P., Yamamoto, R., Masuda, G., & Transmutator: Knowledge transformation opera-
Sakamoto, N. (2001). Knowledge discovery with geneti- tor that modifies learner knowledge by exploring learner
experience.
189
TEAM LinG
190
Comprehensibility of Data Mining Algorithms

Zhi-Hua Zhou
Nanjing University, China
INTRODUCTION mance of the algorithm, they must understand how

it arrives at its decisions.
Data mining attempts to identify valid, novel, potentially Discovery: Data mining algorithms may play an impor-
useful, and ultimately understandable patterns from huge tant role in the process of scientific discovery. An
volume of data. The mined patterns must be ultimately algorithm may discover salient features and relation-
understandable because the purpose of data mining is to ships in the input data whose importance was not
aid decision-making. If the decision-makers cannot un- previously recognized. If the patterns mined by the
derstand what does a mined pattern mean, then the pattern algorithm are comprehensible, then these discoveries
cannot be used well. Since most decision-makers are not can be made accessible to human review.
data mining experts, ideally, the patterns should be in a Explanation: In some domains, it is desirable to be
style comprehensible to common people. So, comprehen- able to explain actions a data mining algorithm
sibility of data mining algorithms, that is, the ability of a suggest take for individual input patterns. If the
data mining algorithm to produce patterns understand- mined patterns are understandable in such a do-
able to human beings, is an important factor. main, then explanations of the suggested actions on
a particular case can be garnered.
Improving performance: The feature representa-
BACKGROUND tion used for a data mining task can have a signifi-
cant impact on how well an algorithm is able to mine.
A data mining algorithm is usually inherently associated Mined patterns that can be understood and ana-
with some representations for the patterns it mines. There- lyzed may provide insight into devising better fea-
fore, an important aspect of a data mining algorithm is the ture representations.
comprehensibility of the representations it forms. That is, Refinement: Data mining algorithms can be used to
whether or not the algorithm encodes the patterns it mines refine approximately-correct domain theories. In
in such a way that they can be inspected and understood order to complete the theory-refinement process, it
by human beings. Actually, such an importance has been is important to be able to express, in a comprehen-
argued by a machine learning pioneer many years ago sible manner, the changes that have been imparted
(Michalski, 1983): to the theory during mining.
The results of computer induction should be symbolic

descriptions of given entities, semantically and MAIN THRUST
structurally similar to those a human expert might
produce observing the same entities. Components of It is evident that data mining algorithms with good com-
these descriptions should be comprehensible as single prehensibility are very desirable. Unfortunately, most
chunks of information, directly interpretable in natural data mining algorithms are not very comprehensible and
language, and should relate quantitative and qualitative therefore their comprehensibility has to be enhanced by
concepts in an integrated fashion. extra mechanisms. Since there are many different data
mining tasks and corresponding data mining algorithms,
Craven and Shavlik (1995) have indicated a number of it is difficult for such a short article to cover all of them.
concrete reasons why the comprehensibility of machine So, the following discussions are restricted to the compre-
learning algorithms is very important. With slight modifi- hensibility of classification algorithms, but some essence
cation, these reasons are also applicable to data mining is also applicable to other kinds of data mining algorithms.
algorithms. Some classification algorithms are deemed as compre-
hensible because the patterns they mine are expressed in
Validation: If the designers and end-users of a data an explicit way. Representatives are decision tree algo-
mining algorithm are to be confident in the perfor- rithms that encode the mined patterns in the form of a
TEAM LinG
decision tree which can be easily inspected. Some other The eclectic algorithms incorporate elements of both
classification algorithms are deemed as incomprehensible the decompositional and pedagogical ones. A represen- C
because the patterns they mine are expressed in an implicit tative is the DEDEC algorithm (Tickle, Orlowski, &
way. Representatives are artificial neural networks that Diederich, 1996), which extracts a set of rules to reflect the
encode the mined patterns in real-valued connection functional dependencies between the inputs and the
weights. Actually, many methods have been developed to outputs of the artificial neural networks. Fig. 1 shows its
improve the comprehensibility of incomprehensible classi- working routine.
fication algorithms, especially for artificial neural networks. The compositional algorithms are not strictly
The main scheme for improving the comprehensibility decompositional because they do not extract rules from
of artificial neural networks is rule extraction, that is, individual units with subsequent aggregation to form a
extracting symbolic rules from trained artificial neural global relationship, nor do them fit into the eclectic cat-
networks. It originates from Gallants work on egory because there is no aspect that fits the pedagogical
connectionist expert system (Gallant, 1983). Good reviews profile. Algorithms belonging to this category are mainly
can be found in (Andrews, Diederich, & Tickle, 1995; designed for extracting deterministic finite-state automata
Tickle, Andrews, Golea, & Diederich, 1998). Roughly (DFA) from recurrent artificial neural networks. A repre-
speaking, current rule extraction algorithms can be cat- sentative is the algorithm proposed by Omlin and Giles
egorized into four categories, namely the decompositional, (1996), which exploits the phenomenon that the outputs
pedagogical, eclectic, or compositional algorithms. Each of the recurrent state units tend to cluster, and if each
category is illustrated with an example below. cluster is regarded as a state of a DFA then the relation-
The decompositional algorithms extract rules from ship between different outputs can be used to set up the
each unit in an artificial neural network and then aggre- transitions between different states. For example, assum-
gate. A representative is the RX algorithm (Setiono, 1997), ing there are two recurrent state units s0 and s1, and their
which prunes the network and discretizes outputs of outputs appear as nine clusters, then the working style of
hidden units for reducing computational complexity in the algorithm is shown in Fig. 2.
examining the network. If a hidden unit has many connec- During the past years, powerful classification algo-
tions then it is split into several output units and some rithms have been developed in the ensemble learning area.
new hidden units are introduced to construct a subnet- An ensemble of classifiers works through training mul-
work, so that the rule extraction process is iteratively tiple classifiers and then combining their predictions,
executed. The RX algorithm is summarized in Table 1. which is usually much more accurate than a single classi-
The pedagogical algorithms regard the trained artifi- fier (Dietterich, 2002). However, since the classification is
cial neural network as an opaque and aim to extract rules made by a collection of classifiers, the comprehensibility
that map inputs directly into outputs. A representative is of an ensemble is poor even when its component classi-
the TREPAN algorithm (Craven & Shavlik, 1996), which fiers are comprehensible.
regards the rule extraction process as an inductive learn- A pedagogical algorithm has been proposed by Zhou,
ing problem and uses oracle queries to induce an ID2-of- Jiang, and Chen (2003) to improve the comprehensibility
3 decision tree that approximates the concept represented of ensembles of artificial neural networks, which utilizes
by a given network. The pseudo-code of this algorithm is the trained ensemble to generate instances and then
shown in Table 2. extracts symbolic rules from them. The success of this
Table 1. The RX algorithm
1. Train and prune the artificial neural network.

2. Discretize the activation values of the hidden units by clustering.
3. Generate rules that describe the network outputs using the discretized activation values.
4. For each hidden unit:
1) If the number of input connections is less than an upper bound, then extract rules to describe
the activation values in terms of the inputs.
2) Else form a subnetwork:
(a) Set the number of output units equal to the number of discrete activation values.
Treat each discrete activation value as a target output.
(b) Set the number of input units equal to the inputs connected to the hidden units.
(c) Introduce a new hidden layer.
(d) Apply RX to this subnetwork.
5. Generate rules that relate the inputs and the outputs by merging rules generated in Steps 3 and 4.
191
TEAM LinG
Table 2. The TREPAN algorithm
TREPAN(training_examples, features)
Queue
for each example E training_examples
E.label ORACLE(E)
initialize the root of the tree, T, as a leaf node
put <T, training_examples, {}> into Queue
while Queue and size(T) < tree_size_limit
remove node N from head of Queue
examplesN example set stored with N
constraintsN constraint set stored with N
use features to build set of candidate splits
use examplesN and calls to ORACLE(constraintsN) to evaluate splits
S best binary split
search for best MOFN splits, S, using S as a seed
make N an internal node with split S
for each outcome, s, of S
make C, a new child node of N
constraintsC constraintsN {S = s}
use calls to ORACLE(constraintsC) to determine if C should remain a leaf
otherwise
examplesC members of examplesN with outcome s on split S
put <C, examplesC, constraintsC> into Queue
return T
Figure 1. Working routine of the DEDEC algorithm

analyze the weights to obtain search for functional depen-
a rank of the inputs according dencies between the important
to their relative importance in inputs and the output, then
predicting the output extract the corresponding rules
ANN Training Weight Vector Analysis Iterative FD/Rule Extraction
Figure 2. The working style of Omlin and Giless algorithm
s1 s1 s1
1 2 1 1 2 1 1 2 1
0.5
3 4 5 3 4
0 0 0
0 0.5 1 s0 0 0.5 1 s0 0 0.5 1 s0
4 4 5
2
2 2
1
1 1
3 3
(a) All the possible transitions (b) All the possible transitions (c) All the possible transitions
from state 1 from state 2 from states 3 and 4
192
TEAM LinG
algorithm suggests that research on improving compre- binatorial explosion is inevitable for even moderate-sized
hensibility of artificial neural networks can give illumina- networks. Although many mechanisms such as pruning C
tion to the improvement of comprehensibility of other have been employed to reduce the computational com-
complicated classification algorithms. plexity, the efficiency of most current algorithms is not
Recently, Zhou & Jiang (2003) proposed to combine good enough. In order to work well in real-world applica-
ensemble learning and rule induction algorithms to obtain tions, effective algorithms with better efficiency are
accurate and comprehensible classifiers. Their algorithm needed.
uses an ensemble of artificial neural networks as a data Until now almost all works on improving comprehen-
preprocessing mechanism for the induction of symbolic sibility of complicated algorithms rely on rule extraction.
rules. Later, they (Zhou & Jiang, 2004) presented a new Although symbolic rule is relatively easy to be under-
decision tree algorithm and shown that when the en- stood by human beings, it is not the only comprehensible
semble is significantly more accurate than the decision style that could be exploited. For example, visualization
tree directly grown from the original training set and the may provide good insight into a pattern. However, al-
original training set has not fully captured the target though there are a few works (Frank & Hall, 2004; Melnik,
distribution, using an ensemble as the preprocessing 2002) utilizing visualization techniques to improve the
mechanism is beneficial. These works suggest the twice- comprehensibility of data mining algorithms, few work
learning paradigm to develop accurate and comprehen- attempts to exploit together rule extraction and visualiza-
sible classifiers, that is, using coupled classifiers where a tion, which is evidently very worth exploring.
classifier devotes to the accuracy while the other devotes Previous research on comprehensibility has mainly
to the comprehensibility. focused on classification algorithms. Recently, some works
on improving the comprehensibility of complicated re-
gression algorithms have been presented (Saito & Nakano,
FUTURE TRENDS 2002; Setiono, Leow, & Zurada, 2002). Since complicated
algorithms exist extensively in data mining, more sce-
It was supposed that an algorithm which could produce narios besides classification should be considered.
explicitly expressed patterns is comprehensible. How-
ever, such a supposition might not be so valid as it
appears to be. For example, as for a decision tree contain- CONCLUSION
ing hundreds of leaves, whether or not it is comprehen-
sible? A quantitative answer might be more feasible than This short article briefly discusses complexity issues in
a qualitative one. Thus, quantitative measure of compre- data mining. Although there is still a long way to produce
hensibility is needed. Such a measure can also help solve patterns that can be understood by common people in any
a long-standing problem, that is, how to compare the data mining tasks, endeavors on improving the compre-
comprehensibility of different algorithms. hensibility of complicated algorithms have paced a prom-
Since rule extraction is an important scheme for im- ising way. It could be anticipated that experiences and
proving the comprehensibility of complicated data mining lessons learned from these research might give illumina-
algorithms, frameworks for evaluating the quality of ex- tion on how to design data mining algorithms whose
tracted rules are important. Actually, the FACC (Fidelity, comprehensibility is good enough, not needed to be
Accuracy, Comprehensibility, Consistency) framework further improved. Only when the comprehensibility is not
proposed by Andrews, Diederich, and Tickle (1995) has a problem, the fruits of data mining can be fully enjoyed.
been used for almost a decade, which contains two impor-
tant criteria, i.e. fidelity and accuracy. Recently, Zhou
(2004) identified the fidelity-accuracy dilemma which REFERENCES
indicates that in some cases pursuing high fidelity and
high accuracy simultaneously is impossible. Therefore, Andrews, R., Diederich, J., & Tickle, A.B. (1995). Survey
new evaluation frameworks have to be developed and and critique of techniques for extracting rules from trained
employed, while the ACC (eliminating Fidelity from FACC) artificial neural networks. Knowledge-Based Systems, 8(6),
framework suggested by Zhou (2004) might be a good 373-389.
candidate.
Most current rule extraction algorithms suffer from Craven, M.W., & Shavlik, J.W. (1995). Extracting compre-
high computational complexity. For example, in hensible concept representations from trained neural
decompositional algorithms, if all the possible relation- networks. In Working Notes of the IJCAI95 Workshop on
ships between the connection weights and units in a Comprehensiblility in Machine Learning (pp. 61-75),
trained artificial neural network are considered, then com- Montreal, Canada.
193
TEAM LinG
Craven, M.W., & Shavlik, J.W. (1996). Extracting tree- Zhou, Z.-H., & Jiang, Y. (2003). Medical diagnosis with
structured representations of trained networks. In D. C4.5 rule preceded by artificial neural network ensemble.
Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in IEEE Transactions on Information Technology in Bio-
Neural Information Processing Systems, 8 (pp. 24-30). medicine, 7(1), 37-42.
Cambridge, MA: MIT Press.
Zhou, Z.-H., & Jiang, Y. (2004). NeC4.5: neural ensemble
Dietterich, T.G. (2002). Ensemble learning. In M.A. Arbib based C4.5. IEEE Transactions on Knowledge and Data
(Ed.), The handbook of brain theory and neural networks Engineering, 16(6), 770-773.
(2nd ed.). Cambridge, MA: MIT Press.
Zhou, Z.-H., Jiang, Y., & Chen, S.-F. (2003). Extracting
Frank, E., & Hall, M. (2003). Visualizing class probability symbolic rules from trained neural network ensembles. AI
estimation. In N. Lavra , D. Gamberger, H. Blockeel, & L. Communications, 16(1), 3-15.
Todorovski (Eds.), Lecture Notes in Artificial Intelli-
gence, 2838. Berlin: Springer, 168-179.
KEY TERMS
Gallant, S.I. (1983). Connectionist expert systems. Com-
munications of the ACM, 31(2), 152-169. Accuracy: The measure of how well a pattern can
Melnik, O. (2002). Decision region connectivity analysis: generalize. In classification it is usually defined as the
a method for analyzing high-dimensional classifiers. percentage of examples that are correctly classified.
Machine Learning, 48(1-3), 321-351. Artificial Neural Networks: A system composed of
Michalski, R. (1983). A theory and methodology of induc- many simple processing elements operating in parallel
tive learning. Artificial Intelligence, 20(2), 111-161. whose function is determined by network structure, con-
nection strengths, and the processing performed at com-
Omlin, C.W., & Giles, C.L. (1996). Extraction of rules from puting elements or units.
discrete-time recurrent neural networks. Neural Networks,
9(1), 41-52. Comprehensibility: The understandability of a pat-
tern to human beings; the ability of a data mining algo-
Saito, K., & Nakano, R. (2002). Extracting regression rules rithm to produce patterns understandable to human be-
from neural networks. Neural Networks, 15(10), 1279- ings.
1288.
Decision Tree: A flow-chart-like tree structure, where
Setiono, R. (1997). Extracting rules from neural networks each internal node denotes a test on an attribute, each
by pruning and hidden-unit splitting. Neural Computa- branch represents an outcome of the test, and each leaf
tion, 9(1), 205-225. represents a class or class distribution.
Setiono, R., Leow, W.K., & Zurada, J.M. (2002). Extraction Ensemble Learning: A machine learning paradigm
of rules from artificial neural networks for nonlinear reusing multiple learners to solve a problem.
gression. IEEE Transactions on Neural Networks, 13(3),
564-577. Fidelity: The measure of how well the rules extracted
from a complicated model mimic the behavior of that
Tickle, A.B., Andrews, R., Golea, M., & Diederich, J. model.
(1998). The truth will come to light: directions and chal-
lenges in extracting the knowledge embedded within MOFN Expression: A boolean expression consisted
trained artificial neural networks. IEEE Transactions on of an integer threshold m and n boolean antecedents,
Neural Networks, 9(6), 1057-1067. which is fired when at least m antecedents are fired. For
example, the MOFN expression 2-of-{ a, b, c } is logically
Tickle, A.B., Orlowski, M., & Diederich, J. (1996). DEDEC: equivalent to (a b) (a c) ( b c).
A methodology for extracting rule from trained artificial
neural networks. In Proceedings of the AISB96 Work- Rule Extraction: Given a complicated model such as
shop on Rule Extraction from Trained Neural Networks an artificial neural network and the data used to train it,
(pp. 90-102), Brighton, UK. produce a symbolic description of the model.
Zhou, Z.-H. (2004). Rule extraction: Using neural net- Symbolic Rule: A pattern explicitly comprising an
works or for neural networks? Journal of Computer Sci- antecedent and a consequent, usually in the form of IF
ence and Technology, 19(2), 249-253. THEN .
194
TEAM LinG
Twice-Learning: A machine learning paradigm using

coupled learners to achieve two aspects of advantages. In C
its original form, two classifiers are coupled together
where one classifier is devoted to the accuracy while the
other devoted to the comprehensibility.
195
TEAM LinG
196
Computation of OLAP Cubes

Amin A. Abdulghani
Quantiva, USA
INTRODUCTION total sales) is aggregated over all locations. A cuboid is

a group-by of a subset of dimensions, obtained by aggre-
The focus of Online Analytical Processing (OLAP) is to gating all tuples on these dimensions. In an n-dimensional
provide a platform for analyzing data (e.g., sales data) with cube, a cuboid is a base cuboid if it has exactly n dimen-
multiple dimensions (e.g., product, location, time) and sions. If the number of dimensions is fewer than n, then
multiple measures (e.g., total sales or total cost). OLAP it is an aggregate cuboid. Any of the standard aggregate
operations then allow viewing of this data from a number functions such as count, total, average, minimum, or
of perspectives. For analysis, the object or data structure maximum can be used for aggregating. Figure 1 shows an
of primary interest in OLAP is a cube. example of a three-dimensional (3-D) cube, and Figure 2
shows an aggregate 2-D cuboid.
In theory, no special operators or SQL extensions are
BACKGROUND required to take a set of records in the database and
generate all the cells for the cube. Rather, the SQL group-
An n-dimensional cube is defined as a group of k-dimen- by and union operators can be used in conjunction with
sional (k<=n) cuboids arranged by the dimensions of the d sorts of the dataset to produce all cuboids. However,
data. A cell represents an association of a measure m (e.g., such an approach would be very inefficient, given the
total sales) with a member of every dimension (e.g., obvious interrelationships between the various group-
product=toys, location=NJ, year=2003). By defini- bys produced.
tion, a cell in an n-dimensional cube can have fewer
dimensions than n. The dimensions not present in the cell
are aggregated over all possible members. For example, MAIN THRUST
you can have a two-dimensional (2-D) cell,
C1(product=toys, year=2003). Here, the implicit value I now describe the essence of the major methods for the
for the dimension location is *, and the measure m (e.g., computation of OLAP cubes.
Figure 1. A 3-D cube that consists of 1-D, 2-D, and 3-D Figure 2. An example 2-D cuboid on (product, year) for
cuboids the 3-D cube in Figure 1 (location='*'); total sales needs
to be aggregated (e.g., SUM)
dimension
Measure(total
sales) 2003
2003
2002
year
year
2002
PA 2001
2.5M
2001 NY
location toy clothes cosmetic
NJ nn s s
toy clothes cosmetic product
s product s
TEAM LinG
Top-Down Computation compute the child (and no intermediate group-bys exist

between the parent and child). Figure 3 depicts a sample C
In a seminal paper, Gray, Bosworth, Layman, and Pirahesh lattice where A, B, C, and D are dimensions, nodes repre-
(1996) proposed the data cube operator as a means of sent group-bys, and the edges show the parent-child
simplifying the process of data cube construction. The relationship.
algorithm presented forms the basis of the top-down The basic idea for top-down cube construction is to
approach. In the approach, the aggregation functions are start by computing the base cuboid (group-by for which
categorized into three classes: no cube dimensions are aggregated). A single pass is
made over the data, a record is examined, and the appro-
Distributive: An aggregate function F is called priate base cell is incremented. The remaining group-bys
distributive if there exists a function g such that the are computed by aggregating over already computed finer
value of F for an n-dimensional cell can be computed grade group-by. If a group-by can be computed from one
by applying g to the value of F in an (n + 1) or more possible parent group-bys, then the algorithm
dimensional cell. Examples of such functions in- uses the parent smallest in size. For example, for comput-
clude SUM and COUNT. For example, the COUNT() ing the cube ABCD, the algorithm starts out by computing
of an n-dimensional cell can be computed by apply- the cuboids for ABCD. Then, using ABCD, it computes
ing SUM to the value of COUNT() in an (n+1) the cuboids for ABC, ABD, and BCD. The algorithm then
dimensional cell. repeats itself by computing the 2-D cuboids, AB, BC, AD,
Algebraic: An aggregate function F is algebraic if and BD. Note that the 2-Dcuboids can be computed from
F of an n-dimensional cell can be computed by using multiple parents. For example, AB can be computed from
a constant number of aggregates of the (n + 1) ABC or ABD. The algorithm selects the smaller group-by
dimensional cell. An example is the AVERAGE() (the group-by with the fewest number of cells). An ex-
function. The AVERAGE() of an n-dimensional cell ample top-down cube computation is shown in Figure 4.
can be computed by taking the sum and count of the Variants of this approach optimize on additional costs.
(n+1)dimensional cell and then dividing the SUM The best-known methods are the PipeSort and PipeHash
by the COUNT to produce the global average. (Agarwal, Agrawal, Deshpande, Gupta, Naughton,
Holistic: An aggregate function F is called holistic Ramakrishnan, et al., 1996). The basic idea of both algo-
if the value of F for an n-dimensional cell cannot be rithms is that a minimum spanning tree should be gener-
computed from a constant number of aggregates of ated from the original lattice such that the cost of travers-
the (n+1)dimensional cell. Median and mode are ing edges will be minimized. The optimizations for the
examples of holistic functions. costs that these algorithms include are as follows:
The top-down cube computation works with distribu- Cache-results: This optimization aims at ensuring
tive or algebraic functions. These functions have the that the result of a group-by is cached (in memory)
property that more detailed aggregates (i.e., more dimen- so other group-bys can use it in the future.
sions) can be used to compute less detailed aggregates. Amortize-scans: This optimization amortizes the
This property induces a partial-ordering (i.e., a lattice) on cost of a disk read by computing the maximum
all the group-bys of the cube. A group-by is called a child possible number of group-bys together in memory.
of some parent group-by if the parent can be used to Share-sorts: For a sort-based algorithm, this aims
at sharing sorting cost across multiple group-bys.
Figure 3. Cube lattice Figure 4. Top-down cube computation
ABCD ABCD
ABC ABD ACD BCD ABC ABD ACD BCD
AB AC AD BC BD CD AB AC AD BC BD CD
A B C D A B C D
all all
197
TEAM LinG
Share-partitions: For a hash-based algorithm, when Figure 5. Bottom-up cube computation

the hash-table is too large to fit in memory, data are
partitioned and aggregated to fit in memory. This can ABCD
be achieved by sharing this cost across multiple
group- bys.
ABC ABD ACD BCD
Both PipeSort and PipeHash are examples of ROLAP
(Relational-OLAP) algorithms as they operate on multidi- AB AC AD BC BD CD
mensional tables. An alternative is MOLAP (Multidimen-
sional OLAP), where the approach is to store the cube
A B C D
directly as a multidimensional array. If the data are in
relational form, then they are loaded into arrays, and the
cube is computed from the arrays. The advantage is that all
no tuples are needed for comparison purposes; all opera-
tions are performed by using array indices.
proach is typically more efficient for high-dimensional,
A good example for a MOLAP-based cubing algorithm
sparse datasets. The key observation to its success is in
is the MultiWay algorithm (Zhao, Deshpande, & Naughton,
its ability to prune cells in the lattice that do not lead to
1997). Here, to make efficient use of memory, a single large
useful answers. Suppose, for example, that the iceberg
d-dimensional array is divided into smaller d-dimensional
query is COUNT()>100. Also, suppose there is a cell
arrays, called chunks. It is not necessary to keep all chunks
X(A=a1, B=b1) with COUNT=75. Then, clearly, C does
in memory at the same time; only part of the group-by
not satisfy the query. However, in addition, you can see
arrays are needed for a given instance. The algorithm starts
that any cell X in the cube lattice that includes the
at the base cuboid with a single chunk. The results are
dimensions (e.g. A=a1, B=b1) will also not satisfy the
passed to immediate lower-level cuboids of the top-down
query. Thus, you need not look at any such cell after
cube computation tree. After the chunk processing is
having looked at cell X. This forms the basis of support
complete, the next chunk is loaded. The process then
pruning. The observation was first made in the context of
repeats itself. An advantage of the MultiWay algorithm is
frequent sets for the generation of association rules
that it also allows for shared computation, where interme-
(Agrawal, Imielinski, & Swami, 1993).
diate aggregate values are reused for the computation of
For more arbitrary queries, detecting whether a cell is
successive descendant cuboids. The disadvantage of
prunable is more difficult. The fundamental property that
MOLAP-based algorithms is that they are not scalable for
allows pruning is called monotonicity of the query: Let
a large number of dimensions and sparse data.
D be a database and XD be a cuboid. A query Q) is
monotonic at X if the condition Q(X) is FALSE implies
Bottom-Up Computation that Q(X) is FALSE for any XX. With this definition,
a cell X in the cube would be pruned if the query Q is
Computing and storing full cubes may be overkill. Con- monotonic at X. However, determining whether a query
sider, for example, a dataset with 10 dimensions, each with Q is monotonic in terms of this definition is an NP-hard
a domain size of 9 (i.e., can have nine distinct values). problem for many simple class of queries (Imielinski,
Computing the full cube for this dataset would require Khachiyan, & Abdulghani, 2002). To work around this
computing and storing 1010 cuboids. Not all cuboids in the problem, another notion of monotonicity referred to as
full cube may be of interest. For example, if the measure is view monotonicity of a query is introduced. Suppose
total sales, and a user is interested in finding the cuboids you have cuboids defined on a set S of dimension and
that have high sales (greater than a given threshold), then measures. A view V on S is an assignment of values to the
computing cells with low total sales would be redundant. elements of the set. If the assignment holds for the
It would be more efficient here to only compute the cells dimensions and measures in a given cell X, then V is a
that have total sales greater than the threshold. The group view for X on the set S. So, for example, in a cube of rural
of cells that satisfies the given query is referred to as an buyers, if the average sale of bread is 15 with 20 people,
iceberg cube. The query returning the group of cells is then the view on the set {areaType, COUNT(),
referred to as an iceberg query. AVG(salesBread)} for the cube is {areaType=rural,
Iceberg queries are typically computed using a bottom- COUNT()=20, AVG(salesBread)=15}. Extending the defi-
up approach (Beyer & Ramakrishnan, 1999). These meth- nition, a view on a query Q for X is an assignment of
ods work from the bottom of the cube lattice and then work values for the set of dimension and measure attributes of
their way up towards the cells with a larger number of the query expression.
dimensions. Figure 5 illustrates this approach. The ap-
198
TEAM LinG
A query Q() view monotonic on view V(Q,X) if for any exceeds the benefit of any other nonmaterialized group-
cell X in any database D such that V is the view for X, the by. Let S be the set of materialized group-bys. This benefit C
condition Q is FALSE, for X implies Q is FALSE for all X of including a group-by v in the set S is the total savings
X. An important property of view monotonicity is that achieved for computing the group-bys not included in S
the time and space required for checking it for a query by using v versus the cost of computing them through
depends on the number of terms in the query and not on some group-by already in S. Gupta, Harinarayan,
the size of the database or the number of its attributes. Rajaraman, and Ullman (1997) further extend this work to
Because most queries typically have few terms, it would include indices in the cost.
be useful in many practical situations. Imielinski et al. The subset of the cuboids selected for materialization
(2002) presents a method for checking view monotonicity is referred to as a partial cube. There have been efficient
for a query that includes constraints of type (Agg {<, >, approaches suggested in the literature for computing the
=, !=} c), where c is a constant and Agg can be MIN, SUM, partial cube. In one such approach suggested by Dehne,
MAX, AVERAGE, COUNT, aggregates that are higher Eavis, and Rau-Chaplin (2004), the cuboids are computed
order moments about the origin, or aggregates that are an in a top-down fashion. The process starts with the original
integral of a function on a single attribute. lattice or the PipeSort spanning tree of the original lattice,
organizes the selected cuboids into a tree of minimal cost,
Hybrid Approach and then further tries to reduce the cost by possibly
adding intermediate nodes.
Typically, for low-dimension, low-cardinality, dense After a set of cuboids has been materialized, queries
datasets, the top-down approach is more applicable than are evaluated by using the materialized results. Park, Kim,
the bottom-up one. However, combining the two ap- and Lee (2001) describe a typical approach. Here, the
proaches leads to an even more efficient algorithm (Xin, OLAP queries are answered in a three-step process. In the
Han, Li, & Wah, 2003). On the global computation order, first step, it selects the materialized results that will be
the work presented uses the top-down approach. At a used for rewriting and identifies the part of the query
sublayer underneath, it exploits the potential of the bot- (region) that the materialized result can answer. Next,
tom-up model. Consider the top-down computation tree in query blocks are generated for these query regions. Fi-
Figure 4. Notice that the dimension ABC is included for all nally, query blocks are integrated into a rewritten query.
the cuboids in the leftmost subtree. Similarly, all the
cuboids in the second subtree include the dimensions
AB. These common dimensions are termed the shared FUTURE TRENDS
dimensions of the particular subtrees and enable bottom-
up computation. The observation is that if a query is In this paper, I focus on the basic aspects of cube compu-
FALSE and (view)-monotonic on the cell defined by the tation. The field is pretty recent, going back to no more
shared dimensions, then the rest of the cells generated than 10 years. However, as the field is beginning to
from this shared dimension are unneeded. The critical mature, issues are becoming better understood. Some of
requirement is that for every cell X, the cell for the shared the issues that will get more attention in future work
dimensions must be computed first. The advantage of include:
such an approach is that it allows for shared computation
as in the top-down approach as well as for pruning, when Advanced data structures for organizing input tuples
possible. of the input cuboids (Han, Pei, Dong, & Wang, 2001;
Xin et al., 2003).
Other Approaches Making use of inherent property of the dataset to
reduce computation of the data cubes. An example
Until now, I have considered computing the cuboids from is the range cubing algorithm, which utilizes the
the base data. Another commonly used approach is to correlation in the datasets to reduce the computa-
materialize the results of a selected set of group-bys and tion cost (Feng, Agrawal, Abbadi, & Metwally,
evaluate all queries by using the materialized results. 2004).
Harinarayan, Rajaraman, and Ullman (1996) describe an Compressing the size of the data cube and storing
approach to materialize a limit of k group-bys. The first it efficiently. A good example is a quotient cube,
group-by to materialize always includes the top group-by, which partitions the cube into classes such that
as none of the group-bys can be used to answer queries each cell in a class has the same aggregate value,
for this group-by. The next group-by to materialize is and the lattice generated preserves the original
included such that the benefit of including it in the set cubes semantics (Lakshmanan, Pei, & Han, 2002).
199
TEAM LinG
Additionally, I believe that future work in this direc- ceedings of the Hawaii International Conference on
tion would attack the problem from multiple issues rather System Sciences, USA.
than a single focus. Sismanis, Deligiannakis,
Roussopoulus, and Kotidis (2002) describe one such Feng, Y., Agrawal, D., Abbadi, A. E., & Metwally, A.
work, Dwarf. Its architecture integrates multiple fea- (2004). Range CUBE: Efficient cube computation by ex-
tures including the compressing of data cubes, a tunable ploiting data correlation. Proceedings of the Interna-
parameter for controlling the amount of materialization, tional Conference on Data Engineering (pp. 658-670),
and indexing and support for incremental updates (which USA.
is important for the case when the underlying data are Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996).
periodically updated). Data cube: A relational aggregation operator generalizing
group-by, cross-tab, and sub-total. Proceedings of the
International Conference on Data Engineering (pp. 152-
CONCLUSION 159), USA.
This paper focuses on the methods for OLAP cube com- Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman, J. D.
putation. All methods share the similarity that they make (1997). Index selection for OLAP. Proceedings of the
use of the ordering defined by the cube lattice to drive the International Conference on Data Engineering (pp. 208-
computation. For the top-down approach, traversal oc- 219), UK.
curs from the top of the lattice. This has the advantage Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient
that for the computation of successive descendant computation of iceberg cubes with complex measures.
cuboids, intermediate node results are used. The bottom- Proceedings of the ACM SIGMOD Conference (pp. 1-12),
up approach traverses the lattice in the reverse direction. USA.
The method can no longer rely on making use of the
intermediate node results. Its advantage lies in the ability Harinarayan, V., Rajaraman, A., & Ullman, J. D. (1996).
to prune cuboids in the lattice that do not lead to useful Implementing data cubes efficiently. Proceedings of the
answers. The hybrid approach uses the combination of ACM SIGMOD Conference (pp. 205-216), Canada.
both methods, taking advantages of both. Imielinski, T., Khachiyan, L., & Abdulghani, A. (2002).
The other major option for computing the cubes is to Cubegrades: Generalizing association rules. Journal of
materialize only a subset of the cuboids and to evaluate Data Mining and Knowledge Discovery, 6(3), 219-257.
queries by using this set. The advantage lies in storage
costs, but additional issues, such as identification of the Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient
cuboids to materialize, algorithms for materializing these, cube: How to summarize the semantics of a data cube.
and query rewrite algorithms, are raised. Proceedings of the International Conference on Very
Large Data Bases (pp. 778-789), China.
Park, C.-S., Kim, M. H., & Lee, Y.-J. (2001). Rewriting OLAP
REFERENCES queries using materialized views and dimension hierar-
chies in data warehouses. Proceedings of the Interna-
Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., tional Conference on Data Engineering, Germany.
Naughton, J. F., Ramakrishnan, R., et al. (1996). On the
computation of multidimensional aggregates. Proceed- Sismanis, Y., Deligiannakis, A., Roussopoulus, N., &
ings of the International Conference on Very Large Data Kotidis, Y. (2002). Dwarf: Shrinking the petacube. Pro-
Bases (pp. 506-521), India. ceedings of the ACM SIGMOD Conference (pp. 464-475),
USA.
Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining
association rules between sets of items in large data- Xin, D., Han, J., Li, X., & Wah, B. W. (2003). Star-cubing:
bases. Proceedings of the ACM SIGMOD Conference (pp. Computing iceberg cubes by top-down and bottom-up
207-216), USA. integration. Proceedings of the International Confer-
ence on Very Large Data Bases (pp. 476-487), Germany.
Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up com-
putation of sparse and iceberg cubes. Proceedings of the Zhao, Y., Deshpande, P. M., & Naughton, J. F. (1997). An
ACM SIGMOD Conference (pp. 359-370), USA. array-based algorithm for simultaneous multidimensional
aggregates. Proceedings of the ACM SIGMOD Confer-
Dehne, F. K. H., Eavis, T., & Rau-Chaplin, A. (2004). Top- ence (pp. 159-170), USA.
down computation of partial ROLAP data cubes. Pro-
200
TEAM LinG
KEY TERMS by applying g to the value of F in (n + 1)dimensional

cuboid. C
Algebraic Function: An aggregate function F is alge- Holistic Function: An aggregate function F is called
braic if F of an n-dimensional cuboid can be computed by holistic if the value of F for an n-dimensional cell cannot
using a constant number of aggregates of the (n + 1) be computed from a constant number of aggregates of the
dimensional cuboid. (n+1)dimensional cell.
Bottom-Up Cube Computation: Cube construction N-Dimensional Cube: A group of k-dimensional (k<=n)
that starts by computing from the bottom of the cube cuboids arranged by the dimension of the data.
lattice and then working up toward the cells with a greater
number of dimensions. Partial Cube: The subset of the cuboids selected for
materialization.
Cube Cell: Represents an association of a measure m
with a member of every dimension. Sparse Cube: A cube is sparse if a high ratio of the
cubes possible cells does not contain any measure value.
Cuboid: A group-by of a subset of dimensions, ob-
tained by aggregating all tuples on these dimensions. Top-Down Cube Computation: Cube construction that
starts by computing the base cuboid and then iteratively
Distributive Function: An aggregate function F is computing the remaining cells by aggregating over al-
called distributive if there exists a function g such that the ready computed finer-grade cells in the lattice.
value of F for an n-dimensional cuboid can be computed
201
TEAM LinG
202
Concept Drift
Marcus A. Maloof
Georgetown University, USA
INTRODUCTION Research on the problem of concept drift has been

empirical and theoretical (see Kuh, Petsche, & Rivest,
Traditional approaches to data mining are based on an 1991; Mesterharm, 2003); however, the focus of this
assumption that the process that generated or is gener- article is on empirical approaches. In this regard, re-
ating a data stream is static. Although this assumption searchers have evaluated such approaches by using syn-
holds for many applications, it does not hold for many thetic data sets (for examples, see Maloof & Michalski,
others. Consider systems that build models for identify- 2004; Schlimmer & Grainger, 1986; Street & Kim,
ing important e-mail. Through interaction with and feed- 2001; Widmer & Kubat, 1996) and real data sets (for
back from a user, such a system might determine that examples, see Black & Hickey, 2002; Blum, 1997; Lane
particular e-mail addresses and certain words of the & Brodley, 1998), both small (Maloof & Michalski,
subject are useful for predicting the importance of e- 2004; Schlimmer & Grainger, 1986; Widmer & Kubat,
mail. However, when the user or the persons sending e- 1996) and large (Hulten, Spencer, & Domingos, 2001;
mail start other projects or take on additional responsi- Kolter & Maloof, 2003; Street & Kim, 2001; Wang,
bilities, what constitutes important e-mail will change. Fan, Yu, & Han, 2003).
That is, the concept of important e-mail will change or Systems for coping with concept drift fall into three
drift. Such a system must be able to adapt its model or broad categories: incremental, partial memory, and en-
concept description in response to this change. Coping semble. Incremental approaches use new instances to
with or tracking concept drift is important for other modify existing models. Partial-memory approaches
applications, such as market-basket analysis, intrusion maintain a subset of previously encountered instances,
detection, and intelligent user interfaces, to name a few. previously built and refined models, or both. When new
instances arrive, such systems use the new instances and
their store of past instances and models to build new
BACKGROUND models or refine existing ones. Finally, ensemble meth-
ods build and maintain multiple models to cope with
Concept drift may occur suddenly, what some call revo- concept drift. Naturally, systems designed for concept
lutionary drift. Or it may occur gradually, what some drift do not always fall neatly into only one of these
call evolutionary drift (Klenner & Hahn, 1994). Drift categories. For example, some partial-memory ap-
may occur at different time scales and at varying rates. proaches are also incremental (Maloof & Michalski,
Concepts may change and then reoccur, perhaps with 2004; Widmer & Kubat, 1996).
contextual cues (Widmer, 1997). For example, the con- A model or concept description is a generalized
cept of warm is different in the summer than in the representation of instances or training data. Such repre-
winter. A contextual cue or variable is perhaps the sentations are important for prediction and for better
season or the daily mean temperature. It helps identify understanding the data set from which they were built.
the appropriate model to use for determining if, for Researchers have used a variety of representations for
example, it is a warm day. Coping with concept drift drifting concepts, including the instances themselves,
requires an online approach, meaning that the system probabilities, linear equations, decision trees, and deci-
must mine a stream of data. If drift occurs slowly sion rules. Methods for inducing and building such
enough, distinguishing between real and virtual concept representations include instance-based learning, nave
drift may be difficult (Klinkenberg & Joachims, 2000). Bayes, support vector machines, C4.5, and AQ15, re-
The former occurs when concepts indeed change over spectively. See Hand, Mannila, and Smyth (2001) for
time; the latter occurs when performance drops during additional information.
the normal online process of building and refining a
model. Such drops could be due to differences in the
data in different parts of the stream. The richness of this MAIN THRUST
problem has led to an equally rich collection of ap-
proaches. Researchers have proposed a variety of approaches for
learning concepts that drift. However, evaluating such
TEAM LinG
Concept Drift
approaches is a critical issue. In the next two sections, I unknown instance, the method uses the instances values
survey approaches for learning concepts that change to traverse the current tree from the root to a leaf node, C
over time and discuss issues of evaluating such ap- returning as the prediction the associated label.
proaches. The Concept Drift 3 system (Black & Hickey, 1999),
or CD3, uses batches of instances annotated with a time
Survey of Approaches for Concept Drift stamp of either current or new to build a decision tree.
When drift occurs, time becomes more relevant for
STAGGER (Schlimmer & Grainger, 1986) was the first prediction, so the time-stamp attribute will appear higher
system for coping with concept drift. Its model consists in the decision tree. After pruning, CD3 converts the
of nodes, corresponding to features and class labels, tree to rules by enumerating all paths containing a new
linked together with probabilistic arcs, representing the time stamp and then removing conditions involving the
strength of association between features and class labels. time stamp. CD3 predicts the class of the best matching
As STAGGER processes new instances, it increases or rule.
decreases probabilities, and it may add nodes and arcs. To Ensemble methods maintain a set of models and use
classify an unknown instance, STAGGER predicts the a voting procedure to yield a global prediction. Blums
most probable class. (1997) implementation of Weighted-Majority
Partial-memory approaches maintain a store of par- (Littlestone & Warmuth, 1994) uses as models histo-
tially built models, a portion of the previously encoun- ries of labels associated with pairs of features. If a
tered instances, or both. Such approaches vary in how models features are present in an instance, then it
they use such information for adjusting current models. predicts the most frequent label present in its history.
The FLORA Systems (Widmer & Kubat, 1996) maintain The method initializes each model with a weight of 1 and
a sequence of examples over a dynamically adjusted reduces a models weight if it predicts incorrectly. It
window of time. The Window Adjustment Heuristic predicts based on a weighted vote of the predictions of
(WAH) adjusts the size of this window in response to the models.
performance changes. Generally, if performance is de- The Streaming Ensemble Algorithm (Street & Kim,
creasing or poor, then the heuristic reduces the windows 2001) maintains a fixed-size collection of models, each
size; if it is increasing or acceptable, then it increases built from a fixed number of instances. When a new
the size. These systems also maintain a store of rules, batch of instances arrives, SEA builds a new model. If
including ones that are overly general, although these space exists in the collection, then it adds the new
are not used for prediction. As the systems process model. Otherwise, it replaces the worst performing
instances, they create new rules or refine existing ones. model with the new model, if one exists. SEA predicts
To classify an instance, the FLORA systems select the the majority vote of the predictions of the models in the
rule that best matches and return its class label. collection.
The AQ-PM Systems maintain a set of examples The Accuracy-Weighted Ensemble (Wang et al.,
over a window of time, but the systems select examples 2003) also maintains a fixed-size collection of models,
from the boundaries of rules, so they can retain ex- each built from a batch of instances. However, this
amples that do not reoccur in the data stream. AQ-PM method weights each classifier in the collection based
(Maloof & Michalski, 2000) builds new rules when new on its performance on the most recent batch. When
examples arrive, whereas AQ11-PM (Maloof & adding a new weighted model, if there is no space in the
Michalski, 2004) refines existing rules. These systems collection, then the method stores only the top weighted
maintain examples over a static window of time, but models. The method predicts based on a weighted-
AQ11-PM+WAH (Maloof, 2003) incorporates Widmer majority vote of the predictions of the models in the
and Kubats (1996) WAH for dynamically sizing this collection.
window. Because these systems use rules, when classi- Dynamic Weighted Majority (Kolter & Maloof,
fying an instance, they return as their prediction the 2003), or DWM, maintains a collection of weighted
class label of the best matching rule. models but dynamically adds and removes models with
The Concept-adapting Very Fast Decision Tree sys- changes in performance. Instead of building a single
tem (Hulten et al., 2001), or CVFDT, progressively model with each batch, DWM uses new instances to
grows a decision tree downward from the leaf nodes. It refine all the models in the collection. Each time a
maintains frequency counts for attribute values by class model predicts incorrectly, DWM reduces its weight,
and extends the tree when a statistical test indicates that and DWM removes a model from the collection if its
a change has occurred. CVFDT also maintains at each weight falls below a threshold. Like the previous method,
node a list of alternate subtrees, which it swaps with the DWM predicts based on a weighted-majority vote of the
current subtree when it detects drift. To classify an predictions of the models, but if the global prediction
203
TEAM LinG
Concept Drift
(i.e., the weighted-majority vote) is incorrect for an The target concept for the first 40 time steps is [size =
instance, then DWM adds a new model to the collection. small] & [color = red]. For the next 40, it is [color =
green] [shape = circle]. For the final 40, it is [size =
Evaluation medium large]. At each time step, one generates a
single training example and 100 test cases of the target
Evaluating systems for concept drift is a critical issue. concept. Naturally, the method updates its model by
Ideally, one wants to use a real-world data set, but finding using the training example and evaluates it by using the
such a data set in which concept drift is easy to identify testing examples, calculating accuracy. One presenta-
is itself a challenge. After all, if it were easy to detect tion of the STAGGER Concepts is not sufficient for a
concept drift in large data sets, then the task of writing proper evaluation, so researchers conduct multiple
systems to cope with it would be trivial. Even if one runs, averaging accuracy at each time step.
establishes that drift is occurring in a data set, to conduct Clearly, a small problem with only 27 possible
a proper evaluation, the phenomenon must produce a examples, researchers have recently proposed larger
measurable effect in a methods performance. More- synthetic data sets involving concept drift (e.g., Hulten
over, the effect of concept drift on the methods perfor- et al., 2001; Street & Kim, 2001; Wang et al., 2003). I
mance must be greater than all other effects, such as that do not have the space here to survey the details, simi-
due to the variability from processing new examples of larities, and differences of these synthetic problems,
a target concept. As a result of these issues, the majority but they are all based on the same ideas present in the
of evaluations have involved synthetic or artificial data STAGGER Concepts. For instance, researchers have
sets. used rotating (Hulten et al., 2001) and shifting (Street
A hallmark of synthetic data sets for concept drift is & Kim, 2001) hyperplanes as changing target concepts.
that they cycle through a series of target concepts. The As noted previously, there have also been evalua-
first target concept persists for a period of time, then the tions involving concept drift in real data sets. Blum
second persists, and so on. For each time step of a (1997) used a calendar-scheduling task in which a
period, one randomly generates a set of training ex- users preference for meetings changed over time.
amples, which the method uses to build or refine its Lane and Brodley (1998) examined an intrusion-detec-
models. The method evaluates the resulting model on the tion application, mining sequences of UNIX commands.
examples in a test set and computes a measure of perfor- Finally, Black and Hickey (2002) studied concept drift
mance, such as predictive accuracy (i.e., the percentage in the phone records of customers of British Telecom.
of the test examples the method predicted correctly).
One generates test cases every time step or every time
period. As the system processes examples, ideally, its FUTURE TRENDS
performance on the test set improves at a certain rate and
to a certain level of accuracy. Indeed, the slope and Tracking concept drift is a rich problem that has led to
asymptote are critical characteristics of a methods per- a diverse set of approaches and evaluation methodolo-
formance. gies, and there are many opportunities for further in-
Crucially, each time the target concept changes, one vestigation. One is the development of better methods
generates a new set of testing examples. Naturally, when for evaluating systems that cope with evolutionary
the method applies the model built for the previous drift. I have already discussed the difficult nature of
target concept to the examples of the new concept, the this problem: To evaluate how systems cope with drift,
methods performance will be poor. However, as the there must be a measurable effect. Slow or evolution-
method processes training examples of the new target ary drift may not produce an effect different enough
concept, as before, performance will improve with some from that of mining the sequence of instances. If noise
slope and to some asymptote. is present, then it is even more difficult to distinguish
By far, the most widely used synthetic data set is the among the variability due to instances, drift, and noise.
so-called STAGGER Concepts. Originally used by Existing systems are probably capable of tracking slow
Schlimmer and Grainger (1986), it has been the center- concept drift, but we need evaluation methodologies
piece for many evaluations (e.g., Kolter & Maloof, 2003; for measuring how such drift affects performance.
Maloof & Michalski, 2000, 2004; Widmer, 1997; I have made the case for the importance of strong
Widmer & Kubat, 1996). There are three attributes: size, evaluations, and in this regard, researchers need to
color, and shape. Size can be small, medium, or large; better place their work in context with past efforts.
color can be red, blue, or green; shape can be triangle, Presently, researchers often choose not to use data
circle, or rectangle. There are three target concepts, and sets from past studies, and create new ones for their
the presentation of examples lasts for 120 time steps. investigation. Because new studies typically consist of
204
TEAM LinG
Concept Drift
new methods evaluated on new data sets, it is difficult to Lecture Notes in Computer Science: Vol. 2311. Software
place new methods in context with previous work. As a 2002: Computing in an imperfect world (pp. 74-87). New C
consequence, it is presently impossible to truly under- York: Springer.
stand the strengths and weaknesses of existing methods,
both old and new. This is not to say that researchers Blum, A. (1997). Empirical support for Winnow and
should not create or introduce new data sets. Indeed, I Weighted-Majority algorithms: Results on a calendar
have already mentioned that the STAGGER Concepts is scheduling domain. Machine Learning, 26, 5-23.
a small problem, so creating a new, larger one is re- Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
quired if, say, researchers want to establish how meth- data mining. Cambridge, MA: MIT Press.
ods scale to larger data streams. Nonetheless, by first
evaluating new methods on existing problems, research- Hulten, G., Spencer, L., & Domingos, P. (2001). Mining
ers will be able to make stronger conclusions about the time-changing data streams. Proceedings of the Sev-
performance of their method, and the community will enth ACM SIGKDD International Conference on
be able to better understand the contribution of the Knowledge Discovery and Data Mining (pp. 97-106).
method. Klenner, M., & Hahn, U. (1994). Concept versioning: A
Finally, we need to develop a systems theory of methodology for tracking evolutionary concept drift in
concept drift. Presently, we have little to guide the dynamic concept systems. Proceedings of the 11th
development of future methods or to guide the selection European Conference on Artificial Intelligence, En-
of a particular method for a new application. However, gland, XX (pp. 473-477).
before developing such a theory, we will need to develop
better methods for evolutionary concept drift, and we Klinkenberg, R., & Joachims, T. (2000). Detecting concept
will need stronger evaluations that place new work in drift with support vector machines. Proceedings of the
context with that of the past. 17th International Conference on Machine Learning
(pp. 487-494), USA.
Kolter, J., & Maloof, M. (2003). Dynamic weighted major-
CONCLUSION ity: A new ensemble method for tracking concept drift.
Proceedings of the Third IEEE International Conference
Tracking concept drift is important for many applica- on Data Mining (pp. 123-130), USA.
tions, from e-mail sorting to market-basket analysis.
Concepts may change quickly or gradually; they may Kuh, A., Petsche, T., & Rivest, R. L. (1991). Learning time-
occur once or at regular intervals. The diversity and varying concepts. In Advances in neural information
complexity of coping with concept drift has led re- processing systems 3 (pp. 183-189). San Francisco: Mor-
searchers to propose an equally varied set of approaches. gan Kaufmann.
Some modify their models, some do so with partial
memory of the past, and some rely on groups of models. Lane, T., & Brodley, C. (1998). Approaches to online
Although there have been evaluations of these approaches learning and concept drift for user identification in
involving real data sets, the majority have involved computer security. Proceedings of the Fourth Interna-
synthetic data sets, which give researchers great control tional Conference on Knowledge Discovery and Data
in testing hypotheses. With stronger evaluations that Mining (pp. 259-263), USA.
place new systems in context with past work, we will be Littlestone, N., & Warmuth, M. K. (1994). The Weighted
able to propose theories for systems that cope with Majority algorithm. Information and Computation, 108,
concept drift, theories that we can test and that will lead 212-261.
to new systems for this critical problem for many appli-
cations. Maloof, M. (2003). Incremental rule learning with partial
instance memory for changing concepts. Proceedings of
the International Joint Conference on Neural Networks
REFERENCES (pp. 2764-2769), USA.
Maloof, M., & Michalski, R. (2000). Selecting examples for
Black, M., & Hickey, R. (1999). Maintaining the perfor- partial memory learning. Machine Learning, 41, 27-52.
mance of a learned classifier under concept drift. Intel-
ligent Data Analysis, 3, 453-474. Maloof, M., & Michalski, R. (2004). Incremental learning
with partial instance memory. Artificial Intelligence, 154,
Black, M., & Hickey, R. (2002). Classification of customer 95-126.
call data in the presence of concept drift and noise. In
205
TEAM LinG
Concept Drift
Mesterharm, C. (2003). Tracking linear-threshold con- KEY TERMS

cepts with Winnow. Journal of Machine Learning Re-
search, 4, 819-838. Class Label: A label identifying the concept or
Schlimmer, J., & Granger, R. (1986). Beyond incremental class of an instance.
processing: Tracking concept drift. Proceedings of the Concept Drift: A phenomenon in which the class
Fifth National Conference on Artificial Intelligence (pp. labels of instances change over time.
502-507), USA.
Data Set: A set of instances of target concepts.
Street, W., & Kim, Y. (2001). A streaming ensemble algo-
rithm (SEA) for large-scale classification. Proceedings of Data Stream: A data set distributed over time.
the Seventh ACM SIGKDD International Conference on Instance: A set of attributes, their values, and a class
Knowledge Discovery and Data Mining (pp. 377-382), label. Also called example, record, or case.
USA.
Model: A representation of a data set or stream that
Wang, H., Fan, W., Yu, P., & Han, J. (2003). Mining one can use for prediction or for better understanding
concept-drifting data streams using ensemble classifi- the data from which it was derived. Also called hypoth-
ers. Proceedings of the Ninth ACM SIGKDD Interna- esis or concept description. Models include probabili-
tional Conference on Knowledge Discovery and Data ties, linear equations, decision trees, decision rules,
Mining (pp. 226-235), USA. and the like.
Widmer, G. (1997). Tracking context changes through Synthetic Data Set: A set of artificial target con-
meta-learning. Machine Learning, 27, 259-286. cepts.
Widmer, G., & Kubat, M. (1996). Learning in the presence Target Concept: The true, typically unknown model
of concept drift and hidden contexts. Machine Learning, underlying a data set or data stream.
23, 69-101.
206
TEAM LinG
207
Condensed Representations for Data Mining C

Jean-Francois Boulicaut
INSA de Lyon, France
INTRODUCTION effort has been made by data-mining researchers to make

an active use of the primitive constraints occurring in q to
Condensed representations have been proposed in achieve a tractable evaluation of useful mining queries. It
Mannila and Toivonen (1996) as a useful concept for the is the domain of constraint-based mining (e.g., the seminal
optimization of typical data-mining tasks. It appears as a paper) (Ng et al., 1998). In real applications, the computa-
key concept within the inductive database framework tion of Th(L,q,r) can remain extremely expensive or even
(Boulicaut et al., 1999; de Raedt, 2002; Imielinski & Mannila, impossible, and the framework of condensed representa-
1996), and this article introduces this research domain, its tions has been designed to cope with such a situation.
achievements in the context of frequent itemset mining The idea of -adequate representations was introduced in
(FIM) from transactional data, and its future trends. Mannila and Toivonen (1996) and Boulicaut and Bykowski
Within the inductive database framework, knowledge (2000). Intuitively, they are alternative representations of
discovery processes are considered as querying pro- the data that enable answering to a class of query (e.g.,
cesses. Inductive databases (IDBs) contain not only data, frequency queries for itemsets in transactional data) with
but also patterns. In an IDB, ordinary queries can be used a bounded precision. At a given precision , one can be
to access and manipulate data, while inductive queries interested in the smaller representations, which are then
can be used to generate (mine), manipulate, and apply called concise or condensed representations. It means
patterns. To motivate the need for condensed represen- that a condensed representation for Th(L,q,r) is a collec-
tations, let us start from the simple model proposed in tion C Th(L,q,r) such that every pattern from Th(L,q,r)
Mannila and Toivonen (1997). Many data-mining tasks can be derived efficiently from C. In the database-mining
can be abstracted into the computation of a theory. Given context, where r might contain a huge volume of records,
a language L of patterns (e.g., itemsets), a database we assume that efficiently means without further access
instance r (e.g., a transactional database) and a selection to the data. The following figure illustrates that we can
predicate q, which specifies whether a given pattern is compute Th(L,q,r) either directly (Arrow 1) or by means of
interesting or not (e.g., the itemset is frequent in r), a data- a condensed representation (Arrow 2) followed by a
mining task can be formalized as the computation of regeneration phase (Arrow 3).
Th(L,q,r) = { L | q(,r) is true}. This also can be We know several examples of condensed representa-
considered as the evaluation for the inductive query q. tions for which Phases 2 and 3 are much less expensive
Notice that it specifies that every pattern that satisfies q than Phase 1. We now introduce the background for
has to be computed. This completeness assumption is understanding condensed representations in the well
quite common for local pattern discovery tasks but is studied context of FIM.
generally not acceptable for more complex tasks (e.g.,
accuracy optimization for predictive model mining). The
selection predicate q can be defined in terms of a Boolean
expression over some primitive constraints (e.g., a mini- Figure 1.
mal frequency constraint used in conjunction with a
syntactic constraint, which enforces the presence or the Condensed representation C for Th(L,q,r)
absence of some subpatterns). Some of the primitive
constraints generally refer to the behavior of a pattern in
3
the data by using the so-called evaluation functions (e.g., 2
frequency).
To support the whole knowledge discovery process,
it is important to support the computation of many differ-
1
ent but correlated theories.
It is well known that a generate-and-test approach that
would enumerate the sentences of L and then test the
Data r Theory Th(L,q,r)
selection predicate q is generally impossible. A huge
TEAM LinG
Condensed Representations for Data Mining
BACKGROUND When the predicate selection is a conjunction of an

anti-monotonic part and a monotonic part, the two bor-
In many cases (e.g., itemsets, inclusion dependencies, ders define the solution set: solutions are between S and
sequential patterns) and for a given selection predicate or G, and (S,G) is a version space. For this conjunction case,
constraint, the search space L is structured by an anti- several algorithms can be used (Bucila et al., 2002; Bonchi,
monotonic specialization relation, which provides a lat- 2003; de Raedt & Kramer, 2001; Jeudy & Boulicaut, 2002).
tice structure. For instance, in transactional data, when L When arbitrary Boolean combinations of anti-monotonic
is the power set of items, and the selection predicate and monotonic constraints are used (e.g., disjunctions),
enforces a minimal frequency, set inclusion is such an the solution space is defined as a union of several version
anti-monotonic specialization relation. Anti-monotonic- spaces (i.e., unions of couples of borders) (de Raedt et al.,
ity means that when a sentence does not satisfy q (e.g., 2002).
an itemset is not frequent), then none of its specializations Borders appear as a typical case of condensed repre-
can satisfy q (e.g., none of its supersets are frequent). It sentation. Assume that the collection of the maximal
becomes possible to prune huge parts of the search space, frequent itemsets in r is available (i.e., the S border for the
which cannot contain interesting sentences. This has minimal frequency constraint); this collection is generally
been studied a lot within the learning as search frame- several orders of magnitude smaller than the complete
work (Mitchell, 1982), and the generic level-wise algo- collection of the frequent itemsets in r, while all of them
rithm from Mannila and Toivonen (1997) has inspired can be generated from S without any access to the data.
many algorithmic developments. It computes Th(L,q,r) However, in most of the applications of pattern discovery
level-wise in the lattice by considering first the most tasks, the user not only wants to get the interesting pat-
general sentences (e.g., the singleton in the FIM prob- terns, but also wants the results of some evaluation func-
lem). Then, it alternates candidate evaluation (e.g., fre- tions about these patterns. This is obvious for the FIM
quency counting) and candidate generation (e.g., build- problem; these patterns are generally exploited in a post-
ing larger itemsets from discovered frequent itemsets) processing step to derive more useful statements about the
phases. The algorithm stops when it cannot generate new data (e.g., the popular frequent association rules that have
candidates or, in other terms, when the most specific a high enough confidence) (Agrawal et al., 1996). This can
sentences have been found (e.g., all the maximal frequent be done efficiently, if we compute not only the collection
itemsets). This collection of the most specific sentences of frequent itemsets, but also their frequencies.
is called a positive border in Mannila and Toivonen In fact, the semantics of an inductive query are better
(1997), and it corresponds to the S set of a Version Space captured by extended theories (i.e., collections like {(,e)
in Mitchells terminology. The a priori algorithm (Agrawal LE | q(,r) est vrai et e=(,r)}), where e is the result
et al., 1996) is clearly the most famous instance of this of an evaluation function in r with values in E. In our FIM
level-wise algorithm. problem, denotes the frequency (e is a number in [0,1])
The dual property of monotonicity is interesting, as of an itemset in a transactional database r. The challenge
well. A selection predicate is monotonic when its negation of designing condensed representations for an extended
is anti-monotonic (i.e., when a sentence satisfies it, all its theory ThE is then to identify subsets of ThE, from which
specializations satisfy it, as well). In the itemset pattern it is possible to generate ThE either exactly or with an
domain, the maximal frequency constraint or a syntactic approximation on the evaluation functions.
constraint that enforces that a given item belongs to the
itemsets are two monotonic constraints. Thanks to the
duality of these definitions, a monotonic constraint gives MAIN RESULTS
rise to a border G, which contains the minimally general
sentences w.r.t., the monotonic constraint (see the fol- We emphasize the main results concerning condensed
lowing figure). representations for frequent itemsets, since this is the
context for which it has been studied a lot.
Figure 2.
Condensed Representations by
Borders
Specialization Border S
relation It makes sense to use borders as condensed representa-
tions. For FIM, specific algorithms have been designed
Border G
for computing directly the S border (Bayardo, 1998). Also,
the algorithm in Kramer and de Raedt (2001) computes
208
TEAM LinG
borders S and G and has been applied successfully to generators. Interestingly, these generators constitute
feature extraction in the domain of molecular fragment condensed representations, as well. An important char- C
finding. In this case, a conjunction of a minimal frequency acterization is the one of free sets (Boulicaut et al., 2000),
in one set of molecule (e.g., the active ones) and a maximal which has been proposed independently under the name
frequency in another set of molecules (e.g., the inactive key patterns (Bastide et al., 2000). By definition, the
ones) is used. This kind of research is related to the so- closures of (frequent) free sets are (frequent) closed sets.
called emergent pattern discovery (Dong & Li, 1999). Given the quivalence classes we quoted earlier, free sets
Considering the extended theory for frequent itemsets, are their minimal elements, and the freeness property can
it is clear that, given the maximal frequent sets and their lead to efficient pruning, thanks to its anti-monotonicity.
frequencies, we have an approximate condensed represen- Computing the frequent free sets plus an extra collection
tation of the frequent itemsets. Without looking at the of some non-free itemsets (part of the so-called negative
data, we can regenerate the whole collection of the fre- border of the frequent free sets), it is possible to regen-
quent itemsets (subsets of the maximal ones), and we have erate the whole collection of the frequent itemsets and
a bounded error on their frequencies: when considering a their frequencies (Boulicaut et al., 2000; Boulicaut et al.,
subset of a maximal -frequent itemset, we know that its 2003). On one hand, we often have much more frequent
frequency is in [,1]. Even though more precise bounds free sets than frequent closed sets but, on another hand,
can be computed, this approximation is useless in practice. they are smaller. The concept of freeness has been
Indeed, when using borders, users have other applications generalized for other exact condensed representations
in mind (e.g., feature construction). like the disjunct-free itemsets (Bykowski & Rigotti, 2001),
The maximal frequent itemsets can be computed in the non derivable itemsets (Calders & Goethals, 2002),
cases where very large frequent itemsets hold such that and the minimal k-free representations of frequent sets
the regeneration process becomes impossible. Typically, (Calders & Goethals, 2003). Regeneration algorithms and
when a maximal frequent itemset has a size 30, it should lead translations between condensed representations have
to the regeneration of around 1010 frequent sets. been studied, as well (Kryszkiewicz et al., 2004).
Exact Condensed Representations of Approximate Condensed

Frequent Sets and Their Frequencies Representations for Frequent Sets and
Their Frequencies
Since Boulicaut and Bykowski (2000), a lot of attention
has been put on exact condensed representations based on Looking for approximate condensed representations can
frequent closed sets. According to the classical framework be useful for very hard datasets. We mentioned that
of Galois connection, the closure of an itemset X in r, borders can be considered as approximate condensed
closure(X,r), is the maximal superset of X, which has the representations but with a poor approximation of the
same frequency as X in r. Furthermore, a set X is closed if needed frequencies. The idea is to be able to compute the
X=closure(X,r). Interestingly, the sets of itemsets that have frequent itemsets with less counting at the price of an
the same closures constitute equivalence classes of itemsets acceptable approximation on their frequencies. One idea
that have the same frequencies, the maximal one in each is to compute the frequent itemsets and their frequencies
equivalence class being a closed set (Bastide et al., 2000). on a sample of the original data (Mannila & Toivonen,
Frequent closed sets are itemsets that are both closed 1996), but bounding the error on the frequencies is hard.
and frequent. In dense and/or highly correlated data, we In Boulicaut et al. (2000), the concept of -free set was
can have orders of magnitude less frequent closed sets introduced. In fact, the free itemsets are a special case
than frequent itemsets. In other terms, it makes sense to when = 0. When > 0, we have less -free sets than free
materialize and, when possible, to look for the fastest sets, and, thus, the representation is more condensed.
computations of the frequent closed sets only. It is then Interestingly, experimentations on real datasets have
easy to derive the frequencies of every frequent itemset shown that the error in practice was much smaller than
without any access to the data. Many algorithms have the theoretical bound (Boulicaut et al., 2000; Boulicaut et
been designed for computing frequent closed itemsets al., 2003). Another family of approximate condensed
(Pasquier et al., 1999; Zaki, 2002). Empirical evaluations of representations has been designed in Pei, et al. (2002),
many algorithms for the FIM problem are reported in where the idea is to consider the maximal frequent itemsets
Goethals and Zaki (2004). Frequent closed set mining is a for different frequency thresholds and then approximate
real breakthrough to the computational complexity of FIM the frequency of any frequent set by means of the
in difficult contexts. computed frequencies.
A specificity of frequent closed set mining algorithms
is the need for a characterization of (frequent) closed set
209
TEAM LinG
Multiples Uses of Condensed resentations are quite general and might be considered
Representations with success for many other pattern domains.
Another interesting achievement is that condensed rep-

resentations of frequent itemsets are not only useful for REFERENCES
FIM in difficult cases, but they also derive more meaning-
ful patterns. In other terms, instead of a regeneration Agrawal, R. et al. (1996). Fast discovery of association
phase, which can be impossible, however, due to the size rules. In (Ed.), Advances in knowledge discovery and
of the collection of frequent itemsets, it is possible to use data mining (pp. 307-328). MIT Press.
directly the condensed representations. Indeed, closed Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal,
sets can be used to derive informative or non-redundant L. (2000). Mining frequent patterns with counting infer-
association rules. Also, frequent and valid association ence. SIGKDD Explorations, 2(2), 66-75.
rules with a minimal left-hand side can be derived from d-
free sets, and these rules can be used, among others, for Bayardo, R. (1998). Efficiently mining long patterns from
association-based classification. It is also possible to databases. Proceedings of the International Conference
derive formal concepts in Willes terminology and, thus, on Management of Data, Seattle, Washington.
apply the so-called formal concept analysis.
Bonchi, F. (2003). Frequent pattern queries: Language
and optimisations [doctoral thesis]. Pisa, Italy: Univer-
sity of Pisa, Italy.
FUTURE TRENDS
Boulicaut, J.-F. & Bykowski, A. (2000). Frequent closures
So far, we consider that two promising directions of as a concise representation for binary data mining. Pro-
research are considered and should provide new break- ceedings of the Knowledge Discovery and Data Mining,
throughs. First, it is useful to consider new mining tasks Current Issues and New Applications, Kyoto, Japan.
and design condensed representations for them. For
instance, Casali, et al. (2003) consider the computation of Boulicaut, J.-F., Bykowski, A., & Rigotti, C. (2000). Ap-
aggregates from data cubes, and Yan, et al. (2003) address proximation of frequency queries by means of free-sets.
sequential pattern mining. An interesting open problem is Proceedings of Principles of Data Mining and Knowl-
to use condensed representations during model mining edge Discovery, Lyon, France.
(e.g., classifiers). Then, merging condensed representa- Boulicaut, J.-F., Bykowski, A., & Rigotti, C. (2003). Free-
tions with constraint-based mining (with various con- sets: A condensed representation of Boolean data for the
straints, including constraints that are neither anti-mono- approximation of frequency queries. Data Mining and
tonic nor monotonic) seems to be a major issue. One Knowledge Discovery, 7(1), 5-22.
challenging problem is to decide which kind of condensed
representation has to be materialized for optimizing se- Boulicaut, J.-F., Klemettinen, M., & Mannila, H. (1999).
quences of inductive queries (e.g., for interactive associa- Modeling KDD processes within the inductive database
tion rule mining); that is, the real context of knowledge framework. Proceedings of the Data Warehousing and
discovery processes. Knowledge Discover, Florence, Italy.
Bucila, C., Gehrke, J., Kifer, D., & White, W.M. (2003).
DualMiner: A dual-pruning algorithm for itemsets with
CONCLUSION constraints. Data Mining and Knowledge Discovery,
7(3), 241-272.
We introduced the key concept of condensed represen-
tations for the optimization of data-mining queries and, Bykowski, A., & Rigotti, C. (2001). A condensed represen-
thus, the development of the inductive database frame- tation to find frequent patterns. Proceedings of the ACM
work. We have summarized an up-to-date view on bor- Principles of Database Systems, Santa Barbara, California.
ders. Then, referencing the work on the various con- Calders, T., & Goethals, B. (2002). Mining all non-derivable
densed representations for frequent itemsets, we have frequent itemsets. Proceedings of the Principles of Data
pointed out the breakthrough of the computational com- Mining and Knowledge Discovery, Helsinki, Finland.
plexity of the FIM tasks. This is important, because the
applications of frequent itemsets go much further than the Calders, T., & Goethals, B. (2003). Minimal k-free repre-
classical association rule-mining task. Finally, the con- sentations of frequent sets. Proceedings of the Prin-
cepts of condensed representations and -adequate rep- ciples of Data Mining and Knowledge Discovery,
Dubrovnik, Croatia.
210
TEAM LinG
Casali, A., Cicchetti, R., & Lakhal, L. (2003). Cube lattices: tional Conference on Management of Data, Seattle,
A framework for multidimensional data mining. Proceed- Washington. C
ings of the SIAM International Conference on Data
Mining, San Francisco, California. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999).
Efficient mining of association rules using closed itemset
de Raedt, L. (2002). A perspective on inductive databases. lattices. Information Systems, 24(1), 25-46.
SIGKDD Explorations, 4(2), 66-77.
Pei, J., Dong, G., Zou, W., & Han, J. (2002). On computing
de Raedt, L., Jger, M., Lee, S.D., & Mannila, H. (2002). A condensed frequent pattern bases. Proceedings of the
theory of inductive query answering. Proceedings of the IEEE International Conference on Data Mining,
IEEE International Conference on Data Mining, Maebashi City, Japan.
Maebashi City, Japan.
Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining
de Raedt, L., & Kramer, S. (2001). The levelwise version closed sequential patterns in large databases. Proceed-
space algorithm and its application to molecular fragment ings of the SIAM International Conference on Data
finding. Proceedings of the International Joint Confer- Mining, San Francisco, California.
ence on Artificial Intelligence, Seattle, Washington.
Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient
Dong, G., & Li., J. (1999). Efficient mining of emerging algorithm for closed itemset mining. Proceedings of the
patterns: Discovering trends and differences. Proceed- SIAM International Conference on Data Mining, Arling-
ings of the International Conference on Knowledge ton, Texas.
Discovery and Data Mining, San Diego, Caifornia.
Goethals, B., & Zaki, M.J. (2004). Advances in frequent
itemset mining implementations. SIGKDD Explorations, KEY TERMS
6(1), 109-117.
Imielinski, T., & Mannila, H. (1996). A database perspec- Condensed Representations: Alternative representa-
tive on knowledge discovery. Communications of the tions of the data that preserve crucial information for
ACM, 39(11), 58-64. being able to answer some kind of queries. The most
studied example concerns frequent sets and their frequen-
Jeudy, B., & Boulicaut, J.-F. (2002). Optimization of asso- cies. Their condensed representations can be several
ciation rule mining queries. Intelligent Data Analysis, orders of magnitude smaller than the collection of the
6(4), 341-357. frequent itemsets.
Kryszkiewicz, M., Rybinski, H., & Gajek, M. (2004). Dataless Constraint-Based Data Mining: Concerns the active
transitions between concise representations of frequent use of constraints that specify the interestingness of
patterns. Intelligent Information Systems, 22(1), 41-70. patterns. Technically, it needs strategies to push the
constraints, or at least part of them, deeply into the data-
Mannila, H., & Toivonen, H. (1996). Multiple uses of mining algorithms.
frequent sets and condensed representations. Proceed-
ings of the International Conference on Knowledge Inductive Databases: An emerging research domain,
Discovery and Data Mining, Portland, Oregon. where knowledge discovery processes are considered as
querying processes. Inductive databases contain both data
Mannila, H., & Toivonen, T. (1997). Levelwise search and and patterns, or models, which hold in the data. They are
borders of theories in knowledge discovery. Data Mining queried by means of more or less ad-hoc query languages.
Pattern Domains: A pattern domain is the definition
Mitchell, T.M. (1982). Generalization as search. Artificial of a language of patterns, a collection of evaluation
Intelligence, 18, 203-226. functions that provide properties of patterns in database
Ng, R., Lakshmanan, L.V.S., Han, J., & Pang, A. (1998). instances, and the kinds of constraints that can be used
Exploratory mining and pruning optimizations of con- to specify pattern interestingness.
strained associations rules. Proceedings of the Interna-
211
TEAM LinG
212
Content-Based Image Retrieval

Timo R. Bretschneider
Odej Kao
University of Paderborn, Germany
Sensing and processing multimedia information is one of The concept of content-based retrieval is datacentric per
the basic traits of human beings: The audiovisual system se; that is, the design of a system has to reflect the
registers and transports surrounding images and sounds. characteristics of the data. Hence, neither an optimal
This complex re-cording system, complemented by the solution that can span all kinds of multimedia data exists
senses of touch, taste, and smell, enables perception and nor is addressing the variety of data characteristics within
provides humans with data for analysing and interpreting one type even possible. However, there are parallels that
the environment. Imitating this perception and the simu- lay the foundation, which then require tailor-made adap-
lation of the processing was and still is one of the major tation and specialisation. This section provides the gen-
leitmotifs of multimedia technology developments. The eral groundwork by pointing out the different types of the
goal is to find a representation for every type of knowl- so-called metainformation, which describes the raw data:
edge, which makes the reception and processing of infor-
mation as easy as possible. The need to process given Technical information refers to the details of the
information, deliver it, and explain it to a certain audience recording, conversion, and saving process (i.e.,
exists in nearly all areas of day-to-day life: commerce, format and name of the stored media).
science, education, and entertainment (Smeulders, Extracted attributes are those that have been de-
Worring, Santini, Gupta, & Jain, 2000). duced by analysing the media content. They are
The development of digital technologies and applica- usually called features and emphasise a certain
tions allowed the production of huge amounts of multime- aspect of the media. Simple features describe, for
dia data. This information has to be systematically col- instance, statistical values of the contained infor-
lected, registered, organised, and classified. Furthermore, mation, while complex features and their weighted
search procedures, methods to formulate queries, and combinations attempt to describe the entire media
ways to visualise the results have to be provided. In early content.
years, this task was tended to by existing database man- Knowledge-based information links the objects,
agement systems (DBMS) with multimedia extensions. people, scenarios, and so forth, detected in the
The basis for representing and modelling multimedia data media to entities in the real world.
is so-called binary large objects, which store images, World-oriented information encompasses informa-
video, and audio sequences without any formatting and tion on the producer of the media, the date, location,
analysis done by the system. Often, however, only a and so forth. Manually added keywords are espe-
reference to the object is handled within the DBMS. For cially in this group, which makes a primitive descrip-
the utilisation of the stored multimedia data, user-defined tion and characterisation of the content possible.
functions (e.g., content analysis) access the actual data
and integrate their results in the existing database. Hence, As can be seen by this classification, technical and
content-based retrieval becomes possible. A survey of world-oriented information can be modelled straightfor-
existing retrieval systems was presented, for example, by wardly in traditional database structures. Organising and
Naphade & Huang (2002). searching can be done by using existing database func-
This article provides an overview of the complex tions. The utilisation of the extracted attributes and knowl-
relations and interactions among the different aspects of edge-based information is more complex in nature. Al-
a content-based retrieval system, whereby the scope is though most of the currently avail-able DBMSs can be
purposely limited to images. The main issues of the data extended with multimedia add-ins, in many cases these are
description, similarity expression, and access are ad- not sufficient, because they cannot describe the stored
dressed and illustrated for an actual system. data to the required degree of retrieval accuracy. How-
TEAM LinG
ever, only these two latter types of metainformation lift All approaches have their individual advantages as
the system to an abstract level that allows the full exploi- well as disadvantages, and a suitable selection depends C
tation of the content. on the domain. For example, a fingerprint database is best
For an in-depth overview of content-based image realised by using the query-by-pictorial-example tech-
retrieval techniques and systems, refer to Deb and Zhang nique, but selection from standards is a suitable candidate
(2004), Kalipsiz (2000), Smeulders et al. (2000), Vasconcelos for a comic strip data-base with a limited number of
and Kunt (2001), and Xiang and Huang (2000). characters. However, the similarity search, in particular
the query-by-pictorial-example approach, is one of the
most powerful methods because it provides the greatest
MAIN THRUST degree of flexibility. Thus, it determines the focus hereaf-
ter.
The goal of multimedia retrieval is the selection of one or Many different methods for feature extraction were
more images whose metainformation meets certain re- developed and can be classified by various criteria. Based
quirements or is similar to a given sample media instance. on the point in time in which the features are extracted, a-
Searching the metainformation is usually based on a full- priori and dynamically extracted features are distinguished.
text search among the assigned keywords. Furthermore, Although the first group was extracted during insertion of
content references, such as colour distributions in an the corresponding media object in the database, the latter
image, or more complex information, such as wavelet kind is generated at query time. The advantage of the
coefficients, can be used. To solve the issue of having the dynamic feature extraction is that the user can define
desired search characteristic in the first place, most sys- relevant elements in the sample image, and the remaining
tems prefer to use a query with an example media item. The parts of the query image do not distract the actual search
systems use this media as a starting point for the search objective. Note that both approaches can be combined.
and processed it in the same manner as the other media Regardless of the chosen approach, the actual fea-
objects, when they were inserted in the database. The tures have to be extracted from the considered data.
content is then analysed with the selected procedures, Examples for this step are histogram-based methods,
and the media is mapped to a vector consisting of (semi- calculation of statistical colour information (Mojsilovic,
) automatically extracted features. Hereafter, the raw data Hu, & Soljanin, 2002), contour descriptors (Berretti, Del
is only needed for display purposes, and all further Bimbo, & Pala, 2000), texture analysis (Gevers, 2002), and
processing focuses on analysing and comparing the wavelet coefficient selection (Albuz, Kocalar, & Khokhar,
representative vectors. The result of this comparison is a 2001). The gained information possibly from different
similarity ranking. The following interfaces can be used to algorithms is combined in a so-called feature vector
specify a query in a multimedia database: that is, by orders of magnitude, smaller than the raw data.
This reduction in volume enables not only a suitable
Browsing: Beginning with a predefined data set, the handling within the DBMS but also a higher level of
user can navigate in any desired direction by using abstraction. Therefore, it can often be utilised directly by
a browser until a suitable media sample is found. semantic-based approaches (Djeraba, 2003; Fan, Luo, &
This approach is often used when no suitable start- Elmagarmid, 2004; Lu, Zhang, Liu, & Hu, 2003) and data-
ing media is available. mining techniques (Datcu, Daschiel, & Pelizzari, 2003; Li
Search with keywords: Technical and world-ori- & Narayanan, 2004).
ented data are represented by alphanumerical fields. The similarity of two multimedia objects in the con-
These can be searched for a given keyword. Choos- tent-based retrieval process is determined by comparing
ing these keywords is extraordinarily difficult for the representing feature vectors. Over the years, a large
abstract structures such as textures, partially due to variety of metrics and similarity functions was developed
the subjectivity of the human perception. for this pur-pose, whereby the best-known methods com-
Similarity search: The similarity search is based on pute a multi-dimensional distance between the vectors:
comparing features extracted from the raw data. The smaller the distance, the higher the similarity of the
Most of these features do not exhibit immediate corresponding media objects. Through the introduction
references to the image, making them highly ab- of weights for the individual positions within the feature
stract for users without special knowledge (Assfalg, vectors, it is possible to emphasise and/or suppress
Del Bimbo, & Pala, 2002; Brunelli & Mich, 2000). desirable and undesirable query characteristics, respec-
Depending on the availability and characteristic of tively. In particular, the approach can help to particularise
the query medium, one differentiates between query the query in iterative retrieval systems; that is, if the users
by pictorial example, query by painting, selection select suitable and unsuitable retrievals, which are used
from standards, and image montage. by the system for adaptation in the next iteration (Jing, Li,
213
TEAM LinG
Zhang, H.-J., & Zhang, B., 2004). However, the application to special applications such as remote sensing. A sample
of distance-based similarity measures is not undisputed, application of the latter is given as an example in order to
because the meaning of distance in a multidimensional clarify the necessary adaptations.
vector space spanned by arbitrary features can be ambigu- The developments and applications of space-borne
ous. Proposed alternatives to the general problem include remote sensing platforms result in the production of
angular measures or the incorporation of the data statistic huge amounts of image data. In particular, the step
in the actual distance measure. Nevertheless, the applica- towards higher spatial and spectral resolution increases
tion of even the simplest measures can be successful for the obtained amount of data by orders of magnitude. To
all intents and pur-poses if they are tailor-made with enable maintenance as well as retrieval, a large number of
respect to the actual database. Still, if this requirement is operational DBMS for remotely sensed imagery is avail-
not met, the retrieval can nevertheless be sufficient as long able (Bretschneider, Cavet, & Kao, 2002; Datcu et al.,
as a -sufficient number of matches for all the kinds of 2003). However, due to the approach to base the queries
possible queries exist. on world-oriented descriptions (e.g., location, scanner,
Hitherto the discussion on the evaluation of similarity and date), the systems are not always suitable for users
among different feature vectors was limited to fairly simple with little expertise in remote sensing. Furthermore, con-
structures; that is, the order and length of the vectors were tent-related inquiries are not feasible in situations like
clearly defined. However, these assumptions are insuffi- the following scenario:
cient when dealing with more complex feature extraction
algorithms. An example is given by algorithms that result A certain region reveals symptoms of saltification and
in an arbitrary number of features solely depending on the a corresponding satellite image of the area was
content of the multimedia object. Thus, feature vectors of purchased. It is of interest to find other regions which
different lengths have to be compared. To circumvent this suffered from the same phenomenon and which were
problem, most of the current systems limit the dimension- successfully recovered or preserved, respectively. Thus
ality to a fixed extent either by the choice of extraction the applied strategies in these regions can help to
algorithms or through a reducing postprocessing. In- develop effective counteractions for the specific case
stances of the latter approach are the principal component under investigation.
analysis (PCA) and the vector quantisation (VQ). The
retrieval results are acceptable, but the overall system Generic content-based retrieval systems as intro-
performance suffers tremendously that is, either by duced earlier in this paper are not suitable, because their
constantly repeating the PCA or optimising the code-book extracted features do not consider the special character-
for the VQ dynamically in systems with a high insertion istic of the satellite imagery; that is, for these systems, all
rate, because a permanent adaptation is required. Mean- remotely sensed images exhibit a high degree of similar-
while, a variety of alternative measures that can handle ity that prohibits an appropriate differentiation. The Grid
more complex feature vectors were developed, and one of Retrieval System for Remotely Sensed Imagery, G(RS)2I,
the most prominent cases is the Earth Movers Distance is a tailor-made retrieval system that consists of highly
(EMD) (Rubner, Tomasi, & Guibas, 2000). specialised feature extraction modules, corresponding
Selected features of an object, file, or other data struc- similarity evaluation, and an adapted indexing technique
tures are stored in indices, a offering accelerated access. under the umbrella of a web-accessible DBMS.
This fact implies that the construction and maintenance Most of the image content in terms of remote sensing
method of an index is of utmost importance for database is contained in the spectral characteristic of an obtained
efficiency. Data structures employed to support such scene, whereby in contrast to generic images, the number
queries are called multidimensional index structures. Well- of colour bands can vary between a few and several
known examples are the k-d-trees, grid files, R/R*-trees, hundred. For the description of such data, a ground
SS/SR-trees, VP-trees, and VA files. A general overview of cover classification approach is most suitable, because
the context of multimedia retrieval can be found in Lu it is not only an abstract description of the content but
(2002). also is easily linkable with the human understanding of
observed features, for example, water surfaces, forests,
and urban areas. For data that consists of multiple bands,
FUTURE TRENDS this approach leads to a highly precise retrieval. Sec-
ondly, the spatial arrangement of the detected regions,
Future trends in image retrieval are manifold, covering as well as their textural composition, are retrieved as
areas such as detailed search for image objects via consid- features. Last but not least, highly specialised extraction
eration of semantic information, from identified entities on techniques ana-lyse the data for specific features such
the image as well as the extension of the retrieval methods as airports, rivers, and road networks. Due to the large
214
TEAM LinG
spatial coverage of a satellite scene (covering hundreds Assfalg, J., Del Bimbo, A., & Pala, P. (2002). Three-
of kilometres and therefore containing highly varying dimensional interfaces for querying by example in con- C
landscapes), the extraction of several feature vectors at tent-based image retrieval. IEEE Transactions on Visual-
different positions within a scene is required, because a ization and Computer Graphics, 8(4), 305-318.
global descriptor provides only insufficient accuracy. To
solve this issue, the G(RS)2I uses feature functions that Berretti, S., Del Bimbo, A., & Pala, P. (2000). Retrieval by
describe the under-lying data; that is, the multidimen- shape similarity with perceptual distance and effective
sional feature vectors are approximated by a hyper sur- indexing. IEEE Transactions on Multimedia, 2(4), 225-
face, which is modelled by radial basis function 239.
(Bretschneider & Li, 2003). These analytically described Bretschneider, T., Cavet, R., & Kao, O. (2002). Retrieval of
surfaces enable powerful access approaches to the un- remotely sensed imagery using spectral information con-
derlying data. Hence, the search for a best match of an tent. Proceedings of the Geoscience and Remote Sensing
extracted feature vector from a query image becomes an Symposium, 4 (pp. 2253-2256).
optimisation problem that is easily solvable, due to the
explicit existence of the first derivative of the feature Bretschneider, T., & Li, Y. (2003). On the problems of
function. locally defined content vectors in image databases for
For the measurement of similarity, throughout the large images. Proceedings of the Pacific-Rim Conference
entire system the G(RS)2I uses a modified version of the on Multimedia, 3 (pp. 1604-1608).
EMD (Li & Bretschneider, 2003), because the amount of Brunelli, R., & Mich, O. (2000). Image retrieval by ex-
content in a satellite scene and therefore the length of amples. IEEE Transactions on Multimedia, 2(3), 164-171.
the feature vector is generally not predictable. This
concept extends even within the indexing of the data that Datcu, M., Daschiel, H., & Pelizzari, A. (2003). Information
is based on a VP-tree and the realisation of the iterative mining in remote sensing image archives: System con-
search engine. With respect to the latter aspect, the cepts. IEEE Transactions on Geoscience and Remote
problem in content-based retrieval is that one feature Sensing, 41(12), 2923-2936.
vector often cannot de-scribe the desired search charac-
teristic precisely enough. Instead of adapting weights Deb, S., & Zhang, Y. (2004). An overview of content-
obtained through the users feedback regarding the rel- based image retrieval techniques. Proceedings of the
evance of the previously retrieved data, the G(RS)2I is not International Conference on Advanced Information
limited to moving a single query point in the search space. Networking and Applications, 1 (pp. 59-64).
The approach is to actually fuse the information content Djeraba, C. (2003). Association and content-based re-
from the positively rated feature vectors by analysing the trieval. IEEE Transactions on Knowledge and Data En-
corresponding EMD flow matrix (Li & Bretschneider, gineering, 15(1), 118-135.
2003).
Fan, J., Luo, H., & Elmagarmid, A. K. (2004). Concept-
oriented indexing of video databases: Toward semantic
CONCLUSION sensitive retrieval and browsing. IEEE Transactions on
Image Processing, 13(7), 974-992.
Content-based retrieval is beneficial for most types of Gevers, T. (2002). Image segmentation and similarity of
data because it resembles the human approach of access- color-texture objects. IEEE Transactions on Multimedia,
ing the respective medium. In particular, this idea holds 4(4), 509-516.
true for data that mankind can directly process. The actual
conceptual and technical realisation of this natural pro- Jing, F., Li, M., Zhang, H.-J., & Zhang, B. (2004). Rel-
cess is a major challenge, because knowledge is fairly evance feedback in region-based image retrieval. IEEE
limited with respect to which way this is accomplished. Transactions on Circuits and Systems for Video Technol-
ogy, 14(5), 672-681.
Kalipsiz, O. (2000). Multimedia databases. Proceedings of
REFERENCES the IEEE International Conference on Information Visu-
alization (pp. 111-115).
Albuz, E., Kocalar, E., & Khokhar, A. A. (2001). Scalable
color image indexing and retrieval using vector wavelets. Li, J., & Narayanan, R. M. (2004). Integrated spectral and
IEEE Transactions on Knowledge and Data Engineer- spatial information mining in remote sensing imagery.
ing, 13(5), 851-861. IEEE Transactions on Geoscience and Remote Sensing,
42(3), 673-685.
215
TEAM LinG
Li, Y., & Bretschneider, T. (2003). Supervised content- Xiang, S. Z., & Huang, T. S. (2000). Image retrieval: Feature
based satellite image retrieval using piecewise defined primitives, feature representation, and relevance feed-
signature similarities. Proceedings of the Geoscience and back. Proceedings of the IEEE Workshop on Content-
Remote Sensing Sympo-sium, 2 (pp. 734-736). based Access of Image and Video Libraries (pp. 10-14).
Lu, G.-J. (2002). Techniques and data structures for effi-
cient multimedia retrieval based on similarity. IEEE Trans- KEY TERMS
actions on Multimedia, 4(3), 372-384.
Lu, Y., Zhang, H., Liu, W., & Hu, C. (2003). Joint semantics Content-Based Retrieval: The search for suitable
and feature based image retrieval using relevance feed- objects in a database based on the content; often used to
back. IEEE Transactions on Multimedia, 5(3), 339-347. retrieve multimedia data.
Mojsilovic, A., Hu, H., & Soljanin, E. (2002). Extraction of Dynamic Feature Extraction: Analysis and descrip-
perceptually important colors and similarity measurement tion of the media content at the time of querying the
for image matching, retrieval and analysis. IEEE Transac- database. The information is computed on demand and
tions on Image Processing, 11(11), 1238-1248. discarded after the query has been processed.
Mojsilovic, A., Kovacevic, J., Hu, J.-Y., Safranek, R. J., & Feature Vector: Data that describes the content of the
Ganapathy, S. K. (2000). Matching and retrieval based on corresponding multimedia object. The elements of the
the vocabulary and grammar of color patterns. IEEE feature vector represent the extracted descriptive infor-
Transactions on Image Processing, 9(1), 38-54. mation with respect to the utilised analysis.
Naphade, M. R., & Huang, T. S. (2002). Extracting seman- Index Structures: Adapted data structures to accel-
tics from audio-visual content: The final frontier in multi- erate the retrieval. The a-priori extracted features are
media retrieval. IEEE Transactions on Neural Networks, organised in such a way that the comparisons can be
13(4), 793-810. focused to a certain area around the query.
Rubner, Y., Tomasi, C., & Guibas, I. J. (2000). The earth Multimedia Database: A multimedia database system
movers distance as a metric for image retrieval. Journal consists of a high-performance database management
of Computer Vision, 40(2), 99-121. system and a database with a large storage capacity and
supports and manages, in addition to alphanumerical data
Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, types, multimedia objects regarding storage, querying,
R. (2000). Content-based image retrieval at the end of the and searching.
early years. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 1349-1380. Query by Pictorial Example: The query is formulated
by using a user-provided example for the desired retrieval.
Vasconcelos, N., & Kunt, M. (2001). Content-based re- Both the query and stored media objects are analysed in
trieval from image databases: Current solutions and fu- the same way.
ture directions. Proceedings of the IEEE International
Conference on Image Processing, 3 (pp. 6-9). Similarity: Correspondence of two data objects of
the same medium. The similarity is determined by compar-
ing the corresponding feature vectors, for example, by a
metric or distance function.
216
TEAM LinG
217
Continuous Auditing and Data Mining C

Edward J. Garrity
Canisius College, USA
Joseph B. ODonnell
Canisius College, USA
G. Lawrence Sanders
State University of New York at Buffalo, USA
INTRODUCTION CAATS. often require that special audit software mod-

ules be embedded at the EDP system design stage (Liang
Investor confidence in the financial markets has been et al., 2001, p. 131). Many entities are reluctant to allow the
rocked by recent corporate frauds and many in the invest- implementation of embedded audit modules, which per-
ment community are searching for solutions. Meanwhile, form CA, due to concerns these CAATs could adversely
due to changes in technology, organizations are increas- affect systems processing in areas such as reducing
ingly able to produce financial reports on a real-time basis. response times. Considering these difficulties, it is not
Access to this timely information can help investors, surprising that continuous transaction monitoring tools
shareholders, and other third parties, but only if this are the second least used software by auditors (Daigle &
information is accurate and verifiable. Real-time financial Lampe, 2003).
reporting requires real-time or continuous auditing (CA) These hurdles to CAAT usage are being minimized by
to ensure integrity of the reported information. Continu- the emergence of eXtensible Markup Language (XML)
ous auditing is a type of auditing which produces audit and eXtensible Business Reporting Language that mini-
results simultaneously, or a short period of time after, the mize system interface issues. XML is a mark-up language
occurrence of relevant events (Kogan, Sudit, & that allows tagging of data to give the data meaning.
Vasarhelyi, 2003, p. 1). CA is facilitated by eXtensible XBRL is a variant of XML that is designed specifically for
Business Reporting Language (XBRL), which enables financial reporting and provides the capability of real-time
seamless transmission of company financial information online performance reporting. Both XML and XBRL en-
to auditor data warehouses. Data mining of these ware- able the receiver of the data to seamlessly download
houses provides opportunities for the auditor to deter- information to the receivers data warehouse
mine financial trends and identify erroneous transac- According to David & Steinbart (2000), data ware-
tions. houses improve audit quality and efficiency by reducing
the time needed to access data and perform data analysis.
Improved audit quality should lead to early detection, and
BACKGROUND possible prevention, of fraudulent financial reporting.
Auditor data warehouses may also be used in financial
Auditing is a systematic process of objectively obtain- fraud litigations in providing evidence to evaluate the
ing and evaluating evidence of assertions about eco- legitimacy of transactions and appropriateness of auditor
nomic actions and events to ascertain the correspon- actions in assessing transactions.
dence between those assertions and established criteria Data mining techniques are well suited to evaluate CA
and communicating the results to interested parties generated warehouse data but advances in audit tools are
(Konrath, 2002, p. 5). In CA, the collection of evidence is needed. Data mining and analysis software is the most
constant, and evaluation of the evidence occurs promptly commonly used audit software (Bierstaker, Burnaby, &
after collection (Kogan et al., 2003). Hass, 2003). Auditor data mining and analysis software
Computerized Assisted Auditing Techniques typically includes low level statistical tools and auditor
(CAATs) are computer programs or software applications specific models like Benfords Law. Benfords Law holds
that are used to improve audit efficiency. CAATs offer that there is a naturally occurring pattern of values in the
great promise to improve audits but have not met expec- digits of a number (Nigrini, 2002). Significant variation
tations due to a lack of a common interface with IT from the expected number pattern may be due to errone-
systems (Liang et al., 2001, p. 131). Also, concurrent ous or fraudulent transactions.
TEAM LinG
Continuous Auditing and Data Mining
More robust auditing tools, using more sophisticated Comparing Methods of Data Mining
data mining methods, are needed for mining large data-
bases and to help auditors meet auditing requirements. Data mining is a process by which one discovers previ-
According to auditing standards (Statement on Audit ously unknown information from large sets of data. Data
Standards 99), auditors should incorporate mining algorithms can be divided into three major groups:
unpredictability in procedures performed (Ramos, 2003). (1) mathematical-based methods, (2) logic-based meth-
Otherwise, perpetrators of frauds may become familiar ods, and (3) distance-based methods (Weiss & Indurkhya,
with common audit procedures and conceal fraud by 1998). Common examples from each major category are
placing it in areas that auditors are least likely to look. described below.
Mathematical-Based Methods
MAIN THRUST
Neural Networks
It is critical for the modern auditor to understand the
nature of CA, and the capabilities of different data mining An Artificial Neural Network (ANN) is a network of nodes
methods in designing an effective audit approach. To- modeled after a neuron or neural circuit. The neural
ward this end, this paper addresses these issues through network mimics the processing of the human brain. In a
discussion of CA, comparison of data mining methods, neural network, neurons are grouped into layers or slabs
and we also provide a potential CA and data mining (Lam, 2004). An input layer consists of neurons that
architecture. Although it is beyond the scope of this receive input from the external environment. The output
paper to provide an in-depth technical discussion of the layer communicates results of the ANN to the user or
details of the proposed architecture, we hope this stimu- external environment. The ANN may also consist of a
lates technical research in this area and provides a start- number of intermediate (hidden) layers. The processing of
ing point for CA system designers. an ANN starts with inputs being received by the input
layer and, upon being excited; the neurons are fired and
Continuous Auditing (CA) produce outputs to the other layers of the system. The
nodes or neurons are interconnected and they will send
Audits involve three major components: audit planning, signals or fire only if the signals it receives exceed a
conducting the audit, and reporting on audit findings certain threshold value. The value of a node is a non-
(Konrath, 2002). The CA approach can be used for the linear, (usually logistic), function of the weighted sum of
audit planning and conducting the audit phases. Accord- the values sent to it by nodes that are connected to it
ing to Pushkin (2003), CA is useful for the strategic audit (Spangler, May, & Vargas, 1999).
planning component that addresses the strategic risk of Programming a neural network to process a set of
reaching an inappropriate conclusion by not integrating inputs and produce the desired output is a matter of
essential activities into the audit plan (p. 27). Strategic designing the interactions among the neurons. This pro-
information may be captured from the entitys Intranets cess consists of the following: (1) arranging neurons in
and from the global internet using intelligent agents various layers, (2) deciding both the connections among
(Pushkin, 2003, p. 28). neurons of different layers, as well as the neurons within
CA is also useful for performing the audit or what a layer, (3) determining the way a neuron receives input
Pushkin (2003) refers to as the tactical component of the and produces output (e.g., the type of function used), and
audit. Tactical activities are most often directed at ob- (4) determining the strength of connections within the
taining transactional evidence as a basis on which to network by selecting and using a training data set so that
assess the validity of assertions embodied in account the ANN can determine the appropriate values of connec-
balances (p.27). For example, CA is useful in testing that tion weights (Lam, 2004).
entities comply with financial performance measures of Prior neural network research has addressed audit
debt covenants in loan agreements (Woodroof & Searcy, areas of risk assessment, errors and fraud, going concern
2001). audit opinion, financial distress, and bankruptcy predic-
CA requires prompt responses to high-risk transaction (Lin, Hwang, & Becker, 2003). Research has identified
tions and the ability to identify financial trends from large successful uses of ANN, however, there are still many
volumes of data. Intelligent agents can be used to promptly other issues to address. For instance, ANN is effective for
identify and respond to erroneous transactions. Under- analytical review procedures although there is no clear
standing the capabilities of data mining methods in iden- guideline for the performance measures to use for this
tifying financial trends is useful in selecting an appropri- analysis (Koskivaara, 2004). Neural network research
ate data mining approach for CA. found differences between the patterns of quantitative
218
TEAM LinG
and qualitative (textual) information from financial state- ters, which can be described as regions of this space
ments (Back, Toivenen, Vanharanta, & Visa, 2001). Audi- containing points that are close to each other (Maltseva C
tors would benefit from advancement in identifying appro- et al., 2000).
priate financial inputs for ANN and developing models
that integrate quantitative and textual information. Logic-Based Methods
Discriminant Analysis Tree and Rule Induction

Linear discriminant analysis is similar to multiple regres- The Tree and Rule Induction approach to data mining
sion analysis. The major difference is that discriminant uses an algorithm (e.g., the ID3 algorithm) to induce a
analysis uses a nominal or ordinal dependent variable decision tree from a file of individual cases, where the
whereas regression uses a continuous dependent vari- case is described by a set of attributes and the class to
able. Discriminant analysis constructs classification func- which the case belongs (Spangler et al., 1999). Once the
tions of the form, decision tree is generated, it is a simple matter to convert
from the decision tree format to an equivalent rule-based
Y = c 1 a 1 + c 2a 2 + c m a m + c 0 view. A big advantage of Tree and Rule Induction is the
ability to communicate and understand the information
Where ci is the coefficient for the case attribute ai and derived from the data mining. Tree and Rule induction
c0 is a constant, for each of the n categories (Spangler et research has addressed audit areas of bankruptcy, bank
al., 1999). One field in the set of data is assigned as the failure, and credit risk (Spangler et al., 1999).
target (response or class, Y) variable, and the algorithm A comparison of data mining approaches based on
produces models for target variables as a function of the representational scheme and learning approach is pre-
other fields in the data set, that are identified as explana- sented in Table 1. Learning approaches are classified as
tory variables (features, cases) (Apte et al., 2003). either supervised, where induction rules are used in
assigning observations to predefined classifications, or
Distance-Based Methods unsupervised, where both categories and classification
rules are determined by the data mining method (Spangler
Clustering et al., 1999). Representational scheme relates to how the
data mining methods model relationships.
Clustering is a data mining approach that partitions large
sets of data objects into homogeneous groups. Clustering Criteria and Issues in Selecting a Data
uses unsupervised classification where little manual pre- Mining Method
screening of data is necessary thus it is useful in situa-
tions where no predefined knowledge of categories is Discriminant analysis and statistical techniques have
available (Maltseva et al., 2000). Generally, an object is enjoyed a long run of popularity in data mining. However,
described by a set of m attributes (features), and can be because of increasing availability of large volumes of
represented by an m-dimensional vector, called a pattern. data, a trend toward increased use of search-based, non-
Therefore, if a dataset contains n objects, it can be viewed parametric modeling techniques such as neural networks
as an n by m pattern matrix. Rows of the matrix correspond and rule-based or decision trees has been taking place
to patterns and columns correspond to features. Math- (Apte et al., 2003). Ultimately however, when evaluating
ematical techniques are then employed to identify clus- and deciding on a data mining method, users must weigh
Table 1. Data mining approaches compared1
Decision Tree Neural Discriminant Clustering

Rule Induction Networks Analysis
Method Type Logic-based Math-based Math-based Distance-based
Learning Supervised Supervised Supervised Unsupervised
approach
Representational Decision nodes, Functional Functional Functional
Scheme rules relationship relationship relationship
between between between
attributes and attributes and attributes and
classes classes classes
219
TEAM LinG
and often compromise between predictive accuracy, level data with differing file formats and record structures from
of understandability, and computational demand (Apte et heterogeneous platforms (Rezaee et al., 2002). Another
al., 2003). consideration in developing a technology infrastructure
Indeed, when one examines issues such as scalability to gather transactions and other activity for auditing is
(i.e., how well the data mining method works regardless of the degree of automation employed. The degree of auto-
the size of the data set), accuracy (i.e., how well the mation can vary depending on the audit system design
information extracted remains stable and constant be- but at least three possibilities are:
yond the boundaries of the data from which it was ex-
tracted, or trained), robustness (i.e., how well the data 1. Embedded audit modules where audit programs are
mining method works in a wide variety of domains), and tightly integrated with application source code to
interpretability (i.e., how well the data mining method constantly monitor and report on exceptional con-
provides understandable information and insight of value ditions (limited use of this approach due to potential
to the end user), it becomes clear that no data mining adverse effects on entity processing, see Back-
methods currently excel in all areas (Apte et al., 2003). ground section for further discussion);
The specific demands of CA raise a number of issues 2. The automatic capture and transformation of data
that relate to the selection of data mining methods and and storage in data warehouses but still requiring
their appropriate application in this domain. The next auditor involvement in running queries to isolate
section explores these issues in greater detail. exceptions and detect unusual patterns (automatic
capture but with auditor intervention);
The CA Challenge: Difficulties and 3. The automatic capture and transformation of data
Issues and storage in data warehouses and the integration
of intelligent agents to modify data capture routines
The demand for more timely communication of auditing and exception reporting and trend spotting via
information to business stakeholders requires auditors to multiple data mining methods (automatic, modified
find new ways to monitor, assemble and analyze audit capture and modified data mining for trend analy-
information. A number of continuous audit tools and sis).
techniques will need to be developed to enable auditors
to assess risk, evaluate internal controls, extract data, Finally, the appropriate selection of a data mining
download data for analytical review, and identify excep- method is critical for determining unusual trends and
tions and unusual transactions. spotting fraudulent behavior. Important considerations
One of the challenges in developing a CA capability include: (1) the size of the data set, (2) the accuracy of the
is developing a technology infrastructure for extracting given data mining method in the particular domain, and (3)
Figure 1. Continuous auditing and data mining architecture2

Entity Systems Auditor Systems
Databases
Environmental
Monitoring
Transactions
Capture XBRL Intelligent-

Transactions Audit Data
and XML Audit Warehouse
Transmit Evaluator
Transactions
Data
Transformation
Data Mart
Data Mart
Intelligent Agent: Data Mining

Applications Manager
______________
220
TEAM LinG
the interpretability of the data mining output information. reliability of performance reporting. Auditors need to
Regardless of the particular context, in the area of CA a understand the capabilities of different data mining ap- C
prime consideration is the frequent and (automatic) alter- proaches to ensure effective continuous auditing. Look-
ation or change of the data mining technique and the ing ahead, researchers will need to further develop data
appropriate selection and changing of the selected audit- mining approaches and tools for the expanding needs of
ing information to be reviewed. These are important con- the audit profession.
siderations because the audited parties should not be
able to anticipate the audit tests and thus create errone-
ous transactions that might go undetected (Ramos, 2003). REFERENCES
A Potential Architecture Apte, C.V., Hong, S.J., Natarajan, R., Pednault, E.P.D.,
Tipu, F.A., & Weiss, S.M. (2003). Data-intensive analytics
In order to handle the issues identified above, a proposed for predictive modeling. IBM Journal of Research and
architecture for the CA process is presented in Figure 1. Development, 47(1), 17-23.
Two important features of the architecture are: (1) the use
of data warehousing and data marts, and (2) the use of Back, B., Toivenen, J., Vanharanta, H., & Visa, A. (2001).
intelligent agents to periodically modify the data extrac- Comparing numerical data and text information from an-
tion and data mining methods. The architecture involves nual reports using self-organizing maps. International
the transfer of data from the entitys systems to the Journal of Accounting Information Systems, 2(2001),
auditors systems through the use of XBRL and XML. The 249-269.
auditors systems include environmental monitoring of Bierstaker, J.L., Burnaby, P., & Hass, S. (2003). Recent
political, economic, and technological factors to address changes in internal auditors use of technology. Internal
strategic audit issues and data mining of transactions to Auditing, 18(4), 39-45.
address tactical audit issues.
David, J.S., & Steinbart, P.J. (2000). Data warehousing
and data mining: Opportunities for internal auditors.
FUTURE TRENDS Altamonte Springs, FL: The Institute of Internal Auditors
Research Foundation.
In the future, auditing and information systems research- Kogan, A., Sudit, E.F., & Vasarhelyi, M.A. (2003). Con-
ers will need to identify additional tools and data mining tinuous online auditing: An evolution (pp. 1-25). Unpub-
methods that are most appropriate for this new applica- lished Workpaper.
tion area. Emerging innovations in Web technology and
the general publics demand for reliable performance Konrath, L.F. (2002). Auditing: A risk analysis approach.
reporting are expected to spur demand for the use of data Cincinnati, OH: South-Western.
mining for CA. Koskivaara, E. (2004). Artificial neural networks in ana-
Expected growth in data mining for CA will provide lytical procedures. Managerial Auditing Journal,
opportunities and issues for auditors and researchers in 19(2), 191-223.
several areas. Advancement in the areas of data mining of
textual information will be useful to CA. Recognizing Lam, M. (2004). Neural network techniques for finan-
patterns in qualitative information provides a broader cial performance prediction: Integrating fundamental
perspective and richer understanding of the entitys and technical analysis. Decision Support Systems, 37(4),
internal and external situation (Back et al., 2001). Also, for 567-581.
risk based auditing ANN model builders will need quali-
tative and quantitative factors that capture political, eco- Liang, D., Fengyi, L., & Wu, S. (2001). Electronically
nomic, and technological factors, as well as balanced auditing EDP systems with the support of emerging infor-
scorecard metrics that signal the extent to which an entity mation technologies. International Journal of Account-
is achieving its strategic objectives (Lin et al., 2003, p. 230). ing Information Systems, 2, 130-147.
Lin, J.W., Hwang, M.I., & Becker, J.D. (2003). A fuzzy
neural network for assessing the risk of fraudulent finan-
CONCLUSION cial reporting. Managerial Auditing Journal, 18(8),
657-665.
The use of data mining for continuous auditing is poised
to improve the effectiveness of audits, and ultimately the
221
TEAM LinG
Maltseva, E., Pizzuti, C., & Talia, D. (2000). Indirect knowl- Auditing: Systematic process of objectively obtain-
edge discovery by using singular value decomposition. ing and evaluating evidence of assertions about eco-
In Data Mining II. Southhampton, UK: WIT Press. nomic actions and events to ascertain the correspon-
dence between those assertions and established criteria
Nigrini, M.J. (2002). Analysis of digits and number pat- and communicating the results to interested parties.
terns. In J.C. Robertson (Ed.), Fraud examination for
managers and auditors (pp. 495-518). Austin, TX: Atex Clustering: Data mining approach that partitions
Austin, Inc. large sets of data objects into homogeneous groups.
Pushkin, A.B. (2003). Comprehensive continuous audit- Computerized Assisted Auditing Techniques
ing: The strategic component. Internal Auditing, 18(1), (CAATs): Software applications that are used to improve
26-33. the efficiency of an audit.
Ramos, M. (2003). Auditors responsibility for fraud de- Continuous Auditing: Type of auditing, which pro-
tection. Journal of Accountancy, 195(1), 28-36. duces audit results simultaneously, or a short period of
time after, the occurrence of relevant events.
Rezaee, Z., Sharbatoghlie, A., Elam, R., & McMickle, P.L.
(2002). Continuous auditing: Building automated audit- Discriminant Analysis: Statistical methodology used
ing capability. Auditing: A Journal of Practice & Theory, for classification that is based on the general regression
21(1), 147-163. model, and uses a nominal or ordinal dependent variable.
Spangler, W.E., May, J.H., & Vargas, L.G. (1999). Choos- eXtensible Business Reporting Language (XBRL):
ing data mining methods for multiple classification: Rep- Mark up language that allows for the tagging of data and
resentational and performance measurement implications is designed for performance reporting. It is a variant of
for decision support. Journal of Management Informa- XML.
tion Systems, 16(1), 37-62.
eXtensible Markup Language (XML): Mark-up lan-
Weiss, S.M., & Indurkhya, N. (1998). Predictive data guage that allows for the tagging of data in order to add
mining. San Francisco, CA: Morgan Kaufmann. meaning to the data.
Woodroof, J., & Searcy, D. (2001). Continuous audit Tree and Rule Induction: Data mining approach that
model development and implementation within a debt uses an algorithm (e.g., the ID3 algorithm) to induce a
covenant compliance domain. International Journal of decision tree from a file of individual cases, where the case
Accounting Information Systems, 2, 169-191. is described by a set of attributes and the class to which
the case belongs.
KEY TERMS
Artificial Neural Network (ANN): A network of nodes ENDNOTES

modeled after a neuron or neural circuit. The neural
network mimics the processing of the human brain. 1
Adapted from Spangler et al. (1999).
2
Adapted from Rezaee et al. (2002).
222
TEAM LinG
223
Data Driven vs. Metric Driven Data ,

Warehouse Design
John M. Artz
The George Washington University, USA
INTRODUCTION used to gain greater insight into the functioning of the

organization. The analysis that can be done is a function
Although data warehousing theory and technology have of the data that were collected in the transaction pro-
been around for well over a decade, they may well be the cessing systems. This was, perhaps, the original view of
next hot technologies. How can it be that a technology data warehousing and, as will be shown, much of the
sleeps for so long and then begins to move rapidly to the current research in data warehousing assumes this view.
foreground? This question can have several answers. The competing view, which I call the metric-driven
Perhaps the technology had not yet caught up to the view of data warehouse design, begins by identifying key
theory or that computer technology 10 years ago did not business processes that need to be measured and tracked
have the capacity to delivery what the theory promised. over time in order for the organization to function more
Perhaps the ideas and the products were just ahead of efficiently. A dimensional model is designed to facili-
their time. All these answers are true to some extent. tate that measurement over time, and data are collected
But the real answer, I believe, is that data warehousing is to populate that dimensional model. If existing organi-
in the process of undergoing a radical theoretical and zational data can be used to populate that dimensional
paradigmatic shift, and that shift will reposition data model, so much the better. But if not, the data need to be
warehousing to meet future demands. acquired somehow. The metric-driven view of data ware-
house design, as will be shown, is superior both theo-
retically and philosophically. In addition, it dramati-
BACKGROUND cally changes the research program in data warehousing.
The metric-driven and data-driven approaches to data
Just recently I started teaching a new course in data warehouse design have also been referred to, respec-
warehousing. I have only taught it a few times so far, but tively, as metric pull versus data push (Artz, 2003).
I have already noticed that there are two distinct and
largely incompatible views of the nature of a data ware-
house. A prospective student, who had several years of MAIN THRUST
industry experience in data warehousing but little theo-
retical insight, came by my office one day to find out Data-Driven Design
more about the course. Are you an Inmonite or a
Kimballite? she inquired, reducing the possibilities to The classic view of data warehousing sees the data
the core issues. Well, I suppose if you put it that way, warehouse as an extension of decision support systems.
I replied, I would have to classify myself as a Kimballite. Again, in a classic view, decision support systems sit
William Inmon (2000, 2002) and Ralph Kimball (1996, atop management information systems and use data
1998, 2000) are the two most widely recognized au- extracted from management information and transac-
thors in data warehousing and represent two competing tion processing systems to support decisions within the
positions on the nature of a data warehouse. organization. This view can be thought of as a data-
The issue that this student was trying to get at was driven view of data warehousing, because the exploita-
whether or not I viewed the dimensional data model as tions that can be done in the data warehouse are driven by
the core concept in data warehousing. I do, of course, the data made available in the underlying operational
but there is, I believe, a lot more to the emerging information systems.
competition between these alternative views of data This data-driven model has several advantages. First,
warehouse design. One of these views, which I call the it is much more concrete. The data in the data warehouse
data-driven view of data warehouse design, begins with are defined as an extension of existing data. Second, it is
existing organizational data. These data have more than evolutionary. The data warehouse can be populated and
likely been produced by existing transaction processing exploited as new uses are found for existing data. Fi-
systems. They are cleansed and summarized and are nally, there is no question that summary data can be
TEAM LinG
Data Driven vs. Metric Driven Data Warehouse Design
derived, because the summaries are based upon existing bers of the class to guess the ages of other members. I
data. However, it is not without flaws. First, the integra- could rank the students by age and then use the ranking
tion of multiple data sources may be difficult. These number instead of age. The point is that each of these
operational data sources may have been developed inde- attempts is somewhere between the two extremes, and
pendently, and the semantics may not agree. It is diffi- the validity of my data improves as I move closer to the
cult to resolve these conflicting semantics without a first extreme. That is, I have measurements of a specific
known end state to aim for. But the more damaging phenomenon, and those measurements are likely to
problem is epistemological. The summary data derived represent that phenomenon faithfully. The epistemo-
from the operational systems represent something, but logical problem in data-driven data warehouse design is
the exact nature of that something may not be clear. that data is collected for one purpose and then used for
Consequently, the meaning of the information that de- another purpose. The strongest validity claim that can be
scribes that something may also be unclear. This is made is that any information derived from this data is
related to the semantic disintegrity problem in rela- true about the data set, but its connection to the organi-
tional databases. A user asks a question of the database zation is tenuous. This not only creates problems with
and gets an answer, but it is not the answer to the the data warehouse, but all subsequent data-mining dis-
question that the user asked. When the somethings that coveries are suspect also.
are represented in the database are not fully understood,
then answers derived from the data warehouse are likely
to be applied incorrectly to known somethings. Unfor- METRIC-DRIVEN DESIGN
tunately, this also undermines data mining. Data mining
helps people find hidden relationships in the data. But if The metric-driven approach to data warehouse design
the data do not represent something of interest in the begins by defining key business processes that need to
world, then those relationships do not represent any- be measured and tracked in order to maintain or improve
thing interesting, either. the efficiency and productivity of the organization.
Research problems in data warehousing currently After these key business processes are defined, they are
reflect this data-driven view. Current research in data modeled in a dimensional data model and then further
warehousing focuses on a) data extraction and integra- analysis is done to determine how the dimensional
tion, b) data aggregation and production of summary model will be populated. Hopefully, much of the data
sets, c) query optimization, and d) update propagation can be derived from operational data stores, but the
(Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2000). All these metrics are the driver, not the availability of data from
issues address the production of summary data based on operational data stores.
operational data stores. A relational database models the entities or objects
of interest to an organization (Teorey, 1999). These
A Poverty of Epistemology objects of interest may include customers, products,
employees, and the like. The entity model represents
The primary flaw in data-driven data warehouse design is these things and the relationships between them. As
that it is based on an impoverish epistemology. Episte- occurrences of these entities enter or leave the organi-
mology is that branch of philosophy concerned with zation, that addition or deletion is reflected in the
theories of knowledge and the criteria for valid knowl- database. As these entities change in state, somehow,
edge (Fetzer & Almeder, 1993; Palmer, 2001). That is those state changes are also reflected in the database.
to say, when you derive information from a data ware- So, theoretically, at any point in time, the database
house based on the data-driven approach, what does that faithfully represents the state of the organization. Que-
information mean? How does it relate to the work of the ries can be submitted to the database, and the answers to
organization? To see this issue, consider the following those queries should, indeed, be the answers to those
example. If I asked each student in a class of 30 for their questions if they were asked and answered with respect
ages, then summed those ages and divided by 30, I should to the organization.
have the average age of the class, assuming that every- A data warehouse, on the other hand, models the
one reported their age accurately. If I were to generate business processes in an organization to measure and
a list of 30 random numbers between 20 and 40 and took track those processes over time. Processes may include
the average, that average would be the average of the sales, productivity, the effectiveness of promotions,
numbers in that data set and would have nothing to do and the like. The dimensional model contains facts that
with the average age of the class. In between those two represent measurements over time of a key business
extremes are any number of options. I could guess the process. It also contains dimensions that are attributes
ages of students based on their looks. I could ask mem- of these facts. The fact table can be thought of as the
224
TEAM LinG
dependent variable in a statistical model, and the dimen- we construct appropriate measures for these processes?
sions can be thought of as the independent variables. So c) How do we know those measures are valid? d) How do ,
the data warehouse becomes a longitudinal dataset track- we know that a dimensional model has accurately cap-
ing key business processes. tured the independent variables? e) Can we develop an
abstract theory of aggregation so that the data aggrega-
A Parallel with Pre-Relational Days tion problem can be understood and advanced theoreti-
cally? and, finally, f) can we develop an abstract data
You can see certain parallels between the state of data language so that aggregations can be expressed math-
warehousing and the state of database prior to the rela- ematically by the user and realized by the machine?
tional model. The relational model was introduced in Initially, both data-driven and metric-driven de-
1970 by Codd but was not realized in a commercial signs appear to be legitimate competing paradigms for
product until the early 1980s (Date, 2004). At that time, data warehousing. The epistemological flaw in the data-
a large number of nonrelational database management driven approach is a little difficult to grasp, and the
systems existed. All these products handled data in dif- distinction that information derived from a data-
ferent ways, because they were software products devel- driven model is information about the data set, but
oped to handle the problem of storing and retrieving data. information derived from a metric-driven model is
They were not developed as implementations of a theo- information about the organization may also be a bit
retical model of data. When the first relational product elusive. However, the implications are enormous. The
came out, the world of databases changed almost over- data-driven model has little future in that it is founded
night. Every nonrelational product attempted, unsuc- on a model of data exploitation rather than a model of
cessfully, to claim that it was really a relational product data. The metric-driven model, on the other hand, is
(Codd, 1985). But no one believed the claims, and the likely to have some major impacts and implications.
nonrelational products lost their market share almost
immediately.
Similarly, a wide variety of data warehousing prod- FUTURE TRENDS
ucts are on the market today. Some are based on the
dimensional model, and some are not. The dimensional The Impact on White-Collar Work
model provides a basis for an underlying theory of data
that tracks processes over time rather than the current The data-driven view of data warehousing limits the
state of entities. Admittedly, this model of data needs future of data warehousing to the possibilities inherent
quite a bit of work, but the relational model did not come in summarizing large collections of old data without a
into dominance until it was coupled with entity theory, so specific purpose in mind. The metric-driven view of
the parallel still holds. We may never have an announce- data warehousing opens up vast new possibilities for
ment in data warehousing as dramatic as Codds paper in improving the efficiency and productivity of an organi-
relational theory. It is more likely that a theory of tem- zation by tracking the performance of key business
poral dimensional data will accumulate over time. How- processes. The introduction of quality management
ever, in order for data warehousing to become a major procedures in manufacturing a few decades ago dra-
force in the world of databases, an underlying theory of matically improved the efficiency and productivity of
data is needed and will eventually be developed. manufacturing processes, but such improvements have
not occurred in white-collar work.
The Implications for Research The reason that we have not seen such an improve-
ment in white-collar work is that we have not had
The implications for research in data warehousing are metrics to track the productivity of white-collar work-
rather profound. Current research focuses on issues ers. And even if we did have the metrics, we did not have
such as data extraction and integration, data aggregation a reasonable way to collect them and track them over
and summary sets, and query optimization and update time. The identification of measurable key business
propagation. All these problems are applied problems in processes and the modeling of those processes in a data
software development and do not advance our under- warehouse provides the opportunity to perform quality
standing of the theory of data. management and process improvement on white-collar
But a metric-driven approach to data warehouse de- work.
sign introduces some problems whose resolution can Subjecting white-collar work to the same rigorous
make a lasting contribution to data theory. Research definition as blue-collar work may seem daunting, and
problems in a metric-driven data warehouse include a) indeed that level of definition and specification will
How do we identify key business processes? b) How do not come easily. So what would motivate a business to
225
TEAM LinG
do this? The answer is simple: Businesses will have to do variables. We begin to realize that the choice of data types
this when the competitors in their industry do it. Whoever (e.g., interval or ratio) will affect the types of analysis we
does this first will achieve such productivity gains that can do on the data and will hence potentially limit the
competitors will have to follow suit in order to compete. queries. So the database designer has to address con-
In the early 1970s, corporations were not revamping their cerns that have traditionally been the domain of the
internal procedures because computerized accounting statistician. Similarly, the statistician cannot afford the
systems were fun. They were revamping their internal luxury of constructing a data set for a single purpose or
procedures because they could not protect themselves a single type of analysis. The data set must be rich enough
from their competitors without the information for deci- to allow the statistician to find relationships that may not
sion making and organizational control provided by their have been considered when the data set was being con-
accounting information systems. A similar phenomenon structed. Variables must be included that may potentially
is likely to drive data warehousing. have impact, may have impact at some times but not
others, or may have impact in conjunction with other
Dimensional Algebras variables. So the statistician has to address concerns that
have traditionally been the domain of the database de-
The relational model introduced Structured Query Lan- signer.
guage (SQL), an entirely new data language that allowed What this points to is the fact that database design
nontechnical people to access data in a database. SQL and statistical exploitation are just different ends of the
also provided a means of thinking about record selec- same problem. After these two ends have been con-
tion and limited aggregation. Dimensional models can nected by data warehouse technology, a single theory of
be exploited by a dimensional query language such as data must be developed to address the entire problem.
MDX (Spofford, 2001), but much greater advances are This unified theory of data would include entity theory
possible. and measurement theory at one end and statistical ex-
Research in data warehousing will likely yield some ploitation at the other. The middle ground of this theory
sort of a dimensional algebra that will provide, at the will show how decisions made in database design will
same time, a mathematical means of describing data affect the potential exploitations, so intelligent design
aggregation and correlation and a set of concepts for decisions can be made that will allow full exploitation
thinking about aggregation and correlation. To see how of the data to serve the organizations needs to model
this could happen, think about how the relational model itself in data.
led us to think about the organization as a collection of
entity types or how statistical software made the con-
cepts of correlation and regression much more con- CONCLUSION
crete.
Data warehousing is undergoing a theoretical shift from
A Unified Theory Of Data a data-driven model to a metric-driven model. The met-
ric-driven model rests on a much firmer epistemologi-
In the organization today, the database administrator and cal foundation and promises a much richer and more
the statistician seem worlds apart. Of course, the statis- productive future for data warehousing. It is easy to haze
tician may have to extract some data from a relational over the differences or significance between these two
database in order to do his or her analysis. And the approaches today. The purpose of this article was to
statistician may engage in limited data modeling in show the potentially dramatic, if somewhat speculative,
designing a data set for analysis by using a statistical implications of the metric-driven approach.
tool. The database administrator, on the other hand, will
spend most of his or her time in designing, populating,
and maintaining a database. A limited amount of time REFERENCES
may be devoted to statistical thinking when counts,
sums, or averages are derived from the database. But Artz, J. (2003). Data push versus metric pull: Compet-
these two individuals will largely view themselves as ing paradigms for data warehouse design and their impli-
participating in greatly differing disciplines. cations. In M. Khosrow-Pour (Ed.), Information tech-
With dimensional modeling, the gap between data- nology and organizations: Trends, issues, challenges
base theory and statistics begins to close. In dimen- and solutions. Hershey, PA: Idea Group Publishing.
sional modeling we have to begin thinking in terms of
construct validity and temporal data. We need to think Codd, E. F. (1970). A relational model of data for large
about correlations between dependent and independent shared data banks. Communications of the ACM, 13(6),
377-387.
226
TEAM LinG
Codd, E. F. (1985, October 14). Is your DBMS really KEY TERMS

relational? Computerworld. ,
Date, C. J. (2004). An introduction to database systems Data-Driven Design: A data warehouse design that
(8th ed.). Addison-Wesley. begins with existing historical data and attempts to
derive useful information regarding trends in the orga-
Fetzer, J., & Almeder, F. (1993). Glossary of epistemology/ nization.
philosophy of science. Paragon House.
Data Warehouse: A repository of time-oriented
Inmon, W. (2002). Building the data warehouse. Wiley. organizational data used to track the performance of key
business processes.
Inmon, W. (2000). Exploration warehousing: Turning
business information into business opportunity. Wiley. Dimensional Data Model: A data model that repre-
sents measurements of a process and the independent
Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. variables that may affect that process.
(2000). Fundamentals of data warehouses. Springer-
Verlag. Epistemology: A branch of philosophy that attempts
to determine the criteria for valid knowledge.
Kimball, R. (1996). The data warehouse toolkit. Wiley.
Key Business Process: A business process that can
Kimball, R. (1998). The data warehouse lifecycle toolkit. be clearly defined, is measurable, and is worthy of
Wiley. improvement.
Kimball, R., & Merz, R. (2000). The data webhouse toolkit. Metric-Driven Design: A data warehouse design
Wiley. that beings with the definition of metrics that can be
Palmer, D. (2001). Does the center hold? An introduction used to track key business processes.
to Western philosophy. McGraw-Hill. Relational Data Model: A data model that repre-
Spofford, G. (2001). MDX Solutions. Wiley. sents entities in the form of data tables.
Teorey, T. (1999). Database modeling & design. Morgan

Kaufmann.
Thomsen, E. (2002). OLAP solutions. Wiley.
227
TEAM LinG
228
Data Management in Three-Dimensional

Structures
Xiong Wang
California State University at Fullerton, USA
Data management in its general term refers to activities Three-dimensional structures can be used to describe
that involve the acquisition, storage, and retrieval of data. data in different domains. In biology and chemistry, for
Traditionally, information retrieval is facilitated through example, a molecule is represented as a 3D structure with
queries, such as exact search, nearest neighbor search, connections. Each point is the center of an atom and the
range search, etc. In the last decade, data mining has connections are bonds between atoms. In computer-
emerged as one of the most dynamic fields in the frontier aided design, an object is specified as a set of 3D vectors
of data management. Data mining refers to the process of that describes the shape of the object (Veltkamp, 2001;
extracting useful knowledge from the data. Popular data Suzuki & Sugimoto, 2003). In computer vision, the shape
mining techniques include association rule discovery, of a 3D object can be caught by X-ray or ultrasonic
frequent pattern discovery, classification, and clustering. scanning devices. The result is a set of 3D points (Hilaga,
In this chapter, we discuss data management in a specific Shinagawa, Kohmura, & Kunii, 2001). In medical imaging,
type of data i.e., three-dimensional structures. While 3D images of tissues or tumors can be collected using
research on text and multimedia data management has magnetic resonance imaging or computer tomography
attracted considerable attention and substantial progress (Akutsu, Arakawa, & Murase, 2002). With advances in
has been made, data management in three-dimensional the Internet, scanning devices, and storage, the World
structures is still in its infancy (Castelli & Bergman, 2001; Wide Web is becoming a huge reservoir of all kinds of
Paquet & Rioux, 1999). Data management in 3D structures data. The 3D models available over the Internet dramati-
raises several interesting problems: cally increased in the last two decades. Similarity search
is a highly desirable technique in all these domains.
1. Similarity search Classification and clustering of biological data or chemi-
2. Pattern discovery cal compounds have special significances. For example,
3. Classification traditionally proteins are classified to families according
4. Clustering to their specific functions. However, recently, many ap-
proaches have been proposed to classify proteins ac-
Given a database of 3D structures and a query 3D cording to their structures. Some of these approaches
structure, similarity search looks for those structures in achieve very high accuracy when compared with their
the database that match the query structure within a range biological counterparts. Classification and clustering can
of tolerable errors. The similarity could be defined in two also help build index structures in 3D model retrieval to
different measurements. The first measurement compares speed up similarity search.
the data structure with the query structure in their entirety Currently, there is not a universal model or framework
i.e., a point-to-point match. We will call this aggregate for the representation, storage, and retrieval of three-
similarity search. The second measurement compares dimensional structures. Most of these data are stored in
only the contours or shapes of the data structure with that plain text files in some specific format. The format is
of the query structure. This is generally referred to as different from application to application. Likewise, the
shape-based similarity search. The range of tolerable existing techniques for information retrieval and data
errors specifies how close the match should be when the mining in three-dimensional structures take root in the
data structure is aligned with the query structure. Pattern areas of application. Two main areas of application are
discovery is concerned with similar substructures that computer vision and scientific data mining, where compu-
occur in multiple structures. Classification and clustering tation-intensive techniques have been developed and are
when applied to these domains attempt to group 3D still in demand. We focus on data management in these
structures with similar shapes or containing similar pat- two areas.
terns together.
TEAM LinG
Data Management in Three-Dimensional Structures
ADVANCES IN COMPUTER VISION capable of recognizing multiple objects in 3D scenes that

contain clutter and occlusion. Saupe and Vranic (2001) ,
Shape-based recognition of 3D objects is a core problem suggested using rays that cast from the center of the mass
in computer vision and has been studied for decades. of the object as feature vectors. The representation is
There are roughly three categories of approaches: vol- compared using spherical harmonics and moments. All
ume-based, feature-based, and interactive. For example, these descriptors are very high dimensional spaces and
Keim (1999) proposed a geometric-based similarity search indexing them is a well known difficult problem, due to
tree to deal with 3D volume-based search. He suggested the curse of dimensionality. Furthermore, the approaches
using voxels to approximate the 3D objects. For each 3D are not suitable for pattern discovery.
object, a Minimum Surrounding Volume (MSV) and Maxi- An interactive search scheme was introduced in (Elad,
mum Including Volume (MIV) are constructed using voxels. Tal, & Ar, 2001). The algorithm sets the center of a
Similar objects are clustered in data pages and the MSV structure to the origin and calculates the normalized
and MIV approximations of the objects are stored in the moment value of each point. The normalized moment
directory pages. Another interesting scheme uses values are used to approximate the structure. Two struc-
superellipsoids to approximate the shape of the volume. tures are compared according to a weighted Euclidean
Superellipsoids are similar to ellipsoids except the terms distance. The weights are adapted based on the feedback
in the definition are raised to parameterized exponents from the user, using Singular Value Decomposition. Chui
which are not necessarily integers. Indexing the and Rangarajan (2000) developed an iterative optimiza-
superellipsoids is very difficult due to their different tion algorithm to estimate point correspondences and
shapes and sizes. image transformations, such as affine or thin plate splines.
Feature-based approaches have been developed ex- The cost function is the sum of Euclidean distances
tensively in the literature. In (Belongie, Malik, & Puzicha, between the transformed query shape and the trans-
2001, 2002), Belongie and co-authors designed a descrip- formed data shape. The distance function needs an initial
tor, called the shape context, to characterize the distribu- correspondence between the two structures. Interactive
tion of the structure. For each point pi in the structure, the refinement was also used in a medical image database
set of 3D vectors originating from pi to every other point system developed by Lehmann et al. (2004). A survey of
in the structure is collected as the shape context. A shape matching in computer vision can be found in
histogram is constructed based on comparison between (Veltkamp & Hagedoorn, 1999).
the shape context of the query shape and that of the data Shape-based similarity search in a large database of
shape. Osada, Funkhouser, Chazelle and Dobkin (2001) 3D objects is much more challenging and is still a young
suggested using a similar descriptor, called the shape research area. The most recent achievements are a search
distribution. The shape distribution is a set of values that engine developed at Princeton University (Funkhouser,
are calculated by a shape function, such as the Euclidean et al., 2003) and a search system 3DESS for 3D engineering
distance between two randomly selected points on the shapes built at Purdue University (Lou, Prabhakar, &
surface of a 3D object. A histogram is calculated based on Ramani, 2004). These techniques were concentrated on
the shape distributions of the two shapes under consid- computer vision and they did not compare the 3D models
eration. Kriegel et al. introduced a 3D shape similarity point-to-point. Many of them only search for 3D objects
model that decomposes the 3D model into concentric that look similar. The effectiveness of such techniques
shells and sectors around the center point (Ankerst, often depends on subjective perception.
Kastenmller, Kriegel, & Seidl, 1999). The histograms are
determined by counting the number of points within each
cell. The similarity is calculated using a quadratic form EMERGING TECHNIQUES IN
distance function. Korn, Sidiropoulos, Faloutsos, Siegel SCIENTIFIC DATA MINING
and Protopapas (1998) proposed another descriptor, called
size distribution that is similar to the pattern spectrum. In structural biology, detecting similarity in protein struc-
They introduced a multi-scale distance function based on tures has been a useful tool for discovering evolutionary
mathematical morphology. The distance function is inte- relationships, analyzing functional similarity, and pre-
grated to the GEMINI framework to prune the search dicting protein structures. The differences in proteins or
space. GEMINI is an index structure for high dimensional chemical compounds are very subtle. To discover func-
feature spaces. Bespalov, Shokoufandeh, Regli, & Sun tionally related proteins or chemical compounds, in many
(2003) used Singular Value Decomposition to find suitable cases we have to match them atom-to-atom. Aggregate
feature vectors. A data level shape descriptor, called the similarity search is known as an NP complete problem in
spin image, was used in (Johnson & Hebert, 1999) to match combinatorial pattern matching. Even though the signifi-
surfaces represented as surface meshes. The system is cance of the problem surfaced with applications to
229
TEAM LinG
bioinformatics, few practical approaches have been devel- domain expert can choose an optimal value according to
oped. The latest results are a performance study based on the context. A new index structure, B+ Tree, was intro-
2D structures that was conducted by Gavrilov, Indyk, duced by Wang in 2002 to overcome these weaknesses
Motwani and Venkatasubramanian at Stanford (1999) and (Wang, 2002). The B+ Trees preserve precise informa-
a novel algorithm that was developed by Wang and col- tion of the data, allow pattern discovery with variable
laborators in (Wang & Wang, 2000). The algorithm discov- ranges of tolerable errors, and remove false matches
ers patterns in 3D data with no assumptions of edges. The totally. In an attempt to compare the shapes of 3D
research group headed by Karypis at University of Minne- structures, Wang also invented a new notion called the
sota studies pattern discovery in graphs, with vertices and a-surface to capture the surface of a 3D structure in
edges (Kuramochi & Karypis, 2002). variable details (Wang, 2001).
Pattern discovery is another highly desirable tech-
nique in these domains. A motif is a substructure in
proteins that has specific geometric arrangement and, in FUTURE TRENDS
many cases, is associated with a particular function, such
as DNA binding. Active sites are another type of patterns A universal model or framework for the representation,
in protein structures. They play an important role through storage, and retrieval of three-dimensional structures is
protein-protein and protein-ligand interaction, i.e. the bind- highly desirable. However, due to the different nature
ing process. In drug design, scientists try to determine the and applications in the two areas, a model or framework
binding sites of the target molecules and seek inhibitors that fits both is not feasible. In computer vision and 3D
that can be bound to it. For example in determining the model retrieval, the number of points in each object is
structure of HIV protease and looking for effective inhibi- huge and a single point does not impact perception
tors, over 120 of these structure determinations have been substantially. In fact, a point that does not get along with
done and at least two inhibitors of HIV protease are now other points is likely to be considered an outlier or noise.
being regularly used to treat AIDS. In the past two de- The number of objects increases tremendously everyday
cades, the number of 3D protein structures dramatically in different sources over the Internet. We are likely to see
increased. The Protein Data Bank maintains 26,800 entries a 3D model search engine much like the search engine for
as of August 2004. New structures are deposited every- text documents. A centralized database for 3D models is
day. Performing similarity search and pattern discovery in possible, but will be only a small portion of the available
such an enormous data set urgently demands highly effi- search space. The search engine will be capable to access
cient computational tools. Wang and coauthors (2002) data in different formats to answer the user query. Shape-
developed a framework for discovering frequently occur- based statistical approaches will remain a main stream.
ring patterns in 3D structures and applied the approach to On the other hand, in scientific data, each individual
scientific data mining. The algorithm is a variant of the point is very important. For example, it may represent the
geometric hashing technique invented for model based center of an atom. Each point may also be associated with
recognition in computer vision in 1988 (Lamdan & Wolfson, different properties. Thus blindly searching through
1988). Similar approaches were also introduced in different sources for data in different formats is not
(Verbitsky, Nussinov, & Wolfson, 1999). Since hash func- practical. Instead, we are likely to see large centralized
tions map floating point numbers to integers when calcu- data reservoir like The Protein Data Bank. For most of the
lating the hash bin addresses, they do not preserve precise applications, pattern discovery, classification, and clus-
information of the data. As a consequence, these ap- tering are most desirable techniques.
proaches are not suitable for similarity search that allows
variable ranges of tolerable errors. Furthermore, false
matches must be filtered out via a verification process. It CONCLUSION
is well known in the literature that geometric hashing is too
sensitive to noise in the data. Due to regularity of biologi- The human interest about three-dimensional structures
cal and chemical structures, dissimilarity is often very began even before Euclid. After centuries of investiga-
subtle. Inaccuracy introduced by the scanning devices tion and learning, it became clear that although we have
adds noise to the data (Ankerst, Kastenmller, & Kriegel, a better perception of 3D objects compared with our
Seidl, 1999). It is extremely difficult to choose a fixed range comprehension of text and multimedia data our tech-
of tolerable errors, especially, when the data are collected niques in the representation, storage, search, and dis-
by different domain experts, using different equipments, covery of 3D structures are lagged far behind. In com-
such as in the case of Protein Data Bank (Berman, et al., puter vision, 3D object recognition is a well know difficult
2000; Westbrook, et al., 2002). It is critical that the range problem. With advances in computational power and
of tolerable errors be set to a tunable parameter, so that the storage devices, maintaining an enormous amount of 3D
230
TEAM LinG
structures is not only feasible but also inexpensive. How- Funkhouser, T.A., Min, P. Kazhdan, M.M., Chen, J.,
ever, few effective approaches have been developed to Halderman, A., Dobkin, D.P., et al. (2003). A search engine ,
facilitate similarity search and pattern discovery in a very for 3D models. ACM Transactions on Graphics, 22(1), 83-105.
large database of 3D structures. Applications of 3D struc-
tures to bioinformatics pose even more intricate challenges, Gavrilov, M., Indyk, P., Motwani, R., &
due to the nature that these structures often need to be Venkatasubramanian, S. (1999). Geometric pattern match-
compared point-to-point. Fortunately, research in these ing: A performance study. Proc. of the Fifteenth Annual
topics has attracted more and more attention from computer Symposium on Computational Geometry (pp. 79-85),
scientists in different areas, such as computer vision, data- Miami Beach, Florida.
base systems, artificial intelligence, etc. It can be foreseen Hilaga, M., Y. Shinagawa, Y., T. Kohmura, T., & T. L. Kunii
that substantial progress will be achieved and novel tech- T.L. (2001). Topology matching for fully automatic simi-
niques will emerge in the near future. larity estimation of 3D shapes. SIGGRAPH, 203-212.
Johnson, A.E., & Hebert, M. (1999). Using spin images for
REFERENCES efficient object recognition in cluttered 3D scenes. IEEE
Transactions on Pattern Analysis and Machine Intelli-
Akutsu, T., Arakawa, K., & Murase H. (2002). Shape from gence, 21(5), 433-449.
contour using adaptive image selection. Systems and Keim, D.A. (1999). Efficient geometry-based similarity
Computers in Japan, 33(11), 50-60. search of 3D spatial databases. Proc. of ACM SIGMOD
Ankerst, M., Kastenmller, G., Kriegel, H-P., & Seidl, T. International Conference on Management of Data (pp.
(1999). Nearest neighbor classification in 3D protein da- 419-430), Philadephia, Pennsylvania.
tabases. Proc. of the 7th International Conference on Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., &
Intelligent Systems for Molecular Biology (pp. 34-43), Protopapas, Z. (1998). Fast and effective retrieval of
Heidelberg, Germany. medical tumor shapes. IEEE Transactions on Knowledge
Belongie, S. Malik, J., & Puzicha, J. (2001). Matching and Data Engineering, 10(6), 889-904.
shapes. Proc. of the Eighth International Conference on Kuramochi, M., & Karypis, G. (2002). Discovering geo-
Computer Vision (pp. 454-463), Los Alamitos, California. metric frequent subgraphs. Proc. of the 2002 IEEE Inter-
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape match- national Conference on Data Mining (pp. 258-265),
ing and object recognition using shape contexts. IEEE Maebashi, Japan.
Transactions on Pattern Analysis and Machine Intelli- Lamdan, Y., & Wolfson, H. (1988). Geometric hashing: A
gence, 24(4), 509-522. general and efficient model-based recognition scheme.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, Proc. of International Conference on Computer Vision
T.N., Weissig, H., et al. (2000). The protein data bank. (pp. 237-249).
Nucleic Acids Research, 28(1), 235-242. Lehmann, T.M., Plodowski, B., Spitzer, K., Wein, B.B.,
Bespalov, D., Shokoufandeh, A., Regli, W.C., & Sun, W. Ney, H., & Seidl, T. (2004). Extended query refinement for
(2003). Scale-space representation of 3D models and to- content-based access to large medical image databases.
pological matching. ACM Symposium on Solid Modeling Proc. of SPIE Medical Imaging (pp. 90-98).
and Applications (pp. 208-215). Lou, K., Prabhakar, S., & Ramani, K. (2004). Content-
Castelli, V., & Bergman, L. (2001). Image databases: based three-dimensional engineering shape search. Proc.
Search and retrieval of digital imagery. John Wiley & Sons. of the 2004 IEEE International Conference on Data
Engineering (pp. 754-765), Boston.
Chui, H., & Rangarajan A. (2000). A new algorithm for non-
rigid point matching. Proc. of the IEEE Conference on Osada, R., Funkhouser, T., Chazelle, B., & Dobkin, D.
Computer Vision and Pattern Recognition (pp. 44-51), (2001). Matching 3D models with shape distributions.
Hilton Head Island, South Carolina. Proc. of the International Conference on Shape Model-
ing and Applications, Genova, Italy.
Elad, M., Tal, A., & Ar, S. (2001), Content based retrieval
of VRML objects An iterative and interactive approach. Paquet, E., & Rioux, M. (1999). Crawling, indexing and
Proc. of the Eurographics Workshop on Multimedia (pp. retrieval of three-dimensional data on the web in the
97-108), Manchester, UK. framework of MPEG-7. VISUAL, 179-18.
231
TEAM LinG
Saupe, D., & Vranic, D. V. (2001). 3D model retrieval with KEY TERMS
spherical harmonics and moments. DAGM-Symposium
2001 (pp. 392-397). B+ Tree: An index structure that decomposes a 3D
Suzuki, M.T., & Sugimoto, Y.Y. (2003). A search method structure into point-triplets and indexes the triplets in a
to find partially similar triangular faces from 3D polygonal three-dimensional B + tree.
models. Modeling and Simulation, 323-328. -Surface: The surface of a 3D structure that is
Veltkamp, R. C. (2001). Shape matching: Similarity mea- constructed by rolling a solid ball with radius along the
sures and algorithms. IEEE Shape Modeling Interna- contour and extracting every point the solid ball touches.
tional, 188-197. Aggregate Similarity Search: The search operation
Veltkamp, R. C., & Hagedoorn, M. (1999). State-of-the-art in 3D structures that matches the structures point-to-
in shape matching. Technical Report UU-CS-1999-27, point in their entirety.
Utrecht. Feature Vector: A vector in which every dimension
Verbitsky, G., Nussinov, R., & Wolfson, H. J. (1999). represents a property of a 3D structure. A good feature
Flexible structural comparison allowing hinge bending vector captures similarity and dissimilarity of 3D struc-
and swiveling motions. Proteins, 34, 232-254. tures.
Wang, X. (2001). -Surface and its application to mining Histogram: The original term refers to a bar graph that
protein data. Proc. of the IEEE International Conference represents a distribution. In information retrieval, it also
on Data Mining (pp. 659-662), San Jose, California. refers to a weighted vector that describes the properties
of an object or the comparison of properties between two
Wang, X. (2002). B+ tree: Indexing 3D point sets for objects. Like a feature vector a good histogram captures
pattern discovery. Proc. of the IEEE International Con- similarity and dissimilarity of the objects of interest.
ference on Data Mining (pp. 701-704), Maebashi, Japan.
Shape-Based Similarity Search: The search opera-
Wang X., & Wang, J.T.L. (2000). Fast similarity search in tion in 3D structures that matches only the surfaces of the
three-dimensional structure databases. Journal of Chemi- structures without referring to the points inside the sur-
cal Information and Computer Sciences, 40(2), 442-451. faces.
Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Superimposition: The process of matching two 3D
Rigoutsos, I., & Zhang, K. (2002). Finding patterns in structures by alignment through rigid translations and
three dimensional graphs: Algorithms and applications to rotations.
scientific data mining. IEEE Transaction on Knowledge
and Data Engineering, 14(4), 731-749. The Curse of Dimensionality: The original term refers
to the exponential growth of hyper-volume as a function
Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., of dimensionality. In information retrieval, it refers to the
Ravichandran, V. et al. (2002). The protein data bank: phenomenon that the performance of an index structure
Unifying the archive. Nucleic Acids Research, 30(1), 245- for nearest neighbor search and -range search deterio-
248. rates rapidly due to the growth of hyper-volume.
232
TEAM LinG
233
Data Mining and Decision Support for ,

Business and Science
Auroop R. Ganguly
Oak Ridge National Laboratory, USA
Amar Gupta
University of Arizona, USA
Shiraj Khan
University of South Florida, USA
INTRODUCTION maker is not how one can get more information or design
better information systems but what to make of the infor-
Analytical Information Technologies mation and systems already in place. The challenge is to
be able to utilize the available information, to gain a better
Information by itself is no longer perceived as an asset. understanding of the past, and to predict or influence the
Billions of business transactions are recorded in enter- future through better decision making. Researchers in
prise-scale data warehouses every day. Acquisition, stor- data mining technologies (DMT) and decision support
age, and management of business information are com- systems (DSS) are responding to this challenge. Broadly
monplace and often automated. Recent advances in redefined, data mining (DM) relies on scalable statistics,
mote or other sensor technologies have led to the devel- artificial intelligence, machine learning, or knowledge
opment of scientific data repositories. Database tech- discovery in databases (KDD). DSS utilize available infor-
nologies, ranging from relational systems to extensions mation and DMT to provide a decision-making tool usu-
like spatial, temporal, time series, text, or media, as ally relying on human-computer interaction. Together,
well as specialized tools like geographical information DMT and DSS represent the spectrum of analytical infor-
systems (GIS) or online analytical processing (OLAP), mation technologies (AIT) and provide a unifying plat-
have transformed the design of enterprise-scale busi- form for an optimal combination of data dictated and
ness or large scientific applications. The question in- human-driven analytics.
creasingly faced by the scientific or business decision-
Table 1. Analytical information technologies Table 2. Application examples
Data-Mining Technologies Science and Engineering

Association, correlation, clustering, Bio-Informatics
classification, regression, database Genomics
knowledge discovery
Hydrology, Hydrometeorology
Signal and image processing, nonlinear
systems analysis, time series and spatial Weather Prediction
statistics, time and frequency domain Climate Change Science
analysis Remote Sensing
Expert systems, case-based reasoning, Smart Infrastructures
system dynamics Sensor Technologies
Econometrics, management science Land Use, Urban Planning
Decision Support Systems Materials Science
Automated analysis and modeling Business and Economics
o Operations research Financial Planning
o Data assimilation, estimation, and Risk Analysis
tracking
Supply Chain Planning
Human computer interaction
Marketing Plans
o Multidimensional OLAP and
spreadsheets Text and Video Mining
o Allocation and consolidation Handwriting/Speech Recognition
engine, alerts Image and Pattern Recognition
o Business workflows and data Long-Range Economic Planning
sharing Homeland Security
TEAM LinG
Data Mining and Decision Support for Business and Science
BACKGROUND Business solution providers and IT vendors, on the

other hand, have focused primarily on scalability, pro-
Tables 1 and 2 describe the state of the art in DMT and DSS cess automation and workflows, and the ability to com-
for science and business, and provide examples of their bine results from relatively simple analytics with judg-
applications. ments from human experts. For example, e-business
Researchers and practitioners have reviewed the applications in the areas of supply-chain planning, fi-
state of the art in analytic technologies for business nancial analysis, and business forecasting traditionally
(Apte et al., 2002; Kohavi et al., 2002; Linden & Fenn, 2003) rely on decision-support systems with embedded data
or science (Han et al., 2002), as well as data mining methods, mining, operations research and OLAP technologies,
software, and standards (Fayyad & Uthurusamy, 2002; business intelligence (BI), and reporting tools, as well
Ganguly, 2002a; Grossman et al., 2002; Hand et al., 2001; as an easy-to-use GUI (graphical user interface) and
Smyth et al., 2002) and decision support systems (Carlsson extensible business workflows (Geoffrion & Krishnan,
& Turban, 2002; Shim et al., 2002). 2003). These applications can be custom built by utiliz-
ing software tools or are available as prepackaged e-
business application suites from large vendors like SAP,
MAIN THRUST PeopleSoft, and Oracle, as well as best of breed and
specialized applications from smaller vendors like
Seibel and i2 . A recent report by the market research
Scientific and Business Applications firm Gartner (Linden & Fenn, 2003) summarizes the
relative maturity and current industry perception of
Rapid advances in information and sensor technologies advanced analytics. For reasons ranging from excessive
(IT and ST) along with the availability of large-scale (IT) vendor hype to misperceptions among end users
scientific and business data repositories or database caused by inadequate quantitative background, the busi-
management technologies, combined with breakthroughs ness community is barely beginning to realize the value
in computing technologies, computational methods, and of data-dictated predictive, analytical, or simulation
processing speeds, have opened the floodgates to data- models. However, there are notable exceptions to this
dictated models and pattern matching (Fayyad & trend (Agosta et al., 2003; Apte et al., 2002; Geoffrion
Uthurusamy, 2002; Hand et al., 2001). The use of so- & Krishnan, 2003; Kohavi et al., 2002; Wang & Jain, 2003;
phisticated and computationally-intensive analytical Yurkiewicz, 2003).
methods is expected to become even more common-
place with recent research breakthroughs in computa- Solutions Utilizing DMT and DSS
tional methods and their commercialization by leading
vendors (Bradley et al., 2002; Grossman et al., 2002; For a scientist or an engineer, as well as for a business
Smyth et al., 2002). manager or management scientist, DMT and DSS are
Scientists and engineers have developed innovative tools used for developing domain-specific applications.
methodologies for extracting correlations and associa- These applications might combine knowledge about the
tions, dimensionality reduction, clustering or classifi- specific scientific or business domain (e.g., through the
cation, regression, and predictive modeling tools based use of physically-based or conceptual-scientific mod-
on expert systems and case-based reasoning, as well as els, business best practices and known constraints, etc.)
decision support systems for batch or real-time analy- with data-dictated or decision-making tools like DSS
sis. They have utilized tools from areas like traditional and DMT. Within the context of these applications,
statistics, signal processing, and artificial intelligence, DMT and DSS can aid in the discovery of novel patterns,
as well as emerging fields like data mining, machine development of predictive or descriptive models, miti-
learning, operations research, systems analysis, and gation of natural or manmade hazards, preservation of
nonlinear dynamics. Innovative models and newly dis- civil societies and infrastructures, improvement in the
covered patterns in complex, nonlinear, and stochastic quality and span of life as well as in economic prosperity
systems encompassing the natural and human environ- and well being, and development of natural and built
ments have demonstrated the effectiveness of these environments in a sustainable fashion. Disparate appli-
approaches. However, applications that can utilize these cations utilizing DMT and DSS tools tend to have inter-
tools in the context of scientific databases in a scalable esting similarities. Examples of current best practices
fashion have only begun to emerge (Curtarolo et al., in the context of business and scientific applications are
2003; Ganguly, 2002b; Grossman & Mazzucco, 2002; provided next.
Grossman et al., 2001; Han et al., 2002; Kamath et al., Business forecasting, planning, and decision sup-
2002; Thompson et al., 2002). port applications (Carlsson & Turban, 2002; Shim et al.,
234
TEAM LinG
2002; Wang & Jain, 2003; Yurkiewicz, 2003) usually need for high value-added jobs (e.g., after a Pareto classifica-
to read data from a variety of sources like online transaction) or for exceptional situations (e.g., large prediction ,
tional processing (OLTP) systems, historical data ware- variance or significant risks). In addition, certain research
houses and data marts, syndicated data vendors, legacy studies have indicated that judgmental overrides may not
systems, or public domain sources like the Internet, as well improve upon the results of automated predictive models
as in the form of real-time or incremental data entry from on the (longer-term) average.
external or internal collaborators, expert consultants, plan- Scientists and engineers traditionally have utilized
ners, decision makers, and/or executives. Data from dis- advanced quantitative approaches for making sense of
parate sources usually are mapped to a predefined com- observations and experimental results, formulating theo-
mon data model and incorporated through extraction, trans- ries and hypotheses, and designing experiments. For
formation, and loading (ETL) tools. End users are provided users of statistical and numerical approaches in these
GUI-based access to define application contexts and set- domains, DMT often seems like the proverbial old wine
tings, structure business workflows and planning cycles in new bottles. However, innovative use of DMT in-
and format data models for visualization, judgmental up- cludes the development of algorithms, systems, and
dates, or analytical and predictive modeling. The param- practices that can not only apply novel methodologies,
eters of the embedded data mining models might be preset, but also can scale to large scientific data repositories
calculated dynamically based on data or user inputs, or (Connover et al., 2003; Graves, 2003; Han et al., 2002; He
specified by a power user. The results of the data mining et al., 2003; Ramachandran et al., 2003). While scientific
models can be utilized automatically for optimization and and business data mining have a lot in common, the
recommendation systems and/or can be used to serve as incorporation of domain knowledge is probably more
baselines for planners and decision makers. Tools like BI, critical in scientific applications. When appropriately
Reports, and OLAP (Hammer, 2003) are utilized to help combined with domain specific knowledge about the
planners and decision makers visualize key metrics and physics or the data sources/uncertainties, DMT ap-
predictive modeling results, as well as to utilize alert proaches have the potential to revolutionize the pro-
mechanisms and selection tools to manage by exception or cesses of scientific discovery, verification, and predic-
by objectives. Judgmental updates at various levels of tion (Han et al., 2002; Karypis, 2002). This potential has
aggregation and their reconciliation, collaboration among been demonstrated by recent applications in diverse
internal experts and external trading partners, as well as areas like remote sensing (Hinke et al., 2000), material
managerial review processes and adherence to corporate sciences (Curtarolo et al., 2003), bioinformatics (Graves,
directives, are aided by allocation and consolidation 2003), and the earth sciences (Ganguly, 2002b; Kamath et
engines, tools for simulation, ad hoc and predefined al., 2002; Potter et al., 2003; Thompson et al., 2002; see
reports, user-defined business workflows, audit trails http://datamining.itsc.uah.edu/adam/). Besides physi-
with comments and reason codes, and flexible informa- cal and data-dictated methods, human-computer interac-
tion transfer and data-handling capabilities. tion retains a significant role in real-world scientific
Emerging technologies include the use of automated decision making. This necessitates the use of DSS, where
DMT for aiding traditional DSS tasks; for example, the the results of DMT can be combined with expert judg-
use of data mining to zero down on the cause of aggregate ment and techniques from simulation, operations re-
exceptions in multidimensional OLAP cubes. The end search (OR), and other DSS tools. The Reviews of
results of the planning process are usually published in a Geophysics (American Geophysical Union, 1995) pro-
pre-defined placeholder (e.g., a relational database table), vides a slightly dated discussion on the use of data
which, in turn, can be accessed by execution systems or assimilation, estimation, and OR, as well as DSS, (http:/
other planning applications. The use of elaborate mecha- /www.agu.org/journals/rg/rg9504S/contents.html# hy-
nisms for user-driven analysis and judgmental or col- drology). Examples of decision support systems and
laborative decisions, as opposed to reliance on auto- tools in scientific and engineering applications also
mated DMT, remains a guiding principle for the current can be found in dedicated journals like Decision Sup-
genre of business-planning applications. The value of port Systems or Journal of Decision Systems (see
collaborative decision making and global visibility of Vol. 8, Number 2, 1998, and the latest issues), as well
information is near axiomatic for business applications. as in journals or Web sites dealing with scientific and
However, researchers need to design better DMT applica- engineering topics (McCuistion & Birk, 2002) (see the
tions that can utilize available information from disparate NASA air traffic control Web site at http://
sources through advanced analytics and account for spe- www.asc.nasa.gov/aatt/dst.html; NASA research Web
cific domain knowledge, constraints, or bottlenecks. Valu- sites for a global carbon DSS http://geo.arc.nasa.gov/
able and/or scarce human resources can be conserved by website/cquestwebsite/index.html; and institutes like
automating routine tasks and by reserving expert resources
235
TEAM LinG
MIT Lincoln Laboratories at http://www.ll.mit.edu/ South Florida and the second author was a member of the
AviationWeather/index2.html). faculty at the MIT Sloan School of Management.
FUTURE TREND REFERENCES
Business applications have focused on DSS with embed- Agosta, L., Orlov, L.M., & Hudso R. (2003). The future
ded and scalable implementations of relatively straight- of data mining: Predictive analytics. Forrester Brief.
forward DMT. Scientific applications have focused tradi-
tionally on advanced DMT in prototype applications with Apte, C., Liu, B., Pednault, E.P.D., & Smyth, P. (2002,
sample data. Researchers and practitioners of the future August). Business applications of data mining. Commu-
need to utilize advanced DMT for business applications nications of the ACM, 45(8), 49-53.
and scalable DMT, and DSS for scientists and engineers. Bradley, P. et al. (2002). Scaling mining algorithms to large
This provides a perfect opportunity for innovative and databases. Communications of the ACM, 45(8), 38-43.
multi-disciplinary collaborations.
Carlsson, C., & Turban, E. (2002). DSS: Directions for the next
decade. Decision Support Systems, 33(2), 105-110.
CONCLUSION Conover, H. et al. (2003). Data mining on the TeraGrid.
Proceedings of the Supercomputing Conference, Phoe-
The power of information technologies has been uti- nix, Arizona.
lized to acquire, manage, store, retrieve, and represent
data in information repositories, and to share, report, Curtarolo, S. et al. (2003). Predicting crystal structures
process, collaborate on, and move data in scientific and with data mining of quantum calculations. Physics Review
business applications. Database management and data Letters, 91(13).
warehousing technologies have matured significantly Fayyad, U., & Uthurusamy, R. (2002, August). Evolving
over the years. Tools for building custom and packaged data mining into solutions for insights. Communica-
applications, including, but not limited to, workflow tions of the ACM, 45(8), 28-31.
technologies, Web servers, and GUI-based data entry
and viewing forms, are steadily maturing. There is a Ganguly, A.R. (2002a). Software reviewData mining
clear and present need to exploit the available data and components. ORMS Today, 29(5), 56-59.
technologies to develop the next generation of scien-
tific and business applications, which can combine data- Ganguly, A.R. (2002b). A hybrid approach to improving
dictated methods with domain-specific knowledge. Ana- rainfall forecasts. Computers in Science and Engi-
lytical information technologies, which include DMT neering, 4(4), 14-21.
and DSS, are particularly suited for these tasks. These Geoffrion, A.M., & Krishnan, R. (Eds.). (2003). E-
technologies can facilitate both automated (data-dic- business and management scienceMutual impacts
tated) and human expert-driven knowledge discovery (Parts 1 and 2). Management Science, 49(10-11).
and predictive analytics, and can also be made to utilize
the results of models and simulations that are based on Graves, S.J. (2003). Data mining on a bioinformatics
process physics or business insights. If DMT and DSS grid. Proceedings of the SURA BioGrid Workshop,
were to be defined broadly, a broad statement can per- Raleigh, North Carolina.
haps be made that, while business applications have Grossman, R. et al. (2001). Data mining for scientific and
scalable but straightforward DMT embedded within DSS, engineering applications. Kluwer.
scientific applications have utilized advanced DMT but
focused less on scalability and DSS. Multidisciplinary Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data
research and development efforts are needed in the mining standards initiative. Communications of the ACM,
future for maximal utilization of analytical information 45(8), 59-61.
technologies in the context of these applications.
Grossman, R.L., & Mazzucco, M. (2002). DataSpace: A data
Web for the exploratory analysis and mining of data.
Computers in Science and Engineering, 4(4), 44-51.
ACKNOWLEDGMENTS
Hammer, J. (Ed.). (2003). Advances in online analytical
Most of the work was completed while the first author was processing. Data & Knowledge Engineering, 45(2), 127-
on a visiting faculty appointment at the University of 256.
236
TEAM LinG
Han, J. et al. (2002). Emerging scientific applications in data Wang, G.C.S., & Jain, C.L. (2003). Regression analysis:
mining. Communications of the ACM, 45(8), 54-58. Modeling and forecasting. Institute of Business Fore- ,
casting.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
data mining. Cambridge, MA: MIT Press. Yurkiewicz, J. (2003). Forecasting software survey: Pre-
dicting which product is right for you. ORMS Today.
He, Y. et al. (2003). Framework for mining and analysis of
space science data. Proceedings of the SIAM Interna-
tional Conference on Data Mining, San Francisco, Cali-
fornia.
KEY TERMS
Hinke, T., Rushing, J., Ranganath, H.S., & Graves, S.J.
(2000). Techniques and experience in mining remotely Analytical Information Technologies (AIT): Informa-
sensed satellite data. Artificial Intelligence Review, tion technologies that facilitate tasks like predictive mod-
14(6), 503-531. eling, data assimilation, planning, or decision making
through automated data-driven methods, numerical solu-
Kamath, C. et al. (2002). Classifying of bent-double galaxies. tions of physical or dynamical systems, human-computer
Computers in Science and Engineering, 4(4), 52-60. interaction, or a combination. AIT includes DMT, DSS, BI,
Karypis, G. (2002). Guest editors introduction: Data mining. OLAP, GIS, and other supporting tools and technologies.
Computers in Science and Engineering, 4(4), 12-13. Business and Scientific Applications: End-user mod-
Kohavi, R., Rothleder, N.J., & Simoudis, E. (2002). Emerg- ules that are capable of utilizing AIT along with domain-
ing trends in business analytics. Communications of the specific knowledge (e.g., business insights or constraints,
ACM, 45(8), 45-48. process physics, engineering know-how). Applications
can be custom-built or pre-packaged and are often distin-
Linden, A., & Fenn, J. (2003). Hype cycle for advanced guished form other information technologies by their
analytics, 2003. Gartner Strategic Analysis Report. cognizance of the specific domains for which they are
McCuistion, J.D., & Birk, R. (2002). From observations designed. This can entail the incorporation of domain-
to decision support: The new paradigm for satellite data. specific insights or models, as well as pre-defined infor-
NASA Technical Report. Retrieved from http:// mation and process flows.
www.iaanet.org/symp/berlin/IAA-B4-0102.pdf Business Intelligence (BI): Broad set of tools and
Potter, C. et al. (2003). Global teleconnections of ocean technologies that facilitate management of business
climate to terrestrial carbon flux. Journal of Geophysical knowledge, performance, and strategy through auto-
Research, American Geophysical Union, 108(D17), 4556. mated analytics or human-computer interaction.
Ramachandran, R. et al. (2003). Flexible framework for Data Assimilation: Statistical and other automated
mining meteorological data. Proceedings of the Ameri- methods for parameter estimation, followed by predic-
can Meteorological Societys (AMS) 19th International tion and tracking.
Conference on Interactive Information Processing Data Mining Technologies (DMT): Broadly de-
Systems (IIPS) Meteorology, Oceanography, and Hy- fined, these include all types of data-dictated analytical
drology, Long Beach, California. tools and technologies that can detect generic and inter-
Shim, J. et al. (2002). Past, present, and future of decision esting patterns, scale (or can be made to scale) to large
support technology. Decision Support Systems, 33(2), data volumes, and help in automated knowledge discov-
111-126. ery or prediction tasks. These include determining asso-
ciations and correlations, clustering, classifying, and
Smyth, P., Pregibon, D., & Faloutsos, C. (2002). Data- regressing, as well as developing predictive or forecast-
driven evolution of data mining algorithms. Communica- ing models. The specific tools used can range from
tions of the ACM, 45(8), 33-37. traditional or emerging statistics and signal or image
processing, to machine learning, artificial intelligence,
Thompson, D.S. et al. (2002). Physics-based feature min- and knowledge discovery from large databases, as well
ing for large data exploration. Computers in Science and as econometrics, management science, and tools for
Engineering, 4(4), 22-30. modeling and predicting the evolutions of nonlinear dy-
namical and stochastic systems.
237
TEAM LinG
Decision Support Systems (DSS): Broadly defined, ses, as well as presentation, allocation, and consolidation
these include technologies that facilitate decision making. of information along multiple dimensions (e.g., product,
These can embed DMT and utilize these through auto- location, and time). These technologies are well suited for
mated batch processes and/or user-driven simulations or management by exceptions or objectives, as well as auto-
what-if scenario planning. The tools for decision support mated or judgmental decision making.
include analytical or automated approaches like data as-
similation and operations research, as well as tools that Operations Research (OR): Mathematical and con-
help the human experts or decision makers manage by straint programming and other techniques for mathemati-
objectives or by exception, like OLAP or GIS. cally or computationally determining optimal solutions
for objective functions in the presence of constraints.
Geographical Information Systems (GIS): Tools that
rely on data management technologies to manage, pro- Predictive Modeling: The process through which
cess, and present geospatial data, which, in turn, can vary mathematical or numerical technologies are utilized to
with time. understand or reconstruct past behavior and predict
expected behavior in the future. Commonly utilized
Online Analytical Processing (OLAP): Broad set of tools include statistics, data mining, and operations
technologies that facilitate drill-down or aggregate analy- research, as well as numerical or analytical methodolo-
gies that rely on domain-knowledge.
238
TEAM LinG
239
Data Mining and Warehousing in Pharma ,

Industry
Andrew Kusiak
The University of Iowa, USA
Shital C. Shah
The University of Iowa, USA
INTRODUCTION The vision for pharmaceutical industry is captured in

Figure 1. It has become apparent that data will play a
Most processes in pharmaceutical industry are data central role in drug discovery and delivery to an individual
driven. Companys ability to capture the data and making patient. We are about to see pharmaceutical companies
use of it will grow in significance and may become the delivering drugs, developing test kits (including genetic
main factor differentiating the industry. Basic concepts tests), and computer programs to deliver the best drug to
of data mining, data warehousing, and data modeling are the patient. The current data mining (DM) tools are ca-
introduced. These new data-driven concepts lead to a pable of issuing customized prescriptions, providing most
paradigm shift in pharmaceutical industry. effective drugs, and dosages with minimal adverse ef-
fects. It has become practical to think of designing,
producing, and delivering drugs intended for an indi-
BACKGROUND vidual patient (or a small population of patients). Here,
analogy of the one-of-a-kind production of today versus
The volume of data in pharmaceutical industry has been mass production is worthy attention. Many industries are
growing at an unprecedented rate. For example, a attempting to produce customized products. The pharma-
microarray (equivalent of one test) may provide thou- ceutical industry may soon become a new addition to this
sands of data points. These huge datasets offer chal- collection of industries.
lenges and opportunities. The primary challenges are
data storage, management, and knowledge discovery.
The pharmaceutical industry is one of the most data- MAIN THRUST
driven industries and getting value out of the data is key
to its success. Thus processing the data with novel tools Paradigm Shift
and methods for extracting useful knowledge for speedy
drug discovery and optimal matching of the drugs with Predicting implies knowing in advance, which in a busi-
the patients may well become the main factor differen- ness environment translates into competitive advantage.
tiating the pharmaceutical companies.
The main sources of pharmaceutical data are pa-
Figure 1. The future of pharmaceutical industry
tients, genetic tests, regulatory sources, pharmaceuti-
cal literature, and so on. Data and information collected
Pharma
from different sources, agencies, and clinics, has been company
traditionally used for narrow reporting (Berndt et al.,

2001). Massive clinical trials may lead to errors such as Data
protocol violations, data integrity, data formats, trans-

fer errors, and so on. Traditionally the analysis in phar- Drugs Test kits
maceutical industry [from prediction of drug stability Computer
(King et al., 1984) to drug discovery techniques (Tye, program
2004)] is performed using the population based statis-

tical techniques. To minimize possible errors and in-
crease confidence in predictions there is a need for a Today Tomorrow
comprehensive methodology for data collection, stor- Customized Customized

prescription medication
age, and analysis. Automation of the pharmaceutical data
over its lifecycle seems to be an obvious alternative.
TEAM LinG
Data Mining and Warehousing in Pharma Industry
A future event can be predicted in two major ways: C. Neural network (Mitchell, 1997)
population-based and individual-based. The population- D. Support vector machines
based prediction says, for example, a drug A has been E. Decision tree algorithms [C4.5 (Quinlan, 1992)]
effective in treating 80% of patients in the population P F. Decision rule algorithms [Rough set algorithms
as symbolically illustrated in Figure 2. Of course, any (Pawlak, 1991)]
patient would like to belong to the 80% rather than the 20% G. Association rule algorithms
category before the drug A is administered. Statistics and H. Learning classifier systems
other tools have been widely used in support of the I. Inductive learning algorithms
population-paradigm, among others in medicine and phar- J. Text learning algorithms
maceutical industry.
The individual-based approach supported by numer- Each class containing numerous algorithms, for ex-
ous DM algorithms emphasizes an individual patient ample, there are more than 100 implementations of the
rather than the population (Kusiak et al., 2005). One of decision tree algorithm (class E).
many decision-making scenarios is illustrated in Figure 3,
where the original population P of patients has been Data Warehouse Design
partitioned into two segments 1 and 2. The decisions for
each patient in Segment 1 are made with high confidence; A warehouse has to be designed to meet users require-
say 99%, while the decisions for Segment 2 are predicted ments. DM, online analytical processing (OLAP), and
with lower confidence. reporting are the top items on the list of requirements.
It is quite possible that Segment 2 patients would Systems design methodologies, and tools can be used to
seek an alternative drug or a treatment. There are differ- facilitate the requirements capture. Examples of meth-
ent ways of using DM algorithms. They cover the range odologies for analysis of data warehouse (DW) require-
between the population and individual-based paradigms. ments include AND/OR graphs and the house of quality
The exiting DM algorithms can be grouped into the (Kusiak, 2000).
following basic ten classes (Kusiak, 2001): The architecture of a typical DW embedded in a
pharmaceutical environment is shown in Figure 4. The
A. Classical statistical methods (e.g., linear, qua- pharmaceutical data is extracted from numerous sources
dratic, and logistic discriminant analyses) and preprocessed to minimize inconsistencies. Also, data
B. Modern statistical techniques (e.g., projection transformation will capture intricate solution spaces to
pursuit classification, density estimation, k-near- improve knowledge discovery. The cleaned and trans-
est neighbor, Bayes algorithm) formed data is loaded and refreshed directly into a DW
or data marts. A data mart might be a precursor to the
full-fledged DW or function as a specialized DW. A
Figure 2. Population-based paradigm
special purpose (exploratory) data mart or a DW might
be created for exploratory data analysis and research.
80% P cured The warehouse and data marts serve various applications
that justify the development and maintenance cost in
Drug A Population P Confidence = 95%
this data storage technology. The range of services that
20% P not cured could be developed off a DW could be expanded beyond
OLAP and DM into almost all pharmaceutical business
areas, including interactions with federal agencies and
Figure 3. Individual-based paradigm using data other businesses.
mining tools
Data Flow Analysis
70% P cured
Population A task that may parallel the capture of requirements for
Confidence = 99%
Segment 1
a DW involves analysis of data flow. A warehouse is to
15% P not cured integrate various streams of data that have to be identi-
Drug A fied. The information analysts and users need to feel
10% P cured comfortable with the data flow methodology selected
Population
for capturing the data flow logic. At minimum this data
Confidence = 85%
Segment 2 flow modeling exercise should increase efficiency of
5% P not cured the data handling and management. An example to a
methodology that can be used to model data flow is the
240
TEAM LinG
Figure 4. Data warehouse architecture

,
Data marts
Data mining
Data
Transformation Datya
Data
Servive
and Loading
Raw Data
OLAP
Data
Exploratory
warehouse
data
Information
warehouse
Data and
Sources Knowledge
Services
Integrated Definition (Kusiak, 2000) process modeling Decision 1 = (Score_9_months - Score_3_months)/

methodology, IDEF3. (Number of months)
Decision 2 = Slope of the scores over time
Preprocessing For Data Warehousing Decision 3 = Average score over time
Before a viable and accurate DW is built, there is a need

Decision 4 = Percentage improvement between
to solve specific problems associated with data manage- first and last reading
ment. The collection process, coding schemes and stan- Decision 5 = Score based on group means with first
dards may lead to errors, which must be fixed. Protocols and last reading over time
for data collection and transfer should concentrate, among
other things, on feature related unwanted characters, The clinical trials performed over an extended period
illegal values, out of range feature values, non-uniform of time lead to temporal data. There is a need to incorpo-
primary keys, and so on. rate additional parameters in the temporal dataset. These
Clinical trials are costly and there are various methods parameters can be a trend variable, time, or a transformed
to reducing the number of participating subjects. Selecting parameter. Table 1 provides an example of temporal
the top and bottom quartile patients (score/criterion based) decision handling by employing five different decision
for both placebo and drug subjects can significantly re- measures. Case 2 is a unique example as decision 3 and
duce the trial magnitude. Thus special consideration is 5 imply that there is no improvement, contrary to the
required to handle data for smaller number of subjects. The actual fact. Thus a new meta-decision measure can be
outcomes are often measured with subjective scores that compiled based on the above five decisions.
vary from subject to subject and also for each subject. Handling noisy data, including genetic data, is a
Handling these subjective outcomes is a challenging prob- challenge (Table 2). There are various models that can
lem for developing practical solutions. An example of compress, identify and reduce noisy genetic data such as
subjective outcomes measured over a nine-month period clustering and feature reduction methods (Shah & Kusiak,
is shown in Table 1. 2004; Dudoit et al., 2000).
Table 1. Example of subjective and temporal outcomes
Subjective Score Possible Decision Measures

ID 3_months 6_months 9_months Decision 1 Decision 2 Decision 3 Decision 4 Decision 5
1 70.6 78.41 97.3 2.97 4.4496 82.1 112.09 -2.98
2 30.01 62.78 77.29 5.25 7.8805 56.7 171.41 -4.88
3 61.12 86.55 26.77 -3.82 -5.7244 58.15 101.38 3.2
4 74.05 44.44 55.67 -2.04 -3.0637 58.05 81.16 1.54
5 75.28 81.95 41.2 -3.79 -5.6812 66.14 92.89 3.12
241
TEAM LinG
Table 2. Noisy genetic data 2000). The snowflake schema is a variation of star schema
in which the dimensional tables are organized into a
ID Treatment SNP1 SNP2 SNP3 SNP4 SNP5 Decision
hierarchy by normalizing the tables. Another important
1 Drug A/T C/T A/C G/T C/T Improved
issue while designing the DW especially for pharmaceu-
2 Drug A/T T/T A/A G/G T/T Improved
3 Drug A/T C/T A/C G/T T/T Not_Improved
tical industry is the security issues, that is, avoiding
4 Drug A/A C/C A/A T/T C/T Not_Improved unauthorized access.
5 Drug A/T C/T A/C G/T C/T Not_Improved
6 Placebo A/T C/T C/C G/T T/T Improved Data Modeling
7 Placebo A/A C/T A/C G/G C/T Improved
8 Placebo A/T C/C A/C G/T C/T Improved The data stored in DW is relatively error free and com-
9 Placebo A/T C/T A/C G/T C/T Not_Improved pact. This data contains hidden information, which needs
10 Placebo T/T T/T C/C T/T C/C Not_Improved
to be extracted. It forms a staging ground for knowledge
discovery, and decision-making (Elmasri & Navathe,
Data Warehouse Characteristics 2000). The desired functionality of DW access tools is
as follows:
A DW is a subject-oriented, integrated, nonvolatile, time-
variant collection of data in support of users (Elmasri & Tabular reporting and information mapping
Navathe, 2000). The prominent feature of the DW is to Complex queries and sophisticated criteria search
support large volume of data and relatively small number Ranking
of users with relatively long interactions, high perfor- Multivariable and time series analysis
mance levels, multidimensional view, usability, manage- Data visualization, graphing, charting, and pivoting
ability, and flexible reporting. It must efficiently extract Complex textual search
and process data. Advanced statistical analysis
Data cuboids in the DW are multidimensional con- Trend discovery and analysis
structs reflecting the relationships in the underlying Pattern and associations discovery
data. A three-dimensional cuboid is called a data cube
and is illustrated in Figure 5. The data cube represents There are three main streams for data analysis namely,
three dimensions, namely cancer type, medication, and OLAP, statistics, and DM. OLAP supports various view-
time factor. The data cubes can be views through multiple points at different levels of abstraction, which helps in
orientations using a technique called pivoting (Elmasri & data visualization and analysis. Population-based statis-
Navathe, 2000). Thus the data cube in can be pivoted to tical analysis of the data can be performed through various
form a separate medication and time factor table for each tools ranging from simple regression to complex multivari-
cancer type. This data structure allows for roll-up and ate analysis. DM offers tools (decision trees, decision
drill-down capabilities. Roll-up moves up the hierarchy, rules, support vector machines, neural networks, associa-
for example grouping the cancer type dimension to present tion rules, and so on) for discovery of new knowledge by
a single medication and time factor view. Similarly drill- converting hidden information into business models. It
down helps in finer grained view, for example, the medi- provides tools for identifying valid, novel, potentially
cations can be broken down into cardiac, respiratory, and useful, and ultimately understandable patterns from data
skin medications. and constructs high confidence predictions for individu-
The DW is normally designed with star schema or the als (Fayyad et al., 1997). This may represent valuable
snowflake schema. The star schema is designed with a knowledge that might lead to medical discoveries, for
fact table (tuples arranged in one per recorded fact) and example, certain ranges of parameter values leading to
single table for each dimension (Elmasri & Navathe, longer survival time.
Thus OLAP with DM complements each other in the
analysis and provides different but much needed func-
Figure 5. Data cube
tionality for data understanding, visualization, and mak-
Time ing individualized decisions. Data modeling provides ana-
Cancer lytical reasoning for decision-making to solve specific
type patient and business specific issues.
Applications
Medication
DM, OLAP, and decision-making algorithms will lead to
outcome predictions, identification of significant patterns
242
TEAM LinG
Figure 6. Pharmaceutical and medical applications

Application Area ,
Parameter
Business
selections &
Prediction information Individualized
identification
Identification protocols
Classification Drug discovery Adverse effects Treatments
Optimization Predictions
Outcome Customized
predictions protocols
and parameters, patient-specific decisions, and ultimately of parameter relevancy. Dynamic clinical studies would
process/goal optimization. Prediction, identification, clas- provide additional facet to interact and intervene, result-
sification, and optimization are key to drug discovery, ing in clean, error free, and necessary data collection and
adverse effect analysis, outcome predictions, and indi- analysis.
vidualized protocols (Figure 6).
The issues highlighted in this article are illustrated
with various medical informatics research projects. DM CONCLUSION
has led to identifying predictive parameters, formulating
individualized treatment protocols, categorizing model Data warehousing, data modeling, data mining, OLAP,
patients, and developing a decision-making model for and decision-making algorithms will ultimately lead to
predictions of survival for dialysis patients (Kusiak et al., targeted drug discovery and individualized treatments
2005). A gene/SNP selection approach (Shah & Kusiak, with minimum adverse effects. The success of the phar-
2004) was developed using weighted decision trees and maceutical industry will largely depend on following
genetic algorithm. The high quality significant gene/SNP the course outlined by the new data paradigm.
subset can be targeted for drug development and valida-
tion. An intelligent DM system determined the wellness
score for the infants with hypoplastic left heart syndrome REFERENCES
based on 73 physiologic, laboratory, and nurse-assessed
parameters (Kusiak et al., 2003). For cancer-related analy-
Bahler, D., Stone, B., Wellington, C., & Bristol, D.W.
sis, a patient acceptance model can be developed to
(2000). Symbolic, neural, and Bayesian machine learn-
predict the treatment type, drug toxicity, and the length of
ing models for predicting carcinogenicity of chemical
disease free status after the treatment. These concept
compounds. Journal of chemical information and com-
discussed in this chapter can be applied across all medical
puter sciences, 40(4), 906-914.
topics including phenotypic and genotypic data.
Other examples are Merck gene sequencing (Eckman Berndt, D.J., Fisher, J.W., Hevner, R.A., & Studnicki, J.
et al., 1998), epidemiological and clinical toxicology (2001). Healthcare data warehousing and quality assur-
(Helma et al., 2000), prediction of rodent carcinogenic- ance. IEEE Computer, 34(12), 56-65.
ity bioassays (Bahler et al., 2000), gene expression
level analysis (Dudoit et al., 2000), predicting risk of Dudoit, S., Fridlyand, J., & Speed, T.P. (2000). Com-
coronary artery disease (Tham et al., 2003), and VA parison of discrimination methods for the classifica-
health care system (Smith & Joseph, 2003). tion of tumors using gene expression data. Technical
Report 576. Berkeley, CA: Department of Statistics,
University of California.
FUTURE TRENDS Eckman, B.A., Aaronson, J.S., Borkowski, J.A., Bailey,
W.J., Elliston, K.O., Williamson, A. R., & Blevins, R.A.
The future of data mining, warehousing, and modeling in (1998). The Merck gene index browser: An extensible
pharmaceutical industry offers numerous challenges. data integration system for gene finding, gene character-
The first one is the scale of the data, measured with the ization and EST data mining. Bioinformatics, 14, 2-13.
number of features. New scalable schemes for parameter
selection and knowledge discovery will be developed. Elmasri, R., & Navathe, S. B. (2000). Fundamentals of
There is a need to develop new tools for rapid evaluation database systems. New York: Addison-Wesley.
243
TEAM LinG
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Tye, H. (2004). Application of statistical design of experi-
Uthurusamy, R. (1997). Advances in knowledge discov- ments methods in drug discovery. Drug Discovery To-
ery and data mining. Cambridge, MA: MIT Press. day, 9(11), 485-491.
Helma, C., Gottmann, E., & Kramer, S. (2000). Knowledge
discovery and data mining in toxicology. Statistical Meth-
ods in Medical Research, 9(4), 329-358. KEY TERMS
King, S.Y., Kung, M.S., & Fung, H.L. (1984). Statistical
prediction of drug stability based on nonlinear parameter Adverse Effects: Any untoward medical occurrences
estimation. Journal of pharmaceutical sciences, 73(5), that may be life threatening and requires in-patient
657-662. hospitalization.
Kusiak, A. (2000). Computational intelligence in design Clustering: Clustering algorithms discover simi-
and manufacturing. New York: John Wiley. larities and differences among groups of items. It di-
vides a dataset so that patients with similar content are
Kusiak, A. (2001). Feature transformation methods in in the same group, and groups are as different as pos-
data mining. IEEE Transactions on Electronics Pack- sible from each other.
aging Manufacturing, 24(3), 214-221.
Customized Protocols: A specific set of treatment
Kusiak, A., Caldarone, C.A., Kelleher, M.D., Lamb, F.S., parameters and their values those are unique for indi-
Persoon, T.J., Gan, Y., & Burns, A. (2003, April). Mining vidual. The customized protocols are derived from dis-
temporal data sets: Hypoplastic left heart syndrome case covered knowledge patterns.
study. In Proceedings of the SPIE Conference on Data
Mining and Knowledge Discovery: Theory, Tools, and Data Visualization: The method or end result of
Technology V 5098, SPIE (pp. 93-101). Belingham, WA. transforming numeric and textual information into a
graphic format. Visualizations are used to explore large
Kusiak, A., Dixon, B., & Shah, S. (2005). Predicting sur- quantities of data holistically in order to understand
vival time for kidney dialysis patients: A data mining trends or principles.
approach. Computers in Biology and Medicine, 35(4),
311-327. Decision Trees: Decision-tree algorithm creates
rules based on decision trees or sets of if-then state-
Kusiak, A., Kern, J.A., Kernstine, K.H., & Tseng, T.L. ments to maximize interpretability.
(2000). Autonomous decision-making: a data mining ap-
proach. IEEE Transactions on Information Technology Drug Discovery: It is a research process that identi-
in Biomedicine, 4(4), 274-284. fies molecules with desired biological effects so a to
develop new therapeutic drugs.
Mitchell, T.M. (1997). Machine learning. New York:
McGraw Hill. Feature Reduction Methods: The goal of feature
reduction method is to identify the minimum set of non-
Pawlak, Z. (1991). Rough sets: Theoretical aspects of redundant features (e.g., SNPs, genes) that are useful in
reasoning about data. Boston, MA: Kluwer. classification.
Quinlan, R. (1992). C 4.5 programs for machine learn- Knowledge Discovery: Knowledge discovery in
ing. San Meteo, CA: Morgan Kaufmann. databases is the process of identifying valid, novel,
potentially useful, and ultimately understandable pat-
Shah, S.C., & Kusiak, A. (2004). Data mining and genetic terns/models in data.
algorithm based Gene/SNP selection. Artificial Intelli-
gence in Medicine, 31(3), 183-196. Neural Networks: Neural networks are a set of
simple units (neurons) that receive a number of real values
Smith, M.W., & Joseph, G.J. (2003). Pharmacy data in inputs, which are processed through the network to
the VA health care system. Medical Care Research and produce a real value output.
Review: MCRR, 60 (3 Suppl), 92S-123S.
OLAP: Online analytical processing is a category of
Tham, C.K., Heng, C.K., & Chin, W.C. (2003). Predicting software tools that provides analysis of data stored in a
risk of coronary artery disease from DNA microarray- database.
based genotyping using neural networks and other statis-
tical analysis tool. Journal of Bioinformatics and Compu-
tational Biology, 1(3), 521-539.
244
TEAM LinG
245
Data Mining for Damage Detection in D

Engineering Structures
RamdevKanapady
University of Minnesota, USA
Aleksandar Lazarevic
INTRODUCTION aerospace, mechanical and civilian structures damage

detection techniques become critical to reliable predic-
The process of implementing and maintaining a structural tion of damage in these structural systems.
health monitoring system consists of operational evalu- The most currently used damage detection methods
ation, data processing, damage detection, and life predic- are manual, such as tap test, visual, or specially localized
tion of structures. This process involves the observation measurement techniques (Doherty, 1997). These tech-
of a structure over a period of time using continuous or niques require that the location of the damage be on the
periodic monitoring of spaced measurements, the extrac- surface of the structure. In addition, location of the
tion of features from these measurements, and the analy- damage has to be known a priori, and these locations
sis of these features to determine the current state of have to be readily accessible. This makes current mainte-
health of the system. Such health monitoring systems are nance procedure of large structural systems very time
common for bridge structures, and many examples are consuming and expensive due to its heavy reliance on
citied in (Maalej et al., 2002). human labor.
The phenomenon of damage in structures includes
localized softening or cracks in a certain neighborhood of
a structural component due to high operational loads, or BACKGROUND
the presence of flaws due to manufacturing defects.
Methods that detect damage in the structure are useful for The damage in structures and structural systems is de-
non-destructive evaluations that typically are employed fined through comparison of two different states of the
in agile manufacturing and rapid prototyping systems. In system; the first one is the initial undamaged state, and
addition, there are a number of structures, such as turbine the second one is the damaged state. Due to the changes
blades, suspension bridges, skyscrapers, aircraft struc- in the properties of the structure or quantities derived
tures, and various structures deployed in space, for which from these properties, the process of damage detection
structural integrity is of paramount concern (Figure 1). eventually reduces to a form of pattern recognition/data
With the increasing demand for safety and reliability of mining problem. In addition, emerging continuous moni-
toring of an instrumented structural system often results
in the accumulation of a large amount of data that need to
Figure 1. Damage detection is part of structural health be processed, analyzed, and interpreted for feature extrac-
monitoring system tion in an efficient damage detection system. However,
the rate of accumulating such data sets far outstrips the
ability to analyze them manually. More specifically, there
is often information hidden in the data that cannot be
discovered manually. As a result, there is a need to
develop an intelligent data processing component that
can significantly improve the current damage detection
systems. An immediate alternative is to design data min-
ing techniques that can enable real-time prediction and
identification of damage for newly available test data,
once a sufficiently accurate model is developed.
In recent years, various data mining techniques, such
as artificial neural networks (ANNs) (Anderson, Lemoine
TEAM LinG
Data Mining for Damage Detection in Engineering Structures
& Ambur, 2003; Lazarevic et al., 2004; Ni, Wang & Ko, CLASSIFICATION OF DAMAGE
2002; Sandhu et al., 2001; Yun & Bahng, 2000; Zhao, Ivan DETECTION TECHNIQUES
& DeWolf, 1998), support vector machines (SVMs) (Mita
& Hagiwara 2003), decision trees (Sandhu et al., 2001), We provide several different criteria for classification of
have been applied successfully to structural damage damage detection techniques based on data mining.
detection problems, thus showing that they can be poten- In the first classification, damage detection techniques
tially useful for such class of problems. This success can can be categorized into continuous (Keller & Ray, 2003)
be attributed to numerous disciplines integrated with data and periodic (Patsias & Staszewski, 2002) damage detec-
mining, such as pattern recognition, machine learning, tion systems. Continuous techniques usually employ an
and statistics. In addition, it is well known that data mining integrated approach that consists of data acquisition
techniques effectively can handle noisy, partially incom- process, feature extraction from large amounts of data
plete, and faulty data, which is particularly useful, since collected from real-time sensors, and damage detection
in damage detection applications, measured data are ex- process. In periodic techniques, feature extraction pro-
pected to be incomplete, noisy, and corrupted. cess is optional, since the amount of data that need to be
The intent of this paper is to provide a survey of processed is not large and does not necessarily require
emerging data mining techniques for damage detection data mining techniques for feature extraction.
structures. Although the field of damage detection is very In the second classification, we distinguish between
broad and consists of vast literature that is not based on application-based and application-independent tech-
data mining techniques, this survey will be focused pre- niques. Application-based techniques are generally ap-
dominantly on data mining techniques for damage detec- plicable to a specific structural system, and they typically
tion based on changes in properties of the structure. assume that the monitored structure responds in some
However, a large amount of literature applicable to fault predetermined manner that can be accurately modeled by
detection and diagnosis to application-specific system, (i) numerical techniques such as finite element (Sandhu et
such as rotating machinery, is not within the scope of this paper. al., 2001) or boundary element analysis (Anderson, Lemoine
& Ambur, 2003) and/or (ii) behavior of the response of the
structures that are based on physics-based models (Keller
CATEGORIZATION OF STRUCTURAL & Ray, 2003). Most of damage detection techniques that
DAMAGE exist in literature belong to the application-based ap-
proach, where the minimization of the residue between the
The damage in structures can be classified as linear or experimental and the analytical model is built into the
nonlinear. Damage is considered as linear if the undam- system. Often, this type of data is not available and can
aged linear-elastic structure remains linear-elastic after render application-based methods impractical for certain
damage. However, if the initially linear-elastic structure applications, particularly for structures that are designed
behaves in a nonlinear manner after the damage initiation, and commissioned without such models. On the other
then the damage is considered as nonlinear. However, it hand, application-independent techniques do not de-
is possible that the damage is linear at the damage initia- pend on specific structure, and they are generally appli-
tion phase, but after prolonged growth in time, it may cable to any structural system. However, the literature on
become nonlinear. For example, loose connections be- these techniques is very sparse, and the research in this
tween the structures at the joints or the joints that rattle area is at a very nascent stage (Bernal & Gunes, 2000;
(Sekhar, 2003) are considered non-linear damages. Ex- Zang, Friswell & Imregun 2004).
amples of such non-linear damage detection systems are In the third classification, damage detection tech-
described in Adams and Farrar (2002) and Kerschen and niques are split into signature-based and non-signature-
Golinval (2004). based methods. Signature-based techniques extensively
Most of the modal data in the literature are proposed use signatures of known damages in the given structure
for the linear case. They are based on the following three that are provided by human experts. These techniques
levels of damage identification: (1) Recognition-quali- commonly fall into the category of recognition of damage
tative indication that damage might be present in the detection, which only provides the qualitative indication
structure; (2) Localizationinformation about the prob- that damage might be present in the structure (Friswell,
able location of the damage in the structure; and (3) Penny & Wilson, 1994) and, to a certain extent, the
Assessmentestimate of the extent of severity of the localization of the damage (Friswell, Penny & Garvey,
damage in the structure. Such linear damage detection 1997). Non-signature methods are not based on signa-
techniques can be found in Yun and Bahng (2000), Ni, et tures of known damages, and they not only recognize but
al. (2002), and Lazarevic, et al. (2004). also localize and assess the extent of damage. Most of the
damage detection techniques in the literature fall into this
246
TEAM LinG
category (Lazarevic et al., 2004; Ni et al., 2002; Yun & rules about the structural damage were found. In addi-
Bahng, 2000). tion, a method using support vector machines (SVMs) ,
In the fourth classification, damage detection tech- has been proposed to detect local damages in a building
niques are classified into local (Sekhar, 2003; Wang, 2003) structure (Mita & Hagiwara, 2003). The method is verified
and global (Fritzen & Bohle, 2001) techniques. Typically, to have the capability to identify not only the location of
the damage is initiated in a small region of the structure, damage, but also the magnitude of damage with satisfac-
and, hence, it can be considered as local phenomenon. One tory accuracy, employing modal frequency patterns as
could employ local or global damage detection features damage features.
that are derived from local or global response or properties
of the structure. Although local features can detect the Pattern Recognition
damage effectively, these features are very difficult to
obtain from the experimental data, such as higher natural Pattern recognition techniques also are applied to dam-
frequencies and mode shapes of structure (Ni et al., 2002). age detection by various researchers. For example, the
In addition, since the vicinity of damage is not known a statistical pattern recognition is applied to damage de-
priori, the global methods that can employ only global tection employing relatively few measurements of modal
damage detection features, such as lower natural frequen- data collected from three scale model-reinforced con-
cies of the structure (Lazarevic et al., 2004), are preferred. crete bridges (Haritos & Owen, 2004), but the method
Finally, damage detection techniques can be classified only was able to indicate that damage had occurred. In
as traditional and emerging data mining techniques. Tra- addition, independent component analysis, a multivari-
ditional analytical techniques employ mathematical modate statistical method also known as proper orthogonal
els to approximate the relationships between specific dam- decomposition, has been applied to damage detection
age conditions and changes in the structural response or problems on time history data to capture essential pat-
dynamic properties. Such relationships can be computed tern of the measured vibration data (Zang, Friswell &
by solving a class of so-called inverse problems. The major Imregun, 2004).
drawbacks of the existing approaches are as follows: (i) the
more sophisticated methods involve computationally cum- Neural Networks
bersome system solvers, which are typically solved by
singular value decomposition techniques, non-negative Prediction-based techniques such as neural networks
least-squares techniques, bounded variable least-squares (Lazarevic et al., 2004; Ni et al., 2002, Zhao et al., 1998) has
techniques, and so forth; and, (ii) all computationally been successfully applied to detect the existence, loca-
intensive procedures need to be repeated for any newly tion, and quantification of the damage in the structure
available measured test data for a given structure. A brief employing modal data. Neural networks have been ex-
survey of these methods can be found in Doebling, et al. tremely popular in recent years due to their capabilities
(1996). On the other hand, data mining techniques consist as universal approximators.
of application techniques to model an explicit inverse In damage detection approaches based on neural
relation between damage detection features and damage networks, the damage location and severity are simulta-
by minimization of the residue between the experimental neously identified using a one-stage scheme, also called
and the analytical model at the training level. For example, the direct method (Zhao et al., 1998), where the neural
the damage detection features could be natural frequen- network is trained with different damage levels at each
cies, mode shapes, mode curvatures, and so forth. It possible damage location. However, these studies were
should be noted that data mining techniques also are restricted to very small models with a small number of
applied to detect features in large amounts of measurement target variables (order of 10), and the development of a
data. In the next few sections, we will provide a short predictive model that could identify correctly the loca-
description of several types of data mining algorithms tion and severity of damage in practical large-scale com-
used for damage detection. plex structures using this direct approach was a consid-
erable challenge. Increased geometric complexity of the
Classification structure caused an increase in the number of target
variables, thus resulting in data sets with a large number
Data mining techniques based on classification were suc- of target variables. Since the number of prediction mod-
cessfully applied to identify the damage in the structures. els that needs to be built for each continuous target
For example, decision trees have been applied to detect variable increases, the number of training data records
damage in an electrical transmission tower (Sandhu et al., required for effective training of neural networks also
2001). It has been found that in this approach, decision increases, thus requiring more computational time for
trees can be easily understood, while many interesting training neural networks, but also more time for data
247
TEAM LinG
generation, since each damage state (data record) re- FUTURE TRENDS
quires an eigen solver to generate natural frequency and
mode shapes of the structure. The earlier direct approach, Damage detection is increasingly becoming an indispens-
employed by numerous researchers, required the predic- able and integral component of any comprehensive struc-
tion of the material property; namely, the Youngs modu- tural health monitoring program for mechanical and large-
lus of elasticity considering all the elements in the domain scale civilian, aerospace, and space structures. Although
individually or simultaneously. However, this approach a variety of techniques have been developed for detecting
does not scale to situations in which thousands of ele- damages, there is still a number of research issues con-
ments are present in the complex geometry of the structure cerning the prediction performance and efficiency of the
or when multiple elements in the structure have been techniques that needs to be addressed (Auwerarer &
damaged simultaneously. Peeters, 2003; De Boe & Golinval, 2001).
To reduce the size of the system under consideration,
several substructure-based approaches have been pro-
posed (Sandhu et al., 2002; Yun & Bahng, 2000). These CONCLUSION
approaches partition the structure into logical substruc-
tures and then predict the existence of the damage in each In this paper, a survey of emerging data mining tech-
of them. However, pinpointing the location of damage and niques for damage detection in structures is provided.
extent of the damage is not resolved completely in these This survey reveals that the existing data mining tech-
approaches. Recently, these issues have been addressed niques are based predominantly on changes in properties
in two hierarchical approaches (Lazarevic et al., 2004; Ni of the structure to classify, localize, and predict the extent
et al., 2002). In the former, neural networks are hierarchi- of damage.
cally trained using one-level damage samples to first
locate the position of the damage, and then the network
is retrained by an incremental weight update method
using additional samples corresponding to different dam-
REFERENCES
age degrees, but only at the identified location at the first
stage. The input attributes of neural networks are de- Adams, D., & Farrar, C. (2002). Classifying linear and
signed to depend only on damage location, and they nonlinear structural damage using frequency domain ARX
consisted of several natural frequencies and a few incom- models. Structural Health Monitoring, 1(2), 185-201.
plete modal vectors. Since measuring mode shapes are Anderson, T., Lemoine, G., & Ambur, D. (2003). An arti-
difficult, global methods based only on natural frequen- ficial neural network based damage detection scheme for
cies are highly preferred. However, employing natural electrically conductive composite structures. Proceed-
frequencies as features traditionally has many drawbacks ings of the 44th AIAA/ASME/ASCE/AHS/ASC Structures,
(e.g., two symmetric damage locations cannot be distin- Structural Dynamics, and Materials Conference, Nor-
guished using only natural frequencies). To overcome folk, Virginia.
these drawbacks, Lazarevic, et al. (2004) proposed hierar-
chical and localized clustering approaches based only on Auwerarer, H., & Peeters, B. (2003). International research
natural frequencies as features where symmetrical dam- projects on structural health monitoring: An overview.
age locations of damage as well as spatial characteristics Structural Health Monitoring, 2(4), 341-358.
of structural systems are integrated in building the model. Bernal, D., & Gunes, B. (2000). Extraction of system
matrices from state-space realizations. Proceedings of the
Other Techniques 14 th Engineering Mechanics Conference, Austin, Texas.
Other data mining based approaches also have been De Boe, P., & Golinval, J.-C. (2001). Damage localization
applied to different problems in structural health monitor- using principal component analysis of distributed sensor
ing. For example, outlier-based analysis techniques array. In F.K. Chang (Ed.), Structural health monitoring:
(Worden, Manson & Fieller, 2000) have been used to The demands and challenges (pp. 860-861). Boca Raton,
detect the existence of damage; wavelet-based approaches FL: CRC Press.
(Wang, 2003) have been used to detect damage features; Doebling, S., Farrar, C., Prime, M., & Shevitz, D. (1996).
and a combination of independent component analysis Damage identification and health monitoring of struc-
and artificial neural networks (Zang, Friswell & Imregun, tural systems from changes in their vibration character-
2004) have been applied successfully to detect damages istics: A literature review [Report LA-12767-MS]. Los
in structures. Alamos National Laboratory, Los Alamos, NM.
248
TEAM LinG
Doherty, J. (1987). Nondestructive evaluation. In A.S. Maalej, M., Karasaridis, A., Pantazopoulou, S., &
Kobayashi (Ed.), Handbook on experimental mechanics Hatzinakos, D. (2002). Structural health monitoring of ,
(ch. 12). Society of Experimental Mechanics, Inc, England smart structures. Smart Materials and Structures, 11,
Cliffs, NJ. 581-589.
Friswell, M., Penny J., & Garvey, S. (1997). Parameter Mita, A., & Hagiwara, H. (2003). Damage diagnosis of a
subset selection in damage location. Inverse Problems in building structure using support vector machine and
Engineering, 5(3), 189-215. modal frequency patterns. Proceedings of SPIE, 5057,
San Diego, CA.
Friswell, M., Penny J., & Wilson, D. (1994). Using vibra-
tion data and statistical measures to locate damage in Ni, Y., Wang, B., & Ko, J. (2002). Constructing input
structures. Modal Analysis: The International Journal of vectors to neural networks for structural damage identi-
Analytical and Experimental Modal Analysis, 9(4), 239-254. fication. Smart Materials and Structures, 11, 825-833.
Fritzen, C., & Bohle, K. (2001). Vibration based global Patsias, S., & Staszewski, W.J. (2002). Damage detection
damage identificationA tool for rapid evaluation of using optical measurements and wavelets. Structural
structural safety. In F.D. Chang (Ed.), Structural health Heath Monitoring, 1(1), 5-22.
monitoring: The demands and challenges (pp. 849-859).
Boca Raton, FL: CRC Press. Sandhu, S., Kanapady, R., Tamma, K.K., Kamath, C., &
Kumar, V. (2001). Damage prediction and estimation in
Haritos, N., & Owen, J.S. (2004). The use of vibration data structural mechanics based on data mining. Proceedings
for damage detection in bridges: A comparison of system of the 7th ACM SIGKDD International Conference on
identification and pattern recognition approaches. Struc- Knowledge Discovery and Data Mining/Fourth Work-
tural Health Monitoring, 3(2), 141-163. shop on Mining Scientific Datasets, San Francisco, Cali-
fornia.
Keller, E., & Ray, A. (2003). Real-time health monitoring of
mechanical structures. Structural Health Monitoring, Sandhu, S., Kanapady, R., Tamma, K.K., Kamath, C., &
2(3), 191-203. Kumar, V. (2002). A sub-structuring approach via data
mining for damage prediction and estimation in complex
Kerschen, G., & Golinval, J-C. (2004). Feature extraction structures. Proceedings of the SIAM International Con-
using auto-associative neural networks. Smart Materials ference on Data Mining, Arlington, Virginia.
and Structures, 13, 211-219.
Sekhar, S. (2003). Identification of a crack in rotor system
Khoo, L., Mantena, P., & Jadhav, P. (2004). Structural using a model-based wavelet approach. Structural Health
damage assessment using vibration modal analysis. Struc- Monitoring, 2(4), 293-308.
tural Health Monitoring, 3(2), 177-194.
Wang, W. (2003). An evaluation of some emerging tech-
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., & niques for gear fault detection. Structural Health Moni-
Kumar, V. (2003a). Localized prediction of continuous toring, 2(3), 225-242.
target variables using hierarchical clustering. Proceed-
ings of the Third IEEE International Conference on Data Worden, K., Manson, G., & Fieller, N. (2000). Damage
Mining, Florida. detection using outlier analysis. Journal of Sound and
Vibration, 229(3), 647-667.
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., &
Kumar, V. (2003b). Damage prediction in structural me- Yun, C., & Bahng, E.Y. (2000). Sub-structural identifica-
chanics using hierarchical localized clustering-based tion using neural networks. Computers & Structures, 77, 41-52.
approach. Proceedings of Data Mining and Knowledge
Discovery: Theory, Tools, and Technology V, Orlando, Zang, C., Friswell, M.I., & Imregun, M. (2004). Structural
Florida. damage detection using independent component analy-
sis. Structural Health Monitoring, 3(1), 69-83.
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., &
Kumar, V. (2004). Effective localized regression for dam- Zhao, J., Ivan, J., & De Wolf, J. (1998). Structural damage
age detection in large complex mechanical structures. detection using artificial neural networks. Journal of
Proceedings of the 9th ACM SIGKDD International Infrastructure Systems, 4(3), 93-101.
Conference on Knowledge Discovery and Data Mining,
Seattle, Washington.
249
TEAM LinG
KEY TERMS Mode Shapes: Eigen vectors associated with the natu-
ral frequencies of the structure.
Boundary Element Method: Numerical method to solve Natural Frequency: Eigen values of the mass and
the differential equations with boundary/initial condi- stiffness matrix system of the structure.
tions over surface of a domain.
Smart Structure: A structure with a structurally inte-
Finite Element Method: Numerical method to solve grated fiber optic sensing system.
the differential equations with boundary/initial condi-
tions over a domain. Structures and Structural System: The word struc-
ture used has been employed loosely in the paper. The
Modal Properties: Natural frequency, mode shapes, structure is referred to as the continuum material, whereas
and mode curvatures constitutes modal properties. the structural system consists of structures that are
connected at joints.
250
TEAM LinG
251
Data Mining for Intrusion Detection ,

Aleksandar Lazarevic
INTRODUCTION nerabilities and they are usually not sufficient to ensure

complete security of the infrastructure and to ward off
Today computers control power, oil and gas delivery, attacks that are continually being adapted to exploit the
communication systems, transportation networks, bank- systems weaknesses often caused by careless design
ing and financial services, and various other infrastruc- and implementation flaws. This has created the need for
ture services critical to the functioning of our society. security technology that can monitor systems and iden-
However, as the cost of the information processing and tify computer attacks. This component is called intru-
Internet accessibility falls, more and more organiza- sion detection and is a complementary to conventional
tions are becoming vulnerable to a wide variety of cyber security mechanisms. This article provides an overview
threats. According to a recent survey by CERT/CC (Com- of current status of research in intrusion detection
puter Emergency Response Team/Coordination Cen- based on data mining.
ter), the rate of cyber attacks has been more than dou-
bling every year in recent times (Figure 1). In addition,
the severity and sophistication of the attacks are also BACKGROUND
growing. For example, Slammer/Sapphire Worm was
the fastest computer worm in history. As it began spread- Intrusion detection includes identifying a set of mali-
ing throughout the Internet, it doubled in size every 8.5 cious actions that compromise the integrity, confiden-
seconds and infected at least 75,000 hosts causing tiality, and availability of information resources. An
network outages and unforeseen consequences such as Intrusion Detection System (IDS) can be defined as a
canceled airline flights, interference with elections, combination of software and/or hardware components
and ATM failures (Moore, 2003). that monitors computer systems and raises an alarm
It has become increasingly important to make our when an intrusion happens.
information systems, especially those used for critical Traditional intrusion detection systems are based on
functions in the military and commercial sectors, resis- extensive knowledge of signatures of known attacks.
tant to and tolerant of such attacks. The conventional However, the signature database has to be manually
approach for securing computer systems is to design revised for each new type of intrusion that is discovered.
security mechanisms, such as firewalls, authentication In addition, signature-based methods cannot detect
mechanisms, and Virtual Private Networks (VPN) that emerging cyber threats, since by their very nature these
create a protective shield around them. However, such threats are launched using previously unknown attacks.
security mechanisms almost always have inevitable vul- Finally, very often there is substantial latency in deploy-
ment of newly created signatures. All these limitations
have led to an increasing interest in intrusion detection
Figure 1. Growth rate of cyber incidents reported to techniques based upon data mining.
Computer Emergency Response Team/Coordination The tremendous increase of novel cyber attacks has
Center (CERT/CC) made data mining based intrusion detection techniques
120000
extremely useful in their detection. Data mining tech-
niques for intrusion detection generally fall into one of
100000
three categories; misuse detection, anomaly detection
Number of cyber incidents
80000 and summarization of monitored data.

60000
40000 MAIN THRUST

20000
Before applying data mining techniques to the problem
0
1 1991
1990 2 1992
3 1993
4 1994
5 6
1995 7
1996 8
1997 9
1998 10 2000
1999 11 2001
12 2002
13 2003
14
of intrusion detection, the data has to be collected.
Year Different types of data can be collected about informa-
TEAM LinG
Data Mining for Intrusion Detection
tion systems (e.g., tcpdump and netflow data for net- MADAM ID (Lee, 2000, 2001) was one of the first
work intrusion detection, syslogs or system calls for projects that applied data mining techniques to the intru-
host intrusion detection). However, such collected data sion detection problem. In addition to the standard
is often available in a raw format and needs to be features that were available directly from the network
processed in order to be used in data mining techniques. traffic (e.g., duration, start time, service), three groups
For example, in MADAM ID project (Lee, 2000, 2001) of constructed features were also used by the RIPPER
at Columbia University, association rules and frequent algorithm to learn intrusion detection rules from DARPA
episodes were extracted from network connection 1998 data set (Lippmann, 1999). Other classification
records to construct three groups of features: (i) con- algorithms that are applied to the intrusion detection
tent-based features that describe intrinsic characteris- problem include standard decision trees (Bloedorn,
tics of a network connection (e.g., number of packets, 2001; Sinclair, 1999), modified nearest neighbor algo-
acknowledgments, data bytes from source to destina- rithms (Ye, 2001b), fuzzy association rules (Bridges,
tion), (ii) time-based traffic features that compute the 2000), neural networks (Dao, 2002; Lippman, 2000a),
number of connections in some recent time interval nave Bayes classifiers (Schultz, 2001), genetic algo-
(e.g., last few seconds) and (iii) connection based fea- rithms (Bridges, 2000), genetic programming
tures that compute the number of connections from a (Mukkamala, 2003a), and etcetera. Most of these ap-
specific source to a specific destination in the last N proaches attempt to directly apply specified standard
connections (e.g., N = 1000). techniques to publicly available intrusion detection data
When the feature construction step is complete, ob- sets (Lippmann, 1999, 2000b), assuming that the labels
tained features may be used in any data mining technique. for normal and intrusive behavior are already known.
Since this is not realistic assumption, misuse detection
Misuse Detection based on data mining has not been very successful in
practice.
In misuse detection based on data mining, each instance
in a data set is labeled as normal or attack/intrusion Anomaly Detection
and a learning algorithm is trained over the labeled data.
These techniques are able to automatically retrain intru- Anomaly detection creates profiles of normal legiti-
sion detection models on different input data that in- mate computer activity (e.g., normal behavior of users,
clude new types of attacks, as long as they have been hosts, or network connections) using different tech-
labeled appropriately. Unlike signature-based intrusion niques and then uses a variety of measures to detect
detection systems, data mining based misuse detection deviations from defined normal behavior as potential
models are created automatically, and can be more anomaly. Anomaly detection models often learn from a
sophisticated and precise than manually created signa- set of normal (attack-free) data, but this also requires
tures. In spite of the fact that misuse detection models cleaning data from attacks and labeling only normal data
have high degree of accuracy in detecting known attacks records. Nevertheless, other anomaly detection tech-
and their variations, their obvious drawback is the in- niques detect anomalous behavior without using any
ability to detect attacks whose instances have not yet knowledge about the training data. Such models typi-
been observed. In addition, labeling data instances as cally assume that the data records that do not belong to
normal or intrusive may require enormous time for the majority behavior correspond to anomalies.
many human experts. The major benefit of anomaly detection algorithms
Since standard data mining techniques are not di- is their ability to potentially recognize unforeseen and
rectly applicable to the problem of intrusion detection emerging cyber attacks. However, their major limita-
due to dealing with skewed class distribution (attacks/ tion is potentially high false alarm rate, since deviations
intrusions correspond to a class of interest that is much detected by anomaly detection algorithms may not nec-
smaller, i.e., rarer, than the class representing normal essarily represent actual attacks, but new or unusual, but
behavior) and learning from data streams (attacks/intru- still legitimate, network behavior.
sions very often represent sequence of events), a num- Anomaly detection algorithms can be classified into
ber of researchers have developed specially designed several groups: (i) statistical methods; (ii) rule-based
data mining algorithms that are suitable for intrusion methods; (iii) distance-based methods; (iv) profiling
detection. Research in misuse detection has focused methods; and (v) model-based approaches (Lazarevic,
mainly on classification of network intrusions using 2004). Although anomaly detection algorithms are quite
various standard data mining algorithms (Barbara, 2001; diverse in nature, and thus may fit into more than one
Ghosh, 1999; Lee, 2001; Sinclair, 1999), rare class proposed category, most of them employ certain data
predictive models (Joshi, 2001) and association rules mining or artificial intelligence techniques.
(Barbara, 2001; Lee, 2000; Manganaris, 2000).
252
TEAM LinG
Statistical methods: Statistical methods monitor Rule-based systems: Rule-based systems were
the user or system behavior by measuring certain used in earlier anomaly detection based IDSs to ,
variables over time (e.g., login and logout time of characterize normal behavior of users, networks
each session). The basic models keep averages of and/or computer systems by a set of rules. Ex-
these variables and detect whether thresholds are amples of such rule-based IDSs include
exceeded based on the standard deviation of the ComputerWatch (Dowell, 1990) and Wisdom &
variable. More advanced statistical models com- Sense (Liepins, 1992).
pute profiles of long-term and short-term user Profiling methods: In profiling methods, pro-
activities by employing different techniques, such files of normal behavior are built for different
as Kolmogorov-Smirnov test (Cabrera, 2000), Chi- types of network traffic, users, programs and
square (c 2) statistics (Ye, 2001a), probabilistic etcetera; and deviations from them are consid-
modeling (Yamanishi, 2000), and likelihood of ered as intrusions. Profiling methods vary greatly
data distributions (Eskin, 2000). ranging from different data mining techniques to
Distance-based methods: Most statistical ap- various heuristic-based approaches.
proaches have limitation when detecting outliers in
higher dimensional spaces, since it becomes in- For example, ADAM (Audit Data and Mining) (Bar-
creasingly difficult and inaccurate to estimate the bara, 2001) is a hybrid anomaly detector trained on
multidimensional distributions of the data points. both attack-free traffic and traffic with labeled attacks.
Distance-based approaches attempt to overcome The system uses a combination of association rule
limitations of statistical outlier detection ap- mining and classification to discover novel attacks in
proaches and they detect outliers by computing tcpdump data by using the pseudo-Bayes estimator.
distances among points. Several distance-based Recently reported IDDM system (Abraham, 2001) rep-
outlier detection algorithms that have been re- resents an off-line IDS, where the intrusions are de-
cently proposed for detecting anomalies in net- tected only when sufficient amounts of data are col-
work traffic (Lazarevic, 2003) are based on com- lected and analyzed. The IDDM system describes pro-
puting the full dimensional distances of points files of network data at different times, identifies any
from one another using all the available features, large deviations between these data descriptions and
and on computing the densities of local neighbor- produces alarms in such cases. PHAD (packet header
hoods. Values of categorical features are converted anomaly detection) (Mahoney, 2002) monitors net-
into the frequencies of their occurrences and they work packet headers and builds profiles for 33 differ-
are further considered as continuous ones. MINDS ent fields from these headers by observing attack free
(Minnesota Intrusion Detection System) (Ertoz, traffic and building contiguous clusters for the values
2004) employs outlier detection algorithms to as- observed for each field. ALAD (application layer
sign an anomaly score to each network connection. anomaly detection) (Mahoney, 2002) uses the same
A human analyst then has to look at only the most method for calculating the anomaly scores as PHAD,
anomalous connections to determine if they are but it monitors TCP data and builds TCP streams when
actual attacks or other interesting behavior. Ex- the destination port is smaller than 1024.
periments on live network traffic have shown that Finally, there have also been several recently pro-
MINDS is able to routinely detect various suspi- posed commercial products that use profiling-based
cious behavior (e.g., policy violations), worms, as anomaly detection techniques. For example, Antura
well as various scanning activities. from System Detection (System Detection, 2003) use
data mining based user profiling, while Mazu Profiler
In addition, several clustering based techniques, such from Mazu Networks (Mazu Networks, 2003) and
as fixed-width and canopy clustering (Eskin, 2002), have Peakflow X from Arbor networks (Arbor Networks,
been used to detect network intrusions in DARPA 1998 2003) use rate-based and connection profiling anomaly
data sets as small clusters when compared to the large detection schemes.
ones that corresponded to the normal behavior. In an-
other interesting approach (Fan, 2001), artificial anoma- Model-based approaches: Many researchers
lies in the network intrusion detection data are generated have used different types of data mining models,
around the edges of the sparsely populated data regions, such as replicator neural networks (Hawkins,
thus forcing the learning algorithm to discover the spe- 2002) or unsupervised support vector machines
cific boundaries that distinguish these regions from the (Eskin, 2002; Lazarevic, 2003), to characterize
rest of the data. the normal behavior of the monitored system. In
253
TEAM LinG
the model-based approaches, anomalies are de- CONCLUSION

tected as deviations for the model that represents
the normal behavior. Replicator four-layer feed- Although a variety of techniques have been developed
forward neural network (RNN) reconstructs input for detecting different types of computer attacks in
variables at the output layer during the training different computer systems, there are still a number of
phase, and then uses the reconstruction error of research issues concerning the prediction performance,
individual data points as a measure of outlyingness, efficiency and fault tolerance of IDSs that need to be
while unsupervised support vector machines at- addressed. Signature analysis, the most common strat-
tempt to find a small region where most of the data egy in the commercial domain until recently, is increas-
lies, label these data points as a normal behavior, ingly integrated with different anomaly detection and
and then detect deviations from learned models as alert correlation techniques based on advanced data
potential intrusions. In addition, standard neural mining and artificial intelligence techniques in order to
networks (NN) were also used in intrusion detec- detect emerging and coordinated computer attacks.
tion problems to learn a normal profile. For ex-
ample NNs were often used to model the normal
behavior of individual users (Ryan, 1997), to build REFERENCES
profiles of software behavior (Ghosh, 1999) or to
profile network packets and queue statistics (Lee, Abraham, T. (2001). IDDM: Intrusion detection using
2000). data mining techniques. Australia Technical Report
DSTO-GD-0286. DSTO Electronics and Surveillance
Summarization Research Laboratory, Department of Defense.
Summarization techniques use frequent itemsets or as- Arbor Networks. (2003). Intelligent network manage-
sociation rules to characterize normal and anomalous ment with peakflow traffic. Retrieved from http://
behavior in the monitored computer systems. For ex- www.arbornetworks.com/products_sp.php
ample, association patterns generated at different times Barbara, D., Wu, N., & Jajodia, S. (2001). Detecting
were used to study significant changes in the network novel network intrusions using Bayes estimators. In
traffic characteristics at different periods of time (Lee, Proceedings of the First SIAM Conference on Data Min-
2001). Association pattern analysis has also been shown ing, Chicago, IL.
to be beneficial in constructing profiles of normal
network traffic behavior (Manganaris, 1999). MINDS Bloedorn, E., Christiansen, A., Hill, W., Skorupka, C.,
(Ertoz, 2004) uses association patterns to provide high- Talbot, L., & Tivel, J. (2001). Data mining for network
level summary of network connections that are ranked intrusion detection: How to get started. MITRE Tech-
highly anomalous in the anomaly detection module. nical Report. Retrieved from www.mitre.org/work/
These summaries allow a human analyst to examine a tech_papers/tech_papers_01/bloedorn_datamining
large number of anomalous connections quickly and to
provide templates from which signatures of novel at- Bridges, S., & Vaughn, R. (2000). Fuzzy data mining and
tacks can be built for augmenting the database of signa- genetic algorithms applied to intrusion detection. In
ture-based intrusion detection systems. Proceedings of the 23rd National Information Sys-
tems Security Conference, Baltimore, MD.
Cabrera, J., Ravichandran, B., & Mehra, R. (2000). Statis-
FUTURE TRENDS tical traffic modeling for network intrusion detection. In
The Proceedings of 8th International Symposium on
Intrusion detection techniques have improved dramati- Modeling, Analysis and Simulation of Computer and
cally over time, especially in the past few years. IDS Telecommunication System.
technology is developing rapidly and its near-term fu-
ture is very promising. Data mining techniques for Dao, V., & Vemuri, R. (2002). Computer network intrusion
intrusion detection increasingly become an indispens- detection: A comparison of neural networks methods.
able and integral component of any comprehensive en- Differential Equations and Dynamical Systems, Special
terprise security program, since they successfully Issue on Neural Networks.
complement traditional security mechanisms.
254
TEAM LinG
Dowell, C., & Ramstedt, P. (1990). The Computerwatch Lee, W., & Stolfo, S.J. (2000). A framework for construct-
data reduction tool. In Proceedings of the 13th Na- ing features and models for intrusion detection systems. ,
tional Computer Security Conference, Washington, DC. ACM Transactions on Information and System Security,
3(4), 227-261.
Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P., Srivastava,
J., Kumar, V., & Dokas, P. (2004). The MINDS: Minne- Lee, W., Stolfo, S.J., & Mok, K. (2001). Adaptive intrusion
sota intrusion detection system. In A. Joshi, H. Kargupta, detection: A data mining approach. Artificial Intelli-
K. Sivakumar, & Y. Yesha (Eds.), Next generation data gence Review, 14, 533-567.
mining. Boston: Kluwer Academic Publishers.
Liepins, G., & Vaccaro, H. (1992). Intrusion detection: Its role
Eskin, E. (2000). Anomaly detection over noisy data using and validation. Computers and Security, 347-355.
learned probability distributions. In Proceedings of the
International Conference on Machine Learning, Stanford Lippmann, R., & Cunningham, R. (2000a). Improving intru-
University, CA. sion detection performance using keyword selection and
neural networks. Computer Networks, 34 (4), 597-603.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., & Stolfo, S.
(2002). A geometric framework for unsupervised anomaly Lippmann, R., Haines, J.W., Fried, D.J., Korba, J., & Das,
detection: Detecting intrusions in unlabeled data. In S. K. (2000b). The 1999 DARPA off-line intrusion detection
Jajodia & D. Barbara (Eds.), Applications of data mining evaluation. Computer Networks.
in computer security, advances in information security. Lippmann, R.P., Cunningham, R.K., Fried, D.J., Graf, I.,
Boston: Kluwer Academic Publishers. Kendall, K.R., Webster, S.E., & Zissman, M.A. (1999).
Fan, W., Lee, W., Miller, M., Stolfo, S.J., & Chan, P.K. Results of the DARPA 1998 offline intrusion detection
(2001). Using artificial anomalies to detect unknown evaluation. In Proceedings of Workshop on Recent Ad-
and known network intrusions. In the Proceedings of vances in Intrusion Detection.
the First IEEE International Conference on Data Min- Mahoney, M., & Chan, P. (2002). Learning nonstationary
ing, San Jose, CA. models of normal network traffic for detecting novel
Ghosh, A., & Schwartzbard, A. (1999). A study in using attacks. In Proceedings of the Eighth ACM Interna-
neural networks for anomaly and misuse detection. In tional Conference on Knowledge Discovery and Data
Proceedings of the Eighth USENIX Security Sympo- Mining (pp. 376-385), Edmonton, Canada.
sium (pp. 141-151). Manganaris, S., Christensen, M., Serkle, D., & Hermiz, K.
Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). (2000). A data mining analysis of RTID alarms. Computer
Outlier detection using replicator neural networks. In Networks, 34(4), 571-577.
Proceedings of the 4th International Conference on Mazu Networks. (2003). Mazu Profiler. An Overview.
Data Warehousing and Knowledge Discovery (pp. www.mazunetworks.com/solutions/white_papers/down-
170-180). Lecture Notes in Computer Science 2454. load/Mazu_Profiler.pdf
Aix-en-Provence, France.
Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford,
Joshi, M., Agarwal, R., & Kumar, V. (2001). PNrule, S., & Weaver, N. (2003). The spread of the Sapphire/
mining needles in a haystack: Classifying rare classes Slammer Worm. Retrieved from www.cs.berkeley.edu/
via two-phase rule induction. In Proceedings of the ~nweaver/sapphire
ACM SIGMOD Conference on Management of Data,
Santa Barbara, CA. Mukkamala, S., Sung, A., & Abraham, A. (2003a). A
linear genetic programming approach for modeling in-
Lazarevic, A., Ertoz, L., Ozgur, A., Srivastava, J., & Kumar, trusion. In Proceedings of the IEEE Congress on Evo-
V. (2003). A comparative study of anomaly detection lutionary Computation, Perth, Australia.
schemes in network intrusion detection. In Proceedings
of the Third SIAM International Conference on Data Ryan, J., Lin, M-J., & Miikkulainen, R. (1997). Intrusion
Mining, San Francisco, CA. detection with neural networks. In Proceedings of the
AAAI Workshop on AI Approaches to Fraud Detection
Lazarevic, A., Kumar, V., & Srivastava, J. (2004). Intrusion and Risk Management (pp. 72-77), Providence, RI.
detection: A survey. In V. Kumar, J. Srivastava, & A.
Lazarevic (Eds.), Managing cyber threats: Issues, ap- Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001). Data
proaches and challenges. Boston: Kluwer Academic Pub- mining methods for detection of new malicious executables.
lishers. In Proceedings of the IEEE Symposium on Security and
Privacy (pp. 38-49), Oakland, CA.
255
TEAM LinG
Sinclair, C., Pierce, L., & Matzner, S. (1999). An application KEY TERMS
of machine learning to network intrusion detection. In
Proceedings of the 15th Annual Computer Security Ap- Anomaly Detection: Analysis strategy that identi-
plications Conference (pp. 371-377). fies intrusions as unusual behavior that differs from the
System Detection. (2003). Anomaly detection: The Antura normal behavior of the monitored system.
difference. Retrieved from http://www.sysd.com/library/ Intrusion: Malicious, externally induced, operational
anomaly.pdf fault in the computer system.
Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. Intrusion Detection: Identifying a set of malicious
(2000). On-line unsupervised outlier detection using actions that compromise the integrity, confidentiality,
finite mixtures with discounting learning algorithms. In and availability of information resources.
Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Min- Misuse Detection: Analysis strategy that looks for
ing (pp. 320-324), Boston, MA. events or sets of events that match a predefined pattern
of a known attack.
Ye, N., & Chen, Q. (2001a). An anomaly detection tech-
nique based on a chi-square statistic for detecting intru- Signature-Based Intrusion Detection: Analysis
sions into information systems. Quality and Reliability strategy where monitored events are matched against a
Engineering International Journal, 17(2), 105-112. database of attack signatures to detect intrusions.
Ye, N., & Li, X. (2001b, June). A scalable clustering Tcpdump: Computer network debugging and secu-
technique for intrusion signature recognition. In Pro- rity tool that allows the user to intercept and display
ceedings of the IEEE Workshop on Information Assur- TCP/IP packets being transmitted over a network to
ance and Security. United States Military Academy, which the computer is attached.
West Point, NY. Worms: Self-replicating programs that aggressively
spread through a network by taking advantage of auto-
matic packet sending and receiving features found on
many computers.
256
TEAM LinG
257
Data Mining in Diabetes Diagnosis and ,

Detection
Indranil Bose
The University of Hong Kong, Hong Kong
INTRODUCTION The availability of historical data on diabetes natu-

rally leads to the application of data mining techniques to
Diabetes is a disease worrying hundreds of millions of discover interesting patterns (Apte et al., 2002; Hsu et al.,
people around the world. In the USA, the population of 2000). The objective is to find rules that help to under-
diabetic patients is about 15.7 million (Breault et al., 2002). stand diabetes, facilitate early detection of diabetes, and
It is reported that the direct and indirect cost of diabetes discover how diabetes may be associated with different
in the USA is $132 billion (Diabetes Facts, 2004). Since segments of the population.
there is no method that is able to eradicate diabetes, The data mining process for diagnosis of diabetes can
doctors are striving for ways to fight this doom. Research- be divided into five steps, though the underlying prin-
ers are trying to link the cause of diabetes with patients ciples and techniques used for data mining diabetic data-
lifestyles, inheritance information, age, and so forth in bases may differ for different projects in different coun-
order to get to the root of the problem. Due to the tries. Following is a brief description of the five steps.
prevalence of a large number of responsible factors and
the availability of historical data, data mining tools have Step 1: Data Cleaning
been used to generate inference rules on the cause and
effect of diabetes as well as to help in knowledge discov- Before carrying out data mining on the diabetic patient
ery in this area. The goal of this chapter is to explain the database, the data should be cleaned. Errors such as
different steps involved in mining diabetes data and to missing values, typographical errors, or wrong informa-
show, using case studies, how data mining has been tion are contained in the patient records, and, worse still,
carried out for detection and diagnosis of diabetes in many records are duplicate records. Two approaches can
Hong Kong, USA, Poland, and Singapore. be used to clean the data, namely standardized format
schema and sorted neighborhood method (Hsu et al.,
2000). By generating the standardized format schema, a
BACKGROUND user defines mappings among attributes in different for-
mats, and each of the database files are modified into this
Diabetes is a severe metabolic disorder marked by high standardized format. The next task is to look for duplicate
blood glucose level, excessive urination, and persistent records that have to be removed from the standardized
thirst, caused by lack of insulin actions. There are usually format using the sorted neighborhood method. Under this
three forms of diabetesType 1, Type 2, and gestational. scheme, the database is sorted on one or more fields that
It is believed that diabetes is a particularly opportune uniquely identify each record, and the chosen fields of the
disease for data mining technology for a number of rea- records are compared within a sliding window. When
sons (Breault, 2001): duplicates are detected, the user is called in to verify it,
and the duplicates are removed from the database.
There are many diabetic databases with historic
patient information. Step 2: Data Preparation
New knowledge about treatment of diabetes can
help save money. The importance of the data preparation step cannot be
Diabetes can produce terrible complications like overstated, since the success of data analysis depends on
blindness, kidney failure, and so forth, so physi- it (Breault et al., 2002). A fundamental issue is whether to
cians need to know how to identify potential cases use a relational database comprising multiple tables or a
quickly. flat file that is best suited for data mining (Breault, 2002).
TEAM LinG
Data Mining in Diabetes Diagnosis and Detection
Generally, either of the following two methods is adopted. MAIN THRUST

In the first method, a set of flat files is used to represent
all the data, then data mining tools are used separately on Hong Kong CaseDiabetes Registry
each file, and the results obtained from the flat files are
linked together in some fashion. An alternative method is
and Statistical Analysis
to find a data reduction technique that allows fields in a
complex database to be transformed into vector scores In Hong Kong, around 10% of the population is suffering
instead of considering them separately. For example, for from diabetes, of which 5% belong to Type 1 and 95%
an Australian study, the 26 items recorded for a patient belong to the Type 2 form of diabetes. The prevalence of
from a diabetes database was converted to a four-compo- Type 2 diabetes is increasing at an alarming rate among
nent vector <HgbA1c, Eye, Lipids, Microalbumin> by Chinese, and its development is believed to involve the
assigning them to each component with an associated interplay between genetic and environmental factors. In
weight (determined by domain experts) that indicated how view of this, the major hospitals have established their
strongly the item related to the vector, and significant own diabetes registry for knowledge discovery. For in-
data reduction was obtained. stance, the diabetes clinic at the Prince of Wales Hospital
of Hong Kong has adopted a structured diabetes care
protocol and has created a diabetes registry based on a
Step 3: Data Analysis modified Europe DIAB-CARE format (Apte et al., 2002)
since 1995.
Data mining tools are used to convert the data stored in In September 2000, a diabetes data mining study was
databases into useful knowledge. Different software pro- conducted to investigate the patterns of diabetes and
grams using different models are run on the same data to their relationships with clinical characteristics in Hong
provide different results. Sometimes, the result may be Kong Chinese patients with late-onset (over age 34) Type
unexpected, but many of the rules and causal relation- 2 diabetes (Lee et al., 2000). This study involved 2,310
ships discovered can conform to the trends. A software patients selected from a hospital clinic-based diabetes
that is commonly used for data mining is Classification registry. A statistical analysis toolStatistical Package
and Regression Tree (CART). CART recursively parti- for Social Sciences (SPSS)was used for conducting t-
tions the input variable space to maximize purity in the test, Mann-Whitney U test, analysis of variance, and
terminal tree nodes (Breault et al., 2002).
2 test. Many useful results were generated, which can
Step 4: Knowledge Evaluation be found on the two tables on page 1366 of the paper by
Lee, et al. (2000). For example, it was found that the
The rules obtained from data mining may not be meaning- patients, irrespective of their sex, were more likely to have
ful or true. The experts have to evaluate the knowledge a diabetic mother than a diabetic father. Also, female
before using it. For example, multiple random samples patients with a diabetic mother were found to have higher
should be used to evaluate the data mining tools used to levels of plasma total cholesterol compared to those
insure that results are not just by chance (Breault, 2001). having a diabetic father. In two-group comparisons, there
For diabetes, similar data mining studies in other geo- was also evidence that the male patients with a diabetic
graphic and cultural locations are needed to prove any father had higher body mass index (BMI) values than the
results that are suspected to be specific to one region or male patients with a diabetic mother. It also was shown
segment of population. that both maternal and paternal factors may be respon-
sible for the development of Type 2 diabetes in the
Step 5: Knowledge Usage Chinese population. All these rules greatly improved the
physicians understanding of diabetes.
After knowledge is extracted as a result of data mining, the
developers have to address the issue of cost effective- United States CaseClassification and
ness of applying that knowledge for detection of diabe- Regression Trees (CART)
tes. As a result of Step 4, the data miners may obtain a
costly solution that can help a small group of people. Diabetes is a major health problem in the United States.
However, this may not prove helpful for diagnosing According to current statistics, about 5.9% of the popu-
diabetes for the entire population. It is the effective usage lation, or 16 million people in the USA, have diabetes. It
of the knowledge gathered from data mining for real-life is estimated that the percentage will rise to 8.9% by 2025.
application that will mark the success of the endeavor. In the United States, there is a long history of diabetic
258
TEAM LinG
registries and databases with systematically collected Figure 1. Image of ROSETTA software on training and
patient information. A large-scale research was carried out test data sets ,
by applying data mining techniques to a diabetic data
warehouse from an integrated health care system in the
New Orleans area with 30,383 diabetic patients (Breault et
al., 2002).
To prepare the data tables, structured query language
(SQL) statements were executed on the data warehouse to
form the flat file as input to data mining software. To do
data mining, CART software was used with a binary target
variable and ten predictors: age, sex, emergency depart-
ment visits, office visits, comorbidity index, dyslipidemia,
hypertension, cardiovascular disease, retinopathy, and
end-stage renal disease. The outcome showed that the
most important variable associated with bad glycemic
control is younger age, not the comorbiditity index or
whether patients have related diseases. The total classifi-
cation error (40.5%) was substantial. Limited, 2002). Rough sets have been applied to mine
data in a Polish diabetes database using the ROSETTA
PIMA Indian CaseMachine Learning software (hrn, 1997). The rough sets approach investi-
Mining Approach gates structural relationships in the data rather than
probability distributions and produces decision tables
Another special group in the USA that deserves a separate rather than trees.
mention because of the importance of the findings is the A recent study from a Polish medical school used a
PIMA Indian case. With a very high rate of diabetes for dataset of 107 patients aged five to 22 who were suffering
Pima Indians, the Pima Indian Diabetes Database (PIDD) from insulin-dependent diabetes. In this study, it was
with 768 diabetes patients has been established by the found that the minimal subsets of attributes that are
National Institute of Health. efficient for rule making included age of disease diagno-
There have been many studies applying data mining sis, microalbuminuria (yes/no) and disease duration in
techniques to the PIDD. Some well-known examples of years. With the use of rough sets techniques, decision
data mining techniques used include multi-stream depen- rules were generated to predict microalbuminuria. The
dency detection (MSDD) algorithm, Bayesian neural net- best predictor was age <7 predicting no microalbuminuria
works, and multiplier-free feed-forward neural networks. at 83.3% accuracy.
Although the cited examples use somewhat different sub-
groups of the PIDD, accuracy for predicting diabetes Singapore CaseData Cleansing and
ranges from 66% to 81%. It is interesting to see that with Interaction With Domain Experts
a wide variety of prediction tools available, efficiency and
accuracy can be improved greatly in diabetes data mining. Around 10% of the population in Singapore is diabetic.
Recently, several modified data mining techniques have In the diabetes data mining exercise conducted in
been used successfully on the same database. This in- Singapore, the objective was to find rules that could be
cludes the usage of decision trees with an augmented used by physicians to understand more about diabetes
splitting criterion (Buja & Lee, 2001). The use of the fuzzy and to find special patterns about a particular patient
Nave Bayes method yielded a best-case accuracy of population (Hsu et al., 2000). In order to deal with noisy
76.95% (Tang et al., 2002). The application of shunted data, a semi-automatic data cleaning system was utilized
inhibitory neural networks, where the neurons can act as to reconcile format differences among tables with user
adaptive non-linear filters, resulted in an accuracy of over mapping input. A sorted neighborhood method was
80% and performed better than multi-layer perceptrons, in used to remove duplicate records.
general (Arulampalam & Bouzerdoum, 2001). In a particular case at the National University of
Singapore, the researchers mined the data via a classifi-
Poland CaseRough Sets Technique cation with association rule tool, using thresholds of 1%
support and 50% confidence. They generated 700 rules.
In Poland, more than 1.5 million people suffer from diabe- The physicians were overwhelmed with the large number
tes, and more than 41,000 have Type 1 diabetes (Medtronic and also wanted to know causal connections rather than
259
TEAM LinG
associations. It was realized that post-processing was knowledge about associations between diabetes and
needed in order to make the results usable for the physicians. other diseases have to be dug out in order to improve
The tree method was employed to generate general public health in the future. Though Type 1 and Type 2
rules giving the underlying trends in the data that the diabetes have been studied frequently, more effort is
physicians already knew and exception rules giving de- needed in the future to study the diagnosis and detection
viations to these trends. The physicians found the excep- of gestational diabetes.
tion rules especially helpful in understanding how sub-
population trends differ from the main population. With
the help of the physicians domain knowledge, the data CONCLUSION
mining process was optimized by reducing the size of data
and hypothesis space and by removing unnecessary The occurrence of diabetes is increasing at an alarming
query operations (Owrang, 2000). rate all over the world, and its development is believed to
involve the interplay among many unknown and mysteri-
ous reasons such as genetic and environmental factors.
FUTURE TRENDS In view of this, data-mining technology can play a very
important role in analyzing existing diabetes databases
An important step in using data mining for diagnosing and identifying useful rules that help diabetes prevention
diabetes is the creation of a centralized database that can and control.
store most diabetic patients health data. The larger the It is worthwhile to recognize that successful applica-
data pool, the more accurate the results from data mining tion of data-mining technologies to diabetes prevention
are likely to be. The experiences and the many challenges and control requires: (1) preparing a comprehensive dia-
faced in building a data warehouse for a not-for-profit betes database for input into data mining software to
organization called Christiana Care in the state of Dela- avoid garbage in and garbage out; this will require data
ware, USA, is described by Ewen, et al., and it is reported cleansing and transformations from a relational data ware-
that the creation of the data warehouse was responsible house to a data mining data table that is useable by data
for a gain in operating revenue in 1997 (Ewen et al., 1998). mining tools; (2) selecting and skillfully applying the
Though the task is not easy, for successful data mining, appropriate data-mining software and techniques; (3)
it is almost imperative to create data warehouses for intelligently sifting through the software output to priori-
sharing of diabetes-related information among physi- tize the areas that will provide the most cost savings or
cians across the world. The major obstacle is that there is outcomes improvement; and (4) interacting with domain
only a small number of diabetic patients in this world that experts to select the best-fit rules and patterns that opti-
have access to shared care, while many others remain mize the efficiency and effectiveness of the data-mining
undiagnosed, untreated, or suboptimally treated (Chan, process. In this paper, we have studied how the above
2000). This is a major challenge that has to be overcome steps are applied for mining diabetes databases in gen-
in the future. eral, and, in particular, we have discussed the various
The case studies clearly indicate that there is no single methods adopted by researchers in different countries for
method that proves to be effective for mining diabetes diagnosis of diabetes.
databases. Many tools and techniques, such as statistical
methods, machine learning algorithms, neural networks,
Bayesian neural sets, and rough sets, have been em- REFERENCES
ployed to study and figure out useful rules and patterns
to help physicians combat the deadly disease. There is Apte, C., Liu, B., Pednault, E.P.D., & Smyth, P. (2002).
further need in the future to use other techniques like Business applications of data mining. Communications of
case-based reasoning for assessment of risk of complica- the ACM, 45(8), 49-53.
tions for individual diabetes patients (Armengol et al.,
2004; Montani et al., 2003) or hybrid techniques like rule Armengol, E., Palaudaries, A., & Plaza, E. (2004). Indi-
induction, using simulated annealing for discovering vidual prognosis of diabetes long-term risks: A CBR
associations between observations made of patients on approach. Technical Report IIIA.
their first visit and early mortality (Richards et al., 2001). Arulampalam, G., & Bouzerdoum, A. (2001). Application
Moreover, besides doing data mining in diabetes of shunting inhibitory artificial neural networks to medical
databases, it is also important that the same work be diagnosis. Proceedings of the Seventh Australian and
carried out for studying the complications of diabetes and New Zealand Intelligent Information Systems Confer-
understanding its relationship to other diseases. By ap- ence.
plying different data-mining techniques and tools, more
260
TEAM LinG
Breault, J.L. (2001). Data mining diabetic databases: Are Richards, G., Rayward-Smith, V.J., Sonksen, P.H., Carey,
rough sets a useful addition? [Electronic version]. Com- S., & Weng, C. (2001). Data mining for indicators of early ,
puting Science and Statistics, 33. mortality in a database of clinical records. Artificial Intel-
ligence in Medicine, 22, 215-231.
Breault, J.L. (2002). Mathematical challenges of variable
transformations in data mining diabetic data ware- Tang, Y., Pan, W., Qiu, X, & Xu, Y. (2002). The identifica-
houses. Retrieved from http://www.ipam.ucla.edu/pub- tion of fuzzy weighted classification system incorporated
lications/sdm2002/sdm2002_jbreault_poster.pdf with fuzzy Nave Bayes from data. Proceedings of the
IEEE International Conference on Systems, Man and
Breault, J.L., Goodall C.R., & Fos, P.J. (2002). Data mining Cybernetics.
a diabetic data warehouse. Artificial Intelligence in
Medicine, 26, 37-54.
Buja, A., & Lee, Y-S. (2001). Data mining criteria for tree-
based regression and classification. San Francisco, CA:
KEY TERMS
KDD 2001.
Body Mass Index (BMI): A measure of mass of an
Chan, J.C.N. (2000). Heterogeneity of diabetes mellitus in individual that is calculated as weight divided by height
the Hong Kong Chinese population. Hong Kong Medical squared.
Journal, 6(1), 77-84.
Data Warehouse: A database, frequently very large,
Diabetes Facts. (2004). Retrieved from http:// that can access all of a companys information. It contains
diabetes.mdmercy.com/about_diabetes/facts.html data about how the warehouse is organized, where the
information can be found, and any connections between
Ewen, E.F. et al. (1998). Data warehousing in an inte- existing data.
grated health system: Building the business case. Wash-
ington, D.C.: DOLAP 98. Gestational Diabetes: This form of diabetes develops
in 2% to 5% of all pregnancies but disappears when a
Hsu, W., Lee, M.L., Liu, B., & Ling, T.W. (2000). Explora- pregnancy is over.
tion mining in diabetic patients databases: Findings and
conclusions. Proceedings of the Sixth ACM SIGKDD Neural Networks: Computer processors or software
International Conference on Knowledge Discovery and based on the human brains mesh-like neuron structure.
Data Mining. Boston, Massachusetts. Neural networks can learn to recognize patterns and
programs to solve related problems on their own.
Lee, S.C. et al. (2000). Diabetes in Hong Kong Chinese:
Evidence for familial clustering and parental effects. Dia- Rough Sets: A method of representation of uncer-
betes Care, 23, 1365-1368. tainty in the membership of a set. It is related to fuzzy sets
and is a popular data mining technique in medicine and
Medtronic Limited. (2002). Diabetes overview. Retrieved finance.
from http://www.medtronic.com/UK/health/diabetes/
diabetes_overview.html Type 1 Diabetes: This insulin-dependent diabetes
mellitus includes risk factors that are less well defined.
Montani, S. et al. (2003). Integrating model-based deci- Autoimmune, genetic, and environmental factors are in-
sion support in a multi-modal reasoning systems for volved in the development of this type of diabetes.
managing type 1 diabetic patients. Artificial Intelligence
in Medicine, 29, 131-151. Type 2 Diabetes: This non-insulin-dependent diabe-
tes mellitus accounts for about 90% to 95% of all diag-
hrn, A., & Komorowski, J. (1997). ROSETTA: A rough nosed cases of diabetes. Risk factors include older age,
set toolkit for analysis of data. Proceedings of the Joint obesity, family history, prior history of gestational diabe-
Conference of Information Sciences, Durham, North Carolina. tes, impaired glucose tolerance, physical inactivity, and
Owrang, M.M. (2000). Using domain knowledge to opti- race/ethnicity.
mize the knowledge discovery process in databases.
International Journal of Intelligent Systems,15, 45-60.
261
TEAM LinG
262
Data Mining in Human Resources

Marvin D. Troutt
Kent State University, USA
Lori K. Long
Kent State University, USA
INTRODUCTION on employees. It also allows organizations to access the

data in a variety of ways. For example, a firm can retrieve
In this paper, we briefly review and update our earlier work data on a particular employee or they can retrieve data on
(Long & Troutt, 2003) on the topic of data mining in the a certain group of employees through conducting a search
human resources area. To gain efficiency, many organiza- based on a specific parameter such as job classification.
tions have turned to technology to automate many HR The development of relational databases in organizations
processes (Hendrickson, 2003). As a result of this auto- along with advances in storage technology has resulted
mation, HR professionals are able to make more informed in organizations collecting a large amount of data on
strategic HR decisions (Bussler & Davis, 2002). While HR employees.
professionals may no longer need to manage the manual While these calculations are helpful to quantify the
processing of data, they should not abandon their ties to value of some HR practices, the bottom-line impact of HR
data collected on and about the organizations employ- practices is not always so clear. One can evaluate the cost
ees. Using HR data in decision-making provides a firm per hire, but does that information provide any reference
with the opportunity to make more informed strategic to the value of that hire? Should a greater value be
decisions. If a firm can extract useful or unique informa- assessed to an employee who stays with the organization
tion on the behavior and potential of their people from HR for an extended period of time? Most data analysis re-
data, they can contribute to the firms strategic planning trieved from HRIS does not provide an opportunity to
process. The challenge is identifying useful information seek out additional relationships beyond those that the
in vast human resources databases that are the result of system was originally designed to identify.
the automation of HR related transaction processing. Traditional data analysis methods often involve
Data mining is essentially the extracting of knowledge manual work and interpretation of data that is slow,
based on patterns of data in very large databases and is expensive and highly subjective (Fayyad, Piatsky-Shapiro,
an analytical technique that may become a valuable tool & Smyth, 1996a). For example, if an HR professional is
for HR professionals. Organizations that employ thou- interested in analyzing the cost of turnover, they might
sands of employees and track employment related infor- have to extract data from several different sources such as
mation might find valuable information patterns con- accounting records, termination reports and personnel
tained within their databases to provide insights in such hiring records. That data is then combined, reconciled and
areas as employee retention and compensation planning. evaluated. This process creates many opportunities for
To develop an understanding of the potential of data errors. As business databases have grown in size, the
mining HR information in a firm, we will identify opportu- traditional approach has grown more impractical.
nities as well as concerns in applying data mining tech- Data mining has been used successfully in many
niques to HR Information Systems. functional areas such as finance and marketing. HRIS
applications in many organizations provide an as yet
unexplored opportunity to apply data mining techniques
BACKGROUND (Patterson & Lindsay, 2003). While most applications
provide opportunities to generate ad-hoc or standardized
In this section, we review existing work on the topic. The reports from specific sets of data, the relationships be-
human resource information systems (HRIS) of most tween the data sets are rarely explored. It is this type of
organizations today feature relational database systems relationship that data mining seeks to discover.
that allow data to be stored in separate files that can be The blind application of data mining techniques can
linked by common elements such as name or identification easily lead to the discovery of meaningless and invalid
number. The relational database provides organizations patterns. If one searches long enough in any data set, it
with the ability to keep a virtually limitless amount of data is likely possible to find patterns that appear to hold but
TEAM LinG
are not necessarily statistically significant or useful Selecting and preparing the data is the next step in the
(Fayyad et al., 1996a). There has not been any specific data mining process. Some organizations have indepen- ,
exploration of applying these techniques to human re- dent Human Resource Information Systems that feature
source applications; however, there are some guidelines multiple databases that are not connected to each other.
in the process that are transferable to an HRIS. Feelders, This type of system is sometimes selected to offer greater
Daniels and Holsheimer (2000) outline six important steps flexibility to remote organizational locations or sub-groups
in the data mining process: 1) problem definition, 2) with unique information needs (Anthony et al., 1996). The
acquisition of background knowledge, 3) selection of possible inconsistency of the design of the databases
data, 4) pre-processing of data, 5) analysis and interpre- could make data mining difficult when multiple databases
tation, and 6) reporting and use. At each of these steps, exist. Data warehousing can prevent this problem and an
we will look at important considerations as they relate to organization may need to create a data warehouse before
data mining human resources databases. Further, we will they begin a data-mining project. The advantage gained
examine some specific legal and ethical considerations of in first developing the data warehouse or mart is that most
data mining in the HR context. of the data editing is effectively done in advance.
The formulation of the questions to be explored is an Another challenge in mining data is dealing with the
important aspect of the data mining process. As men- issues of missing or noisy data. Data quality may be
tioned earlier, with enough searching or application of insufficient if data is collected without any specific analy-
sufficiently many techniques, one might be able to find sis in mind (Feelders et al., 2000). This is especially true
useless or ungeneralizable patterns in almost any set of for human resource information. Typically when HR data
data. Therefore, the effectiveness of a data mining project is collected, the purpose is some kind of administrative
is improved through establishing some general outlines need such as payroll processing. The need of data for the
of inquiry prior to start the project. To this extent, data required transaction is the only consideration in the type
mining and the more traditional statistical studies are of data to collect. Future analysis needs and the value in
similar. Thus, careful attention to the scientific method the data collected is not usually considered. Missing data
and sound research methods are to be followed. A widely may also be a problem, especially if the system adminis-
respected source of guidelines on research methods is the trator does not have control over data input. Many orga-
book by Kerlinger and Lee (2000). nizations have taken advantage of web-based technology
A certain level of expertise is necessary to carefully to allow for employee input and updating of their own data
evaluate questions posed in a data mining project. Obvi- (Hendrickson, 2003). Employees may choose not to enter
ously, a requirement is data mining and statistical exper- certain types of data resulting in missing data. However,
tise, but one must also have some intimate understanding a data warehouse or datamart may help to prevent or
of the data that is available, along with its business systemized the handling of many of these problems.
context. Furthermore, some subject matter expertise is There are many types of algorithms in use in data
needed to determine useful questions, select relevant mining. The choice of the algorithm depends on the
data and interpret results (Feelders et al., 2000). For intended use of the extracted knowledge (Brodley, Lane,
example, a firm with interest in evaluating the success of & Stough, 1999). The goals of data mining can be broken
an affirmative action program needs to understand the down into two main categories. Some applications seek to
Equal Employment Opportunity (EEO) classification sys- verify the hypothesis formulated by the user. The other
tem to know what data is relevant. main goal is the discovery or uncovering new patterns
Another important consideration in the process of systematically (Fayyad et al., 1996a). Within discovery,
developing a question to look at is the role of causality the data can be used to either predict future behavior or
(Feelders et al., 2000). A subject matter experts involve- describe patterns in an understandable form. A complete
ment is important in interpreting the results of the data discussion of data mining techniques is beyond the scope
analysis. For example, a firm might find a pattern indicat- of this paper. However, the following techniques have the
ing a relationship between high compensation levels and potential to be applicable for data mining of human re-
extended length of service. The question then becomes, sources information.
do employees stay with the company longer because they Clustering and classification is an example of a set of
receive high compensation? Or do employees receive data mining techniques borrowed from classical statisti-
higher compensation if they stay longer with the com- cal methods that can help describe patterns in informa-
pany? An expert in the area can take the relationship tion. Clustering seeks to identify a small set of exhaustive
discovered and build upon it with additional information and mutual exclusive categories to describe the data that
available in the organization to help understand the cause is present (Fayyad et al., 1996a). This might be a useful
and effect of the specific relationship identified. application to human resource data if you were trying to
263
TEAM LinG
identify a certain set of employees with consistent at- that direct use of all available observations is imprac-
tributes. For example, an employer may want to find out tical for regression and similar studies. Thus, random
what the main categories of top performers are that sampling may be necessary to use regression analysis.
employees fall into with an eye towards tailoring various Various nonlinear regression techniques are also avail-
programs to the groups or further study of such groups. able in commercial statistical packages and can be used
One category may be more or less appropriate for one type in a similar way for data mining. Recently, a new model
of training program. A difficulty with clustering tech- fitting technique was proposed in Troutt, et al. (2001).
niques is that no normative techniques are known that In this approach the objective is to explain the highest
specify the correct number of clusters that should be or lowest performers, respectively, as a function of
formed. In addition, there exist many different logics that one or more independent variables.
may be followed in forming the clusters. Therefore, the art The final step in the process emphasizes the value of
of the analyst is critical. the use of the information. The information extracted
Similarly, classification is a data mining technique that must be consolidated and resolved with previous infor-
maps a data item into one of several pre-defined classes mation and then shared and acted upon (Fayyad, Piatsky-
(Fayyad et al., 1996a). Classification may be useful in Shapiro, Smyth, & Uthurusamy, 1996b). Too often, orga-
human resources to classify trends of movement through nizations go through the effort and expense of collecting
the organization for certain sets of successful employees. and analyzing data without any idea of how to use the
A company is at an advantage when recruiting if it can information retrieved. Applying data mining techniques
point out some realistic career paths for new employees. to an HRIS can help support the justification of the
Being able to support those career paths with information investment in the system. Therefore, the firm should
reflecting employee success can make this a strong re- have some expected use for the information retrieved in
source for those charged with hiring in an organization. the process.
Decision Tree Analysis, also called tree or hierarchical One use of human resource related information is to
partitioning, is a somewhat related technique but follows support decision-making in the organization. The results
a very different logic and can be rendered somewhat more obtained from data mining may be used for a full range of
automatic. Here, a variable is chosen first in such a way as decision-making steps. It can be used to provide infor-
to maximize the difference or contrast formed by splitting mation to support a decision, or can be fully integrated
the data into two groups. One group consists of all obser- into an end-user application (Feelders et al., 2000). For
vations having a value higher on a certain value of the example, a firm might be able to set up decision rules
variable, such as the mean. Then the complement, namely regarding employees based on the results of data mining.
those lower than that value, becomes the other group. They might be able to determine when an employee is
Then each half can be subjected to successive further promotion eligible or when a certain work group should
splits with possibly different variables becoming impor- be eligible for additional company benefits.
tant to different halves. For example, employees might first Organizational leaders must be aware of legislation
be split into two groups above and below average tenure concerning legal and privacy issues when making deci-
with the firm. Then the statistics of the two groups can be sions about using personal data collected from individu-
compared and contrasted to gain insights about employee als in organizations (Hubbard, Forcht, & Thomas, 1998).
turnover factors. A further split of the lower tenure group, By their nature, systems that collect employee informa-
say based on gender, may help prioritize those most likely tion run the risk of invading the privacy of employees by
to need special programs for retention. Thus, clusters or allowing access to the information to others within the
categories can be formed by binary cuts, a kind of divide organization. Although there is no explicit constitutional
and conquer approach. In addition, the order of variables right to privacy, certain amendments and federal laws
can be chosen differently to make the technique more have relevance to this issue as they provide protection
flexible. For each group formed, summary statistics can be for employees from invasion of privacy and defamation
presented and compared. This technique is a rather pure (Fisher, Schoenfeldt, & Shaw, 1999). Further, recent
form of data mining and can be performed in the absence epidemics of identity theft create a need for organiza-
of specific questions or issues. It might be applied as a way tions to monitor access to employee information (Carlson,
of seeking interesting questions about a very large datamart. 2004). Organizations can protect themselves from these
Regression and related models, also borrowed from employee concerns by having solid business reasons for
classical statistics, permits estimating a linear function of any data collected from employees and ensuring access
independent variables that best explains or predicts a to this data is restricted.
given dependent variable. Since this technique is gener- There are also some potential legal issues if a firm
ally well known, we will not dwell on the details here. uses inappropriate information extracted from data min-
However, data warehouses and datamarts may be so large ing to make employment related decisions. Even if a
264
TEAM LinG
manager has an understanding of current laws, they Similarly, Dynamic Health Strategies (DHS, http://
could still face challenges as laws and regulations con- www.dhsgroup.com/) is an association that concentrates ,
stantly change (Ledvinka & Scarpello, 1992). An ex- on group health benefit issues. It combines the use of
treme example that may violate equal opportunity laws proprietary technology, audit discipline, analytical soft-
is a decision to hire only females in a job classification ware, and bio-statistical evaluation techniques to assess
because the data mining uncovered that females were and improve the quality and performance of existing
consistently more successful. health care providers, health plans, and health systems.
One research study found that an employees ability DHS performs analysis services for self-insured corpora-
to authorize disclosure of personal information affected tions, universities, government entities and group health
their perceptions of fairness and invasion of privacy management. The process allows DHS to pinpoint spe-
(Eddy & Stone-Romero, 1999). Therefore, it is recom- cific measures that enable clients to reduce costs, reduce
mended that firms notify employees upon hire that the healthcare risk exposure and improve quality of care.
information they provide may be used in data analyses. Group health analysis allows the monitoring of cost,
Another recommendation is to establish a committee or quality and utilization of health care prior to problems
review board to monitor any activities relating to analysis becoming major issues. Clients and consultants are then
of personal information (Osborn, 1978). This committee able to apply the findings and recommendations to realize
can review any proposed research and ensure compliance benefits such as: targeting of specific health and wellness
with any relevant employment or privacy laws. Often the issues, monitoring of utilization and cost over time, as-
employees reaction to the use of their information is sessment of effectiveness and efficiency of health and
based upon their perceptions. If there is a perception that wellness initiatives, design health plan benefits to meet
the company is analyzing the data to take negative actions specific needs, application of risk models and cost projec-
against employees, employees are more apt to object to tions for budgeting and financial strategies.
the use. However, if the employer takes the time to notify Evidently, the group health insurance benefits area is
employees and obtain their permission, the perception of spearheading interest in DM. We may postulate several
negativity may be removed. Even if there may be no legal reasons for this. First, this area represents one of consid-
consequences, employee confidence is something that erable importance in terms of financial impacts. Next, the
employers need to maintain. existence of the Health Insurance Portability and Ac-
countability Act of 1996 (HIPAA) (http://
www.hipaadvisory.com/) created a requirement for ad-
UPDATES AND ISSUES ministrative simplification, compliance, and reporting
within healthcare entities. Third, the National Committee
As of this writing, there are still surprisingly few reported for Quality Assurance (NCQA) (http://www.ncqa.org/
industry experiences with DM in HR. Of course, the index.asp) has established a quality assurance system
newness of the area will require a time lag. In addition, called the Health Plan Employer Data and Information Set
many HR groups will not yet have the requisite experts (HEDIS). Information Systems for reporting related to
and information systems capabilities. However, there are HIPAA and HEDIS serve as a ready data source for DM.
several forces at work that should improve the situation Software vendors have been developing new DM
in the near future. tools at an accelerating pace. At least one vendor,
First, DM for the human resources area is rapidly Lawson (http://www.bitpipe.com/), has developed a prod-
becoming of interest to industry associations and soft- uct specialized to provide HR reporting, monitoring, and
ware vendors. Human Resources Benchmarking decision support. The Defense Software Collaborators
Association(http://www.hrba.org/roundtable.pdf) now (DACS, http://www.dacs.dtic.mil/) website has collected
provides links related to DM. In fact, this association is links to a very large number of DM tools and vendors.
affiliated with and linked to the Data Mining Benchmarking Many of these purport to be designed for the nontechni-
Association. cal business user.
The former offers a variety of services related to DM,
such as
FUTURE TRENDS
Consortium studies with costs divided
HR benchmarking efforts The increasing interest in DM by HR related associations
Data collection and database access will likely continue and should provide a strong impetus
Benchmarking studies of important data mining and guidance for individual member firms. These associa-
processes. tions make it possible for member firms to pool their
databases. Such pooled information gives members a
265
TEAM LinG
statistical leverage in that their own data may be insuf- tion of privacy and procedural justice perspectives. Per-
ficient for particular studies. However, when their par- sonnel Psychology, 52(2), 335-358.
ticular data are viewed in the context of the larger
database, Bayesian methods might be brought to bear to Fayyad, U.M., Piatsky-Shapiro, G., & Smyth, P. (1996a).
make better inferences that may not be reliable with From data mining to knowledge discovery in databases.
smaller data sets (Bickell & Doksum, 1977). AI Magazine, (7), 37-54.
In addition to more firm-specific experience studies, Fayyad, U.M., Piatsky-Shapiro, G., Smyth, P., &
evaluation studies and comparisons are needed for the Uthurusamy, R. (1996b). Advances in knowledge discov-
various DM tools becoming available. Potential adopters ery and data mining. California: American Association
need assessments of both the effectiveness and ease of use. for Artificial Intelligence.
Feelders, A., Daniels, H., & Holsheimer, M. (2000). Meth-
CONCLUSION odological and practical aspects of data mining. Informa-
tion & Management, (37), 271-281.
The use of HR data beyond administrative purposes can Fisher, C.D., Schoenfeldt, L.F., & Shaw, J.B. (1999). Hu-
provide the basis for a competitive advantage by allowing man resource management. Boston: Houghton Mifflin
organizations to strategically analyze one of their most Company.
important assets, their employees. Organizations must be
able to transform the data they have collected into useful Hendrickson, A. (2003). Human resource information sys-
information. Data mining provides an attractive opportu- tems: Backbone technology of contemporary human
nity that has not yet been adequately exploited. The resources. Journal of Labor Research, 24(3) 381-394.
application of DM techniques to HR requires organiza- Hubbard, J.C., Forcht, K.A., & Thomas, D.S. (1998). Hu-
tional expertise and work to prepare the system for mining. man resource information systems: An overview of cur-
In particular, a datamart for HR is a useful first step. With rent ethical and legal issues. Journal of Business Ethics,
proper preparation and consideration, HR databases to- (17), 1319-1323.
gether with data mining create an opportunity for organi-
zations to develop their competitive advantage through Kerlinger, F. N., & Lee, H. B. (2000). Foundations of
using that information for strategic decision-making. A Behavioral Research (3rd ed.). Orlando: Harcourt, Inc.
number of current influences promise to increase the
interest of firms in the application of DM to the Human Ledvinka, J., & Scarpello, V.G. (1992). Federal regulation
Resources area. Trade association and software vendor of personnel and human resource management. Belmont:
activities should also facilitate an increase acceptance Wadsworth Publishing Company.
and willingness to adopt DM. Long, L. K., & M.D. Troutt. (2003). Data mining human
resource information systems. In J. Wang (Ed.), Data
mining: Opportunities and challenges (pp. 366-381).
REFERENCES Hershey: Idea Group Publishing.
Bickell, P.J., & Doksum, K. A. (1977). Mathematical sta- Osborn, J.L. (1978). Personal information: Privacy at the
tistics: Basic ideas and selected topics. San Francisco: workplace. New York: AMACOM.
Holden Day, Inc. Patterson, B., & Lindsey, S. (2003). Mining the gold: Gain
Brodley, C.E., Lane, T., & Stough, T.M. (1999). Knowledge competitive advantage through HR data analysis. HR
discovery and data mining. American Scientist, 87, 54-61. Magazine, 48(9), 131-136.
Bussler, L., & Davis, E. (2002). Information systems: The SAS Institute Inc. (2001). John Deer harvests HR records
quite revolution in human resource management. Journal with SAS. Retrieved from http://www.sas.com/news/suc-
of Computer Information Systems, 17-20. cess/johndeere.html
Carlson, L. (2004). Employers offering identity theft Townsend, A., & Hendrickson, A. (1996). Recasting HRIS
protection. Employee Benefit News, 18(3), 50-51. as an information resource. HR Magazine, 41(2), 91-96.
Eddy, E.R., Stone, D.L., & Stone-Romero, E.F. (1999). The Troutt, M.D., Hu, M., Shanker, M., & Acar, W. (2003).
effects of information management policies on reac- Frontier versus ordinary regression models for data min-
tions to human resource information systems: An integra- ing. In P.C. Pendharker (Ed.), Managing data mining
266
TEAM LinG
technologies in organizations: Techniques and appli- Opportunity Commission (EEOC) for demographic report-
cations (pp. 21-31). Hershey: Idea Group Publishing. ing requirements. ,
HEDIS : Health Plan Employer Data and Information
Set, a quality assurance system established by the Na-
KEY TERMS tional Committee for Quality Assurance (NCQA).
HIPAA: the Health Insurance Portability and Ac-
Benchmarking: To identify the Best in Class of countability Act of 1996.
business processes, which might then be implemented
or adapted for use by other businesses. Human Resource Information System: An integrated
system used to gather and store information regarding an
Enterprise Resource Planning System: An inte- organizations employees.
grated software system processing data from a variety of
functional areas such as finance, operations, sales, hu- Subject Matter Expert: A person who is knowledge-
man resources and supply-chain management. able about the skills and abilities required for a specific
domain such as Human Resources.
Equal Employment Opportunity Classification: A job
classification system set forth by the Equal Employment
267
TEAM LinG
268
Data Mining in the Federal Government

Les Pang
National Defense University, USA
INTRODUCTION author as being critical factors which led toward the

success of the real-world data mining projects. Also,
Data mining has been a successful approach for improv- some of these lessons reflect novel and imaginative
ing the level of business intelligence and knowledge practices.
management throughout an organization. This article
identifies lessons learned from data mining projects
within the federal government including military ser- MAIN THRUST
vices. These lessons learned were derived from the
following project experiences: Each lesson learned (indicated in boldface) is listed
below. Following each practice is a description of illus-
Defense Medical Logistics Support System Data trative project or projects (indicated in italics), which
Warehouse Program support the lesson learned.
Department of Defense (DoD) Defense Financial
and Accounting Service (DFAS) Operation Mon- Avoid the Privacy Trap
goose
DoD Computerized Executive Information Sys- DoD Computerized Executive Information System -
tem (CEIS) Patients as well as the system developers indicate their
Department of Transportation (DOT) Executive concern for protecting the privacy of individuals their
Reporting Framework System medical records need safeguards. Any kind of large
Federal Aviation Administration (FAA) Aircraft database like that where you talk about personal info
Accident Data Mining Project raises red flags, said Alex Fowler, a spokesman for the
General Accounting Office (GAO) Data Mining of Electronic Frontier Foundation. There are all kinds of
DoD Purchase and Travel Card Programs questions raised about who accesses that info or protects
U.S. Coast Guard Executive Information System it and how somebody fixes mistakes (Hamblen, 1998).
Veteran Administrations (VA) Demographics System Proper security safeguards need to be implemented
to protect the privacy of those in the mined databases.
Vigilant measures are needed to ensure that only autho-
BACKGROUND rized individuals have the capability of accessing, view-
ing and analyzing the data. Efforts should also be made
Data mining involves analyzing diverse data sources in to protect the data through encryption and identity man-
order to identify relationships, trends, deviations and agement controls.
other relevant information that would be valuable to an Evidence of the publics high concern for privacy
organization. This approach typically examines large was the demise of the Pentagons $54 million Terrorist
single databases or linked databases that are dispersed Information Awareness (originally, Total Information
throughout an organization. Pattern recognition tech- Awareness) effort the program in which government
nologies and statistical and mathematical techniques computers were to be used to scan an enormous array of
are often used to perform data mining. By utilizing this databases for clues and patterns related to criminal or
approach, an organization can gain a new level of corpo- terrorist activity. To the dismay of privacy advocates,
rate knowledge that can be used to address its business many government agencies are still mining numerous
requirements. databases (General Accounting Office, 2004; Gillmor,
Many agencies in the federal government have ap- 2004). Data mining can be a useful tool for the govern-
plied a data mining strategy with significant success. ment, but safeguards should be put in place to ensure that
This chapter aims to identify the lessons gained as a information is not abused, stated the chief privacy
result of these many data mining implementations within officer for the Department of Homeland Security
the federal sector. Based on a thorough literature re- (Sullivan, 2004). Congressional concerns on privacy
view, these lessons were uncovered and selected by the are so high that the body is looking at introducing
TEAM LinG
legislation that would require agencies to report to Con- areas have also benefited from the approach such as in
gress on data mining activities to support homeland sales and marketing in the private sector. Statistics ,
security purposes (Miller, 2004). (dollars recovered) from efforts such as this can be used
to support future data mining projects.
Steer Clear of the Guns Drawn
Mentality if Data Mining Unearths a Use Data Mining for Supporting
Discovery Budgetary Requests
DoD Defense Finance & Accounting Services Opera- Veterans Administration Demographics System pre-
tion Mongoose was a program aimed to discover billing dicts demographic changes based on patterns among its
errors and fraud through data mining. About 2.5 million 3.6 million patients as well as data gathered from insur-
financial transactions were searched to locate inaccu- ance companies. Data mining enables the VA to provide
rate charges. This approach detected data patterns that Congress with much more accurate budget requests. The
might indicate improper use. Examples include pur- VA spends approximately $19 billion a year to provide
chases made on weekends and holidays, entertainment medical care to veterans. All government agencies such
expenses, highly frequent purchases, multiple purchases as the VA are under increasing scrutiny to prove that they
from a single vendor and other transactions that do not are operating effectively and efficiently. This is particu-
match with the agencys past purchasing patterns. It larly true as a driving force behind the Presidents Man-
turned up a cluster of 345 cardholders (out of 400,000) agement Agenda (Executive Office of the President,
who had made suspicious purchases. 2002). For many, data mining is becoming the tool of
However, the process needs some fine-tuning. As an choice to highlight good performance or dig out waste.
example, buying golf equipment appeared suspicious United States Coast Guard developed an executive
until it was learned that a manager of a military recre- information system designed for managers to see what
ation center had the authority to buy the equipment. resources are available to them and better understand
Also, casino-related expense revealed to be a common- the organizations needs. Also, it is also used to identify
place hotel bill. Nevertheless, the data mining results relationships between Coast Guard initiatives and sei-
have shown sufficient potential that data mining will zures of contraband cocaine and establish tradeoffs
become a standard part of the Departments efforts to between costs of alternative strategies. The Coast Guard
curb fraud. has numerous databases; content overlap each other; and
only one employee understands each database well
Create a Business Case Based on Case enough to extract information. In addition, field offices
are organized geographically but budgets are drawn up
Histories to Justify Costs by programs that operate nationwide. Therefore, there is
a disconnect between organizational structure and ap-
FAA Aircraft Accident Data Mining Project involved propriations structure. The Coast Guard successfully
the Federal Aviation Administration hiring MITRE Cor- used their data mining program to overcome these is-
poration to identify approaches it can use to mine vol- sues (Ferris, 2000).
umes of aircraft accident data to detect clues about their DOT Executive Reporting Framework (ERF) aims
causes and how those clues could help avert future to provide complete, reliable and timely information in
crashes (Bloedorn, 2000). One significant data mining an environment that allows for the cross-cutting identi-
finding was that planes with instrument displays that can fication, analysis, discussion and resolution of issues.
be viewed without requiring a pilot to look away from The ERF also manages grants and operations data. Grants
the windshield were damaged a smaller amount in run- data shows the taxes and fees that DOT distributes to the
way accidents than planes without this feature. states highway and bridge construction, airport devel-
On the other hand, the government is careful about opment and transit systems. Operations data covers
committing significant funds to data mining projects. payroll, administrative expenses, travel, training and
One of the problems is how do you prove that you kept other operations cost. The ERF system accesses data
the plane from falling out of the sky, said Trish Carbone, from various financial and programmatic systems in use
a technology manager at MITRE. It is difficult to justify by operating administrators.
data mining costs and relate it to benefits (Matthews, Before 1993 there was no financial analysis system
2000). to compare the departments budget with congressional
One way to justify data mining program is to look at appropriations. There was also no system to track per-
past successes in data mining. Historically, fraud detec- formance against the budget or how they were doing
tion has been the highest payoff in data mining, but other with the budget. The ERF system changed this by track-
269
TEAM LinG
ing of the budget and providing the ability to correct any FUTURE TRENDS
areas that have been over planned. Using ERF, adjust-
ments were made within the quarter so the agency did not Despite the privacy concerns, data mining continues to
go over budget. ERF is being extended to manage budget offer much potential in identifying waste and abuse,
projections, development and formulation. It can be used potential terrorist and criminal activity, and identify
as a proactive tool, which allows an agency to be more clues to improve efficiency and effectiveness within
dynamically to project ahead. The system has improved organizations. This approach will become more perva-
financial accountability in the agency (Ferris, 1999). sive because of its integration with online analytical
tools, the improved ease of use in utilizing data mining
Give Users Continual Computer-Based tools and the appearance of novel visualization tech-
Training niques for reporting results. Also, the emergence of a
new branch of data mining called text mining to help
DoD Medical Logistics Support System (DMLSS) built improve the efficiency of searching on the Web. This
a data warehouse with a front-end decision support/data approach transforms textual data into a useable format
mining tools to help manage the growing costs of health that facilitates classifying documents, finds explicit
care, enhance health care delivery, enhance health deliv- relationships or associations among documents, and
ery in peacetime and promote wartime readiness and clusters documents into categories (SAS, 2004).
sustainability. DMLSS is responsible for the supply of
medical equipment and medicine worldwide for DoD
medical care facilities. The system received recognition CONCLUSION
for reducing the inventory in its medical depot system by
80 percent and reducing supply request response time These lessons learned may or may not fit in all environ-
from 71 to 15 days (Government Computer News, 2001). ments due to cultural, social and financial consider-
One major challenge faced by the agency was the diffi- ations. However, the careful review and selection of
culty in keeping up on the training of users because of the relevant lessons learned could result in addressing the
constant turnover of military personnel. It was deter- required goals of the organization by improving the
mined that there is a need to provide quality computer- level of corporate knowledge.
based training on a continuous basis (Olsen, 1997). A decision maker needs to think outside the box
and move away from the traditional approaches to suc-
Provide the Right Blend of Technology, cessfully implement and manage their programs. Data
Human Capital Expertise and Data mining poses a challenging but highly effective ap-
proach to improve business intelligence within ones
Security Measures domain.
General Accounting Office used data mining to identify
numerous instances of illegal purchases of goods and
services from restaurants, grocery stores, casinos, toy
REFERENCES
stores, clothing retailers, electronics stores, gentlemens
clubs, brothels, auto dealers and gasoline service sta- Bloedorn, E. (2000). Data mining for aviation safety.
tions. This was all part of their effort to audit and inves- MITRE Publications.
tigate federal government purchase and travel card and Executive Office of the President, Office of Manage-
related programs (General Accounting Office, 2003). ment and Budget. (2002). Presidents Management
Data mining goes beyond using the most effective Agenda. Fiscal Year 2002.
technology and tools. There must be well-trained indi-
viduals involved who know about the process, proce- Ferris, N. (1999). 9 Hot Trends for 99. Government
dures and culture of the system being investigated. They Executive.
need to understand the capabilities and limitation of data Ferris, N. (2000). Information is power. Government
mining concepts and tools. In addition, these individual Executive.
must recognize the data security issues associated with
the use of large, complex and detailed databases. General Accounting Office. (2003). Data mining:
Results and challenges for government program au-
dits and investigations. GAO-03-591T.
270
TEAM LinG
General Accounting Office. (2004). Data mining: Federal Federal Government: The national government of the
efforts cover a wide range of uses. GAO-04-584. United States, established by the Constitution, which ,
consists of the executive, legislative, and judicial
Gillmor, D. (2004). Data mining by government ram- branches. The head of the executive branch is the Presi-
pant. eJournal. dent of the United States. The legislative branch con-
Government Computer News. (2001). Ten agencies sists of the United States Congress, and the Supreme
honored for innovative projects. Government Com- Court of the United States is the head of the judicial
puter News. branch.
Hamblen, M. (1998). Pentagon to deploy huge medical Legacy System: Typically, a database management
data warehouse. Computer World. system in which an organization has invested consider-
able time and money and resides on a mainframe or
Matthews, W. (2000). Digging Digital Gold. Federal minicomputer.
Computer Week.
Logistics Support System: A computer package
Miller, J. (2004). Lawmakers renew push for data- that assists in the planning and deploying the movement
mining law. Government Computer News. and maintenance of forces in the military. The package
Olsen, F. (1997). Health record project hits pay dirt. could deals with the design and development, acquisi-
Government Computer News. tion, storage, movement, distribution, maintenance,
evacuation and disposition of material; movement, evacu-
SAS. (2004). SAS Text Miner. ation, and hospitalization of personnel; acquisition of
construction, maintenance, operation and disposition
Schwartz, A. (2000). Making the Web safe. Federal of facilities; and acquisition of furnishing of services.
Computer Week.
Purchase Cards: Credit cards used in the federal
Sullivan, A. (2004). U.S. Government still data mining. government used by authorized government official for
Reuters. small purchases, usually under $2,500.
Travel Cards: Credit cards issued to federal em-
ployees to pay for costs incurred on official business
KEY TERMS travel.
Computer-Based Training: A recent approach in-

volving the use of microcomputer, optical disks such as
compact disks, and/or the Internet to address an NOTE
organizations training needs.
Executive Information System: An application The views expressed in this article are those of the
designed for the top executives that often features a author and do not reflect the official policy or position
dashboard interface, drill-down capabilities and trend of the National Defense University, the Department of
analysis. Defense or the U.S. Government.
271
TEAM LinG
272
Data Mining in the Soft Computing Paradigm

Pradip Kumar Bala
IBAT, Deemed University, India
Shamik Sural
Indian Institute of Technology, Kharagpur, India
Rabindra Nath Banerjee

INTRODUCTION are far too complex to be noticed by either humans or more

conventional automated analysis techniques. Because of
Data mining is a set of tools, techniques and methods that this ability, neural computing and machine learning tech-
can be used to find new, hidden or unexpected patterns nologies demonstrate broad applicability in the world of
from a large volume of data typically stored in a data data mining and, thus, to a wide variety of complex
warehouse. Results obtained from data mining help an business problems. Rough set is the approximation of an
organization in more effective individual and group deci- imprecise and uncertain set by pair of precise concepts,
sion-making. Regardless of the specific technique, data called the lower and upper approximations.
mining methods can be classified by the function they Each soft computing technique addresses problems in
perform or by their class of application. its domain using a distinct methodology. However, they
Association rule is a type of data mining that corre- are not substitute of each other. In fact, these soft com-
lates one set of items or events with another set of items puting tools work in a cooperative manner, rather than
or events. It employs association or linkage analysis, being competitive. This has led to the development of
searching transactions from operational systems for in- hybridization of soft computing tools for data mining
teresting patterns with a high probability of repetition. applications (Mitra et al., 2002). It should, however, be
Classification techniques include mining processes in- kept in mind that soft computing techniques have been
tended to discover rules that define whether an item or traditionally developed to handle small data sets. Extend-
event belongs to a particular predefined subset or a class ing the soft computing paradigm for processing large
of data. This category of techniques is probably the most volumes of data is itself a challenging task. In the next
broadly applicable to different types of business prob- section, we give a brief background of the various soft
lems. In some cases, it is difficult to define the parameters computing techniques.
of a class of data to be analyzed. When parameters are
elusive, clustering methods can be used to create parti- BACKGROUND
tions so that all members of each set are similar according
to a specified set of metrics. Summarization describes a set Fuzzy rules offer an attractive trade-off between the need
of data in compact form. Regression techniques are used for accuracy and compactness on one hand, and scalability
to predict a continuous value. The regression can be on the other, when reasoning systems within a particular
linear or non-linear with one predictor variable or more knowledge domain become quite complex. Fuzzy rules
than one predictor variables, known as multiple regres- generalize the concept of categorization because, by
sion. definition, the same object can belong to multiple sets
Soft computing, which includes application of fuzzy with different degrees of membership. In this sense, fuzzy
logic, neural network, rough set and genetic algorithm, is logic eliminates the problems associated with borderline
an emerging area in data mining. By studying combina- cases: where, for example, a value of degree of member-
tions of variables and how different combinations affect ship with 0.9 may cause a rule to fire but a value 0.899 may
data sets, we develop neural network, a non-linear predic- not. The net result is that fuzzy systems tend to provide
tive model that learns. Machine learning techniques, greater accuracy than traditional rule-based systems when
such as genetic algorithms and fuzzy logic, can derive continuous variables are involved.
meaning from complicated and imprecise data. They can Neural network, which draws its inspiration from neu-
extract patterns from and detect trends within the data that roscience, attempts to mirror the way a human brain works
TEAM LinG
in recognizing patterns by developing mathematical struc- subset is taken into account while computing the
tures with the ability to learn. An artificial neural network degree of support and the degree of confidence. The ,
(ANN) learns through training. These are simple com- measures are similar in spirit to the count operator
puter-based programs whose primary function is to con- used for fuzzy cardinality. Subsequently, with these
struct models of a problem space based on trial and error. extended measures incorporated, several mining
The process of training a neural net to associate certain algorithms have been developed (Gyenesei, 2000;
input patterns with correct output responses involves the Gyenesei & Teuhola, 2001; Shu et al., 2001). Instead
use of repetitive examples and feedback, much like the of dividing quantitative attributes into fixed inter-
training of a human being. vals, linguistic terms can be used to represent the
Rough set theory finds application in studying impre- regularities and exceptions the way in which hu-
cision, vagueness, and uncertainty in data analysis and mans perceive the reality. Chen et al. (2002) have
is based on the establishment of equivalence classes developed an algorithm for fuzzy association rules
within a given training data. A rough set gives an approxi- in dealing with partitioning quantitative data do-
mation of a vague concept by two precise concepts, called mains. Wei & Chen (1999) extended generalized
the lower and upper approximations. These two approxi- association rules with fuzzy taxonomies, by which
mations are a classification of the domain of interest into partial belongings could be incorporated. Further-
disjoint categories. The lower approximation is a descrip- more, a recent effort has been made which incorpo-
tion of the domain objects known with certainty to belong rates linguistic hedges on existing fuzzy taxonomies
to the subset of interest and upper approximation is a (Chen et al., 1999; Chen et al., 2002a). Several fuzzy
description that may possibly belong to the subset. extensions have been made on interestingness
Genetic algorithms (GAs) are computational models measures. A measure called Interestingness De-
used in efficient and global search methods for optimality gree has been proposed which can be seen as the
in problem solving. These search algorithms are based on increase in probability of an event Y caused by the
the mechanics of natural genetics theory combined with occurrence of another event X. Attempts have been
Darwins theory of survival of the fittest and are par- made to introduce thresholds for filtering databases
ticularly suitable for solving complex optimization prob- in dealing with very low membership degrees
lems as well as applications that require adaptive problem- (Hullermeier, 2001).
solving strategies. In data mining, GA finds application in Genetic Algorithm: Min et al. (2001) have used a
hypothesis testing and refinement. GA-based data mining approach in e-commerce to
With this background, we next present how soft com- find association rules of IF-THEN form for adopters
puting techniques can be applied to specific data mining and non-adopters of e-purchasing. Association
problems. rules of the form IF-THEN form can also be mined,
which provides a high degree of accuracy and cov-
erage (Lopes et al., 1999).
MAIN THRUST Clustering: Following are some of the applications
of soft computing tools in clustering.
Association Rule Mining: Following are some of the Fuzzy Logic: A fuzzy clustering algorithm makes an
applications of soft computing tools in association rule attempt to group the prospects into categories based
mining. on their identifying characteristics. For example, for
prospective customers of any business, the key
Fuzzy Logic: A generalized association rule may attributes can include geographic data, psycho-
involve binary, quantitative or categorical data and graphic data and others. Clusters expressed in lin-
hierarchical relation. In quantitative or categorical guistic terms can be easily handled using fuzzy sets.
association rule mining, irrespective of the method- Using fuzzy sets, we can also find dependencies
ology used, sharp boundaries remain a problem between data expressed in qualitative format. Use of
which under-estimates or over-emphasizes the ele- fuzzy logic can help in avoiding searching for less
ments near the boundaries. This, may, therefore, important, trivial or meaningless patterns in data-
lead to an inaccurate representation of semantics. bases. Fuzzy clustering algorithms have been de-
To deal with the problem, fuzzy sets and fuzzy items, veloped for mining telecommunications customer
usually in the form of labels or linguistic terms, are and prospect database to gain customer information
used and defined onto the domains (Chien et al., for deciding a marketing strategy (Russell et al., 1999).
2001). In the fuzzy framework, conventional notions Neural Network: Self Organizing Map (SOM) is one
of support and confidence could be extended as of the most widely used unsupervised neural net-
well. The partial belongingness of an item in a work models that employ competitive learning steps.
273
TEAM LinG
An important data mining task is organizing data Neuro-Fuzzy Computing: Neuro-Fuzzy comput-
points with single or multiple dimensions in their ing combines strong features of neural network
natural cluster. Kohonen et al. (2000) has demon- and fuzzy approach together. To deal with numeri-
strated the applicability of self-organizing map where cal and linguistic data and granular knowledge, a
large data sets can be partitioned in stages. A two- granular neural network can be designed (Zhang et
phase method can be used where in the first phase, al., 2000). High-level granular knowledge in the
the step-wise strategy of SOM is used, and then in form of rules is generated by compressing the low-
the second phase, the resulting prototypes of the level granular data. An algorithm has been devel-
SOM are clustered by an agglomerative clustering oped to mine mixed fuzzy rules involving both
method or by k-means clustering (Vesanto et al., numeric and categorical attributes. Fuzzy set-driven
2000). A dimension-independent algorithm has been computational techniques for data mining have
developed which allows hierarchical clustering of been discussed by Pedrycz (1998), establishing the
SOMs, based on a spread factor. This can be used as relationship between data mining and fuzzy model-
a controlling measure for generating maps with dif- ing.
ferent dimensionality (Alahakoon et al., 2000). Other Hybrid Approaches: Hybrid prediction sys-
Classification and Rule Extraction: Following are tem based on neural network with its learning
some of the applications of soft computing tools in based on memory-based reasoning can be de-
classification and rule extraction. signed for classification. It can also learn the dy-
Neural Network: For the purpose of classification or namic behavior of a system over a period of time.
rule extraction, ANN is used in the supervised learn- It has been established by experimentation that a
ing paradigm. The most common supervised learn- hybrid system has a high potential in solving data
ing paradigm is error back-propagation. Here, a neu- mining problems (Shin et al., 2000). The concepts
ral network receives an input example, generates a of fuzzy logic and rough set can be applied in Multi-
guess, and compares that guess with the expected layer Perceptron (MLP) neural network to extract
result. The error between the guess and the desired rules from crude domain knowledge. Appropriate
result is fed back to improve the guess in an iterative number of the hidden nodes is automatically deter-
manner. In this sense, the ANN is being supervised mined and the dependency factors are used in the
by the feedback, which shows the network where it initial weight encoding.
made the mistakes and how the correct result should Other Data Mining Applications: The soft com-
look. The most common form of the back-propaga- puting paradigm has also been extended to some
tion algorithm uses a sum-of-squared-errors approach other types of data mining as discussed below.
to generate an aggregate measure of the difference
error. A general framework for classification rule Genetic algorithms have been used in regression
mining, called NEUCRUM (NEUral Classification RUle analysis. One basic assumption in traditional regression
Mining) has been developed. It has two compo- models is that there is no interaction amongst the at-
nents, one is a specific neural classifier named FANNC tributes. To learn non-linear multi-regression from a set
and the other is a novel rule extraction approach of training data, an adaptive GA can be used. A genetic
named STARE (Zhou et al., 2000). algorithm can handle attribute interactions efficiently.
Rough Set: Piasta (1999) has presented an approach GA has been used to discover interesting rules in a
called the ProbRough system to analyze business dependency-modeling task (Noda et al., 1999). The sys-
databases based on rule induction. The ProbRough tem developed by Shin et al. (2000) for classification, as
system can induce decision rules from databases discussed above, can be applied for regression analysis also.
with a very high number of objects and attributes. Fuzzy functional dependencies are extensions of clas-
Based on rough set theory, another approach in the sical functional dependencies, aimed at dealing with
selection of attributes for construction of decision fuzziness in databases and reflecting the semantics that
tree has been developed (Wei, 2003). According to close values of a collection of certain items are depen-
Han & Kamber (2003), rough set theory can be dent on close values of a collection of a set of different
applied for classification to discover structural rela- items. Generally, fuzzy functional dependencies have
tionships in imprecise or noisy data. It can be applied different forms depending on the different aspects of
to discrete-valued attributes and hence, continu- integrating fuzzy logic in classical functional dependen-
ous-valued attributes must be discretized prior to its cies. Fuzzy inference generalizes both imprecise and
use. A classifier can be trained using rough set precise inference. An attempt has been made by Yang &
learning algorithm for rule extraction in IF-THEN Singhal (2001) to develop a framework of linking fuzzy
form, from a decision table.
274
TEAM LinG
functional dependencies and fuzzy association rules in a data with a reasonable time response is a primary require-
closer manner. ment of any kind of data mining algorithm. Since soft ,
Discovering relationships among time series is an computing techniques typically require more processing
interesting application since time series patterns reflect than the traditional techniques, it remains to be seen how
the evolution of changes in item values with sequential well they can be adopted successfully to the interesting
factors like time. The value of each time series item is and challenging field of data mining.
viewed as a pattern over time, and the similarity between
any two patterns is measured by pattern matching. Chen
et al. (2001) have presented a method based on Dynamic REFERENCES
Time Warping (DTW) to discover pattern associations.
Summarization is one of the major components of data Alahakoon, D., Halgamuge, S.K., & Srinivasan, B. (2000).
mining. Lee & Kim (1997) have proposed an interactive Dynamic self-organizing maps with controlled growth for
top-down summary discovery process which utilizes fuzzy knowledge discovery. IEEE Transactions on Neural
ISA hierarchies as domain knowledge. They have defined Networks, 11(3), 601-614.
a generalized tuple as a representational form of a data-
base summary including fuzzy concepts. By virtue of Bonaventura, P., Marco, G., Marco, M., Franco, S., &
fuzzy ISA hierarchies, where fuzzy ISA relationships Sheng, J. (2003). A hybrid model for the prediction of the
common in actual domains are naturally expressed, the linguistic origin of surnames. IEEE Transactions on Knowl-
discovery process comes up with more accurate database edge and Data Engineering, 15(3), 760-763.
summaries. They have also presented an informative- Chen, G.Q., Wei, Q., & Kerre, E.E. (1999). Fuzzy data
ness measure for distinguishing generalized tuples that mining: Discovery of fuzzy generalized association rules.
deliver more information to users, based on Shannons In Recent Research Issues on Management of Fuzziness
information theory. in Databases. Berlin: Springer-Verlag.
GA-based approach can be used for discovering tem-
poral trends by synthesizing Bayesian Networks Chen, G.Q., Yan, P., & Kerre, E.E. (2002a). Mining fuzzy
(Novobilski & Kamangar, 2002). Bonaventura et al. (2003) implication-based association rules in quantitative data-
have developed a hybrid model for the prediction of the bases. In International FLINS Conference on Computa-
linguistic origin of the surnames. It is a neural network tional Intelligent Systems for Applied Research, Bel-
module combining the results provided both by the lexical gium.
rule module and by the statistical module and used to
compute the evidence for the classes. Chen, G.Q., Wei, Q., Liu, D., & Wets, G. (2002). Simple
association rules (SAR) and the SAR-based rule discov-
ery. Journal of Computer & Industrial Engineering, 43,
721-733.
FUTURE DIRECTIONS
Chen, G.Q., Wei, Q., & Zhang, H. (2001). Discovering
In this paper, we have discussed various soft computing similar time-series patterns with fuzzy clustering and
methods used in data mining. Our focus has been on four DTW methods. In International Fuzzy Systems Associa-
primary soft computing techniques, namely, fuzzy logic, tion Conference, Vancouver, BA, Canada.
neural network, rough set and genetic algorithm. Al-
though these techniques have not yet attained maturity Chien, B.C., Lin, Z.L., & Hong, T.P. (2001). An efficient
to the extent the conventional data mining techniques clustering algorithm for mining fuzzy quantitative asso-
have, it is expected that these techniques will mature well ciation rules. In Ninth International Fuzzy Systems Asso-
enough to be dealt with as independent areas of data ciation World Congress (pp. 1306-1311), Vancouver,
mining very soon. Canada.
Gyenesei, A. (2000). A fuzzy approach for mining quan-
titative association rules. TUCS Technical Reports 336.
CONCLUSION Department of Computer Science, University of Turku,
Finland.
Hybridization of fuzzy, neural and genetic algorithms in
solving data mining problems seems to be an upcoming Gyenesei, A., & Teuhola, J. (2001). Interestingness mea-
area in the field of data mining. However, more stress sures for fuzzy association rules. In Principles and Prac-
needs to be given to the domain of improving the effi- tice of Knowledge Discovery in Databases, Freiburg,
ciency of these soft computing techniques when applied Germany.
to the data mining problems. Processing of tera bytes of
275
TEAM LinG
Han, J., & Kamber, M. (2003). Data mining: Concepts and Shu, J.Y., Tsang, E.C.C., Daniel, & Yeung, S. (2001). Query
techniques. San Francisco: Morgan Kaufmann. fuzzy association rules in relational database. In Ninth
International Fuzzy Systems Association World Con-
Hullermeier, E. (2001). Fuzzy association rules: Semantics gress, Vancouver, Canada.
issues and quality measures. In Lecture Notes in Com-
puter Science 2206 (pp. 380-391). Berlin & Heidelberg: Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-
Springer. organizing map. IEEE Transactions on Neural Networks,
11(3), 586-600.
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela,
J., Paatero, V., & Saarela, A. (2000). Self organization of a Wei, J.M. (2003). Rough set based approach to selection
massive document collection. IEEE Transactions on of node. International Journal of Computational Cogni-
Neural Networks, 11(3), 574-585. tion, 1(2).
Lee, D.H., & Kim, M.H. (1997). Database summarization Wei, Q., & Chen, G.Q. (1999). Mining generalized associa-
using fuzzy ISA hierarchies. IEEE Transactions on Sys- tion rules with fuzzy taxonomic structures. In Eighteenth
tems, Man and Cybernetics Part B: Cybernetics, 27 (1). International Conference of North Atlantic Fuzzy Informa-
tion Processing Systems (pp. 477-481), New York, NY, USA.
Lopes, C., Pacheco, M., Vellasco, M., & Passos, E. (1999).
Rule-Evolver: An evolutionary approach for data mining. Yang, Y., & Singhal, M. (2001). Fuzzy functional depen-
In Seventh International Workshop on Rough Sets, Fuzzy dencies and fuzzy association rules. In First Interna-
Sets, Data Mining and Granular-Soft Computing (pp. tional Conference on Data Warehouse and Knowledge
458-462), Yamaguchi, Japan. Discovery (pp. 229-240), Florence, Italy.
Min, H., Smolinski, T., & Boratyn, G.A. (2001). A genetic Zhang, Y.Q., Fraser, M.D., Gagliano, R.A., & Kandel, A.
algorithm-based data mining approach to profiling the (2000). Granular neural networks for numerical-linguistic
adopters and non-adopters of e-purchasing. In Third data fusion and knowledge discovery. IEEE Transac-
International Conference on Information Reuse and tions on Neural Networks, 11, 658-667.
Integration, Las Vegas, USA.
Zhou, Z.H., Yuan, J., & Chen, S.F. (2000). A general neural
Mitra, S., Pal, S.K., & Mitra, P. (2002). Data mining in soft framework for classification rule mining. International
computing framework: A survey. IEEE Transactions on Journal of Computers, Systems, and Signals, 1(2), 154-168.
Neural Networks, 13, 3-14.
Noda, E., Freitas, A.A., & Lopes, H.S. (1999). Discovering
interesting prediction rules with a genetic algorithm. In KEY TERMS
IEEE Congress on Evolutionary Computing (pp. 1322-1329).
Data Mining: A set of tools, techniques and methods
Novobilski, A., & Kamangar, F. (2002). A genetic algo- used to find new, hidden or unexpected patterns from a
rithm based approach for discovering temporal trends large collection of data typically stored in a data ware-
using Bayesian networks. In Sixth World Conference on house.
Systemics, Cybernetics, and Informatics.
Fuzzy Set: A set that captures the different degrees of
Pedrycz, W. (1998). Fuzzy set technology in knowledge belongingness of different objects in the universe instead
discovery. Fuzzy Sets and Systems, 98, 279-290. of a sharp demarcation between objects that belong to a
Piasta, Z. (1999). Analyzing business databases with the set and those that does not.
ProbRough rule induction system. In Workshop on Data Genetic Algorithm: A search and optimization tech-
Mining in Economics, Marketing and Finance (pp. 22- nique that uses the concept of survival of genetic mate-
29), Chania, Greece. rials over various generations of populations much like
Russell, S., & Lodwick, W. (1999). Fuzzy clustering in data the theory of natural evolution.
mining for telco database marketing campaigns. In North Hybrid Technique: A combination of two or more soft
Atlantic Fuzzy Information Processing Symposium (pp. computing techniques used for data mining. Examples are
720-726), New York. neuro-fuzzy, neuro-genetic, and etcetera.
Shin,C.K., Tak Yun, U., Kang Kim, H., & Chan Park, S. Neural Network: A connectionist model that can be
(2000). A hybrid approach of neural network and memory- trained in supervised or unsupervised mode for learning
based learning to data mining. IEEE Transactions on patterns in data.
Neural Networks, 11(3), 637-646.
276
TEAM LinG
Rough Set: A method of modeling impreciseness and Soft Computing: Collection of methods and tech-
vagueness in data through two sets representing the niques like fuzzy set, neural network, rough set and ,
upper bound and lower bound of the data set. genetic algorithm for solving complex real world prob-
lems.
277
TEAM LinG
278
Data Mining Medical Digital Libraries

Colleen Cunningham
Drexel University, USA
Xiaohua Hu
INTRODUCTION Text Mining Process
Given the exponential growth rate of medical data and Text mining can be viewed as a modular process that
the accompanying biomedical literature, more than involves two modules: an information retrieval module
10,000 documents per week (Leroy et al., 2003), it has and an information extraction module. presents the
become increasingly necessary to apply data mining relationship between the modules and the relationships
techniques to medical digital libraries in order to assess between the phases within the information retrieval
a more complete view of genes, their biological func- module. The former module involves using NLP tech-
tions and diseases. Data mining techniques, as applied to niques to pre-process the written language and using
digital libraries, are also known as text mining. techniques for document categorization in order to find
relevant documents. The latter module involves finding
specific and relevant facts within text. NLP consists of
BACKGROUND three distinct phases: (1) tokenization, (2) parts of
speech (PoS) tagging and (3) parsing. In the tokenization
Text mining is the process of analyzing unstructured step, the text is decomposed into its subparts, which are
text in order to discover information and knowledge that subsequently tagged during the second phase with the
are typically difficult to retrieve. In general, text mining part of speech that each token represents (e.g., noun,
involves three broad areas: Information Retrieval (IR), verb, adjective, etc.). It should be noted that generating
Natural Language Processing (NLP) and Information the rules for PoS tagging is a very manual and labor-
Extraction (IE). Each of these areas are defined as intensive task. Typically, the parsing phase utilizes shal-
follows: low parsing in order to group syntactically related words
together because full parsing is both less efficient (i.e.,
Natural Language Processing: a discipline that very slow) and less accurate (Shatkay & Feldman, 2003).
deals with various aspects of automatically pro- Once the documents have been pre-processed, then they
cessing written and spoken language. can be categorized.
Information Retrieval: a discipline that deals There are two approaches to document categoriza-
with finding documents that meet a set of specific tion: Knowledge Engineering (KE) and Machine Learn-
requirements. ing (ML). Knowledge Engineering requires the user to
Information Extraction: a sub-field of NLP that manually define rules, which can consequently be used
addresses finding specific entities and facts in to categorize documents into specific pre-defined cat-
unstructured text. egories. Clearly, one of the drawbacks of KE is the time
that it would take a person (or group of people) to
manually construct and maintain the rules. ML, on the
MAIN THRUST other hand, uses a set of training documents to learn the
rules for classifying documents. Specific ML tech-
niques that have successfully been used to categorize
The current state of text mining in digital libraries is text documents include, but are not limited to, Decision
provided in order to facilitate continued research, which Trees, Artificial Neural Networks, Nearest Neighbor
subsequently can be used to develop large-scale text and Support Vector Machines (SVM) (Stapley et al.,
mining systems. Specifically, an overview of the pro- 2002). Once the documents have been categorized, then
cess, recent research efforts and practical uses of min- documents that satisfy specific search criteria can be
ing digital libraries, future trends and conclusions are retrieved.
presented.
TEAM LinG
Figure 1. Overview of text mining process

D
Text
Knowledge Enginee ring Boolean
Tokenization Phase Technique Technique
Decomposed, untagged
tokens (i.e. words)
OR OR
Part of Speech Phase

Machine Learning
N A Tokens (i.e. words) tagge d
Technique s Vector
V N with part of speech (e.g.
(e.g. Decision Trees, Technique s
noun, verb, etc.)
Artificial Neural Networks,
Parsing Phase Nearest Neighbor, Support
Vector Machines, etc.)
Adjecti
Verbs
ves Groups of sy ntactically-
related words
Nouns Adverbs
NLP techniques to pre-process Document Categorization Document Retrieval

written languag e Techniques Technique s
Information Retrieval Module

Retrieved Documents
Information Extraction Module
There are several techniques for retrieving documents that do not always hold true (Pearson, 2001). Furthermore,
that satisfy specific search criteria. The Boolean ap- this approach relies heavily on completeness of the list of
proach returns documents that contain the terms (or gene names and synonyms and summarizes the modular
phrases) contained in the search criteria; whereas, the process of text mining.
vector approach returns documents based upon the term
frequency-inverse document frequency (TF x IDF) for Research to Address Issues in Mining
the term vectors that represent the documents. Varia-
Digital Libraries
tions of clustering and clustering ensemble algorithms
(Iliopoulous et al., 2001; Hu, 2004), classification al-
gorithms (Marcotte et al., 2001) and co-occurrence The issues in mining digital libraries, specifically medi-
vectors (Stephens et al., 2001) have been successfully cal digital libraries, include scalability, ambiguous En-
used to retrieve related documents. An important point glish and biomedical terms, non-standard terms and
to mention is that the terms that are used to represent the structure and inconsistencies between medical reposi-
search criteria as well as the terms used to represent the tories (Shatkay & Feldman, 2003). Most of the current
documents are critical to successfully and accurately text mining research focuses on automating informa-
returning related documents. However, terms often have tion extraction (Shatkay & Feldman, 2003). The
multiple meanings (i.e., polysemy) and multiple terms scalability of the text mining approaches is of concern
can have the same meaning (i.e., synonyms). This repre- because of the rapid rate of growth of the literature. As
sents one of the current issues in text mining, which will such, while most of the existing methods have been
be discussed in the next section. applied to relatively small sample sets, there has been an
The last part of the text mining process is informa- increase in the number of studies that have been focused
tion extraction, of which the most popular technique is on scaling techniques to apply to large collections
co-occurrence (Blaschke & Valencia, 2002; Jenssen et (Pustejovsky et al., 2002; Jenssen et al., 2002). One
al., 2001). There are two disadvantages to this approach, exception to this is the study by Jenssen et al. (2001) in
each of which creates opportunities for further re- which the authors used a predefined list of genes to
search. First, this approach depends upon assumptions retrieve all related abstracts from PubMed that con-
regarding sentence structure, entity names, and etcetera tained the genes on the predefined list.
279
TEAM LinG
Table 1. General text mining process
General Step General Purpose General Issues General Solutions
aliases,
Identify and retrieve
IR synonyms &
relevant docs
homonyms
Find specific and

Controlled vocabularies
relevant facts within
(ontologies)
text
aliases,
[e.g., find specific gene [e.g., gene terms and
IE synonyms &
entities (entity synonyms: LocusLink (Pruitt
homonyms
extraction) & & Maglott, 2001), SwissProt
relationships between (Boeckmann et al., 2003),
specific genes HUGO (HUGO, 2003),
(relationship National Library of Medicine's
extraction)] MeSH (NLM, 2003)]
Since mining digital libraries relies heavily on the abil- to discovering protein associations (Fu et al., 2003). For
ity to accurately identify terms, the issues of ambiguous instance, Srinivasan (2004) developed MeSH-based text
terms, special jargon and the lack of naming conventions mining methods that generate hypotheses by identifying
are not trivial. This is particularly true in the case of digital potentially interesting terms related to specific input.
libraries where the issue is further compounded by non- Further examples include, but are not limited to: uncov-
standard terms. In fact, a lot of effort has been dedicated ering uses for thalidomide (Weeber et al., 2003), discov-
to building ontologies to be used in conjunction with text ering functional connections between genes (Chaussabel
mining techniques (Boeckmann et al., 2003; HUGO, 2003; & Sher, 2002) and identifying viruses that could be used
Liu et al., 2001; NLM, 2003; Oliver et al., 2002; Pruitt & as biological weapons (Swanson et al., 2001). He summa-
Maglott, 2001; Pustejovsky et al., 2002). Manually build- rizes some of the recent uses of text mining in medical
ing and maintaining ontologies, however, is a time con- digital libraries.
suming effort. In light of that, there have been several
efforts to find ways of automatically extracting terms to
incorporate into and build ontologies (Nenadic et al., 2002; FUTURE TRENDS
Ono et al., 2001). The ontologies are subsequently used to
match terms. For instance, Nenadic et al. (2002) developed
the Tagged Information Management System (TIMS), which The large volume of genomic data resulting and the
is an XML-based Knowledge Acquisition system that uses accompanying literature from the Human Genomic
ontology for information extraction over large collections. project is expected to continue to grow. As such, there
will be a continued need for research to develop scal-
able and effective data mining techniques that can be
Uses of Text Mining in Medical Digital used to analyze the growing wealth of biomedical data.
Libraries Additionally, given the importance of gene names in
the context of mining biomedical literature and the fact
There are many uses for mining medical digital libraries that there are a number of medical sources that use
that range from generating hypotheses (Srinivasan, 2004) different naming conventions and structures, research
Table 2. Some uses of text mining medical digital libraries
Application Technique Supporting Literature

Jenssen et al., 2001; Stapley & Benoit,
Building gene networks Co-occurrence 2000; Adamic et al., 2002
Unsupervised cluster learning and

Discovering protein association vector classification Fu et al., 2003
Stephens et al., 2001; Chaussabel &

Discovering gene interactions Co-occurrence Sher, 2002
Discovering uses for thalidomide Mapping phrases to UMLS concepts Weeber et al., 2001
Extracting and combining relations Rule-based parser and co-occurrence Leroy et al., 2003
Incorporating ontologies (e.g., mapping
Generating hypotheses terms to MeSH) Srinivasan, 2004; Weeber et al., 2003
Identifying biological virus weapons Swanson et al., 2001
280
TEAM LinG
to further develop ontology will play an important part in concept discovery in molecular biology. In Proceedings
mining medical digital libraries. Finally, it is worth mentioning of the Pacific Symposium on Biocomputing (PSB) (pp. ,
that there has been some effort to link the unstructured text 384-395).
documents within medical digital libraries with their related
structured data in data repositories. Jenssen, T.K., Laegrid, A., Komorowski, J., & Hovig, E.
(2001). A literature network of human genes for high-
throughput analysis of gene expression. Nature Genet-
ics, 28(1), 21-28.
CONCLUSION
Leroy, G., Chen, H., Martinez, J.D., Eggers, S., Flasey, R.R.,
Given the practical applications of mining digital librar- Kislin, K.L., Huang, Z., Li, J., Xu, J., McDonald, D.M., &
ies and the continued growth of available data, mining Ng, G. (2003). Genescene: Biomedical text and data mining.
digital libraries will continue to be an important area In Proceedings of the Third ACM/IEEE-CS joint confer-
that will help researchers and practitioners gain invalu- ence on Digital Libraries (pp. 116-118).
able and undiscovered insights into genes, their rela- Liu, H. Lussier, Y.A., & Friedman, C. (2001). Diambiguating
tionships, biological functions, diseases and possible ambiguous biomedical terms in biomedical narrative text:
therapeutic treatments. an unsupervised method-abstract. Journal of Biomedi-
cal Informatics, 34(4), 249-261.
REFERENCES Marcotte, E.M, Xenarios, I., & Eisenberg, D. (2001). Min-

ing literature for protein-protein interactions.
Blaschke, C., & Valencia, A. (2002). The frame-based Bioinformatics, 17(4), 359-363.
module of the SUISEKI information extraction system. NLM. (2003). MeSH: Medical Subject Headings. Re-
IEEE Intelligent Systems, Special Issue on Intelligent trieved from http://www.nlm.nih.gov/mesh/
Systems in Biology, 17(2), 14-20.
Oliver, D.E., Rubin, D.L., Stuart, J.M., Hewett, M., Klein,
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., T.E., & Altman, R.B. (2002). Ontology development for a
Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., pharmacogenetics knowledge base. Proceedings of Pa-
ODonovan, C., Phan, I., Pilbout, S., & Schneider, M. cific Symposium on Biocomputing (PSB)-2002, 7, 65-76.
(2003). The SWISS-PROT Protein Knowledgebase and
its Supplement TrEMBL in 2003. Nucleic Acids Re- Pearson, H. (2001). Biologys name game. Nature,
search, 31(1), 365-370. 411(6838), 631-632.
Chaussabel, D., & Sher, A. (2002). Mining microarray Pruitt, K.D., & Maglott, D.R. (2001). RefSeq and
expression data by literature profiling. Genome Biol- LocusLink:NCBI gene-centered resources. Nucleic Acids
ogy, 3(10), research0055.1-0055.16. Research, 29(1), 137-140.
De Bruijn, B., & Martin, J. (2002). Getting to the core of Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., &
knowledge: Mining biomedical literature. International Cochran, B. (2002). Robust relational parsing over bio-
Journal of Medical Informatics, 67, 7-18. medical literature: Extracting inhibit relations. Pro-
ceedings of Pacific Symposium on Biocomputing (PSB)-
Fu, Y., Mostafa, J., & Seki, K. (2003). Protein association 2002, 7, 362-373.
discovery in biomedical literature. In Proceedings of the
Third ACM/IEEE-CS Joint Conference on Digital Li- Shatkay, H., & Feldman, R. (2003). Mining the biomedi-
braries (pp.113-115). cal literature in the genomic era: An overview. Journal
of Computational Biology, 10(6), 821-855.
Hu, X. (2004). Integration of cluster ensemble and text
summarization for gene expression. In Proceedings of Srinivasan, P. (2004). Text Mining: Generating hypoth-
the 2004 IEEE Symposium of Bioinformatics and eses from MEDLINE. Journal of the American Society
Bioengineering. for Information Science and Technology, 55(5), 396-413.
HUGO. (2003). HUGO (The Human Genome Organi- Stapley, B.J., Kelley, L.A., & Sternberg, M.J. (2002). Predict-
zation) Gene Nomenclature Committee. Retrieved from ing the sub-cellular location of proteins from text using
http://www.gene.ucl.ac.uk/nomenclature support vector machines. Proceedings of the Pacific Sym-
posium on Biocomputing (PSB), 7, 374-385.
Iliopolous, I., Enright, A.J., & Ouzounis, C.A. (2001).
Textquest: Document clustering of Medline abstracts for
281
TEAM LinG
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., & Bioinformatics: Data mining applied to medical digi-
Mostafa, J. (2001). Detecting gene relations from Medline tal libraries.
abstracts. In Proceedings of the Pacific Symposium on
Biocomputing (PSB) (pp. 483-496). Clustering: An algorithm that takes a dataset and
groups the objects such that objects within the same
Swanson, D.R., Smalheiser, N.R., & Bookstein, A. (2001). cluster have a high similarity to each other, but are
Information discovery from complementary literatures: dissimilar to objects in other clusters.
Categorizing viruses as potential weapons. Journal of the
American Society for Information Science, 52(10), 797- Information Extraction: A sub-field of NLP that ad-
812. dresses finding specific entities and facts in unstructured
text.
Weeber, M., Klein, H., Berg, L., & Vos, R. (2001). Using
concepts in literature-based discovery: Simulating Information Retrieval: A discipline that deals with
Swansons Raynaud-Fish Oil and Migraine-Magnesium finding documents that meet a set of specific require-
discoveries. Journal of the American Society for Informa- ments.
tion Science, 52(7), 548-557. Machine Learning: Artificial intelligence meth-
Weeber, M., Vos, R., Klein, H., de Jong-Van den Berg, ods that use a dataset to allow the computer to learn
L.T.W., Aronson, A., & Molema, G. (2003). Generating models that fit the data.
hypotheses by discovering implicit associations in the Natural Language Processing: A discipline that
literature: A case report for new potential therapeutic deals with various aspects of automatically processing
uses for Thalidomide. Journal of the American Medi- written and spoken language.
cal Informatics Association, 10(3), 252-259.
Supervised Learning: A machine learning tech-
nique that requires a set of training data, which consists
of known inputs and a priori desired outputs (e.g., clas-
KEY TERMS sification labels) that can subsequently be used for
either prediction or classification tasks.
Bibliomining: Data mining applied to digital li- Unsupervised Learning: A machine learning tech-
braries to discover patterns in large collections. nique, which is used to create a model based upon a
dataset; however, unlike supervised learning, the de-
sired output is not known a priori.
282
TEAM LinG
283
Data Mining Methods for Microarray ,

Data Analysis
Lei Yu
Arizona State University, USA
Huan Liu
INTRODUCTION numerical value. Expression levels of the same set of

genes under study are normally accumulated through
The advent of gene expression microarray technology multiple experiments on different samples (or the same
enables the simultaneous measurement of expression sample under different conditions) and recorded in a
levels for thousands or tens of thousands of genes in a data matrix. In data mining, data are often stored in the
single experiment (Schena, et al., 1995). Analysis of gene form of a matrix, of which each column is described by
expression microarray data presents unprecedented op- a feature or attribute and each row consists of feature
portunities and challenges for data mining in areas such values and forms an instance, also called a record or
as gene clustering (Eisen, et al., 1998; Tamayo, et al., 1999), data point, in a multidimensional space defined by the
sample clustering and class discovery (Alon, et al., 1999; features. Figure 1 illustrates two ways of representing
Golub, et al., 1999), sample class prediction (Golub, et al., microarray data in a matrix form. In Figure 1a, each
1999; Wu, et al., 2003), and gene selection (Xing, Jordan, feature is a sample (S) and each instance is a gene (G).
& Karp, 2001; Yu & Liu, 2004). This article introduces the Each genes expression levels are measured across all
basic concepts of gene expression microarray data and the samples (or conditions), so fij is the measurement of
describes relevant data-mining tasks. It briefly reviews the expression level of the ith gene for the jth sample,
the state-of-the-art methods for each data-mining task where i = 1,..., n and j = 1,..., m. In Figure 1b, the data
and identifies emerging challenges and future research matrix is the transpose of the one in Figure 1a, in which
directions in microarray data analysis. features are genes, and instances are samples. Some-
times, data in Figure 1b may have class labels ci for each
instance, represented in the last column. The class la-
BACKGROUND AND MOTIVATION bels can be different types of diseases or phenotypes of
the underlying samples. A typical microarray data set
The rapid advances of gene expression microarray tech- may contain thousands of genes but only a small number
nology have provided scientists, for the first time, the of samples (often less than 100). The number of samples
opportunity of observing complex relationships among is likely to remain small at least for the near future
various genes in a genome by simultaneously measuring due to the expense of collecting microarray samples
the expression levels of thousands of genes in massive (Dougherty, 2001).
experiments. In order to extract biologically meaning- The two different forms of data shown in Figure 1
ful insights from a plethora of data generated from have different data-mining tasks. When instances are
microarray experiments, advanced data analysis tech- genes (Figure 1a), gene clustering can be performed to
niques are in demand. Data-mining methods, which dis- find similarly expressed genes across many samples.
cover patterns, statistical or predictive models, and When instances are samples (Figure 1b), three different
relationships among massive data, are effective tools tasks can be performed: sample clustering, which in-
for microarray data analysis. volves grouping similar samples together to discover
Gene expression microarrays are silicon chips that classes or subclasses of samples; sample class predic-
simultaneously measure the expression levels of thou- tion, which involves predicting diseases or phenotypes
sands of genes. The description of technologies for of novel samples based on patterns learned from train-
constructing these chips and measuring gene expres- ing samples with known class labels; and gene selection,
sion levels is beyond the scope of this article (refer to which involves selecting a small number of genes from
Draghici, 2003, for an introduction). Each expression thousands of genes to reduce the dimensionality of the
level of a specific gene among thousands of genes data and improve the performance of classification and
measured in an experiment is eventually recorded as a clustering methods.
TEAM LinG
Data Mining Methods for Microarray Data Analysis
Figure 1. Microarray data matrix
a b
MAJOR LINES OF RESEARCH AND 2002). The methods of K-means often require specifica-
tion of the number of clusters, K, and the selection of K
DEVELOPMENT
instances as the initial clusters. All instances are then
partitioned into the K clusters, optimizing some objec-
In this part, we briefly review methods for each of the
tive function (e.g., inner-cluster similarity) by assign-
data-mining tasks identified earlier: gene clustering,
ing each instance to the most similar cluster, which is
sample clustering, sample class prediction, and gene
determined by the distance between the instance and the
selection. We discuss gene clustering and sample clus-
mean of each cluster in the current iteration. Self-
tering together, for these two tasks are common; how-
organizing maps (SOMs) are variations of K-means
ever, they are applied on microarray data from different
methods and require specification of the initial topol-
directions.
ogy of K nodes to construct the map. In graph-based
partitioning methods, a Minimum Spanning Tree (MST)
Clustering is often constructed, and the clusters are generated by
deleting the MST edges with the largest lengths. Graph-
Clustering is a process of grouping similar samples, based partitioning methods do not heavily depend on the
objects, or instances into clusters. Many clustering regularity of the geometric shape of cluster boundaries,
methods exist (for a review, see Jain, Murty, & Flynn, as K-means and SOMs do.
1999; Parson, Ehtesham, & Liu, 2004). They can be Traditional clustering methods require that each
applied to microarray data analysis for clustering genes instance belong to a single cluster, even though some
or samples. In this article, we present three groups of instances may be only slightly relevant for the biologi-
frequently used clustering methods. cal significance of their assigned clusters. Fuzzy C-
The first use of hierarchical clustering in gene means (Dembele & Kastner, 2003) apply a fuzzy parti-
clustering is first described in Eisen et al. (1998). Each tioning method that assigns cluster membership values
instance forms a cluster in the beginning, and the two to instances; this process is called fuzzy clustering. It
most similar clusters are merged until all instances are links each instance to all clusters via a real-value vector
in one single cluster. The clustering results in the form of indexes. The value of each index lies between 0 and 1,
of a tree structure, called dendrogram, which can be where a value close to 1 indicates a strong association to
broken at different levels by using domain knowledge. the corresponding cluster, while a value close to 0
Tree structures are easy to understand and can reveal indicates no association. The vector of indexes thus
close relationships among resulting clusters, but they defines the membership of an instance with respect to
do not provide a unique partition among all the in- the various clusters.
stances, because different ways to determine a basic
level in the dendrogram can result in different cluster- Sample Class Prediction
ing results.
Unlike hierarchical clustering methods, partition-
Apart from clustering methods, which do not require a
based clustering methods divide the whole data into a
priori knowledge about the classes of available instances,
fixed number of clusters. Examples are K-means (Herwig,
a classification method requires training instances with
et at., 1999), self-organizing maps (Tamayo, et al., 1999),
labeled classes, leans patterns that discriminate be-
and graph-based partitioning (Xu, Y., Olman, & Xu, D.,
284
TEAM LinG
tween various classes, and ideally, correctly predicts the Therefore, selecting a small number of discriminative
classes of unseen instances. Many classification meth- genes from thousands of genes is essential for suc- ,
ods can be applied to predict diseases or phenotypes of cessful sample classification and clustering. Feature
novel samples from microarray data. We present four selection methods (for a review, see Blum & Langley,
commonly used methods in this section. 1997; Liu & Motoda, 1998) can be applied in microarray
Linear discriminative analysis (LDA) For an m x n data analysis for gene selection.
gene expression matrix X (m is the number of samples, and Among gene selection methods, earlier methods
n is the number of genes), Linear discriminative analysis often evaluate genes in isolation without considering
(LDA) seeks linear combinations, xa, of sample vectors xi gene-to-gene correlation. They rank genes according
= (xi1,,xin) with large ratios of between-class to within- to their individual relevance or discriminative power to
class sums of squares. In other words, it tries to maximize the targeted classes and select top-ranked genes. Some
the ratio aTBa/aTWa, where B and W denote, respectively, methods based on statistical tests or information gain
the n x n matrices of between-class and within-class sums have been employed in Golub et al. (1999) and Model,
of squares (Dudoit, Fridlyand, & Speed, 2000). (2001). However, a number of studies (Ding & Peng,
Nearest neighbor (NN) usually does not learn during 2003; Xion, Fang, & Zhao, 2001) point out that simply
the training phase. Only when it is required to classify a combining a highly ranked gene with another highly
new sample does NN search the data to find the nearest ranked gene often does not form a good gene set,
neighbor for the new sample, using the class label of the because some highly correlated genes could be redun-
nearest neighbor to predict the class label of the new dant. Removing redundant genes among selected ones
sample. K-NN makes the prediction for a new sample can achieve a better representation of the characteris-
based on the most common class label among the K tics of the targeted classes and lead to improved clas-
training samples most similar to the new sample. Ex- sification accuracy. Methods that handle gene redun-
amples can be found in Pomeroy, et. al, 2002). dancy based on pair-wise correlation analysis among
Decision trees classify samples by building a tree- genes can be found in Ding and Peng (2003); Xing,
like structure. Specifically, they recursively split samples Jordan, and Karp (2001); and Yu and Liu (2004). A gene
into two child branches based on the values of a selected selection method for unlabeled samples is also pro-
feature, starting with all the samples. Each leaf node of posed and is shown effective for sample clustering in
the tree is pure in terms of classes, and the resulting Xing and Karp (2001).
partition corresponds to a classifier. By limiting the
number of consecutive branches, they can produce more
generalized classifiers. Different forms of trees exist. FUTURE TRENDS
In Wu et al. (2003), classification and regression trees
are applied for sample classification. Traditional data-mining methods are often designed
Support vector machines (SVMs) have also been for data where the number of instances is significantly
shown effective in sample classification (Brown, et al., larger than the number of features. In microarray data
2000). They try to separate a set of training samples of two analysis, however, the number of features (genes) is
different classes with a hyperplane in an n-dimensional huge, and the number of instances (samples) is rela-
space defined by n features (genes). If no separating tively small, for tasks of sample clustering or classifi-
hyperplane exists in the original space, a kernel function cation. This unique characteristic of microarray data
is used to map the samples into a higher dimensional presents a challenge to the scalability of current data-
space where a separating hyperplane exists. Complex mining methods to high dimensionality. In addition, the
kernel functions that provide nonlinear mappings result relative shortage of instances in the context of high
in nonlinear classifiers. SVMs avoid overfitting by se- dimensionality often causes many methods to overfit
lecting a hyperplane that is maximally distant from the the training data. Therefore, besides improving current
training samples of two different classes, called maxi- data-mining methods, substantial research efforts are
mum margin separating hyperplane, from among many needed to come up with new methods specifically
hyperplanes that can separate the two classes. designed for microarray data.
Gene Selection
CONCLUSION
The nature of relatively high dimensionality but a small
sample size of data in sample classification and cluster- Gene expression microarrays are a revolutionary tech-
ing can cause the problem of curse of dimensionality nology with great potential to provide accurate medical
and overfitting of the training data (Dougherty, 2001). diagnostics, develop cures for diseases, and produce a
285
TEAM LinG
detailed genome-wide molecular portrait of cellular states Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data
(Piatetsky-Shapiro & Tamayo, 2003). Data-mining meth- clustering: A review. ACM Computing Surveys, 31(3),
ods are effective tools to turn massive raw data from 264-323.
microarray experiments into biologically important in-
sights. In this article, we provide a brief introduction to Liu, H., & Motoda, H. (1998). Feature selection for knowl-
microarray data analysis and a concise review of various edge discovery and data mining. Boston: Kluwer Aca-
data-mining methods for microarray data. demic.
Model, F. (2001). Feature selection for DNA methylation
based cancer classification. Bioinformatics, 17, 154-164.
REFERENCES
Parson, L., Ehtesham, H., & Liu, H. (2004). Subspace
Alon, U. (1999). Broad patterns of gene expression re- clustering for high dimensional data: A review. ACM
vealed by clustering analysis of tumor and normal colon SIGKDD Explorations, 6(1), 90-105.
tissues probed by oligonucleotide arrays. Proceedings Piatetsky-Shapiro, G., & Tamayo, P. (2003). Microarray
of the National Acad. Sci., 96 (pp. 6745-6750), USA. data mining: Facing the challenges. SIGKDD Explora-
Blum, A. L., & Langley, P. (1997). Selection of relevant tions, 5(2), 1-5.
features and examples in machine learning. Artificial Pomeroy, S. L. (2002). Prediction of central nervous sys-
Intelligence, 97, 245-271. tem embryonal tumor outcome based on gene expression.
Brown, M. et al. (2000). Knowledge-based analysis of Nature, 415, 436-442.
microarray gene expression data by using support vec- Schena, M. (1995). Quantitative monitoring of gene ex-
tor machines. Proceedings of the National Acad. Sci., 97 pression patterns with a complementary DNA microarray.
(pp. 262-267), USA. Science, 270, 467-470.
Dembele, D., & Kastner, P. (2003). Fuzzy C-means method Tamayo, P. (1999). Interpreting patterns of gene expres-
for clustering microarray data. Bioinformatics, 19(8), 973- sion with self-organizing maps: Methods and application
980. to hematopoietic differentiation. Proceedings of the Na-
Ding, C., & Peng, H. (2003). Minimum redundancy feature tional Acad. Sci., 96 (pp. 2907-2912), USA.
selection from microarray gene expression data. Proceed- Wu, B., (2003). Comparison of statistical methods for
ings of the Computational Systems Bioinformatics Con- classification of ovarian cancer using mass spectrom-
ference (pp. 523-529). etry data. Bioinformatics, 19, 1636-1643.
Dougherty, E. R. (2001). Small sample issue for microarray- Xing, E., Jordan, M., & Karp, R. (2001). Feature selection
based classification. Functional Genomics, 2, 28-34. for high-dimensional genomic microarray data. Proceed-
Draghici, S. (2003). Data analysis tools for DNA ings of the 18th International Conference on Machine
microarrays. Chapman & Hall/CRC. Learning (pp. 601-608).
Dudoit, S., Fridlyand, J., & Speed, T. P. (2000). Com- Xing, E., & Karp, R. (2001). CLIFF: Clustering of high-
parison of discrimination methods for the classifica- dimensional microarray data via iterative feature filtering
tion of tumors using gene expression data (Tech. Rep. using normalized cuts. Bioinformatics, 17, S306-S315.
No. 576). Berkeley, CA: University of California at Xion, M., Fang, Z. & Zhao, J. (2001). Biomarker identifica-
Berkeley, Department of Statistics. tion by feature wrappers. Genome Research, 11, 1878-
Eisen, M. (1998). Clustering analysis and display of 1887.
genome-wide expression patterns. Proceedings of the Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expres-
National Acad. Sci., 95 (pp. 14863-14868), USA. sion data using a graph-theoretic approach: An applica-
Golub, T. (1999). Molecular classification of cancer: Class tion of minimum spanning trees. Bioinformatics, 18(4),
discovery and class prediction by gene expression moni- 536-545.
toring. Science, 286, 531-537. Yu, L., & Liu, H. (2004). Redundancy based feature selec-
Herwig, R., (1999). Large-scale clustering of cDNA finger- tion for microarray data. Proceedings of the 10th ACM
prints. Genomics, 66, 249-256. SIGKDD Conference on Knowledge Discovery and Data
Mining.
286
TEAM LinG
KEY TERMS Feature Selection: A process of choosing an optimal

subset of features from original features, according to a ,
Classification: The process of predicting the classes certain criterion.
of unseen instances based on patterns learned from Gene: A hereditary unit consisting of a sequence of
available instances with predefined classes. DNA that contains all the information necessary to
Clustering: The process of grouping instances into produce a molecule that performs some biological func-
clusters so that instances are similar to one another tion.
within a cluster but dissimilar to instances in other Gene Expression Microarrays: Silicon chips that
clusters. simultaneously measure the expression levels of thou-
Data Mining: The application of analytical methods sands of genes.
and tools to data for the purpose of discovering patterns, Genome: All the genetic information or hereditary
statistical or predictive models, and relationships among material possessed by an organism.
massive data.
287
TEAM LinG
288
Data Mining with Cubegrades

Amin A. Abdulghani
Quantiva, USA
INTRODUCTION given point x and function f(), it can be interpreted as a

statement of how a change in the value of x (x) affects a
Much interest has been expressed in database mining change in value in the function ( f(x)).
by using association rules (Agrawal, Imielinski, & Swami, From another viewpoint, cubegrades can also be con-
1993). In this article, I provide a different view of the sidered as defining a primitive for cubes. An n-dimen-
association rules, which are referred to as cubegrades sional cube is a group of k-dimensional (k<=n) cuboids
(Imielinski, Khachiyan, & Abdulghani, 2002) . arranged by the dimensions of the data. A cell represents
An example of a typical association rule states that, an association of a measure m (e.g., total sales) with a
say, in 23% of supermarket transactions (so-called market member of every dimension. The scope of interest in
basket data) customers who buy bread and butter also Online Analytical Processing (OLAP) is to evaluate one
buy cereal (that percentage is called confidence) and that or more measure values of the cells in the cube. Cubegrades
in 10% of all transactions, customers buy bread and butter allow a broader, more dynamic view. In addition to evalu-
(this percentage is called support). Bread and butter ating the measure values in a cell, they evaluate how the
represent the body of the rule, and cereal constitutes the measure values change or are affected in response to a
consequent of the rule. This statement is typically repre- change in the dimensions of a cell. Traditionally, OLAP
sented as a probabilistic rule. But association rules can has had operators such as drill downs, rollups defined,
also be viewed as statements about how the cell repre- but the cubegrade operator differs from them as it returns
senting the body of the rule is affected by specializing it a value measuring the effect of the operation. Additional
with the addition of an extra constraint expressed by the operators have been proposed to evaluate/measure cell
rules consequent. Indeed, the confidence of an associa- interestingness (Sarawagi, 2000; Sarawagi, Agrawal, &
tion rule can be viewed as the ratio of the support drop, Megiddo, 1998). For example, Sarawagi et al. computes
when the cell corresponding to the body of a rule (in this anticipated value for a cell by using the neighborhood
case, the cell of transactions including bread and butter) values, and a cell is considered an exception if its value is
is augmented with its consequent (in this case, cereal). significantly different from its anticipated value. The
This interpretation gives association rules a dynamic difference is that cubegrades perform a direct cell-to-cell
flavor reflected in a hypothetical change of support af- comparison.
fected by specializing the body cell to a cell whose
description is a union of body and consequent descrip-
tors. For example, the earlier association rule can be BACKGROUND
interpreted as saying that the count of transactions in-
cluding bread and butter drops to 23% of the original An association or propositional rule can be defined in
when restricted (rolled down) to the transactions includ- terms of cube cells. It can be defined as a quadruple (body,
ing bread, butter, and cereal. In other words, this rule consequent, support, confidence) where body and conse-
states how the count of transactions supporting buyers quent are cells over disjoint sets of attributes, support is
of bread and butter is affected by buying cereal as well. the number of records satisfying the body, and confi-
With such interpretation in mind, a much more general dence is the ratio of the number of records that satisfy the
view of association rules can be taken, when support body and the consequent to the number of records that
(count) can be replaced by an arbitrary measure or aggre- satisfy just the body. You can also consider an associa-
gate, and the specialization operation can be substituted tion rule as a statement about a relative change of
with a different delta operation. Cubegrades capture measure, COUNT, when specializing or drilling down the
this generalization. Conceptually, this is very similar to cell denoted by the body to the cell denoted by the body
the notion of gradients used in calculus. By definition, the + consequent. The confidence of the rule measures how
gradient of a function between the domain points x1 and the consequent affects the support when drilling down
x2 measures the ratio of the delta change in the function the body. These association rules can be generalized in
value over the delta change between the points. For a two ways:
TEAM LinG
By allowing relative changes in other measures, Figure 1. Cubegrade: specialization, generalization,

instead of just confidence, to be returned as part of and mutation ,
the rule.
By allowing cell modifications to be able to occur in A=a1, B=b1 [Count=150, Avg
different directions instead of just specializations (M1)=25]
(or drill-downs).
These generalized cell modifications are denoted as Generalization
cubegrades. A cubegrade expresses how a change in the

structure of a given cell affects a set of predefined mea- Mutation
sures. The original cell being modified is referred to as the A=a1, B=b1, C=c1 A=a1, B=b1, C=c2
source, and the modified cell as target. [Count=100, Avg(M1)=20] [Count=70, Avg(M1)=30]
More formally, a cubegrade is a 5-tuple (source, tar-

get, measures, value, delta-value) where
Specialization
source and target are cube cells
measures is the set of measures that are evaluated
both in the source as well as in the target
value is a function, value: measures R, that A=a1, B=b1, C=c1, D=d1
evaluates measure m measures in the source [Count=50, Avg(M1)=18]
delta-value is also a function, delta-value: mea-
sures R, that computes the ratio of the value of m
measures in the target versus the source
(Specialization Cubegrade). The average age of
A cubegrade can visually be represented as a rule
buyers who purchase $20 to $30 worth of milk
form:
monthly drops by 10% among buyers who also buy
cereal.
Source target, [measures, value, delta-value] (salesMilk=[$20,$30]) (salesMilk=[$20,$30],
salesCereal=[$1,$5])
Define a descriptor to be an attribute value pair of the [AVG(Age), AVG(Age) = 23, DeltaAVG(Age) = 90%]
form dimension=value if the dimension is a discrete (Mutation Cubegrade). The average amount spent
attribute, or dimension = [lo, hi] if the attribute is a on milk drops by 30% when moving from suburban
dimension attribute. The cubegrades are distinguished as buyers to urban buyers.
three types: (areaType=suburban) (areaType=urban)
[ AVG(salesMilk), AVG(salesMilk) = $12.40,
DeltaAVG(salesMilk)= 70%]
Specializations: A cubegrade is a specialization if
the set of descriptors of the target are a superset of
those in the source. Within the context of OLAP, the
target cell is termed a drill-down of source. MAIN THRUST
Generalizations: A cubegrade is a generalization
if the set of descriptors of the target cell are a subset Similar to association rules (Agrawal & Srikant, 1994), the
of those in the source. Here, in OLAP, the target cell generation of cubegrades can be divided into two phases:
is termed a roll-up of source. (a) generation of significant cells (rather than frequent
Mutations: A cubegrade is a mutation if the target sets) satisfying the source cell conditions and (b) compu-
and source cells have the same set of attributes but tation of cubegrades from the source (instead of comput-
differ on the descriptor values (they are union ing association rules from frequent sets) satisfying the
compatible, so to speak, as the term has been used joint conditions between source and target and target
in relational algebra). conditions.
The first task is similar to the computation of iceberg
Figure 1 illustrates the operations of these cubegrades. cube queries (Beyer & Ramakrishnan, 1999; Han, Pei,
Following, I illustrate some specific examples to explain Dong, & Wang, 2001; Xin, Han, Li, & Wah, 2003). The
the use of these cubegrades: fundamental property that allows for pruning in these
computations is called monotonicity of the query: Let D
289
TEAM LinG
be a database and XD be a cell. A query is monotonic at However, if the view for C is (Count = 1200;
X if the condition Q(X) is FALSE implies that Q(X) is AVG(salesMilk) = 57; MAX(salesMilk) =80;
FALSE for any X`X. MIN(salesMilk) = 30; SUM(salesMilk) = 68400), then it
However, as described by Imielinski et al. (2002), deter- can be shown that there cannot exist any subcell C of C
mining whether a query Q is monotonic in terms of this in any database for which the original query can be
definition is an NP-hard problem for many simple classes satisfied. Thus, the query can be pruned on cell C.
of queries. To work around this problem, the authors After the source cells have been computed, the next
introduced another notion of monotonicity, referred to as task is to compute the set of target cells. This is done by
view monotonicity of a query. Suppose you have cuboids performing a set of query conversions which make it
defined on a set S of dimension and measures. A view V on possible to reduce cubegrade query evaluation to ice-
S is an assignment of values to the elements of the set. If berg queries. Given a specific candidate source cell C,
the assignment holds for the dimensions and measures in define Q[C] as the query which results from Q by source
a given cell X, then V is a view for X on the set S. So, for substitution. Here, Q is transformed by substituting into
example, if in a cell of rural buyers, the average sales of its where clause all the values of the measures, and the
bread for 20 buyers is 15, then the view on the set {areaType, descriptors of the source cell C as well, by performing the
COUNT(), AVG(salesBread)} for the cell is delta elimination step, which replaces all the relative
{areaType=rural, COUNT()=20, AVG(salesBread)=15}. delta-values (expressed as fractions) by the regular less
Extending the definition, a view on a query is an assign- thangreater than conditions. This is possible because
ment of values for the set of dimension and measure the values for the measures of C are known. With this, the
attributes of the query expression. A query Q()view mono- delta-values can now be expressed as conditions on
tonic on view V if for any cell X in any database D such that values of the target. For example, if AVG(Salary)=40K in
V is the view for X, the condition Q is FALSE, for X implies cell C and the condition on the DeltaAVG(Salary) is of the
form DeltaAVG(Salary) >1.10, this can be translated now
Q is FALSE for all X'X. An important property of view
to AVG(Salary) >44K, where AVG(Salary) references the
monotonicity is that the time and space required for check- target cell. The final step in source substitution is join
ing it for a query depends on the number of terms in the transformation, where the join conditions (specializa-
query, not on the size of the database or the number of its tions, generalizations, and mutations) in Q are trans-
attributes. Because most of the queries typically have few formed into the target conditions because the source cell
terms, it would be useful in many practical situations. The is known. Notice, thus, that Q[C] is the cube query
method presented can be used for checking for view specifying the target cell.
monotonicity for queries that include constraints of type Dong, Han, Lam, Pei, and Wang (2001) present an
(Agg {<, >, =, !=} c), where c is a constant and Agg can be optimized version of target generation that is particularly
MIN, SUM, MAX, AVERAGE, COUNT, an aggregate that useful for the case where the number of source cells are
is a higher order moment about the origin, or an aggregate few in number. The ideas in the algorithm include the
that is an integral of a function on a single attribute. following steps:
Consider a hypothetical query asking for cubes with
1000 or more buyers and with total milk sales less than Perform for the set of identified source cells the
$50,000. In addition, the average milk sales per customer lowest common delta elimination such that the
should be between $20 and $50, with maximum sales greater resulting target condition does not exclude any
than $75. This query can be expressed as follows: possible target cells.
Perform a bottom-up iceberg query for the target
COUNT(*)>=1000 and AVG(salesMilk)>=20 and cells based on the target condition. Define
AVG(salesMilk)<50 and LiveSet(T) of a target cell T as the candidate set of
MAX(salesMilk)>=75 and SUM(saleMilk)<50K source cells that can possibly match or join with the
target. A target cell, T, may identify a source, S, in
Suppose, while performing bottom-up cube computa- its LiveSet to be prunable based on its monotonic-
tion, you have a cell C with the following view V (Count ity and thus removable from its LiveSet. In such a
=1200; AVG(salesMilk) = 50; MAX(salesMilk) = 80; case, all descendants of T would also not include
MIN(salesMilk) = 30; SUM(salesMilk) =60000). Using the S in their LiveSets.
method for checking view monotonicity, it can be shown Perform a join of the target and each of its LiveSets
that some cell C of C can exist (though this subcell is not source cells and for the resulting cubegrade; check
guaranteed to exist in this database) with 1000<= count whether it satisfies the join criteria and the delta-
<1075 for which this query can be satisfied. Thus, this value condition for the query.
query cannot be pruned on the cell.
290
TEAM LinG
In a typical scenario, it is not expected that users 2002). Applying the cubegrade paradigm in such domains
would be asking for cubegrades per se. Rather, it may be provides an opportunity for richer mining results. ,
more likely that they pose a query on how a given delta
change affects a set of cells, which cells are affected by
a given delta change, or what delta changes affect a set of CONCLUSION
cells in a prespecified manner.
Further, more complicated sets of applications can be In this article, I look at a generalization of association rules
implemented by using cubegrades (Abdulghani, 2001). referred to as cubegrades. These generalizations include
For example, one may be interested in finding cells that allowing the evaluation of relative changes in other mea-
remain stable and are not significantly affected by gener- sures, rather than just confidence, to be returned as well
alization, specialization, or mutation. An illustration for as allowing cell modifications to occur in different direc-
such a situation would be to find cells that remain stable tions. The additional directions that I consider here in-
on the blood pressure measure and are not affected by a clude generalizations, which modify cells towards the
different specialization on age or area demographics. more general cell with fewer descriptors, and mutations,
Another application could be to find effective factors, a which modify the descriptors of a subset of the attributes
set of specialization, generalization, or mutation descrip- in the original cell definition with the others remaining the
tors that are effective in changing a measure value m by same. The paradigm allows you to ask queries that were
a significant ratio. For example, one may want to find not possible through association rules. The downside is
effective factors in decreasing a cholesterol level across that it comes with the price of relatively increased compu-
a set of selected cells. tation/storage costs that need to be tackled with innova-
tive methods.
FUTURE TRENDS
REFERENCES
A major challenge for cubegrade processing is its compu-
tational complexity. Potentially, an exponential number of Abdulghani, A. (2001). Cubegrades-generalization of
source/target cells can be generated. A positive develop- association rules to mine large datasets. Doctoral disser-
ment in this direction is the work done on Quotient Cubes tation, Rutgers University, New Brunswick, NJ. Disserta-
(Lakshmanan, Pei, & Han, 2002). This work provides a tion Abstracts International, DAI-B 62/10, UMI Number
method for partitioning and compressing the cells of a 3027950.
data cube into equivalent classes such that the resulting
classes have cells covering the same set of tuples and Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining
preserving the cubes semantic roll-up/drill-down. In this association rules between sets of items in large data-
context, we can reduce the number of cells generated for bases. Proceedings of the ACM SIGMOD Conference (pp.
the source and target. Further, the pairings for the 207-216), USA.
cubegrade can be reduced by restricting the source and Agrawal, R., & Srikant, R. (1994). Fast algorithms for
target to different classes. mining association rules in large databases. Proceedings
Another related challenge for cubegrades is to iden- of the International Conference on Very Large Data
tify the set of interesting cubegrades (cubegrades that are Bases (pp. 487-499), Chile.
somewhat surprising). Insights to this problem can be
obtained from similar work done in the context of associa- Bayardo, R., & Agrawal, R. (1999). Mining the most
tion rules (Bayardo & Agrawal, 1999; Liu, Ma, &Yu, 2001). interesting rules. Proceedings of the ACM International
The main difference being that for cubegrades, measures Conference on Knowledge Discovery and Data Mining
(possibly a combination of them) other than COUNT are (pp. 145-154), USA.
involved, but for association rules, the interesting func-
tions are based on the count function. In addition, Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up com-
cubegrades have cell modification in multiple directions, putation of sparse and iceberg CUBEs. Proceedings of
but association rules are restricted to specializations. the ACM SIGMOD Conference (pp. 359-370), USA.
As cubegrades are better understood, wider applica- Dong, G., Han, J., Lam, J. M., Pei, J., & Wang, K. (2001).
tions of the concept to various domains are expected to Mining multi-dimensional constrained gradients in data
be seen. Association rules have been applied with suc- cubes. Proceedings of the International Conference on
cess to such areas as intrusion detection (Lee, Stolfo, & Very Large Data Bases (pp. 321-330), Italy.
Mok, 1998), and microarray data (Tuzhilin & Adomavicius,
291
TEAM LinG
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). KEY TERMS
Data cube: A relational aggregation operator generalizing
group-by, cross-tab, and sub-total. Proceedings of the Cubegrade: A cubegrade is a 5-tuple (source, target,
International Conference on Data Engineering (pp. 152- measures, value, delta-value) where
159), USA.
source and target are cells
Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient measures is the set of measures that are evaluated
computation of iceberg cubes with complex measures. both in the source as well as in the target
Proceedings of the ACM SIGMOD Conference (pp. 1-12), value is a function, value: measures R, that
USA. evaluates measure m measures in the source
Imielinski, T., Khachiyan, L., & Abdulghani, A. (2002). delta-value is a function, delta-value: measures
Cubegrades: Generalizing association rules. Journal of R, that computes the ratio of the value of m
Data Mining and Knowledge Discovery, 6(3), 219-257. measures in the target versus the source
Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient Drill-Down: A cube operation that allows users to
cube: How to summarize the semantics of a data cube. navigate from summarized cells to more detailed cells.
Proceedings of the International Conference on Very
Large Data Bases (pp. 778-789), China. Generalizations: A cubegrade is a generalization if
the set of descriptors of the target cell are a subset of the
Lee, W., Stolfo, S. J., & Mok, K. W. (1998). Mining audit set of attribute-value pairs of the source cell.
data to build intrusion detection models. Proceedings of
the ACM International Conference on Knowledge Dis- Iceberg Cubes: The set of cells in a cube that satisfies
covery and Data Mining (pp. 66-72), USA. an iceberg query.
Liu, B., Ma, Y., & Yu, S. P. (2001). Discovering unexpected Iceberg Query: A query on top of a cube that asks for
information from your competitors Web sites. Proceed- aggregates above a certain threshold.
ings of the ACM International Conference on Knowl- Mutations: A cubegrade is a mutation if the target and
edge Discovery and Data Mining (pp. 144-153), USA. source cells have the same set of attributes but differ on
Sarawagi, S. (2000). User-adaptive exploration of multidi- the values.
mensional data. Proceedings of the International Con- Query Monotonicity: () is monotonic at a cell X if the
ference on Very Large Data Bases (pp. 307-316). condition Q(X) is FALSE implies that Q(X) is FALSE for
Sarawagi, S., Agrawal, R., & Megiddo, N.(1998). Discov- any cell X`X.
ery driven exploration of OLAP data cubes. Proceedings Roll-Up: A cube operation that allows users to aggre-
of the International Conference on Extending Database gate from detailed cells to summarized cells.
Technology (pp. 168-182), Spain.
Specialization: A cubegrade is a specialization if the
Tuzhilin, A., & Adomavicius, G. (2002). Handling very set of attribute-value pairs of the target cell is a superset
large numbers of association rules in the analysis of of the set of attribute-value pairs of the source cell.
microarray data. Proceedings of the ACM International
Conference on Knowledge Discovery and Data Mining View: A view V on S is an assignment of values to the
(pp. 396-404), Canada. elements of the set. If the assignment holds for the
dimensions and measures in a given cell X, then V is a view
Xin, D., Han, J., Li, X., & Wah, B. W.(2003). Star-cubing: for X on the set S.
Computing iceberg cubes by top-down and bottom-up
integration. Proceedings of the International Confer- View Monotonicity: A query Q is view monotonic on
ence on Very Large Data Bases (pp. 476-487), Germany. view V if for any cell X in any database D such that V is
the view for X for query Q , the condition Q is FALSE for
X implies that Q is FALSE for all X` X.
292
TEAM LinG
293
Data Mining with Incomplete Data ,

Hai Wang
Saint Marys University, Canada
Shouhong Wang
University of Massachusetts Dartmouth, USA
INTRODUCTION Another simple approach of dealing with missing

data is to use generic unknown for all missing data
Survey is one of the common data acquisition methods items. However, this approach does not provide much
for data mining (Brin, Rastogi & Shim, 2003). In data information that might be useful for interpretation of
mining one can rarely find a survey data set that contains missing data.
complete entries of each observation for all of the The third solution to dealing with missing data is to
variables. Commonly, surveys and questionnaires are estimate the missing value in the data item. In the case of
often only partially completed by respondents. The time series data, interpolation based on two adjacent
possible reasons for incomplete data could be numer- data points that are observed is possible. In general
ous, including negligence, deliberate avoidance for pri- cases, one may use some expected value in the data item
vacy, ambiguity of the survey question, and aversion. based on statistical measures (Dempster, Laird, & Rubin,
The extent of damage of missing data is unknown when 1997). However, data in data mining are commonly of
it is virtually impossible to return the survey or ques- the types of ranking, category, multiple choices, and
tionnaires to the data source for completion, but is one binary. Interpolation and use of an expected value for a
of the most important parts of knowledge for data min- particular missing data variable in these cases are gener-
ing to discover. In fact, missing data is an important ally inadequate. More importantly, a meaningful treat-
debatable issue in the knowledge engineering field ment of missing data shall always be independent of the
(Tseng, Wang, & Lee, 2003). problem being investigated (Batista & Monard, 2003).
In mining a survey database with incomplete data, More recently, there have been mathematical meth-
patterns of the missing data as well as the potential ods for finding the salient correlation structure, or
impacts of these missing data on the mining results aggregate conceptual directions, of a data set with miss-
constitute valuable knowledge. For instance, a data miner ing data (Aggarwal & Parthasarathy, 2001; Parthasarathy
often wishes to know how reliable a data mining result & Aggarwal, 2003). These methods make themselves
is, if only the complete data entries are used; when and distinct from the traditional approaches of treating miss-
why certain types of values are often missing; what ing data by focusing on the collective effects of the
variables are correlated in terms of having missing missing data instead of individual missing values. How-
values at the same time; what reason for incomplete data ever, these statistical models are data-driven, instead of
is likely, etc. These valuable pieces of knowledge can be problem-domain-driven. In fact, a particular data mining
discovered only after the missing part of the data set is task is often related to its specific problem domain, and
fully explored. a single generic conceptual construction algorithm is
insufficient to handle a variety of data mining tasks.
BACKGROUND
MAIN THRUST
There have been three traditional approaches to handling
missing data in statistical analysis and data mining. One There have been two primary approaches of data mining
of the convenient solutions to incomplete data is to with incomplete data: conceptual construction and en-
eliminate from the data set those records that have hanced data mining.
missing values (Little & Rubin, 2002). This, however,
ignores potentially useful information in those records. Conceptual Construction with
In cases where the proportion of missing data is large, Incomplete Data
the data mining conclusions drawn from the screened
data set are more likely misleading. Conceptual construction with incomplete data reveals
the patterns of the missing data as well as the potential
TEAM LinG
Data Mining with Incomplete Data
impacts of these missing data on the mining results the data miner can define index VM(i,j)/VM(i) where
based only on the complete data. Conceptual construc- VM(i,j) is the number of missing values in both
tion on incomplete data is a knowledge development variables i and j, and VM(i) is the number of missing
process. To construct new concepts on incomplete values in variable i. This concept discloses the
data, the data miner needs to identify a particular prob- correlation of two variables in terms of missing
lem as a base for the construction. According to (Wang, values. The higher the value VM(i,j)/VM(i) is, the
S. & Wang, H., 2004), conceptual construction is car- stronger the correlation of missing values would be.
ried out through two phases. First, data mining tech- (4) Conditional Effects: The concept of conditional
niques (e.g., cluster analysis) are applied to the data set effects reveals the potential changes to the under-
with complete data to reveal the unsuspected patterns of standing of the problem caused by the missing
the data, and the problem is then articulated by the data values. To develop the concept of conditional
miner. Second, the incomplete data with missing values effects, the data miner assumes different possible
related to the problem are used to construct new con- values for the missing values, and then observe the
cepts. In this phase, the data miner evaluates the impacts possible changes of the nature of the problem. For
of missing data on the identification of the problem and instance, in the above example, the data miner can
develops knowledge related to the problem. For ex- define index P|z(i)=k where P is the change of
ample, suppose a data miner is investigating the profile the size of the target consumer group perceived by
of the consumers who are interested in a particular the data miner, z(i) represents all missing values
product. Using the complete data, the data miner has of variable i, and k is the possible value variable i
found that variable i (e.g., income) is an important factor might have for the survey. Typically, k={max, min,
of the consumers purchasing behavior. To further verify p} where max is the maximal value of the scale, min
and improve the data mining result, the data miner must is the minimal value of the scale, and p is the random
develop new knowledge through mining the incomplete variable with the same distribution function of the
data. Four typical concepts as results of knowledge values in the complete data. By setting different
discovery in data mining with incomplete data are de- possible values of k for the missing values, the data
scribed as follows: miner is able to observe the change of the size of the
consumer group and redefine the problem.
(1) Reliability: The reliability concept reveals the
scope of the missing data in terms of the problem Enhanced Data Mining with Incomplete
identified based only on complete data. For in- Data
stance, in the above example, to develop the reli-
ability concept, the data miner can define index The second primary approach to data mining with in-
VM(i)/VC(i) where VM(i) is the number of missing complete data is enhanced data mining, in which incom-
values in variable i, and V C(i) is the number of plete data are fully utilized. Enhanced data mining is
samples used for the problem identification in carried out through two phases. In the first phase, obser-
variable i. Accordingly, the higher VM(i)/V C(i) is, vations with missing data are transformed into fuzzy
the lower the reliability of the factor would be. observations. Since missing values make the observa-
(2) Hiding: The concept of hiding reveals how likely tion fuzzy, according to fuzzy set theory (Zadeh, 1978),
an observation with a certain range of values in one an observation with missing values can be transformed
variable is to have a missing value in another into fuzzy patterns that are equivalent to the observa-
variable. For instance, in the above example, the tion. For instance, suppose there is an observation
data miner can define index VM(i)|x(j)(a,b) where A=X(x1, x2, . . . xc . . .xm) where x c is the variable with
VM(i) is the number of missing values in variable i, missing value, and xc{ r1, r 2 ... rp } where rj (j=1, 2, ... p)
x(j) is the occurrence of variable j (e.g., education is the possible occurrence of xc. Let j = P j(xc = rj), the
years), and (a,b) is the range of x(j); and use this fuzzy membership (or possibility) that x c belongs to rj ,
index to disclose the hiding relationships between (j=1, 2, ...p), and j j = 1. Then, j [X |(xc= rj)] (j=1, 2,
variables i and j, say, more than two thousand records ... p) are fuzzy patterns that are the equivalence to the
have missing values in variable income given the observation A.
value of education years ranging from 13 to 19. In the second phase of enhanced data mining, all
(3) Complementing: The concept of complementing fuzzy patterns, along with the complete data, are used
reveals what variables are more likely to have miss- for data mining using tools such as self-organizing maps
ing values at the same time; that is, the correlation (SOM) (Deboeck & Kohonen, 1998; Kohonen, 1989;
of missing values related to the problem being Vesanto & Alhoniemi, 2000) and other types of neural
investigated. For instance, in the above example,
294
TEAM LinG
networks (Wang, 2000, 2002). These tools used for en- incomplete data. It provides effective techniques for
hanced data mining are different from the original ones in knowledge development so that the data miner is allowed ,
that they are capable of retaining information of fuzzy to interpret the data mining results based on the particu-
membership for each fuzzy pattern. Wang (2003) has devel- lar problem domain and his/her perception of the missing
oped a SOM-based enhanced data mining model to utilize all data. The other approach is fuzzy transformation. Ac-
fuzzy patterns and the complete data for knowledge discov- cording to this approach, observations with missing
ery. Using this model, the data miner is allowed to compare values are transformed into fuzzy patterns based on
SOM based on complete data and fuzzy SOM based on all fuzzy set theory. These fuzzy patterns along with obser-
incomplete data to perceive covert patterns of the data set. vations with complete data are then used for data mining
It also allows the data miner to conduct what-if trials by through, for examples, data visualization and classifica-
including different portions of the incomplete data to dis- tion.
close more accurate facts. Wang (2005) has developed a The inclusion of incomplete data for data mining
Hopfield neural network based model (Hopfield & Tank, would provide more information for the decision maker
1986) for data mining with incomplete survey data. The in identifying problems, verifying and improving the
enhanced data mining method utilizes more information data mining results derived from observations with
provided by fuzzy patterns, and thus makes the data mining complete data only.
results more accurate. More importantly, it produces rich
information about the uncertainty (or risk) of the data mining
results. REFERENCES
Aggarwal, C.C., & Parthasarathy, S. (2001). Mining

FUTURE TRENDS massively incomplete data sets by conceptual recon-
struction. Proceedings of the 7th ACM SIGKDD Inter-
Research into data mining with incomplete data is still in national Conference on Knowledge Discovery and
its infancy. The literature on this issue is still scarce. Data Mining (pp. 227-232). New York: ACM Press.
Nevertheless, the data miners endeavor for knowing
what we do not know will accelerate research in this Batista, G., & Monard, M. (2003). An analysis of four
area. More theories and techniques of data mining with missing data treatment methods for supervised learning.
incomplete data will be developed in the near future, Applied Artificial Intelligence, 17(5/6), 519-533.
followed by comprehensive comparisons of these theo- Brin, S., Rastogi, R., & Shim, K. (2003). Mining optimized
ries and techniques. These theories and techniques will gain rules for numeric attributes. IEEE Transactions on
be built on the combination of statistical models, neural Knowledge & Data Engineering, 15(2), 324-338.
networks, and computational algorithms. Data manage-
ment systems on large-scale database systems for data Deboeck, G., & Kohonen, T. (1998). Visual explora-
mining with incomplete data will be available for data tions in finance with self-organizing maps. London,
mining practitioners. In the long term, techniques of UK: Springer-Verlag.
dealing with missing data will become a prerequisite of Dempster, A.P., Laird, N.M., & Rubin, D.B. (1997).
any data mining instrument. Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
B39(1), 1-38.
CONCLUSION
Hopfield, J.J., & Tank, D. W. (1986). Computing with
Generally, knowledge discovery starts with the original neural circuits. Sciences, 233, 625-633.
problem identification. Yet the validation of the problem Kohonen, T. (1989). Self-organization and associa-
identified is typically beyond the database and generic tive memory (3rd ed.). Berlin: Springer-Verlag.
algorithms themselves. During the knowledge discovery
process, new concepts must be constructed through Little, R.J.A., & Rubin, D.B. (2002). Statistical analy-
demystifying the data. Traditionally, incomplete data in sis with missing data (2nd ed.). New York: John Wiley
data mining are often mistreated. As explained in this and Sons.
chapter, data with missing values must be taken into
account in the knowledge discovery process. Parthasarathy, S., & Aggarwal, C.C. (2003). On the use
There have been two major non-traditional approaches of conceptual reconstruction for mining massively in-
that can be effectively used for data mining with incom- complete data sets. IEEE Computer Society, 15(6),
plete data. One approach is conceptual construction on 1512-1521.
295
TEAM LinG
Tseng, S., Wang, K., & Lee, C. (2003). A pre-processing Enhanced Data Mining with Incomplete Data: Data
method to deal with missing values by integrating cluster- mining that utilizes incomplete data through fuzzy trans-
ing and regression techniques. Applied Artificial Intelli- formation.
gence, 17(5/6), 535-544.
Fuzzy Transformation: The process of transform-
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self- ing an observation with missing values into fuzzy pat-
organizing map. IEEE Transactions on Neural Networks, terns that are equivalent to the observation based on
11(3), 586-600. fuzzy set theory.
Wang, S. (2000). Neural networks. In M. Zeleny (Ed.), Hopfield Neural Network: A neural network with
IEBM handbook of IT in business (pp. 382-391). London, a single layer of nodes that have binary inputs and
UK: International Thomson Business Press. outputs. The output of each node is fed back to all other
nodes simultaneously, and each of the node forms a
Wang, S. (2002). Nonlinear pattern hypothesis genera- weighted sum of inputs and passes the output result
tion for data mining. Data & Knowledge Engineering, through a nonlinearity function. It applies a supervised
40(3), 273-283. learning algorithm, and the learning process continues
Wang, S. (2003). Application of self-organizing maps until a stable state is reached.
for data mining with incomplete data Sets. Neural Com- Incomplete Data: The data set for data mining con-
puting & Application, 12(1), 42-48. tains some data entries with missing values. For in-
Wang, S. (2005). Classification with incomplete survey stance, when surveys and questionnaires are partially
data: A Hopfield neural network approach, Computers completed by respondents, the entire response data
& Operational Research, 32(10), 2583-2594. becomes incomplete data.
Wang, S., & Wang, H. (2004). Conceptual construction Neural Network: A set of computer hardware and/
on incomplete survey data. Data and Knowledge Engi- or software that attempt to emulate the information
neering, 49(3), 311-323. processing patterns of the biological brain. A neural
network consists of four main components:
Zadeh, L.A. (1978). Fuzzy sets as a basis for a theory of
possibility. Fuzzy Sets and Systems, 1, 3-28. (1) Processing units (or neurons); and each of them
has a certain output level at any point in time.
(2) Weighted interconnections between the various
processing units which determine how the output
KEY TERMS of one unit leads to input for another unit.
(3) An activation rule which acts on the set of input at
Aggregate Conceptual Direction: Aggregate con- a processing unit to produce a new output.
ceptual direction describes the trend in the data along (4) A learning rule that specifies how to adjust the
which most of the variance occurs, taking the missing weights for a given input/output pair.
data into account.
Self-Organizing Map (SOM): Two layer neural
Conceptual Construction with Incomplete Data: network that maps the high-dimensional data onto low-
Conceptual construction with incomplete data is a knowl- dimensional grid of neurons through unsupervised learn-
edge development process that reveals the patterns of ing or competitive learning process. It allows the data
the missing data as well as the potential impacts of these miner to view the clusters on the output maps.
missing data on the mining results based only on the
complete data.
296
TEAM LinG
297
Data Quality in Cooperative Information ,

Systems
Carlo Marchetti
Universit di Roma La Sapienza, Italy
Massimo Mecella
Monica Scannapieco
Antonino Virgillito
INTRODUCTION architecture will allow us to understand the challenges

posed by data quality management in CISs. Moreover, the
A Cooperative Information System (CIS) is a large- presentation of the DaQuinCIS services will provide ex-
scale information system that interconnects various amples of techniques that can be used to face data quality
systems of different and autonomous organizations, challenges and will show future research directions that
geographically distributed and sharing common objec- should be investigated.
tives (De Michelis et al., 1997). Among the different
resources that are shared by organizations, data are
fundamental; in real world scenarios, organization A BACKGROUND
may not request data from organization B, if it does not
trust Bs data (i.e., if A does not know that the quality of Data quality traditionally has been investigated in the
the data that B can provide is high). As an example, in an context of single information systems. Only recently,
e-government scenario in which public administrations there is a growing attention toward data quality issues in
cooperate in order to fulfill service requests from citi- the context of multiple and heterogeneous information
zens and enterprises (Batini & Mecella, 2001), admin- systems (Berti-Equille, 2003; Bertolazzi & Scannapieco,
istrations very often prefer asking citizens for data 2001; Naumann et al., 1999). In cooperative scenarios,
rather than from other administrations that have stored the main data quality issues regard (Bertolazzi &
the same data, because the quality of such data is not Scannapieco, 2001):
known. Therefore, lack of cooperation may occur due to
lack of quality certification. assessment of the quality of the data owned by
Uncertified quality also can cause a deterioration of each organization; and
the data quality inside single organizations. If organiza- methods and techniques for exchanging and im-
tions exchange data without knowing their actual qual- proving quality information.
ity, it may happen that data of low quality will spread all
over the CIS. On the other hand, CISs are characterized For the assessment issue, some of the results al-
by high data replication (i.e., different copies of the ready achieved for traditional systems can be borrowed.
same data are stored by different organizations). From In the statistical area, a lot of work has been done since
a data quality perspective, this is a great opportunity; the late 1960s. Record linkage techniques have been
improvement actions can be carried out on the basis of proposed, most of them based on the Fellegi and Sunter
comparisons among different copies in order to select (1969) model. Also, edit imputation methods based on
the most appropriate one or to reconcile available cop- the Fellegi and Holt (1976) model have been provided in
ies, thus producing a new improved copy to be notified the same area. Instead, in the database area, record
to all interested organizations. matching techniques (Hernandez & Stolfo, 1998) and
In this article, we describe possible solutions to data data cleaning tools (Galhardas et al., 2000) have been
quality problems in CISs that have been implemented proposed as a contribution to data quality assessment. In
within the DaQuinCIS architecture. The description of the Winkler (2004), a survey of data quality assessment
TEAM LinG
Data Quality in Cooperative Information Systems
techniques is provided, covering both statistical tech- MAIN THRUST

niques and data cleaning solutions.
When considering the issue of exchanging data and the In current government and business scenarios, organiza-
associated quality, a model to export both data and quality tions start cooperating in order to offer services to their
data needs to be defined. Some conceptual models to customers and partners. Organizations that cooperate
associate quality information to data have been proposed, have business links (i.e., relationships, exchanged docu-
which include an extension of the entity-relationship model ments, resources, knowledge, etc.) connecting each
(Wang et al., 1993) and a data warehouse conceptual model other. Specifically, organizations exploit business ser-
with quality features described through the description vices (e.g., they exchange data or require services to be
logic formalism (Jarke at al., 1995). Both models are for carried out) on the basis of business links, and, there-
a specific purpose: the former to introduce quality ele- fore, the network of organizations and business links
ments in relational database design; the latter to introduce constitutes a cooperative business system. As an ex-
quality elements in the data warehouse design. In Mihaila ample, a supply chain, in which some enterprises offer
et al. (2000), the problem of the quality of Web-available basic products and some others assemble them in order
information has been faced in order to select data with high to deliver final products to customers, is a cooperative
quality coming from distinct sources; every source has to business system. As another example, a set of public
evaluate some pre-defined data quality parameters and to administrations, which need to exchange information
make their values available through the exposition of about citizens and their health state in order to provide
metadata. social aids, is a cooperative business system derived
Furthermore, exchanging data in CISs poses impor- from the Italian e-government scenario (Batini &
tant problems that have been addressed by the data Mecella, 2001).
integration literature. Data integration is the problem of A cooperative business system exists independently
combining data residing at different sources and provid- of the presence of a software infrastructure supporting
ing the user with a unified view of these data (Lenzerini, electronic data exchange and service provisioning. In-
2002). As described in Yan et al. (1999), when perform- deed, cooperative information systems are software
ing data integration, two different types of conflicts systems supporting cooperative business systems; in
may arise: semantic conflicts, due to heterogeneous the remaining of this article, the following definition of
source models; and instance-level conflicts, due to what CIS is considered:
happens when sources record inconsistent values on the
same objects. The data quality broker described in the A cooperative information system is formed by a set of
following is a system solving instance-level conflicts. organizations that cooperate through a communication
Other notable examples of data integration systems infrastructure N, which provides software services to
within the same category are AURORA (Yan et al., organizations as well as reliable connectivity. Each
1999) and the system described in Sattler et al. (2003). organization is connected to N through a gateway G,
AURORA supports conflict tolerant queries (i.e., it on which software services offered by the organization
provides a dynamic mechanism to resolve conflicts by to other organizations are deployed. A user is a
means of defined conflict resolution functions). The software or human entity residing within an
system described in Sattler et al. (2003) describes how organization and using the cooperative system.
to solve both semantic and instance-level conflicts. The
proposed solution is based on a multi-database query Several CISs are characterized by a high degree of
language, called FraQL, which is an extension of SQL data replicated in different organizations; for example,
with conflict resolution mechanisms. A system that also in an e-government scenario, the personal data of a
takes into account metadata for instance-level conflict citizen are stored by almost all administrations. But in
resolution is described in Fan et al. (2001). Such a such scenarios, the different organizations can provide
system adopts the ideas of the context interchange the same data with different quality levels; thus, any user
framework (Bressan et al., 1997); therefore, context- of data may appreciate to exploit the data with the
dependent and independent conflicts are distinguished, highest quality level, among the provided ones.
and, accordingly, to this very specific direction, con- Therefore, only the highest quality data should be
version rules are discovered on pairs of systems. returned to the user, limiting the dissemination of low
Finally, among the techniques explicitly proposed to quality data. Moreover, the comparison of the gathered
perform query answering in CISs, we cite Naumann et al. data values might be used to enforce a general improve-
(1999), in which an algorithm to perform query planning ment of data quality in all organizations.
based on the evaluation of data sources qualities, specific In the context of the DaQuinCIS project1, we are
queries qualities, and query results qualities is described. proposing an architecture for the management of data
298
TEAM LinG
Figure 1. An architecture for enhancing data quality in cooperative information systems
Rating
,
Service
Communication
Infrastructure
Quality Notification
Service (QNS) Quality
Factory (QF)
Orgn
Cooperative
Gateway
Org1 back-end systems
Cooperative
Gateway
Org2 internals of the
Cooperative organization
Gateway
Data Quality
Broker (DQB)
quality in CISs; this architecture allows the diffusion of quirements specified in the request cannot be satisfied,
data and related quality and exploits data replication to then the broker initiates a negotiation with the user that
improve the overall quality of cooperative data. The can optionally weaken the constraints on the desired
interested reader can find a detailed description of the data. The data quality broker is, in essence, a data
design and implementation of the architecture in integration system (Lenzerini, 2002) deployed as a
Scannapieco, et al. (2004). Each organization offers peer-to-peer system, which allows to pose a quality-
services to other organizations on its own cooperative enhanced query over a global schema and to select data
gateway and also to specific services to its internal back- satisfying such requirements.
end systems. Therefore, cooperative gateways interface The Quality Notification Service is a publish/sub-
both internally and externally through services. More- scribe engine used as a quality message bus between
over, the communication infrastructure offers some spe- services and/or organizations. More specifically, it al-
cific services. Services are all identical and peer (i.e., lows quality-based subscriptions for users to be notified
they are instances of the same software artifacts and act on changes of the quality of data. For example, an orga-
both as servers and clients of the other peers, depending nization may want to be notified if the quality of some
on the specific activities to be carried out). The overall data it uses degrades below a certain threshold, or when
architecture is depicted in Figure 1. Organizations ex- high quality data are available. Also the quality notifica-
port data and quality data according to a common model tion service is deployed as a peer-to-peer system.
referred to as Data and Data Quality (D2Q) model. It The Rating Service associates trust values to each
includes the definitions of (i) constructs to represent data source in the CIS. These values are used to deter-
data, (ii) a common set of data quality properties, (iii) mine the reliability of the quality evaluation performed
constructs to represent them, and (iv) the association by organizations. The rating service is a centralized
between data and quality data. service, to be provided by a third-party organization.
In order to produce data and quality data according to The interested reader can find further details on the
the D2Q model, each organization deploys on its coop- rating service design and implementation in De Santis,
erative gateway a quality factory service that is respon- et al. (2003).
sible for evaluating the quality of its own data. The design
of the quality factory has been addressed in Cappiello, et
al. (2003). FUTURE TRENDS
The Data Quality Broker poses, on behalf of a re-
questing user, a data request over other cooperating The complete development of a framework for data
organizations, also specifying a set of quality require- quality management in CISs requires the solution of
ments that the desired data have to satisfy; this is re- further issues. An important aspect concerns the tech-
ferred to as quality brokering function. Different cop- niques to be used for quality dimension measurement.
ies of the same data received as responses to the request In both statistical and machine learning areas, some
are reconciled, and a best-quality value is selected and techniques could be usefully exploited. The general
proposed to organizations, which can choose to discard idea is to have quality values estimated with a certain
their data and to adopt higher quality ones; this is re- probability instead of a deterministic quality evalua-
ferred to as quality improvement function. If the re-
299
TEAM LinG
tion. In this way, the task of assigning quality values to Berti-Equille, L. (2003). Quality-extended query process-
each data value could be considerably simplified. ing for distributed processing. Proceedings of the ICDT03
The data quality broker covers some aspects of qual- International Workshop on Data Quality in Cooperative
ity-driven query processing in CISs. Nevertheless, there Information Systems (DQCIS03), Siena, Italy.
is still the need to investigate instance reconciliation
techniques; whereas, quality values are not attached to Bertolazzi, P., & Scannapieco, M. (2001). Introducing data
exported data, how is it possible to select between two quality in a cooperative context. Proceedings of the 6th
conflicting instances of same data? Current data inte- International Conference on Information Quality
gration systems simply do not provide any answer in (ICIQ01), Boston, Massachusetts.
such cases, but looser semantics for query answering is Bressan, S. et al. (1997). The context interchange mediator
needed in order to make data integration systems actu- prototype. Proceedings of the ACM SIGMOD Interna-
ally work in real scenarios where errors and conflicts tional Conference on Management of Data, Tucson,
are present. Arizona.
Some further open issues concern trusting data
sources. Data ownership is an important one. From a Cappiello, C. et al. (2003). Data quality assurance in
data quality perspective, assigning a responsibility on cooperative information systems: A multi-dimension
data helps actually to engage improvement actions as quality certificate. Proceedings of the ICDT03 Inter-
well as to trust sources providing data. In some cases, national Workshop on Data Quality in Cooperative
laws help to assign responsibilities on data typologies, Information Systems (DQCIS03), Siena, Italy.
but this is not always possible. Models and techniques De Michelis, G. et al. (1997). Cooperative information
that allow trusting such sources with respect to provided systems: A manifesto. In M.P. Papazoglou, & G. Schlageter
data are an open and important issue, especially when (Eds.), Cooperative information systems: Trends & di-
data sources interact in open and dynamic environments rections. London: Academic Press.
like peer-to-peer systems.
De Santis, L., Scannapieco, M., & Catarci, T. (2003).
Trusting data quality in cooperative information sys-
CONCLUSION tems. Proceedings of the 11th International Confer-
ence on Cooperative Information Systems
Managing data quality in CISs requires solving problems (CoopIS03), Catania, Italy.
from many research areas of computer science, such as Fan, W., Lu, H., Madnick, S., & Cheung, D. (2001).
databases, software engineering, distributed comput- Discovering and reconciling value conflicts for numeri-
ing, security, and information systems. This implies that cal data integration. Information Systems, 26(8), 635-656.
the proposal of integrated solutions is very challenging.
In this article, an architecture to support data quality Fellegi, I., & Holt, D. (1976). A systematic approach to
management in CISs has been described; such an archi- automatic edit and imputation. Journal of the American
tecture consists of modules that provide some solutions Statistical Association, 71, 17-35.
to principal data quality problems in CISs; namely, qual-
ity-driven query answering, data quality access, data Fellegi, I., & Sunter, A. (1969). A theory for record linkage.
quality maintenance, and trust management. The archi- Journal of the American Statistical Association, 64, 1183-
tecture is suitable in all contexts in which data stored by 1210.
different and distributed sources are overlapping and Galhardas, H. et al. (2000). An extensible framework for
affected by data errors. Among such contexts, the archi- data cleaning. Proceedings of the 16th International
tecture has been validated in the Italian e-government Conference on Data Engineering (ICDE 2000), San
scenario, in which all public administrations store data Diego, California.
about citizens that need to be corrected and reconciled.
Hernandez, M., & Stolfo, S. (1998). Real-world data is
dirty: Data cleansing and the merge/purge problem. Jour-
REFERENCES nal of Data Mining and Knowledge Discovery, 1(2), 9-37.
Jarke, M., Lenzerini, M., Vassiliou, V., & Vassiliadis, P.
Batini, C., & Mecella, M. (2001). Enabling Italian e- (Eds.). (1995). Fundamentals of data warehouses. Berlin,
government through a cooperative architecture. IEEE Heidelberg, Germany: Springer Verlag.
Computer, 34(2), 40-45.
300
TEAM LinG
Lenzerini, M. (2002). Data integration: A theoretical per- KEY TERMS

spective. Proceedings of the 21st ACM Symposium on ,
Principles of Database Systems (PODS 2002), Madison, Cooperative Information Systems: Set of geographi-
Wisconsin. cally distributed information systems that cooperate on
Mihaila, G., Raschid, L., & Vidal, M.(2000). Using quality the basis of shared objectives and goals.
of data metadata for source selection and ranking. Pro- Data Integration System: System that presents
ceedings of the Third International Workshop on the Web data distributed over heterogeneous data sources ac-
and Databases (WebDB00), Dallas, Texas. cording to a unified view. A data integration system
Naumann, F., Leser, U., & Freytag, J. (1999). Quality- allows processing of queries over such a unified view by
driven integration of heterogenous information sys- gathering results from the various data sources.
tems. Proceedings of the 25th International Confer- Data Quality: Data have a good quality if they are fit
ence on Very Large Data Bases (VLDB99), Edinburgh, for use. Data quality is measured in terms of many
Scotland. dimensions or characteristics, including accuracy, com-
Redman, T. (1996). Data quality for the information pleteness, consistency, and currency of electronic data.
age. Norwood, MA: Artech House. Data Schema: Collection of data types and relation-
Sattler, K., Conrad, S., & Saake, G. (2003). Interactive ships described according to a particular type language.
example-driven integration and reconciliation for ac- Data Schema Instance: Collection of data values
cessing database integration. Information Systems, 28(5), that are valid (i.e., conform) with respect to a data
393-413. schema.
Scannapieco, M. et al. (2004). The DaQuinCIS architec- Metadata: Data providing information (e.g., qual-
ture: A platform for exchanging and improving data quality, provenance, etc.) about application data. Metadata
ity in cooperative information systems. Information Sys- are associated with data according to specific data mod-
tems, 29(7), 551-582. els.
Wand, Y., & Wang, R. (1996). Anchoring data quality Peer-to-Peer Systems: Distributed systems in
dimensions in ontological foundations. Communica- which each node can be both client and server with
tions of the ACM, 39(11), 86-95. respect to service and data provision.
Wang, R. (1998). A product perspective on total data Publish and Subscribe: Distributed communica-
quality management. Communications of the ACM, 41(2), tion paradigm according to which subscribers are noti-
58-65. fied about events that have been published by publishers
Wang, R., Kon, H., & Madnick, S. (1993). Data quality and that are of interest to them.
requirements: Analysis and modeling. Proceedings of
the 9th International Conference on Data Engineer-
ing (ICDE 93), Vienna, Austria. ENDNOTE
Winkler, W.E. (2004). Methods for evaluating and cre-
ating data quality. Information Systems, 29(7), 531-550. 1
DaQuinCIS - Methodologies and Tools for Data
Quality inside Cooperative Information Systems
Yan, L., & Ozsu, T. (1999). Conflict-tolerant queries in (http://www.dis.uniroma1.it/~dq) is a joint re-
AURORA. Proceedings of the Fourth International search project carried out by DIS - Universit di
Conference on Cooperative Information Systems Roma La Sapienza, DISCo - Universit di Milano
(CoopIS99), Edinburgh, Scotland. Bicocca and DEI - Politecnico di Milano.
301
TEAM LinG
302
Data Quality in Data Warehouses

William E. Winkler
U.S. Bureau of the Census, USA
INTRODUCTION billing amounts are erroneous because they do not satisfy

edit or business rules, then totals may be biased low or
Fayyad and Uthursamy (2002) have stated that the ma- high. Imputation rules can supply replacement values for
jority of the work (representing months or years) in erroneous or missing values that are consistent with the
creating a data warehouse is in cleaning up duplicates edit rules and preserve joint probability distributions.
and resolving other anomalies. This article provides an Files without error can be effectively data mined.
overview of two methods for improving quality. The
first is data cleaning for finding duplicates within files
or across files. The second is edit/imputation for main- MAIN THRUST
taining business rules and for filling in missing data. The
fastest data-cleaning methods are suitable for files with This section provides an overview of data cleaning and
hundreds of millions of records (Winkler, 1999b, of statistical data editing and imputation. The cleanup
2003b). The fastest edit/imputation methods are suit- and homogenization of the files are preprocessing steps
able for files with millions of records (Winkler, 1999a, prior to data mining.
2004b).
Data Cleaning
BACKGROUND Data cleaning is also referred to as record linkage or

object identification. Record linkage was introduced by
When data from several sources are successfully com- Newcombe, Kennedy, Axford, and James (1959) and
bined in a data warehouse, many new analyses can be given a formal mathematical framework by Fellegi and
done that might not be done on individual files. If Sunter (1969). Notation is needed. Two files, A and B,
duplicates are present within a file or across a set of are matched. The idea is to classify pairs in a product
files, then the duplicates might be identified. Data clean- space, A x B, from two files A and B into M, the set of
ing or record linkage uses name, address, and other true matches, and U, the set of true nonmatches. Fellegi
information, such as income ranges, type of industry, and Sunter considered ratios of conditional probabili-
and medical treatment category, to determine whether ties of the form
two or more records should be associated with the same
entity. Related types of files might be combined. In the R = P( | M) / P( | U) (1)
health area, a file of medical treatments and related
information might be combined with a national death where is an arbitrary agreement pattern in a comparison
index. Sets of files from medical centers and health space . For instance, might consist of eight patterns
organizations might be combined over a period of years representing simple agreement or disagreement on the
to evaluate the health of individuals and discover new largest name component, street name, and street number.
effects of different types of treatments. Linking files is an Alternatively, each might additionally account for
alternative to exceptionally expensive follow-up studies. the relative frequency with which specific values of name
The uses of the data are affected by lack of quality components such as Smith, Zabrinsky, AAA, and
due to the duplication of records and missing or errone- Capitol occur. Ratio R, or any monotonely increasing
ous values of variables. Duplication can waste money function of it, such as the natural log, is referred to as a
and yield error. If a hospital has a patient incorrectly matching weight (or score).
represented in two different accounts, then the hospital The decision rule is given by the following state-
might repeatedly bill the patient. Duplicate records may ments:
inflate the numbers and amounts in overdue-billing cat-
egories. If the quantitative amounts associated with If R > T, then designate the pair as a match.
some accounts are missing, then the totals may be If T<R<T , then designate the pair as a possible
biased low. If values associated with variables such as match and hold it for clerical review. (2)
TEAM LinG
If R < T , then designate the pair as a nonmatch. the EM algorithm. The parameters are known to vary
significantly across files (Winkler, 1999b). They can even ,
The cutoff thresholds T and T are determined by a vary significantly across similar files representing an
priori error bounds on false matches and false urban area and an adjacent suburban area. If two files each
nonmatches. Rule 2 agrees with intuition. If con- contain 1,000 or more records, than bringing together all
sists primarily of agreements, then would intu- pairs from two files is impractical,due to the small number
itively be more likely to occur among matches than of potential matches within the total set of pairs. Blocking
nonmatches, and Ratio 1 would be large. On the other is the method of considering only pairs that agree exactly
hand, if consists primarily of disagreements, then (character by character) on subsets of fields. For instance,
Ratio 1 would be small. Rule 2 partitions the set a set of blocking criteria may be to consider only pairs that
into three disjoint subregions. The region T<R<T is agree on the U.S. Postal zip code and the first character of
referred to as the no-decision region or clerical-review the last name. Additional blocking passes may be needed
region. In some situations, resources are available to to obtain matching pairs that are missed by earlier block-
review pairs clerically. ing passes (Newcombe et al., 1959; Hernandez & Stolfo,
Linkages can be error prone in the absence of strong 1995; Winkler, 2004a).
or unique identifiers such as a verified social security
number that identifies an individual record or entity. Statistical Data Editing and Imputation
Weak identifiers such as name, address, and other
nonuniquely identifying information are used. The com- Correcting inconsistent information and filling in miss-
bination of weak identifiers can determine whether a ing information needs to be efficient and cost effective.
pair of records represents the same entity. If errors or For single fields, edits are straightforward. A lookup
differences exist in the representations of names and table may yield correct diagnostic or zip codes. For
addresses, then many duplicates can erroneously be multiple fields, an edit might require that an individual
added to a warehouse. For instance, a the name of a younger than 15 years of age must have a marital status
business may be John K Smith and Company in one of unmarried. If a record fails this edit, then a subse-
file and J. K. Smith, Inc. in another file. Without the quent procedure would need to change either the age or
additional corroborating of information such as ad- the marital status.
dresses, it is difficult to determine whether the two Editing has been done extensively in statistical agen-
names correspond to the same entity. With three ad- cies since the 1950s. Early work was clerical. Later
dresses such as 123 E. Main Street, 123 East Main computer programs applied if-then-else rules with logic
St., and P.O. Box 456 and the two names, the linkage similar to the clerical review. The main disadvantage
can still be quite difficult. With suitable preprocessing was that edits that did not fail, for a record would
methods, it may be possible to represent the names in initially fail as the values in fields associated with edit
forms in which the different components can be com- failures were changed. Fellegi and Holt (1976) provided
pared. To use addresses of the forms 123 E. Main a theoretical model. In providing their model, they had
Street and P.O. Box 456, it may be necessary to use three goals:
an auxiliary file or expensive follow up that indicates
that the addresses have at some time been associated 1. The data in each record should be made to satisfy
with the same entity. all edits by changing the fewest possible variables
If individual fields have a minor typographical error, (fields).
then string comparators that account for such errors can 2. Imputation rules should derive automatically from
allow effective comparisons (Winkler, 1995, 2004b; edit rules.
Cohen, Ravikumar, & Fienberg, 2003). Individual fields 3. When imputation is necessary, it should maintain
might be first name, last name, and street name, which the joint distribution of variables.
are delineated by standardization software. Rule-based
methods of standardization are available in commercial Fellegi and Holt (1976; Theorem 1) proved that
software for addresses and in other software for names implicit edits are needed for solving the problem of
(Winkler, 1995, 1999b). The probabilities in Equations Goal 1. Implicit edits are those that can be logically
1 and 2 are referred to as matching parameters. If derived from explicitly defined edits. Implicit edits
training data consisting of matched and unmatched pairs provide information about edits that do not fail initially
are available, then a supervised method requiring train- for a record but may fail as the values in fields that are
ing data can be used for estimation of the matching associated with failing edits are changed. The following
parameters. Optimal-matching parameters can sometimes example illustrates some of the computational issues. An
be estimated via unsupervised learning methods, such as edit can be considered as a set of points. Let edit E =
303
TEAM LinG
{married & age 15}. Let r be a data record. Then r E => Chaudhuri, Gamjam, Ganti, and Motwani (2003) provide
r fails edit. This formulation is equivalent to If age 15, a method of indexing that significantly improves over
then not married. If a record r fails a set of edits, then one brute-force methods, in which all pairs in two files are
field in each of the failing edits must be changed. An compared. Second, some research deals with better string
implicit edit E3 can be implied from two explicitly defined comparators for comparing fields having typographical
edits E 1 and E2; i.e., E1 and E2 => E3. errors. Cohen et al. (2003) have methods that improve
over the basic Jaro-Winkler methods (Winkler, 2004b).
E1 = {age 15, married, . } Third, another research area investigates methods to
E2 = { . , not married, spouse} standardize and parse general names and address fields
E3 = {age 15, . , spouse} into components that can be more easily compared.
Borkar, Desmukh, and Sarawagi (2001), Churches, Chris-
The edits restrict the fields age, marital status, and ten, Lu, and Zhu (2002), and Agichtein and Ganti (2004)
relationship to head of household. Implicit edit E3 is have Hidden Markov methods that work as well or better
derived from E1 and E2. If E3 fails for a record r = {age 15, than some of the rule-based methods. Although the
not married, spouse}, then necessarily either E1 or E2 fails. Hidden Markov methods require training data, Churches
Assume that the implicit edit E3 is unobserved. If edit E2 et al. provide methods for quickly creating additional
fails for record r, then one possible correction is to change training data for new applications. The additional train-
the marital status field in record r to married in order to ing data supplement a core set of generic training data.
obtain a new record r1. Record r1 does not fail for E2 but now The training data consists of free-form records and the
fails for E1. The additional information from edit E3 assures corresponding records that have been processed into
that record r satisfies all the edits after changing one components. Fourth, because training data are rarely
additional field. For larger data situations with more edits available and optimal matching parameters are needed,
and more fields, the number of possibilities increases at a some research investigates unsupervised learning meth-
very high exponential rate. ods that do not require training data. Ravikumar and
In data warehouse situations, the ease of implement- Cohen (2004) have unsupervised learning methods that
ing the ideas of Fellegi and Holt (1996) by using gener- improve over the basic EM methods of Winkler (1995,
alized edit software is dramatic. An analyst who has 1999b, 2003b). Their methods are even competitive with
knowledge of the edit situations might put together the supervised learning methods. Fifth, other research con-
edit tables in a relatively short time. Kovar and Winkler siders methods of (nearly) automatic error rate estima-
(1996) compared two edit systems on economic data. tion with little or no training data. Winkler (2002) consid-
Both were installed and run in less than one day. In many ers methods that use unlabeled data and very small
business situations, only a few simple edit rules might be subsets of labeled data (training data). Sixth, some
needed. In their books on data quality, Redman (1996), research considers methods for adjusting statistical and
English (1999), and Loshin (2001) have described edit- other types of analyses for matching error. Lahiri and
ing files to assure that business rules are satisfied. The Larsen (2004) have methods for improving the accuracy
authors have not noted the difficulties of applying of statistical analyses in the presence of matching error.
hardcoded if-then-else rules and the relative ease of Their methods extend ideas introduced by Scheuren and
applying Fellegi and Holts methods. Winkler (1993, 1997).
There are three trends for edit/imputation research.
The first trend comprises faster ways of determining
FUTURE TRENDS the minimum number of variables containing values
that contradict the edit rules. DeWaal (2003a, 2003b,
There are several trends in data cleaning. First, some 2003c) applies Fourier-Motzkin and cardinality con-
research considers better search strategies to compen- strained Chernikova algorithms that allow direct bound-
sate for typographical error. Winkler (2003b, 2004a) ing of the number of computational paths. Riera-
applies efficient blocking strategies for bringing to- Ledesma and Salazar-Gonzalez (2004) apply clever
gether pairs in a situation where one file and its indexes heuristics in the setup of the problems that allow direct
are held in memory. This allows the matching of a mod- integer programming methods to perform much faster.
erate size file of 100 million records (which will reside The second trend is to apply machine learning methods.
in 4 GB of memory) against large administrative files of Nordbotten (1995) applies neural nets to the basic
upwards of 4 billion records. This type of record linkage editing problem. Di Zio, Scanu, Coppola, Luzi, and
requires only one pass against both files, whereas con- Ponti (2004) apply Bayesian networks to the imputation
ventional record linkage requires many sorts, matching problem. The advantage of these approaches is that they
passes of both files, and large amounts of disk space. do not require the detailed rule elicitation of other edit
304
TEAM LinG
approaches; they depend on representative training data. De Waal, T. (2003b). A fast and simple algorithm for
The training data consists of unedited records and the automatic editing of mixed data. Journal of Official Sta- ,
corresponding edited records after review by subject tistics, 19(4), 383-402.
matter specialists. The third trend comprises methods
that preserve both statistical distributions and edit con- De Waal, T. (2003c). Processing of erroneous and un-
straints in the preprocessed data. Winkler (2003a) con- safe data. Rotterdam: ERIM Research in Management.
nects the generalized imputation methods of Little and Di Zio, M., Scanu, M., Coppola, L., Luzi, O., & Ponti, A.
Rubin (2002) with generalized edit methods of Winkler (2004). Bayesian networks for imputation. Journal of
(1999a, 2003b). The potential advantage is that the the Royal Statistical Society, A, 167(2), 309-322.
statistical properties needed for data mining may be
preserved. English, L. P. (1999). Improving data warehouse and
business information quality: Methods for reducing
costs and increasing profits. New York: Wiley.
CONCLUSION Fayyad, U., & Uthurusamy, R. (2002). Evolving data mining
into solutions for insights. Communications of the Asso-
To data mine effectively, data need to be preprocessed ciation of Computing Machinery, 45(8), 28-31.
in a variety of steps that include removing duplicates,
performing statistical data editing and imputation, and Fellegi, I. P., & Holt, D. (1976). A systematic approach
doing other cleanup and regularization of the data. If to automatic edit and imputation. Journal of the Ameri-
moderate errors exist in the data, data mining may waste can Statistical Association, 71, 17-35.
computational and analytic resources with little gain in Fellegi, I. P., & Sunter, A. B. (1969). A theory of record
knowledge. linkage. Journal of the American Statistical Associa-
tion, 64, 1183-1210.
REFERENCES Hernandez, M., & Stolfo, S. J. (1995). The merge/purge

problem for large databases. ACM SIGMOD 1995 (pp. 127-
Agichtein, E., & Ganti, V. (2004). Mining reference 138).
tables for automatic text segmentation. ACM Special Kovar, J. G., & Winkler, W. E. (1996). Editing eco-
Interest Group on Knowledge Discovery and Data Min- nomic data. Proceedings of the Section on Survey
ing (pp. 20-29). Research Methods, American Statistical Association
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Auto- (pp. 81-87). Retrieved from http://www.census.gov/srd/
matic segmentation of text into structured records. www/byyear.html
ACM SIGMOD 2001 (pp. 175-186). Lahiri, P. A., & Larsen, M. D. (2004). Regression analy-
Chaudhuri, S., Gamjam, K., Ganti, V., & Motwani, R. sis with linked data. Journal of the American Statisti-
(2003). Robust and efficient match for on-line data cal Association.
cleaning. Proceedings of the ACM SIGMOD Conferences, Little, R. A., & Rubin, D. B., (2002). Statistical analy-
2003 (pp. 313-324). sis with missing data (2nd ed.). New York: Wiley.
Churches, T., Christen, P., Lu, J., & Zhu, J. X. (2002). Loshin, D. (2001). Enterprise knowledge manage-
Preparation of name and address data for record linkage ment: The data quality approach. San Diego: Morgan
using hidden Markov models. BioMed Central Medical Kaufman.
Informatics and Decision Making, 2(9), Retrieved from
http://www.biomedcentral.com/1472-6947/2/9/ Newcombe, H. B., Kennedy, J. M., Axford, S. J., &
James, A. P. (1959). Automatic linkage of vital records.
Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). Science, 130, 954-959.
A comparison of string metrics for matching names and
addresses. Proceedings of the International Joint Con- Nordbotten, S. (1995). Editing statistical records by
ference on Artificial Intelligence, Mexico. neural networks. Journal of Official Statistics, 11, 393-
414.
De Waal, T. (2003a). Solving the error localization
problem by means of vertex generation. Survey Meth- Ravikumar, P., & Cohen, W. W. (2004). A hierarchical
odology, 29(1), 71-79. graphical model for record linkage. Proceedings of the
Conference on Uncertainty in Artificial Intelligence,
USA. Retrieved from http://www.cs.cmu.edu/~wcohen
305
TEAM LinG
Redman, T. C. (1996). Data quality in the information age. Winkler, W. E. (2004a). Approximate string comparator
Boston, MA: Artech. search strategies for very large administrative lists. Pro-
ceedings of the Section on Survey Research Methods,
Riera-Ledesma, J., & Salazar-Gonzalez, J.-J. (2004). A American Statistical Association.
branch-and-cut algorithm for the error localization
problem in data cleaning (Tech. Rep.). Tenerife, Spain: Winkler, W. E. (2004b). Methods for evaluating and cre-
Universidad de la Laguna. ating data quality. Information Systems, 29(7), 531-550.
Scheuren, F., & Winkler, W. E. (1993). Regression
analysis of data files that are computer matched. Survey
Methodology, 19, 39-58. KEY TERMS
Scheuren, F., & Winkler, W. E. (1997). Regression
analysis of data files that are computer matched, II. Data Cleaning: The methodology of identifying
Survey Methodology, 23, 157-165. duplicates in a single file or across a set of files by using
a name, address, and other information.
Winkler, W. E. (1995). Matching and record linkage. In
B. G. Cox (Eds.), Business survey methods (pp. 355-384). Data Mining: The application of analytical methods
New York: Wiley. and tools to data for the purpose of identifying patterns
and relationships, such as classification, prediction,
Winkler, W. E. (1999a). The state of statistical data estimation, or affinity grouping.
editing. In Statistical data editing (pp. 169-187). Rome:
ISTAT. Edit Restraints: Logical restraints such as busi-
ness rules that assure that an employees listed salary in
Winkler, W. E. (1999b). The state of record linkage and a job category is not too high or too low or that certain
current research problems. Proceedings of the Survey contradictory conditions, such as a male hysterectomy,
Methods Section, Statistical Society of Canada (pp. 73- do not occur.
80).
Imputation: The method of filling in missing data
Winkler, W. E. (2002). Record linkage and Bayesian that sometimes preserves statistical distributions and
networks. Proceedings of the Section on Survey Re- satisfies edit restraints.
search Methods, American Statistical Association,
Retrieved from http://www.census.gov/srd/www/ Preprocessed Data: In preparation for data mining,
byyear.html data that have been through preprocessing such as data
cleaning or edit/imputation.
Winkler, W. E. (2003a). A contingency table model for
imputing data satisfying analytic constraints. Proceed- Rule Induction: The process of learning, from cases
ings of the Section on Survey Research Methods, or instances, if-then rule relationships consisting of an
American Statistical Association, . Retrieved from http:/ antecedent (if-part, defining the preconditions or cov-
/www.census.gov/srd/www/byyear.html erage of the rule) and a consequent (then-part, stating a
classification, prediction, or other expression of a prop-
Winkler, W. E. (2003b). Data cleaning methods. Pro- erty that holds for cases defined in the antecedent).
ceedings of the ACM Workshop on Data Cleaning,
Record Linkage and Object Identification, USA. Re- Training Data: A representative subset of records
trieved from http://csaa.byu.edu/kdd03cleaning.html for which the truth of classifications and relationships
is known and that can be used for rule induction in
machine learning models.
306
TEAM LinG
307
Data Reduction and Compression in ,

Database Systems
Alexander Thomasian
INTRODUCTION which are then used in compressing the data. Because

relational tables are organized as database pages, each
Data compression is storing data such that it requires page on disk (and main memory) will hold compressed
less space than usual. Data compression has been effec- data, so the number of records per page is doubled. Data
tively used in storing data in a compressed form on is uncompressed on demand, with or without hardware
magnetic tapes, disks, and even main memory. In many assistance.
cases, updated data cannot be stored in place when it is Data reduction operates at a higher level of abstrac-
not compressible to the same or smaller size. Compression tion than data compression, although data compression
also reduces the bandwidth requirements in transmitting methods can be used in conjunction with data reduction.
(program) code, data, text, images, speech, audio, and An example is quantizing the data in a matrix, which has
video. The transmission may be from main memory to the been dimensionally reduced via the SVD method, as I
CPU and its caches, from tape and disk into main memory, describe in the following sections. This concludes the
or over local, metropolitan, and wide area networks. When discussion of data compression; the remainder of this
data compression is used, transmission time improves or, article deals with data reduction.
conversely, the required transmission bandwidth is re-
duced. Two excellent texts on this topic are Sayood (2002)
and Witten, Bell, and Moffat (1999). MAIN THRUST
Huffman encoding is a popular data compression
method. It substitutes the symbols of an alphabet with k Recent interest in data reduction resulted in the New
bits per symbol, so that frequent symbols are repre- Jersey Data Reduction Report (Barbara, et al., 1997),
sented with fewer than k bits, and less common symbols which classifies data reduction methods into parametric
with more than k bits. A significant saving in space is and nonparametric. Histograms, clustering, and indexing
possible when the distribution is highly skewed, for structures are examples of nonparametric methods.
example, the Zipf distribution, because the average num- Another classification is direct versus transform-based
ber of bits to represent symbols is smaller than k bits. methods. Singular value decomposition (SVD) and
Arithmetic coding is a more sophisticated technique discrete wavelet transforms are parametric transform-
that represents a string with an appropriate fraction. based methods. In the following few sections, I briefly
Lempel-Ziv coding substitutes a character string by an introduce the aforementioned methods.
index to a dictionary or a previous occurrence and the
string length. Variations of these algorithms, separately
or in combination, are used in many applications. SVD
Compression can be lossy or lossless. Lossy data
compression is utilized in cases where some data loss SVD is applicable to a two-dimensional MN matrix X,
can be tolerated, for example, a restored compressed where M is the number of objects, and there are N
image may not be discernibly different from the origi- features per object. For example, M may represent the
nal. Lossless data compression, which restores com- number of customers, and the columns represent the
pressed data to its original value, is absolutely neces- amount they spend on any of the N products. M may be
sary in some applications. Quantization is a lossy data in the millions, while N is the hundreds or even thou-
compression method, which represents data more sands.
coarsely than the original signal. According to SVD we have the decomposition X =
General purpose data compression is applicable to USVt, where U is another MN matrix, S is a diagonal
data warehouses and databases. For example, a variation NN matrix of singular values, sn, 1<n<N, and V holds
of the lossless Lempel-Ziv method has been applied to the eigenvectors of principal components. Alternatively,
DB2 records. This is accomplished by analyzing the the covariance matrix C, the product of the transpose of
records in a relational table and building dictionaries, X times X divided by M, can be decompoed as: C =VVt,
TEAM LinG
Data Reduction and Compression in Database Systems
where is a diagonal matrix of eigenvalues n = sn2/ M. We difference between the distances of points with k di-
assume, without a loss in generality, that the eigenvalues mensions and the original distance, is used to represent
are in nonincreasing order, such that the transformation the goodness of the fit. The value of k should be selected
of coordinates into the principal components will yield Y to be as small as possible, while stress is maintained at
= XV, whose columns are in decreasing order of their an appropriately low level. A fast, approximate alterna-
energy or variance. There is a reduction in the rank of the tive is FASTMAP, whose goal is to find a k-dimensional
matrix when some eigenvalues are equal to zero. If we space that matches the distances of an NN matrix for N
retain the first p columns of Y, the Normalized Mean points (Faloutsos & Lin, 1995).
Square Error (NMSE) is equal to the sum of the eigenval-
ues of discarded columns divided by the trace of the matrix
(sum of the eigenvalues or diagonal elements of C, which WAVELETS
remains invariant). A significant reduction in the number
of columns can be attained at a relatively small NMSE, as According to Fouriers theorem, a continuous function
shown in numerous studies (see Korn, Jagadish, & can be expressed as the sum of sinusoidal functions. A
Faloutsos, 1997). discrete signal with n points can be expressed by the n
Higher dimensional data, as in the case of data ware- coefficients of a Discrete Fourier Transform (DFT).
houses, for example, (product, customer, date) (dol- According to Parsevals theorem, the energy in the time
lars) can be reduced to two dimensions by appropriate and frequency domain are equal (Faloutsos, 1996).
transformations, for example, with products as rows and The DFT consists of the sum of sine and cosine
(customer date) as columns. functions. I am interested in transforms, which can
The columns of dataset X may not be globally corre- capture a vector with as few coefficients as possible.
late for example, high-income customers buy expensive The Discrete Cosine Transform (DCT) achieves better
items, and low-income customers buy economy items energy concentration than DFT and also solves the fre-
so that the items bought by these two groups of custom- quency-leak problem that plagues DFT (Agrawal,
ers are disjoint. Higher data compression (for a given Faloutsos, & Swami, 1993).
NMSE) can be attained by first clustering the data, using The Discrete Wavelet Transform (DWT) is also
an off-the-shelf clustering method, such as k-means (Dun- related to DFT but achieves better lossy data compres-
ham, 2003), and then applying SVD to clusters (Castelli, sion. The Haar transform is a simple wavelet transform
Thomasian, & Li, 2003). More sophisticated clustering that operates on a time sequence and computes the sum
methods, which generate elliptical clusters, may yield and difference of its halves, recursively. DWT can be
higher dimensionality reduction. An SVD-friendly clus- applied to signals with multiple dimensions, one dimen-
tering method, which generates clusters amenable to sion at a time (Press, Teukolsky, Vetterling, & Flannery,
dimensionality reduction, is proposed in Chakrabarti and 1996). To illustrate how a single dimensional wavelet
Mehrotra (2000). transform works, consider an image with four pixels
K-nearest-neighbor (k-NN) queries can be carried having the following values: [9,7,3,5] (Stollnitz, Derose,
out with respect to a dataset, which has been subjected & Salesin, 1996). We obtain a lower resolution image by
to SVD, by first transforming the query point to the substituting pairs of pixel values with their average:
appropriate coordinates by using the principal compo- [8,4]. Information is lost due to down sampling. The
nents. In the case of multiple clusters, we first need to original pixels can be recovered by storing detail coef-
determine the cluster to which the query point belongs. ficients, given as 1=98 and 1=34, that is, [1,1].
In the case of the k-means clustering method, the query Another averaging and detailing step yields [6] and [2].
point belongs to the cluster with the closest centroid. The wavelet transform of the original image is then
After determining the k nearest neighbors in the primary [6,2,1,1]. In fact, for normalization purposes, the last
cluster, I need to determine if other clusters are to be two coefficients have to be divided by the square root of
searched. A cluster is searched if the hypersphere cen- 2. Wavelet compression is attained by not retaining all
tered on the query point, with the k nearest neighbors the coefficients.
inside it, intersects with the hypersphere of that cluster. As far as data compression in data warehousing is
This step is repeated until no more intersections exist. concerned, a k-d DWT can be applied to a k-d data cube
Multidimensional scaling (MS) is another method to obtain a compressed approximation by saving a frac-
for dimensionality reduction (Kruskal & Wish, 1978). tion of the strongest coefficients. An approximate com-
Given the pair-wise distances or dissimilarities among putation of multidimensional aggregates for sparse data
a set of objects, the goal of MS is to represent them in using wavelets is reported in Vitter and Wang (1999).
k dimensions so that their distances are preserved. A
stress function, which is the sum of squares of the
308
TEAM LinG
REGRESSION, MULTIREGRESSION, A large number of clustering methods have been

AND LOG-LINEAR MODELS developed over the years. They can be roughly classi- ,
fied as statistical, database, and machine-learning meth-
Linear regression in its simplest form, given two vectors ods (Barbara et al., 1997), although the last category is
x and y, uses the least squares formula to obtain the rather limited. Statistical methods are classified as
approximation Y = a+bX. In case y is a polynomial in x, hierarchical and partitioning methods. Hierarchical
for example, Y = a + bX + cX2 + dX3, the transformation methods have been classified into agglomerative or
Xi = xi can be used to obtain a linear model so that it can bottom-up methods and divisive or top-down methods.
be solved by using the least squares method. The case Y The number of clusters to be generated should be
= bXn can be handled by first taking logarithms of both specified for partitioning methods. The k-means and k-
sides. In effect, a set of points can be represented by a medoids are two such methods, where the medoid is an
compact formula. Most cases have multiple columns of actual centrally located point belonging to the cluster,
data so that multiregression can be used. and a centroid is not.
Log-linear modeling arises in the context of discrete The k-means method applied to n points first ran-
multivariate analysis and seeks approximation to dis- domly selects k points as the initial centroids of the
crete multidimensional probability distributions. For clusters. Then these steps are followed: (a) Assign the
example, a probability distribution over several variables remaining n-k points to one of k clusters based on the
is factored into a product of the joint probabilities of the proximity of the point to the centroids, according to
subsets; this is where data compression is achieved. Euclidean distance, (b) compute the new centroids of
the k clusters, whose coordinates are the means of
respective dimensions, (c) reassign all points to clus-
ters based on their proximity, and (d) quit if there is no
HISTOGRAMS change from one iteration to the next; otherwise, go
back to step (b).
Data compression in the form of histograms is a well- A comprehensive review of basic clustering meth-
known database technique. A histogram on a column or ods from the viewpoint of database and data-mining
attribute can assist the relational query optimizer in applications is given in Dunham (2003). There is em-
estimating the selectivity of that attribute. For example, phasis on clustering methods for very large datasets,
if the column has a nonclustered index, the histogram can which cannot be held in main memory. It is important to
determine that using the index would be more costly keep the number of passes over the disk-resident data
because the selectivity of the attribute is low for clustering to a minimum.
(Ramakrishnan & Gehrke, 2003). In this case histograms
are categorized into equidepth, equiwidth, and com-
pressed categories, where in the last case separate buck-
ets are assigned to the most frequent attributes.
INDEX TREES
Index trees are used in facilitating access to large

datasets, for example, relational tables, which is usu-
CLUSTERING ally done based on a single attribute, such as age, or two
attributes, age and salary, by two separate one-dimen-
Clustering partitions records into multiple groups or sional index structures. B+ trees are the most popular
clusters, such that (a) records in one cluster are similar indexing structures for a single dimension, with each
to each other and (b) records in different clusters are node splitting the range of values assigned to it n + 1
dissimilar to each other (Ramakrishnan & Gehrke, 2003). ways, where n is the node fanout. The nodes of the B+
For example, using the age and salaries of the employees tree correspond to 4 or 8 KB database pages and have an
in a large company, a clustering algorithm may deter- occupancy of at least 50%. B+ trees are self-organiz-
mine that there are three major groups: young employees ing and remain balanced as a result of insertions and
with low salaries, middle-aged employees with higher deletions. The larger the value of n, the lower the tree.
salaries, and older employees with the highest salaries. Typical B+ trees have three to four levels; the top two
More generally, a cluster may be represented by the levels fit in main memory. To determine employees
coordinates of its centroid and possibly its radius, which who are 31 to 35 years old and earn $90,000 to $120,000
is the average distance from the centroid. annually is then tantamount to a set intersection of
Data reduction is attained in clustering by represent- records qualified by the two B+ trees.
ing points based on their membership in a cluster so that R-trees, developed in the mid-1980s, deal with
effectively all points in the cluster are assigned the multiple dimensions simultaneously. Over the years,
coordinates of its centroid.
309
TEAM LinG
numerous multidimensional indexing methods (MDIMs) of a cluster sample. Stratified sampling could be attained
have been proposed (Gaede & Gunther, 1998), although if the records were indexed according to the year.
their usage in operational database systems remains rather
limited. MDIMs have been classified into data-partition-
ing and space-partitioning methods. In the former case FUTURE TRENDS
the partitioning is carried out based on insertions and
deletions of multidimensional points in the index. Ex- With rapid increases in the volume of data being held for
amples are R-trees and their variations. Space-partition- data mining and data warehousing, lossy yet accurate
ing methods, such as quad-trees and k-d-b trees, recur- data compression methods are required. This is also
sively partition the space globally when local overflows important from the viewpoint of collecting data from
occur. As a result, space-partitioning methods may not be remote sources with low bandwidth transmission capa-
balanced. I will discuss only data-partitioning methods in bility. The new field of data streams (Golab & Ozsu,
the remainder of this article. 2003) uses summarization methods, such as transmit-
MDIMs tend to be viable for a limited number of ting averages rather than detailed values.
dimensions, and as the number of dimensions increases,
the dimensionality curse sets in (Faloutsos, 1996). In
effect, the index loses its effectiveness, because a CONCLUSION
rather large fraction of the pages constituting the index
are touched as part of query processing. I have provided a summary of data reduction methods
Indexes need some extensions to provide summary applicable to data warehousing and databases in general.
data for data reduction. They can be considered as I have also discussed data compression. Appropriate
hierarchical histograms, but there is no easy way to references are given for further study.
extract this information from an index.
ACKNOWLEDGMENT
SAMPLING
Supported by NSF through Grant 0105485 in Computer
Sampling achieves data compression by selecting an Systems Architecture.
appropriate subset of a large dataset to which a (rela-
tional) query is applied. Sampling can be categorized as
follows (Han & Kamber, 2001): (a) simple random
sample without replacement, (b) simple random sample
REFERENCES
with replacement, (c) cluster sample, and (d) stratified
sample (examples follow). Agrawal, A., Faloutsos, C., & Swami, A. (1993). Effi-
Consider a large university, which maintains student cient similarity search in sequence databases. Proceed-
records in a relational table. Two columns of this table ings of the Foundations of Data Organization and
are of interest: year of study (i.e., freshman, sopho- Algorithms Conference (pp. 69-84), USA.
more, junior, senior) and the grade point average (GPA). Barbara, D., et al. (1997). The New Jersey data reduction
The average GPA by year of study can be specified report. Data Engineering Bulletin, 20(4), 3-42.
succinctly as an SQL GROUP-By query. Assuming that
neither column is indexed, instead of scanning the table Castelli, V., Thomasian, A., & Li, C. S. (2003). CSVD:
to obtain the averages GPAs, the system may randomly Clustering and singular value decomposition for ap-
select records from the table to carry out this task. proximate similarity search in high dimensional spaces.
Online query processing is possible in this context, IEEE Transactions on Knowledge and Data Engineer-
that is, the system starts displaying the average GPAs ing, 15(3), 671-685.
based on samples it has obtained so far by year to the Chakrabarti, K., & Mehrotra, S. (2000). Local dimen-
user and an error bound. The running of the query can be sionality reduction: A new approach to indexing high-
stopped as soon as the user is satisfied (Hellerstein, dimensional spaces. Proceedings of the 26th Interna-
Haas, & Wang, 1997). When the data is not ordered by tional Conference on Very Large Data Bases (pp. 89-100),
one of the attributes under consideration, for example, Egypt.
it is ordered alphabetically by student last names; then
all the records in the page can be used in sampling to Dunham, M. H. (2003). Data mining: Introductory and
reduce the number of disk accesses. This is an example advanced topics. Prentice-Hall.
310
TEAM LinG
Faloutsos, C. (1996). Searching multimedia databases by Witten, I. H., Bell, T., & Moffat, A. (1999). Managing
content. Kluwer Academic Publishers. gigabytes: Compressing and indexing documents and ,
images (2nd ed.). Morgan Kaufmann.
Faloutsos, C., & Lin, K. I. (1995). Fastmap: A fast algorithm
for indexing, data-mining and visualization of traditional
and multimedia datasets. Proceedings of the ACM SIGMOD
International Conference (pp. 163-174), USA. KEY TERMS
Gaede, V., & Guenther, O. (1998). Multidimensional index-
ing methods. ACM Computing Surveys, 30(2), 170-231. Clustering: The process of grouping objects based
on their similarity and dissimilarity. Similar objects
Golab, L., & Ozsu, M. T. (2003). Data stream management should be in the same cluster, which is different from
issues A survey. ACM SIGMOD Record, 32(2), 5-14. the cluster for dissimilar objects.
Han, J., & Kamber, M. (2001). Data mining: Concepts Histogram: A data structure that maintains one or
and techniques. Morgan Kaufmann. more attributes or columns of a relational DBMS to
assist the query optimizer.
Hellerstein, J. M., Haas, P. J., & Wang, H. J. (1997).
Online aggregation. Proceedings of the ACM SIGMOD Index Tree: Partitions the space in a single or
International Conference (pp. 171-182), USA. multiple dimensions for efficient access to the subset
of the data that is of interest.
Korn, F., Jagadish, H., & Faloutsos, C. (1997). Effi-
ciently supporting ad hoc queries in large datasets of Karhunen-Loeve Transform (KLT): Utilizes prin-
time sequences. Proceedings of the ACM SIGMOD cipal component analysis or singular value decomposi-
International Conference (pp. 289-300), USA. tion to minimize the distance error introduced for that
level of dimensionality reduction.
Kruskal, J. B., & Wish, M. (1978). Multidimensional
scaling. Beverly Hills, CA: Sage Publications. Principal Component Analysis (PCA): Computes
the eigenvectors for principal components and uses
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & them to transform a matrix X into a matrix Y, whose
Flannery, B. P. (1996). Numerical recipes in C: The art columns are aligned with the principal components.
of scientific computing. Cambridge University Press. Dimensionality is reduced by discarding columns in Y
Ramakrishnan, K., & Gehrke, J. (2003). Database man- with the least variance or energy.
agement systems (3rd ed.). McGraw-Hill. Sampling: A technique for selecting units from a
Sayood, K. (2002). Introduction to data compression population so that by studying the sample, you may
(2nd ed.). Elsevier. fairly generalize your results back to the population.
Stollnitz, E. J., Derose, T. D., & Salesin, D. H. (1996). Singular Value Decomposition (SVD): Attains
Wavelets for computer graphics: Theory and applica- the same goal as PCA by decomposing the X matrix into
tions. Prentice-Hall. a U matrix, a diagonal matrix of eigenvalues, and a
matrix of eigenvectors the same as those obtained by
Vetterli, M., & Kovacevic, J. (1995). Wavelets and subband PCA.
coding. Prentice-Hall.
Wavelet Transform: A method to transform data so
Vitter, J. S., & Wang, M. (1999). An approximate com- that it can be represented compactly.
putation of multidimensional aggregates of sparse data
using wavelets. Proceedings of the ACM SIGMOD Inter-
national Conference (pp. 193-204), USA.
311
TEAM LinG
312
Data Warehouse Back-End Tools

Alkis Simitsis
National Technical University of Athens, Greece
Dimitri Theodoratos
INTRODUCTION Figure 1. Abstract architecture of a data warehouse
The back-end tools of a data warehouse are pieces of

software responsible for the extraction of data from sev-
eral sources, their cleansing, customization, and insertion
into a data warehouse. They are known under the general
term extraction, transformation and loading (ETL) tools.
In all the phases of an ETL process (extraction and
exportation, transformation and cleaning, and loading),
individual issues arise and, along with the problems and
constraints that concern the overall ETL process, make its
lifecycle a very complex task.
BACKGROUND
A Data Warehouse (DW) is a collection of technologies information coming from multiple sources into a common
aimed at enabling the knowledge worker (executive, man- format; (d) the cleaning of the resulting data set on the
ager, analyst, etc.) to make better and faster decisions. basis of database and business rules; and (e) the propa-
Data warehouses typically are divided into the front-end gation of the data to the data warehouse and/or data
part concerning end users who access the data ware- marts. In the sequel, we will adopt the general acronym
house with decision-support tools, and the back-stage ETL for all kinds of in-house or commercial tools and all
part, where the collection, integration, cleaning and trans- the aforementioned categories of tasks.
formation of data takes place in order to populate the In Figure 2, we abstractly describe the general frame-
warehouse. The architecture of a data warehouse exhibits work for ETL processes. In the left side, we can observe
various layers of data in which data from one layer are the original data providers (sources). Typically, data
derived from data of the previous layer (Figure 1). The providers are relational databases and files. The data from
processes that take part in the back stage of the data these sources are extracted by extraction routines, which
warehouse are data intensive, complex, and costly provide either complete snapshots or differentials of the
(Vassiliadis, 2000). Several reports mention that most of data sources. Then, these data are propagated to the Data
these processes are constructed through an in-house Staging Area (DSA), where they are transformed and
development procedure that can consume up to 70% of cleaned before being loaded to the data warehouse. Inter-
the resources for a data warehouse project (Gartner, 2003). mediate results, again in the form of (mostly) files or
In order to facilitate and manage the data warehouse relational tables, are part of the data-staging area. The
operational processes, commercial tools exist in the mar- data warehouse is depicted in the right part of Figure 2 and
ket under the general title Extraction-Transformation- comprises the target data stores (i.e., fact tables for the
Loading (ETL) tools. To give a general idea of the func- storage of information and dimension tables with the
tionality of these tools, we mention their most prominent description and the multidimensional, roll-up hierarchies
tasks, which include (a) the identification of relevant of the stored facts). The loading of the central warehouse
information at the source side; (b) the extraction of this is performed from the loading activities depicted right
information; (c) the customization and integration of the before the data warehouse data store.
TEAM LinG
Figure 2. The environment of extract-transformation- Scalzo (2003) mentions that 90% of the problems in
load processes data warehouses arise from the nightly batch cycles that ,
load the data. At this stage, the administrators have to
Extract Transform Load deal with problems like (a) efficient data loading and (b)
& Clean
concurrent job mixture and dependencies. Moreover, ETL
processes have global time constraints, including the
time they must be initiated and their completion deadlines.
In fact, in most cases, there is a tight time window in the
night that can be exploited for the refreshment of the data
warehouse, since the source system is off-line or not
heavily used during this period. Other general problems
Sources DSA DW include the scheduling of the overall process, the finding
of the right execution order for dependent jobs and job
sets on the existing hardware for the permitted time
State of the Art schedule, and the maintenance of the information in the
data warehouse.
In the past, there have been research efforts toward the
design and optimization of ETL tasks. We mention three Phase I: Extraction and Transportation
research prototypes: (a) the AJAX system (Galhardas et
al., 2000); (b) the Potters Wheel system (Raman & During the ETL process, a first task that must be per-
Hellerstein, 2001); and (c) ARKTOS II (Arktos II, 2004). The formed is the extraction of the relevant information that
first two prototypes are based on algebras, which we find has to be propagated further to the warehouse
mostly tailored for the case of homogenizing Web data; (Theodoratos et al., 2001). In order to minimize the overall
the latter concerns the modeling and the optimization of processing time, this involves only a fraction of the
ETL processes in a customizable and extensible manner. source data that has changed since the previous execu-
An extensive review of data quality problems and tion of the ETL process, mainly concerning the newly
related literature, along with quality management method- inserted and possibly updated records. Usually, change
ologies, can be found in Jarke, et al. (2000). Rundensteiner detection is performed physically by the comparison of
(1999) offers a discussion of various aspects of data two snapshots (one corresponding to the previous ex-
transformations. Sarawagi (2000) offers a similar collec- traction and the other to the current one). Efficient algo-
tion of papers in the field of data, including a survey rithms exist for this task, like the snapshot differential
(Rahm & Do, 2000) that provides an extensive overview of algorithms presented in Labio and Garcia-Molina (1996).
the field, along with research issues and a review of some Another technique is log sniffing (i.e., the scanning of the
commercial tools and solutions on specific problems log file in order to reconstruct the changes performed
(Monge, 2000; Borkar et al., 2000). In a related but different since the last scan). In rare cases, change detection can
context, we would like to mention the IBIS tool (Cal et al., be facilitated by the use of triggers. However, this solu-
2003). IBIS is an integration tool following the global-as- tion is technically impossible for many of the sources that
view approach to answer queries in a mediated system. are legacy systems or plain flat files. In numerous other
Moreover, there is a variety of ETL tools in the market. cases, where relational systems are used at the source
Simitsis (2003) lists the ETL tools available at the time that side, the usage of triggers also is prohibitive due to the
this paper was written. performance degradation that their usage incurs and to
the need to intervene in the structure of the database.
Moreover, another crucial issue concerns the transporta-
MAIN THRUST tion of data after the extraction, where tasks like ftp,
encryption-decryption, compression-decompression, and
In this section, we briefly review the problems and con- so forth can possibly take place.
straints that concern the overall ETL process, as well as
the individual issues that arise separately in each phase Phase II: Transformation and Cleaning
of an ETL process (extraction and exportation, transfor-
mation and cleaning, and loading). Simitsis (2004) offers It is possible to determine typical tasks that take place
a detailed study on the problems described in this paper during the transformation and cleaning phase of an ETL
and presents a framework toward the modeling and the process. Rahm and Do (2000) further detail this phase in
optimization of ETL processes. the following tasks: (a) data analysis; (b) definition of
313
TEAM LinG
transformation workflow and mapping rules; (c) verifica- call a surrogate key (Kimball et al., 1998). The basic
tion; (d) transformation; and (e) backflow of cleaned data. reasons for this replacement are performance and seman-
In terms of the transformation tasks, we distinguish two tic homogeneity. Performance is affected by the fact that
main classes of problems (Lenzerini, 2002): (a) conflicts textual attributes are not the best candidates for indexed
and problems at the schema level (e.g., naming and struc- keys and need to be replaced by integer keys. More
tural conflicts) and (b) data-level transformations (i.e., at importantly, semantic homogeneity causes reconcilia-
the instance level). tion problems, since different production systems might
The integration and transformation programs perform use different keys for the same object (synonyms) or the
a wide variety of functions, such as reformatting data, same key for different objects (homonyms), resulting in
recalculating data, modifying key structures of data, add- the need for a global replacement of these values in the
ing an element of time to data warehouse data, identifying data warehouse. Observe row (20,green) in table Src_1 of
default values of data, supplying logic to choose between Figure 3. This row has a synonym conflict with row
multiple sources of data, summarizing data, merging data (10,green) in table Src_2, since they represent the same
from multiple sources, and so forth. real-world entity with different IDs, and a homonym
In the sequel, we present four common ETL transforma- conflict with row (20,yellow) in table Src_2 (over at-
tion cases as examples: (a) semantic normalization and tribute ID). The production key ID is replaced by a
denormalization; (b) surrogate key assignment; (c) slowly surrogate key through a lookup table of the form
changing dimensions; and (d) string problems. The re- Lookup(SourceID,Source,SurrogateKey). The Source
search prototypes presented in the previous section and column of this table is required, because there can be
several commercial tools already have done some piece of synonyms in the different sources, which are mapped to
progress in order to tackle problems like these four. Still, different objects in the data warehouse (e.g., value 10 in
their presentation in this paper aspires to make the reader tables Src_1 and Src_2). At the end of this process, the
understand that the whole process should be discrimi- data warehouse table DW has globally unique, recon-
nated from the way we resolved integration issues until now. ciled keys.
Semantic Normalization and Slowly Changing Dimensions

Denormalization
Factual data are not the only data that change. The
It is common for source data to be long denormalized dimension values change at the sources, too, and there
records, possibly involving more than 100 attributes. This are several policies to apply for their propagation to the
is due to the fact that bad database design or classical data warehouse. Kimball, et al. (1998) present three poli-
COBOL tactics led to the gathering of all the necessary cies for dealing with this problem: overwrite (Type 1);
information for an application in a single table/file. Fre- create a new dimensional record (Type 2); push down the
quently, data warehouses also are highly denormalized in changed value into an old attribute (Type 3).
order to answer certain queries more quickly. But some- For Type 1 processing, we only need to issue appro-
times, it is imperative to normalize somehow the input data. priate update commands in order to overwrite attribute
Consider, for example, a table of the form values in existing dimension table records. This policy
R(KEY,TAX,DISCOUNT,PRICE), which we would like to can be used whenever we do not want to track history in
transform to a table of the form R(KEY,CODE,AMOUNT). dimension changes and if we want to correct erroneous
For example, the input tuple t[key,30,60,70] is transformed values. In this case, we use an old version of the dimen-
into the tuples t1[key,1,30]; t2[key,2,60]; t3[key,3,70]. The
transformation of the information organized in rows to
information organized in columns is called rotation or Figure 3. Surrogate key assignment
denormalization, since frequently, the derived values (e.g.,
the total income, in our case) also are stored as columns, ID, COLOR
10, red
ID,
100,
COLOR
red
functionally dependent on other attributes. Occasionally, 20, green 200, green
it is possible to apply the reverse transformation in order

300, yellow
Src_1
ML
to normalize denormalized data before being loaded to the DW
data warehouse. ID, COLOR ML

10, green
20, yellow
Surrogate Keys Src_2 SourceID, Source, SurrogateKey

10, Src_1, 100
20, Src_1, 200
10, Src_2, 200
In a data warehouse project, we usually replace the keys Lookup
20, Src_2, 300
of the production systems with a uniform key, which we
314
TEAM LinG
sion data Dold as they were received from the source and process. The option to turn them off contains some risk
their current version Dnew. We discriminate the new and in the case of a loading failure. So far, the best technique ,
updated rows through the respective operators. The new seems to be the usage of the batch loading tools offered
rows are assigned a new surrogate key through a function by most RDBMSs that avoid these problems. Other tech-
application. The updated rows are assigned a surrogate niques that facilitate the loading task involve the creation
key, which is the same as the one that their previous of tables at the same time with the creation of the respec-
version had already being assigned. Then, we can join the tive indexes, the minimization of inter-process wait states,
updated rows with their old versions from the target table, and the maximization of concurrent CPU usage.
which subsequently will be deleted, and project only the
attributes with the new values.
In the Type 2 policy, we copy the previous version of FUTURE TRENDS
the dimension record and create a new one with a new
surrogate key. If there is no previous version of the In terms of financial growth, in a recent study (Giga
dimension record, we create a new one from scratch; Information Group, 2002), it is reported that the ETL
otherwise, we keep them both. This policy can be used market reached a size of $667 million for year 2001; still, the
whenever we want to track the history of dimension growth rate reached a rather low 11% (as compared to a
changes. rate of 60% growth for year 2000). More recent studies
Finally, Type 3 processing is also very simple, since, (Giga Information Group, 2002; Gartner, 2003) account the
again, we only have to issue update commands to existing ETL issue as a research challenge and pinpoint several
dimension records. For each attribute A of the dimension topics for future work:
table, which is checked for updates, we need to have an
extra attribute called old_A. Each time we spot a new value Integration of ETL with XML adapters; EAI (Enter-
for A, we write the current A value to the old_A field and prise Application Integration) tools (e.g., MQ-Se-
then write the new value to attribute A. In this way, we can ries); customized data quality tools; and the move
have both new and old values present at the same dimen- toward parallel processing of the ETL workflows.
sion record. Active ETL (Adzic & Fiore, 2003), meaning the need
to refresh the warehouse with as fresh of data as
String Problems possible (ideally, online).
Extension of the ETL mechanisms for non-tradi-
A major challenge in ETL processes is the cleaning and tional data, like XML/HTML, spatial, and biomedical.
the homogenization of string data (e.g., data that stands
for addresses, acronyms, names, etc.). Usually, the ap-
proaches for the solution of this problem include the CONCLUSION
application of regular expressions for the normalization of
string data to a set of reference values. ETL tools are pieces of software responsible for the
extraction of data from several sources, their cleansing,
Phase III: Loading customization, and insertion into a data warehouse. In all
the phases of an ETL process (extraction and exportation,
The final loading of the data warehouse has its own transformation and cleaning, and loading), individual
technical challenges. A major problem is the ability to issues arise and, along with the problems and constraints
discriminate between new and existing data at loading that concern the overall ETL process, make its lifecycle a
time. This problem arises when a set of records has to be very troublesome task. The key factors underlying the
classified to (a) the new rows that need to be appended to main problems of ETL workflows are (a) vastness of the
the warehouse and (b) rows that already exist in the data data volumes; (b) quality problems, since data are not
warehouse, but their value has changed and must be always clean and have to be cleansed; (c) performance,
updated (e.g., with an UPDATE command). Modern ETL since the whole process has to take place within a specific
tools already provide mechanisms for this problem, mostly time window; and (d) evolution of the sources and the
through language predicates. Also, simple SQL com- data warehouse, which can eventually lead even to daily
mands are not sufficient, since the open-loop-fetch tech- maintenance operations. Although state of the art in the
nique, where records are inserted one by one, is extremely field of both research and commercial ETL tools includes
slow for the vast volume of data to be loaded in the some signs of progress, much work remains to be done
warehouse. An extra problem is the simultaneous usage before we can claim that this problem is resolved. In our
of the rollback segments and log files during the loading opinion, there are several issues that are technologically
315
TEAM LinG
open and that present interesting topics of research for Rahm, E., & Do, H.H. (2000). Data cleaning: Problems and
the future in the field of data integration in data ware- current approaches. Bulletin of the Technical Committee
house environments. on Data Engineering, 23(4).
Raman, V., & Hellerstein, J. (2001). Potters wheel: An
interactive data cleaning system. Proceedings of 27th
REFERENCES International Conference on Very Large Data Bases
(VLDB), Rome, Italy.
Adzic, J., & Fiore, V., (2003). Data warehouse population
platform. Proceedings of 5th International Workshop on Rundensteiner, E. (Ed.). (1999). Special issue on data
the Design and Management of Data Warehouses transformations. Bulletin of the Technical Committee on
(DMDW), Berlin, Germany. Data Engineering, 22(1).
Arktos II. (2004). A framework for modeling and managing Sarawagi, S. (2000). Special issue on data cleaning. Bulle-
ETL processes. Retrieved from http://www.dblab.ece. tin of the Technical Committee on Data Engineering, 23(4).
ntua.gr/~asimi
Scalzo, B. (2003). Oracle DBA guide to data warehousing
Borkar, V., Deshmuk, K., & Sarawagi, S. (2000). Automati- and star schemas. Upper Saddle River, NJ : Prentice Hall.
cally extracting structure from free text addresses. Bulletin
of the Technical Committee on Data Engineering, 23(4). Simitsis, A. (2003). List of ETL tools. Retrieved from http:/
/www.dbnet.ece.ntua.gr/~asimi/ETLTools.htm
Cal, A. et al. (2003). IBIS: Semantic data integration at
work. Proceedings of the 15th CAiSE. Simitsis, A. (2004). Modeling and managing extraction-
transformation-loading (ETL) processes in data ware-
Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000). house environments [doctoral thesis]. National Technical
Ajax: An extensible data cleaning tool. Proceedings ACM University of Athens, Greece.
SIGMOD International Conference On the Management
of Data, Dallas, Texas. Theodoratos, D., Ligoudistianos, S., & Sellis, T. (2001).
View selection for designing the global data warehouse.
Gartner. (2003). ETL magic quadrant update: Market Data & Knowledge Engineering, 39(3), 219-240.
pressure increases. Retrieved from http://www.gartner.
com/reprints/informatica/112769.html Vassiliadis, P. (2000). Gulliver in the land of data ware-
housing: Practical experiences and observations of a
Giga Information Group. (2002). Market overview update: researcher. Proceedings of 2nd International Workshop
ETL. Technical Report RPA-032002-00021. on Design and Management of Data Warehouses
(DMDW), Sweden.
Inmon, W.-H. (1996). Building the data warehouse. New
York: John Wiley & Sons, Inc.
Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P.
KEY TERMS
(Eds.). (2000). Fundamentals of data warehouses.
Springer-Verlag. Data Mart: A logical subset of the complete data
warehouse. We often view the data mart as the restriction
Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. of the data warehouse to a single business process or to
(1998). The data warehouse lifecycle toolkit: Expert a group of related business processes targeted toward a
methods for designing, developing, and deploying data particular business group.
warehouses. New York: John Wiley & Sons.
Data Staging Area (DSA): An auxiliary area of volatile
Labio, W., & Garcia-Molina, H. (1996). Efficient snapshot data employed for the purpose of data transformation,
differential algorithms for data warehousing. Proceed- reconciliation, and cleaning before the final loading of the
ings of 22nd International Conference on Very Large data warehouse.
Data Bases (VLDB), Bombay, India.
Data Warehouse: A subject-oriented, integrated, time-
Lenzerini, M. (2002). Data integration: A theoretical per- variant, non-volatile collection of data used to support
spective. Proceedings of 21st Symposium on Principles the strategic decision-making process for the enterprise.
of Database Systems (PODS), Wisconsin. It is the central point of data integration for business
intelligence and is the source of data for the data marts,
Monge, A. (2000). Matching algorithms within a duplicate delivering a common view of enterprise data (Inmon, 1996).
detection system. Bulletin of the Technical Committee
on Data Engineering, 23(4).
316
TEAM LinG
ETL: Extract, transform, and load (ETL) are data ware- Source System: An operational system of record
housing functions that involve extracting data from out- whose function is to capture the transactions of the ,
side sources, transforming them to fit business needs, business. A source system is often called a legacy system
and ultimately loading them into the data warehouse. ETL in a mainframe environment.
is an important part of data warehousing, as it is the way
data actually gets loaded into the warehouse. Target System: The physical machine on which the
data warehouse is organized and stored for direct query-
Online Analytical Processing (OLAP): The general ing by end users, report writers, and other applications. A
activity of querying and presenting text and number data target system is often called a presentation server.
from data warehouses, as well as a specifically dimen-
sional style of querying and presenting that is exemplified
by a number of OLAP vendors.
317
TEAM LinG
318
Data Warehouse Performance

Beixin (Betsy) Lin
Yu Hong
BearingPoint Inc., USA
Zu-Hsu Lee
A data warehouse is a large electronic repository of There are inherent differences between a traditional data-
information that is generated and updated in a structured base system and a data warehouse system, though to a
manner by an enterprise over time to aid business intelli- certain extent, all databases are similarly designed to
gence and to support decision making. Data stored in a serve a basic administrative purpose, e.g., to deliver a
data warehouse is non-volatile and time variant and is quick response to transactional data processes such as
organized by subjects in a manner to support decision entry, update, query and retrieval. For many conventional
making (Inmon, Rudin, Buss, & Sousa, 1998). Data ware- databases, this objective has been achieved by online
housing has been increasingly adopted by enterprises as transactional processing (OLTP) systems (e.g. Oracle
the backbone technology for business intelligence re- Corp, 2004; Winter & Auerbach, 2004). In contrast, data
porting and query performance has become the key to the warehouses deal with a huge volume of data that are more
successful implementation of data warehouses. Accord- historical in nature. Moreover, data warehouse designs
ing to a survey of 358 businesses on reporting and end- are strongly organized for decision making by subject
user query tools, conducted by Appfluent Technology, matter rather than by defined access or system privileges.
data warehouse performance significantly affects the As a result, a dimension model is usually adopted in a data
Return on Investment (ROI) on Business Intelligence (BI) warehouse to meet these needs, whereas an Entity-Rela-
systems and directly impacts the bottom line of the tionship model is commonly used in an OLTP system. Due
systems (Appfluent Technology, 2002). Even though in to these differences, an OLTP query usually requires
some circumstances it is very difficult to measure the much shorter processing time than a data warehouse
benefits of BI projects in terms of ROI or dollar figures, query (Raden, 2003). Performance enhancement tech-
management teams are still eager to have a single version niques are, therefore, especially critical in the arena of
of the truth, better information for strategic and tactical data warehousing.
decision making, and more efficient business processes Despite the differences, these two types of database
by using BI solutions (Eckerson, 2003). systems share some common characteristics. Some tech-
Dramatic increases in data volumes over time and the niques used in a data warehouse to achieve a better
mixed quality of data can adversely affect the performance performance are similar to those used in OLTP, while some
of a data warehouse. Some data may become outdated are only developed in relation to data warehousing. For
over time and can be mixed with data that are still valid for example, as in an OLTP system, an index is also used in a
decision making. In addition, data are often collected to data warehouse system, though a data warehouse might
meet potential requirements, but may never be used. Data have different kinds of indexing mechanisms based on its
warehouses also contain external data (e.g. demographic, granularity. Partitioning is a technique which can be used
psychographic, etc.) to support a variety of predictive in data warehouse systems as well (Silberstein, Eacrett,
data mining activities. All these factors contribute to the Mayer, & Lo, 2003).
massive growth of data volume. As a result, even a simple On the other hand, some techniques are developed
query may become burdensome to process and cause specifically to improve the performance of data ware-
overflowing system indices (Inmon, Rudin, Buss & Sousa, houses. For example, aggregates can be built to provide
2001). Thus, exploring the techniques of performance a quick response time for summary information (e.g. Eacrett,
tuning becomes an important subject in data warehouse 2003; Silberstein, 2003). Query parallelism can be imple-
management. mented to speed up the query when data are queried from
TEAM LinG
several tables (Silberstein, et al., 2003). Caching and query requirements, the cardinality of the table can be
statistics are unique for data warehouses since the statis- decided. Given a tables cardinality, an appropriate ,
tics will help to build a smart cache for better performance. indexing method can then be chosen.
Also, pre-calculated reports are useful to certain groups Dimensional Models: Most data warehouse de-
of users who are only interested in seeing static reports signs use dimensional models, such as Star-Schema,
(Eacrett, 2003). Periodic data compression and archiving Snow-Flake, and Star-Flake. A star-schema is a
helps to cleanse the data warehouse environment. Keep- dimensional model with fully denormalized hierar-
ing only the necessary data online will allow faster access chies, whereas a snowflake schema is a dimensional
(e.g. Kimball, 1996). model with fully normalized hierarchies. A star-flake
schema represents a combination of a star schema
and a snow-flake schema (e.g. Moody & Kortink,
MAIN THRUST 2003). Data warehouse architects should consider
the pros and cons of each dimensional model before
As discussed earlier, performance issues play a crucial making a choice.
role in a data warehouse environment. This chapter de-
scribes ways to design, build, and manage data ware- Aggregates
houses for optimum performance. The techniques of
tuning and refining the data warehouse discussed below Aggregates are the subsets of the fact table data (Eacrett,
have been developed in recent years to reduce operating 2003). The data from the fact table are summarized into
and maintenance costs and to substantially improve the aggregates and stored physically in a different table than
performance of new and existing data warehouses. the fact table. Aggregates can significantly increase the
performance of the OLAP query since the query will read
Performance Optimization at the Data fewer data from the aggregates than from the fact table.
Model Design Stage Database read time is the major factor in query execution
time. Being able to reduce the database read time will help
Adopting a good data model design is a proactive way to the query performance a great deal since fewer data are
enhance future performance. In the data warehouse being read. However, the disadvantage of using aggre-
design phase, the following factors should be taken into gates is its loading performance. The data loaded into the
consideration. fact table have to be rolled up to the aggregates, which
means any newly updated records will have to be updated
Granularity: Granularity is the main issue that in the aggregates as well to make the data in the aggre-
needs to be investigated carefully before the data gates consistent with those in the fact table. Keeping the
warehouse is built. For example, does the report data as current as possible has presented a real challenge
need the data at the level of store keeping units to data warehousing (e.g. Bruckner & Tjoa, 2002). The
(SKUs), or just at the brand level? These are size ratio of the database records transferred to the database
questions that should be asked of business users records read is a good indicator of whether or not to use
before designing the model. Since a data ware- the aggregate technique. In practice, if the ratio is 1/10 or
house is a decision support system, rather than a less, building aggregates will definitely help performance
transactional system, the level of detail required (Silberstein, 2003).
is usually not as deep as the latter. For instance, a
data warehouse does not need data at the document Database Partitioning
level such as sales orders, purchase orders, which
are usually needed in a transactional system. In Logical partitioning means using year, planned/actual
such a case, data should be summarized before data, and business regions as criteria to partition the
they are loaded into the system. Defining the data database into smaller data sets. After logical partitioning,
that are needed no more and no less will a database view is created to include all the partitioned
determine the performance in the future. In some tables. In this case, no extra storage is needed and each
cases, the Operational Data Stores (ODS) will be partitioned table will be smaller to accelerate the query
a good place to store the most detailed granular (Silberstein, Eacrett, Mayer & Lo, 2003). Take a multina-
level data and those data can be provided on the tional company as an example. It is better to put the data
jump query basis. from different countries into different data targets, such
Cardinality: Cardinality means the number of pos- as cubes or data marts, than to put the data from all the
sible entries of the table. By collecting business countries into one data target. By logical partitioning
(splitting the data into smaller cubes), query can read the
319
TEAM LinG
smaller cubes instead of large cubes and several parallel adopted. In contrast, for a high cardinality dimen-
processes can read the small cubes at the same time. sion table, the B-Tree index should be used (e.g.
Another benefit of logical partitioning is that each parti- McDonald, Wilmsmeier, Dixon & Inmon, 2002).
tioned cube is less complex to load and easier to perform The Master Data Table Index: Since the data
the administration. Physical partitioning can also reach the warehouse commonly uses dimensional models,
same goal as logical partitioning. Physical partitioning the query SQL plan always starts from reading the
means that the database table is cut into smaller chunks of master data table. Using indices on the master data
data. The partitioning is transparent to the user. The table will significantly enhance the query perfor-
partitioning will allow parallel processing of the query and mance.
each parallel process will read a smaller set of data sepa-
rately. Caching Technology
Query Parallelism Caching technology will play a critical role in query

performance as the memory size and speed of CPUs
In business intelligence reporting, the query is usually increase. After a query runs once, the result is stored in
very complex and it might require the OLAP engine to read the memory cache. Subsequently, similar query runs will
the data from different sources. In such case, the technique get the data directly from the memory rather than by
of query parallelism will significantly improve the query accessing the database again. In an enterprise server, a
performance. Query parallelism is an approach to split the certain block of memory is allocated for the caching use.
query into smaller sub-queries and to allow parallel pro- The query results and the navigation status can then be
cessing of sub-queries. By using this approach, each sub- stored in a highly compressed format in that memory
process takes a shorter time and reduces the risk of system block. For example, a background user can be set up
hogging if a single long-running process is used. In con- during the off-peak time to mimic the way a real user runs
trast, sequential processing definitely requires a longer the query. By doing this, a cache is generated that can be
time to process (Silberstein, Eacrett, Mayer & Lo, 2003). used by real business users.
Database Indexing Pre-Calculated Report
In a relational database, indexing is a well known technique Pre-calculation is one of the techniques where the admin-
for reducing database read time. By the same token, in a istrator can distribute the workload to off-peak hours and
data warehouse dimensional model, the use of indices in have the result sets ready for faster access (Eacrett,
the fact table, dimension table, and master data table will 2003). There are several benefits of using the pre-calcu-
improve the database read time. lated reports. The user will have faster response time
since calculation takes place on the fly. Also, the system
The Fact Table Index: By default the fact table workload is balanced and shifted to off-peak hours.
will have the primary index on all the dimension Lastly, the reports can be available offline.
keys. However, a secondary index can also be built
to fit a different query design (e.g. McDonald, Use Statistics to Further Tune up the
Wilmsmeier, Dixon, & Inmon, 2002). Unlike a System
primary index, which includes all the dimension
keys, the secondary index can be built to include In a real-world data warehouse system, the statistics data
only some dimension keys to improve the perfor- of the OLAP are collected by the system. The statistics
mance. By having the right index, the query read provide such information as what the most used queries
time can be dramatically reduced. are and how the data are selected. The statistics can help
The Dimension Table Index: In a dimensional further tune the system. For example, examining the
model, the size of the dimension table is the decid- descriptive statistics of the queries will reveal the most
ing factor affecting query performance. Thus, the common used drill-down dimensions as well as the
index of the dimension table is important to de- combinations of the dimensions. Also, the OLAP statis-
crease the master data read time, and thus to im- tics will also indicate what the major time component is
prove filtering and drill down. Depending on the out of the total query time. It could be database read time
cardinality of the dimension table, different index or OLAP calculation time. Based on the data, one can
methods will be adopted to build the index. For a low build aggregates or offline reports to increase the query
cardinality dimension table, Bit-Map index is usually performance.
320
TEAM LinG
Data Compression How to design a better OLAP statistics recording

system to measure the performance of the data ,
Over time, the dimension table might contain some unnec- warehouse? As mentioned earlier, query OLAP sta-
essary entries or redundant data. For example, when some tistics data is important to identify the trend and the
data has been deleted from the fact table, the correspond- utilization of the queries by different users. Based
ing dimension keys will not be used any more. These keys on the statistics results, several techniques could
need to be deleted from the dimension table in order to be used to tune up the performance, such as aggre-
have a faster query time since the query execution plan gates, indices and caching.
always start from the dimension table. A smaller dimen- How to maintain data integrity and keep the good
sion table will certainly help the performance. data warehouse performance at the same time? In-
The data in the fact table need to be compressed as formation quality and ownership within the data
well. From time to time, some entries in the fact table might warehouse play an important role in the delivery of
contain just all zeros and need to be removed. Also, accurate processed data results (Ma, Chou, & Yen,
entries with the same dimension key value should be 2000). Data validation is thus a key step towards
compressed to reduce the size of the fact table. improving data reliability and quality. Accelerating
Kimball (1996) pointed out that data compression the validation process presents a challenge to the
might not be a high priority for an OLTP system since the performance of the data warehouse.
compression will slow down the transaction process.
However, compression can improve data warehouse per-
formance significantly. Therefore, in a dimensional model, CONCLUSION
the data in the dimension table and fact table need to be
compressed periodically. Information technologies have been evolving quickly in
recent decades. Engineers have made great strides in
Data Archiving developing and improving databases and data warehous-
ing in many aspects, such as hardware; operating sys-
In the real data warehouse environment, data are being tems; user interface and data extraction tools; data mining
loaded daily or weekly. The volume of the production data tools, security, and database structure and management
will become huge over time. More production data will systems.
definitely lengthen the query run time. Sometimes, there Many success stories in developing data warehous-
are too much unnecessary data sitting in the system. ing applications have been published (Beitler & Leary,
These data may be out-dated and not important to the 1997; Grim & Thorton, 1997). An undertaking of this scale
analysis. In such cases, the data need to be periodically creating an effective data warehouse, however, is costly
archived to offline storage so that they can be removed and not risk-free. Weak sponsorship and management
from the production system (e.g. Uhle, 2003). For example, support, insufficient funding, inadequate user involve-
three years of production data are usually sufficient for a ment, and organizational politics have been found to be
decision support system. Archived data can still be pulled common factors contributing to failure (Watson, Gerard,
back to the system if such need arises Gonzalez, Haywood & Frenton, 1999). Therefore, data
warehouse architects need to first clearly identify the
FUTURE TRENDS information useful for the specific business process and
goals of the organization. The next step is to identify
Although data warehouse applications have been evolv- important factors that impact performance of data ware-
ing to a relatively mature stage, several challenges remain housing implementation in the organization. Finally all the
in developing a high performing data warehouse environ- business objectives and factors will be incorporated into
ment. The major challenges are: the design of the data warehouse system and data models,
in order to achieve high-performing data warehousing
How to optimize the access to volumes of data based and business intelligence.
on system workload and user expectations (e.g.
Inmon, Rudin, Buss & Sousa, 1998)? Corporate
information systems rely more and more on data REFERENCES
warehouses that provide one entry point for access
to all information for all users. Given this, and Appfluent Technology. (2002, December 2). A study on
because of varying utilization demands, the system reporting and business intelligence application usage.
should be tuned to reduce the system workload Retrieved from http://searchsap.techtarget.com/
during peak user periods. whitepaperPage/0,293857,sid21_gci866993,00.html
321
TEAM LinG
Beitler, S.S., & Leary, R. (1997). Sears EPIC transforma- Uhle, R. (2003). Data aging with mySAP business intelli-
tion: Converting from mainframe legacy systems to On- gence. SAP White Paper.
Line Analytical Processing (OLAP). Journal of Data
Warehousing, 2, 5-16. Watson, H., Gerard, J., Gonzalez, L.E., Haywood, M.E., &
Fenton, D. (1999). Data warehousing failures: Case stud-
Bruckner, R.M., & Tjoa, A.M. (2002). Capturing delays ies and findings. Journal of Data Warehousing, 4, 44-55.
and valid times in data warehousestowards timely con-
sistent analyses. Journal of Intelligent Information Sys- Winter, R., & Auerbach, K. (2004). Contents under pres-
tems, 19(2), 169-190. sure: scalability challenges for large databases. Intelli-
gent Enterprise, 7(7), 18-25.
Eacrett, M. (2003). Hitchhikers guide to SAP business
information warehouse performance tuning. SAP White
Paper. KEY TERMS
Eckerson, W. (2003). BI StatShots. Journal of Data Ware- Cache: A region of a computers memory which stores
housing, 8(4), 64. recently or frequently accessed data so that the time of
Grim, R., & Thorton, P.A. (1997). A customer for life: the repeated access to the same data can decrease.
warehouseMCI approach. Journal of Data Warehousing, Granularity: The level of detail or complexity at which
2, 73-79. an information resource is described.
Inmon, W.H., Imhoff, C., & Sousa, R. (2001). Corporate Indexing: In data storage and retrieval, the creation
information factory. New York: John Wiley & Sons, Inc. and use of a list that inventories and cross-references
Inmon, W.H., Rudin, K., Buss, C.K., & Sousa, R. (1998). data. In database operations, a method to find data more
Data Warehouse Performance. New York: John Wiley & efficiently by indexing on primary key fields of the data-
Sons, Inc. base tables.
Kimball, R. (1996). The data warehouse toolkit. New York: ODS (Operational Data Stores): A system with
John Wiley & Sons, Inc. capability of continuous background update that keeps
up with individual transactional changes in operational
Ma, C., Chou, D.C., & Yen, D.C. (2000). Data warehous- systems versus a data warehouse that applies a large load
ing, technology assessment and management. Industrial of updates on an intermittent basis.
Management + Data Systems, 100, 125
OLAP (Online Analytical Processing): A category
McDonald, K., Wilmsmeier, A., Dixon, D.C., & Inmon, of software tools for collecting, presenting, delivering,
W.H. (2002, August). Mastering the SAP business infor- processing and managing multidimensional data (i.e.,
mation warehouse. Hoboken, NJ: John Wiley & Sons, data that has been aggregated into various categories or
Inc. dimensions) in order to provide analytical insights for
business management.
Moody, D., & Kortink, M.A.R. (2003). From ER models to
dimensional models, part II: Advanced design issues. OLTP (Online Transaction Processing): A stan-
Journal of Data Warehousing, 8, 20-29. dard, normalized database structure designed for trans-
actions in which inserts, updates, and deletes must be
Oracle Corp. (2004). Largest transaction processing db on fast.
Unix runs oracle database. Online Product News, 23(2).
Service Management: The strategic discipline for
Peterson, S. (1994). Stars: A pattern language for query identifying, establishing, and maintaining IT services to
optimized schema. Sequent Computer Systems, Inc. White support the organizations business goal at an appropri-
Paper. ate cost.
Raden, N. (2003). Real time: Get real, part II. Intelligent SQL (Structured Query Language): A standard
Enterprise, 6(11), 16. interactive programming language used to communicate
Silberstein, R. (2003). Know how network: SAP BW perfor- with relational databases in order to retrieve, update, and
mance monitoring with BW statistics. SAP White Paper. manage data.
Silberstein, R., Eacrett, M., Mayer, O., & Lo, A. (2003). SAP
BW performance tuning. SAP White Paper.
322
TEAM LinG
323
Data Warehousing and Mining in Supply Chains ,

Richard Mathieu
Saint Louis University, USA
Reuven R. Levary
Saint Louis University, USA
INTRODUCTION and products is moving downstream (i.e., from the initial

supply sources to the end customers). The flow of infor-
Every finished product has gone through a series of mation regarding the demand for the product and orders
transformations. The process begins when manufactur- to suppliers is moving upstream, while the flow of infor-
ers purchase the raw materials that will be transformed mation regarding product availability, shipment sched-
into the components of the product. The parts are then ules, and invoices is moving downstream. For each orga-
supplied to a manufacturer, who assembles them into the nization in the supply chain, its customer is the subse-
finished product and ships the completed item to the quent organization in the supply chain, and its subcon-
consumer. The transformation process includes numer- tractor is the prior organization in the chain.
ous activities (Levary, 2000). Among them are
Designing the product BACKGROUND

Designing the manufacturing process
Determining which component parts should be pro- Supply chain data can be characterized as either transac-
duced in house and which should be purchased tional or analytical (Shapiro, 2001). All new data that are
from suppliers acquired, processed, and compiled into reports that are
Forecasting customer demand transmitted to various organizations along a supply chain
Contracting with external suppliers for raw materials are deemed transactional data (Davis & Spekman, 2004).
or component parts Increasingly, transactional supply chain data is processed
Purchasing raw materials or component parts from and stored in enterprise resource planning systems, and
suppliers complementary data warehouses are developed to sup-
Establishing distribution channels for raw materials port decision-making processes (Chen R., Chen, C., &
and component parts from suppliers to manufac- Chang, 2003; Zeng, Chiang, & Yen, 2003). Organizations
turer such as Home Depot, Lowes, and Volkswagen have
Establishing of distribution channels to the suppli- developed data warehouses and integrated data-mining
ers of raw materials and component parts methods that complement their supply chain management
Establishing distribution channels from the manu- operations (Dignan, 2003a; Dignan, 2003b; Hofmann,
facturer to the wholesalers and from wholesalers to 2004). Data that are used in descriptive and optimization
the final customers models are considered analytical data (Shapiro, 2001).
Manufacturing the component parts Descriptive models include various forecasting models,
Transporting the component parts to the manufac- which are used to forecast demands along supply chains,
turer of the final product and managerial accounting models, which are used to
Manufacturing and assembling the final product manage activities and costs. Optimization models are
Transporting the final product to the wholesalers, used to plan resources, capacities, inventories, and prod-
retailers, and final customer uct flows along supply chains.
Data collected from consumers are the core data that
Each individual activity generates various data items affect all other data items along supply chains. Informa-
that must be stored, analyzed, protected, and transmitted tion collected from the consumers at the point of sale
to various units along a supply chain. include data items regarding the sold product (e.g., type,
A supply chain can be defined as a series of activities quantity, sale price, and time of sale) as well as information
that are involved in the transformation of raw materials about the consumer (e.g., consumer address and method
into a final product, which a customer then purchases of payment). These data items are analyzed often. Data-
(Levary, 2000). The flow of materials, component parts, mining techniques are employed to determine the types of
TEAM LinG
Data Warehousing and Mining in Supply Chains
items that are appealing to consumers. The items are MAIN THRUST
classified according to consumers socioeconomic back-
grounds and interests, the sale price that consumers are Data Aggregation in Supply Chains
willing to pay, and the location of the point of sale.
Data regarding the return of sold products are used to
identify potential problems with the products and their Large amounts of data are being accumulated and stored
uses. These data include information about product qual- by companies belonging to supply chains. Data aggrega-
ity, consumer disappointment with the product, and legal tion can improve the effectiveness of using the data for
consequences. Data-mining techniques can be used to operational, tactical, and strategic planning models. The
identify patterns in returns so that retailers can better concept of data aggregation in manufacturing firms is
determine which type of product to order in the future and called group technology (GT). Nonmanufacturing firms
from which supplier it should be purchased. Retailers are are also aggregating data regarding products, suppliers,
also interested in collecting data regarding competitors customers, and markets.
sales so that they can better promote their own product
and establish a competitive advantage. Group Technology
Data related to political and economic conditions in
supplier countries are of interest to retailers. Data-mining Group technology is a concept of grouping parts, re-
techniques can be used to identify political and economic sources, or data according to similar characteristics. By
patterns in countries. Information can help retailers choose grouping parts according to similarities in geometry,
suppliers who are situated in countries where the flow of design features, manufacturing features, materials used,
products and funds is expected to be stable for a reason- and/or tooling requirements, manufacturing efficiency
ably long period of time. can be enhanced, and productivity increased. Manufac-
Manufacturers collect data regarding a) particular turing efficiency is enhanced by
products and their manufacturing process, b) suppliers,
and c) the business environment. Data regarding the Performing similar activities at the same work center
product and the manufacturing process include the char- so that setup time can be reduced
acteristics of products and their component parts ob- Avoiding duplication of effort both in the design
tained from CAD/CAM systems, the quality of products and manufacture of parts
and their components, and trends in theresearch and Avoiding duplication of tools
development (R & D) of relevant technologies. Data- Automating information storage and retrieval
mining techniques can be applied to identify patterns in (Levary, 1993)
the defects of products, their components, or the manu-
facturing process. Data regarding suppliers include avail- Effective implementation of the GT concept necessi-
ability of raw materials, labor costs, labor skills, techno- tates the use of a classification and coding system. Such
logical capability, manufacturing capacity, and lead time a system codes the various attributes that identify simi-
of suppliers. Data related to qualified teleimmigrants (e.g., larities among parts. Each part is assigned a number or
engineers and computer software developers) is valuable alphanumeric code that uniquely identifies the parts
to many manufacturers. Data-mining techniques can be attributes or characteristics. A parts code must include
used to identify those teleimmigrants having unique knowl- both design and manufacturing attributes.
edge and experience. Data regarding the business envi- A classification and coding system must provide an
ronment of manufacturers include information about com- effective way of grouping parts into part families. All parts
petitors, potential legal consequences regarding a prod- in a given part family are similar in some aspect of design
uct or service, and both political and economic conditions or manufacture. A part may belong to more than one
in countries where the manufacturer has either facilities or family.
business partners. Data-mining techniques can be used A part code is typically composed of a large number
to identify possible liability concerning a product or of characters that allow for identification of all part at-
service as well as trends in political and economic condi- tributes. The larger the number of attributes included in a
tions in countries where the manufacturer has business part code, the more difficult the establishment of standard
interests. procedures for classifying and coding. Although numer-
Retailers, manufacturers, and suppliers are all inter- ous methods of classification and coding have been
ested in data regarding transportation companies. These developed, none has emerged as the standard method.
data include transportation capacity, prices, lead time, Because different manufacturers have different require-
and reliability for each mode of transportation. ments regarding the type and composition of parts codes,
324
TEAM LinG
customized methods of classification and coding are gen- Products intended to be used or consumed in a
erally required. Some of the better known classification specific season of the year ,
and coding methods are listed by Groover (1987). Volume and speed of product movement
After a code is established for each part, the parts are The methods of the transportation of the products
grouped according to similarities and are assigned to part from the suppliers to the retailers
families. Each part family is designed to enhance manufac- The geographical location of suppliers
turing efficiency in a particular way. The information re- Method of transaction handling with suppliers; for
garding each part is arranged according to part families in example, EDI, Internet, off-line
a GT database. The GT database is designed in such a way
that users can efficiently retrieve desired information by As in the case of GT, retailing products may belong
using the appropriate code. to more than one family.
Consider part families that are based on similarities of
design features. A GT database enables design engineers Aggregation of Data Regarding
to search for existing part designs that have characteris- Customers of Finished Products
tics similar to those of a new part that is to be designed. The
search begins when the design engineer describes the To effectively market finished products to customers, it
main characteristics of the needed part with the help of a is helpful to aggregate customers with similar character-
partial code. The computer then searches the GT database istics into families. Examples of customers of finished
for all the items with the same code. The results of the product families include
search are listed on the computer screen, and the designer
can then select or modify an existing part design after Customers residing in a specific geographical re-
reviewing its specifications. Selected designs can easily gion
be retrieved. When design modifications are needed, the Customers belonging to a specific socioeconomic
file of the selected part is transferred to a CAD system. group
Such a system enables the design engineer to effectively Customers belonging to a specific age group
modify the parts characteristics in a short period of time. Customers having certain levels of education
In this way, efforts are not duplicated when designing Customers having similar product preferences
parts. Customers of the same gender
The creation of a GT database helps reduce redun- Customers with the same household size
dancy in the purchasing of parts as well. The database
enables manufacturers to identify similar parts produced Similar to both GT and retailing products, customers
by different companies. It also helps manufacturers to of finished products may belong to more than one family.
identify components that can serve more than a single
function. In such ways, GT enables manufacturers to
reduce both the number of parts and the number of suppli-
ers. Manufacturers that can purchase large quantities of a
FUTURE TRENDS
few items rather than small quantities of many items are
able to take advantage of quantity discounts. Supply Chain Decision Databases
Aggregation of Data Regarding Retailing The enterprise database systems that support supply
chain management are repositories for large amounts of
Products transaction-based data. These systems are said to be
data rich but information poor. The tremendous amount
Retailers may carry thousands of products in their stores. of data that are collected and stored in large, distributed
To effectively manage the logistics of so many products, database systems has far exceeded the human ability for
product aggregation is highly desirable. Products are comprehension without analytic tools. Shapiro estimates
aggregated into families that have some similar character- that 80% of the data in a transactional database that
istics. Examples of product families include the following: supports supply chain management is irrelevant to deci-
sion making and that data aggregations and other analy-
Products belonging to the same supplier ses are needed to transform the other 20% into useful
Products requiring special handling, transportation, information (2001). Data warehousing and online analyti-
or storage cal processing (OLAP) technologies combined with tools
Products intended to be used or consumed by a for data mining and knowledge discovery have allowed
specific group of customers the creation of systems to support organizational deci-
sion making.
325
TEAM LinG
The supply chain management (SCM) data warehouse ers, distributors, and carriers to incorporate RFID tags
must maintain a significant amount of data for decision into both products and operations. Other large retailers
making. Historical and current data are required from are following Wal-Marts lead in requesting RFID tags to
supply chain partners and from various functional areas be installed in goods along their supply chain. The tags
within the firm in order to support decision making in follow products from the point of manufacture to the store
regard to planning, sourcing, production, and product shelf. RFID technology will significantly increase the
delivery. Supply chains are dynamic in nature. In a supply effectiveness of tracking materials along supply chains
chain environment, it may be desirable to learn from an and will also substantially reduce the loss that retailers
archived history of temporal data that often contains accrue from thefts. Nonetheless, civil liberty organiza-
some information that is less than optimal. In particular, tions are trying to stop RFID tagging of consumer goods,
SCM environments are typically characterized by variable because this technology has the potential of affecting
changes in product demand, supply levels, product at- consumer privacy. RFID tags can be hidden inside objects
tributes, machine characteristics, and production plans. without customer knowledge. So RFID tagging would
As these characteristics change over time, so does the make it possible for individuals to read the tags without
data in the data warehouses that support SCM decision the consumers even having knowledge of the tags exist-
making. We should note that Kimball and Ross (2002) use ence.
a supply value chain and a demand supply chain as the Sun Microsystems has designed RFID technology to
framework for developing the data model for all business reduce or eliminate drug counterfeiting in pharmaceutical
data warehouses. supply chains (Jaques, 2004). This technology will make
The data warehouses provide the foundation for de- the copying of drugs extremely difficult and unprofitable.
cision support systems (DSS) for supply chain manage- Delta Air Lines has successfully used RFID tags to track
ment. Analytical tools (simulation, optimization, & data pieces of luggage from check-in to planes (Brewin, 2003).
mining) and presentation tools (geographic information The luggage-tracking success rate of RFID was much
systems and graphical user interface displays) are coupled better than that provided by bar code scanners.
with the input data provided by the data warehouse Active RFID tags, unlike passive tags, have an inter-
(Marakas, 2003). Simchi-Levi, D., Kaminsky, and Simchi- nal battery. The tags have the ability to be rewritten and/
Levi, E. (2000) describe three DSS examples: logistics or modified. The read/write capability of active RFID tags
network design, supply chain planning and vehicle rout- is useful in interactive applications such as tracking work
ing, and scheduling. Each DSS requires different data in process or maintenance processes. Active RFID tags
elements, has specific goals and constraints, and utilizes are larger and more expensive than passive RFID tags.
special graphical user interface (GUI) tools. Both the passive and active tags have a large, diverse
spectrum of applications and have become the standard
The Role of Radio Frequency technologies for automated identification, data collec-
Identification (RFID) in Supply Chains tion, and tracking. A vast amount of data will be recorded
by RFID tags. The storage and analysis of this data will
Data Warehousing pose new challenges to the design, management, and
maintenance of databases as well as to the development
The emerging RFID technology will generate large amounts of data-mining techniques.
of data that need to be warehoused and mined. Radio
frequency identification (RFID) is a wireless technology
that identifies objects without having either contact or
sight of them. RFID tags can be read despite environmen-
CONCLUSION
tally difficult conditions such as fog, ice, snow, paint, and
widely fluctuating temperatures. Optically read technolo- A large amount of data is likely to be gathered from the
gies, such as bar codes, cannot be used in such environ- many activities along supply chains. This data must be
ments. RFID can also identify objects that are moving. warehoused and mined to identify patterns that can lead
Passive RFID tags have no external power source. to better management and control of supply chains. The
Rather, they have operating power generated from a more RFID tags installed along supply chains, the easier
reader device. The passive RFID tags are very small and data collection becomes. As the tags become more popu-
inexpensive. Further, they have a virtually unlimited op- lar, the data collected by them will grow significantly. The
erational life. The characteristics of these passive RFID increased popularity of the tags will bring with it new
tags make them ideal for tracking materials through sup- possibilities for data analysis as well as new warehousing
ply chains. Wal-Mart has required manufacturers, suppli- and mining challenges.
326
TEAM LinG
REFERENCES Shapiro, J. F. (2001). Modeling the supply chain. Pacific

Grove, CA: Duxbury. ,
Brewin, B. (2003, December). Delta says radio frequency Simchi-Levi, D., Kaminsky, P., & Simchi-Levi, E. (2000).
ID devices pass first bag. Computer World, 7. Retrieved Designing and managing the supply chain. Boston, MA:
March 29, 2005 from www.computerworld.com/ McGraw-Hill.
mobiletopics/mobile/technology/story/0,10801,88446,
00/ Zeng, Y., Chiang, R., & Yen, D. (2003). Enterprise integra-
tion with advanced information technologies: ERP and
Chen, R., Chen, C., & Chang, C. (2003). A Web-based ERP data warehousing. Information Management and Com-
data mining system for decision making. International puter Security, 11(3), 115-122.
Journal of Computer Applications in Technology, 17(3),
156-158.
Davis, E. W., & Spekman, R. E. (2004). Extended enter-
KEY TERMS
prise. Upper Saddle River, NJ: Prentice Hall.
Analytical Data: All data that are obtained from opti-
Dignan, L. (2003a, June 16). Data depot. Baseline Maga- mization, forecasting, and decision support models.
zine.
Computer Aided Design (CAD): An interactive com-
Dignan, L. (2003b, June 16). Lowes big plan. Baseline puter graphics system used for engineering design.
Magazine.
Computer Aided Manufacturing (CAM): The use of
Groover, M. P. (1987). Automation, production systems, computers to improve both the effectiveness and effi-
and computer-integrated manufacturing. Englewood ciency of manufacturing activities.
Cliffs, NJ: Prentice-Hall.
Decision Support System (DSS): Computer-based
Hofmann, M. (2004, March). Best practices: VW revs its systems designed to assist in managing activities and
B2B engine. Optimize Magazine, 22-30. information in organizations.
Jaques, R. (2004, February 20). Sun pushes RFID drug Graphical User Interface (GUI): A software interface
technology. E-business/Technology/News. Retrieved that relies on icons, bars, buttons, boxes, and other
March 29, 2005, from www.vnunet.com/News/1152921 images to initiate computer-based tasks for users.
Kimball, R., & Ross, M. (2002). The data warehouse Group Technology (GP): The concept of grouping
toolkit (2nd ed.). New York: Wiley. parts, resources, or data according to similar characteris-
tics.
Levary, R. R. (1993). Group technology for enhancing the
efficiency of engineering activities. European Journal of Online Analytical Processing (OLAP): A form of
Engineering Education, 18(3), 277-283. data presentation in which data are summarized, aggre-
gated, deaggregated, and viewed in the frame of a table or
Levary, R. R. (2000, MayJune). Better supply chains cube.
through information technology. Industrial Management,
42, 24-30. Supply Chain: A series of activities that are involved
in the transformation of raw materials into a final product
Marakas, G. M. (2003). Modern data warehousing, min- that is purchased by a customer.
ing and visualization. Upper Saddle River, NJ: Prentice-
Hall. Transactional Data: All data that are acquired, pro-
cessed, and compiled into reports, which are transmitted
to various organizations along a supply chain.
327
TEAM LinG
328
Data Warehousing Search Engine

Hadrian Peter
University of the West Indies, Barbados
Charles Greenidge
INTRODUCTION from several heterogeneous sources, all aimed at aug-

menting end-user business function (Berson & Smith,
Modern database systems have incorporated the use of 1997; Inmon, 2003; Wixom & Watson, 2001).
DSS (Decision Support Systems) to augment their deci- Central to the use of heterogeneous data sources is
sion-making business function and to allow detailed the need to extract, clean, and load data from a variety of
analysis of off-line data by higher-level business man- operational sources. Operational data not only need to
agers (Agosta, 2000; Kimball, 1996). be cleaned by removing bad records or invalid fields, but
The data warehouse is an environment that is readily also typically must be put through a merge/purge pro-
tuned to maximize the efficiency of performing decision cess that removes redundancies and records that are
support functions. However the advent of commercial uses inconsistent and lack integrity (Celko, 1995).
of the Internet on a large scale has opened new possibilities External data is key to business function and deci-
for data capture and integration into the warehouse. sion making, and includes sources of information such
The data necessary for decision support can be di- as newspapers, magazines, trade publications, personal
vided roughly into two categories: internal data and exter- contacts, and news releases. In the case where external
nal data. In this article, we focus on the need to augment data is being used in addition to data taken from dispar-
external data capture from Internet sources and provide a ate operational sources, this external data may require a
tri-partite, high-level model termed the Data Warehouse similar cleaning/merge/purge process to be applied to
Search Engine (DWSE) model, to perform the same. guarantee consistency (Higgins, 2003).
We acknowledge efforts that also are being made to The World Wide Web represents a large and growing
retrieve internal information from Web sources by use of source of external data but is notorious for the presence
the Web warehouse, which stores the Web users mouse of bad, unauthorized, or otherwise irregular data (Brake,
and keyboard activities online, the so called clickstream 1997; Pfaffenberger, 1996). Thus, the need for cleaning
data. A number of Web warehouse initiatives have been and integrity checking activities increases when the
proposed, including WHOWEDA (Madria et al., 1999). Web is being used to gather external data (Sander-
To be clear, data warehouses have focused on large Beuermann & Schomburg, 1998). The prevalence of
volumes of long-term historical data (a number of weeks, viruses and worms on the Internet, and the problems with
months, or years old), but the presence of the Internet unsolicited e-mail (spam) show that communications
with its data, which is short-lived and volatile, and im- coming across the Internet must be filtered. The clean-
proved automation and integration activities make the ing process may include pre-programmed rules to adjust
shorter time scales for the refresh cycle more attractive. values, the use of sophisticated AI techniques, or simply
To attain the maximum benefits of the DWSE, signifi- mechanisms to spot anomalies that then can be fixed
cant technical contributions that surpass existing imple- manually.
mentations must be encouraged from the wider database,
information-retrieval, and search-engine technology com-
munities. Our model recognizes the fact that a cross- MAIN THRUST
disciplinary approach to research is the only way to guar-
antee further advancement in the area of decision support. Informed decision making in a data warehouse relies on
suitable levels of granularity for internal data, where
users can drill down from low to higher levels of
BACKGROUND aggregation. External data in the form of multimedia
files, informal contacts, or news sources are often
Data warehousing methodologies are concerned with marginalized (Inmon, 2002; Kimball, Barquin &
the collection, organization, and analysis of data taken Edelstein, 1997) due to their unstructured nature.
TEAM LinG
Figure 1. Merits of internal/external data

,
Decision-Making
Internal Data External Data Data
Trusted Sources Unproven Sources Informative
Unknown Structure Challenging

Known Structure
+ =
Known Frequency Varying Frequency Wide in Scope
Consistent Unpredictable Balanced
True decision making embraces as many pertinent Engine Model (DWSE), which has an intermediate data
sources of information as possible so that a holistic extraction/cleaning layer functionally called the Meta-
perspective of facts, trends, and individual pieces of Data Engine, sandwiched between the data warehouse
data can be obtained. Increasingly, the growth of com- and search engine environments.
merce and business on the Internet has meant that in
addition to traditional modes of disseminating informa- The DWSE Model
tion, the Internet has become a forum for ready posting
of information of all kinds (Hall, 1999). The data warehouse and search engine environments
The prevalence of so much information on the Internet serve two distinct and important roles at the current
means that it is potentially a superior source of external time, but there is scope to utilize the strengths of both
data for the data warehouse. Since such data typically in conjunction for maximum usefulness (Barquin &
originate from a variety of sources (sites), it has to Edelstein, 1997; Sonnenreich & Macinta, 1998). Our
undergo a merging and transformation process before it proposed DWSE model seeks to allow cooperative links
can be used in the data warehouse. In the case of internal between data warehouse and search engine with an aim
data, which forms the core of the data warehouse, previ- of satisfying external data requirements. The model
ous manual methods of applying ad-hoc tools and tech- consists of (1) data warehouse (DW), (2) meta-data
niques to the data cleaning process are being replaced by engine (MDE) and (3) search engine (SE). The MDE is
more automated forms such as ETL (Extract, Trans- the component that provides a bridge over which infor-
form, Load). Although current ETL standardized pack- mation must pass from one environment to the other.
ages are expensive, they offer productivity gains in the The MDE enhances queries coming from the warehouse
long run (Earls, 2003; Songini, 2004). and also captures, merges, and formats information
The existence of the so-called invisible Web and returned by the search engine (Devlin, 1998).
ongoing efforts to gain access to these untapped sources The new model, through the MDE, seeks to augment
suggest that the future of external data retrieval will the operations of both by allowing external data to be
enjoy the same interest as that shown in internal data collected for the business analyst, while improving the
(Inmon, 2002; Sherman & Price, 2003; Smith, 2001). search engine searches through query modifications of
The need for reliable and consistent external data pro- queries emerging from the data warehouse.
vides the motivation for an intermediate layer between The generalized process is as follows. A query origi-
raw data gathered from the Internet and external data nates in the warehouse environment and is modified by
storage areas lying within the domain of the data ware- the MDE so that it is specific and free of nonsense
house (Agosta, 2000). words. A word that has a high occurrence in a text but
Until there is a maturing of widely available tools conveys little specific information about the subject of
with the ability to access the invisible Web, there will be the text is deemed to be a nonsense word. Typically,
a continued reliance on information retrieval techniques, these words include pronouns, conjunctions, and proper
as contrasted with data retrieval techniques, to gather names. This term is synonymous with noise word, as
external data (van Rijsbergen, 1979). The need for three found in information retrieval texts (Belew, 2000). The
environments to be present to process external data modified query is transmitted to the search engine that
from Web sources into the warehouse suggests a three- performs its operations and retrieves its results docu-
tier solution to this problem. Accordingly, we propose ments. The documents returned are analyzed by the
a tri-partite model called the Data Warehouse Search MDE, and information is prepared for return to the
329
TEAM LinG
Figure 2. The DWSE model Figure 3. Bridging the architectures: The meta-data
engine
Data Warehouse Meta-Data Engine Search Engine Web
Enterprise Wide
S o u r ces Extract/ I Meta-Data
Transform SE n Engine
Meta t
Data e
Storage r
n
e
SE Files t
DW Data Hybrid Search

Extraction/
Warehouse Engine
Cleaning DBMS
DSS Tool
Internal Data Internet
End User
warehouse. Finally, the information relating to the answer Index words retrieved during SE crawls.
to the query is returned to the warehouse environment. Manage the scheduling and control of tasks aris-
The entire process may take days or weeks, as both ing from the operation of both DW and SE.
warehouse and search engine operate independently ac- Provide a neutral platform so that SE and DW
cording to their own schedules. operational cycles and architectures cannot in-
The design of both search engine and data warehouse terfere with each other.
are skill-intensive and specialized tasks. Prudent judg- Track meta-data that may be used to enhance the
ment dictates that nothing should be done to add to the quality of queries and results in the future (e.g.,
burden of building and maintaining these complex sys- monitor the use of domain-specific words/phrases
tems. History shows that many information technology such as jargon for a particular domain).
(IT) projects fail when expediency overtakes deliberate
design. In Figure 2, we see the cooperation among com- Of major interest in Figure 4 is the automatic infor-
ponents of the DWSE. mation retrieval (AIR) component. AIR seeks to high-
The DWSE model considers only the external data light the existence of documents that are related to a
requirements of the data warehouse. The SE component given query (Belew, 2000) using a process that is differ-
performs actual searches but no analysis of documents. ent from data retrieval. Data retrieval (van Rijsbergen,
The indexing of the retrieved documents and other analy- 1979), as used in relation to database management sys-
sis is done by the MDE. tems (RDBMSs) requires an exact match of terms, uses
an artificial query language (e.g., SQL), and uses deduc-
Meta-Data Engine Model tion. Information retrieval, on the other hand (as is
common for Internet searches), uses partial matching
To cope with the radical differences between the SE and and a natural query language, and inductively produces
DW designs, we propose a meta-data engine to coordi- relevant results (Spertus & Stein, 1999). This idea of
nate all activities. Typical commercial SEs are com- relevance distinguishes the IR search from the normal
posed of a crawler (spider) and an indexer (mite). The data retrieval where a matching of terms is done.
indexer is used to codify results into a database for easy
querying. Our approach reduces the complexity of the
search engine by moving the indexing tasks to the meta- ANALYSIS OF MODEL
data engine. The meta-data engine seeks to form a bridge
between the diverse SE and DW environments. The pur- Strengths
pose of the meta-data engine is to facilitate an automatic
information retrieval mode. Figure 3 illustrates the con- The major strengths of this model are:
cept of the bridge between DW and SE environments, and
Figure 4 outlines the main features of the MDE. Independence
The following are some of the main functions of the
meta-data engine: This model carries logical independence. There is a
deliberate attempt to ensure that the best practices of
Capture and transform data arising from SE (e.g., handle each individual model is adhered to by insistence that
HTML, XML, .pdf, and other document formats).
330
TEAM LinG
Figure 4. Characteristics of the meta-data engine ment to take place in three languages. For example,
when there is a query language for the warehouse, perl ,
for the meta-data engine, and java for the search engine,
Meta-Data Engine the strengths of each can be utilized effectively.
The new model allows for security and integrity to
Operate in AIR be maintained. The warehouse need not expose itself to
Mode dangerous integrity questions or to increasingly mali-
Process Queries cious threats on the Internet. By isolating the sector that
Emerging From interfaces with the Internet from the sector that carries
Data Warehouse
vital internal data, the prospects for good security are
Provide Neutral improved.
Platform
Process Meta-Data Relieves Information Overload
By seeking to remove, repackage, and reprocess exter-

nal data, we are indirectly removing time-consuming
filtering activities by employees and busy executives.
There is a growing need to protect users from being
there be logical independence in the design. This is very overwhelmed by increasing volumes of data. In the
important, since many IT projects fail when good design future, we will rely on smart, accurate, automated sys-
is compromised. tems to guide us through the realms of extraneous data
This model carries physical independence. There is to the cores of knowledge.
no need to overburden the capacities of existing ware-
house projects by insistence on the simultaneous stor- Potential Weaknesses
age of vast quantities of external data. The model allows
for the transfer of summarized results via network con- Relies on fast-changing search engine technolo-
nections, while the physical machines may reside con- gies. Improvements or modifications to the way
tinents apart. information formatted and accessed on the Internet
This model carries administrative independence. will impact on the search and retrieval process.
There is the implicit expectation that the three separate Introduces the need for more system overhead and
subsystems comprising the model can and will be under administrators. A warehouse administrator, search
the control of three separate administrators. This is engine administrator, and information retrieval
important because warehouse administration is a spe- expert (meta-data engine) will have to be con-
cialist area, as is search engine maintenance. By allow- tracted. The systems themselves will have to be
ing specialists to focus on their respective disciplines, configured so that cooperation among them re-
the chances for effective administration are enhanced. mains easy.
Storage of large volumes of irrelevant informa-
Value-Added Approach tion is a problem. The cost of maintaining and
indexing unmanageable quantities of Internet data
This model complements the concept of maximizing the may discount the benefits derived from analysis of
utility of acquired data by adding value to this data even the external data. Figure 5 highlights possible
long after its day-to-day usefulness has expired. The drawbacks to using the proposed model.
ability to add pertinent external data to volumes of aging
internal data extends the horizons for the usefulness of
the internal data indefinitely. FUTURE TRENDS
Adaptability The future is notoriously unpredictable, yet the follow-

ing trends can be seen readily for the next decade.
The new model provides for scalability, as the logical/
physical independence is not restricted to any one sys- 1. Maturing of the data warehouse tools and tech-
tem or architecture. Databases in any sector of the niques (Agosta, 2000; Greenfield, 1996; Inmon,
model can be provided for independently without nega- 2002; Schwartz, 2003).
tively impacting the other parts of the system. The new 2. Increases in the volumes of data being stored
model also allows for flexibility by allowing develop- online and off-line.
331
TEAM LinG
Figure 5. Hidden dangers in DWSE model alternate path into the meta-data engine of the new model
may be required.
In the case of the second outcome, a new model may
B D be proposed with this process (OLTP external data to
A meta-data engine) as an integral part. Current data ware-
house models already allow for the establishment of
data marts and querying by external tools. If these new
Large volumes of redundant/unusable data may be stored
processes are considered desirable components of the
data warehousing process, then a new data warehouse
End-user analysis may become skewed by wrong use of external data
model could be proposed.
Validity and accuracy of external data not always known, so there
are risks involved in using the external data. Relevance remains a thorny issue. The problem of
Figure 8.2 Hidden dangers in DWSE model providing relevant data without anyone to manually verify
its relevance is challenging. Semantic Web developments
promise to address questions of relevance on the Internet.
3. Central importance of the Internet and its related
technologies (Goodman, 2000).
4. Increasing sophistication and focus of Web sites. REFERENCES
5. Increased need to automate analysis and decision
support on the growing volumes of data; both Agosta, L. (2000). The essential guide to data ware-
internal and external to the organization. housing. New Jersey: Prentice-Hall.
6. Adoption of specialized search engines (Raab, Barquin, R., & Edelstein, H. (Eds.). (1997). Planning
1999). and designing the data warehouse. New Jersey:
7. Adoption of specialized sites (vortals) catering to Prentice-Hall.
a particular category of user (Rupley, 2000).
8. Ability to mine invisible Web, increasing poten- Belew, R.K. (2000). Finding out about: A cognitive
tial benefits from a data warehouse search engine perspective on search engine technology and the
model (Sherman & Price, 2003; Smith, 2001; WWW. New York: Cambridge University Press.
Soudah, Sullivan, 2000).
Berson, A., & Smith, S.J. (1997). Data warehousing,
data mining and olap. New York: McGraw-Hill.
The DWSE model is poised to gain prominence as
practitioners realize the benefits of three distinct and Brake, D. (1997). Lost in cyberspace [Electronic ver-
architecturally independent layers. We are confident sion]. New Scientist.
that with vital and open architectures and models of
clarity, data warehousing efforts can be successful in Celko, J. (1995). Dont warehouse dirty data.
the long term. Datamation, 42-53.
Devlin, B. (1998). Meta-data: The warehouse atlas. DB2
Magazine, 3(1), 8-9.
CONCLUSION
Earls, A.R. (2003). ETL: Preparation is the best bet.
More research is needed to investigate how to enable Computerworld, 37(34), 25-27.
the data warehouse to take advantage of the growing Goodman, A. (2000). Searching for a better way [Elec-
external data stores on the Internet. Comparison of tronic version].
results obtained using the three-tiered approach of the
DWSE model, as compared with the traditional external Greenfield, L. (1996). Dont let data warehousing
data approaches, must be carried out. Unclear also is gotchas getcha. Datamation, 76-77.
how the traditional Online Transaction Processing
Hall, C. (1999). Enterprise information portals: Hot air
(OLTP) systems will funnel data toward this model. If
or hot technology [Electronic version]. Business Intel-
operational systems start to retain significant quantities
ligence Advisor, 111(11).
of external data, it may mitigate against this model as
more and more external data will be available in the Higgins, K.J. (2003). Warehouse data earns its keep.
warehouse through normal integration processes. On Network Computing, 14(8), 111-115.
the other hand, the costs of dealing with such external
data in the warehouse may be so prohibitive that an Inmon, W.H. (2002). Building the data warehouse.
New York: John Wiley & Sons.
332
TEAM LinG
Inmon, W.H. (2003). The story so far. Computerworld, Wixom, B.H., & Watson, H.J. (2001). An empirical inves-
37(15), 26-27. tigation of the factors affecting data warehousing suc- ,
cess. MIS Quarterly, 25(1), 17-39.
Kimball, R. (1996). Dangerous preconceptions [Electronic
version].
Kimball, R. (1997). A dimensional modeling manifesto
[Electronic version]. DBMS Magazine.
KEY TERMS
Madria, S.K. et al. (1999). Research issues in Web data
mining. Proceedings of Data Warehousing and Knowl- Data Retrieval: Denotes the standardized database
edge Discovery, First International Conference. methods of matching a set of records, given a particular
query (e.g., use of the SQL SELECT command on a
Pfaffenberger, B. (1996). Web search strategies. New database.
York: MIS Press.
Decision Support System (DSS): An interactive
Raab, D.M. (1999). Enterprise information portals [Elec- arrangement of computerized tools tailored to retrieve
tronic version]. Relationship Marketing Report. and display data regarding business problems and queries.
Rupley, S. (2000). From portals to vortals. PC Magazine. ETL (Extract/Transform/Load): This term speci-
Sander-Beuermann, W., & Schomburg, M. (1998). fies a category of software that efficiently handles three
Internet information retrievalThe further development essential components of the warehousing process. First,
of meta-search engine technology. Proceedings of the data must be extracted (removed from originating sys-
Internet Summit, Internet Society, Geneva, Switzerland. tem), then transformed (reformatted and cleaned), and
third, loaded (copied/appended) into the data warehouse
Schwartz, E. (2003). Data warehouses get active. database system.
InfoWorld, 25(48), 12-13.
External Data: A broad term indicating data that is
Sherman, C., & Price, G. (2003). The invisible Web: external to a particular company. Includes electronic
Uncovering sources search engines cant see. Library and non-electronic formats.
Trends, 52(2), 282-299.
Information Retrieval: Denotes the attempt to
Smith, C.B. (2001). Getting to know the invisible Web. match a set of related documents to a given query using
Library Journal, 126(11), 16-19. semantic considerations (e.g., library catalogue sys-
Songini, M.L. (2004). ETL. Computerworld, 38(5), 23-24. tems often employ information retrieval techniques).
Sonnenreich, W., & Macinta, T. (1998). Web Internal Data: Previously cleaned warehouse data
developer.com guide to search engines. New York: that originated from the daily information processing
Wiley Computer Publishing. systems of a company.
Soudah, T. (2000). Search, and you shall find [Electronic Invisible Web: Denotes those significant portions
version]. of the Internet where data is stored which are inacces-
sible to the major search engines. The invisible Web
Spertus, E., & Stein, L.A. (1999). Squeal: A structured represents an often ignored/neglected source of poten-
query language for the Web [Electronic version]. tial online information.
Sullivan, D. (2000). Invisible Web gets deeper [Elec- Metadata: Data about data; in the data warehouse, it
tronic version]. The Search Engine Report. describes the contents of the data warehouse.
Van Rijsbergen, C.J. (1979). Information retrieval [Elec-
tronic version]. In Finding out about. [CD-ROM]. Rich-
ard Belew.
333
TEAM LinG
334
Data Warehousing Solutions for Reporting

Problems
Juha Kontio
Turku Polytechnic, Finland
INTRODUCTION tems, such as order-entry systems. The second largest

category (30%) comprised information systems relat-
Reporting is one of the basic processes in all organiza- ing to decision support and reporting. Only one pilot
tions. It provides information for planning and decision data warehouse was among these systems, but on the
making and, on the other hand, information for analyzing other hand, customized reporting systems were used,
the correctness of the decisions made at the beginning for example, in SOK, SSP, and OPTI. Reporting was
of the process. Reporting is based on the data that the commonly recognized as an area where interviewees
operational information systems contain. Reports can were not satisfied and were hoping for improvements.
be produced directly from these operational databases, This article focuses on describing the reporting
but an operational database is not organized in a way that problems that the organizations are facing and explains
naturally supports analysis. An alternative way is to how they can exploit a data warehouse to overcome
organize the data in such a way that supports analysis these problems.
easily. Typically, this method leads to the introduction
of a data warehouse.
In the summer of 2002, a multiple case study re- BACKGROUND
search was launched in six Finnish organizations (see
Table 1). The researchers studied the databases of these The term data warehouse was first introduced as a
organizations and identified the trends in database ex- subject-oriented, integrated, nonvolatile, and time-vari-
ploitation. One of the main ideas was to study the ant collection of data in support of managements deci-
diffusion of database innovations. In practice this meant sions (Inmon, 1992). A simpler definition says that a
that the researchers described the present database ar- data warehouse is a store of enterprise data designed to
chitecture and identified the future plans and present facilitate management decision making (Kroenke,
problems. The data for this research was mainly col- 2004). A data warehouse differs from traditional data-
lected with semistructured interviews, and altogether, bases in many ways. Its structure is different than that of
54 interviews were arranged. traditional databases, and different functionalities are
The research processed data of 44 different infor- required (Elmasri & Navathe, 2000). The aim is to
mation systems. Most (40%) of the analyzed informa- integrate all corporate information into one repository,
tion systems were online transaction processing sys- where the information is easily accessed, queried, ana-
Table 1. The case organizations
Organization/ Line of Business Private/ Turnover Employees

Abbreviation Used Public 2002 2002
in This Research (Millions )
SOK Co-operative society (main Private 2998 4645
Corporation/SOK businesses: food and groceries,
hardware)
Salon Seudun Telecommunication Private 28 121
Puhelin, Ltd./SSP
Statistics National statistics Public 52 1074
Finland/STAT
State Provincial Regional administrative Public 350
Office of Western authority
Finland/WEST
TS-Group, Ltd.,/TS Printing services and Private 69.7 2052
communications (Consolidated
corporation)
Optiroc, Ltd./OPTI Building materials Private 149 388
TEAM LinG
Data Warehousing Solutions for Reporting Problems
lyzed, and used as a basis for the reports (Begg & Connolly, products will be a demanding task. Centralization of the
2002). A data warehouse provides decision support to data is also one topic that has been discussed in SOK ,
organizations with the help of analytical databases and Corporation, which has been justified with improve-
Online Analytical Processing (OLAP) tools (Gorla, 2003). ments in reporting.
A data warehouse (see Figure 1) receives data from the In Salon Seudun Puhelin, Ltd., the interviewees men-
operational databases on a regular basis, and new data is tioned that the major information system is somehow
added to the existing data. The warehouse contains both used inconsistently. Therefore, the data is not consis-
detailed aggregated data and summarized data to speed tent and influences the reporting. This company has
up the queries. It is typically organized in smaller units developed its own reporting application with Microsoft
called data marts, which support the specific analysis Access, but the program is not capable of managing files
needs of a department or business unit (Bonifati, Cattaneo, over 1 GB, which reduces the possibilities of using the
Ceri, Fuggetta, & Paraboschi, 2001). system. According to the interviewees, this limit pre-
In the case organizations, the idea of the data ware- vents, for example, the follow-up of daily sales. Another
house has been discussed, but so far no data warehouses problem concerning the reporting system is that users
exist, although in one case, a data warehouse pilot is in are incapable of defining their own reports when their
use. The rationale for these discussions is that at the needs change. Analyzing customer data is also difficult,
moment, the reporting and the analyzing possibilities because collecting all customer data together is a very
are not serving the organizations very well. Actually, the burdensome task. Therefore, Salon Seudun Puhelin, Ltd.,
interviewees identified many problems in reporting. has also discussed a data warehouse solution for three
In the SOK Corporation, the interviewees complained reasons: a) to get rid of the size limits, b) to provide a
that information is distributed in numerous information system for users where they can easily define new
systems; thus, building a comprehensive view of the reports, and c) to gain more versatile analysis possibili-
information is difficult. Another problem is in financial ties.
reporting. A financial report taken from different infor- The State Provincial Office of Western Finland is a
mation systems gives different results, though they joint regional administrative authority of seven minis-
should be equal. A reason for this inequality is that the tries. One of their yearly responsibilities is to evaluate
data is not harmonized and processed similarly. In the the basic service in their region. In practice, this respon-
restaurant business of SOK Corporation, an essential sibility means that they gather and analyze a large amount
piece of information is the sales figures of the products. of data. The first problem is that they have not used a
It should be able to analyze which, where, and how many special data management tool. The lack of an adequate
products have been bought. In the whole SOK Corpora- tool for data management makes it difficult to do any
tion, analyzing different customers and their behavior in time-series analysis, which many of the interviewees
detail is, at the moment, impossible. The interviewees hoped for. Another problem is that the results should be
also mentioned that a common database containing all easily distributed in forms of different reports, but at
products of the co-operative society might help in re- the moment, this is not the case.
porting, but defining a common classification of the In TS-Group, Ltd., a data warehouse pilot has been
implemented. The pilot enables versatile reporting and,
as the interviewees mentioned, this opportunity should
Figure 1. Data warehousing not be lost. However, some reporting problems still
exist. For example, the distribution and the format of the
Operational Valuable information for processes reports should be solved. In the department of financial
databases
management, the reporting system does not support the
latest operating systems and, therefore, only some com-
puters are capable of using the system.
In Optiroc, Ltd., reports are generated directly from
Data
the operational databases that are not designed for re-
Warehouse
Summarized
porting purposes. One problem attendant on this design
Cleaning
Reformatting
data
Analytical is that the reports run slowly. In principle, users should
Detailed tools also be able to create their reports by themselves, but in
data
reality, only a few of them are able to. Maybe this is one
reason that the interviewees presented very critical
Data
Data
Other
Data Filter Data
Mart comments on reporting. The interviewees mentioned as
Mart
Inputs Mart well that the implementation of a data warehouse system
is strongly supported and is seen as a solution for prob-
335
TEAM LinG
lems in reporting. A data warehouse was also justified need to pay attention to the quality of the source data in
because, with it, the company could serve customers the operational databases (Finnegan & Sammon, 2000).
better and could produce customized reports. At the One problem with MDS is that when the business envi-
moment, customer reporting is analyzed to develop re- ronment changes, the evolution of multidimensional
porting and to define necessary tools. schemas is not as manageable as with normalized schemas
(Martyn, 2004). It is also said that any architecture not
based on third normalized form can cause the failure of
MAIN THRUST a data warehouse project (Gardner, 1998). On the other
hand, a dimensional model provides a better solution for
The case organizations have plans to start exploiting a a decision support application than a pure normalized
data warehouse, but before the introduction of a data relational model does (Loukas & Spencer, 1999). All of
warehouse, plenty of design must be accomplished in all the above is actually related to efficiency, because a large
these cases. Designing a data warehouse requires quite amount of data is processed during analysis. Typically,
different techniques than the design of an operational this is a question about needed joins in the database
database (Golfarelli & Rizzi, 1998). Modeling a data level. Usually, a star schema is the most efficient design
model for a data warehouse is seen as one of the most for a data warehouse, because the denormalized tables
critical phases in the development process (Bonifati et require fewer joins (Martyn, 2004). However, recent de-
al., 2001). This data modeling has specific features that velopments in storage technology, access methods (such
distinguish it from normal data modeling (Busborg, as bitmap indexes), and query optimization indicate that
Christiansen, & Tryfona, 1999). At the beginning, the the performance with the third normalized form should be
content of the operational information systems, the in- tested before moving to multidimensional schemas
terconnections between them, and the equivalent entities (Martyn, 2004). From this 3NF schema, a natural step
should be understood (Blackwood, 2000). In practice, toward MDS is to use denormalization, which will
this entails studying the data models of the operational support both efficiency and flexibility issues (Finnegan
databases and developing an integrated schema to en- & Sammon, 2000). Still, it is possible to define neces-
hance the data interoperability (Bonifati et al., 2001). sary SQL views on top of the 3NF schema without
The data modeling of a data warehouse is called dimen- denormalization (Martyn, 2004). Finally, during the
sionality modeling (DM) (Golfarelli & Rizzi, 1998; design, the issues of physical and logical design should
Begg & Connolly, 2002). Dimensional models were be separated; physical design is about performance, and
developed to support analytical tasks (Loukas & Spen- logical design is about understandability (Kimball,
cer, 1999). Dimensionality modeling concentrates on 2001).
facts and the properties of the facts and dimensions The OLAP tools, which are based on MDS views,
connected to facts (Busborg et al., 1999). Facts are access data warehouses for complex data analysis and
numeric, and quantitative data of the business and dimen- decision support activities (Kambayashi, Kumar,
sions describe different dimensions of the business Mohania, & Samtani, 2004). These tools typically in-
(Bonifati et al., 2001). Fact tables contain all the busi- clude assessing the effectiveness of a marketing cam-
ness events to be analyzed, and dimension tables define paign, forecasting product sales, and planning capacity.
how to analyze fact information (Loukas & Spencer, The architecture of the underlying database of the data
1999). The result of the dimensionality modeling is warehouse categorizes the different analysis tools
typically presented in a star model or in a snowflake (Begg & Connolly, 2002). Depending on the schema
model (Begg & Connolly, 2002). Multidimensional type, the terms Relational OLAP (ROLAP), Multidi-
schema (MDS) is a more generic term that collectively mensional OLAP (MOLAP), and Hybrid OLAP (HOLAP)
refers to both schemas (Martyn, 2004). When a star are used (Kroenke, 2004). ROLAP is a preferable
model is used, the fact tables are normalized, but dimen- choice when a) the information needs change frequently,
sion tables are not. When dimension tables are normal- b) the information should be as current as possible, and
ized too, the star model turns into a snowflake model c) the users are sophisticated computer users (Gorla,
(Bonifati et al., 2001). 2003). The main differences between ROLAP and
Ideally, an information system such as a data ware- MOLAP are in the currency of data and in the data
house should be correct, fast, and friendly (Martyn, storage processing capacity. MOLAP populates its own
2004). Correctness is especially important in data ware- structure of the original data when it is loaded from the
houses to ensure that decisions are based on accurate operational databases (Dodds, Hasan, Hyland, &
information. Actually, an estimated 30% to 50% of Veeraraghavan, 2000). In MOLAP, the data is stored in
information in a typical database is either missing or a special-purpose MDS (Begg & Connolly, 2002). ROLAP,
incorrect (Blackwood, 2000). This idea emphasizes the on the other hand, analyzes the original data, and the
336
TEAM LinG
users can drill down to the unit data level (Dodds et al., The presented cases reflect the overall situation in
2000). ROLAP uses a meta-data layer to avoid the creation different organizations quite well and might predict ,
of an MDS. It typically utilizes the SQL extensions, such what will happen in the future. The cases show that
as CUBE and ROLLUP in Oracle DBMS (Begg & Connolly, organizations have clear problems in analyzing the op-
2002). erational data. They have developed their own applica-
tions to ease the reporting but are still living with
inadequate solutions. Data warehousing has been identi-
FUTURE TRENDS fied as a possible solution for the reporting problems,
and the future will show whether data warehouses dif-
In the future, the SOK Corporation needs to build a fuse in the organizations.
comprehensive view of the information. They can
achieved this goal first by building an enterprise data
model and then by modifying the existing information CONCLUSION
systems. From the reporting point of view, the opera-
tional data should be further modeled into a data ware- The case organizations have recognized the possibilies
house solution. After these steps, the problems relating of a data warehouse to solve problems in reporting.
to reporting should be solved. Their initiatives are based on business requirements,
In order to improve the consistency of the data at which is a good starting point because to be successful,
Salon Seudun Puhelin, Ltd., most concerns need to be a data warehouse should be a business-driven initiative
placed on the correctness of the data and the ways users in partnership with the information technology depart-
work with the information systems. Until then, exploit- ment (Gardner, 1998; Finnegan & Sammon, 2000).
ing a data warehouse is not rational. For this company, it However, the first step should be the analysis of the real
is reasonable to evaluate the true requirements of a data needs of a data warehouse. At the same time, the present
warehouse for reporting, because other solutions that problems in reporting should be solved if possible.
are easier than starting a data warehouse project might To really support reporting, the operational data
be on hand. should be organized like that in a data warehouse. In
In the State Provincial Office of Western Finland practice, this means studying and analyzing the existing
(WEST), the most important issue is acquiring a suitable operational databases. As a result, a data model describ-
tool for managing the collected data. After that step, the ing the necessary elements of the data warehouse should
emphasis can shift to reporting, analyzing, and in time- be achieved. A suggestion is to first produce an enter-
series. Basically, the environment of WEST is ideal for prise level data model to describe all the data and their
exploiting a data warehouse. The processes and the interconnections. This enterprise level data model will
functions are heavily dependent on the analysis and the be the basis for the data warehouse design. As the theory
developments in time series. suggested at the beginning of this article, a normalized
In TS-Group, Ltd., a data warehouse could solve the data model with SQL views should be produced and
problems in reporting as well. Introducing a data ware- tested. This model can further be denormalized into a
house would define the standard formats of the reports, snowflake or a star model when performance require-
which are currently causing a problem. At the same time, ments are not met with the normalized schema.
the distribution problems of the reports would be solved. Producing a data model for the data warehouse is
In Optiroc, Ltd., a data warehouse can solve the only one part of the process. In addition, special atten-
slowness of running reports and the difficulties in analy- tion should be taken in designing data extraction from
sis. One reason for the slowness is that the reports run operational databases to the data warehouse. The enter-
directly from the operational databases; moving the prise level data model helps these databases understand
origin of the data to a data warehouse might offer the data and thus eases the development of data extrac-
improvements. Another problem deals with the analysis tion solutions. When the data warehouse is in use,
and the possibilities to define necessary reports on the automated tools are necessary to speed up the loading of
fly. Introducing necessary OLAP tools and training the the operational data into the data warehouse (Finnegan
users sufficiently will solve this problem. & Sammon, 2000). Before loading data into the data
In Statistics Finland, the interviewees mentioned warehouse, the organizations should also analyze the
that reporting is not a problem. However, a data ware- correctness and the quality of the data in the operational
house might offer extra value in data analysis. At the databases.
moment, though, data warehousing is not the most acute Finally, as this article shows, a data warehouse is a
topic in information technology development in Statis- relevant alternative for solving reporting problems that
tics Finland.
337
TEAM LinG
the case organizations are currently facing. After the Inmon, W. H. (1992). Building the data warehouse. New
implementation of a data warehouse, organizations must York: Wiley.
ensure that the possible users of the system are edu-
cated in order to fully take advantage of the new possi- Kambayashi, Y., Kumar, V., Mohania, M., & Samtani, S.
bilities that the data warehouse offers. Of course, orga- (2004). Recent advances and research problems in data
nizations should remember that there is no quick jump warehousing. Lecture Notes in Computer Science, 1552,
to data warehouse exploitation. 81-92.
Kimball, R. (2001). A trio of interesting snowflakes.
Intelligent Enterprise, 4, 30-32.
REFERENCES
Kroenke, D. M. (2004). Database processing: Funda-
Begg, C., & Connolly, T. (2002). Database systems: A mentals, design and implementation. Upper Saddle
practical guide to design, implementation, and man- River, NJ: Pearson Prentice Hall.
agement. Addison-Wesley. Loukas, T., & Spencer, T. (1999, October). From star to
Blackwood, P. (2000). Eleven steps to success in data snowflake to ERD: Comparing data warehouse design
warehousing. Business Journal, 14(44), 26-27. approaches. Enterprise Systems.
Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., & Martyn, T. (2004). Reconsidering multi-dimensional
Paraboschi, S. (2001). Designing data marts for data schemas. ACM SIGMOD Record, 33(1), 83-88.
warehouses. ACM Transactions on Software Engineer-
ing and Methodology, 10(4), 452-483.
Busborg, F., Christiansen, J. G. B., & Tryfona, N. (1999). KEY TERMS
StarER: A conceptual model for data warehouse design.
Proceedings of the ACM International Workshop on Data Extraction: A process in which data is trans-
Data Warehousing and OLAP, USA. ferred from operational databases to a data warehouse.
Dodds, D., Hasan, H., Hyland, P., & Veeraraghavan, R. Dimensionality Modeling: A logical design tech-
(2000). Approaches to the development of multidimen- nique that aims to present data in a standard, intuitive
sional databases: Lessons from four case studies. 31(3), form that allows for high-performance access.
10-23. Retrieved from the ACM SIGMIS database. Normalization/Denormalization: Normalization
Elmasri, R., & Navathe, S. B. (2000). Fundamentals of is a technique for producing a set of relations with
database systems. Reading, MA: Addison-Wesley. desirable properties, given the data requirements of an
enterprise. Denormalization is a step backward in the
Finnegan, P., & Sammon, D. (2000). The ten command- normalization process for example, to improve per-
ments of data warehousing. 31(4), 82-91. Retrieved from formance.
the ACM SIGMIS database.
OLAP: The dynamic synthesis, analysis, and con-
Gardner, S. R. (1998). Building the data warehouse. solidation of large volumes of multidimensional data.
Communications of the ACM, 41(9), 52-60.
Snowflake Model: A variant of the star schema, in
Golfarelli, M., & Rizzi, S. (1998). A methodological which dimension tables do not contain denormalized
framework for data warehouse design. Proceedings of data.
the ACM International Workshop on Data Warehous-
ing and OLAP, USA. Star Model: A logical structure that has a fact table
containing factual data in the center, surrounded by
Gorla, N. (2003). Features to consider in a data ware- dimension tables containing reference data.
housing system. Communications of the ACM, 46(11),
111-115.
338
TEAM LinG
339
Database Queries, Data Mining, and OLAP ,

Lutz Hamel
University of Rhode Island, USA
INTRODUCTION definition is routinely accomplished via the data defini-

tion language (DDL) facilities of SQL by specifying
Modern, commercially available relational database either a star or snowflake schema (Kimball, 1996).
systems now routinely include a cadre of data retrieval
and analysis tools. Here we shed some light on the
interrelationships between the most common tools and MAIN THRUST
components included in todays database systems: query
language engines, data mining components, and online The following sections contain the pair wise compari-
analytical processing (OLAP) tools. We do so by pair- sons between the tools and components considered in
wise juxtaposition, which will underscore their differ- this chapter.
ences and highlight their complementary value.
Database Queries vs. Data Mining
BACKGROUND Virtually all modern, commercial database systems are

based on the relational model formalized by Codd in the
Todays commercially available relational database sys- 1960s and 1970s (Codd, 1970) and the SQL language
tems now routinely include tools such as SQL database (Date, 2000) which allows the user to efficiently and
query engines, data mining components, and OLAP effectively manipulate a database. In this model a data-
(Craig, Vivona, & Bercovitch, 1999; Oracle, 2001; base table is a representation of a mathematical relation,
Scalzo, 2003; Seidman, 2001). These tools allow devel- that is, a set of items that share certain characteristics or
opers to construct high powered business intelligence attributes. Here, each table column represents an at-
(BI) applications which are not only able to retrieve tribute of the relation and each record in the table
records efficiently but also support sophisticated analy- represents a member of this relation. In relational data-
ses such as customer classification and market segmen- bases the tables are usually named after the kind of
tation. However, with powerful tools so tightly inte- relation they represent. Figure 1 is an example of a table
grated with the database technology, understanding the that represents the set or relation of all the customers of
differences between these tools and their comparative a particular store. In this case the store tracks the total
advantages and disadvantages becomes critical for ef- amount of money spent by its customers.
fective application development. From the practitioners Relational databases do not only allow for the cre-
point of view questions like the following often arise: ation of tables but also for the manipulation of the tables
and the data within them. The most fundamental opera-
Is running database queries against large tables tion on a database is the query. This operation enables
considered data mining? the user to retrieve data from database tables by assert-
Can data mining and OLAP be considered synony- ing that the retrieved data needs to fulfill certain crite-
mous? ria. As an example, consider the fact that the storeowner
Is OLAP simply a way to speed up certain SQL might be interested in finding out which customers
queries? spent more than $100 at the store. The following query
The issue is being complicated even further by the

Figure 1. A relational database table representing
fact that data analysis tools are often implemented in
customers of a store
terms of data retrieval functionality. Consider the data
mining models in the Microsoft SQL server which are
Id Name ZIP Sex Age Income Children Car Total
implemented through extensions to the SQL database Spent
query language (e.g., predict join) (Seidman, 2001) or 5 Peter 05566 M 35 $40,000 2 Mini $250.00
Van
the proposed SQL extensions to enable decision tree
classifiers (Sattler & Dunemann, 2001). OLAP cube 22 Maureen 04477 F 26 $55,000 0 Coupe $50.00
TEAM LinG
Database Queries, Data Mining, and OLAP
returns all the customers from the above customer table These rules are understandable because they summa-
that spent more than $100: rize hundreds, possibly thousands, of records in the
customer database and it would be difficult to glean this
SELECT * FROM CUSTOMER_TABLE WHERE information off the query result. The rules are also action-
TOTAL_SPENT > $100; able. Consider that the first rule tells the storeowner that
adults over the age of 35 that own a minivan are likely to
This query returns a list of all instances in the table spend more than $100. Having access to this information
where the value of the attribute Total Spent is larger allows the storeowner to adjust the inventory to cater to
than $100. As this example highlights, queries act as this segment of the population, assuming that this repre-
filters that allow the user to select instances from a sents a desirable cross-section of the customer base.
table based on certain attribute values. It does not matter Similar with the second rule, male customers that reside in
how large or small the database table is, a query will a certain ZIP code are likely to spend more than $100.
simply return all the instances from a table that satisfy Looking at census information for this particular ZIP
the attribute value constraints given in the query. This code, the storeowner could again adjust the store inven-
straightforward approach to retrieving data from a data- tory to also cater to this population segment, presumably
base has also a drawback. Assume for a moment that our increasing the attractiveness of the store and thereby
example store is a large store with tens of thousands of increasing sales.
customers (perhaps an online store). Firing the above As we have shown, the fundamental difference be-
query against the customer table in the database will tween database queries and data mining is the fact that in
most likely produce a result set containing a very large contrast to queries data mining does not return raw data
number of customers and not much can be learned from that satisfies certain constraints, but returns models of
this query except for the fact that a large number of the data in question. These models are attractive be-
customers spent more than $100 at the store. Our innate cause in general they represent understandable and ac-
analytical capabilities are quickly overwhelmed by large tionable items. Since no such modeling ever occurs in
volumes of data. database queries we do not consider running queries
This is where differences between querying a data- against database tables as data mining, it does not matter
base and mining a database surface. In contrast to a how large the tables are.
query, which simply returns the data that fulfills certain
constraints, data mining constructs models of the data in Database Queries vs. OLAP
question. The models can be viewed as high level sum-
maries of the underlying data and are in most cases more In a typical relational database queries are posed against
useful than the raw data, since in a business sense they a set of normalized database tables in order to retrieve
usually represent understandable and actionable items instances that fulfill certain constraints on their at-
(Berry & Linoff, 2004). Depending on the questions of tribute values (Date, 2000). The normalized tables are
interest, data mining models can take on very different usually associated with each other via primary/foreign
forms. They include decision trees and decision rules keys. For example, a normalized database of our store
for classification tasks, association rules for market with multiple store locations or sales units might look
basket analysis, as well as clustering for market seg- something like the database given in Figure 2. Here, PK
mentation among many other possible models. Good and FK indicate primary and foreign keys, respectively.
overviews of current data mining techniques and models From a user perspective it might be interesting to ask
can be found in Berry & Linoff (2004), Han & Kamber some of the following questions:
(2001), Hand, Mannila, & Smyth (2001), and Hastie,
Tibshirani, & Friedman (2001). How much did sales unit A earn in January?
To continue our store example, in contrast to a How much did sales unit B earn in February?
query, a data mining algorithm that constructs decision What was their combined sales amount for the
rules might return the following set of rules for custom- first quarter?
ers that spent more than $100 from the store database:
Even though it is possible to extract this information
IF AGE > 35 AND CAR = MINIVAN THEN TOTAL SPENT with standard SQL queries from our database, the nor-
> $100 malized nature of the database makes the formulation of
OR the appropriate SQL queries very difficult. Further-
IF SEX = M AND ZIP = 05566 THEN TOTAL SPENT > $100 more, the query process is likely to be slow due to the
fact that it must perform complex joins and multiple
340
TEAM LinG
Figure 2. Normalized database schema for a store dimensions to each other and specifies the measures that
(Source: Craig et al., 1999, Figure 3.2) are to be aggregated. Here the measures are dollar_total, ,
sales_tax, and shipping_charge. Figure 4 shows a
three-dimensional data cube pre-aggregated from the
star schema in Figure 3 (in this cube we ignored the
customer dimension, since it is difficult to illustrate four-
dimensional cubes). In the cube building process the
measures are aggregated along the smallest unit in each
dimension, giving rise to small pre-aggregated segments
in a cube.
Data cubes can be seen as a compact representation
of pre-computed query results1. Essentially, each seg-
ment in a data cube represents a pre-computed query
result to a particular query within a given star schema.
The efficiency of cube querying allows the user to
interactively move from one segment in the cube to
another enabling the inspection of query results in real
scans of entire database tables in order to compute the time. Cube querying also allows the user to group and
desired aggregates. ungroup segments, as well as project segments onto
By rearranging the database tables in a slightly differ- given dimensions. This corresponds to such OLAP
ent manner and using a process called pre-aggregation or operations as roll-ups, drill-downs, and slice-and-dice,
computing cubes the above questions can be answered respectively (Gray, Bosworth, Layman, & Pirahesh,
with much less computational power enabling a real time 1997). These specialized operations in turn provide
analysis of aggregate attribute values OLAP (Craig et answers to the kind of questions mentioned above.
al., 1999; Kimball, 1996; Scalzo, 2003). In order to As we have seen, OLAP is enabled by organizing a
enable OLAP, the database tables are usually arranged relational database in a way that allows for the pre-
into a star schema where the innermost table is called the aggregation of certain query results. The resulting data
fact table and the outer tables are called dimension tables. cubes hold the pre-aggregated results giving the user
Figure 3 shows a star schema representation of our store the ability to analyze these aggregated results in real
organized along the main dimensions of the store busi- time using specialized OLAP operations. In a larger
ness: customers, sales units, products, and time. context we can view OLAP as a methodology for the
The dimension tables give rise to the dimensions in organization of databases along the dimensions of a
the pre-aggregated data cubes. The fact table relates the business making the database more comprehensible to
the end user.
Data Mining vs. OLAP

Figure 3. Star schema for a store database (Source:
Craig et al., 1999, Figure 3.3) Is OLAP data mining? As we have seen, OLAP is en-
abled by a change to the data definition of a relational
Figure 4. A three-dimensional data cube
341
TEAM LinG
database in such a way that it allows for the pre-compu- not seem surprising that all three tools are now routinely
tation of certain query results. OLAP itself is a way to look bundled.
at these pre-aggregated query results in real time. How-
ever, OLAP itself is still simply a way to evaluate queries,
which is different from building models of the data as in REFERENCES
data mining. Therefore, from a technical point of view we
cannot consider OLAP to be data mining. Where data Berry, M.J.A., & Linoff, G.S. (2004). Data mining tech-
mining tools model data and return actionable rules, niques: For marketing, sales, and customer relationship
OLAP allows users to compare and contrast measures management (2nd ed.). New York: John Wiley & Sons.
along business dimensions in real time.
It is interesting to note that recently a tight integra- Codd, E.F. (1970). A relational model of data for large
tion of data mining and OLAP has occurred. For ex- shared data banks. Communications of the ACM, 13(6),
ample, Microsoft SQL Server 2000 not only allows 377-387.
OLAP tools to access the data cubes but also enables its Craig, R.S., Vivona, J.A., & Bercovitch, D. (1999). Microsoft
data mining tools to mine data cubes (Seidman, 2001). data warehousing. New York: John Wiley & Sons.
Date, C.J. (2000). An introduction to database systems
FUTURE TRENDS (7th ed.). Reading, MA: Addison-Wesley.
Deroski, S., & Lavra , N. (2001). Relational data mining.

Perhaps the most important trend in the area of data
Berlin, New York: Springer.
mining and relational databases is the liberation of data
mining tools from the single table requirement. This Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1997).
new breed of data mining algorithms is able to take Data cube: A relational aggregation operator generalizing
advantage of the full relational structure of a relational group-by, cross-tab, and sub-totals. Data Mining and
database obviating the need of constructing a single Knowledge Discovery, 1(1), 29-53.
table that contains all the information to be used in the
data mining task (Deroski & Lavra , 2001). This allows for Han, J., & Kamber, M. (2001). Data mining: Concepts and
data mining tasks to be represented naturally in terms of techniques. San Francisco: Morgan Kaufmann Publishers.
the actual database structures, for example Yin, Han, Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles
Yang, & Yu (2004), and also allows for a natural and tight of data mining. Cambridge, MA: MIT Press.
integration of data mining tools with relational databases.
Hastie, T., Tibshirani, R., & Friedman, J.H. (2001). The
elements of statistical learning: Data mining, infer-
CONCLUSION ence, and prediction. New York: Springer.
Kimball, R. (1996). The data warehouse toolkit: Prac-
Modern, commercially available relational database tical techniques for building dimensional data ware-
systems now routinely include a cadre of data retrieval houses. New York: John Wiley & Sons.
and analysis tools. Here, we briefly described and con-
trasted the most often-bundled tools: SQL database Oracle. (2001). Oracle 9i Data Mining. Oracle White Paper.
query engines, data mining components, and OLAP tools. Pendse, N. (2001). Multidimensional data structures.
Contrary to many assertions in the literature and busi- Retrieved from http://www.olapreport.com/
ness press, performing queries on large tables or ma- MDStructures.htm
nipulating query data via OLAP tools is not considered
data mining due to the fact that no data modeling occurs Sattler, K., & Dunemann, O. (2001, November 5-10). SQL
in these tools. On the other hand, these three tools database primitives for decision tree classifiers. Paper pre-
complement each other and allow developers to pick the sented at the 10th International Conference on Information
tool that is right for their application: queries allow ad and Knowledge Management, Atlanta, Georgia.
hoc access to virtually any instance in a database; data
mining tools can generate high-level, actionable sum- Scalzo, B. (2003). Oracle DBA guide to data warehousing
maries of data residing in database tables; and OLAP and star schemas. Upper Saddle River, NJ: Prentice Hall
allows for real-time access to pre-aggregated measures PTR.
along important business dimensions. In this light it does
342
TEAM LinG
Seidman, C. (2001). Data mining with Microsoft SQL OLAP (Online Analytical Processing): A cat-
Server 2000 technical reference. Microsoft Press. egory of applications and technologies for collecting, ,
managing, processing and presenting multidimensional
Yin, X., Han, J., Yang, J., & Yu, P.S. (2004). CrossMine: Efficient data for analysis and management purposes. (Source:
classification across multiple database relations. Paper pre- http://www.olapreport.com/glossary.htm)
sented at the 20th International Conference on Data Engi-
neering (ICDE 2004), Boston, MA, USA. Query: This term generally refers to databases. A
query is used to retrieve database records that match
certain criteria. (Source: http://usa.visa.com/business/
merchants/online_trans_glossary.html)
KEY TERMS
SQL (Structured Query Language): SQL is a stan-
Business Intelligence: Business intelligence (BI) is a dardized programming language for defining, retriev-
broad category of technologies that allows for gathering, ing, and inserting data objects in relational databases.
storing, accessing and analyzing data to help business Star Schema: A database design that is based on a
users make better decisions. (Source: http:// central detail fact table linked to surrounding dimension
www.oranz.co.uk/glossary_text.htm) tables. Star schemas allow access to data using business
Data Cubes: Also known as OLAP cubes. Data stored terms and perspectives. (Source: http://www.ds.uillinois.
in a format that allows users to perform fast multi- edu/glossary.asp)
dimensional analysis across different points of view.
The data is often sourced from a data warehouse and
relates to a particular business function. (Source: http:/
/www.oranz.co.uk/glossary_text.htm)
ENDNOTE
Normalized Database: A database design that ar- 1

Another interpretation of data cubes is as an effec-
ranges data in such a way that it is held at its lowest level tive representation of multidimensional data along
avoiding redundant attributes, keys, and relationships. the main business dimensions (Pendse, 2001).
(Source: http://www.oranz.co.uk/glossary_text.htm)
343
TEAM LinG
344
Database Sampling for Data Mining

Patricia E.N. Lutu
University of Pretoria, South Africa
INTRODUCTION data set to be represented by a much smaller random

sample that can be processed much more efficiently.
In data mining, sampling may be used as a technique for
reducing the amount of data presented to a data mining
algorithm. Other strategies for data reduction include MAIN THRUST
dimension reduction, data compression, and
discretisation. For sampling, the aim is to draw, from a There are a number of key issues to be considered before
database, a random sample, which has the same character- obtaining a suitable random sample for a data mining task.
istics as the original database. This chapter looks at the It is essential to understand the strengths and weaknesses
sampling methods that are traditionally available from the of each sampling method. It is also essential to under-
area of statistics, how these methods have been adapted stand which sampling methods are more suitable to the
to database sampling in general and database sampling type of data to be processed and the data mining algorithm
for data mining in particular. to be employed. For research purposes, we need to look
at a variety of sampling methods used by statisticians, and
attempt to adapt them to sampling for data mining.
BACKGROUND
Some Basics of Statistical Sampling
Given the rate at which database/data warehouse sizes are Theory
growing, attempts at creating faster/more efficient algo-
rithms that can process massive data sets may eventually In statistics, the theory of sampling, also known as statis-
become futile exercises. Modern database and data ware- tical estimation or the representative method, deals with
house sizes are in the region of 10s or 100s of terabytes, the study of suitable methods of selecting a representa-
and sizes continue to grow. A query issued on such a tive sample of a population, in order to study or estimate
database/data warehouse could easily return several mil- values of specific characteristics of the population
lions of records. (Neyman, 1934). Since the characteristics being studied
While the costs of data storage continue to decrease, can only be estimated from the sample, confidence inter-
the analysis of data continues to be hard. This is the case vals are calculated to give the range of values within
for even traditionally simple problems requiring aggrega- which the actual value will fall, with a given probability.
tion, for example, the computation of a mean value for There are a number of sampling methods discussed in
some database attribute. In the case of data mining, the the literature, for example the book by Rao (Rao, 2000).
computation of very sophisticated functions, on very Some methods appear to be more suited to database
large numbers of database records, can take several hours, sampling than others. Simple random sampling (SRS),
or even days. For inductive algorithms, the problem of stratified random sampling, and cluster sampling are three
lengthy computations is compounded by the fact that such methods. Simple random sampling involves select-
many iterations are needed in order to measure the training at random elements of the population, P, to be studied.
ing accuracy as well as the generalization accuracy. The method of selection may be either with replacement
There is plenty of evidence to suggest that, for induc- (SRSWR) or without replacement (SRSWOR). For very
tive data mining, the learning curve flattens after only a large populations, however, SRSWR and SRSWOR are
small percentage of the available data from a large data set equivalent. For simple random sampling, the probabilities
has been processed (Catlett, 1991; Kohavi, 1996; Provost of inclusion of the elements may or may not be uniform.
et al., 1999). The problem of overfitting (Dietterich, 1995) If the probabilities are not uniform then a weighted ran-
also dictates that the mining of massive data sets should dom sample is obtained.
be avoided. The sampling of databases has been studied The second method of sampling is stratified random
by researchers for some time. For data mining, sampling sampling. Here, before the samples are drawn, the popu-
should be used as a data reduction technique, allowing a lation P is divided into several strata, p1, p2,.. p k, and the
TEAM LinG
sample S is composed of k partial samples s1, s 2,.., sk, each records that still need to be chosen for the sample, and
drawn randomly, with replacement or not, from one of the RMsize = (N-t) is the number of records in the file still to ,
strata. Rao (2000) discusses several methods of allocating be processed. This sampling method is commonly referred
the number of sampled elements for each stratum. Bryant to as method S (Vitter, 1987).
et al. (1960) argue that, if the sample is allocated to the The reservoir sampling method (Fan et al., 1962;
strata in proportion to the number of elements in the Jones, 1962; Vitter, 1985, 1987) is a sequential sampling
strata, it is virtually certain that the stratified sample method over a finite population of database records, with
estimate will have a smaller variance than a simple random an unknown population size. Olken (1993) discuss its use
sample of the same size. The stratification of a sample may in sampling of database query outputs on the fly. This
be done according to one criterion. Most commonly technique produces a sample of size S, by initially placing
though, there are several alternative criteria that may be the first S records of the database/file/query in the reser-
used for stratification. When this is the case, the different voir. For each subsequent kth database record, that record
criteria may all be employed to achieve multi-way stratifi- is accepted with probability S/k. If accepted, it replaces a
cation. Neyman (1934) argues that there are situations randomly selected record in the reservoir.
when it is very difficult to use an individual unit as the unit Acceptance/Rejection sampling (A/R sampling) can
of sampling. For such situations, the sampling unit should be used to obtain weighted samples (Olken, 1993). For a
be a group of elements, and each stratum should be weighted random sample, the probabilities of inclusion of
composed of several groups. In comparison with strati- the elements of the population are not uniform. For data-
fied random sampling, where samples are selected from base sampling, the inclusion probability of a data record
each stratum, in cluster sampling a sample of clusters is is proportional to some weight calculated from the records
selected and observations/measurements are made on the attributes. Suppose that one database record rj is to be
clusters. Cluster sampling and stratification may be com- drawn from a file of n records with the probability of
bined (Rao, 2000). inclusion being proportional to the weight wj. This may be
done by generating a uniformly distributed random inte-
Database Sampling Methods ger 1 j n and then accepting the sampled record rj with
probability j = w j / w max, where wmax is the maximum
Database sampling has been practiced for many years for possible value for wj. The acceptance test is performed by
purposes of estimating aggregate query results, database generating another uniform random variate uj, 0 uj 1,
auditing, query optimization, and, obtaining samples for and accepting rj iff uj < j. If r j is rejected, the process is
further statistical processing (Olken, 1993). Static sam- repeated until some rj is accepted.
pling (Olken, 1993) and adaptive (dynamic) sampling
(Haas & Swami, 1992) are two alternatives for obtaining Stratified Sampling
samples for data mining tasks. In recent years, many
studies have been conducted in applying sampling to Density biased sampling (Palmer & Faloutsos, 2000) is a
inductive and non-inductive data mining (John & Lan- method that combines clustering and stratified sampling.
gley, 1996; Provost et al., 1999; Toivonen, 1996). In density biased sampling, the aim is to sample so that
within each cluster points are selected uniformly, the
Simple Random Sampling sample is density preserving, and the sample is biased by
cluster size. Density preserving in this context means that
Simple random sampling is by far, the simplest method of the expected sum of weights of the sampled points for
sampling a database. Simple random sampling may be each cluster is proportional to the clusters size. Since it
implemented using sequential random sampling or reser- would be infeasible to determine the clusters apriori,
voir sampling. For sequential random sampling, the prob- groups are used instead to represent all the regions in n-
lem is to draw a random sample of size n without replace- dimensional space. Sampling is then done to be density
ment, from a file containing N records. The simplest preserving for each group. The groups are formed by
sequential random sampling method is due to Fan et al. placing a d-dimensional grid over the data. In the d-
(1962) and Jones (1962). An independent uniform random dimensional grid, the d dimensions of each cell are labeled
variate [from the uniform interval (0,1)] is generated for either with a bin value for numeric attributes, or by a
each record in the file to determine whether the record discrete value for categorical attributes. The d-dimen-
should be included in the sample. If m records have sional grid defines the strata for multi-way stratified
already been chosen from among the first t records in sampling. A one-pass algorithm is used to perform the
the file, the (t+1)st record is chosen with probability weighted sampling, based on the reservoir algorithm.
(RQsize/RMsize), where RQsize = (n-m) is the number of
345
TEAM LinG
Adaptive Sampling suitable sample size, and the level accuracy that can be
tolerated. These issues are discussed below.
Lipton et al. (1990) use adaptive sampling, also known as
sequential sampling, for database sampling. In sequential Deciding on a Sampling Method
sampling, a decision is made after each sampled element
whether to continue sampling. Olken (1993) has observed The data to be sampled may be balanced, imbalanced,
that sequential sampling algorithms outperform conven- clustered or unclustered. These characteristics will af-
tional single-stage algorithms in terms of the number of fect the quality of the sample obtained. While simple
sample elements required, since they can adjust the sample random sampling is very easy to implement, it may pro-
size to the population parameters. Haas & Swami (1992) duce non-representative samples for data that is
have proved that sequential sampling uses the minimum imbalanced or clustered. On the other hand, stratified
sample size for the required accuracy. sampling, with a good choice of strata cells, can be used
John & Langley (1996) have proposed a method they to produce representative samples, regardless of the
call dynamic sampling, which combines database sam- characteristics of the data. Implementation of one-way
pling with the estimation of classifier accuracy. The method stratification should be straight forward, however, for
is most efficiently applied to classification algorithms that multi-way stratification, there are many considerations
are incremental, for example nave Bayes and artificial to be made. For example, in density-biased sampling, a d-
neural-network algorithms such as backpropagation. They dimensional grid is used. Suppose each dimension has n
define the concept of probably close enough (PCE), possible values (or bins). The multi-way stratification
which they use for determining when a sample size pro- will result in n d strata cells. For large d, this is a very large
vides an accuracy that is probably good enough. Good number of cells. When it is not easy to estimate the
enough in this context means that there is a small prob- sample size in advance, adaptive (or dynamic) sampling
ability that the mining algorithm could do better by using may be employed, if the data mining algorithm is incre-
the entire database. The smallest sample size n is chosen mental.
from a database of size N, so that: Pr (acc(N) acc(n) >
) <= , where: acc(n) is the accuracy after processing a Determining the Representative
sample of size n, and is a parameter that describes what Sample Size
close enough means. The method works by gradually
increasing the sample size n until the PCE condition is For static sampling, the question must be asked: What
satisfied. is the size of a representative sample? A sample is
Provost, Jensen, & Oates (1999) use progressive sam- considered statistically valid if it is sufficiently similar to
pling, another form of adaptive sampling, and analyse its the database from where it is drawn (John & Langley,
efficiency relative to induction with all available examples. 1996). Univariate sampling may be used to test that each
The purpose of progressive sampling is to establish nmin, field in the sample comes from the same distribution as
the size of the smallest sufficient sample. They address the the parent database. For categorical fields, the chi-squared
issue of convergence, where convergence means that a test can be used to test the hypothesis that the sample
learning algorithm has reached its plateau of accuracy. In and the database come from the same distribution. For
order to detect convergence, they define the notion of a continuous-valued fields, a large-sample test can be
sampling schedule S as S = {no, n1, .., nk} where ni is an used to test the hypothesis that the sample and the
integer that specifies the size of the sample, and S is a database have the same mean. It must however be pointed
sequence of sample sizes to be provided to an inductive out that obtaining fixed size representative samples from
algorithm. They show that schedules in which n i increases a database is not a trivial task, and consultation with a
geometrically as: {no, a.n o, a2.no,., ak.no}, are asymptoti- statistician is recommended.
cally optimal. As one can see, progressive sampling is For inductive algorithms, the results from the theory
similar to the adaptive sampling method of John & Langley of probably approximately correct (PAC) learning have
(1996), except that a non-linear increment for the sample been suggested in the literature (Valiant, 1984; Haussler,
size is used. 1990). These have however been largely criticized for
overestimating the sample size (e.g., Haussler, 1990). For
incremental inductive algorithms, dynamic sampling (John
THE SAMPLING PROCESS & Langley, 1996; Provost et al., 1999) may be employed
to determine when a sufficient sample has been pro-
Several decisions need to be made when sampling a cessed. For association rule mining, the methods de-
database. One needs to decide on a sampling method, a scribed by Toivonen (1996) may be used to determine the
sample size.
346
TEAM LinG
Determining the Accuracy and reduce the amount of data presented to a data mining
Confidence of Results Obtained from algorithm. More than fifty years of research has produced ,
a variety of sampling algorithms for database tables and
Samples query result sets. The selection of a sampling method
should depend on the nature of the data as well as the
For the task of inductive data mining, suppose we have algorithm to be employed. Estimation of sample sizes for
estimated the classification error for a classifier con- static sampling is a tricky issue. More research in this area
structed from sample S, to be errors (h), as the proportion is needed in order to provide practical guidelines. Strati-
of the test examples that are misclassified. Statistical fied sampling would appear to be a versatile approach to
theory, based on the central limit theorem, enables us to sampling any type of data. However, more research is
conclude that, with approximately N% probability, the needed to address especially the issue of how to define
true error lies in the interval: the strata for sampling.
Errors (h) Z N (Errors (h)(1 Errors (h)) / n )

REFERENCES
As an example, for a 95% probability, the value of ZN
is 1.96. Note that, ErrorS (h) is the mean error and (ErrorS Bryant, E.C., Hartley, H.O., & Jessen, R.J. (1960). Design
(h) (1- ErrorS (h))/n) is the variance of the error. Mitchell and estimation in two-way stratification. Journal of the
(1997) gives a detailed discussion of the estimation of American Statistical Association, 55, 105-124.
classifier accuracy and confidence intervals. For non
inductive data mining algorithms, there are known ways Catlett, J. (1991). Megainduction: A test flight. In Pro-
of estimating the error and confidence in the results ceedings of the Eighth Workshop on Machine Learning
obtained from samples. For association rule mining, for (pp. 596-599). San Mateo, CA: Morgan Kaufman.
example, Toivonen (1996) states that the lower bound for Dietterich, T. (1995). Overfitting and undercomputing in
the size of the sample, given an error bound and a machine learning. ACM Computing Surveys, 27(3), 326-
maximum probability , is given by: 327.
1 2 Fan, C., Muller, M., & Rezucha, I. (1962). Development of

| s | ln sampling plans by using sequential (item by item) selec-
2 2
tion techniques and digital computers. Journal of the
The value is the error in estimating the frequencies American Statistical Association, 57, 387-402.
of frequent item sets for some give set of attributes. Haas, P.J., & Swami, A.N. (1992). Sequential sampling
procedures for query size estimation. IBM Technical
Report RJ 8558. IBM Alamaden.
FUTURE TRENDS
Haussler, D. (1990). Probably approximately correct learn-
There two main thrusts in research on establishing the ing. National Conference on Artificial Intelligence.
sufficient sample size for a data mining task. The theoreti- John, G.H., & Langley, P. (1996). Static versus dynamic
cal approach, most notably the PAC framework and the sampling for data mining. In Proceedings of the Second
Vapnik-Chernonenkis (VC) dimension have produced re- International Conference on Knowledge Discovery in
sults that are largely criticized by data mining researchers Databases and Data Mining. AAAI/MIT Press.
and practitioners. However, research will continue in this
area, and may eventually yield practical guidelines. On the Jones, T. (1962). A note on sampling from tape files.
empirical front, researchers will continue to conduct simu- Communications of the ACM, 5, 343-343.
lations that will lead to generalizations on how to estab-
Kohavi, R. (1996). Scaling up the accuracy on nave-Bayes
lish sufficient sample sizes.
classifiers: A decision tree hybrid. In Proceedings of the
Second International Conference on Knowledge Dis-
covery and Data Mining.
CONCLUSIONS
Lipton, R., Naughton, J., & Schneider, D. (1990). Practical
Modern databases and data warehouses are massive, and selectivity estimation through adaptive sampling. In ACM
will continue to grow in size. It is essential for researchers SIGMOD International Conference on the Management
to investigate data reduction techniques that can greatly of Data (pp. 1-11).
347
TEAM LinG
Mitchell, T.M. (1997). Machine learning. McGraw-Hill. Dynamic Sampling (Adaptive Sampling): A method
of sampling where sampling and processing of data pro-
Neyman, J. (1934). On the two different aspects of the ceed in tandem. After processing each incremental part of
representative method: The method of stratified sampling the sample, a decision is made whether to continue sam-
and the method of purposive selection. Journal of the pling or not.
Royal Statistical Society, 97, 558-625.
Reservoir Sampling: A database sampling method that
Olken, F. (1993). Random sampling from databases. PhD implements uniform random sampling on a database table of
thesis. University of California at Berkeley. unknown size, or a query result set of unknown size.
Palmer C.R., & Faloutsos, C. (2000). Density biased sam- Sequential Random Sampling: A database sam-
pling: An improved method for data mining and cluster- pling method that implements uniform random sam-
ing. In Proceedings of the ACM SIGMOD Conference (pp. pling on a database table whose size is known.
82-92).
Simple Random Sampling: Simple random sampling
Provost, F., Jensen, D., & Oates, T. (1999). Efficient involves selecting at random elements of the population
progressive sampling. In Proceedings of the Fifth ACM to be studied. The sample S is obtained by selecting at
DIGKDD International Conference on Knowledge Dis- random single elements of the population P.
covery and Data Mining (pp. 23-32), San Diego, CA.
Simple Random Sampling with Replacement
Rao, P.S.R.S. (2000). Sampling methodologies with ap- (SRSWR): A method of simple random sampling where an
plications. FL: Chapman & Hall/CRC. element stands a chance of being selected more than once.
Toivonen, H. (1996). Sampling large databases for asso- Simple Random Sampling without Replacement
ciation rules. In Proceedings of the Twenty-Second (SRSWOR): A method of simple random sampling where
Conference on Very Large Databases (VLDDB96), each element stands a chance of being selected only
Mumbai India. once.
Valiant, L.G. (1984). A theory of the learnable. Communi- Static Sampling: A method of sampling where the
cations of the ACM, 27 (11), 1134-1142. whole sample is obtained before processing begins. The
Vitter, J.S. (1985). Random sampling with a reservoir. ACM user must specify the sample size.
Transactions on Mathematical Software, 11, 37-57. Stratified Sampling: For this method, before the
Vitter, J.S. (1987). An efficient method for sequential samples are drawn, the population P is divided into
random sampling. ACM Transactions on Mathematical several strata, p1, p2,.. pk, and the sample S is composed
Software, 13 (1), 58-67. of k partial samples s1, s 2,.., sk, each drawn randomly, with
replacement or not, from one of the strata.
KEY TERMS Uniform Random Sampling: A method of simple ran-

dom sampling where the probabilities of inclusion for
Cluster Sampling: In cluster sampling, a sample of each element are equal.
clusters is selected and observations/measurements are Weighted Random Sampling: A method of simple
made on the clusters. random sampling where the probabilities of inclusion for
Density-Biased Sampling: A database sampling each element are not equal.
method that combines clustering and stratified sampling.
348
TEAM LinG
349
DEA Evaluation of Performance of E-Business ,

Initiatives
Yao Chen
University of Massachusetts Lowell, USA
Luvai Motiwalla
M. Riaz Khan
INTRODUCTION cally or physically via couriers. This enables organiza-

tions the flexibility to expand into different product lines
The Internet has experienced a phenomenal growth in and markets quickly, with low investments. Secondly,
attracting people and commerce activities over the last 24x7 availability, better communication with customers,
decadefrom a few thousand people in 1993 to 150+ and sharing of the organizational knowledge base allows
million in 1999, and about one billion by 2004 (Bingi et al., organizations to provide better customer service. This
2000). This growth has attracted a variety of organizations can translate to better customer retention rates as well as
initially to provide marketing information about their repeat orders. Finally, the rich interactive media and
products and services and, customer support, and later to database technology of the Web allows for unconstrained
conduct business transactions with customers or busi- awareness, visibility, and opportunity for an organization
ness partners on the Web. These electronic business (EB) to promote its products and services. This enhances
initiatives could be the implementation of intranet and/or organizations abilities to attract new customers, thereby
extranet applications like B2C, B2B, Web-CRM, Web- increasing their overall markets and profitability. Despite
marketing, and others, to take advantage of the Web- the recent dot-com failures, EB has made tremendous
based economic model, which offers opportunities for inroads in traditional corporations. Forrester Research in
internal efficiencies and external growth. its survey found 90% of the firms plan to conduct some e-
It has been recognized that the Internet economic commerce, business-to-consumer (B2C), or business-to-
model is more efficient at the transaction cost level and business (B2B), and predicts EB transactions to rise to
elimination of the middleman in the distribution channel, about $6.9 trillion by 2004. As a result, the management
and also can have a big impact on the market efficiency. has started to believe in the Internet because of its ability
Web-enabling business processes are particularly attrac- to attract and retain more customers, reduce sales and
tive in the new economy, where product life cycles are distribution overheads, and global access to markets with
short and efficient, while the market for products and an expectation of an increase in sales revenues, higher
services is global. Similarly, management of these compa- profits, and better returns for the stockholders (Choi &
nies expects a much better financial performance than Winston, 2000; Motiwalla & Khan, 2002; Steinfield &
their counterparts in the industry, which had not adopted Whitten, 1999; White, 1999).
these EB initiatives (Hoffman et. al., 1995; Wigand &
Benjamin, 1995).
EB allows organizations to expand their business BACKGROUND
reach. One of the key benefits of the Web is access to and
from global markets. The Web eliminates several geo- It is important that we use a comprehensive performance
graphical barriers for a corporation that wants to conduct evaluation tool to examine whether these EB initiatives
global commerce. While traditional commerce relied on have a positive impact on the financial performance.
value-added networks (VANs) or private networks, which Managers are often interested in evaluating how effi-
were expensive and provided limited connectivity (Pyle, ciently EB initiatives are with respect to multiple inputs
1996), the Web makes electronic commerce cheaper with and outputs. Single-measure gap analysis is often used as
extensive global connectivity. Corporations have been a fundamental method in performance evaluation and best
able to produce goods anywhere and deliver electroni- practice identification. It is extremely difficult to show
TEAM LinG
DEA Evaluation of Performance of E-Business Initiatives
benchmarks where multiple measurements exist. It is rare stable to any input changes, if an input-oriented super-
that one single measure can suffice for the purpose of efficiency DEA model is used (or any output changes, if
performance assessment. In our empirical study, there are an output-oriented super-efficiency DEA model is used).
multiple measures that characterize the performance of Therefore, we can use + to represent the super-efficiency
retail companies. This requires that the research tool used score (i.e., infeasibility means the highest super-efficiency).
here have the flexibility to deal with changing production Chen (2004) shows that (i) if an efficient DMU does not
technology in the context of multiple performance mea- possess any input super-efficiency (input saving), it must
sures. Data envelopment analysis (DEA) is originally possess output super-efficiency (output surplus), and (ii)
developed to measure the relative efficiency of peer if an efficient DMU does not possess any output super-
decision-making units (DMUs) in a multiple input-output efficiency, it must possess input super-efficiency. We
setting. DEA has been proven to be an excellent method- thus can use both input-oriented and output-oriented
ology for performance evaluation and benchmarking (Zhu, super-efficiency DEA models to fully characterize the
2003). super-efficiency.
Based on Cooper, Seiford, and Zhu (2004), the specific Based on the above derivations, Chen, et al. (2004) are
reasons for using DEA are given as follows. First, DEA is able to rank the performance of a set of publicly held
a data-oriented approach for evaluating the performance corporations in retail industry over the period 1997-2000.
of a set of peer DMUs, which convert multiple inputs into Specifically, the objective of this study is to determine
multiple outputs. In our case, the DMUs can be, for whether the financial data support the beneficial claims
example, corporations that have launched EB activities. made in the popular literature that EB has boosted the
For each corporation, each year can be regarded as a bottom-line.
DMU. Second, DEA is a methodology directed to fron-
tiers rather than central tendencies. Instead of trying to fit
a regression plane through the center of the data, as is MAIN TRUST
done in statistical regression, for example, one floats a
piecewise linear surface to rest on top of the observations. To present our DEA methodology, we assume that there
Because of this approach, DEA proves particularly adept are n DMUs to be evaluated. Each DMU consumes vary-
at uncovering relationships that remain hidden in other ing amounts of m different inputs to produce s different
methodologies. Third, DEA does not require explicitly outputs. Specifically, DMUj consumes amount xij of input
formulated assumptions of functional form as in linear and i and produces amount yrj of output r. We assume that xij
nonlinear regression models. This flexibility allows us to > 0 and y rj > 0 and further assume that each DMU has at
identify the multi-dimensional efficient frontier without least one positive input and one positive output value.
the need for explicitly expressing the technology change The input-output oriented super efficiency models
and organizational knowledge. whose frontier exhibits VRS can be expressed as Seiford
In order to discriminate the performance among the and Zhu (1999) in Box 1, where xio and yro are respec-
efficient DMUs, a super-efficiency DEA model in which a
tively the ith input and rth output for a DMU o under
DMU under evaluation is excluded from the reference set
evaluation.
is developed. However, the super-efficiency model has
been restricted to the case of constant returns to scale Let o represent the score for characterizing the super-
(CRS), because the non-CRS super-efficiency DEA model efficiency in terms of input saving, we have
can be infeasible (Seiford & Zhu, 1998; 1999; Zhu, 1996).
It is difficult to precisely define infeasibility. As a VRS super* if the input - oriented super - efficiency model is feasible
result, one cannot rank the performance of a set of DMUs. o = 1 o if the input - oriented super - efficiency model is infeasible

In fact, an input-oriented super-efficiency DEA model
measures the input super-efficiency when outputs are
fixed at their current levels. Likewise, an output-oriented Note that o > 1. If o >1, a specific efficient DMU o has
super-efficiency DEA model measures the output super- input super-efficiency. If o = 1, DMU o does not have
efficiency when inputs are fixed at their current levels.
input super-efficiency. Similarly, let o represent the
From the different uses of the super-efficiency concept,
we see that super-efficiency can be interpreted as the score for characterizing the output super-efficiency, we
degree of efficiency stability or input saving/output sur- have
plus achieved by an efficient DMU. If super-efficiency is
used as an efficiency stability measure, then infeasibility oVRS super* if the output - oriented super - efficiency model is feasible
o =
means that an efficient DMUs efficiency classification is 1 if the output - oriented super - efficiency model is infeasible
350
TEAM LinG
Box 1.
min oVRS-super
,
max oVRS super
n n
s.t. x
j =1
j ij oVRS super xio i = 1,2,..., m; s.t. x j ij xio i = 1,2,..., m;
j =1
j o j o
n n
y
j=1
j rj y ro r = 1,2,..., s; j y rj oVRS super y ro r = 1,2,..., s;
j=1
j o j o
n n

j=1
j =1 j =1
j=1
j o j o
oVRS-super 0 oVRS super 0

j 0 jo j 0 ( j o)
Note that o < 1. If o < 1, a specific efficient DMU o indicate that the EB companies performed better in some
measures than the non-EB companies included in the
has output super-efficiency. If o = 1, DMU o does not sample. Further, by contrasting the performance of EB
have output super-efficiency. companies against the non-EB companies, the findings
The DEA inputs and outputs can be developed from confirm that the EB companies have benefited from the
the financial ratios. For example, the inputs can include (i) innovation and the strategic integration of the EB technologies.
number of employees, (ii) inventory cost, (iii) total current
assets, and (iv) cost of sales; and the outputs can include
(i) revenue and (ii) net income. In our study, 75% of the EB REFERENCES
companies are efficient, and 57% of the non-EB companies
are efficient. Thus, in general, the EB companies have Bingi, P., Mir, A., & Khamalah, J. (2000). The challenges
performed better. In terms of the super-efficiency DEA facing global e-commerce. Information Systems Man-
model, the EB companies demonstrate a better perfor- agement, 6(2), 26-34.
mance than the companies that have not yet adopted the
Charnes, A, Cooper, W.W., & Rhodes, E. (1978). Measur-
EB initiatives (Chen et al., 2004).
ing the efficiency of decision making units. European
Journal of Operational Research, 2, 429-444.
FUTURE TRENDS Chen, Y. (2004). Ranking efficient units in DEA. OMEGA,
32, 213-219.
It has been recognized that the link between IT investment
Chen, Y., Motiwalla, L., & Khan, M.R. (2004). Using
and firm performance is indirect. Future research should
super-efficiency DEA to evaluate financial performance
focus on multi-stage performance of IT impacts on firm
of e-business initiative in the retail industry. Interna-
performance. For example, Chen and Zhu (2004) developed
tional Journal of Information Technology and Decision
a preliminary DEA-based model to (i) characterize the
Making, 3(2), 337-351.
indirect impact of IT on firm performance, (ii) identify the
efficient frontier of two value-added stages related to IT Chen, Y., & Zhu, J. (2004). Measuring information
investment and profit generation, and (iii) highlight firms technologys indirect impact on firm performance. Infor-
that can be further analyzed for best practice benchmarking. mation Technology & Management Journal, 5(1), 9-22.
Choi, S., & Winston, A. 2000. Benefits and requirements
CONCLUSION for interoperability in electronic marketplace. Technol-
ogy in Society, 22, 33-44.
The analysis of the retail industry data indicates that there Cooper, W.W., Seiford, L.M., & Zhu, J. (2004). Handbook
is some evidence that the EB initiatives have had some on data envelopment analysis. Boston: Kluwer Aca-
degree of favorable impact on the financial performance of demic Publishers.
companies that have moved in this direction. Results
351
TEAM LinG
Hoffman, D., Novak, T., & Chatterjee, P. (1995). Commer- KEY TERMS
cial scenarios for the Web: Opportunities and challenges.
Journal of Computer-Mediated Communications, 1(3). Data Envelopment Analysis (DEA): A data-oriented
Motiwalla, L., & Khan, M.R. (2002). Financial impact of e- mathematical programming approach that allows multiple
business initiatives in the retail industry. Journal of performance measures in a single model.
Electronic Commerce in Organization, 1(1), 55-73. Decision Making Unit (DMU): The subject under
Pyle, R. (1996). Commerce and the Internet. Communica- evaluation.
tions of the ACM, 39(6), 23. Efficient: Full efficiency is attained by any DMU if and
Seiford, L.M., & Zhu, J. (1998). Sensitivity analysis of only if none of its inputs or outputs can be improved
DEA models for simultaneous changes in all the data. without worsening some of its other inputs or outputs.
Journal of the Operational Research Society, 49, 1060- Electronic Business (EB): Business transactions in-
1071. volving exchange of goods and services with customers
Seiford, L.M., & Zhu, J. (1999). Infeasibility of super- and/or business partners over the Internet.
efficiency data envelopment analysis models. INFOR, 37, Inputs/Outputs: Refer to the performance measures
174-187. used in DEA evaluation. Inputs usually refer to the re-
Steinfield, C., & Whitten, P. (1999). Community level sources used, and outputs refer to the outcomes achieved
socio-economic impacts of electronic commerce. Journal by an organization or DMU.
Of Computer-Mediated Communications, 5(2). Returns to Scale (RTS): RTS are considered to be
White, G. (1999, December 3). How GM, Ford think Web increasing if a proportional increase in all the inputs
can make a splash on the factory floor. Wall Street Jour- results in a more than proportional increase in the single
nal, 1, 1. output. In DEA, the concept of returns to scale is extended
to multiple inputs and multiple outputs situations.
Wigand, R., & Benjamin, R. (1995). Electronic commerce:
effects on electronic markets. Journal of Computer Medi- Super-Efficiency: The input savings or output sur-
ated Communication, 1(3). pluses achieved by an efficient DMU.
Zhu, J. (1996). Robustness of the efficient DMUs in data

envelopment analysis. European Journal of Operational
Research, 90(3), 451-460.
Zhu, J. (2003). Quantitative models for performance evalu-
ation and benchmarking: Data envelopment analysis
with spreadsheets and DEA excel solver. Boston: Kluwer
Academic Publishers.
352
TEAM LinG
353
Decision Tree Induction ,

Roberta Siciliano
University of Naples Federico II, Italy
Claudio Conversano
University of Cassino, Italy
INTRODUCTION tions, can be viewed as a production rule yielding to a

specific label class/value. The set of production rules
Decision Tree Induction (DTI) is an important step of constitutes the predictive learning of the response class/
the segmentation methodology. It can be viewed as a value of new objects, where only measurements of the
tool for the analysis of large datasets characterized by predictors are known. As an example, a new client of a
high dimensionality and nonstandard structure. Seg- bank is classified as a good client or a bad one by
mentation follows a nonparametric approach, since no dropping it down the tree according to the set of splits
hypotheses are made on the variable distribution. The (binary questions) of a tree path, until a terminal node
resulting model has the structure of a tree graph. It is labeled by a specific response-class is reached.
considered a supervised method, since a response crite-
rion variable is explained by a set of predictors.
In particular, segmentation consists of partitioning BACKGROUND
the objects (also called cases, individuals, observations,
etc.) into a number of subgroups (on the basis of suitable The appealing aspect for the segmentation user is that
partitioning of the modalities of the explanatory vari- the final tree provides a comprehensive description of
ables, the so-called predictors) in a recursive way, so the phenomenon in different contexts of application,
that a tree-structure is produced. Typically, partitioning such as marketing, credit scoring, finance, medical di-
is in two subgroups yielding to binary trees, although agnosis, and so forth.
ternary trees as well as r-way trees also can be built up. Segmentation can be considered as an exploratory
Two main targets can be achieved with tree-struc- tool but also as a confirmatory nonparametric model.
turesclassification and regression treeson the ba- Exploration can be obtained by performing a recursive
sis of the type of response variable, which can be cat- partitioning of the objects until a stopping rule defines
egorical or numerical. the final structure to interpret. Confirmation is a differ-
Tree-based methods are characterized by two main ent problem, requiring definition of decision rules,
tasks: exploratory and decision. The first is to describe usually obtained by performing a pruning procedure
with the tree structure the dependence between the soon after a partitioning one. Important questions arise
response and the predictors. The decision task is prop- when using segmentation for predictive learning goals
erly of DTI, aiming to define a decision rule for unseen (Hastie et al., 2001; Zhang, 1999). The tree structure
objects for estimating unknown response class/values that fits the data and can be used for unseen objects
as well as validating the accuracy of the final results. cannot be the simple result of any partitioning algo-
For example, trees often are considered in credit- rithm. Two aspects should be jointly considered: the
scoring problems in order to describe and classify good tree size (i.e., the number of terminal nodes) and the
and bad clients of a bank on the basis of socioeconomic accuracy of the final decision rule evaluated by an error
indicators (e.g., age, working conditions, family status, measure. In fact, a weak point of decision trees is the
etc.) and financial conditions (e.g., income, savings, sensitivity of the classification/prediction rules mea-
payment methods, etc.). Conditional interactions de- sured by the size of the tree and its accuracy to the type
scribing the client profile can be detected looking at the of dataset as well as to the pruning procedure. In other
paths along the tree, when going from the top to the words, the ability of a decision tree to detect cases and
terminal nodes. Each internal node of the tree is as- take right decisions can be evaluated by a simple mea-
signed a partition (or a split for binary tree) of the sure, but it also requires a specific induction procedure.
predictor space, and each terminal node is assigned a Likewise, in statistical inference, where the power of a
label class/value of the response. As a result, each tree testing procedure is judged with respect to changes of the
path, characterized by a sequence of predictor interac- alternative hypotheses, decision tree induction strongly
TEAM LinG
Decision Tree Induction
depends on both the hypotheses to verify and their (Mola & Siciliano, 1997) becomes a fundamental step to
alternatives. For instance, in classification trees, the num- speed up the overall DTI procedure.
ber of response classes and the prior distribution of cases
among the classes influence the quality of the final deci- Tree Simplification: Pruning Algorithms
sion rule. In the credit-scoring example, an induction
procedure using a sample of 80% of good clients and 20% A further step is required for DTI relying on the hypoth-
of bad clients likely will provide reliable rules to identify esis of uncertainty in the data due to noise and residual
good clients and unreliable rules to identify bad ones. variation. Simplifying trees is necessary to remove the
most unreliable branches and improve understandabil-
ity. Thus, the goal of simplification is inferential (i.e.,
MAIN THRUST to define the structural part of the tree and reduce its
size while retaining its accuracy). Pruning methods
Exploratory trees can be fruitfully used to investigate consist in simplifying trees in order to remove the most
the data structure, but they cannot be used straightfor- unreliable branches and improve the accuracy of the rule
wardly for induction purposes. The main reason is that for classifying fresh cases.
exploratory trees are accurate and effective with re- The pioneer approach of simplification was pre-
spect to the training data used for growing the tree, but sented in the Automatic Interaction Detection (AID) of
they might perform poorly when applied to classifying/ Morgan and Sonquist (1963). It was based on arresting
predicting fresh cases that have not been used in the the recursive partitioning procedure according to some
growing phase. stopping rule (pre-pruning).
Alternative procedures consist in pruning algorithms
DTI Main Tasks working either from the bottom to the top of the tree
(post-pruning) or vice versa (pre-pruning). CART
DTI definitely has an important purpose represented by (Breiman et al., 1984) introduced the idea to grow the
understandability: the tree structure for induction needs totally expanded tree for removing retrospectively some
to be simple and not large; this is a difficult task since a of the branches (post-pruning). This results in a set of
predictor may reappear (even though in a restricted optimally pruned trees for the selection of the final
form) many times down a branch. At the same time, a decision rule.
further requirement is given by the identification issue: The main issue of pruning algorithms is the defini-
on one hand, terminal branches of the expanded tree tion of a complexity measure that takes account of both
reflect particular features of the training set, causing the tree size and accuracy through a penalty parameter
over-fitting; on the other hand, over-pruned trees neces- expressing the gain/cost of pruning tree branches. The
sarily do not allow identification of all the response training set is often used for pruning, whereas the test
classes/values (under-fitting). set for selecting the final decision rule. This is the case of
both the error-complexity pruning of CART and the criti-
Tree Model Building cal value pruning (Mingers, 1989). Nevertheless, some
methods require only the training set. This is the case of
Simplification method performance in terms of accu- the pessimistic error pruning and the error-based pruning
racy depends on the partitioning criterion used in the (Quinlan, 1987, 1993) as well as the minimum error pruning
tree-growing procedure (Buntine & Niblett, 1992). Thus, (Cestnik & Bratko, 1991) and the CART cross-validation
exploratory trees become an important preliminary step method. Instead, other methods use only the test set,
for DTI. In tree model building, it is worth distinguish- such as the reduced error pruning (Quinlan, 1987). These
ing between the optimality criterion for tree pruning latter pruning algorithms yield to just one best pruned
(simplification method) and the criterion for selecting tree, which represents in this way the final rule.
the best decision rule (decision rule selection). These In DTI, accuracy refers to the predictive ability of the
criteria often use independent datasets (training set and decision tree to classify/predict an independent set of test
test set). In addition, a validation set can be required to data. In classification trees, the error rate, measured by the
assess the quality of the final decision rule (Hand, number of incorrect classifications of the tree on test data,
1997). In this respect, segmentation with pruning and does not reflect accuracy of predictions for classes that
assessment can be viewed as stages of any computa- are not equally likely, and those with few cases are usually
tional model-building process based on a supervised badly predicted. As an alternative to the CART pruning,
learning algorithm. Furthermore, growing the tree struc- Cappelli, et al. (1998) provided a pruning algorithm based
ture using a Fast Algorithm for Splitting Trees (FAST) on the impurity-complexity measure to take account of the
distribution of the cases over the classes.
354
TEAM LinG
Decision Rule Selection Adaptive Boosting (AdaBoost) (Freud & Schapire,

1996; Schapire et al., 1998) adopts an iterative bootstrap ,
Given a set of optimally-pruned trees, a simple method replication of the sample units of the training sample
to choose the optimal decision rule consists in selecting such that at any iteration, misclassified/worse predicted
the pruned tree producing the minimum misclassification cases have higher probability to be included in the
rate on an independent test set (0-SE rule) or the most current bootstrap sample, and the final decision rule is
parsimonious pruned tree whose error rate on the test set obtained by majority voting.
is within one standard error of the minimum (1-SE rule) Random Forest (Breiman, 1999) is instead an en-
(Breiman et al., 1984). Once the sequence of pruned semble of unpruned trees obtained by introducing two
trees is obtained, a statistical testing pruning also could bootstrap resampling schema, one on the objects and
be performed to achieve the most reliable decision rule another one on the predictors, such that an out-of-bag
(Cappelli et al., 2002). sample provides the estimation of the test set error and
suitable measures of predictor importance are derived
Tree Averaging and Compromise Rules for the final interpretation.
A more recent approach is to generate a set of candidate

decision trees that are aggregated in a suitable way to FUTURE TRENDS
provide a more accurate final rule. This strategy, which is
an alternative to the selection of one final tree, consists The use of trees for both exploratory and decision
in the definition of either a compromise rule or a con- purposes will receive more and more attention over
sensus rule using a set of trees rather than the single one. both the statistical and the machine-learning communi-
One approach is known as tree averaging, which classi- ties, with novel contributions highlighting the idea of
fies a new object by averaging over a set of trees using a decision rules as a preliminary or final tool for data
set of weights (Oliver & Hand, 1995). It requires the analysis (Conversano et al., 2001).
definition of the set of trees (since it is impractical to
average over every possible pruned tree), the calculation Combining Trees With Other Statistical
of the weights, and the independent data set to classify. Tools
An alternative selection method for classification prob-
lems consists in summarizing the information of each Some examples are given by the joint use of statistical
tree by table cross-classifying terminal nodes and remodeling and tree-structures for exploratory analysis
sponse classes, so that evaluating the predictability power as well as for confirmatory analysis. For the former,
or the degree of homogeneity through a statistical index examples are given by the procedures called multi-
yields finding out the best rule (Siciliano, 1998). budget trees and two-stage discriminant trees, which
have been recently overviewed by Siciliano, Aria, and
Ensemble Methods Conversano (2004). For the latter, an example is given
by the stepwise model tree induction method (Malerba
Ensemble methods are based on a combination, expressed et al., 2004), which associates multiple regression
by a weighted or non-weighted aggregation, of single models to leaves of the tree in order to define a suitable
induction estimators able to improve the overall accu- data-driven procedure for tree induction.
racy of any single induction method.
Bagging (Bootstrap Aggregating) (Breiman, 1996) is Trees in Regression Smoothing
obtained by a bootstrap replication of the sample units of
the training sample, each having the same probability to It is also worth mentioning the idea of using DTI for
be included in the bootstrap sample, in order to generate regression smoothing purposes. Conversano (2002)
single prediction/classification rules that are aggregated, introduced a novel class of semiparametric models
providing a final decision rule consisting in either the named Generalized Additive Multi-Mixture Models
average (for regression problems) or the modal class (GAM-MM) for performing statistical learning for
(for classification problems) of the single estimates. both classification and regression tasks. These models
Definitions of bias and variance for a classifier as com- are similar to generalized additive models and work
ponents of the test set error have been used by Breiman using an iterative estimation algorithm evaluating the
(1998). Unstable classifiers can have low bias on a large predictive power of suitable mixtures of smoothers/clas-
range of data sets. Their problem is high variance. sifiers associated with each predictor entering the model.
355
TEAM LinG
Trees are part of the set of alternative smoothers or REFERENCES

classifiers defining the estimations functions associated
with each predictor on the basis of bagged scoring mea- Breiman, L. (1996). Bagging predictors. Machine Learn-
sures taking into account the trade-off between estima- ing, 24, 123-140.
tion accuracy and model complexity.
Breiman, L. (1999). Random forestRandom features [tech-
Trees for Data Imputation and Data nical report]. Berkeley, CA: University of California.
Validation Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J.
(1984). Classification and regression trees. Belmont,
A different approach is based on the use of DTI to solve CA: Wadsworth.
data quality problems. It is well known that the extreme
flexibility of DTI allows handling missing data in an easy Buntine, W., & Niblett, T. (1992). A further comparison
way when performing data modeling. But DTI does not of splitting rules for decision-tree induction. Machine
permit performing missing data imputation in a straight- Learning, 8, 75-85.
forward manner when dealing with multivariate data. Cappelli, C., Mola, F., & Siciliano, R. (1998). An alter-
Conversano and Siciliano (2004) defined an incremen- native pruning method based on the impurity-complex-
tal approach for missing data imputation based on deci- ity measure Proceedings in Computational Statistics:
sion trees. The basic idea is inspired by the principle of 13th Symposium of COMPSTAT. London: Springer.
statistical learning for information retrieval; namely,
observed data can be used to impute missing observa- Cappelli, C., Mola, F., & Siciliano, R. (2002). A statis-
tions in an incremental manner, starting from the sub- tical approach to growing a reliable honest tree. Compu-
groups of observations presenting the less proportion tational Statistics and Data Analysis, 38, 285-299.
of missingness (with respect to one set of variables) up Cestnik, B., & Bratko, I. (1991). On estimating prob-
to the subgroups presenting the maximum number of abilities in tree pruning. Proceedings of the EWSL-91.
them (with respect to one set of variables). The imputa- Berlin: Springer.
tion is made in each step of the algorithm by performing
a segmentation of the complete cases (using as re- Conversano, C. (2002). Bagged mixture of classifiers
sponse the variable presenting missing values). The using model scoring criteria. Patterns Analysis & Ap-
final decision rule is derived using either cross-valida- plications, 5(4), 351-362.
tion or boosting, and it is used to impute missing values
in a given subgroup. Conversano, C., Mola, F., & Siciliano, R. (2001). Parti-
tioning algorithms and combined model integration for
data mining. Computational Statistics, 16, 323-339.
CONCLUSION Conversano, C., & Siciliano, R. (2004). Incremental
tree-based missing data imputation with lexicographic
In the last two decades, computational enhancements ordering. Proceedings of Interface 2003, Fairfax, USA.
highly contributed to the increase in popularity of tree
structures. Partitioning algorithms and decision tree Freud, Y., & Schapire, R. (1996). Experiments with a
induction are two faces of one medal. While the compu- new boosting algorithm. Proceedings of Machine Learn-
tational time consuming has been rapidly reducing, the ing, Thirteenth International Conference. Berlin:
statistician is making use of more computational inten- Springer.
sive procedures to find out an unbiased and accurate Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles
decision rule for new objects. Nevertheless, DTI cannot of data mining. Cambridge, USA: MIT Press.
result in finding out just a number (the error rate) with
its accuracy. Trees are much more than a decision rule. Hastie, T., Friedman, J.H., & Tibshirani, R. (2001). The
Software enhancements through interactive guide elements of statistical learning: Data mining, infer-
user interface and custom routines should empower the ence and prediction. New York: Springer Verlag.
final user of tree structures making possible to achieve Malerba, D., Esposito, F., Ceci, M., & Appice, A. (2004),
further important results in the directions of interpret- Top-down induction of model trees with regression and
ability, identification, and robustness. splitting nodes. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 26, 6.
356
TEAM LinG
Mingers, J. (1989a). An empirical comparison of selection bootstrap sample to generate single prediction/classifi-
measures for decision tree induction. Machine Learning, cation rules that being aggregated provides a final deci- ,
3, 319-342. sion rule consisting in either the average (for regression
problems) or the modal class (for classification problems)
Mingers, J. (1989b). An empirical comparison of prun- among the single estimates.
ing methods for decision tree induction. Machine Learn-
ing, 4, 227-243. Classification Tree: An oriented tree structure ob-
tained by a recursive partitioning of a sample of cases on
Mola, F., & Siciliano, R. (1997). A fast splitting algo- the basis of a sequential partitioning of the predictor
rithm for classification trees. Statistics and Comput- space such to obtain internally homogenous groups and
ing, 7, 209-216. externally heterogeneous groups of cases with respect to
Morgan, J.N., & Sonquist, J.A. (1963). Problem in the a categorical variable.
analysis of survey data and a proposal. Journal of the Decision Rule: The result of an induction proce-
American Statistical Association, 58, 415-434. dure providing the final assignment of a response class/
Oliver, J.J., & Hand, D.J. (1995). On pruning and aver- value to a new object so that only the predictor measure-
aging decision trees. Proceedings of Machine Learn- ments are known. Such rule can be drawn in the form of
ing, the 12th International Workshop. Berlin: Springer. decision tree.
Quinlan, J.R. (1986). Induction of decision tree. Ma- Ensemble: A combination, typically weighted or
chine Learning, 1, 86-106. unweighted aggregation, of single induction estimators
able to improve the overall accuracy of any single
Quinlan, J.R. (1987). Simplifying decision tree. Internat. induction method.
J. Man-Mach. Studies, 27, 221-234.
Exploratory Tree: An oriented tree graph formed
Schapire, R.E., Freund, Y., Barlett, P., & Lee, W.S. by internal nodes and terminal nodes, the former allow-
(1998). Boosting the margin: A new explanation for the ing the description of the conditional interaction paths
effectiveness of voting methods. The Annals of Statis- between the response variable and the predictors,
tics, 26(5), 1998. whereas the latter are labeled by a response class/value.
Siciliano, R. (1998). Exploratory versus decision trees. FAST (Fast Algorithm for Splitting Trees): A
R. Payne, & P. Green (Eds.), Proceedings in computa- splitting procedure to grow a binary tree using a suitable
tional statistics (pp. 113-124). Physica-Verlag. London: mathematical property of the impurity proportional re-
Springer. duction measure to find out the optimal split at each
Siciliano, R., Aria, M., & Conversano, C. (2004). Tree node without trying out necessarily all candidate splits.
harvest: Methods, software and applications. In J. Antoch Partitioning Tree Algorithm: A recursive algo-
(Ed.), COMPSTAT 2004 Proceedings (pp. 1807-1814). rithm to form disjoint and exhaustive subgroups of
Berlin: Springer. objects from a given group in order to build up a tree
Zhang, H., & Singer, B. (1999). Recursive partitioning structure.
in the health sciences. New York: Springer Verlag. Production Rule: A tree path characterized by a
sequence of predictor interactions yielding to a spe-
cific label class/value of the response variable.
KEY TERMS Pruning: A top-down or bottom-up selective algo-
rithm to reduce the dimensionality of a tree structure in
Adaptive Boosting (AdaBoost): An iterative boot- terms of the number of its terminal nodes.
strap replication of the sample units of the training
sample such that at any iteration, misclassified/worse Random Forest: An ensemble of unpruned trees
predicted cases have higher probability to be included in obtained by introducing two bootstrap resampling
the current bootstrap sample, and the final decision rule schema, one on the objects and another one on the
is obtained by majority voting. predictors, such that an out-of-bag sample provides the
estimation of the test set error, and suitable measures of
Bagging (Bootstrap Aggregating): A bootstrap predictor importance are derived for the final interpre-
replication of the sample units of the training sample, tation.
each having the same probability to be included in the
357
TEAM LinG
Regression Tree: An oriented tree structure obtained

by a recursive partitioning of a sample on the basis of a
sequential partitioning of the predictor space such to
obtain internally homogenous groups and externally het-
erogeneous groups of cases with respect to a numerical
response variable.
358
TEAM LinG
359
Diabetic Data Warehouses ,

Joseph L. Breault
Ochsner Clinic Foundation, USA
INTRODUCTION MAIN THRUST
The National Academy of Sciences convened in 1995 Three major challenges are reviewed here: a) understand-
for a conference on massive data sets. The presentation ing and converting the diabetic databases into a data-
on health care noted that massive applies in several mining data table, b) the data mining, and c) utilizing
dimensions . . . the data themselves are massive, both in results to assist clinicians and managers in improving the
terms of the number of observations and also in terms of health of the population studied.
the variables . . . there are tens of thousands of indicator
variables coded for each patient (Goodall, 1995, para- The Diabetic Database
graph 18). We multiply this by the number of patients in
the United States, which is hundreds of millions. The diabetic data warehouse we studied included 30,383
Diabetic registries have existed for decades. Data- diabetic patients during a 42-month period with hun-
mining techniques have recently been applied to them in dreds of fields per patient.
an attempt to predict diabetes development or high-risk Understanding the data requires awareness of its
cases, to find new ways to improve outcomes, and to limitations. These data were obtained for purposes other
detect provider outliers in quality of care or in billing than research. Clinicians will be aware that billing codes
services (Breault, 2001; He, Koesmarno, Van, & Huang, are not always precise, accurate, and comprehensive.
2000; Hsu, Lee, Liu, & Ling, 2000; Kakarlapudi, Sawyer, & However, the codes are widely used in outcomes mod-
Staecker, 2003; Stepaniuk, 1999; Tafeit, Moller, Sudi, & eling. Epidemiologists and clinicians will be aware that
Reibnegger, 2000). important predictors of diabetic outcomes are missing
Diabetes is a major health problem. The long history from the database, such as body mass index, family
of diabetic registries makes it a realistic and valuable history of diabetes, time since the onset of diabetes,
target for data mining. diet, and exercise habits. These variables were not elec-
tronically stored and would require going to the paper
chart and patient interviews to obtain.
BACKGROUND
Developing the Data-Mining Data Table
In-depth examination of one such diabetic data ware-
house developed a method of applying data-mining tech- The major challenge is transforming the data from the
niques to this type of database (Breault, Goodall, & Fos, relational structure of the diabetic data warehouse with
2002). There are unique data issues and analysis prob- its multiple tables to a form suitable for data mining
lems with medical transactional databases. The lessons (Nadeau, Sullivan, Teorey, & Feldman, 2003). Data-min-
learned will be applicable to any diabetic database and ing algorithms are most often based on a single table,
perhaps to broader medical databases. within which is a record for each individual, and the fields
Methods for translating a complex relational medi- contain variable values specific to the individual. We call
cal database with time series and sequencing informa- this the data-mining data table. The most portable format
tion to a flat file suitable for data mining are challeng- for the data-mining data table is a flat file, with one line for
ing. We used the classification tree approach with a each individual record.
binary target variable. While many data mining methods SQL statements on the data warehouse create the flat
(neural networks, logistic regression, etc.) could be file output that the data-mining software then reads. The
used, classification trees have been noted to be appeal- steps are as follows:
ing to physicians because much of medical diagnosis
training operates in a fashion similar to classification Review each table of the relational database and
trees. select the fields to export.
TEAM LinG
Diabetic Data Warehouses
Determine the interactions between the tables in the had at least two HgbA1c tests and at least two office
relational database. visits, the criteria we used for minimal continuity in this
Define the layout of the data-mining data table. 42-month period.
Specify patient inclusion and exclusion criteria.
What is the time interval? What are the minimum Data-Mining Technique
and maximum number of records (e.g., clinic vis-
its or outcome measures) each patient must have We used the classification tree approach as standard-
to be included? What relevant fields can be miss- ized in the CART software by Salford Systems. As
ing and still include the individual in the data- detailed in Hand, Mannila, and Smyth (2001), the prin-
mining data table? ciple behind all tree models is to recursively partition
Extract data, including the stripping of patient the input variable space to maximize purity in the termi-
identifiers to protect human subjects. nal tree nodes. The partitioning split in any cell is done
Determine how to handle missing values (Duhamel, by searching each possible threshold for each variable
Nuttens, Devos, Picavet, & Beuscart, 2003) to find the threshold split that leads to the greatest
Perform sanity checks on the data-mining data improvement in the purity score of the resultant nodes.
table, for example, that the minimum and maxi- Hence, this is a monothetic process, which may be a
mum of each variable make clinical sense. limitation of this method in some circumstances.
In CARTs defaults, the Gini splitting criteria are
Handling time series medical data is challenging for used, although other methods are options. This could
data-mining software. One example in our study is the recursively continue to the point of perfect purity,
HgbA1c, the key measure of glycemic control. This is which would sometimes mean only one patient in a
closely related to clinical outcomes and complication terminal node. But overfitting of the data does not help
rates in diabetes. Health care costs increase markedly in accurately classifying another data set. Therefore, we
with each 1% increase in baseline HgbA1c; patients with divide the data randomly into learning and test sets. The
an HgbA1c of 10% versus 6% had a 36% increase in 3- number of trees generated is halted or pruned back by
year medical costs (Blonde, 2001). How should this how accurately the classification tree created from the
time series variable be transformed from the relational learning set can predict classification in the test set.
database to a vector (column) in the data-mining data Cross-validation is another option for doing this, though
table? A given diabetic patient may have many of these in the CART softwares defaults this is limited to n =
HgbA1c results. We could pick the last one, the first, a 3000. This could be changed higher to use our full data
median or mean value. Because the trend over time for set, but some CART consultants note, The n-fold cross-
this variable is important, we could choose the slope of validation technique is designed to get the most out of
its regression line over time. However, a linear function datasets that are too small to accommodate a hold-out or
may be a good representation for some patients, but a test sample. Once you have 3,000 records or more, we
very bad one for others that may be better represented by recommend that a separate test set be used (Timberlake-
an upside down U curve. This difficulty is a problem for Consultants, 2001). The original CART creators recom-
most repeated laboratory tests. Some information will mended dividing the data into test and learning samples
be lost in the creation of the data-mining data table. whenever there were more than 1,000 cases, with cross-
We used the average HgbA1c for a given patient and validation being preferable in smaller data sets (Breiman,
excluded patients who did not have at least two HgbA1c Friedman, Olshen, & Stone, 1984).
results in the data warehouse. We repartitioned this The 10 predictor variables were used with the binary
average HgbA1c into a binary variable based on a mean- target variable of the HgbA1c average (cut-point of
ingful clinical cut-point of 9.5%. Experts agree that an 9.5%) in an attempt to find interesting patterns that may
HgbA1c >9.5% is a bad outcome, or a medical quality have management or clinical importance and are not
error, no matter what the circumstances (American already known.
Medical Association, Joint Commission on Accredita- The variables that are most important to classifica-
tion of Healthcare Organizations, & National Commit- tion in the optimal CART tree were age (100, where the
tee for Quality Assurance, 2001). most important variable is arbitrarily given a relative
Our final data-mining data table had 15,902 patients score of 100), number of office visits (51), comorbidity
(rows). Mean HgbA1c > 9.5% was the target variable, index (CMI) (44), cardiovascular disease (16), choles-
and the 10 predictors were age, sex, emergency depart- terol problems (17), number of emergency room visits
ment visits, office visits, comorbidity index, (7), and hypertension (0.6).
dyslipidemia, hypertension, cardiovascular disease, re- CART can be used for multiple purposes. Here we
tinopathy, and end stage renal disease. All these patients want to find clusters of deviance from glycemic control.
360
TEAM LinG
With no analysis, the rate of bad glycemic control in the years rather than older than 65.6 if they have a bad
learning sample is 13.2%. We want to find nodes that glycemic control is 4.11 (95% CI:3.60, 4.69). This is ,
have higher rates of bad glycemic control. The first split surprising information to most clinicians. Similar find-
in node 1 of the tree is based on an age of 65.6 (CART ings were recently reported with vascular endpoints in
sends all cases less than or equal to the cut-point to the diabetic patients (Miyaki, Takei, Watanabe, Nakashina,
left and greater than the cut-point to the right). In node 2 & Omae, 2002).
(<=65.6 years of age), we have bad glycemic control in The case may be that those with the worst glycemic
19.4%. control die young and never make it to the older group.
If we look at the tree to find terminal nodes (TN) Although this is an interesting theoretical explanation,
where the percentage of all the patients in those nodes nevertheless, these numbers represent real patients
having a bad HgbA1c is greater than the 19.4% true of that need help with their glycemic control now.
those <= 65.6 years of age, we identify 4 of the 10 TNs If we want to target diabetics with bad HgbA1c
in the learning sample. values, the odds of finding them are 3.2 times as high in
The purer the nodes we limit ourselves to, the less diabetic patients younger than 65.6 years than those
percentage of the overall population with bad glycemic who are older and 4.1 times as high in those who are
control we get. With no analysis in the learning set, we younger than 55 than those over 65. This information is
can capture all 1052 patients with bad glycemic control clinically important because the younger group has so
but must target the entire group of 7953. When we limit many more years of life left to develop diabetic com-
ourselves to the purer node 2, we capture 74% of those plications from bad glycemic control. This is espe-
with bad glycemic control by targeting only 50% of the cially helpful because it tells us which population to
population. If we limit ourselves to TN1, we capture 49% target interventions at even before we have the HgbA1c
of those with bad glycemic control by targeting only values to show us. Health maintenance organizations
27% of the population. If we use more complicated and public health workers may want to explore what
criteria by combining the 4 TNs with worst glycemic educational interventions can be successfully directed
control, we capture 54% of those with bad glycemic to younger diabetics (younger than 65, especially
control by targeting only 30% of the population. younger than 55) who are much more likely to have bad
The classification errors in the learning and test glycemic control than the geriatric patients.
samples are substantial, as a quarter of the bad glycemic
control patients are missed in the CART analysis. CART
is doing a good job with the 10 predictor variables it is FUTURE TRENDS
given, but more accurate prediction requires additional
variables not in our database. Areas that need further work to fully utilize data mining
Adjustment to defaults in CART can give better rein health care include time-series issues, sequencing
sults defined as capturing a larger percentage of those information, data-squashing technologies, and a tight
with bad glycemic control within a smaller percentage of integration of domain expertise and data-mining skills.
the population. However, the complexity of the formulas We have already discussed time-series issues. This
to identify the population is difficult for managers to has been investigated but needs further exploration in
use. Not only must managers identify the persons, but health care data mining (Bellazzi, Larizza, Magni,
they must get enough of a feel for what the population Montani, & Stefanelli, 2000; Goodall, 1999; Tsien,
characteristics are to know what interventions are likely 2000).
to be helpful. This is more intuitive for those who are The sequence of various events may hold meaning
younger than 55 or younger than 65 than it is for those important to a study. For example, a patient may have
who satisfy 0.451*(AGE) + 0.893*(CMI) <= 32.5576. better glycemic control, manifested in improved
HgbA1c values, especially when the patient had an
Results office visit with a physician within a month of a previ-
ous HgbA1c. Perhaps this is meaningful information
From this CART analysis, the most important variable that implies that the proper sequence of physician
associated with a bad HgbA1c score is age less than 65. visits relative to HgbA1c measurements is an important
Those less than 65.6 years old are almost three times as predictor of good outcomes. All this information is
likely to have bad glycemic control than those who are located in the relational database, but we must ferret it
older. The odds ratio that someone is less than 65.6 years out by having in advance an idea that a sequence of this
old if they have a bad HgbA1c (average reading > 9.5%) sort may be important and then searching for such
is 3.18 (95% CI: 2.87, 3.53) (Fos & Fine, 2000). Simi- associations. There may be many such sequences in-
larly, the odds ratio that someone is younger than 55.2 volving interactions between hospital, clinic, pharmacy,
361
TEAM LinG
and laboratory variables. Regrettably, no amount of data nated performance measurement for the management of
mining will be able to extract sequence associations where adult diabetes. Retreived March 24, 2005, from http://
we have not thought to extract the prerequisite variables www.ama-assn.org/ama/upload/mm/370/nr.pdf
from the relational database into the data-mining data
table. In the ideal data-mining scenario, software could Bellazzi, R., Larizza, C., Magni, P., Montani, S., &
interface directly with the relational database and extract Stefanelli, M. (2000). Intelligent analysis of clinical
all possibly meaningful sequences for us to review, and time series: An application in the diabetes mellitus
domain experts would then sort through the list. This domain. Artificial Intelligence Med, 20(1), 37-57.
issue has begun to get attention (Deroski & Lavra , 2001) Blonde, L. (2001). Epidemiology, costs, consequences,
and will need to be addressed in future health care data and pathophysiology of type 2 diabetes: An American
mining. epidemic. Ochsner Journal, 3(3), 126-131.
It has been shown that using a data-squashing algo-
rithm to reduce a massive data set is more powerful and Breault, J. L. (2001). Data mining diabetic databases: Are
accurate than using a random sample (DuMouchel, rough sets a useful addition? In E. Wegman, A. Braverman,
Volinsky, Johnson, Cortes, & Pregibon, 1999). Squash- A. Goodman, & P. Smyth (Eds.), Computing science and
ing is a form of lossy compression that attempts to statistics (pp. 597-606). Interface Foundation of North
preserve statistical information (DuMouchel, 2001). America, Inc, 33, Fairfax Station, VA.
These newer data-squashing techniques may be a better Breault, J. L., & Goodall, C. R. (2002, January). Mathemati-
approach than random sampling in massive data sets. cal challenges of variable transformations in data mining
These techniques also protect human subjects privacy. diabetic data warehouses. Paper presented at the Math-
Transactional health care data mining, exemplified ematical Challenges in Scientific Data Mining Confer-
in the diabetic data warehouses discussed previously, ence, Los Angeles, CA.
involves a number of tricky data transformations that
require close collaboration between domain experts Breault, J. L., Goodall, C. R., & Fos, P. J. (2002). Data
and data miners (Breault & Goodall, 2002). Even with mining a diabetic data warehouse. Artificial Intelli-
ideal collaboration or overlapping expertise, we need to gence Med, 26(1-2), 37-54.
develop new ways to extract variables from relational
Breiman, L. (1984). Classification and regression
databases containing time-series and sequencing infor-
trees. Belmont, CA: Wadsworth International.
mation. Part of the answer lies in collaborative groups
that can have additional insights. Part of the answer lies Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., &
in the further development of data-mining tools that act Beuscart, R. (2003). A preprocessing method for improv-
directly on a relational database without transformation ing data mining techniques: Application to a large medical
to explicit data arrays. In some circumstances, it may be diabetes database. Stud Health Technol Inform, 95, 269-
useful to produce several data-mining data tables to 274.
independently data mine and then combine the results.
This may be particularly useful when different DuMouchel, W. (2001). Data squashing: Constructing
granularities of attributes are glossed over by the use of summary data sets. In E. Wegman, A. Braverman, A.
a single data-mining data table. Goodman, & P. Smyth (Eds.), Computing science and
statistics. Interface Foundation of North America, Inc.,
Fairfax Station, VA.
CONCLUSION DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., &
Pregibon, D. (1999, August). Squashing flat files flatter.
Data mining is valuable in discovering novel associa- Proceedings of the Fifth ACM SIGKDD International
tions in diabetic databases that can prove useful to Conference on Knowledge Discovery and Data Mining,
clinicians and administrators. This may also be the case USA.
for many other health care problems.
Deroski, S., & Lavra , N. (2001). Relational data mining.
New York: Springer.
REFERENCES Fos, P. J., & Fine, D. J. (2000). Designing health care
for populations: Applied epidemiology in health care
American Medical Association, Joint Commission on administration. San Francisco: Jossey-Bass.
Accreditation of Healthcare Organizations, & National
Committee for Quality Assurance. (2001). Coordi- Goodall, C. (1995). Massive data sets in healthcare. In
(Chair: Jon R. Kettenring), Massive data sets. Paper pre-
362
TEAM LinG
sented at the meeting of the Committee on Applied and Timberlake-Consultants. (2001). CART frequently asked
Theoretical Statistics at the National Academy of Sciences, questions. Retrieved October 21, 2001, from http:// ,
National Research Council, Washington, DC. Retrieved www.timberlake.co.uk/software/cart/cartfaq1.htm#q23
March 29, 2005, from http://bob.nap.edu/html/massdata/
media/cgoodall-t.html Tsien, C. L. (2000). Event discovery in medical time-series
data. Proceedings of the AMIA Symposium, 858-862.
Goodall, C. R. (1999). Data mining of massive datasets in
healthcare. Journal of Computational and Graphical
Statistics, 8(3), 620-634.
KEY TERMS
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles
of data mining. Cambridge, MA: MIT Press. Co-Morbidity Index: A composite variable that
He, H., Koesmarno, Van, & Huang. (2000). Data mining gives a measure of how many other medical problems
in disease management: A diabetes case study. In R. someone has in addition to the one being studied.
Mizoguchi & J. K. Slaney (Eds.), Proceedings of the Data Mining Data Table: The flat file constructed
Sixth Pacific Rim International Conference on Artifi- from the relational database that is the actual table used
cial Intelligence: Topics in artificial intelligence (p. by the data-mining software.
799). New York: Springer.
Glycemic Control: Tells how well controlled are
Hsu, W. (2000, August). Exploration mining in diabetic the sugars of a diabetic patient. Usually measured by
patients databases: Findings and conclusions. Proceed- HgbA1c.
ings of the Sixth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, USA. HgbA1c: Blood test that measures the percent of
receptors on a red blood cell that are saturated with
Kakarlapudi, V., Sawyer, R., & Staecker, H. (2003). The glucose. This is translated into a measure of how sugars
effect of diabetes on sensorineural hearing loss. Otol have averaged over the last few months. Normal is less
Neurotol, 24(3), 382-386. than 6, depending on the laboratory standards.
Miyaki, K., (2002). Novel statistical classification Medical Transactional Database: The database
model of type 2 diabetes mellitus patients for tailor- created from the billing and required reporting transac-
made prevention using data mining algorithm. J tions of a medical practice. Clinical experience is some-
Epidemiol, 12(3), 243-248. times required to understand gaps and inadequacies in
Nadeau, T. P., (2003). Applying database technology to the collected data.
clinical and basic research bioinformatics projects. J Monothetic Process: In a classification tree, when
Integr Neurosci, 2(2), 201-217. data at each node are split on just one variable rather than
Stepaniuk, J. (1999, June). Rough set data mining of several variables.
diabetes data. In Z. Ras & A. Skowron (Eds.), Proceed- Recursive Partitioning: The method used to di-
ings of the 11th International Symposium onVol. Foun- vide data at each node of a classification tree. At the top
dations of intelligent systems (pp. 457-465). New York: node, every variable is examined at every possible value
Springer. to determine which variable split will produce the maxi-
Tafeit, E., Moller, Sudi, & Reibnegger. (2000). ROC mum and minimum amounts of the target variable in the
and CART analysis of subcutaneous adipose tissue to- daughter nodes. This is recursively done for each addi-
pography (SAT-Top) in type-2 diabetic women and tional node.
healthy females. American Journal of Human Biology,
12, 388-394.
363
TEAM LinG
364
Discovering an Effective Measure in Data

Mining
Takao Ito
Ube National College of Technology, Japan
INTRODUCTION sented by Hagita and Sawaki (1995). The fourth one is the
dice coefficient. The fifth one is the Phi coefficient. The
One of the most important issues in data mining is to last two are both mentioned by Manning and Schutze
discover an implicit relationship between words in a large (1999). The sixth one is the proposal measure (PM) sug-
corpus and labels in a large database. The relationship gested by Ishiduka, Yamamoto, and Umemura (2003). It is
between words and labels often is expressed as a function one of the several new measures developed by them in
of distance measures. An effective measure would be their paper.
useful not only for getting the high precision of data In order to evaluate these distance measures, formulas
mining, but also for time saving of the operation in data are required. Yamamoto and Umemura (2002) analyzed
mining. In previous research, many measures for calculat- these measures and expressed them in four parameters of
ing the one-to-many relationship have been proposed, a, b, c, and d (Table 1).
such as the complementary similarity measure, the mutual Suppose that there are two words or labels, x and y, and
information, and the phi coefficient. Some research showed they are associated together in a large database. The mean-
that the complementary similarity measure is the most ings of these parameters in these formulas are as follows:
effective. The author reviewed previous research related
to the measures in one-to-many relationships and pro- a. The number of documents/records that have x and
posed a new idea to get an effective one, based on the y both.
heuristic approach in this article. b. The number of documents/records that have x but
not y.
c. The number of documents/records that do not have
BACKGROUND x but do have y.
d. The number of documents/records that do not have
Generally, the knowledge discover in databases (KDD) either x or y.
process consists of six stages: data selection, cleaning, n. The total number of parameters a, b, c, and d.
enrichment, coding, data mining, and reporting (Adriaans
& Zantinge, 1996). Needless to say, data mining is the Umemura (2002) pointed out the following in his paper:
most important part in the KDD. There are various tech- Occurrence patterns of words in documents can be
niques, such as statistical techniques, association rules, expressed as binary. When two vectors are similar, the
and query tools in a database, for different purposes in two words corresponding to the vectors may have some
data mining. (Agrawal, Mannila, Srikant, Toivonen & implicit relationship with each other. Yamamoto and
Verkamo, 1996; Berland & Charniak, 1999; Caraballo, 1999; Umemura (2002) completed their experiment to test the
Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996; validity of these indexes under Umemuras concept. The
Han & Kamber, 2001). result of the experiment of distance measures without
When two words or labels in a large database have noisy pattern from their experiment can be seen in Figure
some implicit relationship with each other, one of the 1 (Yamamoto & Umemura, 2002).
different purposes is to find out the two relative words or The experiment by Yamamoto and Umemura (2002)
labels effectively. In order to find out relationships be- showed that the most effective measure is the CSM. They
tween words or labels in a large database, the author indicated in their paper as follows: All graphs showed
found the existence of at least six distance measures after that the most effective measure is the complementary
reviewing previously conducted research. similarity measure, and the next is the confidence and the
The first one is the mutual information proposed by third is asymmetrical average mutual information. And the
Church and Hanks (1990). The second one is the confi- least is the average mutual information (Yamamoto and
dence proposed by Agrawal and Srikant (1995). The third Umemura, 2002). They also completed their experiments
one is the complementary similarity measure (CSM) pre- with noisy pattern and found the same result (Yamamoto
& Umemura, 2002).
TEAM LinG
Discovering an Effective Measure in Data Mining
Table 1. Kind of distance measures and their formulas
No Kin d of Dista n ce M ea sur es F or m ul a

,
an
1 th e m u tu a l in for m a t ion I ( x1 ; y1 ) = log
(a + b)(a + c)
a
2 th e c on fiden ce conf (Y | X ) =
a+c
ad bc
3 th e comp leme n ta r y si mi la r it y me a su r e Sc (F , T ) =
(a + c)(b + d )
2a
4 th e d ice coefficien t S d (F ,T ) =
( a + b ) + (a + c )
ad bc
5 th e P h i co effici en t DE =
(a + b)(a + c)(b + d )(c + d )
a 2b
6 th e pr oposal m eas u r e S (F , T ) =
1+ c
Figure 1. Result of the experiment of distance measures without noisy pattern
MAIN THRUST effective measure in this article (Aho, Kernighan &

Weinberger, 1995).
How to select a distance measure is a very important issue,
because it has a great influence on the result of data RESULT OF THE THREE KINDS OF
mining (Fayyad, Piatetsky-Shapiro & Smyth, 1996; EXPERIMENTS
Glymour, Madigan, Pregibon & Smyth, 1997) The author
completed the following three kinds of experiments, based All of these three kinds of experiments are executed under
upon the heuristic approach, in order to discover an the following conditions. In order to discover an effective
365
TEAM LinG
measure, the author selected actual data of a places name, part of the function can be varied.
such as the name of prefecture and the name of a city in The first one is the experiment with the PM.
Japan from the articles of a nationally circulated newspa- The total number of combinations of five variables is
per, the Yomiuri. The reasons for choosing a places name 3,125. The author calculated all combinations, except for
are as follows: first, there are one-to-many relationships the case when the denominator of the PM becomes zero.
between the name of a prefecture and the name of a city; The result of the top 20 in the PM experiment, using a
second, the one-to-many relationship can be checked years amount of articles from the Yomiuri in 1991, is
easily from the maps and telephone directory. Generally calculated as follows.
speaking, the first name, such as the name of a prefecture, In Table 2, the No. 1 function of a11c1 has a highest
consists of another name, such as the name of a city. For a 1 1 a
instance, Fukuoka City is geographically located in Fukuoka correct number, which means S ( F , T ) = = .
c +1 1+ c
Prefecture, and Kitakyushu City also is included in Fukuoka
Prefecture, so there are one-to-many relationships be-
1 a 1 a
The No. 11 function of 1a1c0 means S ( F , T ) = = .
tween the name of the prefecture and the name of the city. c+0 c
The distance measure would be calculated with a large The rest function in Table 2 can be read as mentioned
database in the experiments. The experiments were ex- previously.
ecuted as follows: This result appears to be satisfactory. But to prove
whether it is satisfactory or not, another experiment
Step 1. Establish the database of the newspaper. should be done. The author adopted the Phi coefficient
Step 2. Choose the prefecture name and city name from the and calculated its correct number. To compare with the
database mentioned in step 1. result of the PM experiment, the author completed the
Step 3. Count the number of parameters a, b, c, and d from experiment of iterating the exponential index of denomi-
the newspaper articles. nator in the Phi coefficient from 0 to 2, with 0.01 step
Step 4. Then, calculate the distance measure adopted. based upon the idea of fractal dimension in the complex
Step 5. Sort the result calculated in step 4 in descent order theory instead of fixed exponential index at 0.5. The result
upon the distance measure. of the top 20 in this experiment, using a years amount of
Step 6. List the top 2,000 from the result of step 5. articles from the Yomiuri in 1991, just like the first experi-
Step 7. Judge the one-to-many relationship whether it is ment, can be seen in Table 3.
correct or not. Compared with the result of the PM experiment men-
Step 8. List the top 1,000 as output data and count its tioned previously, it is obvious that the number of cor-
number of correct relationships. rect relationships is less than that of the PM experiment;
Step 9. Finally, an effective measure will be found from the therefore, it is necessary to uncover a new, more effective
result of the correct number. measure. From the previous research done by Yamamoto
and Umemura (2002), the author found that an effective
To uncover an effective one, two methods should be measure is the CSM.
considered. The first one is to test various combinations The author completed the third experiment using the
of each variable in distance measure and to find the best CMS. The experiment began by iterating the exponential
combination, depending upon the result. The second one index of the denominator in the CSM from 0 to 1, with 0.01
is to assume that the function is a stable one and that only steps just like the second experiment. Table 4 shows the
Table 2. Result of the top 20 in the PM experiment
No Function Correct Number No Function Correct Number

1 a11c1 789 11 1a1c0 778
2 a111c 789 12 1a10c 778
3 1a1c1 789 13 11acc 778
4 1a11c 789 14 11ac0 778
5 11ac1 789 15 11a0c 778
6 11a1c 789 16 aa1cc 715
7 a11cc 778 17 aa1c0 715
8 a11c0 778 18 aa10c 715
9 a110c 778 19 a1acc 715
10 1a1cc 778 20 a1ac0 715
366
TEAM LinG
Table 3. Result of the top 20 in the Phi coefficient experiment
No Exponential Index Correct Number No Exponential index Correct Number

,
1 0.26 499 11 0.20 465
2 0.25 497 12 0.31 464
3 0.27 495 13 0.19 457
4 0.24 493 14 0.32 445
5 0.28 491 15 0.18 442
6 0.23 488 16 0.17 431
7 0.29 480 17 0.16 422
8 0.22 478 18 0.15 414
9 0.21 472 19 0.14 410
10 0.30 470 20 0.13 403
Table 4. Result of the top 20 in the CSM experiment
1991 1992 1993 1994 1995 1996 1997

E.I. C.N. E.I. C.N. E.I. C.N. E.I. C.N. E.I. C.N. E.I. C.N. E.I. C.N.
0.73 850 0.75 923 0.79 883 0.81 820 0.85 854 0.77 832 0.77 843
0.72 848 0.76 922 0.80 879 0.82 816 0.84 843 0.78 828 0.78 842
0.71 846 0.74 920 0.77 879 0.8 814 0.83 836 0.76 826 0.76 837
0.70 845 0.73 920 0.81 878 0.76 804 0.86 834 0.75 819 0.79 829
0.74 842 0.72 917 0.78 878 0.75 800 0.82 820 0.74 818 0.75 824
0.63 841 0.77 909 0.82 875 0.79 798 0.88 807 0.71 818 0.73 818
0.69 840 0.71 904 0.76 874 0.77 798 0.87 805 0.73 812 0.74 811
0.68 839 0.78 902 0.75 867 0.78 795 0.89 803 0.70 803 0.72 811
0.65 837 0.79 899 0.74 864 0.74 785 0.81 799 0.72 800 0.71 801
0.64 837 0.70 898 0.73 859 0.73 769 0.90 792 0.69 800 0.80 794
0.62 837 0.69 893 0.72 850 0.72 761 0.80 787 0.79 798 0.81 792
0.66 835 0.80 891 0.71 833 0.83 751 0.91 777 0.68 788 0.83 791
0.67 831 0.81 884 0.83 830 0.71 741 0.79 766 0.67 770 0.82 790
0.75 828 0.68 884 0.70 819 0.70 734 0.92 762 0.80 767 0.70 789
0.61 828 0.82 883 0.69 808 0.84 712 0.93 758 0.66 761 0.84 781
0.76 824 0.83 882 0.83 801 0.69 711 0.78 742 0.81 755 0.85 780
0.60 818 0.84 872 0.68 790 0.85 706 0.94 737 0.65 749 0.86 778
0.77 815 0.67 870 0.85 783 0.86 691 0.77 730 0.82 741 0.87 776
0.58 814 0.66 853 0.86 775 0.68 690 0.76 709 0.83 734 0.69 769
0.82 813 0.85 852 0.67 769 0.87 676 0.95 701 0.64 734 0.68 761
Note: C.N. means the correct number, and E.I. means the exponential index.
result of the top 20 in this experiment, using a years Determine the Most Effective
amount of articles from the Yomiuri from 1991-1997. Exponential Index of the Denominator
It is obvious by these results that the CSM is more
effective than the PM and the Phi coefficient. The relation-
in the CSM
ship of the complete result of the exponential index of the
denominator in the CSM and the correct number can be To discover the most effective exponential index of the
seen as in Figure 2. denominator in the CSM, a calculation about the relation-
The most effective exponential index of the denomina- ship between the exponential index and the total number
tor in the CSM is from 0.73 to 0.85, but not as 0.5, as many of documents of n was carried out, but none was found.
researchers believe. It would be hard to get the best result In fact, it is hard to consider that the exponential index
with the usual method using the CSM. would vary with the difference of the number of documents.
367
TEAM LinG
Figure 2. Relationships between the exponential indexes of denominator in the CSM and the correct number
The C orrect num ber

1000
900
800
700
1991
600 1992
500 1993
1994
400 1995
1996
300
1997
200
100
0
0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96
The ExponentialIndex of Denom inator In C SM
The details of the four parameters of a, b, c, and d in For many things in the world, part of the economic
Table 4 are listed in Table 5. phenomena obtain a normal distribution referred to in
The author adopted the average of the cumulative statistics, but others do not, so the author developed the
exponential index to find the most effective exponential CSM measure from the idea of fractal dimension in this
index of the denominator in the CSM. Based upon the article. Conventional approaches may be inherently inac-
results in Table 4, calculations for the average of the curate, because usually they are based upon linear math-
cumulative exponential index of the top 20 and those ematics. The events of a concurrence pattern of correlated
results are presented in Figure 3. pair words may be explained better by nonlinear math-
From the results in Figure 3, it is easy to understand ematics. Typical tools of nonlinear mathematics are com-
that the exponential index converges at a constant value plex theories, such as chaos theory, cellular automaton,
of about 0.78. Therefore, the exponential index could be percolation model, and fractal theory. It is not hard to
fixed at a certain value, and it will not vary with the size of predict that many more new measures will be developed
documents. in the near future, based upon the complex theory.
In this article, the author puts the exponential index at
0.78 to forecast the correct number. The gap between the
calculation result forecast by the revised method and the CONCLUSION
maximum result of the third experiment is illustrated in
Figure 4. Based upon previous research of the distance measures,
the author discovered an effective measure of distance
measure in one-to-many relationships with the heuristic
FUTURE TRENDS approach. Three kinds of experiments were conducted,
and it was confirmed that an effective measure is the CSM.
Many indexes have been developed for the discovery of In addition, it was discovered that the most effective
an effective measure to determine an implicit relationship exponential of the denominator in the CSM is 0.78, not
between words in a large corpus or labels in a large 0.50, as many researchers believe.
database. Ishiduka, Yamamoto, and Umemura (2003) re- A great deal of work still needs to be carried out, and
ferred to part of them in their research paper. Almost all of one of them is the meaning of the 0.78. The meaning of the
these approaches were developed by a variety of math- most effective exponential of denominator 0.78 in the CSM
ematical and statistical techniques, a conception of neural should be explained and proved mathematically, although
network, and the association rules. its validity has been evaluated in this article.
368
TEAM LinG
Table 5. Relationship between the maximum correct number and their parameters
,
Year Exponential Index Correct Number a b c d n
1991 0.73 850 2,284 46,242 3,930 1,329 53,785
1992 0.75 923 453 57,636 1,556 27 59,672
1993 0.79 883 332 51,649 1,321 27 53,329
1994 0.81 820 365 65,290 1,435 36 67,126
1995 0.85 854 1,500 67,914 8,042 190 77,646
1996 0.77 832 636 56,529 2,237 2,873 62,275
Figure 3. Relationships between the exponential index and the average of the number of the cumulative exponential
index
T he exponential index
0.795
0.79
0.785
0.78
0.775
0.77
0.765
0.76
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
T he number of the cumulative exponential index
ACKNOWLEDGMENTS ings of the ACM SIGMOD Conference on Management

of Data.
The author would like to express his gratitude to the Agrawal, R. et al. (1996). Fast discovery of association
following: Kyoji Umemura, Professor, Toyohashi Univer- rules, advances in knowledge discovery and data min-
sity of Technology; Taisuke Horiuchi, Associate Profes- ing. Cambridge, MA: MIT Press.
sor, Nagano National College of Technology; Eiko
Yamamoto, Researcher, Communication Research Labo- Aho, A.V., Kernighan, B.W., & Weinberger, P.J. (1989).
ratories; Yuji Minami, Associate Professor, Ube National The AWK programming language. Addison-Wesley Pub-
College of Technology; Michael Hall, Lecturer, Nakamura- lishing Company, Inc.
Gakuen University; who suggested a large number of
improvements, both in content and computer program. Berland, M., & Charniak, E. (1999). Finding parts in very
large corpora. Proceedings of the 37th Annual Meeting of
the Association for Computational Linguistics.
REFERENCES Caraballo, S.A. (1999). Automatic construction of a
hypernym-labeled noun hierarchy from text. Proceedings
Adriaans, P., & Zantinge, D. (1996). Data mining. Addison of the Association for Computational Linguistics.
Wesley Longman Limited.
Church, K.W., & Hanks, P. (1990). Word association
Agrawal, R., & Srikant, R. (1995). Mining of association norms, mutual information, and lexicography. Computa-
rules between sets of items in large databases. Proceed- tional Linguistics, 16(1), 22-29.
369
TEAM LinG
Figure 4. Gaps between the maximum correct number and the correct number forecasted by the revised method
1000
900
800
700
600
m axim um correct num ber
500
correct num ber focasted by revised
m ethod
400
300
200
100
0
1991 1992 1993 1994 1995 1996 1997
Fayyad, U.M. et al. (1996). Advances in knowledge dis- Yamamoto, E., & Umemura, K. (2002). A similarity measure
covery and data mining.AAAI Press/MIT Press. for estimation of one-to-many relationship in corpus
[Japanese edition]. Journal of Natural Language Pro-
Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth, P. cessing, 9, 45-75.
(1996). The KDD process for extracting useful knowl-
edge from volumes of data. Communications of the
ACM, 39(11), 27-34. KEY TERMS
Glymour, C. et al. (1997). Statistical themes and lessons for
data mining. Data Mining and Knowledge Discover, 1(1), Complementary Similarity Measurement: An index
11-28. developed experientially to recognize a poorly printed
character by measuring the resemblance of the correct
Hagita, N., & Sawaki, M. (1995). Robust recognition of pattern of the character expressed in a vector. The author
degraded machine-printed characters using complimen- calls this a diversion index to identify the one-to-many
tary similarity measure and error-correction learning. Pro- relationship in the concurrence patterns of words in a
ceedings of the SPIEThe International Society for large corpus or labels in a large database in this article.
Optical Engineering.
Confidence: An asymmetric index that shows the
Han, J., & Kamber, M. (2001). Data mining. Morgan percentage of records for which A occurred within the
Kaufmann Publishers. group of records and for which the other two, X and Y,
actually occurred under the association rule of X, Y => A.
Ishiduka, T., Yamamoto, E., & Umemura, K. (2003). Evalu-
ation of a function presumes the one-to-many relation- Correct Number: For example, if the city name is
ship [unpublished research paper] [Japanese edition]). included in the prefecture name geographically, the au-
thor calls it correct. So, in this index, the correct number
Manning, C.D., & Schutze, H. (1999). Foundations of indicates the total number of correct one-to-many rela-
statistical natural language processing. MIT Press. tionship calculated on the basis of the distance measures.
Umemura, K. (2002). Selecting the most highly correlated Distance Measure: One of the calculation techniques
pairs within a large vocabulary. Proceedings of the to discover the relationship between two implicit words in
COLING Workshop SemaNet02, Building and Using a large corpus or labels in a large database from the
Semantic Networks. viewpoint of similarity.
370
TEAM LinG
Mutual Information: Shows the amount of informa- documents based upon the chi square test with the
tion that one random variable x contains about another y. frequencies expected for independence. ,
In other words, it compares the probability of observing
x and y, together with the probabilities of observing x and Proposal Measure: An index to measure the concur-
y independently. rence frequency of the correlated pair words in docu-
ments, based upon the frequency of two implicit words
Phi Coefficient: One metric for corpus similarity mea- that appear. It will have high value, when the two implicit
sured upon the chi square test. It is an index to calculate words in a large corpus or labels in a large database occur
the frequencies of the four parameters of a, b, c, and d in with each other frequently.
371
TEAM LinG
372
Discovering Knowledge from XML Documents

Richi Nayak
Queensland University of Technology, Australia
INTRODUCTION BACKGROUND: WHAT IS XML

MINING?
XML is the new standard for information exchange and
retrieval. An XML document has a schema that defines XML mining includes mining of structures as well as
the data definition and structure of the XML document contents from XML documents, depicted in Figure 2
(Abiteboul et al., 2000). Due to the wide acceptance of (Nayak et al., 2002). Element tags and their nesting therein
XML, a number of techniques are required to retrieve dictate the structure of an XML document (Abiteboul et
and analyze the vast number of XML documents. Auto- al., 2000). For example, the textual structure enclosed by
matic deduction of the structure of XML documents for <author> </author> is used to describe the author tuple
storing semi-structured data has been an active subject and its corresponding text in the document. Since XML
among researchers (Abiteboul et al., 2000; Green et al., provides a mechanism for tagging names with data, knowl-
2002). A number of query languages for retrieving data edge discovery on the semantics of the documents be-
from various XML data sources also has been developed comes easier for improving document retrieval on the
(Abiteboul et al., 2000; W3c, 2004). The use of these Web. Mining of XML structure is essentially mining of schema
query languages is limited (e.g., limited types of inputs including intrastructure mining, and interstructure mining.
and outputs, and users of these languages should know
exactly what kinds of information are to be accessed). Intrastructure Mining
Data mining, on the other hand, allows the user to search
out unknown facts, the information hidden behind the Concerned with the structure within an XML document.
data. It also enables users to pose more complex queries Knowledge is discovered about the internal structure of
(Dunham, 2003). XML documents in this type of mining. The following
Figure 1 illustrates the idea of integrating data min- mining tasks can be applied.
ing algorithms with XML documents to achieve knowl- The classification task of data mining maps a new
edge discovery. For example, after identifying similari- XML document to a predefined class of documents. A
ties among various XML documents, a mining technique schema is interpreted as a description of a class of XML
can analyze links between tags occurring together within documents. The classification procedure takes a collec-
the documents. This may prove useful in the analysis of tion of schemas as a training set and classifies new XML
e-commerce Web documents recommending personal- documents according to this training set.
ization of Web pages.
Figure 1. XML mining scheme Figure 2. A taxonomy of XML mining
XML
Mining
XML XML
Structure Content
Mining Mining
Intra- Inter-
Content Structure
Structure Structure Analysis
Mining Mining Clarification
TEAM LinG
The clustering task of data mining identifies similari- Classification is performed on XML content, labeling
ties among various XML documents. A clustering algo- new XML content as belonging to a predefined class. To ,
rithm takes a collection of schemas to group them together reduce the number of comparisons, pre-existing schemas
on the basis of self-similarity. These similarities are then classify the new documents schema. Then, only the
used to generate new schema. As a generalization, the instance classifications of the matching schemas need to
new schema is a superclass to the training set of schemas. be considered in classifying a new document.
This generated set of clustered schemas can now be used Clustering on XML content identifies the potential
in classifying new schemas. The superclass schema also for new classifications. Again, consideration of schemas
can be used in integration of heterogeneous XML docu- leads to quicker clustering; similar schemas are likely
ments for each application domain. This allows users to to have a number of value sets. For example, all schemas
find, collect, filter, and manage information sources more concerning vehicles have a set of values representing
effectively on the Internet. cars, another set representing boats, and so forth. How-
The association data mining describes relationships ever, schemas that appear dissimilar may have similar
between tags that tend to occur together in XML docu- content. Mining XML content inherits some problems
ments that can be useful in the future. By transforming faced in text mining and analysis. Synonymy and polysemy
the tree structure of XML into a pseudo-transaction, it can cause difficulties, but the tags surrounding the
becomes possible to generate rules of the form if an content usually can help resolve ambiguities.
XML document contains a <craft> tag, then 80% of the
time it also will contain a <licence> tag. Such a rule Structural Clarification
then may be applied in determining the appropriate
interpretation for homographic tags. Concerned with distinguishing the similar structured
documents based on contents. The following mining
Interstructure Mining tasks can be performed.
Content provides support for alternate clustering of
Concerned with the structure between XML documents. similar schemas. Two distinctly structured schemas
Knowledge is discovered about the relationship be- may have document instances with identical content.
tween subjects, organizations, and nodes on the Web in Mining these avails new knowledge. Vice versa, schemas
this type of mining. The following mining tasks can be provide support for alternate clustering of content. Two
applied. XML documents with distinct content may be clustered
Clustering schemas involves identifying similar together, given that their schemas are similar.
schemas. The clusters are used in defining hierarchies Content also may prove important in clustering
of schemas. The schema hierarchy overlaps instances schemas that appear different but have instances with
on the Web, thus discovering authorities and hubs similar content. Due to heterogeneity, the incidence of
(Garofalakis et al. 1999). Creators of schema are iden- synonyms is increased. Are separate schemas actually
tified as authorities, and creators of instances are hubs. describing the same thing, only with different terms?
Additional mining techniques are required to identify While thesauruses are vital, it is impossible for them to
all instances of schema present on the Web. The follow- be exhaustive for the English language, let alone handle
ing application of classification can identify the most all languages. Conversely, schemas appearing similar
likely places to mine for instances. Classification is actually are completely different, given homographs.
applied with namespaces and URIs (Uniform Resource The similarity of the content does not distinguish the
Identifiers). Having previously associated a set of semantic intention of the tags. Mining, in this case,
schemas with a particular namespace or URI, this infor- provides probabilities of a tag having a particular mean-
mation is used to classify new XML documents origi- ing or a relationship between meaning and a URI.
nating from these places.
Content is the text between each start and end tag in
XML documents. Mining for XML content is essen- METHODS OF XML STRUCTURE
tially mining for values (an instance of a relation), MINING
including content analysis and structural clarification.
Mining of structures from a well-formed or valid docu-
Content Analysis ment is straightforward, since a valid document has a
schema mechanism that defines the syntax and structure
Concerned with analysing texts within XML documents. of the document. However, since the presence of schema
The following mining tasks can be applied to contents. is not mandatory for a well-formed XML document, the
373
TEAM LinG
document may not always have an accompanying schema. data graph to produce the most specific data guide
To describe the semantic structure of the documents, (Nayak et al., 2002). The data graph represents the inter-
schema extraction tools are needed to generate schema for actions between the objects in a given data domain.
the given well-formed XML documents. When extracting a schema from a data graph, the goal is
DTD Generator (Kay, 2000) generates the DTD for a to produce the most specific schema graph from the
given XML document. However, the DTD generator yields original graph. This way of extracting schema is more
a distinct DTD for every XML document; hence, a set of general than using the schema for a guide, because most
DTDs is defined for a collection of XML documents rather of the XML documents do not have a schema, and some-
than an overall DTD. Thus, the application of data mining times, if they have a schema, they do not conform to it.
operations will be difficult in this matter. Tools such as
XTRACT (Garofalakis, 2000) and DTD-Miner (Moh et al.,
2000) infer an accurate and semantically meaningful DTD METHODS OF XML CONTENT
schema for a given collection of XML documents. How- MINING
ever, these tools depend critically on being given a rela-
tively homogeneous collection of XML documents. In Before knowledge discovery in XML documents oc-
such heterogeneous and flexible environment as the Web, curs, it is necessary to query XML tags and content to
it is not reasonable to assume that XML documents related prepare the XML material for mining. An SQL-based
to the same topic have the same document structure. query can extract data from XML documents. There are
Due to a number of limitations using DTDs as an a number of query languages, some specifically de-
internal structure, such as limited set of data types, loose signed for XML and some for semi-structured data, in
structure constraints, and limitation of content to tex- general. Semi-structured data can be described by the
tual, many researchers propose the extraction of XML grammar of SSD (semi-structured data) expressions.
schema as an extension to XML DTDs (Feng et al., 2002; The translation of XML to SSD expression is easily
Vianu, 2001). In Chidlovskii (2002), a novel XML schema automated (Abiteboul et al., 2000). Query languages
extraction algorithm is proposed, based on the Extended for semi-structured data exploit path expressions. In
Context-Free Grammars (ECFG) with a range of regular this way, data can be queried to an arbitrary depth. Path
expressions. Feng et al. (2002) also presented a seman- expressions are elementary queries with the results
tic network-based design to convey the semantics car- returned as a set of nodes. However, the ability to
ried by the XML hierarchical data structures of XML return results as semi-structured data is required, which
documents and to transform the model into an XML path expressions alone cannot do. Combining path ex-
schema. However, both of these proposed algorithms are pressions with SQL-style syntax provides greater flex-
very complex. ibility in testing for equality, performing joins, and
Mining of structures from ill-formed XML docu- specifying the form of query results. Two such lan-
ments (that lack any fixed and rigid structure) are per- guages are Lorel (Abiteboul et al., 2000) and Unstruc-
formed by applying the structure extraction approaches tured Query Language (UnQL) (Farnandez et al., 2000).
developed for semi-structured documents. But not all of UnQL requires more precision and is more reliant on
these techniques can effectively support the structure path expressions.
extraction from XML documents that is required for XML-QL, XML-GL, XSL, and Xquery are designed
further application of data mining algorithms. For in- specifically for querying XML (W3c, 2004). XML-QL
stance, the NoDoSe tool (Adelberg & Denny, 1999) (Garofalsaki et al., 1999) and Xquery bring together
determines the structure of a semi-structured document regular path expressions, SQL-style query techniques,
and then extracts the data. This system is based primarily and XML syntax. The great benefit is the construction
on plain text and HTML files, and it does not support of the result in XML and, thus, transforming XML data
XML. Moreover, in Green, et al. (2002), the proposed from one schema to another. Extensible Stylesheet
extraction algorithm considers both structure and con- Language (XSL) is not implemented as a query lan-
tents in semi-structured documents, but the purpose is to guage but is intended as a tool for transforming XML to
query and build an index. They are difficult to use without HTML. However, XSL_S select pattern is a mecha-
some alteration and adaptation for the application of data nism for information retrieval and, as such, is akin to a
mining algorithms. query (W3c, 2004). XML-GL (Ceri et al., 1999) is a
An alternative method is to approach the document as graphical language for querying and restructuring XML
the Object Exchange Model (OEM) (Nestorov et al. documents.
1999; Wang et al. 2000) data by using the corresponding
374
TEAM LinG
FUTURE TRENDS Adelberg, B., & Denny, M. (1999). Nodose version 2.0.
Proceedings of the ACM SIGMOD Conference on ,
There has been extensive effort to devise new technolo- Management of Data, Seattle, Washington.
gies to process and integrate XML documents, but a lot Arcade. (2004). http://datamining.csiro.au/arcade.html
of open possibilities still exist. For example, integration of
data mining, XML data models and database languages Ceri, S. et al. (1999). XMLGl: A graphical language for
will increase the functionality of relational database prod- querying and restructuring XML documents. Proceed-
ucts, data warehouses, and XML products. Also, to ings of the 8th International WWW Conference, Toronto,
satisfy the range of data mining users (from naive to expert Canada.
users), future work should include mining user graphs
that are structural information of Web usages, as well as Chidlovskii, B. (2002). Schema extraction from XML col-
visualization of mined data. As data mining is applied to lections. Proceedings of the 2nd ACM/IEEE-CS Joint
large semantic documents or XML documents, extraction Conference on Digital Libraries, Portland, Oregon.
of information should consider rights management of Dunham, M.H. (2003). Data mining: Introductory and
shared data. XML mining should have the authorization advanced topics. Upper Saddle River, NJ: Prentice Hall
level to empower security to restrict only appropriate
users to discover classified information. Farnandez, M., Buneman, P., & Suciu, D. (2000). UNQL: A
query language and algebra for semistructured data based
on structural recursion. VLDB JOURNAL: Very Large
CONCLUSION Data Bases, 9(1), 76-110.
Feng, L., Chang, E., & Dillon, T. (2002). A semantic
XML has proved effective in the process of transmitting network-based design methodology for XML documents.
and sharing data over the Internet. Companies want to ACM Transactions of Information Systems (TOIS),
bring this advantage into analytical data, as well. As 20(4), 390-421.
XML material becomes more abundant, the ability to
gain knowledge from XML sources decreases due to Garofalakis, M. et al. (1999). Data mining and the Web:
their heterogeneity and structural irregularity; the idea Past, present and future. Proceedings of the Second
behind the XML data mining looks like a solution to put International Workshop on Web Information and Data
to work. Using XML data in the mining process has Management, Kansas City, Missouri.
become possible through new Web-based technologies Garofalakis, M.N. et al. (2000). XTRACT: A system for
that have been developed. Simple Object Access Proto- extracting document type descriptors from XML docu-
col (SOAP) is a new technology that has enabled XML ments. Proceedings of the 2000 ACM SIGMOD Inter-
to be used in data mining. For example, vTag Web national Conference on Management of Data, Dallas,
Mining Server aims at monitoring and mining of the Texas.
Web with the use of information agents accessed by
SOAP (vtag, 2003). Similarly, XML for Analysis de- Green, R., Bean, C.A., & Myaeng, S.H. (2002). The seman-
fines a communication structure for an application pro- tics of relationships: An interdisciplinary perspective.
gramming interface, which aims at keeping client pro- Boston: Kluwer Academic Publishers.
gramming independent from the mechanics of data trans-
port but, at the same time, providing adequate informa- Kay, M. (2000). SAXON DTD generatorA tool to gen-
tion concerning the data and ensuring that it is properly erate XML DTDs. Retrieved January 2, 2003, from http:/
handled (XMLanalysis, 2003). Another development, /home.iclweb.com/ic2/mhkay/dtdgen.html
YALE, is an environment for machine learning experi- Moh, C.-H., & Lim, E.-P. (2000). DTD-miner: A tool for
ments that uses XML files to describe data mining mining DTD from XML documents. Proceedings of the
experiments setup (Yale, 2004). The Data Miners AR- Second International Workshop on Advanced Issues of
CADE also uses XML as the target language for all data E-Commerce and Web-Based Information Systems, Cali-
mining tools within its environment (Arcade, 2004). fornia.
Nayak, R., Witt, R., & Tonev, A. (2002, June). Data mining
REFERENCES and XML documents. Proceedings of the 2002 Interna-
tional Conference on Internet Computing, Nevada.
Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on Nestorov, S. et al. (1999). Representative objects: Concise
the Web: From relations to semistructured data and representation of semi-structured, hierarchical data. Pro-
XML. San Francisco, CA: Morgan Kaumann.
375
TEAM LinG
ceedings of the IEEE Proc on Management of Data, Well-Formed XML Documents: To be well-formed, a
Seattle, Washington. pages XML must have properly nested tags, unique
attributes (per element), one or more elements, exactly one
Vianu, V. (2001). A Web odyssey: from Codd to XML. root element, and a number of schema-related constraints.
Proceedings of the 20 th ACM SIGMOD-SIGACT- Well-formed documents may have a schema, but they do
SIGART Symposium on Principles of Database Systems, not conform to it.
California.
XML Content Analysis Mining: Concerned with
Vtag. (2003). http://www.connotate.com/csp.asp analysing texts within XML documents.
W3c. (2004). XML query (Xquery). Retrieved March 18, XML Interstructure Mining: Concerned with the
2004, from http://www.w3c.org/XML/Query structure between XML documents. Knowledge is dis-
Wang, Q., Yu, X.J., & Wong, K. (2000). Approximate graph covered about the relationship among subjects, organi-
scheme extraction for semi-structured data. Proceedings zations, and nodes on the Web.
of the 7th International Conference on Extending Data- XML Intrastructure Mining: Concerned with the
base Technology, Konstanz. structure within an XML document(s). Knowledge is
XMLanalysis. (2003). http://www.intelligenteai.com/fea- discovered about the internal structure of XML docu-
ture/011004/editpage.shtml ments.
Yale. (2004). http://yale.cs.uni-dortmund.de/ XML Mining: Knowledge discovery from XML

documents (heterogeneous and structural irregular).
For example, clustering data mining techniques can
group a collection of XML documents together accord-
KEY TERMS ing to the similarity of their structures. Classification
data mining techniques can classify a number of hetero-
Ill-Formed XML Documents: Lack any fixed and geneous XML documents into a set of predefined clas-
rigid structure. sification of schemas to improve XML document han-
dling and achieve efficient searches.
Valid XML Document: To be valid, an XML docu-
ment additionally must conform (at least) to an explic- XML Structural Clarification Mining: Concerned
itly associated document schema definition. with distinguishing the similar structured documents
based on contents.
376
TEAM LinG
377
Discovering Ranking Functions for ,

Information Retrieval
Weiguo Fan
Virginia Polytechnic Institute and State University, USA
Praveen Pathak
University of Florida, USA
INTRODUCTION where wdi, wqi (for i=1 to t) are weights assigned to different
terms in the document and the query respectively. The
The field of information retrieval deals with finding similarity between the two vectors is calculated as the
relevant documents from a large document collection cosine of the angle between the two vectors. It is ex-
or the World Wide Web in response to a users query pressed as (Salton & Buckley, 1988):
seeking relevant information. Ranking functions play a
very important role in the retrieval performance of such t
retrieval systems and search engines. A single ranking w qi wdi
function does not perform well across different user Similarity (Q, D) = i =1
queries, and document collections. Hence it is neces-

t t
(1)
sary to discover a ranking function for a particular
(w
i =1
qi ) 2 * ( wdi ) 2
i =1
context. Adaptive algorithms like genetic programming

(GP) are well suited for such discovery. The score, called retrieval status value (RSV), is
calculated for each document in the collection and the
documents are ordered and presented to the user in the
BACKGROUND decreasing order of RSV. Various content based fea-
tures are available in VSM to compute the term weights.
In an information retrieval system (IR), given a user The most common ones are the term frequency (tf) and
query a matching function matches the information in the inverse document frequency (idf). Term frequency
the query with that in the documents to rank the docu- measures the number of times a term appears in the
ments in decreasing order of their predicted relevance document or the query. The higher this number, the
to the user. The top ranked documents are then pre- more important the term is assumed to be in describing
sented to the user as a response to her query. To facili- the document. Inverse document frequency is calcu-
tate this relevance estimation process both the docu- lated as log( N / DF ) , where N is the total number of
ments and the queries need to be transformed in a form documents in the collection, and DF is the number of
that can be easily processed by computers. Vector Space documents in which the term appears. A high value of idf
Model (VSM) (Salton, 1989) is one of the most suc- means the term appears in a relatively few number of
cessful models to represent the documents and queries. documents and hence the term is assumed to be impor-
We choose this model as the underlying model in our tant in describing the document. A lot of similar content
research due to the ease of interpretation of the model based features are available in literature (Salton, 1989;
and the tremendous success in retrieval performance Salton & Buckley, 1988). The features can be combined
studies. Moreover most existing search engines and IR (e.g. tf*idf) to generate a variety of new composite
systems are based on this model. features that can be used in term weighting.
In VSM, both documents and queries are represented Equation (1) suggests that in order to discover a good
as vectors of terms. Suppose there are t terms in the ranking function, we need to discover the optimal way of
collection, then a document D and a query Q are repre- assigning weights to document and query keywords. Change
sented as: in term weighting strategy will essentially change the behav-
ior of a ranking function. In this chapter we describe a method
D = (wd1, wd2, , wdt) to discover an optimal ranking function.
Q = (wq1, wq2, , wqt)
TEAM LinG
Discovering Ranking Functions for Information Retrieval
Figure 1. A sample tree representation for a ranking representations for the solutions. A tree-based represen-
function tation allows for easier parsing and implementation. An
example of a term weighting formula using tree structure
+ is given in Figure 1. We will use such a tree based
representation in this chapter. Third, GP is very effective
- lo g
for non-linear function and structure discovery problems
tf * n df where traditional optimization methods do not seem to
work well (Banzhaf, Nordin, Keller, & Francone, 1998).
df N Finally, it has been empirically found that GP discovers
better solutions than those obtained by conventional
heuristic algorithms.
Ranking function discovery as presented in this chap-
MAIN THRUST ter is different from classification in that we seek to
find a function that will be used for ranking or prioritiz-
In this chapter we present a systematic and automatic ing documents. There are efforts in IR that treat this as
discovery process to discover ranking functions. The a classification problem in which a classifier or dis-
process is based on an artificial intelligence technique criminant function is used for ranking. But evidence
called genetic programming (GP). GP is based on ge- shows that ranking function discovery has yielded better
netic algorithms (GA) (Goldberg, 1989; Holland, 1992). retrieval results than the results obtained by treating this
Because of the intrinsic parallel search mechanism and as a classification problem using Support Vector Ma-
powerful global exploration capability in high-dimen- chines and Neural Networks (Fan, Gordon, Pathak,
sional space, both GA and GP have been used to solve a Wensi, & Fox, 2004; Fuhr & Pfeifer, 1994). We now
wide range of hard optimization problems. They are proceed to describe the discovery process using GP.
used in various optimal design and data-mining applica-
tions (Koza, 1992). Ranking Function Discovery by GP
GP represents the solution to a problem as a chro-
mosome (or an individual) in a population pool. It In order to apply GP in our context we need to define
evolves the population of chromosomes in successive several components for it. We use the tree structure as
generations by following the genetic transformation shown in Figure 1 to represent term weighting formula.
operations such as reproduction, crossover, and muta- Components needed for such a representation are given
tion to discover chromosomes with better fitness val- in Table 1.
ues. A fitness function assigns a fitness value for each For the purpose of our discovery framework we will
chromosome that represents how good the chromosome define these parameters as follows:
is at solving the problem at hand.
We use GP for discovering ranking functions be- An individual in the population is expressed in
cause of four reasons. First, in GP there is no stringent term of a tree which represents one possible rank-
requirement for an objective function to be continuous. ing function. A population in a generation con-
All that is needed is that the objective function should be sists of P such trees.
able to differentiate good solutions from the bad ones.
This property allows us to use common IR performance
Terminals: We use the features mentioned in
measures, like average precision (P_Avg), which are Table 2 and real constants as the terminals.
non-linear in nature as objective functions. Second, GP Functions: We use +, -, *, /, and log as the
is well suited to represent the common tree based functions allowed
Table 1. Essential GP components
GP Parameters Meaning
Terminals Leaf nodes in the tree data structure
Functions Non-leaf nodes used to combine the leaf nodes. Typically
numerical operations
Fitness Function The objective function that needs to be optimized
Reproduction and Crossover Genetic operators used to copy fit solutions from one generation
to another and to introduce diversity in the population
378
TEAM LinG
Fitness Function: We use the P_Avg as the fitness The detailed process for the ranking function discov-
function which is defined in equation ery is as given in Figure 2. It is an iterative process. First ,
a population of random ranking functions is created. The
training documents are divided into a set of training set
i and a validation set. This two dataset training method-
D r (d j ) ology has been commonly used in machine learning
r (d ) * j =1
i =1
i
i

experiments (Mitchell, 1997). Relevance judgments for a
(2) query for each of the documents in the training and
validation set are known. The random ranking functions
P _ Avg =
T Re l are evaluated using relevance information for the train-
ing set. The performance measure used is given in Equa-
where r(di) (0,1) is the relevance score assigned to tion (2). The topmost ranking function in terms of perfor-
a document, it being 1 if the document is relevant and mance is noted. The population is subjected to the
0 otherwise. |D| is the total number of retrieved genetic operations of selection, reproduction, and cross-
documents. TRel is the total number of relevant over to generate the next generation. The process is
documents for the query. This equation incorpo- repeated for each generation. At the end of 30 genera-
rates both the standard retrieval measures of preci- tions the thirty ranking functions are applied to the
sion and recall. It also takes into account the order- validation set and the best performing ranking function
ing of the relevant retrieved documents. For ex- is chosen as the discovered ranking function for the
ample, if there are 20 documents retrieved and only particular query.
5 of them are relevant then P_Avg score is higher if It is to be noted that over the 30 generations the best
the relevant documents are higher up in the order ranking function in each generation does successively
(say top 5 retrieved documents are relevant) as improve retrieval performance on the training data set.
compared to the P_Avg score with relevant docu- This improvement in retrieval performance is as ex-
ments that are lower in the retrieval order (say the pected. However, to avoid overfitting problems that are
16th to 20th document). This property is very im- very common in these techniques, we apply the best
portant in many retrieval scenarios where the user ranking function from each of the 30 generations to the
is willing to see only the top few retrieved docu- unseen validation data set and choose the best perform-
ments. P_Avg is the most widely used measure in ing ranking function on the validation data set to be
retrieval studies for comparing performance of applied to the test data set. Thus the final ranking
different systems. function that we choose for applying to the test data set
Reproduction: Reproduction copies the top (in need not necessarily come from the last generation.
terms of fitness) trees in the population into the next
population. If P is the population size and reproduc- Experiments and Results
tion rate is rate_r then top rate_r * P trees are
copied into next generation. rate_r is set to 0.1. We have applied the discovery process described above
Crossover: We use tournament selection to se- on various TREC (Harman, 1996; Hawking, 2000; Hawk-
lect, with replacement, 6 random trees from the ing & Craswell, 2001) document collections. The TREC
population. The top two among the six trees (in datasets were divided into training, validation, and test
terms of fitness) are selected for crossover and datasets. The discovered ranking function after the train-
they exchange sub-trees to form trees for the next ing and validation phase was applied to the test dataset.
generation. The performance results of our method were compared
Table 2. Features used in the discovery process

Features Used Statistical Meaning
Tf Number of times a term appears in a document
tf_max Maximum tf in the entire document collection
tf_avg Average tf for a document
Df Number of documents in the collection the term appeared in
df_max Maximum df across all terms
N Number of documents in the entire text collection
Length Length of a document
Length_avg Average length of document in the entire collection
N Number of unique terms in the document
379
TEAM LinG
Figure 2. Discovery process we believe, will yield even better retrieval performance.
Finally, we believe the technique mentioned in this
IN P U T : Q u ery, T raining d ocu m en ts chapter can be combined with data-fusion techniques to
O U TP U T: D isco vered ra nkin g fu nctio n combine successful traits from various algorithms to
P rocess: yield better retrieval performance.
T raining docum ents divided into a training set and a
validation set
Initial population of random ranking functions
generated CONCLUSION
A pply the follow ing on training dataset for 30
generations
Use the individual ranking function in the generation to rank In this chapter we have presented a ranking function
documents in the training set discovery process to discover new ranking functions.
Create a new population by applying genetic operators of
reproduction and crossover on the existing population Ranking functions match the information in documents
T he top perform ing individual from each generation is with that in the queries to rank the documents in the
applied to the docum ents in the validation set and the
best perform ing ranking function is selected decreasing order of predicted relevance of the docu-
T his is the D iscovered ranking function ments to the user. Although there are well known rank-
ing functions in the IR literature we believe even better
ranking functions can be discovered for specific queries
with the results obtained by applying well known ranking or a set of queries. The discovery of ranking functions
functions in the literature (OKAPI, Pivoted TFIDF, and was accomplished using Genetic Programming, an arti-
INQUERY) (Singhal, Salton, Mitra, & Buckley, 1996) to the ficial intelligence algorithm based on evolutionary
same test dataset. These well known functions essentially theory. The results of GP based retrieval have been
differ based on their term weighting strategies i.e. the way found to significantly outperform the results obtained
they combine various weighting features shown in Table by well known ranking functions. We believe this line of
2. We have found that retrieval performance has signifi- research will be potentially very rewarding in terms of
cantly increased by using our method of discovering much improved retrieval performance.
ranking functions. The performance improvement, in terms
of P_Avg, has been anywhere from 11% to 75%, with the
results being statistically significant. More details about REFERENCES
the performance comparisons are available elsewhere
(Fan, Gordon, & Pathak, 2004a, 2004b).
Banzhaf, W., Nordin, P., Keller, R., & Francone, F.
(1998). Genetic programming: An introduction - On
the automatic evolution of computer programs and its
FUTURE TRENDS applications. San Francisco, CA: Morgan Kaufmann
Publishers.
We believe the ranking functions discovery as outlined
in this chapter is just the beginning of a promising Fan, W., Gordon, M., & Pathak, P. (2004a). Discovery of
avenue of research. This discovery process can be ap- context-specific ranking functions for effective informa-
plied to either the routing search task (search specific tion retrieval using genetic programming. IEEE Transac-
to a particular query) or the ad-hoc search task (a gen- tions on Knowledge and Data Engineering, 16(4), 523-
eralized search for a set of queries). Thus the process 527.
could be made more personalized for an individual user
Fan, W., Gordon, M., & Pathak, P. (2004b). A generic
with persistent information requirements or to a group of
ranking function discovery framework by genetic pro-
users (for example in a department in an organization)
gramming for information retrieval. Information Pro-
with similar, but not same, information requirements.
cessing and Management, 40(4), 587-602.
Another promising avenue of research can deal with
intelligently leveraging both the structural as well as Fan, W., Gordon, M. D., Pathak, P., Wensi, X., & Fox, E.
statistical information available in documents, specifi- (2004). Ranking function optimization for effective
cally HTML and XML documents. The work presented web search by genetic programming: an empirical study,
here utilizes just the statistical information in the docu- Proceedings of 37th Hawaii International Confer-
ments in terms of frequencies of terms in documents or ence on System Sciences, Big Island, Hawaii. IEEE.
across documents. But structural information such as
where the terms appear e.g. in the title, anchor, body, or Fuhr, N., & Pfeifer, U. (1994). Probabilistic information
abstract of the document can also be leveraged using the retrieval as combination of abstraction inductive learning
discovery process described in the chapter. Such leveraging, and probabilistic assumptions. ACM Transactions on
Information Systems, 12, 92-115.
380
TEAM LinG
Goldberg, D. E. (1989). Genetic algorithms in search, KEY TERMS

optimization and machine learning. Addison-Wesley. ,
Harman, D. K. (1996). Overview of the fourth text retrieval Average Precision (P_Avg): This is a well known
conference (TREC-4). In D. K. Harman (Ed.), Proceedings measure of retrieval performance. It is the average of the
of the Fourth Text Retrieval Conference (Vol. 500-236, pp. precision scores calculated every time a new relevant
1-24). NIST Special Publication. document is found, normalized by the total number of
relevant documents in the collection.
Hawking, D. (2000). Overview of the TREC-9 Web track.
In E. Voorhees & H. D. (Eds.), Ninth Text Retrieval Document Cut-OFF Value (DCV): The number of
Conference (Vol. 500-249, pp. 86-102). NIST Special Pub- documents that the user is willing to see as a response to
lication. the query.
Hawking, D., & Craswell, N. (2001). Overview of the TREC- Document Frequency: The number of documents
2001 Web track. In E. Voorhees & D. K. Harman (Eds.), in the document collection that the term appears in.
Proceedings of the Tenth Text Retrieval Conference (Vol. Genetic Programming: A stochastic search algo-
500-250, pp. 61-67). NIST. rithm based on evolutionary theory, with the aim to
Holland, J. H. (1992). Adaptation in natural and artificial optimize structure or functional form. A tree structure
systems (2nd ed.). MIT Press. is commonly used for representation of solutions.
Koza, J. R. (1992). Genetic programming: On the pro- Precision: The ratio of the number of relevant
gramming of computers by means of natural selection. documents retrieved to the total number of documents
Cambridge, MA: MIT Press. retrieved.
Mitchell, T. M. (1997). Machine learning. McGraw Hill). Ranking Function: A function that matches the
information in documents with that in the user query to
Salton, G. (1989). Automatic text processing. Reading, assign a score for each document in the collection.
MA: Addison-Wesley Publishing Co.
Recall: The ratio of the number of relevant docu-
Salton, G., & Buckley, C. (1988). Term weighting ap- ments retrieved to the total number of relevant docu-
proaches in automatic text retrieval. Information pro- ments in the document collection.
cessing and management, 24(5), 513-523.
Term Frequency: The number of times a term
Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996). appears in a document.
Document length normalization. Information process-
ing and management, 32(5), 619-633. Vector Space Model (VSM): A common IR model
where both documents and queries are represented as
vectors of terms.
381
TEAM LinG
382
Discovering Unknown Patterns in Free Text

Jan H. Kroeze
University of Pretoria, South Africa
INTRODUCTION rules in text that indicate trends and significant features

A very large percentage of business and academic data is about specific topics (Nasukawa & Nagano, 2001, p. 967).
stored in textual format. With the exception of metadata,
such as author, date, title and publisher, these data are
not overtly structured like the standard, mainly numeri- MAIN THRUST
cal, data in relational databases. Parallel to data mining,
which finds new patterns and trends in numerical data, Like data mining, text mining is a proactive process that
text mining is the process aimed at discovering unknown automatically searches data for new relationships and
patterns in free text. Owing to the importance of com- anomalies to serve as a basis for making business deci-
petitive and scientific knowledge that can be exploited sions aimed at gaining competitive advantage (cf., Rob
from these texts, text mining has become an increas- & Coronel, 2004, p. 597). Although data mining can
ingly popular and essential theme in data mining (Han require some interaction between the investigator and
& Kamber, 2001, p. 428). the data-mining tool, it can be considered as an auto-
Text mining has a relatively short history: Unlike matic process because data-mining tools automati-
search engines and data mining that have a longer history cally search the data for anomalies and possible
and are better understood, text mining is an emerging relationships, thereby identifying problems that have
technical area that is relatively unknown to IT profes- not yet been identified by the end user, while mere
sions (Chen, 2001, p. vi). data analysis relies on the end users to define the
problem, select the data, and initiate the appropriate
data analyses to generate the information that helps
BACKGROUND model and solve problems those end-users uncover
(Rob & Coronel, 2004, p. 597). The same distinction is
Definitions of text mining vary a great deal, from views valid for text mining. Therefore, text-mining tools should
that it is an advanced form of information retrieval (IR) also initiate analyses to create knowledge (Rob &
to those that regard it as a sibling of data mining: Coronel, 2004, p. 598).
In practice, however, the borders between data analy-
Text mining is the discovery of texts. sis, information retrieval and text mining are not always
Text mining is the exploration of available texts. quite so clear. Montes-y-Gmez et al. (2004) proposed
Text mining is the extraction of information from an integrated approach, called contextual exploration,
text. which combines robust access (IR), non-sequential navi-
Text mining is the discovery of new knowledge in gation (hypertext) and content analysis (text mining).
text. According to Smallheiser (2001, pp. 690-691), text
Text mining is the discovery of new patterns, mining approaches can be divided into two main types:
trends and relations in and among texts. Macro analyses perform data-crunching operations over
a large, often global set of papers encompassing one or
Han & Kamber (2001, pp. 428-435), for example, more fields, in order to identify large-scale trends or to
devote much of their rather short discussion of text classify and organize the literature. In contrast, micro
mining to information retrieval. However, one should analyses pose a sharply focused question, in which one
differentiate between text mining and information researches for complementary information that links two
trieval. Text mining does not consist of searching through small, pre-specified fields of inquiry.
metadata and full-text databases to find existing infor-
mation. The point of view expressed by Nasukawa & The Need for Text Mining
Nagano (2001, p. 969), to wit that text mining is a text
version of generalized data mining, is correct. Text Text mining can be used as an effective business intel-
mining should focus on finding valuable patterns and ligence tool for gaining competitive advantage through
TEAM LinG
the discovery of critical, yet hidden, business informa- automatically and suggest the appropriate reply tem-
tion. In the field of academic research, text mining can be plates (Weng & Liu, 2004). ,
used to scan large numbers of publications in order to
select the most relevant literature and to propose new Similarity Detection
links between independent research results. Text min-
ing is also needed to formulate and assess hypotheses Texts are grouped according to their own content into
arising in biomedical research and for helping categories that were not previously known. The docu-
make policy decisions regarding technical innovation ments are analysed by a clustering computer program,
(Smallheiser, 2001, p. 690). Another application of text often a neural network, but the clusters still have to be
mining is in medical science, to discover gene interac- interpreted by a human expert (Hearst, 1999). Docu-
tions, functions and relations, or to build and structure ment pre-processing (tagging of parts of speech,
medical knowledge bases, and to find undiscovered lemmatisation, filtering and structuring) precedes the
relations between diseases and medications (De Bruijn actual clustering phase (Iiritano et al., 2004). The clus-
& Martin, 2002, p. 8). tering program finds similarities between documents,
for example, common author, same themes, or informa-
Types of Text Mining tion from common sources. The program does not need
a training set or taxonomy, but generates it dynamically
Keyword-Based Association Analysis (cf., Sullivan, 2001, p. 201). One example of the use of
text clustering is found in the work of Fattori et al.
Association analysis looks for correlations between (2003), whose text-mining tool processes patent docu-
texts based on the occurrence of related keywords or ments into dynamic clusters to discover patenting trends,
phrases. Texts with similar terms are grouped together. which constitutes information that can be used as com-
The pre-processing of the texts is very important and petitive intelligence.
includes parsing and stemming, and the removal of
words with minimal semantic content. Another issue is Link Analysis
the problem of compounds and non-compounds should
the analysis be based on singular words or should word Link analysis is the process of building up networks of
groups be accounted for? (cf., Han & Kamber, 2001, p. interconnected objects through relationships in order
433). Kostoff et al. (2002), for example, have measured to expose patterns and trends (Westphal & Blaxton,
the frequencies and proximities of phrases regarding 1998, p. 202). In text databases, link analysis is the finding
electrochemical power to discover central themes and of meaningful, high levels of correlations between text
relationships among them. This knowledge discovery, entities. The user can, for example, suggest a broad
combined with the interpretation of human experts, can hypothesis and then analyse the data in order to prove or
be regarded as an example of knowledge creation through disprove this hunch. It can also be an automatic or semi-
intelligent text mining. automatic process, in which a surprisingly high number of
links between two or more nodes may indicate relations
Automatic Document Classification that have hitherto been unknown. Link analysis can also
refer to the use of algorithms to build and exploit networks
Electronic documents are classified according to a pre- of hyperlinks in order to find relevant and related docu-
defined scheme or training set. The user compiles and ments on the Web (Davison, 2003). Yoon & Park (2004)
refines the classification parameters, which are then use link analysis to construct a visual network of pat-
used by a computer program to categorise the texts in ents, which facilitates the identification of a patents
the given collection automatically (cf., Sullivan, 2001, relative importance: The coverage of the application is
p. 198). Classification can also be based on the analysis wide, ranging from new idea generation to ex post facto
of collocation [the juxtaposition or association of a auditing (p. 49). Text mining is also used to identify
particular word with another particular word or words experts by finding and evaluating links between persons
(The Oxford Dictionary, 1995)]. Words that often and areas of expertise (Ibrahim, 2004).
appear together probably belong to the same class (Lopes
et al., 2004). According to Perrin & Petry (2003) use- Sequence Analysis
ful text structure and content can be systematically
extracted by collocational lexical analysis with statis- A sequential pattern is the arrangement of a number of
tical methods. Text classification can be used by busi- elements, in which the one leads to the other over time
nesses, for example, to categorise customers e-mails (Wong et al., 2000). Sequence analysis is the discovery
383
TEAM LinG
of patterns that are related to time frames, for example, the (Nasukawa & Nagano, 2001, p. 967).
origin and development of a news thread (cf., Montes-y- Authors also differ on the issue of natural language
Gmez et al., 2001), or tracking a developing trend in processing within text mining. Some prefer a more statis-
politics or business. It can also be used to predict recurring tical approach (cf., Hearst, 1999), while others feel that
events. linguistic parsing is an essential part of text mining.
Sullivan (2001, p. 37) regards the representation of mean-
Anomaly Detection ing by means of syntactic-semantic representations as
essential for text mining: Text processing techniques,
Anomaly detection is the finding of information that based on morphology, syntax, and semantics, are pow-
violates the usual patterns, for example, a book that refers erful mechanisms for extracting business intelligence
to a unique source, or a document lacking typical informa- information from documents. We can scan text for
tion. An example of anomaly detection is the detection of meaningful phrase patterns and extract key features and
irregularities in news reports or different topic profiles in relationships. According to De Bruijn & Martin (2002,
newspapers (Montes-y-Gmez et al., 2001). p. 16), [l]arge-scale statistical methods will continue to
challenge the position of the more syntax-semantics
Hypertext Analysis oriented approaches, although both will hold their own
place.
Text mining is about looking for patterns in natural In the light of the various definitions of text mining,
language text. Web mining is the slightly more general it should come as no surprise that authors also differ on
case of looking for patterns in hypertext and often ap- what qualifies as text mining and what does not. Build-
plies graph theoretical approaches to detect and utilise ing on Hearst (1999), Kroeze, Matthee, & Bothma
the structure of web sites (New Zealand Digital Library, (2003) use the parameters of novelty and data type to
2002). Marked-up language, especially XML tags, fa- distinguish between information retrieval, standard text
cilitates text mining because the tags can often be used mining and intelligent text mining (see Figure 1).
to simulate database attributes and to convert data-cen- Halliman (2001, p. 7) also hints at a scale of new-
tric documents into databases, which can then be ex- ness of information: Some text mining discussions
ploited (Tseng & Hwung, 2002). Mark-up tags also make stress the importance of discovering new knowledge.
it possible to create artificial structures [that] help us And the new knowledge is expected to be new to every-
understand the relationship between documents and docu- body. From a practical point of view, we believe that
ment components (Sullivan, 2001, p. 51). business text should be mined for information that is
new enough to give a company a competitive edge
once the information is analyzed.
Another issue is the question of when text mining
CRITICAL ISSUES can be regarded as intelligent. Intelligent behavior is
the ability to learn from experience and apply knowl-
Many sources on text mining refer to text as unstruc- edge acquired from experience, handle complex situa-
tured data. However, it is a fallacy that text data are tions, solve problems when important information is
unstructured. Text is actually highly structured in terms missing, determine what is important, react quickly and
of morphology, syntax, semantics and pragmatics. On correctly to a new situation, understand visual images,
the other hand, it must be admitted that these structures process and manipulate symbols, be creative and imagi-
are not directly visible: text represents factual infor- native, and use heuristics (Stair & Reynolds, 2001, p.
mation in a complex, rich, and opaque manner
Figure 1. A differentiation between information retrieval, standard and intelligent metadata mining, and
standard and intelligent text mining (abbreviated from Kroeze, Matthee, & Bothma, 2003)
Novelty level: Non-novel Semi-novel Novel investigation -

investigation - investigation - knowledge creation
Data type: information retrieval knowledge discovery
Metadata (overtly Information retrieval Standard metadata Intelligent metadata
structured) of metadata mining mining
Free text (covertly Information retrieval Standard text mining Intelligent text mining
structured) of full texts
384
TEAM LinG
421). Intelligent text mining should therefore refer to the Hearst, M.A. (1999). Untangling text data mining. In
interpretation and evaluation of discovered patterns. Proceedings of ACL99: the 37th Annual Meeting of the ,
Association for Computational Linguistics. University
of Maryland, June 20-26 (invited paper). Retrieved from
FUTURE TRENDS http://www.ai.mit.edu/people/jimmylin/papers/
Hearst99a.pdf
Mack and Hehenberger (2002, p. S97) regards the automa- Ibrahim, A. (2004). Expertise location: Can text mining
tion of human-like capabilities for comprehending com- help? In N.F.F. Ebecken, C.A. Brebbia, & A. Zanas (Eds.),
plicated knowledge structures as one of the frontiers of Data mining IV (pp. 109-118). Southampton: WIT Press.
text-based knowledge discovery. Incorporating more
artificial intelligence abilities into text-mining tools will Iiritano, S., Ruffolo, M., & Rullo, P. (2004). Preprocessing
facilitate the transition from mainly statistical procedures method and similarity measures in clustering-based text
to more intelligent forms of text mining. mining: A preliminary study. In N.F.F. Ebecken, C.A.
Brebbia, & A. Zanas (Eds.), Data mining IV (pp. 73-79).
Southampton: WIT Press.
CONCLUSION Kostoff, R.N., Tshiteya, R., Pfeil, K.M., & Humenik,
J.A. (2002). Electrochemical power text mining using
Text mining can be regarded as the next frontier in the bibliometrics and database tomography. Journal of
science of knowledge discovery and creation, enabling Power Sources, 110(1), 163-176.
businesses to acquire sought-after competitive intelli-
gence, and helping scientists of all academic disciplines Kroeze, J.H., Matthee, M.C., & Bothma, T.J.D. (2003).
to formulate and test new hypotheses. The greatest Differentiating data- and text-mining terminology. In IT
challenges will be to select the most appropriate tech- Research in Developing Countries - Proceedings of
nology for specific problems and to popularise these SAICSIT 2003 (Annual Research Conference of the
new technologies so that they become instruments that South African Institute of Computer Scientists and
are generally known, accepted and widely used. Information Technologists) (pp. 93-101), September 17-
19. Pretoria. SAICSIT.
Lopes, M.C.S., Terra, G.S., Ebecken, N.F.F., & Cunha,
REFERENCES G.G. (2004). Mining text databases on clients opinion
for oil industry. In N.F.F. Ebecken, C.A. Brebbia, & A.
Chen, H. (2001). Knowledge management systems: A Zanas (eds.), Data Mining IV (pp. 139-147).
text mining perspective. Tucson, AZ: University of Southampton: WIT Press.
Arizona (Knowledge Computing Corporation).
Mack, R., & Hehenberger, M. (2002). Text-based knowl-
Davison, B.D. (2003). Unifying text and link analysis. In edge discovery: Search and mining of life-science docu-
Text-Mining & Link-Analysis Workshop of the 18 th ments. Drug Discovery Today, 7(11) (Suppl.), S89-S98.
International Joint Conference on Artificial Intelli-
gence. Retrieved from http://www-2.cs.cmu.edu/ Montes-y-Gmez, M., Gelbukh, A., & Lpez-Lpez, A.
~dunja/TextLink2003/ (2001). Mining the news: Trends, associations, and devia-
tions. Computacin y Sistemas, 5(1). Retrieved from http:/
De Bruijn, B., & Martin, J. (2002). Getting to the (c)ore /ccc.inaoep.mx/~mmontesg/publicaciones/2001/
of knowledge: Mining biomedical literature. Interna- NewsMining-CyS01.pdf
tional Journal of Medical Informatics, 67(1-3), 7-18.
Montes-y-Gmez, M., Prez-Coutio, M., Villaseor-
Fattori, M., Pedrazzi, G., & Turra, R. (2003). Text Pineda, L., & Lpez-Lpez, A. (2004). Contextual ex-
mining applied to patent mapping: A practical business ploration of text collections. Lecture Notes in Computer
case. World Patent Information, 25(4), 335-342. Science (Vol. 2945) Berlin: Springer-Verlag. Retrieved
Halliman, C. (2001). Business intelligence using smart from http://ccc.inaoep.mx/~mmontesg/publicaciones/
techniques: Environmental scanning using text min- 2004/ContextualExploration-CICLing04.pdf
ing and competitor analysis using scenarios and Nasukawa, T., & Nagano, T. (2001). Text analysis and
manual simulation. Houston, TX: Information Uncover. knowledge mining system. IBM Systems Journal, 40(4),
Han, J., & Kamber, M. (2001). Data mining: Concepts 967-984.
and techniques. San Francisco, CA: Morgan Kaufmann.
385
TEAM LinG
New Zealand Digital Library, University of Waikato. (2002). Hypertext: A collection of texts containing links to
Text mining. Retrieved from http://www.cs.waikato.ac.nz/ each other to form an interconnected network (Sullivan,
~nzdl/textmining/ 2001, p. 46).
Perrin, P., & Petry, F.E. (2003). Extraction and repre- Information Retrieval: The searching of a text
sentation of contextual information for knowledge dis- collection based on a users request to find a list of
covery in texts. Information Sciences, 151, 125-152. documents organised according to its relevance, as
judged by the retrieval engine (Montes-y-Gmez et al.,
Rob, P., & Coronel, C. (2004). Database systems: Design, 2004). Information retrieval should be distinguished
implementation, and management (6th ed.). Boston, MA: from text mining.
Course Technology.
Knowledge Creation: The evaluation and interpre-
Smallheiser, N.R. (2001). Predicting emerging tech- tation of patterns, trends or anomalies that have been
nologies with the aid of text-based data mining: The discovered in a collection of texts (or data in general),
micro approach. Technovation, 21(10), 689-693. as well as the formulation of its implications and conse-
Stair, R.M., & Reynolds, G.W. (2001). Principles of quences, including suggestions concerning reactive busi-
information systems: A managerial approach (5th ed.). ness decisions.
Boston, MA: Course Technology. Knowledge Discovery: The discovery of patterns,
Sullivan, D. (2001). Document warehousing and text trends or anomalies that already exist in a collection of
mining: Techniques for improving business opera- texts (or data in general), but have not yet been identi-
tions, marketing, and sales. New York, NY: John Wiley. fied or described.
Tseng, F.S.C., & Hwung, W.J. (2002). An automatic Mark-Up Language: Tags that are inserted in free
load/extract scheme for XML documents through ob- text to mark structure, formatting and content. XML
ject-relational repositories. Journal of Systems and tags can be used to mark attributes in free text and to
Software, 64(3), 207-218. transform free text into an exploitable database (cf.,
Tseng & Hwung, 2002).
Weng, S.S., & Liu, C.K. (2004). Using text classifica-
tion and multiple concepts to answer e-mails. Expert Metadata: Information regarding texts, for example,
Systems with Applications, 26(4), 529-543. author, title, publisher, date and place of publication,
journal or series, volume, page numbers, key words, etc.
Westphal, C., & Blaxton, T. (1998). Data mining solu-
tions: Methods and tools for solving real-world prob- Natural Language Processing (NLP): The automatic
lems. New York, NY: John Wiley. analysis and/or processing of human language by com-
puter software, focussed on understanding the contents
Wong, P.K., Cowley, W., Foote, H., Jurrus, E., & Tho- of human communications. It can be used to identify
mas, J. (2000). Visualizing sequential patterns for text relevant data in large collections of free text for a data
mining. In Proceedings of the IEEE Symposium on mining process (Westphal & Blaxton, 1998, p. 116).
Information Visualization 2000 (p. 105). Retrieved from
http://portal.acm.org/citation.cfm Parsing: A (NLP) process that analyses linguistic
structures and breaks them down into parts, on the
Yoon, B., & Park, Y. (2004). A text-mining-based patent morphological, syntactic or semantic level.
network: Analytical tool for high-technology trend. The
Journal of High Technology Management Research, Stemming: Finding the root form of related words,
15(1), 37-50. for example singular and plural nouns, or present and
past tense verbs, to be used as key terms for calculating
occurrences in texts.
KEY TERMS Text Mining: The automatic analysis of a large text

collection in order to identify previously unknown pat-
Business Intelligence: Any information that re- terns, trends or anomalies, which can be used to derive
veals threats and opportunities that can motivate a com- business intelligence for competitive advantage or to
pany to take some action (Halliman, 2001, p. 3). formulate and test scientific hypotheses.
Competitive Advantage: The head start a business has

owing to its access to new or unique information and
knowledge about the market in which it is operating.
386
TEAM LinG
387
Discovery Informatics ,
William W. Agresti
Johns Hopkins University, USA
INTRODUCTION builds upon excellent programs of research and practice

in individual disciplines and application areas. It looks
Discovery informatics is an emerging methodology that selectively across these boundaries to find anything (e.g.,
brings together several threads of research and practice ideas, tools, strategies, and heuristics) that will help with
aimed at making sense out of massive data sources. It is the critical task of discovering new information.
defined as the study and practice of employing the full To help characterize discovery informatics, it may
spectrum of computing and analytical science and tech- be useful to see if there are any roughly analogous
nology to the singular pursuit of discovering new infor- developments elsewhere. Two examplesknowledge
mation by identifying and validating patterns in data management and core competencemay be instructive
(Agresti, 2003). as reference points.
Knowledge management, which began its evolution
in the early 1990s, is the practice of transforming the
BACKGROUND intellectual assets of an organization into business value
(Agresti, 2000). Of course, before 1990, organizations,
In this broad-based conceptualization, discovery to varying degrees, knew that the successful delivery of
informatics may be seen as taking shape by drawing on products and services depended on the collective knowl-
more of the following established disciplines: edge of employees. However, KM challenged organiza-
tions to focus on knowledge and recognize its key role
in their success. They found value in addressing ques-
Database Management: Data models, data analy-
tions such as the following:
sis, data structures, data management, federation
of databases, data warehouses, database manage- What is the critical knowledge that should be
ment systems. managed?
Pattern Recognition: Statistical processes, clas- Where is the critical knowledge?
sifier design, image data analysis, similarity mea- How does knowledge get into products and services?
sures, feature extraction, fuzzy sets, clustering
algorithms. When C. K. Prahalad and Gary Hamel published their
Information Storage and Retrieval: Indexing, highly influential paper, The Core Competence of the
content analysis, abstracting, summarization, elec- Corporation (Prahalad & Hamel, 1990), companies
tronic content management, search algorithms, had some capacity to identify what they were good at.
query formulation, information filtering, relevance However, as with KM, most organizations did not appre-
and recall, storage networks, storage technology. ciate how identifying and cultivating core competences
Knowledge Management: Knowledge sharing, (CC) may make the difference between being competi-
knowledge bases, tacit and explicit knowledge, tive or not. A core competence is not the same as what
relationship management, content structuring, you are good at or being more vertically integrated. It
knowledge portals, collaboration support systems. takes dedication, skill, and leadership to effectively
Artificial Intelligence: Learning, concept for- identify, cultivate, and deploy core competences for
mation, neural nets, knowledge acquisition, intel- organizational success.
ligent systems, inference systems, Bayesian meth- Both KM and CC illustrate the potential value of
ods, decision support systems, problem solving, taking on a specific perspective. By doing so, an organi-
intelligent agents, text analysis, natural language zation will embark on a worthwhile reexamination of
processing. familiar topicsits customers, markets, knowledge
sources, competitive environment, operations, and suc-
What distinguishes discovery informatics is that it cess criteria. The claim of this article is that discovery
brings coherence across dimensions of technologies informatics represents a distinct perspective, one that is
and domains to focus on discovery. It recognizes and potentially highly beneficial, because, like KM and CC,
TEAM LinG
Discovery Informatics
it strikes at what is often an essential element for contents and then discover new items that are similar.
success and progressdiscovery. Current technology can support this ability to associate
a fingerprint with a document (Heintze, 2004) in order
to characterize its meaning, thereby enabling concept-
MAIN THRUST based searching. Discovery informatics recognize that
advances in search and retrieval enhance the discovery
Both the technology and application dimensions will be process.
explored to help clarify the meaning of discovery This same semantic analysis can be exploited in
informatics. other settings, such as within organizations. It is pos-
sible now to have your e-mail system prompt you, based
Discovery Across Technologies on the content of messages you compose. When you
click send, the e-mail system may open a dialogue box
The technology dimension is considered broadly to in- (e.g., Do you also want to send that to Mary?). The
clude automated hardware and software systems, theo- system has analyzed the content of your message, deter-
ries, algorithms, architectures, techniques, methods, and mining that, for messages in the past having similar
practices. Included here are familiar elements associated content, you also have sent them to Mary. So the system
with data mining and knowledge discovery, such as clus- is now asking you if you have perhaps forgotten to
tering, link analysis, rule induction, machine learning, include her. While this feature can certainly be intrusive
neural networks, evolutionary computation, genetic al- and bothersome unless it is wanted, the point is that the
gorithms, and instance-based learning (Wang, 2003). same semantic analysis advances are at work here as
However, the discovery informatics viewpoint goes with the Internet search example.
further, to activities and advances that are associated with The informatics part of discovery informatics also
other areas but should be seen as having a role in discov- conveys the breadth of science and technology needed
ery. Some of these activities, like searching or knowl- to support discovery. There are commercially available
edge sharing, are well known from everyday experiences. computer systems and special-purpose software dedi-
Conducting searches on the Internet is a common cated to knowledge discovery (see listings at http://
practice that needs to be recognized as part of a thread www.kdnuggets.com/). The informatics support includes
of information retrieval. Because it is practiced essen- comprehensive hardware-software discovery platforms
tially by all Internet users and involves keyword search, as well as advances in algorithms and data structures,
there is a tendency to minimize its importance. Search which are core subjects of computer science. The latest
technology is extremely sophisticated (Baeza-Yates & developments in data sharing, application integration,
Ribiero-Neto, 1999). People always have some starting and human-computer interfaces are used extensively in
point for their searches. Often, it is not a keyword, but the automated support of discovery. Particularly valu-
a concept. So people are forced to perform the transfor- able, because of the voluminous data and complex rela-
mation from a notional concept of what is desired to a tionships, are advances in visualization (Marakas, 2003).
list of one or more keywords. The net effect can be the Commercial visualization packages are used widely to
familiar many-thousand hits from the search engine. display patterns and to enable expert interaction and
Even though the responses are ranked for relevance (a manipulation of the visualized relationships.
rich and research-worthy subject itself), people may
still find that the returned items do not match their Discovery Across Domains
intended concepts.
Offering hope for improved search are advances in Discovery informatics encourages a view that spans
concept-based search (Houston & Chen, 2004), more application domains. Over the past decade, the term has
intuitiveness to a persons sense of find me content been associated most often with drug discovery in the
like this, where this can be a concept embodied in an pharmaceutical industry, mining biological data. The
entire document or series of documents. For example, financial industry also was known for employing tal-
a person may be interested in learning which parts of a ented programmers to write highly sophisticated math-
new process guideline are being used in practice in the ematical algorithms for analyzing stock trading data,
pharmaceutical industry. Trying to obtain that informa- seeking to discover patterns that could be exploited for
tion through keyword searches typically would involve financial gain. Retailers were prominent in developing
trial and error on various combinations of keywords. large data warehouses that enabled mining across inven-
What the person would like to do is to point a search tool tory, transaction, supplier, marketing, and demographic
to an entire folder of multimedia electronic content and databases. The situation (marked by drug discovery
ask the tool to effectively integrate over the folder informatics, financial discovery informatics, etc.) was
388
TEAM LinG
evolving into one in which discovery informatics was Common Elements of Discovery
preceded by more and more words as it was being used in Informatics ,
an increasing number of domain areas. One way to see the
emergence of discovery informatics is to strip away the The only constant in discovery informatics is data and
domain modifiers and recognize the universality of ev- an interacting entity with an interest in discovering new
ery application area and organization wanting to take information from it. What varies and has an enormous
advantage of its data. effect on the ease of discovering new information is
Discovery informatics techniques are very influen- everything else, notably the following:
tial across professions and domains:
In the financial community, the FinCEN Artificial

Data
Intelligence System (FAIS) uses discovery
informatics methods to help identify money laun- Volume: How much?
dering and other crimes. Rule-based inference is Accessibility: Ease of access and analysis?
used to detect networks of relationships among Quality: How clean and complete is it? Can it be
people and bank accounts so that human analysts trusted as accurate?
can focus their attention on potentially suspicious Uniformity: How homogeneous are the data?
behavior (Senator, Goldberg & Wooton, 1995). Are they in multiple forms, structures, formats,
In the education profession, there are exciting and locations?
scenarios that suggest the potential for discovery
informatics techniques. In one example, a knowl-
Medium: Text, numbers, audio, video, image,
edge manager works with a special education teacher electronic, or magnetic signals or emanations?
to evaluate student progress. The data warehouse Structure of the Data: Formatted rigidly, par-
holds extensive data on students, their performance, tially, or not at all? If text, does it adhere to known
teachers, evaluations, and course content. The edu- language?
cators are pleased that the aggregate performance
data for the school is satisfactory. However, with Interacting Entity
discovery informatics capabilities, the story does
not need to end there. The data warehouse permits Nature: Is it a person or intelligent agent?
finer granularity examination, and the discovery Need, Question, or Motivation of the User:
informatics tools are able to find patterns that are What is prompting a person to examine these
hidden by the summaries. The tools reveal that data? How sharply defined is the question or
examinations involving word problems show much need? What expectations exist about what might
poorer performance for certain students. Further be found? If the motivation is to find something
analysis shows that this outcome was also true in interesting, what does that mean in context?
previous courses for these students. Now, the edu-
cators have the specific data enabling them to take A wide variety of activities may be seen as varia-
action in the form of targeted tutorials for specific tions on this scheme. An individual using a search
students to address the problem (Tsantis & engine to search the Internet for a home mortgage
Castellani, 2001). serves as an instance of query-driven discovery. A
Bioinformatics requires the intensive use of infor- retail company looking for interesting patterns in its
mation technology and computer science to ad- transaction data is engaged in data-driven discovery. An
dress a wide array of challenges (Watkins, 2001). intelligent agent examining newly posted content to the
One example illustrates a multi-step computational Internet based on a persons profile is also engaged in
method to predict gene models, an important activ- a process of discovery, only the interacting entity is not
ity that has been addressed to date by combinations a person.
of gene prediction programs and human experts.
The new method is entirely automated, involving
optimization algorithms and the careful integration FUTURE TRENDS
of the results of several gene prediction programs,
including evidence from annotation software. Sta- There are many indications that discovery informatics
tistical scoring is implemented with decision trees will grow in relevance. Storage costs are dropping; for
to show clearly the gene prediction results (Allen, example, the cost per gigabyte of magnetic disk storage
Pertea & Salzberg, 2004). declined by a factor of 500 from 1990 to 2000 (Univer-
389
TEAM LinG
sity of California at Berkeley, 2003). Our stockpiles of Agresti, W.W. (2003). Discovery informatics. Commu-
data are expanding rapidly in every field of endeavor. nications of the ACM, 46(8), 25-28.
Businesses at one time were comfortable with opera-
tional data summarized over days or even weeks. Increas- Allen, J.E., Pertea, M., & Salzberg, S.L. (2004). Computa-
ing automation led to point-of-decision data on transactional gene prediction using multiple sources of evidence.
tions. With online purchasing, it is now possible to know Genome Research, 14, 142-148.
the sequence of clickstreams leading up to the sale. So Baeza-Yates, R., & Ribiero-Neto, B. (1999). Modern infor-
the granularity of the data is becoming finer, as busi- mation retrieval. Reading, MA: Addison-Wesley.
nesses are learning more about their customers and about
ways to become more profitable. This business analytics Bergeron, B. (2003). Bioinformatics computing. Upper
is essential for organizations to be competitive. Saddle River, NJ: Prentice Hall.
A similar process of finer granular data exists in Heintze, N. (2004). Scalable document fingerprinting.
bioinformatics. In the human body, there are 23 pairs of Carnegie Mellon University. Retrieved from http://
human chromosomes, approximately 30,000 genes, and www2.cs.cmu.edu/afs/cs/user/nch/www/koala/
more than 1,000,000 proteins (Watkins, 2001). The main.html
advances in decoding the human genome are remark-
able, but it is proteins that ultimately regulate metabo- Houston, A., & Chen, H. (2004). A path to concept-based
lism and disease in the body (Watkins, 2001, p. 27). So, information access: From national collaboratories to
the challenges for bioinformatics continue to grow digital libraries. University of Arizona. Retrieved from
along with the data. http://ai.bpa.arizona.edu/go/intranet/papers/
Book7.pdf
Koza, J.R. et al. (Eds.) (2003). Genetic programming IV:
CONCLUSION Routine human-competitive machine intelligence.
Dordrecht, The Netherlands: Kluwer Academic Publishers.
Discovery informatics is an emerging methodology that
promotes a crosscutting and integrative view. It looks Marakas, G.M. (2003). Modern data warehousing,
across both technologies and application domains to mining, and visualization. Upper Saddle River, NJ:
identify and organize the techniques, tools, and models Prentice Hall.
that improve data-driven discovery.
There are significant research questions as this meth- Prahalad, C.K., & Hamel, G. (1990). The core compe-
odology evolves. Continuing progress will be received tence of the corporation. Harvard Business Review, 3,
eagerly from efforts in individual strategies for knowl- 79-91.
edge discovery and machine learning, such as the excel- Senator, T.E., Goldberg, H.G., & Wooton, J. (1995). The
lent contributions in Koza, et al. (2003). An additional financial crimes enforcement network AI system (FAIS):
opportunity is to pursue the recognition of unifying Identifying potential money laundering from reports of
aspects of practices now associated with diverse disci- large cash transactions. AI Magazine, 16, 21-39.
plines. While the anticipation of new discoveries is
exciting, the evolving practical application of discovery Tsantis, L., & Castellani, J. (2001). Enhancing learning
methods needs to respect individual privacy and a di- environments through solution-based knowledge dis-
verse collection of laws and regulations. Balancing covery tools: Forecasting for self-perpetuating sys-
these requirements constitutes a significant and persis- temic reform. Journal of Special Education Technol-
tent challenge as new concerns emerge and as laws are ogy, 16, 39-52.
drafted. University of California at Berkeley. (2003). How much
Looking ahead to the challenges and opportunities information? School of Information Management and
of the 21st century, discovery informatics is poised to Systems. Retrieved from http://www.sims.berkeley.edu/
help people and organizations learn as much as possible research/projects/how-much-info/how-much-info.pdf
from the worlds abundant and ever-growing data assets.
Wang, J. (2003). Data mining: Opportunities and chal-
lenges. Hershey, PA: Idea Group Publishing.
REFERENCES Watkins, K.J. (2001). Bioinformatics. Chemical & En-
gineering News, 79, 26-45.
Agresti, W.W. (2000). Knowledge management. Ad-
vances in Computers, 53, 171-283.
390
TEAM LinG
KEY TERMS gorithms to find the fittest models from the set to serve as
inputs to the next iteration, ultimately leading to a model ,
Clickstream: The sequence of mouse clicks ex- that best represents the data.
ecuted by an individual during an online Internet session. Knowledge Management: The practice of transform-
Data Mining: The application of analytical methods ing the intellectual assets of an organization into business
and tools to data for the purpose of identifying patterns value.
and relationships such as classification, prediction, es- Neural Networks: Learning systems, designed by
timation, or affinity grouping. analogy with a simplified model of the neural connec-
Discovery Informatics: The study and practice of tions in the brain, which can be trained to find nonlinear
employing the full spectrum of computing and analyti- relationships in data.
cal science and technology to the singular pursuit of Rule Induction: Process of learning from cases or
discovering new information by identifying and validat- instances the if-then rule relationships consisting of an
ing patterns in data. antecedent (i.e., if-part, defining the preconditions or
Evolutionary Computation: Solution approach coverage of the rule) and a consequent (i.e., then-part,
guided by biological evolution, which begins with po- stating a classification, prediction, or other expression of
tential solution models and then iteratively applies al- a property that holds for cases defined in the antecedent).
391
TEAM LinG
392
Discretization for Data Mining

Ying Yang
Monash University, Australia
Geoffrey I. Webb
INTRODUCTION mentary, each relating to a different dimension along

which discretization methods may differ. Typically,
Discretization is a process that transforms quantitative discretization methods can be either primary or compos-
data into qualitative data. Quantitative data are commonly ite. Primary methods accomplish discretization without
involved in data mining applications. However, many reference to any other discretization method. Composite
learning algorithms are designed primarily to handle quali- methods are built on top of some primary method(s).
tative data. Even for algorithms that can directly deal with Primary methods can be classified as per the following
quantitative data, learning is often less efficient and less taxonomies.
effective. Hence research on discretization has long been
active in data mining. Supervised vs. Unsupervised (Dougherty, Kohavi,
& Sahami, 1995): Supervised methods are only
applicable when mining data that are divided into
BACKGROUND classes. These methods refer to the class informa-
tion when selecting discretization cut points. Unsu-
Many data mining systems work best with qualitative pervised methods do not use the class information.
data, where the data values are discrete descriptive terms For example, when trying to predict whether a cus-
such as young and old. However, lots of data are quan- tomer will be profitable, the data might be divided
titative, for example, with age being represented by a into two classes profitable and unprofitable. A
numeric value rather a small number of descriptors. One supervised discretization technique would take
way to apply existing qualitative systems to such quan- account of how useful was the selected cut point for
titative data is to transform the data. identifying whether a customer was profitable. An
Discretization is a process that transforms data con- unsupervised technique would not. Supervised
taining a quantitative attribute so that the attribute in methods can be further characterized as error-based,
question is replaced by a qualitative attribute. A many to entropy-based or statistics-based. Error-based
one mapping function is created so that each value of the methods apply a learner to the transformed data and
original quantitative attribute is mapped onto a value of select the intervals that minimize error on the train-
the new qualitative attribute. First, discretization divides ing data. In contrast, entropy-based and statistics-
the value range of the quantitative attribute into a finite based methods assess respectively the class en-
number of intervals. The mapping function associates all tropy or some other statistic regarding the relation-
of the quantitative values in a single interval to a single ship between the intervals and the class.
qualitative value. A cut point is a value of the quantitative Parametric vs. Non-Parametric: Parametric
attribute where a mapping function locates an interval discretization requires the user to specify param-
boundary. For example, a quantitative attribute recording eters for each discretization performed. An example
age might be mapped onto a new qualitative age attribute of such a parameter is the maximum number of
with three values, pre-teen, teenage, and post-teen. The intervals to be formed. Non-parametric discretization
cut points for such a discretization may be 13 and 18. does not utilize user-specified parameters.
Values of the original quantitative age attribute that are Hierarchical vs. Non-Hierarchical: Hierar-
below 13 might get mapped onto the pre-teen value of the chical discretization utilizes an incremental pro-
new attribute, values from 13 to 18 onto teen, and values cess to select cut points. This creates an implicit
above 18 onto post-teen. hierarchy over the value range. Hierarchical
Various discretization methods have been proposed. discretization can be further characterized as ei-
Diverse taxonomies exist in literature to categorize ther split or merge (Kerber, 1992). Split
discretization methods. These taxonomies are comple- discretization starts with a single interval that
TEAM LinG
encompasses the entire value range, then repeat- tion that this difference is less than the difference
edly splits it into sub-intervals until some stop- between 9 and 29. ,
ping criterion is satisfied. Merge discretization Fuzzy vs. Non-fuzzy (Ishibuchi, Yamamoto, &
starts with each value in a separate interval, then Nakashima, 2001; Wu, 1999): Fuzzy discretization
repeatedly merges adjacent intervals until a stop- creates a fuzzy mapping function. A value may
ping criterion is met. It is possible to combine belong to multiple intervals, each with varying de-
both split and merge techniques. For example, grees of strength. Non-fuzzy discretization forms
initial intervals may be formed by splitting. A exact cut points.
merge process is then applied to post-process
these initial intervals. Non-hierarchical Composite methods first generate a mapping function
discretization creates intervals without forming a using an initial primary method. They then use other
hierarchy. For example, many methods forming primary methods to adjust the initial cut points.
the intervals sequentially in a single scan through
the data.
Univariate vs. Multivariate (Bay, 2000): Univariate MAIN THRUST
methods discretize an attribute without reference to
attributes other than the class. In contrast, multi- The main thrust of this chapter deals with how to select
variate methods consider relationships among at- a discretization method. This issue is particularly impor-
tributes during discretization. tant since there exist a large number of discretization
Disjoint vs. Non-Disjoint (Yang & Webb, 2002): methods and no one can be universally optimal. When
Disjoint methods discretize the value range of an selecting between discretization methods it is critical to
attribute into intervals that do not overlap. Non- take account of the learning context, in particular, of the
disjoint methods allow overlap between intervals. learning algorithm, the nature of the data, and the learning
Global vs. Local (Dougherty, Kohavi, & Sahami, objectives. Different learning contexts have different char-
1995): Global methods create a single mapping acteristics and hence have different requirements for
function that is applied throughout a given classi- discretization. It is unrealistic to pursue a universally
fication task. Local methods allow different map- optimal discretization approach that can be blind to its
ping functions for a single attribute in different learning context.
classification contexts. For example, decision tree Many discretization techniques have been developed
learning may discretize a single attribute into differ- primarily in the context of a specific type of learning
ent intervals at different nodes of a tree (Quinlan, algorithm, such as decision tree learning, decision rule
1993). Global techniques are more efficient, because learning, naive-Bayes learning, Bayes network learning,
one discretization is used throughout the entire clustering, and association learning. Different types of
data mining process, but local techniques may re- learning have different characteristics and hence require
sult in the discovery of more useful cut points. different strategies of discretization.
Eager vs. Lazy (Hsu, Huang, & Wong, 2000, 2003): For example, decision tree learners can suffer from the
Eager methods generate the mapping function prior fragmentation problem. If an attribute has many values, a
to classification time. Lazy methods generate the split on this attribute will result in many branches, each of
mapping function as it is needed during classifica- which receives relatively few training instances, making
tion time. it difficult to select appropriate subsequent tests. Hence
Ordinal vs. Nominal: Ordinal discretization forms a they may benefit more than other learners from
mapping function from quantitative to ordinal quali- discretization that results in few intervals. Decision rule
tative data. It seeks to retain ordering information learners may require pure intervals (containing instances
implicit in quantitative attributes. In contrast, nomi- dominated by a single class), while probabilistic learners
nal discretization forms a mapping function from such as naive-Bayes do not. The relations between at-
quantitative to nominal qualitative data, thereby tributes are key themes for association learning, and
discarding any ordering information. For example, if hence multivariate discretization that can capture the
the value range 0 29 were discretized into three inter-dependencies among attributes is desirable. If
intervals 0 9, 10 19 and 20 29, if the intervals are coupled with lazy discretization, lazy learners can further
treated as nominal then a value in the interval 0 9 save training effort. Non-disjoint discretization is not
will be treated as dissimilar to one in 20 29 as it is applicable if the learning algorithm, such as decision tree
to one in 10 19. In contrast, while ordinal learning, requires disjoint attribute values.
discretization will treat the difference between 9 and In order to facilitate understanding this issue, we
either 10 or 19 as equivalent, it retains the informa- contrast discretization strategies in two popular learning
393
TEAM LinG
contexts, decision tree learning and naive-Bayes learning. often violated in real-world applications, naive-Bayes
Although both are commonly used for data mining appli- learning still achieves surprisingly good classification
cations, they have very different inductive biases and performance. Domingos and Pazzani (1997) suggested
learning mechanisms. As a result, they desire different one reason is that the classification estimation under
discretization methodologies. zero-one loss is only a function of the sign of the
probability estimation. The classification accuracy can
Discretization in Decision Tree Learning remain high even while the assumption violation causes
poor probability estimation, so long as the highest
The learned concept is represented by a decision tree in estimate relates to the correct class. Because they are
decision tree learning. Each non-leaf node tests an at- simple, effective, efficient, robust to noise, and sup-
tribute. Each branch descending from that node corre- port incremental training, nave-Bayes classifiers have
sponds to one of the attributes values. Each leaf node been employed in numerous classification tasks.
assigns a class label. A decision tree classifies instances Appropriate discretization mechanisms for naive-
by sorting them down the tree from the root to some leaf Bayes learning include fixed-frequency discretization
node (Mitchell, 1997). Algorithms such as ID3 (Quinlan, (Yang, 2003), proportional discretization (Yang, 2003)
1986) and its successor C4.5 (Quinlan, 1993) are well known and non-disjoint discretization (Yang, 2003). For ex-
exemplars. ample, when discretizing a quantitative attribute, fixed-
Fayyad & Irani (1993) proposed multi-interval-entropy- frequency discretization (FFD) predefines a sufficient
minimization discretization (MIEMD), which has been one interval frequency k. It then discretizes the sorted values
of the most popular discretization mechanisms for deci- into intervals so that each interval has approximately the
sion tree learning over years. Briefly speaking, MIEMD same number k of training instances with adjacent (pos-
discretizes a quantitative attribute by calculating the class sibly identical) values. By this means, FFD fixes an
information entropy as if the classification only uses that interval frequency that is not arbitrary but can ensure
single attribute after discretization. This can be suitable that each interval contains sufficient instances to allow
for the divide-and-conquer strategy of decision tree learn- reasonable probability estimates.
ing, which handles one attribute at a time. However, it is not However, FFD may result in inferior performance for
necessarily appropriate for other learning mechanisms decision tree learning. FFD first ensures that each inter-
such as naive-Bayes learning, which involves all the at- val contains sufficient instances for estimating the na-
tributes simultaneously (Yang, 2003). ive-Bayes probabilities. On top of that, FFD tries to
Furthermore, MIEMD uses the minimum description maximize the number of discretized intervals to reduce
length criterion (MDL) as its termination condition that discretization bias (Yang, 2003). If employed in decision
decides when to stop further partitioning a quantitative tree learning, this maximization effect of FFD tends to
attributes value range. As An and Cercone (1999) indi- cause a severe fragmentation problem.
cate, this criterion has an effect to form qualitative at- The other way around, MIEMD is effective for deci-
tributes with few values. This effect is desirable for decision tree learning but not for naive-Bayes learning.
sion tree learning, since it helps avoid the fragmentation Because of its attribute independence assumption, na-
problem by minimizing the number of values of an attribute. ive-Bayes learning is not subject to the fragmentation
If an attribute has many values, forming a split on those problem. MIEMDs tendency to minimize the number of
values fragments that data into small subsets with respect intervals has a strong potential to reduce the classifica-
to which it is difficult to perform further learning (Quinlan, tion variance but increase the classification bias. As the
1993). However, this effect of minimization is not so wel- data size becomes large, it is very likely that the loss
come to naive-Bayes learning because it brings adverse through bias increase will soon overshadow the gain
impact as we will detail in the next section. through variance reduction, resulting in inferior learning
performance (Yang, 2003). This impact is particularly
Discretization in Naive-Bayes Learning undesirable, since, due to its efficiency, naive-Bayes
learning is very popular for learning from large data.
When classifying an instance, nave-Bayes learning ap- Hence, MIEMD is not a desirable approach for
plies Bayes theorem to calculate the probability of each discretization in naive-Bayes learning.
class given this instance. The most probable class is
chosen as the class of this instance. In order to simplify the Discretization in Association Rule
calculation, an attribute independence assumption is Discovery
made, assuming attributes conditionally independent of
each other given the class. Although this assumption is Association rules (Agrawal, Imielinski, & Swami, 1993)
have quite distinct discretization requirements to the
394
TEAM LinG
classification learning techniques discussed above. Be- from the stock market and observation data from monitor-
cause any attribute may appear in the consequent of an ing sensors. The key theme here is to boost ,
association rule, there is no single class variable with discretizations efficiency (without any significant ac-
respect to which global supervised discretization might curacy loss) and to quickly incorporate discretizations
be performed. Further, there is no clear evaluation crite- results into the learner.
rion for the sets of rules that are produced. Whereas it is
possible to assess the expected error rate of a decision
tree or a nave Bayes classifier, there is no corresponding CONCLUSION
metric for comparing the quality of two alternative sets of
association rules. In consequence, it is not apparent how The process that transforms quantitative data to qualita-
one might assess the quality of two alternative tive data is discretization. The real world is abundant in
discretizations of an attribute. It is not even possible to quantitative data. In contrast, many learning algorithms
run the system with each alternative and evaluate the are more adept at learning from qualitative data. This gap
quality of the respective results. For these reasons simple can be shrunk by discretization, which makes discretization
global discretization techniques such as fixed frequency an important research area for knowledge discovery.
discretization are often used. Numerous methods have been developed as under-
In contrast, Srikant & Agrawal (1996) present a hybrid standing of discretization evolves. Different methods are
global unsupervised and local multi-variate supervised tuned to different learning tasks. When seeking to employ
discretization technique. Each attribute is initially an existing discretization method or develop a new
discretized using fixed frequency discretization. Then, for discretization mechanism, it is very important to under-
each rule a locally optimal discretization is generated by stand the learning context within which the discretization
considering new discretizations formed by joining neigh- lies. For example, what learning algorithm will make use of
boring intervals. This technique is more computationally the discretized values? Different learning algorithms have
demanding than a simple global approach, but may result different characteristics and require different discretization
in more useful rules. strategies. There is no universally optimal discretization
solution.
There are significant research questions that still
FUTURE TRENDS remain open in discretization. Continuing progress is
both desirable and necessary.
Many problems still remain open with regard to
discretization.
The relationship between the nature of a learning task REFERENCES
and the appropriate discretization strategies requires fur-
ther investigation. One challenge is to identify what Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining
aspects of a learning task are relevant to the selection of associations between sets of items in massive databases.
discretization strategies. It would be useful to character- Proceedings of the 1993 ACM-SIGMOD International
ize tasks and discretization methods into abstract features Conference on Management of Data (pp. 207-216).
that facilitate the selection.
Discretization for time-series data is another interest- An, A., & Cercone, N. (1999). Discretization of continuous
ing topic. In time-series data, each instance is associated attributes for learning classification rules. Proceedings of
with a time stamp. The concept that underlies the data may the 3rd Pacific-Asia Conference on Methodologies for
drift over time, which implies that the appropriate Knowledge Discovery and Data Mining (pp. 509-514).
discretization cut points may change. It will be time Bay, S.D. (2000). Multivariate discretization of continu-
consuming if discretization has to be conducted from ous variables for set mining. Proceedings of the 6th ACM
scratch each time the data changes. In this case, incremen- SIGKDD International Conference on Knowledge Dis-
tal discretization that only needs the old cut points and covery and Data Mining (pp. 315-319).
new data to form new cut points can be of great utility.
A third trend is discretization for stream data. A Bluman, A.G. (1992). Elementary statistics: A step by step
fundamental difference between time-series data and approach. Dubuque, IA: Wm.C.Brown Publishers.
stream data is that for stream data, one only has access to
data in the current time window, but not any previous Domingos, P., & Pazzani, M. (1997). On the optimality of
data. The data may have large volume, change very fast the simple Bayesian classifier under zero-one loss. Ma-
and require fast response, for example, exchange data chine Learning, 29, 103-130.
395
TEAM LinG
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Super- Yang, Y., & Webb, G.I. (2002). Non-disjoint discretization
vised and unsupervised discretization of continuous fea- for naive-Bayes classifiers. Proceedings of the 19th Inter-
tures. Proceedings of the 12th International Confer- national Conference on Machine Learning (pp. 666-
ence on Machine Learning (pp. 94-202). 673).
Fayyad, U.M., & Irani, K.B. (1993). Multi-interval Yang, Y. (2003). Discretization for naive-Bayes learning.
discretization of continuous-valued attributes for classi- PhD thesis, School of Computer Science and Software
fication learning. Proceedings of the 13th International Engineering, Monash University, Melbourne, Australia.
Joint Conference on Artificial Intelligence (pp. 1022-
1027).
KEY TERMS
Hsu, C.N., Huang, H.J., & Wong, T.T. (2000). Why
discretization works for naive Bayesian classifiers. Pro- Turning to the authority of introductory statistical
ceedings of the 17th International Conference on Ma- textbooks (Bluman, 1992; Samuels & Witmer, 1999), the
chine Learning (pp. 309-406). following definitions are adopted.
Hsu, C.N., Huang, H.J., & Wong, T.T. (2003). Implications Continuous Data: Can assume all values on the num-
of the Dirichlet assumption for discretization of continu- ber line within their value range. The values are obtained
ous variables in naive Bayesian classifiers. Machine by measuring. An example is: temperature.
Learning, 53(3), 235-263.
Discrete Data: Assume values that can be counted.
Ishibuchi, H., Yamamoto, T., & Nakashima, T. (2001). The data cannot assume all values on the number line
Fuzzy data mining: Effect of fuzzy discretization. Proceed- within their value range. An example is: number of children
ings of the 2001 IEEE International Conference on Data in a family.
Mining (pp. 241-248).
Discretization: A process that transforms quantita-
Kerber, R. (1992). Chimerge: Discretization for numeric tive data to qualitative data.
attributes. Proceedings of 10 th National Conference on
Artificial Intelligence (pp. 123-128). Nominal Data: Classified into mutually exclusive (non-
overlapping), exhaustive categories in which no meaning-
Mitchell, T.M. (1997). Machine learning. New York, NY: ful order or ranking can be imposed on the data. An
McGraw-Hill Companies. example is: blood type of a person: A, B, AB, O.
Quinlan, J.R. (1986). Induction of decision trees. Machine Ordinal Data: Classified into categories that can be
Learning, 1, 81-106. ranked. However, the differences between the ranks can-
Quinlan, J.R. (1993). C4.5: Programs for machine learn- not be calculated by arithmetic. An example is: assign-
ing. San Francisco, CA: Morgan Kaufmann Publishers. ment evaluation: fail, pass, good, excellent.
Samuels, M.L., & Witmer, J.A. (1999). Statistics for the life Qualitative Data: Also often referred to as categorical
sciences (2nd ed.). Prentice-Hall. data, are data that can be placed into distinct categories.
Qualitative data sometimes can be arrayed in a meaningful
Srikant, R., & Agrawal, R. (1996). Mining quantitative order. But no arithmetic operations can be applied to them.
association rules in large relational tables. Proceedings of Quantitative data can be further classified into two groups,
the 1996 ACM-SIGMOD International Conference on nominal or ordinal.
Management of Data (pp. 1-12).
Quantitative Data: Numeric in nature. They can be
Wu, X. (1999). Fuzzy interpretation of discretized inter- ranked in order. They also admit to meaningful arithmetic
vals. IEEE Transactions on Fuzzy Systems, 7(6), 753-759. operations. Quantitative data can be further classified
into two groups, discrete or continuous.
396
TEAM LinG
397
Discretization of Continuous Attributes

,
Fabrice Muhlenbach
EURISE, Universit Jean Monnet - Saint-Etienne, France
Ricco Rakotomalala
ERIC, Universit Lumire - Lyon 2, France
INTRODUCTION cases because there are no experts available, no a priori

knowledge on the domain, or, for a big dataset, the
In the data-mining field, many learning methods such human cost would be prohibitive. It is then necessary to
as association rules, Bayesian networks, and induction be able to have an automated method to discretize the
rules (Grzymala-Busse & Stefanowski, 2001) can predictive attributes and find the cut-points that are
handle only discrete attributes. Therefore, before the better adapted to the learning problem.
machine-learning process, it is necessary to re-encode Discretization was little studied in statistics ex-
each continuous attribute in a discrete attribute consti- cept by some rather old articles considering it as a
tuted by a set of intervals. For example, the age attribute special case of the one-dimensional clustering (Fisher,
can be transformed in two discrete values representing 1958) but from the beginning of the 1990s, the
two intervals: less than 18 (a minor) and 18 or greater. research expanded very quickly with the development of
This process, known as discretization, is an essential supervised methods (Dougherty, Kohavi, & Sahami,
task of the data preprocessing not only because some 1995; Liu et al., 2002). Lately, the applied discretization
learning methods do not handle continuous attributes, has affected other fields: An efficient discretization can
but also for other important reasons. The data trans- also improve the performance of discrete methods such
formed in a set of intervals are more cognitively rel- as the association rule construction (Ludl & Widmer,
evant for a human interpretation (Liu, Hussain, Tan, & 2000a) or the machine learning of a Bayesian network
Dash, 2002); the computation process goes faster with (Friedman & Goldsmith, 1996).
a reduced level of data, particularly when some at- In this article, we will present the discretization as a
tributes are suppressed from the representation space preliminary condition of the learning process. The pre-
of the learning problem if it is impossible to find a sentation will be limited to the global discretization
relevant cut (Mittal & Cheong, 2002); the discretization methods (Frank & Witten, 1999), because in a local
can provide nonlinear relations for example, the discretization, the cutting process depends on the par-
infants and the elderly people are more sensitive to ticularities of the model construction for example,
illness. This relation between age and illness is then not the discretization in rule induction associated with ge-
linear which is why many authors propose to discretize netic algorithms (Divina, Keijzer, & Marchiori, 2003)
the data even if the learning method can handle continu- or lazy discretization associated with nave Bayes clas-
ous attributes (Frank & Witten, 1999). Lastly, sifier induction (Yang & Webb, 2002). Moreover, even
discretization can harmonize the nature of the data if it if this article presents the different approaches to
is heterogeneous for example, in text categorization, discretize the continuous attributes, whatever the learn-
the attributes are a mix of numerical values and occuring method may be used, in the supervised learning
rence terms (Macskassy, Hirsh, Banerjee, & Dayanik, framework, only discretizing the predictive attributes
2001). will be presented. The cutting of the attributes to be
An expert realizes the best discretization because he predicted depends a lot on the particular properties of
can adapt the interval cuts to the context of the study and the problem to treat. The discretization of the class
can then make sense of the transformed attributes. As attribute is not realistic because this pretreatment, if
mentioned previously, the continuous attribute age effectuated, would be the learning process itself.
can be divided in two categories. Take basketball as an
example; what is interesting about this sport is that it has
many categories: mini-mite (under 7), mite (7 to 8), BACKGROUND
squirt (9 to 10), peewee (11 to 12), bantam (13 to
14), midget (15 to 16), junior (17 to 20), and The discretization of a continuous-valued attribute con-
senior (over 20). Nevertheless, this approach is not sists of transforming it into a finite number of intervals
feasible in the majority of machine-learning problem and to re-encode, for all instances, each value on this
TEAM LinG
attribute by associating it with its corresponding inter- lies. In the following sections, we use these criteria to
val. There are many ways to realize this process. distinguish the particularities of each discretization
One of these ways consists of realizing a method.
discretization with a fixed number of intervals. In this
situation, the user must choose the appropriate number Univariate Unsupervised Discretization
a priori: Too many intervals will be unsuited to the
learning problem, and too few intervals can risk losing The simplest discretization methods make no use of the
some interesting information. A continuous attribute instance labels of the class attribute. For example, the
can be divided in intervals of equal width (see Figure 1) equal width interval binning consists of observing the
or equal frequency (see Figure 2). Other methods exist to values of the dataset to identify the minimum and the
constitute the intervals based on the clustering prin- maximum values observed and to divide the continuous
ciples, for example, k-means clustering discretization attribute into the number of intervals chosen by the user
(Monti & Cooper, 1999). (Figure 1). Nevertheless, in this situation, if uncharac-
Nevertheless, for supervised learning, these teristic extreme values exist in the dataset (outliers),
discretization methods ignore an important source of the range will be changed, and the intervals will be
information: the instance labels of the class attribute. misappropriated. To avoid this problem, divide the con-
By contrast, the supervised discretization methods tinuous attribute into intervals containing the same num-
handle the class label repartition to achieve the differ- ber of instances (Figure 2): This method is called the
ent cuts and find the more appropriate intervals. Fig- equal frequency discretization method.
ure 3 shows a situation where it is more efficient to have The unsupervised discretization can be grasped as a
only two intervals for the continuous attribute instead of problem of sorting and separating intermingled prob-
three: It is not relevant to separate two bordering inter- ability laws (Potzelberger & Felsenstein, 1993). The
vals if they are composed of the same class data. There- existence of an optimum analysis was studied by Teicher
fore, the supervised or unsupervised quality of a (1963) and Yakowitz and Spragins (1968). Neverthe-
discretization method is an important criterion to take less, these methods are limited in their application in
into consideration. data mining due to too strong of statistical hypotheses
Another important criterion to qualify a method is seldom checked with real data.
the fact that a discretization either processes on the
different attributes one by one or takes into account the Univariate Supervised Discretization
whole set of attributes for doing an overall cutting. The
second case, called multivariate discretization, is par- To improve the quality of a discretization in supervised
ticularly interesting when some interactions exist be- data-mining methods, it is important to take into ac-
tween the different attributes. In Figure 4, a supervised count the instance labels of the class attribute. Figure 3
discretization attempts to find the correct cuts by taking shows the problem of constituting intervals without the
into account only one attribute independently of the information of the class attribute. The intervals that are
others. This will fail: It is necessary to represent the the better adapted to a discrete machine-learning method
data with the attributes X1 and X2 together to find the are the pure intervals containing only instances of a
appropriate intervals on each attribute. given class. To obtain such intervals, the supervised
discretization methods such as the state-of-the-art
method Minimum Description Length Principle Cut
MAIN THRUST (MDLPC) are based on statistical or information-theo-
retical criteria and heuristics (Fayyad & Irani, 1993).
The two criteria mentioned in the previous section In a particular case, even if one supervised method can
unsupervised/supervised and univariate/multivariate give better results than another (Kurgan & Krysztof,
will characterize the major discretization method fami- 2004), with real data, the improvements of one method
Figure 1. Equal width discretization
Figure 2. Equal frequency discretization
398
TEAM LinG
Figure 3. Supervised and unsupervised discretizations Figure 4. Interaction between the attributes X1 and X2
X2
,
X1
The multivariate unsupervised discretizations can be

performed by clustering techniques using all attributes
globally. It is also possible to consider each cluster
obtained as a class and improve the discretization quality
compared to the other supervised methods are insignifi- by using (univariate) supervised discretization methods
cant. Moreover, the performance of a discretization method (Chmielewski & Grzymala-Busse, 1994).
is difficult to estimate without a learning algorithm. In An approach called multisupervised discretization
addition, the final results can arise from the discretization (Ludl & Widmer, 2000a) can be seen as a particular
processing, the learning processing, or the combination of unsupervised multivariate discretization. This method
both. Because the discretization is realized in an ad hoc starts with the temporary univariate discretization of
way, independently of the learning algorithm characteris- all attributes. Then the final cutting of a given attribute
tics, there is no guarantee that the interval cut will be is based on the univariate supervised discretization of
optimal for the learning method. Only a little work showed all other attributes previously and temporarily
the relevance and the optimality of the global discretization discretized. These attributes play the role of a class
for a very specific classifier such as nave Bayes (Hsu, attribute one after another. Finally, the smallest inter-
Huang, & Wong, 2003; Yang & Webb, 2003). The super- vals are merged. For supervised learning problems, a
vised discretization methods can be distinguished de- paving of the representation space can be done by
pending on the way the algorithm proceeds: bottom-up cutting each continuous attribute into intervals. The
(each value represents an interval, which is merged pro- discretization process consists in merging the border-
gressively with the others to constitute the appropriate ing intervals in which the data distribution is the same
number of intervals) or top-down (the whole dataset rep- (Bay, 2001). Nevertheless, even if this strategy can
resents an interval and is progressively cut to constitute introduce the class attribute in the discretization pro-
the appropriate number of intervals). However, there are cess, it cannot give a particular role to the class at-
no significant performance differences between these two tribute and can induce the discretization to
latest approaches (Zighed, Rakotomalala, & Feschet, 1997). nonimpressive results in the predictive model.
In brief, the different supervised (univariate) methods
can be characterized by (a) the particular statistical crite- Multivariate Supervised Discretization
rion used to evaluate the degree of an intervals purity, (2)
the top-down or bottom-up strategy to find the cut-points When the learning problem is supervised and the in-
used to determine the intervals. stance labels are scattered in the representation space
with interactions between the continuous predictive
Multivariate Unsupervised attributes (as presented in Figure 4), the methods pre-
Discretization viously seen will not give satisfactory results.
HyperCluster Finder is a method that will fix this
Association rules constitute an unsupervised learning problem by combining the advantages of the supervised
method that needs discrete attributes. For such a method, and multivariate approaches (Muhlenbach &
the discretization of a continuous attribute can be real- Rakotomalala, 2002). This method is based on clusters
ized in an univariate way but also in a multivariate way. In constituted as sets of same-class instances that are
the latter case, each attribute is cut in relation to the closed in the representation space. The clusters are
other attributes of the database; this approach can then identified on a multivariate and supervised way: First, a
provide some interesting improvements when unsuper- neighborhood graph is built by using all predictive at-
vised univariate discretization methods do not yield satis- tributes to determine which instances are close to others;
factory results. second, the edges connecting two instances belonging
399
TEAM LinG
to different classes are cut on the graph to constitute the does not benefit from the discretized attributes
clusters; third, the minimal and maximal values of each (Muhlenbach & Rakotomalala, 2002).
relevant cluster are used as cut-points on each predictive
attribute. The intervals found by this method have the
characteristic to be pure on a pavement of the whole ACKNOWLEDGMENT
representation space, even if the purity is not guaranteed
for an independent attribute. The combination of all Edited with the aid of Christopher Yukna.
predictive attribute intervals is what will provide pure
areas in the representation space.
REFERENCES
FUTURE TRENDS Bay, S. D. (2001). Multivariate discretization for set
mining. Knowledge and Information Systems, 3(4), 491-
Today, the discretization field is well studied in the 512.
supervised and unsupervised cases for a univariate pro-
cess. However, there is little work in the multivariate Chmielewski, M. R., & Grzymala-Busse, J. W. (1994).
case. A related problem exists in the feature selection Global discretization of continuous attributes as pre-
domain, which needs to be combined with the aforemen- processing for machine learning. Proceedings of the
tioned multivariate case. This should bring improved Third International Workshop on Rough Sets and Soft
and more pertinent progress. It is virtually certain that Computing (pp. 294-301).
better results can be obtained for a multivariate
discretization if all attributes of the representation Divina, F., Keijzer, M., & Marchiori, E. (2003). A method
space are relevant for the learning problem. for handling numerical attributes in GA-based inductive
concept learners. Proceedings of the Genetic and Evolu-
tionary Computation Conference (pp. 898-908).
CONCLUSION Dougherty, K., Kohavi, R., & Sahami, M. (1995). Super-
vised and unsupervised discretization of continuous
In a data-mining task, for a supervised or unsupervised features. Proceedings of the 12th International Con-
learning problem, the discretization turned out to be an ference on Machine Learning (pp. 194-202).
essential preprocessing step on which the performance
of the learning algorithm that uses discretized attributes Fayyad, U. M., & Irani, K. B. (1993). The attribute selection
will depend. problem in decision tree generation. Proceedings of the
Many methods, supervised or not, multivariate or 13th International Joint Conference on Artificial Intel-
not, exist to perform this pretreatment, more or less ligence (pp. 1022-1027).
adapted to a given dataset and learning problem. Further- Fisher, W. D. (1958). On grouping for maximum homo-
more, a supervised discretization can also be applied in geneity. Journal of the American Statistical Society, 53,
a regression problem when the attribute to be predicted 789-798.
is continuous (Ludl & Widmer, 2000b). The choice of a
particular discretization method depends on (a) its algo- Frank, E., & Witten, I. (1999). Making better use of global
rithmic complexity (complex algorithms will take more discretization. Proceedings of the 16th International
computation time and will be unsuited to very large Conference on Machine Learning (pp. 115-123).
datasets), (b) its efficiency (the simple unsupervised Friedman N., & Goldszmidt, M. (1996). Discretization of
univariate discretization methods are inappropriate to continuous attributes while learning Bayesian networks
complex learning problems), and (c) its appropriate from mixed data. Proceedings of the 13th International
combination with the learning method using the Conference on Machine Learning (pp. 157-165).
discretized attributes (a supervised discretization is
better adapted to the supervised learning problem). For Grzymala-Busse, J. W., & Stefanowski, J. (2001). Three
the last point, it is also possible to significantly improve discretization methods for rule induction. International
the performance of the learning method by choosing an Journal of Intelligent Systems, 16, 29-38.
appropriate discretization, for instance, a fuzzy
discretization for the nave Bayes algorithm (Yang & Hsu, H., Huang, H., & Wong, T. (2003). Implication of
Webb, 2002). Nevertheless, it is unnecessary to employ a dirichlet assumption for discretization of continuous
sophisticated discretization method if the learning method variables in nave Bayes classifiers. Machine Learning,
53(3), 235-263.
400
TEAM LinG
Kurgan, L., & Krysztof, J. (2004). CAIM discretization Zighed, D., Rakotomalala, R., & Feschet, F. (1997). Optimal
algorithm. IEEE Transactions on Knowledge and Data multiple intervals discretization of continuous attributes ,
Engineering, 16(2), 145-153. for supervised learning. Proceedings of the Third Inter-
national Conference on Knowledge Discovery in Data-
Liu, H., Hussain, F., Tan, C., & Dash, M. (2002). bases (pp. 295-298).
Discretization: An enabling technique. Data Mining
Ludl, M., & Widmer, G. (2000a). Relative unsupervised
discretization for association rule mining. Principles
KEY TERMS
of the Fourth European Conference on Data Mining
and Knowledge Discovery (pp. 148-158). Cut-Points: A cut-point (or split-point) is a value
that divides an attribute into intervals. A cut-point has to
Ludl, M., & Widmer, G. (2000b). Relative unsupervised be included in the range of the continuous attribute to
discretization for regression problem. Proceedings of discretize. A discretization process can produce none
the 11th European Conference on Machine Learning or several cut-points.
(pp. 246-253).
Discrete/Continuous Attributes: An attribute is a
Macskassy, S. A., Hirsh, H., Banerjee, A., & Dayanik, A. quantity describing an example (or instance); its domain
A. (2001). Using text classifiers for numerical classifi- is defined by the attribute type, which denotes the values
cation. Proceedings of the 17th International Joint taken by an attribute. An attribute can be discrete (or
Conference on Artificial Intelligence (pp. 885-890). categorical, indeed symbolic) when the number of val-
ues is finite. A continuous attribute corresponds to real
Mittal, A., & Cheong, L. (2002). Employing discrete Bayes numerical values (for instance, a measurement). The
error rate for discretization and feature selection tasks. discretization process transforms an attribute from con-
Proceedings of the First IEEE International Conference tinuous to discrete.
on Data Mining (pp. 298-305).
Instances: An instance is an example (or record) of
Monti, S., & Cooper, G. F. (1999). A latent variable model the dataset; it is often a row of the data table. Instances
for multivariate discretization. Proceedings of the Sev- of a dataset are usually seen as a sample of the whole
enth International Workshop on Artificial Intelligence population (the universe). An instance is described by
and Statistics its attribute values, which can be continuous or discrete.
Muhlenbach, F., & Rakotomalala, R. (2002). Multivariate Number of Intervals: The number of intervals cor-
supervised discretization: A neighborhood graph ap- responds to the different values of a discrete attribute
proach. Proceedings of the First IEEE International resulting from the discretization process. The number
Conference on Data Mining (pp. 314-321). of intervals is equal to the number of cut-points plus 1.
Potzelberger, K., & Felsenstein, K. (1993). On the fisher The minimum number of intervals of an attribute is equal
information of discretized data. Journal of Statistical to 1, and the maximum number of intervals is equal to
Computation and Simulation, 46(3-4), 125-144. the number of instances.
Teicher, H. (1963). Identifiability of finite mixtures. Ann. Representation Space: The representation space is
Math. Statist., 34, 1265-1269. formed with all the attributes of a learning problem. In
the supervised learning, it consists of the representation
Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiabil- of the labeled instances in a multidimensional space,
ity of finite mixtures. Ann. Math. Statist., 39, 209-214. where all predictive attributes play the role of a dimen-
Yang, Y., & Webb, G. (2002). Non-disjoint discretization sion.
for nave Bayes classifiers. Proceedings of the 19th Supervised/Unsupervised: A supervised learning
International Conference on Machine Learning (pp. algorithm searches a functional link between a class-
666-673). attribute (or dependent attribute or attribute to be pre-
Yang, Y., & Webb, G. (2003). On why discretization works dicted) and predictive attributes (the descriptors). The
for nave Bayes classifiers. Proceedings of the 16th Aus- supervised learning process aims to produce a predictive
tralian Joint Conference on Artificial Intelligence (pp. model that is as accurate as possible. In an unsupervised
440-452). learning process, all attributes play the same role; the
unsupervised learning method tries to group instances in
clusters, where instances in the same cluster are similar,
and instances in different clusters are dissimilar.
401
TEAM LinG
Univariate/Multivariate: A univariate (or monothetic)

method processes a particular attribute independently of
the others. A multivariate (or polythetic) method pro-
cesses all attributes of the representation space, so it can
fix some problems related to the interactions among the
attributes.
402
TEAM LinG
403
Distributed Association Rule Mining ,

Mafruz Zaman Ashrafi
David Taniar
Kate A. Smith
INTRODUCTION location. Therefore, Distributed Association Rule Min-

ing (DARM) emerges as an active subarea of data-
Data mining is an iterative and interactive process that mining research.
explores and analyzes voluminous digital data to dis- Consider the following example. A supermarket may
cover valid, novel, and meaningful patterns (Mohammed, have several data centers spread over various regions
1999). Since digital data may have terabytes of records, across the country. Each of these centers may have
data mining techniques aim to find patterns using gigabytes of data. In order to find customer purchase
computationally efficient techniques. It is related to a behavior from these datasets, one can employ an asso-
subarea of statistics called exploratory data analysis. ciation rule mining algorithm in one of the regional data
During the past decade, data mining techniques have centers. However, employing a mining algorithm to a
been used in various business, government, and scien- particular data center will not allow us to obtain all the
tific applications. potential patterns, because customer purchase patterns
Association rule mining (Agrawal, Imielinsky & Sawmi, of one region will vary from the others. So, in order to
1993) is one of the most studied fields in the data-mining achieve all potential patterns, we rely on some kind of
domain. The key strength of association mining is com- distributed association rule mining algorithm, which
pleteness. It has the ability to discover all associations can incorporate all data centers.
within a given dataset. Two important constraints of Distributed systems, by nature, require communica-
association rule mining are support and confidence tion. Since distributed association rule mining algo-
(Agrawal & Srikant, 1994). These constraints are used to rithms generate rules from different datasets spread
measure the interestingness of a rule. The motivation of over various geographical sites, they consequently re-
association rule mining comes from market-basket analy- quire external communications in every step of the
sis that aims to discover customer purchase behavior. process (Ashrafi, David & Kate, 2004; Assaf & Ron, 2002;
However, its applications are not limited only to market- Cheung, Ng, Fu & Fu, 1996). As a result, DARM algo-
basket analysis; rather, they are used in other applica- rithms aim to reduce communication costs in such a way
tions, such as network intrusion detection, credit card that the total cost of generating global association rules
fraud detection, and so forth. must be less than the cost of combining datasets of all
The widespread use of computers and the advances in participating sites into a centralized site.
network technologies have enabled modern organiza-
tions to distribute their computing resources among
different sites. Various business applications used by BACKGROUND
such organizations normally store their day-to-day data
in each respective site. Data of such organizations in- DARM aims to discover rules from different datasets
creases in size everyday. Discovering useful patterns that are distributed across multiple sites and intercon-
from such organizations using a centralized data mining nected by a communication network. It tries to avoid the
approach is not always feasible, because merging datasets communication cost of combining datasets into a cen-
from different sites into a centralized site incurs large tralized site, which requires large amounts of network
network communication costs (Ashrafi, David & Kate, communication. It offers a new technique to discover
2004). Furthermore, data from these organizations are knowledge or patterns from such loosely coupled dis-
not only distributed over various locations, but are also tributed datasets and produces global rule models by
fragmented vertically. Therefore, it becomes more dif- using minimal network communication. Figure 1 illus-
ficult, if not impossible, to combine them in a central trates a typical DARM framework. It shows three par-
TEAM LinG
Distributed Association Rule Mining
Figure 1. A distributed data mining framework CHALLENGES OF DARM
Final All of the DARM algorithms are based on sequential

Models
association mining algorithms. Therefore, they inherit
Global Model Communication Channel
all drawbacks of sequential association mining. How-
Generation
ever, DARM not only deals with the drawbacks of it but
also considers other issues related to distributed com-
Local
Model
Local
Model
Local
Model
puting. For example, each site may have different plat-
forms and datasets, and each of those datasets may have
Algorithm Algorithm Algorithm different schemas. In the following paragraphs, we dis-
cuss a few of them.
Frequent Itemset Enumeration

Data
Repository
Data
Repository
Data
Repository
Frequent itemset enumeration is one of the main associa-
Site 1 Site 2 Site 3 tion rule mining tasks (Agrawal & Srikant, 1993; Zaki,
2000; Jiawei, Jian & Yiwen, 2000). Association rules are
generated from frequent itemsets. However, enumerating
ticipating sites, where each site generates local models all frequent itemsets is computationally expansive. For
from its respective data repository and exchanges local example, if a transaction of a database contains 30 items,
models with other sites in order to generate global one can generate up to 230 itemsets. To mitigate the enu-
models. meration problem, we found two basic search approaches
Typically, rules generated by DARM algorithms are in the data-mining literature. The first approach uses
considered interesting, if they satisfy both minimum breadth-first searching techniques that search through the
global support and confidence threshold. To find inter- iterate dataset by generating the candidate itemsets. It
esting global rules, DARM generally has two distinct works efficiently when user-specified support threshold is
tasks: (i) global support count, and (ii) global rules high. The second approach uses depth-first searching
generation. techniques to enumerate frequent itemsets. This search
Let D be a virtual transaction dataset comprised of technique performs better when user-specified support
D1, D2, D3 D m geographically distributed datasets; threshold is low, or if the dataset is dense (i.e., items
let n be the number of items and I be the set of items such frequently occur in transactions). For example, Eclat (Zaki,
that I = {a1, a 2, a3, an}, where ai n. Suppose N is 2000) determines the support of k-itemsets by intersecting
the total number of transactions and T = {t1, t2, t3, the tidlists (Transaction ID) of the lexicographically first
t N} is the sequence of transaction, such that t i D. The two (k-1) length subsets that share a common prefix.
support of each element of I is the number of transac- However, this approach may run out of main memory when
tions in D containing I and for a given itemset A I; we there are large numbers of transactions.
can define its support as follows: However, DARM datasets are spread over various
sites. Due to this, it cannot take full advantage of those
A ti searching techniques. For example, breath-first search
Support ( A) = (1)
N performs better when support threshold is high. On the
other hand, in DARM, the candidate itemsets are gener-
Itemset A is frequent if and only if Support(A) ated by combining frequent items of all datasets; hence,
minsup, where minsup is a user-defined global support it enumerates those itemsets that are not frequent in a
threshold. Once the algorithm discovers all global fre- particular site. As a result, DARM cannot utilize the
quent itemsets, each site generates global rules that advantage of breath-first techniques when user-speci-
have user-specified confidence. It uses frequent itemsets fied support threshold is high. In contrast, if a depth-
to find the confidence of a rule R1 and can be calculated first search technique is employed in DARM, then it
by using the following formula: needs large amounts of network communication. There-
fore, without very fast network connection, depth-first
search is not feasible in DARM.
Support ( F1 F2 )
Confidence( R) = (2)
Support ( F1 )
404
TEAM LinG
Communication However those statistical measurements have signifi-

cant meaning and may breach privacy (Evfimievski, ,
A fundamental challenge for DARM is to develop mining Srikant, Agrawal & Gehrke, 2002; Rizvi & Haritsa, 2002).
techniques without communicating data from various Therefore, privacy becomes a key issue of association
sites unnecessarily. Since, in DARM, each site shares its rule mining. Distributed association rule-mining algo-
frequent itemsets with other sites to generate unambigu- rithms discover association rules beyond the organiza-
ous association rules, each step of DARM requires com- tion boundary. For that reason, the chances of breaching
munication. The cost of communication increases pro- privacy in distributed association mining are higher than
portionally to the size of candidate itemsets. For ex- the centralized association mining (Vaidya & Clifton,
ample, suppose there are three sites, such as S1, S2, and 2002; Ashrafi, David & Kate, 2003), because distributed
S3, involved in distributed association mining. Suppose association mining accomplishes the final rule model by
that after the second pass, site S1 has candidate 2- combining various local patterns. For example, there are
itemsets equal to {AB, AC, BC, BD}, S2 = {AB, AD, BC, three sites, S1, S2, and S3, each of which has datasets
BD}, and S3 = {AC, AD, BC, BD}. To generate global DS1, DS2, and DS3. Suppose A and B are two items
frequent 2-itemsets, each site sends its respective candi- having global support threshold; in order to find rule
date 2-itemsets to other sites and receives candidate 2- AB or BA, we need to aggregate the local support of
itemsets from others. If we calculate the total number of itemsets AB from all participating sites (i.e., site S1, S2,
candidate 2-itemsets that each site receives, it is equal to and S3). When we do such aggregation, all sites learn the
8. But, if each site increases the number of candidate 2- exact support count of other sites. However, participat-
itemsets by 1, each site will receive 10 candidate 2- ing sites generally are reluctant to disclose the exact
itemsets, and, subsequently, this increases communica- support of itemset AB to other sites, because support
tion cost. This cost will further increase when the number counts of an itemset has a statistical meaning, and this
of sites is increased. Due to this reason, message optimi- may thread data privacy.
zation becomes an integral part of DARM algorithms. For the above-mentioned reason, we need secure
The basic message exchange technique (Agrawal & multi-party computation solutions to maintain privacy
Shafer, 1996) incurs massive communication costs, es- of distributed association mining (Kantercioglu & Clifton,
pecially when the number of local frequent itemsets is 2002). The goal of secure multi-party computation (SMC)
large. To reduce message exchange cost, we found sev- in distributed association rule mining is to find global
eral message optimization techniques (Ashrafi et al., support of all itemsets using a function where multiple
2004; Assaf & Ron, 2002; Cheung et al., 1996). Each of parties hold their local support counts, and, at the end, all
these optimizations focuses on the message exchange parties know about the global support of all itemsets but
size and tries to reduce it in such a way that the overall nothing more. And finally, each participating site uses this
communication cost is less than the cost of merging all global support for rule generation.
datasets into a single site. However, those optimization
techniques are based on some assumptions and, there- Partition
fore, are not suitable in many situations. For example,
DMA (Cheung et al., 1996) assumes the number of DARM deals with different possibilities of data distri-
disjoint itemsets among different sites is high. But this bution. Different sites may contain horizontally or
is only achievable when the different participating sites vertically partitioned data. In horizontal partition, the
have vertical fragmented datasets. To mitigate these dataset of each participating site shares a common set
problems, we found another framework (Ashrafi et al., of attributes. However, in vertical partition, the datasets
2004) in data-mining literature. It partitions the differ- of different sites may have different attributes. Figure
ent participating sites into two groups, such as sender 2 illustrates horizontal and vertical partition dataset in
and receiver, and uses itemsets history to reduce mes- a distributed context. Generating rules by using DARM
sage exchange size. Each sender site sends its local becomes more difficult when different participating
frequent itemsets to a particular receiver site. When sites have vertically partitioned datasets. For example,
receiver sites receive all local frequent itemsets, it if a participating site has a subset of items of other
generate global frequent itemsets for that iteration and sites, then that site may have only a few frequent
sends back those itemsets to all sender sites itemsets and may finish the process earlier than others.
However, that site cannot move to the next iteration,
Privacy because it needs to wait for other sites to generate
global frequent itemsets of that iteration.
Association rule-mining algorithms discover patterns Furthermore, the problem becomes more difficult
from datasets based on some statistical measurements. when the dataset of each site does not have the same
405
TEAM LinG
Figure 2. Distributed (a) homogeneous (b) heterogeneous dataset
TID X-1 X-2 X-3 TID X-1 X-2 X-3 TID X-1 X-2 TID X-1 X-3 TID X-1 X-4
1 1.1 2.2 3.1 8 1.5 2.1 3.1 1 1.1 2.2 1 1.5 3.1 1 1.5 4.1
2 1.1 2.2 3.1 9 1.6 2.2 3.2 2 1.1 2.2 2 1.6 3.1 2 1.6 4.2
3 1.3 2.3 3.3 10 1.3 2.1 3.3 3 1.3 2.3 3 1.3 3.3 3 1.3 4.1
4 1.2 2.5 3.2 11 1.4 2.4 3.4 4 1.2 2.5 4 1.4 3.2 4 1.4 4.4
5 1.7 2.5 3.3 12 1.5 2.4 3.5 5 1.7 2.5 5 1.5 3.3 5 1.5 4.4
6 1.6 2.6 3.6 13 1.6 2.6 3.6 6 1.6 2.6 6 1.6 3.6 6 1.6 4.5
7 1.7 2.7 3.7 14 1.7 2.7 3.7 7 1.7 2.7 7 1.7 3.7 7 1.7 4.5
Site A Site B Site A Site B Site C
(a) (b)
hierarchical taxonomy. If a different taxonomy level exists ises to generate association rules with minimal communi-
in datasets of different sites, as shown in Figure 3, it cation cost, but it also utilizes the resources distributed
becomes very difficult to maintain the accuracy of global among different sites efficiently. However, the accept-
models. ability of DARM depends to a great extent on the issues
discussed in this article.
FUTURE TRENDS
REFERENCES
The DARM algorithms often consider the datasets of
various sites as a single virtual table. On the other hand, Agrawal, R., Imielinsky, T., & Sawmi, A.N. (1993). Mining
such assumptions become incorrect when DARM uses association rules between sets of items in large databases.
different datasets that are not from the same domain. Proceedings of the ACM SIGMOD International Confer-
Enumerating rules using DARM algorithms on such ence on Management of Data, Washington D.C.
datasets may cause a discrepancy, if we assume that
semantic meanings of those datasets are the same. The Agrawal, R., & Shafer, J.C. (1996). Parallel mining of
future DARM algorithms will investigate how such association rules. IEEE Transactions on Knowledge
datasets can be used to find meaningful rules without and Data Engineering, 8(6), 962-969.
increasing the communication cost. Agrawal, R., & Srikant, R. (1994). Fast algorithms for
mining association rules in large database. Proceedings
of the International Conference on Very Large Data-
CONCLUSION bases, Santiago de Chile, Chile.
The widespread use of computers and the advances in Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2003). To-
database technology have provided a large volume of wards privacy preserving distributed association rule
data distributed among various sites. The explosive mining. Proceedings of the Distributed Computing,
growth of data in databases has generated an urgent need Lecture Notes in Computer Science, IWDC03, Calcutta,
for efficient DARM to discover useful information and India.
knowledge. Therefore, DARM becomes one of the ac- Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004). Reduc-
tive subareas of data-mining research. It not only prom- ing communication cost in privacy preserving distrib-
uted association rule mining. Proceedings of the Data-
base Systems for Advanced Applications, DASFAA04,
Figure 3. A generalization scenario Jeju Island, Korea.
Beverage Level 3
Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004). ODAM:
An optimized distributed association rule mining algo-
rithm. IEEE Distributed Systems Online, IEEE.
Coffee Tea Soft drink Level 2 Assaf, S., & Ron, W. (2002). Communication-efficient
distributed mining of association rules. Proceedings of
the ACM SIGMOD International Conference on Man-
Ice Tea Coke Pepsi Level 1 agement of Data, California.
406
TEAM LinG
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996a). Zaki, M.J. (1999). Parallel and distributed association
Efficient mining of association rules in distributed data- mining: A survey. IEEE Concurrency, 7(4), 14-25. ,
bases. IEEE Transactions on Knowledge and Data En-
gineering, 8(6), 911-922. Zaki, M.J. (2000). Scalable algorithms for association
mining. IEEE Transactions on Knowledge and Data
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996b). A fast Engineering, 12(2), 372-390.
distributed algorithm for mining association rules. Pro-
ceedings of the International Conference on Parallel Zaki, M.J., & Ya, P. (2002). Introduction: Recent develop-
and Distributed Information Systems, Florida. ments in parallel and distributed data mining. Journal of
Distributed and Parallel Databases, 11(2), 123-127.
Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, J.
(2002). Privacy preserving mining association rules.
Proceedings of the ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, KEY TERMS
Edmonton, Canada.
DARM: Distributed Association Rule Mining.
Jiawei, H., Jian, P., & Yiwen, Y. (2000). Mining fre-
quent patterns without candidate generation. Proceed- Data Center: A centralized repository for the stor-
ings of the ACM SIGMOD International Conference age and management of information, organized for a
on Management of Data, Dallas, Texas. particular area or body of knowledge.
Kantercioglu, M., & Clifton, C. (2002). Privacy pre- Frequent Itemset: A set of itemsets that have the
serving distributed mining of association rules on hori- user specified support threshold.
zontal partitioned data. Proceedings of the ACM Network Intrusion Detection: A system that de-
SIGMOD Workshop of Research Issues in Data Min- tects inappropriate, incorrect, or anomalous activity in
ing and Knowledge Discovery DMKD, Edmonton, Canada. the private network.
Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy SMC: SMC computes a function f (x1, x2 , x3 xn)
in association rule mining. Proceedings of the Interna- that holds inputs from several parties, and, at the end, all
tional Conference on Very Large Databases, Hong Kong, parties know about the result of the function f (x1, x2 , x3
China. xn) and nothing else.
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso- Taxonomy: A classification based on a pre-deter-
ciation rule mining in vertically partitioned data. Proceed- mined system that is used to provide a conceptual frame-
ings of the ACM SIGKDD International Conference on work for discussion, analysis, or information retrieval.
Knowledge Discovery and Data Mining, Edmonton,
Canada.
407
TEAM LinG
408
Distributed Data Management of Daily Car

Pooling Problems
Roberto Wolfler Calvo
Universit de Technologie de Troyes, France
Fabio de Luigi
University of Ferrara, Italy
Palle Haastrup
European Commission, Italy
Vittorio Maniezzo
University of Bologna, Italy
INTRODUCTION integrated system for the organization of a car-pooling

service and reports about a real-world case study.
The increased human mobility, combined with high use of
private cars, increases the load on the environment and
raises issues about the quality of life. The use of private BACKGROUND
cars lends to high levels of air pollution in cities, parking
problems, noise pollution, congestion, and the resulting Car pooling can be operated in two main ways: Daily Car
low transfer velocity (and, thus, inefficiency in the use of Pooling Problem (DCPP) or Long-Term Car Pooling Prob-
public resources). Public transportation service is often lem (LCPP). In the case of DCPP (Baldacci et al., 2004;
incapable of effectively servicing non-urban areas, where Mingozzi et al., 2004), each day a number of users (here-
cost-effective transportation systems cannot be set up. after called servers) declare their availability for picking
Based on investigations during the last years, problems up and later bringing back colleagues (hereafter clients)
related to traffic have been among those most commonly on that particular day.
mentioned as distressing, while public transportation The problem is to assign clients to servers and to
systems inherently are incapable of facing the different identify the routes to be driven by the servers in order to
transportation needs arising in modern societies. A solu- minimize service costs and a penalty due to unassigned
tion to the problem of the increased passenger and freight clients, subject to user time window and car capacity
transportation demand could be obtained by increasing constraints. In the case of LCPP, each user is available
both the efficiency and the quality of public transporta- both as a server and as a client, and the objective is to
tion systems, and by the development of systems that define crews or user pools, where each user, in turn, on
could provide alternative solutions in terms of flexibility different days, will pick up the remaining pool members
and costs between the public and private ones. This is the (Hildmann, 2001; Maniezzo et al., 2004). The objective
rationale behind so-called Innovative Transport Systems here becomes that of maximizing pool sizes and minimizing
(ITS) (Colorni et al., 1999), like car pooling, car sharing, the total distance traveled by all users when acting as
dial-a-ride, park-and-ride, card car, park pricing, and road servers, again subject to car capacity and time window
pricing, which are characterized by the exploitation of constraints.
innovative organizational elements and by a large flexibil- Car pooling is most effective for daily commuters. It is
ity in their management (e.g., traffic restrictions and fares an old idea but scarcely applied, at least in Europe, for
can vary according with the time of day). many reasons. First of all, it is necessary to collect a lot of
Specifically, car pooling is a collective transportation information about service users and to make them avail-
system based on the idea that sets of car owners having able to the service planner; it can be difficult to cluster
the same travel destination can share their vehicles (pri- people in such a way that everybody feels satisfied.
vate cars), with the objective of reducing the number of Ultimately, it is important that all participants perceive a
cars on the road. Until now, these systems have had a clear personal gain in order to continue using the service.
limited use due to lack of efficient information processing Moreover, past experience has shown already that the
and communication support. This study presents an effectiveness and competitiveness of public transporta-
TEAM LinG
Distributed Data Management of Daily Car Pooling Problems
tion, compared to private transportation, can be improved service is supported by a database of potential users (e.g.,
through an integrated process, based on data manage- employees of a company) that daily commute from their ,
ment facilities. Usually, potential customers of a new houses to their workplace. A subset of them offers seats
transportation system face serious problems of knowl- in their cars. Moreover, they specify the departure time
edge retrieval, which is the reason for devoting substan- (when they leave their house) and the mandatory arrival
tial efforts to providing clients with an easy and powerful time at the office. The employees that offer seats in their
tool to find information. Current information systems give cars are called servers. The employees asking for a lift are
the possibility of real-time data management and include called clients. The set of servers and the set of clients need
the possibility to react to unscheduled events, as they can to be redefined once a day. The effectiveness of the
be reached from almost anywhere. proposed system is related strictly to the architecture and
Architectures for information browsing, like the World the techniques used to manage information.
Wide Web (WWW), are an easy and powerful means The objective of this research is to prove that, at least
through which to deploy databases and, thus, represent in a particular site (the Joint Research Center [JRC] of the
obvious options for setting up services, such as the one European Commission located in Ispra, northern Italy), it
advocated in this work. The WWW is useful for providing could be possible to reduce the number of transport
access to central data storage, for collecting remote data, vehicles without significantly changing either the number
and for allowing GIS and optimization software to interact. of commuters or their comfort-level. The users of the
All these elements have an important impact on the system are commuters normally using their own cars for
implementation of new transport services (Ridematching, traveling between home and a workplace poorly served by
2004). Based on the way they use the WWW and on the public transportation.
services they provide, car-pooling systems range be-
tween two extremes (Carpoolmatch, 2004; Carpooltool, System Architecture
2004; Ridepro, 2004; SAC, 2004, etc.), one where there is
a WWW site collecting information about trips, which are The architecture of the system developed for this problem
open to every registered user, and another one where the is shown in Figure 1 (interested readers can find a com-
users of the system are a restricted group and are granted plete description of the whole system in Wolfler et al.,
more functionalities. 2004). The system consists of the following five main
A main issue for the first type is to guarantee the modules:
reliability of the information. Their interface often is
designed mainly to help set service-related geographical OPT: An optimization module that generates a
information (customer delivery points, customers pick up feasible solution using an algorithm that defines the
points, paths). Such systems only rarely will suggest a paths for the servers. The algorithm makes use of a
matching between clients and servers but operate as a heuristic approach and is used to assign the clients
post-it wall, where users can consult or leave information to the servers and to define the path for each server.
about travel routes. Each path minimizes the travel time, maximizes the
As for the latter type, the idea is normally to set up a number of requests picked up, and satisfies the time
car-pooling service among the users, usually employees and capacity constraints.
of the same organization. This system is more structured.
Moreover, the spontaneous user matching often is sub-
{M-, S-, W-}CAR: Three modules that permit to
stituted by a solution found by means of an algorithmic receive, decrypt, and send SMS (SCAR), EMAIL
approach. An example of this type of system is reported (MCAR) and Web pages (WCAR) to the users,
in Dailey, et al. (1999) or in Lexington-Fayette County respectively. The module Sms Car Pooling (SCAR)
(2004). These systems use the WWW and react to unex- allows the server to send and receive Sms messages.
pected events with e-mail messages. The module Mail Car Pooling (MCAR) supports e-
mail communication and uses POP3 and SMTP as
protocols. The module (WCAR) is the gateway for
the Web interaction. All modules filter the customers
MAIN THRUST access, allowing the entitled user to insert new data,
to query, and to modify the database (e.g., the
This article describes an integrated ICT system that departure and the arrival time desired) and to access
supports the management of a car-pooling service in a service data.
real-world prototypical setting. An approach similar to
Dailey, et al. (1999) is suggested, and a complete system
GUI: A graphical user interface based on ESRI
for supporting the operation of a car-pooling case as a ArcView. It generates a view (a digital map) of the
prototype for a real-life application is described. The current problem instance and provides all relevant
data management.
409
TEAM LinG
Figure 1. The architecture of the system and offers; the system then generates Web pages con-
taining text and maps presenting the results of the opti-
mization algorithm. Both e-mail and SMS also can be used
SMS SC AR for submitting requests and for receiving results or
Requests
variations of previously proposed results (following
E-mail
Email MCAR
Offers
real-time notifications). The clients know whether their
W EB
requests could be matched and the detail of their pickup,
W CAR
Employees
while the servers simply receive an SMS notifying the
pool formation; travel details are separately sent via e-
GUI Solutions mail. SMS also is used by the servers (via a service relay)
to inform clients of delays.
OPT
Geography Operationally, a user sending an e-mail to the system
must insert the message string in the subject field, which
will be processed by a mail receiver agent and finally sent
to the parser. Economic and privacy considerations sug-
The system collects and uses data of two different gested sending all SMS to an SMS engine, which is
types: geographic and alphanumeric. The geographic da- interfaced with the system and which transfers the string
tabase contains the maps of the region of interest with a sent to the parser. Each time the system receives a
detailed road network, the geocoded client pickup sites, syntactically correct e-mail or SMS, it automatically sends
and all the geographic information needed to generate a an acknowledgment SMS message.
GIS output on the Web. The alphanumeric data repository, While the system generates messages to be sent
maintained by a relational database, contains information directly to users, those that users generate, either e-mail
about the employees and a representation of the road or SMS, must follow a rigidly structured format in order
network of the area where the car-pooling service is active. to allow automatic processing. We implemented a single
Other data, input by the users, are related strictly to the parser for all these messages. The Web interface is
daily situationdetailed information about the service, obviously more user-friendly. Each employee can specify
whether a user is a server or a client, the departure time from through an ASP page the set of days (possibly empty)
home and the maximal acceptable arrival time at work, the when he or she is willing to drive and the set of days when
number of available seats in the cars offered by the servers, he or she asks to be picked up.
and the maximum accepted delay, which is the parameter The system then computes a matching of servers as
used to specify how far out of the shortest way a server is clients. The result is a set of routes starting from the
willing to drive in order to pick up colleagues. The users server houses, arriving at the workplace, and passing
are permitted to consult, modify, or delete their own entries through the set of client houses without violating the
in the database at any time. time windows constraints and the car capacity con-
The road network and all user-related data are pro- straints.
cessed by the optimization module in order to define user These routes are made available both to clients and
pools and car paths. The optimization algorithm can be to server. A well-known GIS module, ArcView, loads the
activated without immediate presence of anybody on some route information given in alphanumeric format, after the
regular schedule, searching a local database and present- optimization. Then, it transforms this information in a set
ing the results on the Web. All these actions are performed of GIS views through a set of queries to its database. The
on a periodic basis on the evening before each work day. last step transforms the views into bitmaps displayed by
The system is designed to support two types of users: the Web server. Moreover, the system alerts clients and
the system administrator and the employee. The system servers in real time by means of SMS when any variation
administrator must update the static databases (users, of the scheduling happens.
maps) and guarantee all the functionalities. All other data
are entered, edited and deleted by the users through a
distributed GUI.
Optimization Algorithms and
Implementation Details
Communication Subsystems
One of the most interesting features of the approach
described in this article, with respect to the other systems
The system supports three different communication chan-
currently in use, is the set of algorithms used to obtain
nels: Web, SMS, and e-mails. The Web is the main interface
a matching of clients and servers. Due to the NP-hard-
to connect a user to the system. The user, by means of a
ness of the underlying problem (Varrentrapp et al., 2002)
standard GUI, is allowed to insert transportation requests
and to the size of the instances to solve, a heuristic
410
TEAM LinG
approach was in order. In designing it, the efficiency was and more common in the near future. Already, several
the main parameters, ensuring relatively fast response software companies are marketing tailored solutions. ,
time also for larger instances. The matching obtained Therefore, it is easy to envisage that systems featuring
minimizes the total travel length of all servers going to the the services included in the prototype described in this
workplace and maximizes the number of clients serviced, work will define their own market niche. This, however,
while meeting the operational constraints. In addition, can happen only parallel to an increasing sensibility of
real-time services are supported; for example, the system local governments, which alone can define incentive
sends warnings to employees when delays occur. policies for drivers who share their cars. These policies
The system was implemented at the Joint Research will reflect features of the implemented systems. As shown
Center of the European Commission, using standard com- in this article, all needed technological infrastructure is
mercial software and developing extra modules in C++ already available.
using Microsoft Visual Studio C++ compiler. The data-
base used is Microsoft Access. The geographical data are
stored in shape files and used by an ArcView application CONCLUSION
(release 3.2). The system administrator can access data
directly through MSAccess and ArcView, which are both This article presents the essential elements of a prototypi-
packages available and running in the same machine cal deployment of ITC support to a car-pooling service for
where the car-pooling application resides. Through a large-sized organization. The essential features that can
Arcview, the system manager can start the optimization make its actual use possible are in the technological
module. infrastructure, essentially in the distributed data manage-
The Joint Research Center of the European Commis- ment and accompanying optimization, which helps to
sion (JRC) is situated in the northwest of Italy not far from provide real-time response to user needs. The current
Milan. It covers a big area of 2 km2, and its mission is to possibility of using any data access means as a system
carry out research useful to the European Commission. interface (we used e-mails, SMS, and Web, but new smart
The number of employees is about 2,000 people, divided phones and more in general UMTS can provide other
into three main classes: administrative, staff, and re- access points) can ease system acceptance and end-user
searchers. They come from all around Europe and live in interest. However, the ease of usage alone would be of
the area surrounding the center. The wide geographical little interest, if the solutions proposed, in terms of driver
area covered by the commuters is about 100 Km2. Since paths, were not of good quality. Current development of
this is a sparsely populated area, public transportation optimization research, both in the areas of heuristic and
(i.e., trains, buses, etc.) is definitely insufficient; thus, exact approaches, provide firm basis for designing sup-
private cars are necessary and used. port system that also can deal with the dimension of the
Module OPT, in particular, was tested on a set of real- instances induced by large-sized organizations.
world problem instances derived from data provided by
the JRC. The instances used are the same as those de-
scribed in (Baldacci et al., 2004); they are derived from the REFERENCES
real-world instance defined by over 600 employees by
randomly selecting the desired number of clients and Baldacci, R., Maniezzo, V., & Mingozzi, A. (2004). An exact
servers. method for the car pooling problem based on Lagrangean
Computational results show that the CPU time used by column generation. Operations Research, 52(3), 422-439.
our heuristic increases approximately linearly with the
problem dimension, and the number of unserviced re- Carpoolmatch. (2004). http://www.carpoolmatchnw.org/
quests is a constant proportion of the total number of
employees. Moreover, comparing two instances with the Carpooltool. (2004). http://www.carpooltool.com/en/my/
same dimension, but with different percentage of servers, Colorni, A., Cordone, R., Laniado, E., & Wolfler Calvo, R.
one can see that the CPU time increases when fewer (1999). Innovation in transports: Planning and manage-
servers are available, as expected, while the total travel ment [in Italian]. In S. Pallottino, & A. Sciomachen (Eds.),
time decreases, since fewer paths are driven. Scienze delle decisioni per i trasporti. Franco Angeli.
Cordeau, J.-F., & Laporte, G. (2003). The dial-a-ride prob-
FUTURE TRENDS lem (DARP): Variants, modeling issues and algorithms.
Quarterly Journal of the Belgian, French and Italian
Car-pooling services as well as most other alternative Operations Research Societies, 1, 89-101.
public transportation services are likely to become more
411
TEAM LinG
Dailey, D.J., Loseff, D., & Meyers, D. (1999). Seattle smart KEY TERMS
traveler: Dynamic ridematching on the World Wide Web.
Transportation Research Part C, 7, 17-32. Car Pooling: A collective transportation system based
Hildmann, H. (2001). An ants metaheuristic to solve car on the shared use of private cars (vehicles) with the
pooling problems [masters thesis]. University of objective of reducing the number of cars on the road.
Amsterdam, The Netherlands. Computational Problem: A relation between input
Lexington-Fayette County. (2004). Ride matching ser- and output data, where input data are known (and corre-
vices. Retrieved from http://www.lfucg.com/mobility/ spond to all possible different problem instances), and
rideshare.asp output data are to be identified, but predicates or asser-
tions they must verify are given.
Maniezzo, V. (2002). Decision support for location prob-
lems. Encyclopedia of Microcomputers, 8, 31-52. NY: Data Mining: The application of analytical methods
Marcel Dekker. and tools to data for the purpose of identifying patterns
and relationships, such as classification, prediction, es-
Maniezzo, V., Carbonaro, A., & Hidmann, H. (2004). An timation, or affinity grouping.
ANTS heuristic for the long-term car pooling problem. In
G.C. Onwuboulu, & B.V. Babu (Eds.), New optimization GIS: Geographic Information Systems; tools used to
techniques in engineering (pp. 411-430).Heidelber, Ger- gather, transform, manipulate, analyze, and produce in-
many: Springer-Verlag. formation related to the surface of the Earth.
Mattarelli, M., Maniezzo, V., & Haastrup, P. (1998). A Heuristic Algorithms: Optimization algorithms that
decision support system distributed on the Internet. do not guarantee to identify the optimal solution of the
Journal of Decision Systems, 6(4), 353-368. problem they are applied to, but which usually provide
good quality solutions in an acceptable time.
Ridematching. (2004). University of South Florida,
ridematching systems. Retrieved from http:// NP-Hard Problems: Optimization problems for which
www.nctr.usf.edu/clearinghouse/ridematching.htm a solution can be verified in polynomial time, but no
polynomial solution algorithm is known, even though no
Ridepro. (2004). http://www.ridepro.net/index.asp one so far has been able to demonstrate that none exists.
SAC. (2004). San Antonio Colleges carpool matching Optimization Problem: A computational problem for
service. Retrieved from http://www.accd.edu/sac/ which an objective function associates a merit figure with
carpool/ each problem solution, and it is asked to identify a feasible
solution that minimizes or maximizes the objective function.
Varrentrapp, K., Maniezzo, V., & Sttzle, T. (2002). The
long term car pooling problem: On the soundness of the
problem formulation and proof of NP-completeness. Tech-
nical Report AIDA-02-03. Darmstadt, Germany: Technical ENDNOTE
University of Darmstadt.
1
This article is an abridged and updated version of
Wolfler Calvo, R., de Luigi, F., Haastrup, P., & Maniezzo, the paper, A Distributed Geographic Information
V.(2004). A distributed geographic information system for System for the Daily Car Pooling Problem, pub-
the daily car pooling problem. Computers & Operations lished as Wolfler et al., 2004.
Research, 31, 2263-2278.
412
TEAM LinG
413
Drawing Representative Samples from Large ,

Databases
Wen-Chi Hou
Southern Illinois University, USA
Hong Guo
Feng Yan
Williams Power, USA
Qiang Zhu
University of Michigan, USA
INTRODUCTION tribute to customers who are handicapped, pregnant,

elderly, and etcetera. A uniform sampling may not be able
Sampling has been used in areas like selectivity estima- to include enough such under-populated people. How-
tion (Hou & Ozsoyoglu, 1991; Haas & Swami, 1992, ever, by giving higher inclusion probabilities to (the
Jermaine, 2003; Lipton, Naughton & Schnerder, 1990; Wu, transaction records of) these under-populated customers
Agrawal, & Abbadi, 2001), OLAP (Acharya, Gibbons, & in sampling, the special care can be reflected in the
Poosala, 2000), clustering (Agrawal, Gehrke, Gunopulos, association rules.
& Raghavan, 1998; Palmer & Faloutsos, 2000), and spatial To find representative samples for populations with
data mining (Xu, Ester, Kriegel, & Sander, 1998). Due to its non-uniform probability distributions, some remedies,
importance, sampling has been incorporated into modern such as the density biased sampling (Palmer & Faloutsos,
database systems. 2000) and the Acceptance/Rejection (AR) sampling (Olken,
The uniform random sampling has been used in vari- 1993), have been proposed. The density-biased sampling
ous applications. However, it has also been criticized for is specifically designed for applications where the prob-
its uniform treatment of objects that have non-uniform ability of a group of objects is inversely proportional to
probability distributions. Consider the Gallup poll for a its size. The AR sampling, based on the acceptance/
Federal election as an example. The sample is constructed rejection approach (Rubinstein, 1981), aims for all prob-
by randomly selecting residences telephone numbers. ability distributions and is probably the most general
Unfortunately, the sample selected is not truly represen- approach discussed in the database literature.
tative of the actual voters on the election. A major reason We are interested in finding a general, efficient, and
is that statistics have shown that most voters between accurate sampling method applicable to all probability
ages 18 and 24 do not cast their ballots, while most senior distributions. In this research, we develop a Metropolis
citizens go to the poll-booths on Election Day. Since sampling method, based on the Metropolis algorithm
Gallups sample does not take this into account, the (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller,
survey could deviate substantially from the actual elec- 1953), to draw representative samples. As it will be clear,
tion results. the sample generated by this method is bona fide repre-
Finding representative samples is also important for sentative.
many data mining tasks. For example, a carmaker may like
to add desirable features in its new luxury car model. Since
not all people are equally likely to buy the cars, only from BACKGROUND
a representative sample of potential luxury car buyers can
most attractive features be revealed. Consider another Being a representative sample, it must satisfy some crite-
example in deriving association rules from market basket ria. First, the sample mean and variance must be good
data, recalling that the goal was to place items often estimates of the population mean and variance, respec-
purchased together in near locations. While serving ordi- tively, and converge to the latter when the sample size
nary customers, the store would like to pay some special increases. In addition, a selected sample must have a
TEAM LinG
Drawing Representative Samples from Large Databases
similar distribution to that of the underlying population. k

In the following, we briefly describe the population mean 2 = (ri Nwi ) 2 /( Nwi ) , (4)
and variance and the Chi-square (Spiegel, 1991) test used i =1
to examine the similarity of distributions.
where ri is the number of sample objects drawn from the
Mean and Variance Estimation
wi is the probability
k
ith bin, i =1 ri = N is the sample size,
v
Let x be a d-dimensional vector representing a set of d
attributes that characterizes an object in a population, of the i th bin of the population, and Nwi is the expected
and the quantity of interest of the object, denoted as number of sample objects from that bin. A bin here refers
v to a designated group or range of values. The larger
(x ) . Our task is to calculate the mean and variance of
v v the 2 value, the greater is the discrepancy between the
(x ) of the population (relation). Let w( x ) 0 be the
v sample and the population distributions. Usually, a level
probability or weigh function of x . The population
v of significance is specified as the uncertainty of the test.
mean of (x ) , denoted by , is given by 2
If the value of 2 is less than 1 , we are about 1-
v v confident that the sample and population have similar
= r ( x ) w( x ) (1) 2
all x distributions: customarily, = 0.05 . The value of 1 is
also determined by the degree of freedom involved (i.e.,
k - 1 ).
The probability distribution is required to satisfy
v
rw( x ) = 1 . (2) MAIN THRUST
all x
Metropolis algorithm (Metropolis, Rosenbluth,

v Rosenbluth, Teller, & Teller, 1953) has been known as
For example, let x be a registered voter. Assuming the most successful and influential Monte Carlo Method.
there are only two candidates: a Republican candidate and Unlike its use in numerical calculations, we shall use it
v
a Democratic candidate, then we can let (x ) = 1 if the to construct representative samples. In addition, we will
v also incorporate techniques for finding the best start
voter x will cast a vote for the Republican candidate; and sampling point in the algorithm, which can greatly im-
v v
(x ) = 1 , otherwise. w(x ) is the weight of the registered prove the efficiency of the process.
r
voter x . If is positive, the Republican candidate is
Probability Distribution
predicted to win the election. Otherwise, the Democratic
candidate wins the election. v
The probability distribution w(x ) plays an important role
Another useful quantity is the population variance,
which is defined as in the Metropolis algorithm. Unfortunately, such informa-
tion is usually unknown or difficult to obtain due to
v v incompleteness or size of a population. However, the
2 = r( ( x ) ) 2 w( x ) . (3) relative probability distribution or non-normalized prob-
all x v
ability distribution, denoted by W (x ) , can often be ob-
tained from, for example, preliminary analysis, knowledge,
v statistics, and etcetera. Take the Gallup poll for example.
The variance specifies the variability of the (x )
While it may be difficult or impossible to assign a weight
values relative to . v
(i.e., w(x ) ) to each individual voter, it can be easily
known, for example, from Federal Election Commission,
Chi-Square Test that the relative probabilities for people to vote on the
v
Election Day (i.e., W (x ) ) are 18.5%, 38.7%, 56.5%, and
To compare the distributions of a sample and its popula- 61.5% for groups whose ages fall in 18-24, 25-44, 45-
tion, we perform the Chi-square test (Press, Teukolsky, 65, and 65+, respectively. Fortunately, the relative prob-
Vetterling, & Flannery, 1994) by calculating
414
TEAM LinG
v ment is that the random selection method must provide

ability distribution W (x ) suffices to perform the calcula-
tions and construct a sample (Kalos & Whilock, 1986). a chance for every element in the entire data space to be ,
picked.
v
Sampling Procedure We now decide if the trial object y should be incor-
porated into the sample. We calculate the ratio
Similar to simple random sampling, objects are selected v v v
= W ( y ) / W ( xi ) . If 1 , we accept y into the sample
randomly from the population one after another. The v v v
Metropolis sampling guarantees that the sample selected and let xi +1 = y . That is, y becomes the last object
has a similar distribution to that of the population. added to the sample. If < 1 , we generate a random
number R, which has an equal probability to lie between
Selecting the Starting Point v
0 and 1. If R , the trial objecty is accepted into the
Objects with higher weight are more important than others. v v v
sample and we let xi +1 = y . Otherwise, the trial object y
If a method does not start with those objects, we could miss
v v
them in the process or take long time to incorporate them is rejected and we let xi +1 = xi . It is noted that in the
into the sample, which is formidable in cost and detrimental latter situation, we incorporate the just selected object
to accuracy. In the following, we propose a general ap- v v v
proach to selecting a starting point. xi into the sample again (i.e. xi +1 = xi ). Therefore, an
We begin with searching for an object with the maxi- object with a high probability may appear more than once
v
mum W (x ) . If there are several objects with the same in our sample. The above step is called a Monte Carlo
maximum value, we could just pick any of them as the best step.
v After each Monte Carlo step, we add one more object
starting host x1 . In many applications, such as the Gallup into the sample. The above Monte Carlo step is repeated
v until a predefined sample size N is reached. It is expected
example, finding the maximal value of W (x ) is straightfor-
ward. For others, there are several useful approaches for that the sample average converges very fast as N in-
searching for the maximum of functions, such as Golden creases and the fluctuation decreases in the order of
Section search, Downhill Simplex method, Conjugate Gra- 1 / N (Metropolis, Rosenbluth, Rosenbluth, Teller, &
dient method, and etcetera. A good review of these meth- Teller, 1953). The complete sampling procedure, which
ods can be found in the reference (Press, Teukolsky, includes selecting the starting point and the Metropolis
Vetterling, & Flannery, 1994). algorithm, is summarized in Figure 1.
Incorporating Objects into Sample Properties of a Metropolis Sample

v Since the selection starts with an object having the
Let the object last added to the sample be xi . Now, we pick
v
v
a trial object y from the population randomly. In general, maximum weight W (x ) , it ensures that the sample
v always includes the most important objects. In addition,
there is no restriction on how to select y . The only require-
Figure 1. The Metropolis Sampling

v
(1) locate an object with the maximum weight W (xv ) as the first object x1 of the sample.
(2) for i from 1 to N-1 do
v
(3) randomly select an object y ;
(4) compute = W ( yv) / W ( xvi ) ;
v v
(5) if 1 then xi +1 = y ;
(6) else generate a random number R;
v v
(7) if R then xi +1 = y ;
v v
(8) else xi +1 = xi ;
(9) end if;
(10) end if;
(11) end for.
415
TEAM LinG
the sample has a distribution close to that of the popula- distribution and the greater the Wmax / Wavg value. This
tion when the sample size is large enough. Let r l be a explains why the AR sampling becomes less efficient as
random variable denoting the number of occurrences the dimension increases. In comparison, our Metropolis
sampling method accepts one object in every trial.
for object xl (l = 1, 2, , k) in the sample, then
E (r1 ) : E (r2 ) : ... : E (rk ) W ( x1 ) : W ( x2 ) : ... : W ( xk )
, Quality of the Samples
when the sample size is large enough, where E(rl) is the
expected number of occurrences for object rl in the sample. 2
Instead of showing the variance of estimation < > ,
We also have E (r1 ) / N w( x1 ), E ( r2 ) / N w( x2 ), ,
here we show the second moment of , < 2 > . The
E ( rk ) / N w( xk ) . second moment tells how different the estimates are from
Finally, it should be pointed out that when all objects the actual values, especially when the estimate is biased.
have an equal weight, the Metropolis sampling degener- As shown in Figures 4 and 5, both methods yield pretty
ates to the simple random sampling. accurate estimates of the population mean (= 0.5) and
second moment (= 0.75) for one-dimensional case. Hence,
Experimental Results from Equation (7), they also give good estimates of the
variance.
Now, we report the results of our empirical evaluations of As shown in Figures 6 and 7 for d=3, the AR sampling
the Metropolis sampling and the AR sampling. We com- yields estimates around 1.0 and 1.7 for the mean and
pare the efficiency of the methods and the quality of the second moment, respectively, which are below their re-
samples yielded. spective exact values 3/2 and 15/4. On the other hand, our
To compare the methods quantitatively, we select a Metropolis sampling yields quite accurate estimates. As
model for which we know the analytic values. Here, we the sample size gets larger, our estimates get closer to the
v analytic values, but ARs do not.
have chosen the Gaussian model w(x ) =
Similar results are also observed for higher dimen-
v v v
(1 / ) d / 2 e ( x ) , where d is the dimension and ( x ) = x 2 .
2
sional cases. As shown in Figures 8 and 9, when d=20, the

The mean, the second moment, and variance are: AR sampling yields estimates around 5.1 and 29 for the
mean and second moment, respectively, which are well
v v v below the exact values 10 and 110. As for our method, a
= x 2 w( x ) dx = d / 2 , (5) sample of size as small as of 10,000 objects already gives
< > 10 and < 2 > 110.
v 2 v v
= ( x 2 ) w( x ) dx = ( d 2 + 2d ) / 4 ,
2 (6) The underestimation of the AR sampling was attrib-
uted to two factors: the acceptance/rejection criterion
v
2 = 2 ( ) 2 = d / 2 . (7) W ( x ) / Wmax of the AR sampling and the random number
generator. In the Gaussian model described earlier, W max
v v
appears at the center (i.e., x = 0 ), and the weight W (x )
Figure 2 is a 2-D Gaussian distribution. It has a high diminishes quickly as we move away from the center.
peak at the center. Only a small region near the center Indeed, the majority of the points have very small weights
makes significant contributions to the integrations of
and thus small W ( xv ) / Wmax = e ( x ) ratios, especially when
v2
Equations (5) and (6), and the vast remaining region has
v
vanishing contributions. As the dimension increases, the the dimension is high. The low W ( x ) / Wmax values could
peak becomes higher and narrower, and the distribution make the remote points not selectable when compared
becomes more skewed. In our experiments, we let d = with the random numbers generated in the process.
1, 3, 10, and 20. The random number generators on most computers
are based on the linear congruent method, which first
Sampling Efficiency generates a sequence of integers by the recurrence rela-
tion I j +1 = aI j + c (mod m), where m is the modulus, and
First, let us compare the cost of sampling. As shown in
Figure 3, the AR method roughly needs 5, 10, 75, and 1,750 a and c are positive integers (Press, Teukolsky, Vetterling,
trials to accept just one object into the sample in 1, 3, 10, & Flannery, 1994). The random numbers generated are I1/
and 20-dimensional cases, respectively. It is noted that m, I 2/m, I3/m, ..., and the smallest numbers generated is 1/
the higher the dimension, the more skewed the Gaussian m. As a result, a trial point in the AR sampling, whose
416
TEAM LinG
Figure 2. A Gaussian distribution: d=2 Figure 3. Cost of AR sampling

,
2.0M
d=20
1.6M
1.2M
d=10
Trial
s
800.0k
d=1
400.0k
d=3
0.0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 2k
Sample Size
Figure 4. Sample means for d=1. < > = 0.5 Figure 5. Sample means of second moment. < 2 > = 0.75
0.9
0.8
MC
0.7 AR
0.6
0.5
<>

0.4
0.3
0.2
0.1
0.0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Sample Size
Figure 6. Sample means for d=3. < > = 1.5 Figure 7. Sample means of second moment < 2 > = 15/
4
2.5 7
MC 6 MC
2.0 AR AR
5
1.5 4
< 2 >
<>
3
1.0
2
0.5
1
0.0 0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Sample Size Sample Size
417
TEAM LinG
Figure 8. Sample means for d=20. < > = 10 Figure 9. Sample means of second moment. < 2 > = 110
12
160
140 MC
10
AR
120
8
< 2 >
100

<>

6 80
60
4
MC 40
2 AR
20
0 0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Sample Size Sampel Size
v stabilizes as N gets larger. We have also performed tests

W ( x ) / Wmax is smaller than 1/m, cannot be accepted. Since
on other bin values and the results are similar, all verifying
these remote points have larger values than the points
the agreement of the sample distribution with the popu-
v v
near the center, recalling that ( x ) = x 2 , underestima- lation distribution. As for AR sampling, it also performs
tion sets in. The higher the dimension, the more skewed very well, showing a strong agreement of the sample
the distribution, and the more serious the underestima- distribution with the underlying population distribution.
tion. Increasing sample size would not help because those Extending this test to the three-dimensional Gaussian
points simply will not be accepted. model, we concentrate on the cubic space divided it into
On the other hand, our Metropolis sampling uses the 7 3 = 343 bins. The results are plotted in Figure 11. As
weight ratio of the trial and the last accepted objects. As
points, away from the center, are accepted, the chances of observed, our 2 values are always less than 02.95 ( 386
accepting remote points increase. Therefore, it is immune with = 342 ), indicating a good agreement with the
from the difficulty associated with the AR sampling method. population distribution. As for the AR sampling, it is
Based on our experiments with the Gaussian distribu-
tions, a sufficient minimum size should be in the neighbor- observed that for any sample of reasonable size, its 2
hood of 500d to 1,000d, where d is the number of dimen-
sions for the population. For smoother distributions, the well exceeds 02.95 386 for = 342 : moreover, its 2
results can be even better. increases with the sample size. This indicates that the
samples have different distributions from the population.
Distribution As explained earlier, this is because the AR sampling
could not include points with very low weights and thus
To examine whether a sample has a similar distribution to as more points are included (or as the sample size in-
the underlying population, generally a Chi-square test is creases) the more different the sample distribution is from
performed. Since the Chi-square test is designed for the population distribution. The results are consistent
discrete data, we divide each dimension into 7 bins of with the evaluation of means and variance discussed in
the previous sections. The AR sampling may work well
equal length to compute the 2 values. when the probability distribution is not very skewed, but
our approach works for all distributions. In addition, our
Figure 10 shows the 2 -test results of ARs and our
approach is also more efficient.
samples. We have chosen the commonly used signifi-
cance level =0.05 for the test. It is observed that when
sample size N > 200, our 2 is below 12 = 02.95 = 12.6 FUTURE TRENDS
with = 6 (=7-1). This indicates the agreement of the While modern computers become more and more power-
2 ful, many databases in social, economical, engineering,
sample with the population distributions. The value scientific, and statistical applications may still be too
418
TEAM LinG
Figure 10. Chi-square test on 1-d samples Figure 11. Chi-square test on 3-d samples with
with 2
0.95 = 12.6 and = 6 02.95 = 386 and = 342 ,
25 1000
MC
AR
MC
20 800
AR
15
600
2
2
10
400
5
200
0
0 500 1000 1500 2000 2500 3000
0
Sam p le S iz e N 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample Size N
large to handle. Sampling, therefore, becomes a necessity sional data for data mining application. In Proceedings of
for analyses, surveys, and numerical calculations in these the ACM SIGMOD Conference (pp. 94-105).
applications. In addition, for many modern applications in
OLAP and data mining, where fast responses are required, Haas, P., & Swami, A. (1992). Sequential sampling proce-
sampling also becomes a viable approach for construct- dures for query size estimation. In Proceedings of the
ing an in-core representation of the data. A general ACM SIGMOD Conference (pp. 341-350).
sampling algorithm that applies to all distributions is Hou, W.-C., & Ozsoyoglu, G. (1991). Statistical estimators
needed more than ever. for aggregate relational algebra queries. ACM Transac-
tions on Database Systems, 16(4), 600-654.
CONCLUSION Jermaine, C. (2003). Robust estimation with sampling and

approximate pre-aggregation. In Proceedings of VLDB
The Metropolis sampling presented in this paper can be Conferences (pp. 886-897).
a useful and powerful tool for studying large databases of Kalos, M., & Whilock, P. (1986). Monte Carlo methods:
any distributions. We propose to start the sampling by basic. New York: John Wiley & Sons.
taking an object from where the probability distribution
has its maximum. This guarantees that the sample always Lipton, R., Naughton, J., & Schnerder, D. (1990). Practical
includes the most important objects and improves the selectivity estimation through adaptive sampling. In Pro-
efficiency of the process. Our experiments also indicate a ceedings of the ACM SIGMOD Conference (pp. 1-11).
strong agreement between the selected sample and the Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller
population distributions. The selected sample is bona A., & Teller, E. (1953). Equation of state calculations by
fide representative, better than the samples produced by fast computing machines. Journal of Chemical Phys-
other existing methods. ics, 21(6), 1087-1092.
Olken, F. (1993). Random sampling from databases.
REFERENCES Ph.D. dissertation. University of California.
Palmer, C., & Faloutsos, C. (2000). Density biased
Acharya, S., Gibbons, P., & Poosala, V. (2000). sampling: An improved method for data mining and
Cogressional samples for approximate answering of group- clustering Proceedings of the ACM SIGMOD Confer-
by queries. In Proceedings of ACM SIGMOD Conference ence, 29 (2), 82-92.
(pp. 487-498).
Press, W., Teukolsky, S., Vetterling, W., & Flannery, B.
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1994). Numerical recipes in C. Cambridge: Cambridge
(1998). Automatic subspace clustering of high dimen- University Press.
419
TEAM LinG
Rubinstein, R. (1981). Simulation and the Monte Carlo thermodynamics, solid state physics, biological systems,
Method. New York: John Wiley & Sons. and etcetera. The algorithm is known as the most success-
ful and influential Monte Carlo method.
Spiegel, M. (1991). Probability and statistics. McGraw-
Hill, Inc. Monte Carlo Method: The heart of this method is a
random number generator. The term Monte Carlo Method
Wu, Y., Agrawal, D., & Abbadi, A. (2001). Using the now stands for any sort of stochastic modeling.
golden rule of sampling for query estimation. In Proceed-
ings of ACM SIGMOD Conference (pp. 279-290). OLAP: Online Analytical Processing.
Xu, X., Ester, M., Kriegel, H., & Sander, J. (1998). A Representative Sample: A sample whose distribution
distribution-based clustering algorithm for mining in large is the same as that of the underlying population.
spatial databases. In Proceedings of the IEEE ICDE
Conference (pp. 324-331). Sample: A set of elements drawn from a population.
Selectivity: The ratio of the number of output tuples
of a query to the total number of tuples in the relation.
KEY TERMS
Uniform Sampling: All objects or clusters of objects
Metropolis Algorithm: Was proposed in 1953 by are drawn with equal probability.
Metropolis et al. for studying statistical physics. Since
then it has become a powerful tool for investigating
420
TEAM LinG
421
Efficient Computation of Data Cubes and -

Aggregate Views
Leonardo Tininini
CNR - Istituto di Analisi dei Sistemi e Informatica Antonio Ruberti, Italy
INTRODUCTION precisely to data groups, each containing a subset of the

data and homogeneous with respect to a given set of
This paper reviews the main techniques for the efficient attributes. For example, the data Average duration of
calculation of aggregate multidimensional views and data calls in 2003 by region and call plan is obtained from the
cubes, possibly using specifically designed indexing so-called fact table, which is usually the product of
structures. The efficient evaluation of aggregate multidi- complex source integration activities (Lenzerini, 2002) on
mensional queries is obviously one of the most important the raw data corresponding to each phone call in that year.
aspects in data warehouses (OLAP systems). In particu- Several groups are defined, each consisting of calls made
lar, a fundamental requirement of such systems is the in the same region and with the same call plan, and finally
ability to perform multidimensional analyses in online applying the average aggregation function on the dura-
response times. As multidimensional queries usually in- tion attribute of the data in each group (see Figure 1). The
volve a huge amount of data to be aggregated, the only triple of values (region, call plan, year) is used to identify
way to achieve this is by pre-computing some queries, each group and is associated with the corresponding
storing the answers permanently in the database and average duration value. In multidimensional databases,
reusing these almost exclusively when evaluating queries the attributes used to group data define the dimensions,
in the multidimensional database. These pre-computed whereas the aggregate values define the measures.
queries are commonly referred to as materialized views The term multidimensional data comes from the well-
and carry several related issues, particularly how to effi- known metaphor of the data cube (Gray, Bosworth, Lay-
ciently compute them (the focus of this paper), but also man, & Pirahesh, 1996). For each of n attributes, used to
which views to materialize and how to maintain them. identify a single measure, a dimension of an n-dimen-
sional space is considered. The possible values of the
identifying attributes are mapped to points on the
BACKGROUND dimensions axis, and each point of this n-dimensional
space is thus mapped to a single combination of the
Multidimensional data are obtained by applying aggrega- identifying attribute values and hence to a single aggre-
tions and statistical functions to elementary data, or more gate value. The collection of all these points, along with
Figure 1. From facts to data cubes and drill-down on the time dimension
C P2 C P3 n Italy
Norther
C P1 Italy
C entral
Phone calls
rn Italy
in 2003 So uthe
Average on the
duration attribute
Time
Italy Italy
CP3 hern CP3 hern
No rt aly No rt tral It
aly
CP2 tral It Italy CP2 C en Italy
C en thern thern
CP1 Sou CP1 Sou
2003 (1. quat.)
2003 2003 (2. quat.)
2003 (3. quat.)
Drill-down on the 2003 (4. quat.)
time dimension
2002
2001
plan R eg
C all ion
TEAM LinG
Efficient Computation of Data Cubes and Aggregate Views
all possible projections in lower dimensional spaces, historical analyses with trend forecasts. However,
constitutes the so-called data cube. In most cases, dimen- the presence of this temporal dimension may cause
sions are structured in hierarchies, representing several problems in query formulation and processing, as
granularity levels of the corresponding measures schemes may evolve over time and conventional
(Jagadish, Lakshmanan, & Srivastava, 1999). Hence a time query languages are not adequate to cope with them
dimension can be organized into days, months, quarters (Vaisman & Mendelzon, 2001).
and years; a territorial dimension into towns, regions and Target users: Typical OLTP system users are clerks,
countries; a product dimension into brands, families and and the types of query are rather limited and predict-
types. When querying multidimensional data, the user able. In contrast, multidimensional databases are
specifies the measures of interest and the level of detail usually the core of decision support systems, tar-
required by indicating the desired hierarchy level for each geted at management level. Query types are only
dimension. In a multidimensional environment querying partly predictable and often require highly expres-
is often an exploratory process, where the user moves sive (and complex) query language. However, the
along the dimension hierarchies by increasing or reducing user usually has little experience even in easy
the granularity of displayed data. The drill-down opera- query languages like basic SQL: the typical interac-
tion corresponds to an increase in detail, for example, by tion paradigm is a spreadsheet-like environment
requesting the number of calls by region and quarter, based on iconic interfaces and the graphical meta-
starting from data on the number of calls by region or by phor of the multidimensional cube (Cabibbo &
region and year. Conversely, roll-up allows the user to Torlone, 1998).
view data at a coarser level of granularity.
Multidimensional querying systems are commonly
known as OLAP (Online Analytical Processing) Systems, MAIN THRUST
in contrast to conventional OLTP (Online Transactional
Processing) Systems. The two types have several con- In this section we briefly analyze the techniques pro-
trasting features, although they share the same require- posed to compute data cubes and, more generally, mate-
ment of fast online response times: rialized views containing aggregates. The focus here is on
the exact calculation of the views from scratch. In particu-
Number of records involved: One of the key differ- lar, we do not consider (a) the problems of aggregate view
ences between OLTP and multidimensional queries maintenance in the presence of insertions, deletions and
is the number of records required to calculate the updates (Kotidis & Roussopoulos, 1999; Riedewald,
answer. OLTP queries typically involve a rather Agrawal, & El Abbadi, 2003) and (b) the approximated
limited number of records, accessed through pri- calculation of data cube views (Wu, Agrawal, & El Abbadi,
mary key or other specific indexes, which need to be 2000; Chaudhuri, Das, & Narasayya, 2001), as they are
processed for short, isolated transactions or to be beyond the scope of this paper.
issued on a user interface. In contrast, multidimen-
sional queries usually require the classification and (Efficiently) Computing Data Cubes and
aggregation of a huge amount of data.
Indexing techniques: Transaction processing is
Materialized Views
mainly based on the access of a few records through
primary key or other indexes on highly selective A typical multidimensional query consists of an aggre-
attribute combinations. Efficient access is easily gate group by query applied to the join of the fact table
achieved by well-known and established indexes, with two or more dimension tables. In consequence, it has
particularly B+-tree indexes. In contrast, multidi- the form of an aggregate conjunctive query, for example:
mensional queries require a more articulated ap-
proach, as different techniques are required and SELECT D1.dim1, D2.dim2, AGG(F.measure)
each index performs well only for some categories of FROM fact_table F, dim_table1 D1, dim_table2 D2
queries/aggregation functions (Jrgens & Lenz, 1999). WHERE F.dimKey1 = D1.dimKey1
Current state vs. historical DBs: OLTP operations AND F.dimKey2 = D2.dimKey2
require up-to-date data. Simultaneous information GROUP BY D1.dim1, D2.dim2 (Q1)
access/update is a critical issue and the database
usually represents only the current state of the where AGG is an aggregation function, such as SUM,
system. In OLAP systems, the data does not need MIN, AVG, etc.
to be the most recent available and should in fact be For example, in the above-mentioned phone call data
time-stamped, thus enabling the user to perform warehouse the fact table Phone_calls may have the (sim-
422
TEAM LinG
plified) schema (call_id, territ_id, call_plan_id, duration), be made significantly more efficient (and even more so
and the dimension tables Terr and Call_pl the simplified than ROLAP-based calculations) by exploiting the inher- -
schemas (territ_id, town, region) and (call_plan_id, ently compressed data representation of MOLAP sys-
call_plan_name) respectively. The aggregation illustrated tems. The algorithm particularly benefits from the com-
in Figure 1 would be performed by the following query: pactness of multidimensional arrays, enabling the query
processor to transfer larger chunks of data to the main
SELECT D1.region, D2.call_plan_name, AVG(F.duration) memory and efficiently process them.
FROM Phone_calls F, Terr D1, Call_pl D2 In many practical cases GROUP BYs at the finest
WHERE F.territ_id = D1.territ_id granularity level correspond to sparse data cubes, that
AND F.call_plan_id = D2.call_plan_id is, cubes where a high percentage of points correspond
GROUP BY D1.region, D2.call_plan_name (Q2) to null values. Consider for instance the CUBE corre-
sponding to the query Number of calls by customer, day
Traditional query processing systems first perform all and antenna: it is evident that a considerable number of
joins expressed in the FROM and WHERE clause, and only combinations correspond to zero. Fast techniques to
afterwards perform the grouping on the result of the join compute sparse data cubes are proposed in Ross &
and aggregation on each group. The algorithms producing Srivastava (1997). They are based on (i) decomposing the
the groups can be broadly classified as techniques based fact table into fragments that can be stored in main
on (a) sorting on the GROUP BY attributes and (b) hashing memory; (ii) computing the data cube in the main memory
tables [see Graefe (1993)]. However, there are many com- for each fragment and finally (iii) combining the partial
mon cases where an early evaluation of the GROUP BY is results obtained in (i) and (ii).
possible. This can significantly reduce calculation time, as
it (i) reduces the input size of the join (usually very large
in the context of multidimensional databases) and (ii)
Using Indexes to Improve Efficiency
enables the query processing engine to use indexes to
perform the (early) GROUP BY on the base tables. The most common indexes used in traditional DBMSs are
In Chaudhuri & Shim (1995, 1996), some techniques and probably B+-trees, a particular form labeled with index
applicability conditions are proposed for transforming keyvalues and having a list of record identifiers on the
execution plans into equivalent (more efficient) ones. As leaf level; that is, a list of elements specifying the actual
is typical in query optimization, the technique is based on position of each record on the disk. Index keyvalues may
pull-up transformations, which delay the execution of a consist of one or more columns of the indexed table.
costly operation (e.g., a group by on a large dataset) by OLTP queries usually retrieve a very limited number of
moving it towards the root of the query tree, and on push- tuples (or even a single tuple accessed through the
down transformations, used for example to anticipate an primary key index) and in these cases B+-trees have been
aggregation, thus decreasing the size of a join. demonstrated as particularly efficient.
Several transformations for multidimensional queries In contrast, OLAP queries typically involve aggrega-
are also proposed in Gupta, Harinarayan, & Quass (1995), tion of large tuples groups, requiring specifically de-
based on the concept of generalized projection (GP). Trans- signed indexing structures. In contrast to the OLTP
formations enable the optimizer to (i) push a GP down the context, there is no universally good index for multidi-
query tree; (ii) pull a GP up the query tree; (iii) coalesce two mensional queries, but rather a variety of techniques,
GPs into one, or conversely split one GP into two. Query each of which may perform well for specific data types
tree transformations are also used in the rewriting process. and query forms but be inappropriate for others.
In Agarwal et al. (1996) some algorithms are proposed Let us again consider the typical multidimensional
and compared to extend the traditional techniques for the query expressed by the SQL query (Q1). The core opera-
GROUP BY query evaluation to the CUBE operator pro- tions related to its evaluation are: (1) the joins of the fact
cessing. These are applicable in the case of distributive table with two or more dimension tables, (2) tuple group-
aggregate functions and are based on the property that ing by various dimensional values, and (3) application of
higher-level aggregates can be calculated from lower lev- an aggregation function to each tuple group. An interest-
els in this case. ing index type which can be used to efficiently perform
In MOLAP systems, however, the above methods for operation (1) is the join index; while conventional in-
CUBE calculation are inadequate, as they are substantially dexes map column values to records in one table, join
based on sorting and hashing techniques, which can not indexes map them to records in two (or more) joined
be applied to multidimensional arrays. In Zhao, Deshpande, tables, thus constituting a particular form of materialized
& Naughton (1997) an algorithm is proposed to calculate view. Join indexes in their original version can not be
CUBEs in a MOLAP environment. It is shown that this can used directly for efficient evaluation of OLAP queries,
but can be very effective in combination with other
423
TEAM LinG
indexing techniques, such as bitmaps and partitioning. querying is still lacking, particularly a standardized query
Bitmap indexes (Chan & Ioannidis, 1998) are useful for language independent from the specific storage tech-
tuple grouping and performing some aggregation forms. nique.
In practice, these indexes use a bitmap representation for Although research results on the exact computation
the list of record identifiers in the trees leaf level: if table of data cubes in the last few years have been mainly
t contains n records, then each leaf of the bitmap index incremental, several interesting techniques have been
(corresponding to a specific value c of the indexed column proposed for the approximated computation, particularly
C) contains a sequence of n bits, where the i-th bit is set some based on wavelets, which certainly require further
to 1 if ti.C=c, and otherwise to zero. Bitmap representa- investigations.
tions are indicated when the number of distinct keyvalues In contrast to the OLTP context, we have shown that
is low and several predicates of the form (Column = value) there is no universally good index for multidimensional
are to be combined in AND/OR, as the operation can be queries, but rather a variety of techniques, each of which
efficiently performed by AND-/OR-ing the corresponding may perform well for specific data types and query forms
bitmap representation bit to bit. This operation can be but be inappropriate for others. Hence, an algorithm of
performed in parallel and very efficiently on modern index selection for data warehouses should determine not
processors. Finally, bitmap indexes can be used for fast only the several sets of attributes to be indexed, but also
count query evaluation, as counting can be performed the index type(s) to be used. The definition of an indexing
directly on the bitmap representation without even ac- structure enabling a good trade-off in the several cases of
cessing the selected records. interest is also an interesting issue for future research.
In projection indexes (ONeil & Quass, 1997) the tree
access structure is coupled with a sort of materialized
view, representing the projection of the table on the CONCLUSION
indexed column. The technique has some analogies with
vertically partitioned tables and is indicated when the In this paper we have discussed the main issues related
aggregate operations need to be performed on one or more to the computation of data cubes and aggregate material-
indexed columns. ized views in a data warehouse environment. First of all,
Bit-sliced indexes (ONeil & Quass, 1997) can be con- the main features of OLAP queries with respect to conven-
sidered as a combination of the two previous techniques. tional OLTP queries have been summarized, particularly
Values of the projected column are encoded and a bitmap the number of records involved, the temporal aspects, the
associated with each resulting bit component. The tech- specific indexes. Various techniques for the exact compu-
nique has some analogies with bit transposed files (Wong, tation of materialized views and data cubes from scratch
Li, Olken, Rotem, & Wong, 1986), which were proposed for in both ROLAP and MOLAP environments have been
query evaluation in very large scientific and statistical discussed. Finally, the main indexing techniques for OLAP
databases. These indexes work best for SUM and AVG queries and their applicability have been illustrated.
aggregations, but are not well suited for aggregations
involving more than one column.
In Chan & Ioannidis (1998, 1999) some variations on REFERENCES
the general idea of bitmap indexes are presented. Encod-
ing schemes, time-optimal and space-optimal indexes, and Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A.,
trade-off solutions are studied. A comparison of the use Naughton, J.F., Ramakrishnan, R., & Sarawagi, S. (1996).
of STR-tree based indexes (a particular form of spatial On the computation of multidimensional aggregates. In
index) and some variations of bitmap indexes in the context International Conference on Very Large Data Bases
of OLAP range queries can be found in Jrgens & Lenz (1999). (VLDB96) (pp. 506-521).
Cabibbo, L., & Torlone, R. (1998). From a procedural to a
FUTURE TRENDS visual query language for OLAP. In International Con-
ference on Scientific and Statistical Database Manage-
The efficiency of many evaluation techniques is strictly ment (SSDBM98) (pp. 74-83).
related to the adopted query language and storage tech- Chan, C.Y., & Ioannidis, Y.E. (1998). Bitmap index design
nique (e.g., MOLAP and ROLAP). This stresses the im- and evaluation. In ACM International Conference on
portance of a standardization process: there is indeed a Management of Data (SIGMOD98) (pp. 355-366).
general consensus on data warehouse key concepts, but
a common unified framework for multidimensional data Chan, C.Y., & Ioannidis, Y.E. (1999). An efficient bitmap
encoding scheme for selection queries. In ACM Interna-
424
TEAM LinG
tional Conference on Management of Data (SIGMOD99) Ross, K.A., & Srivastava, D. (1997). Fast computation of
(pp. 215-226). sparse datacubes. In International Conference on Very -
Large Data Bases (VLDB97) (pp. 116-125).
Chaudhuri, S., Das, G., & Narasayya, V. (2001). A robust,
optimization-based approach for approximate answering Vaisman A.A., & Mendelzon A.O. (2001). A temporal
of aggregate queries. In ACM International Conference query language for OLAP: Implementation and a case
on Management of Data (SIGMOD01) (pp. 295-306). study. In 8th International Workshop on Database Pro-
gramming Languages (DBPL 2001) (pp. 78-96).
Chaudhuri, S., & Shim, K. (1995). An overview of cost-
based optimization of queries with aggregates. Data En- Wong, H.K.T., Li, J., Olken, F., Rotem, D., & Wong, L.
gineering Bulletin, 18(3), 3-9. (1986). Bit transposition for very large scientific and
statistical databases. Algorithmica, 1(3), 289-309.
Chaudhuri, S., & Shim, K. (1996). Optimizing queries with
aggregate views. In International Conference on Ex- Wu, Y., Agrawal, D., & El Abbadi, A. (2000). Using
tending Database Technology (EDBT96) (pp. 167-182). wavelet decomposition to support progressive and ap-
proximate range-sum queries over data cubes. In Confer-
Graefe, G. (1993). Query evaluation techniques for large ence on Information and Knowledge Management
databases. ACM Computing Surveys, 25(2), 73-170. (CIKM00) (pp. 414-421).
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Zhao, Y., Deshpande, P., & Naughton, J.F. (1997). An
Data cube: A relational aggregation operator generalizing array-based algorithm for simultaneous multidimensional
group-by, cross-tab, and sub-total. In International Con- aggregates. In ACM International Conference on Man-
ference on Data Engineering (ICDE96) (pp. 152-159). agement of Data (SIGMOD97) (pp. 159-170).
Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggre-
gate-query processing in data warehousing environments.
In International Conference on Very Large Data Bases
KEY TERMS
(VLDB95) (pp. 358-369).
Aggregate Materialized View: A materialized view
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D. (see below) in which the results of a query containing
(1999). What can hierarchies do for data warehouses? aggregations (like count, sum, average, etc.) are stored.
International Conference on Very Large Data Bases
(VLDB99) (pp. 530-541). B+-Tree: A particular form of search tree in which the
keys used to access data are stored in the leaves. Particu-
Jrgens, M., & Lenz, H.J. (1999). Tree based indexes vs. larly efficient for key-access to data stored in slow memory
bitmap indexes: A performance study. In International devices (e.g., disks).
Workshop on Design and Management of Data Ware-
houses (DMDW99). Retrieved from <http:// Data Cube: A collection of aggregate values classified
sunsite.informatik.rwth-aachen.de/Publications/CEUR- according to several properties of interest (dimensions).
WS/Vol-19/paper1.pdf> Combinations of dimension values are used to identify the
single aggregate values in the data cube.
Kotidis, Y., & Roussopoulos, N. (1999). DynaMat: A
dynamic view management system for data warehouses. Dimension: A property of the data used to classify it
In ACM International Conference on Management of and navigate the corresponding data cube. In multidimen-
Data (SIGMOD99) (pp. 371-382). sional databases dimensions are often organized into
several hierarchical levels, for example, a time dimension
Lenzerini, M. (2002). Data integration: A theoretical per- may be organized into days, months and years.
spective. In ACM Symposium on Principles of Database
Systems (PODS02) (pp. 233-246). Drill-Down (Roll-Up): Typical OLAP operation, by
which aggregate data are visualized at a finer (coarser)
ONeil, P.E., & Quass, D. (1997). Improved query perfor- level of detail along one or more analysis dimensions.
mance with variant indexes. In ACM International Con-
ference on Management of Data (SIGMOD97) (pp. 38- Fact (Multidimensional Datum): A single elementary
49). datum in an OLAP system, the properties of which corre-
spond to dimensions and measures.
Riedewald, M., Agrawal, D., & El Abbadi, A. (2003).
Dynamic multidimensional data cubes. In M. Rafanelli Fact Table: A table of (integrated) elementary data
(Ed.), Multidimensional Databases: Problems and Solu- grouped and aggregated in the multidimensional query-
tions (pp. 200-221). Hershey, PA: Idea Group Publishing. ing process.
425
TEAM LinG
Materialized View: A particular form of query whose Multidimensional Query: A query on a collection of
answer is stored in the database to accelerate the evalu- multidimensional data, which produces a collection of
ation of further queries. measures classified according to some specified dimen-
sions.
Measure: A numeric value obtained by applying an
aggregate function (such as count, sum, min, max or OLAP System: A particular form of information sys-
average) to groups of data in a fact table. tem specifically designed for processing, managing and
reporting multidimensional data.
426
TEAM LinG
427
Embedding Bayesian Networks in Sensor Grids -

Juan E. Vargas
University of South Carolina, USA
INTRODUCTION 2. Monitor chemical, biological, radiological, nuclear,

and explosive (CBRNE) attacks and materials
In their simplest form, sensors are transducers that con- 3. Monitor environmental changes in forests, oceans,
vert physical phenomena into electrical signals. By com- and so forth
bining recent innovations in wireless technology, distrib- 4. Monitor vehicle traffic
uted computing, and transducer design, grids of sensors 5. Provide security in shopping malls, parking ga-
equipped with wireless communication can monitor large rages, and other facilities
geographical areas. However, just getting the data is not 6. Monitor parking lots to determine which spots are
enough. In order to react intelligently to the dynamics of occupied and which are free
the physical world, advances at the lower end of the
computing spectrum are needed to endow sensor grids Typically, sensor networks produce vast amounts of
with some degree of intelligence at the sensor and the data, which may arrive at any time and may contain noise,
network levels. Integrating sensory data into representa- missing values, or other types of uncertainty. Therefore,
tions conducive to intelligent decision making requires a theoretically sound framework is needed to construct
significant effort. By discovering relationships between virtual worlds populated by decision-making agents that
seemingly unrelated data, efficient knowledge represen- may receive data from sensing agents or other decision-
tations, known as Bayesian networks, can be constructed making entities. Specific aspects of the real world could be
to endow sensor grids with the needed intelligence to naturally distributed and mapped to devices acting as
support decision making under conditions of uncertainty. data collectors, data integrators, data analysts, effectors,
Because sensors have limited computational capabilities, and so forth. The data collectors would be devices
methods are needed to reduce the complexity involved in equipped with sensors (optical, acoustical, radars, etc.) to
Bayesian network inference. This paper discusses meth- monitor the environment or to provide domain-specific
ods that simplify the calculation of probabilities in Baye- variables relevant to the overall decision-making process.
sian networks and perform probabilistic inference with The data analysts would be engaged in low-level data
such a small footprint that the algorithms can be encoded filtering or in high-level decision making. In all cases, a
in small computing devices, such as those used in wireless single principle should integrate these activities. Baye-
sensors and in personal digital assistants (PDAs). sian theory is the preferred methodology, because it
offers a good balance between the need for separation at
the data source level and the integrative needs at the
BACKGROUND analysis level (Stone, Lawrence, Barlow, & Corwin, 1999).
Another major advantage of Bayesian theory is that data
Recent innovations in wireless development, distributed from different measurement spaces can be fused into a
computing, and sensing design have resulted in energy- rich, unified, representation that supports inference un-
efficient sensor architectures with some computing capa- der conditions of uncertainty and incompleteness. Baye-
bilities. By spreading a number of smart sensors across a sian theory offers the following additional advantages:
geographical area, wireless ad-hoc sensor grids can be
configured as complex monitoring systems that can react Robust operational behavior: Multisensor data fu-
intelligently to changes in the physical world. Wireless sion has an increased robustness when compared
sensor architectures such as the University of California to single-sensor data fusion. When one sensor
Berkeleys Motes (Pister, Kahn, & Boser, 1999) are being becomes unavailable or is inoperative, other sen-
increasingly used in a variety of fields to sors can provide information about the environ-
ment.
1. Monitor information about enemy movements, ex- Extended spatial and temporal coverage: Some parts
plosions, and other phenomena of interest (Vargas of the environment may not be accessible to some
& Wu, 2003) sensors due to range limitations. This occurs espe-
cially when the environment being monitored is
TEAM LinG
Embedding Bayesian Networks in Sensor Grids
vast. In such scenarios, multiple sensors that are Figure 1. A domain scenario
mounted at different locations can maximize the
regions of scanning. Multisensor data fusion pro-
vides increased temporal coverage, as some sen-
sors can provide information when others cannot.
Increased confidence: Single target location can be
confirmed by more than one sensor, which increases
the confidence in target detection.
Reduced ambiguity: Joint information from multiple
sensors can reduce the set of beliefs about data.
Decreased costs: Multiple, inexpensive sensors
can replace expensive single-sensor architectures
at a significant reduction of cost.
Improved detection: Integrating measurements from
multiple sensors can reduce the signal-to-noise
The goal is to recognize situations locally and glo-
ratio, which ensures improved detection.
bally, identify the available options, and make global and
local decisions quickly in order to reduce or eliminate the
Bayesian-based or entropy-based algorithms can the threats and optimize the use of assets. The task is difficult
used to construct efficient data structuresknown as due to the dynamics and uncertainties in the domain.
Bayesian networksto represent the relations and the Threats may change in many ways, targets may move,
uncertainties in the domain (Pearl, 1988). After the Baye- enemy forces may identify the sensing capabilities and
sian networks are created, they can act as hyperdimensional eliminate them, and so forth. In most cases, sensing
knowledge representations that can be used for probabi- information will contain noise and most likely will be
listic inference. In situations when data is not as rich, the inaccurate, unreliable, and uncertain. These constraints
knowledge representations can still be created from state- suggest a distributed, bottom-up approach to match the
ments of causality and independence formulated by ex- natural dynamics and uncertainties of the problem. Thus,
pert opinions. Under this framework, activities such as at the core of this problem, a theoretical framework that
vehicle control, maneuvering, and scheduling could be effectively balances local and global conditions is needed.
planned, and the effectiveness of those plans could be Distributed Bayesian networks offer that balance (Xiang,
evaluated online as the actions of the plans are executed. 2002; Valtorta, Kim, & Vomlel, 2002).
To illustrate these ideas, consider a domain composed
of threats, assets, and grids of sensors. Although un-
manned vehicles loaded with sensors might be able to
MAIN THRUST
detect potential targets and provide data to guide the
distribution of assets, integrating and transforming those
data into meaningful information that is amenable for Embedding Bayesian Networks in
intelligent decisions is very demanding. The data must be Sensor Grids
filtered, the relationships between seemingly unrelated
data sets must be determined, and knowledge representa- Recent advances in the theory of Bayesian network infer-
tions must be created to support wise and timely deci- ence (Darwiche, 2003; Castillo, Gutierrez, & Hadi, 1996;
sions, because conditions of uncertain and incomplete Utete, 1998) have resulted in algorithms that can perform
information are the norm, not the exception. Therefore, a probabilistic inference on very small-scale computing
solution is to endow sensors with embedded local and devices that are comparable to commercially available
global intelligence, as shown in Figure 1. In the figure, PDAs. The algorithms can encode, in real-time, families of
friendly airplanes and tanks, in red, use Bayesian net- polynomial equations representing queries of the type
works (BNs) to make decisions. The BNs are illustrated as p(e|h) involving sets of variables local to the device and
red boxes containing graphs. Each node in the graphs its neighbors.
corresponds to a variable in the domain. The data for each Using the knowledge representations locally encoded
variable may come from sensors spread in the battlefield. into these devices, larger, distributed systems can be
The red nodes are variables related to the friendly re- interconnected. The devices can assess their local condi-
sources, and the blue variables to the enemy resources. tions given local observations and engage with other
The dotted red arrows connecting the BNs represent devices in the system to gain better understanding of the
wireless communication between the BNs. global situation, to obtain more assets, or to convey
428
TEAM LinG
information needed by other devices engaged in larger parameters. In such cases, symbolic propagation can be
scale tactical decisions. The devices can maintain models used to obtain the values of the unknown parameters to -
of the local parameters surrounding them, making them propagate and compute the probabilities of all the rel-
situationally aware and enabling them to participate as evant variables in a query (Zhaoyu, DAmbrosio, 1994;
cells of a larger decision-making process. Sensory infor- Castillo, Gutierrez, & Hadi, 1995; Castillo et al., 1996).
mation might be redirected towards a device with no Symbolic propagation of evidence consists of asserting
access to that sensors information. For example, an air- polynomials involving the known and missing param-
plane might have full access to information about the eters and the available evidence. After those polynomi-
topology of a certain area, but another vehicle may have als are found, customized code generators can produce
access to this information only if it can communicate with code to solve the expressions in real-time, obtain the
the airplane. Global information known by groups of de- exact values of the parameters, and complete the propa-
vices can be sent to the relevant devices such that when- gation of evidence.
ever a device gets new input data, the findings are used to In general, a node X i having a conditional probability
determine locally the most probable states for each of the
variables within the devices and propagate those determi- p ( xi | i ) can be expressed as a parametric family of the
nations to the neighboring devices. form
The situation is illustrated in Figure 2, in which the ij = p( X i = j | i = ), j {0,..., ri } (1)
variables x3, x4, and x6 are shared by more than one
device. The direction of the dotted red arrows indicate the where i refers to the node number, j refers to the state of
direction of the flow of evidence.
the node, and is any instantiation i of the parents of
Symbolic Propagation of Evidence X i . Assume a JPD given on the binary variables
{x1 , x2 , x3 , x4 } , as
Methods for exact and approximate propagation of evi-
dence in Bayesian network models are discussed in other p( x1 , x2 , x3 , x4 ) = p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 , x3 ) (2)
sections of this encyclopedia. A common requirement for
both types of methods is that all the parameters of the joint The complete set of parameters for the binary vari-
probability distribution (JPD) must be known prior to ables for a Bayesian network compatible with the JPD is
propagation. However, complete specifications might not given in Table 1.
always be available. This may happen when the number of
observations in some combinations of variables is not
sufficient to support exact quantifications of conditional
probabilities, or in cases when domain experts may only
know ranges but may not know the exact values of the
Table 1. Complete set of parameters for the binary

variables of a Bayesian network compatible with the
JPD of equation 2
Figure 2. A collection of three Bayesian networks
X1 10 = p ( x1 ) 11 = 1 10
X1 X5
X2 200 = p ( x2 | x1 ) 210 = 1 200
X4
X4 201 = p ( x2 | x1 ) 211 = 1 201
X2
X3
X6 X3 300 = p( x3 | x1 ) 310 = 1 300
301 = p ( x3 | x1 ) 311 = 1 300
X4 4000 = p ( x4 | x2 , x3 ) 4100 = 1 4000

X3 X6
4001 = p ( x4 | x2 , x3 ) 4101 = 1 4001
X7 4010 = p ( x4 | x2 , x3 ) 4110 = 1 4010
4011 = p ( x4 | x2 , x3 ) 4111 = 1 4011
429
TEAM LinG
In general, because 5. Compute the probabilities by executing the gener-

ated expressions.
ri 6. Complete the process by propagating the results to
ijs = 1
j =0
the rest of the network.
Castillo et al. (1995, 1996) present methods that take

the parameters for categorical variables can be obtained advantage of the polynomial structure of conditional
from probabilities. Their procedure consists of asserting poly-
nomials involving known and unknown parameters as
ri sets of parametric equations, which are solved symboli-
iks = 1 ijs cally by using computer packages such as Mathematica
j =0..rj \ k
or Maple. Having such symbolic packages available to
embedded systems is not a reasonable expectation. None-
Knowing the JPD in Equation 2 and the parameters in
theless, two important results can be derived from that
Table 1, the probability of any set of nodes can be
work: (a) Any parameter can be expressed as a first-degree
calculated by extracting the marginal probabilities out of
polynomial and (b) the combined probability of any
the JPD. Keeping with the example, I obtain
instantiation of a set of variables in a network can be
p ( x2 = 0 | x3 = 1) as follows: expressed as a polynomial of degree less than or equal to
the number of variables.
Taking advantage of the parametric structure of the
p( x1,1,0, x 4)
p( x2 = 1 | x3 = 0) =
x1, x 4
=
marginal and conditional probabilities is very valuable;
p( x1, x2,0, x 4) (3) the parametric functions can reduce the complexity in the
x1, x 2, x 4
calculation and propagation of probabilities and in turn
makes it possible to have grids of sensing devices en-
10300 10 200 300 + 301 10301 201301 + 10 201 301 dowed with Bayesian intelligence. Symbolic propagation
= (4)
10 300 + 301 10 301 provides the theoretical framework for a distributed archi-
tecture that can be reconfigured in real-time as a grid of
A you can see in Equation 4, if all the parameters are smart sensors, each performing calculations on variables
know, then the expression can be solved by computing only relevant to the query and the local reasoner.
nine multiplications, three additions, four subtractions,
and one division. These operations can be performed in
real-time by using simple computing devices, such as FUTURE TRENDS
those available to PDAs or most modern wireless sensor
architectures. If, on the other hand, some of the param- Conservative estimates compare the impact of wireless
eters are not available but the topology of the graph is sensors in the next decade to the microprocessor revolu-
known, then the use of clustering algorithms for exact tion of the 1980s. Several million efforts are already on
inference can produce additional polynomial expressions their way: Wal-Mart announced in 2003 that it would
that solve for the unknown parameters. In this case, the require its main suppliers to use RFID tags on all pallets
entire inference process consists of the following steps: and cartons of goods, with an estimated savings of over
$8 billion a year (Goode, 2003). It is expected that the future
1. From the topology of the graph, obtain a junction of the automobile and aircraft industries will heavily
tree, using clustering methods for exact propaga- depend on sensors (Goode, 2004). Innovations in trans-
tion (Lauritzen & Spiegelhalter, 1988; Jensen, 2001; ducers and architectures are making worldwide news
Olesen, Lauritzen, & Jensen, 1992). everyday, as industrial giants such as Intel, Microsoft
2. Use the junction tree to generate the polynomial (Microsoft launches, 2002), and IBM enter the compe-
expressions that solve for the unknown parameters tition with their own products (Goode, 2004). The capabil-
of conditional independence. ity of sensors will continue to increase, and their ability
3. If the inference process involves goal-oriented to process information locally will improve. However,
queries, obtain expressions for each parameter, as in despite the progress at the lower level, the fundamental
Equations 3 and 4, involving the parameters and the issues related to the fusion of data and its analysis at the
query. At this point, all the parameters are known, higher levels are not likely to undergo dramatic improve-
either numerically or symbolically. ments. Bayesian theory, and in particular Bayesian net-
4. Use a code generator to configure the equations works, still offer the best framework for data fusion,
resulting from steps 2 and 3. knowledge discovery, and probabilistic inference. Until
430
TEAM LinG
now, the main barriers to the use of Bayesian networks in Goode, B. (2004, August). Keys to keynotes. Sensors
sensor grids have been the computational complexity of Magazine. -
the inference process the lack of efficient methods for
learning the graph of a network, and adapting to the Jensen, F. V. (2001). Bayesian networks and decision
dynamics of the domain. The latter two, however, are graphs. New York: Springer.
problems that permeate the field of Bayesian networks Lauritzen, S.L., & Spiegelhalter, D.J. (1988). Local compu-
and constitute fertile ground for future research. tations with probabilities on graphical structures and
their application to expert systems. Journal of the Royal
Statistical Society, Series B, 50, 157-224.
CONCLUSION
Microsoft launches smart personal object technology
This paper outlines recent work aimed at endowing sensor initiative. (2002, November 17). Retrieved from http://
grids with local and global intelligence. I describe the www.microsoft.com/presspass/features/2002/nov02/11-
advantages of knowledge representations known as Baye- 17SPOT.asp
sian networks and argue that Bayesian networks offer an Olesen, K. G. , Lauritzen, & Jensen, F. V. (1992). Hugin: A
excellent theoretical foundation to accomplish this goal. system creating adaptive causal probabilistic networks.
The power of Bayesian networks lends them to decom- Proceedings of the Eighth Conference on Uncertainty in
posing a problem into a set of smaller, distributed problem Artificial Intelligence (pp. 223-229), USA.
solvers. Each variable in a network could be associated to
a different sensing agent represented at the local level as Pearl, J. (1988). Probabilistic reasoning in intelligent
a Bayesian network. The sensing agents could decide if systems: Networks of plausible inference. San Mateo,
and when to send observations to the other agents. As CA: Morgan Kaufmann.
information flows between the agents, they in turn could Pister, K. S. J., Kahn, J. M., & Boser, B. E.(1999). Smart
decide to send their own observations, their local infer- dust: Wireless networks of millimeter-scale sensor nodes.
ences, or request local or remote agents for more informa- Highlight Article in Electronics Research Laboratory
tion, given the available evidence. Another advantage of Research Summary. Retrieved from http://
this approach is that it is scalable on demand; more agents www.xbow.com/Products/Wireless_Sensor_Networks.
can be added as the problem gets bigger. Recovery from htm
loss of parts of the system is possible by introducing a set
of redundant observations, each of which can be part of Sensors Magazine. (Ed.). (2004, August). Best of sensors
a local solver agent. expo awards [Special issue]. Sensors Magazine.
Stone, C. A., Lawrence, D., Barlow, C. A., & Corwin, T. L.
(1999). Bayesian multiple target tracking. Boston: Artech
REFERENCES House.
Utete, S. W. (1998). Local information processing for
Castillo, E. Gutierrez, J. M., & Hadi, A. S. (1995). Symbolic decision making in decentralised sensing networks. Pro-
propagation in discrete and continuous Bayesian net- ceedings of the 11th International Conference on Indus-
works. In V. Keranen & P. Mitic (Eds.), Proceedings of the trial and Engineering Applications of Artificial Intelli-
First International Mathematica Symposium: Vol. Math- gence and Expert Systems, IEA/AIE-667-676, Castellon,
ematics with vision (pp. 77-84). Southhampton, UK: Com- Spain.
putational Mechanics Publications, Southerhampton, UK.
Valtorta, M., Kim, Y. G., & Vomlel, J. (2002). Soft evidential
Castillo, E., Gutierrez, J. M., & Hadi, A. S. (1996). A new update for probabilistic multiagent systems. Interna-
method for efficient symbolic propagation in discrete tional Journal of Approximate Reasoning, 29(1), 71-106.
Bayesian networks. Networks, 28, 31-43.
Vargas, J. E., Tvarlapati, K., & Wu, Z. (2003). Target
Darwiche, A. (2003). Revisiting the problem of belief tracking with Bayesian estimation. In V. Lesser, C. Ortiz,
revision with uncertain evidence. Proceedings of the 18th & M. Tambe (Eds.), Distributed sensor networks. Kluwer
International Joint Conference on Artificial Intelligence, Academic Press.
Acapulco, Mexico.
Vargas, J. E., & Wu, Z. (2003). Real-time multiple-target
Goode, B. (2003, September). Having wonder full time. tracking using networked wireless sensors. Proceedings
Sensors Magazine. of the Second Conference on Autonomous Intelligent
Networks and Systems, Palo Alto, CA.
431
TEAM LinG
Xiang, Y. (2002). Probabilistic reasoning in multiagent efficient data structure. A junction tree contains cliques,
systems: A graphical models approach. Cambridge, MA: each of which is a set of variables from the domain. The
Cambridge University Press. junction tree is configured to maintain the probabilistic
dependencies of the domain variables and provides a data
Zhaoyu, L., & DAmbrosio, B. (1994). Efficient inference structure over queries of the type What is the most
in Bayes networks as a combinatorial optimization prob- probable value of variable D given that the values of
lem. International Journal of Approximate Reasoning, variables A, B, etc., are known?
11, 55-81.
Knowledge Discovery: The process by which new
pieces of information can be revealed from a set of data.
KEY TERMS For example, given a set of data for variables {A,B,C,D},
a knowledge discovery process could discover unknown
Bayesian Network: A directed acyclic graph (DAG) probabilistic dependencies among variables in the do-
that encodes the probabilistic dependencies between the main, using measures such as Kullbacks mutual informa-
variables within a domain and is consistent with a joint tion, which, for the discrete case, is given by the formula
probability distribution (JPD) for that domain. For ex-
ample, a domain with variables {A,B,C,D}, in which the p (ai , b j )
I ( A : B) = p(a , b ) log P(a ) P(b )
variables B and C depend on A and the variable D depends i j
i j
i j
on C and B, would have the following JPD: P(A,B,C,D) =
p(A)p(B|A)p(C|A)p(D|C,D) and the graph Smart Sensors: Transducers that convert some physi-
cal parameters into an electrical signal and are equipped
A
with some level of computing power for signal processing.
B C Symbolic Propagation in Bayesian Networks: A

method that reduces the number of operations required to
D
compute a query in a Bayesian network by factorizing the
order of operations in the joint probability distribution.
Joint Probability Distribution: A function that en-
codes the probabilistic dependencies among a set of Wireless Ad-Hoc Sensor Network: A number of sen-
variables in a domain. sors spread across a geographical area. Each sensor has
wireless communication capability and some computing
Junction Tree: A representation that captures the capabilities for signal processing and data networking.
joint probability distribution of a set of variables in a very
432
TEAM LinG
433
Employing Neural Networks in Data Mining -

Mohamed Salah Hamdi
UAE University, UAE
INTRODUCTION ment of the neural network is the neuron. A typical neuron

j receives a set of input signals from the other connected
Data-mining technology delivers two key benefits: (i) a neurons, each of which is multiplied by a synaptic weight
descriptive function, enabling enterprises, regardless of of wij (weight for connection between neurons i and j). The
industry or size, in the context of defined business objec- resulting activation weights are then summed to produce
tives, to automatically explore, visualize, and understand the activation level for the neuron j. Learning is carried out
their data and to identify patterns, relationships, and by adjusting the weights in a neural network. Neurons
dependencies that impact business outcomes (i.e., rev- that contribute to the correct answer have their weights
enue growth, profit improvement, cost containment, and strengthened, while other neurons have their weights
risk management); (ii) a predictive function, enabling reduced. Several architectures and error correction
relationships uncovered and identified through the data- algorithms have been developed for neural networks
mining process to be expressed as business rules or (Haykin, 1999).
predictive models. These outputs can be communicated In general, neural networks can help where an algo-
in traditional reporting formats (i.e., presentations, briefs, rithmic solution cannot be formulated, where lots of
electronic information sharing) to guide business plan- examples of the required behavior are available, and where
ning and strategy. Also, these outputs, expressed as picking out the structure from existing data is needed.
programming code, can be deployed or hard wired into Neural networks work by feeding in some input variables
business-operating systems to generate predictions of and producing some output variables. Therefore, they
future outcomes, based on newly generated data, with can be used where some known information is available
higher accuracy and certainty. and some unknown information should be inferred.
However, there also are barriers to effective large-
scale data mining. Barriers related to the technical aspects Data Mining
of data mining concern issues such as large datasets,
highly complex data, high algorithmic complexity, heavy The data-mining revolution started in the mid-1990s. It
data management demands, and qualification of results was characterized by the incorporation of existing and
(Musick, Fidelis & Slezak, 1997). Barriers related to atti- already well-established tools and algorithms such as
tudes, policies, and resources concern potential dangers machine learning. In 1995, the International Conference
such as privacy concerns (Kobsa, 2002). on Knowledge Discovery and Data Mining became the
The number of research projects and publications most important annual event for data mining. The frame-
reporting experiences with data mining has been growing work of data mining also was outlined in many books, such
steadily. Researchers in many different fields, including as Advances in Knowledge Discovery and Data Mining
database systems, knowledge-base systems, artificial (Fayyad et al., 1996). Data-mining conferences like ACM
intelligence, machine learning, knowledge acquisition, SIGKDD, SPIE, PKDD, and SIAM, and journals like Data
statistics, spatial databases, data visualization, and Mining and Knowledge Discovery Journal (1997), Jour-
Internet computing, have shown great interest in data nal of Knowledge and Information Systems (1999),
mining. In this contribution, the focus is on the relation- and IEEE Transactions on Knowledge and Data En-
ship between artificial neural networks and data mining. gineering (1989) have become an integral part of the
We review some of the related literature and report on our data-mining field.
own experience in this context. The trends in data mining over the last few years
include OLAP (Online Analytical Processing), data ware-
housing, association rules, high performance data-min-
BACKGROUND ing systems, visualization techniques, and applications
of data mining. Recently, new trends have emerged that
Artificial Neural Networks have great potential to benefit the data-mining field, like
XML (eXtensible Markup Language) and XML-related
Neural networks are patterned after the biological ganglia technologies, database products that incorporate data-
and synapses of the nervous system. The essential ele- mining tools, and new developments in the design and
TEAM LinG
Employing Neural Networks in Data Mining
implementation of the data-mining process. Another im- (1996) show how effective data mining can be achieved by
portant data-mining issue is concerned with the relation- combining the power of neural networks with the rigor of
ship between theoretical data-mining research and data- more traditional statistical tools. They argue that this
mining applications. Data mining is an exponentially grow- alliance can generate important synergies. Craven and
ing field with a strong emphasis on applications. Shavlik (1997) describe neural network learning algo-
A further issue of great importance is the research in rithms for data mining that are able to produce comprehen-
data-mining algorithms and the discussion of issues of sible models and that do not require excessive training
scale (Hearst, 1997). The commonly used tools may not times. They argue that neural network methods deserve a
scale up to huge volumes of data. Scalable data-mining place in the toolboxes of data-mining specialists. Mitra,
tools are characterized by the linear increase of their Pal, and Mitra (2002) provide a survey of the available
runtime with the increase of the number of data points literature on data mining using soft computing method-
within a fixed amount of available memory. An overview ologies, including neural networks. They came to the
of scalable data-mining tools is given in Ganti, Gehrke, and conclusion that neural networks are suitable in data-rich
Ramakrishnan (1999). In addition to scalability, robust environments and are typically used for extracting embed-
techniques to model noisy data sets containing an un- ded knowledge in the form of rules, quantitative evalua-
known number of overlapping categories are of great tion of these rules, clustering, self-organization, classifi-
importance (Krishnapuram et al., 2001). cation, and regression. Vesely (2003) argues that from the
methods of data mining based on neural networks, the
Kohonens self-organizing maps are the most promising,
MAIN THRUST because, by using a self-organizing map, one can more
easily visualize high-dimensional data. Self-organizing
Exploiting Neural Networks in Data maps also outperform other conventional methods such
as the popular Principal Component Analysis (PCA)
Mining method for screening analysis of high-dimensional data.
Although highly successful in typical cases, PCA suffers
How is data mining able to tell you important things that from the drawback of being a linear method. Furthermore,
you didnt know or tell you what is going to happen next? real-world data manifolds, besides being nonlinear, often
The technique that is used to perform these feats is called are corrupted by noise and embed into high-dimensional
modeling. Modeling is simply the act of building a model spaces. Self-organizing maps are more robust against
based on data from situations where the answer is known, noise and are often used to provide representations that
and then applying the model to other situations where the can be analyzed successfully using conventional meth-
answers arent known. Modeling techniques have been ods like PCA.
around for centuries, but it is only recently that data In spite of their excellent performance in concept
storage and communication capabilities required to col- discovery, neural networks do suffer from some short-
lect and to store huge amounts of data and the computa- comings. They are sensitive to the net topology, initial
tional power to automate modeling techniques to work weights, and the selection of attributes. If the number of
directly on the data have been available. Modeling tech- layers is not selected suitably, the learning efficiency will
niques used for data mining include decision trees, rule be affected. Too many irrelevant nodes can cause unnec-
induction, genetic algorithms, nearest neighbor, artificial essary computational expense and overfit (i.e., the net-
neural networks, and many other techniques (Chen, Han work creates meaningless concepts); randomly selected
& Yu, 1996; Cios, Pedrycz & Swiniarski, 1998; Hand, initial weights sometimes can trap the nets in so-called
Mannila & Smyth, 2000). pitfalls; that is, neural nets stabilize around local minima
Exploiting artificial neural networks as a modeling instead of the global minimum. Background knowledge
technique for data mining is considered to be an important remains unused in neural nets. The knowledge discov-
direction of research. Neural networks can be applied to ered by nets is not transparent to users. This is perhaps
a number of data-mining problems, including classifica- the main failing of neural networks, as they are unintelli-
tion, regression, and clustering, and there are quite a few gible black boxes.
interesting developments and tools that are being devel- Our own work is focused on mining educational data
oped in this field. Lu, Setiono, and Liu (1996) applied to assist e-learning in a variety of ways. In the following
neural networks to mine symbolic classification rules from section, we report on the experience with MASACAD
large databases. They report that neural networks were (Multi-Agent System for ACademic ADvising), a data-
able to deliver a lower error rate and are more robust mining, multi-agent system that advises students using
against noise than decision trees. Ainslie and Drze neural networks.
434
TEAM LinG
Case Study: Academic Advising of programs, and curricula. This kind of information is
Students published in Web pages, in booklets, and in many other -
forms such as printouts and announcements. The sources
At the UAE University, there is enormous interest in the of knowledge are many; however, the primary source will
area of online education. Rigorous steps are being taken be a human expert, who should posses more complex
toward the creation of the technological infrastructure and knowledge than can be found in documented sources.
the academic infrastructure for the improvement of teach-
ing and learning. MASACAD, the academic advising sys- The Advising System MASACAD
tem described in the following, is to be understood as a tool
that mines educational data to support learning. MASACAD is a multi-agent system that offers aca-
The general goal of academic advising is to assist demic advice to students by mining the educational data
students in developing educational plans that are consis- described in the previous section. It consists of a user
tent with academic, career, and life goals, and to provide system, a grading system, a course announcement sys-
students with information and skills needed to pursue tem, and a mediation agent. The mediation agent pro-
those goals. In order to improve the advising process and vides the information retrieving service. It moves from
make it easier, an intelligent assistant in the form of a the site of an application to another, where it interacts
computer program will be of great interest. The goal of with the agent wrappers. The agent wrappers manage the
academic advising, as stated previously, is too general, states of the applications they are wrapped around,
because many experts are involved and a huge amount of invoking them when necessary. The application grading
expertise is needed. This makes the realization of such an system is a database application for answering queries
assistant too difficult, if not impossible. Therefore, in the about the students and the courses they have already
implemented system, the scope of academic advising was taken. The application course announcement system is
restricted. It was understood as just being intended to a Web application for answering queries about the
provide the student with an opportunity to plan programs courses that are expected to be offered in the semester
of study, select appropriate required and elective classes, for which advising is needed. The application user
and schedule classes in a way that provides the greatest system is the heart of the advising system, and it is here
potential for academic success. where intelligence resides. The application gives stu-
dents the opportunity to express their desires concern-
Data Mined by the Advising System ing the courses to be attended by choosing among the
courses that are offered, initiating a query to obtain
MASACAD advice, and, finally, seeing the results returned by the
advising system. The system also alerts the user auto-
To extract the academic advice (i.e., to provide the student matically via e-mail when something changes in the
with a set of appropriate courses he or she should register offered courses or in the student profile. The advising
for in the coming term), MASACAD has to mine a huge procedure suggests courses according to university
amount of educational data available in different formats. laws, in a way that provides the greatest potential for
The data contain the student profile, which includes the academic success, as seen by a human academic advi-
courses already attended, the corresponding grades, the sor. Taking into account the adequacy of the machine-
desires of the student concerning the courses to be at- learning approach for data mining, added to the avail-
tended, and much other information. The part of the profile ability of experience with advising students, made the
consisting of the courses already attended, the corre- adoption of a paradigm of supervised learning from
sponding grades, and so forth, is maintained by the univer- examples using artificial neural networks interesting.
sity administration in appropriate databases. The part of For academic advising, the known information (input
the profile consisting of the desires of the student con- variables) consists of the profile of the student and of
cerning the courses to be attended should be asked for the offered courses. The unknown information (output
from the student before advising is performed. The data to variables) consists of the advice expected by the stu-
be mined also include the courses that are offered in the dent. In order for the network to be able to infer the
semester for which advising is needed. This information is unknown information, prior training is needed. Train-
maintained by the university administration in appropriate ing will integrate the expertise in academic advising
Web sites. Finally, a very important component of the data into the network. The back-propagation algorithm was
that the system has to mine is expertise. For the problem used for training the neural network.
of academic advising, expertise consists partly of the Information (i.e., training examples) is gained from
university laws concerning academic advising. These con- information about students and courses they really took
sist of all the details and regulations concerning courses, in previous semesters. The selection of these courses
435
TEAM LinG
was made, based on the advice of human experts special- key findings much faster than many other tools. As
izing in academic advising. About 250 computer science computer systems become faster, the value of neural
students in different stages of study were available for the networks as a data-mining tool only will increase.
learning procedures. Each one of the 250 examples con-
sisted of a pair of input-output vectors. The input vector
summarized all the information needed for advising a CONCLUSION
particular student (85 real-valued components; each com-
ponent encodes the information about one of the 85 The amount of raw data stored in databases and the Web
courses of the curriculum). The output vector encodes the is exploding. Raw data by itself, however, does not pro-
final decision concerning the courses in which the stu- vide much information. One benefits when meaningful
dent actually enrolled, based on the advice of the human trends and patterns are extracted from the data. Data-
academic advisor (85 integer-valued components; each mining techniques help to recognize significant facts,
component represents a priority value for one of the 85 relationships, trends, patterns, exceptions, and anoma-
courses of the curriculum, and a higher priority value lies that might otherwise go unnoticed. In this contribu-
indicates a more appropriate course for the student). The tion, we have seen an example of how neural networks can
aim of the learning phase was to determine the most be used to help mine data about students and courses with
suitable values for the learning rate, the size of the net- the aim of developing educational plans that are consis-
work (number of neurons, number of hidden layers was set tent with academic, career, and life goals, and providing
to 2), and the number of training cycles that are needed for students with information and skills needed to pursue
the convergence of the network. Many experiments were those goals. The neural network paradigm seems interest-
conducted to obtain these parameters and to test the ing and viable enough to be used as a data-mining tool.
system. With a network topology of 85-100-100-85 and
systematically selected network parameters (50 different
experiments chosen carefully were performed to obtain REFERENCES
these parameters), the layered, fully connected back-
propagation network was able to deliver a considerable Ainslie, A., & Drze, X. (1996). Data mining: Using neural
performance. Fifty students participated in the evaluation networks as a benchmark for model building. Decision
of the system. In 92% of the cases, the network was able Marketing, 7, 77-86.
to produce very appropriate advice according to human
experts in academic advising. In the remaining 8% of the Chen, M.S., Han, J., & Yu, P.S. (1996). Data mining: An
cases (4 cases), some unsatisfactory course suggestions overview from database perspective. IEEE Transactions
were produced by the network (Hamdi, 2004). on Knowledge and Data Engineering, 8(6), 866-883.
Cios, K.J., Pedrycz, W., & Swiniarski, R.W. (1998). Data
mining methods for knowledge discovery. Norwell, MA:
FUTURE TRENDS Kluwer Academic Publishers.
The rapid growth of business, industrial, and educational Craven, M.W., & Shavlik, J.W. (1997). Using neural net-
data sources has overwhelmed the traditional, interactive works for data mining. Future Generation Computer
approaches to data analysis and created a need for a new Systems, 13(2-3), 211-229.
generation of tools for intelligent and automated discov-
ery in data. Neural networks are well suited for data- Fayyad, U.M., Piatesky-Shapiro, G., Smyth, P., &
mining tasks, due to their ability to model complex, multi- Uthurusamy, R. (1996). Advances in knowledge discov-
dimensional data. As data availability has magnified, so ery and data mining. Menlo Park, CA: MIT Press.
has the dimensionality of problems to be solved, thus Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). Mining
limiting many traditional techniques, such as manual very large databases. IEEE Computer, 32(8), 38-45.
examination of the data and some statistical methods.
Although there are many techniques and algorithms that Hamdi, M.S. (2004). MASACAD: A learning multi-agent
can be used for data mining, some of which can be used system that mines the Web to advise students. Proceed-
effectively in combination, neural networks offer many ings of the International Conference on Internet Com-
desirable qualities, such as the automatic search of all puting (IC04), Las Vegas, Nevada.
possible interrelationships among key factors, the auto- Hand, D.J., Mannila, H., & Smyth, P. (2000). Principles of
matic modeling of complex problems without prior knowl- data mining. Cambridge, MA: MIT Press.
edge of the level of complexity, and the ability to extract
436
TEAM LinG
Haykin, S. (1999). Neural networks: A comprehensive Classification: The process of dividing a dataset into
foundation. Upper Saddle River, NJ: Prentice Hall. mutually exclusive groups such that the members of each -
group are as close as possible to one another, and differ-
Hearst, M. (1997). Distinguishing between web data min- ent groups are as far as possible from one another, where
ing and information access. The Internet <http:// distance is measured with respect to specific variable(s)
www.sims.berkeley.edu/~hearst/talks/data-mining- one is trying to predict. For example, a typical classifica-
panel/index.htm> tion problem is to divide a database of companies into
Kobsa, A. (2002). Personalized hypermedia and interna- groups that are as homogeneous as possible with respect
tional privacy. Communications of the ACM, 45(5), 64-67. to a creditworthiness variable with values good and bad.
Supervised classification is when we know the class
Krishnapuram, R., Joshi, A., Nasraoui, O., & Yi, L. (2001). labels and the number of classes.
Low complexity fuzzy relational clustering algorithms for Web
mining. IEEE Transactions on Fuzzy Systems, 9(4), 596-607. Clustering: The process of dividing a dataset as in
classification, but the distance is now measured with
Lu, H., Setiono, R., & Liu, H. (1996). Effective data mining respect to all available variables. Unsupervised classifi-
using neural networks. IEEE Transactions on Knowledge cation is when we do not know the class labels and may
and Data Engineering, 8(6), 957-961. not know the number of classes.
Mitra, S., Pal, S.K., & Mitra, P. (2002). Data mining in soft Data Mining: The extraction of hidden predictive
computing framework: A survey. IEEE Transactions On information from large databases.
Neural Networks, 13(1), 3-14.
Decision Tree: A tree-shaped structure that repre-
Musick, R., Fidelis, K., & Slezak, T. (1997). Large-scale sents a set of decisions. These decisions generate rules
data mining pilot project in human genome. Retrieved for the classification of a dataset.
from http://home.comcast.net/~crmusick/papers/
RDOFIS.html Linear Regression: A classic statistical problem is to
try to determine the relationship between two random
Vesely, A. (2003). Neural networks in data mining. variables X and Y. For example, we might consider height
AGRIC.ECON.-CZECH, 49(9), 427-431. and weight of a sample of adults. Linear regression at-
tempts to explain this relationship with a straight line fit
to the data.
KEY TERMS
OLAP: Online analytical processing. Refers to array-
Artificial Neural Networks: Non-linear predictive oriented database applications that allow users to view,
models that learn through training and resemble biologi- navigate through, manipulate, and analyze multidimen-
cal neural networks in structure. sional databases.
437
TEAM LinG
438
Enhancing Web Search through Query Log

Mining
Ji-Rong Wen
Microsoft Research Asia, China
INTRODUCTION ing pages the user has selected to browse. Therefore, a

query log can be viewed as a valuable source containing
Web query log is a type of file keeping track of the a large amount of users implicit relevance judgments.
activities of the users who are utilizing a search engine. Obviously, these relevance judgments can be used to
Compared to traditional information retrieval setting in more accurately detect users query intentions and im-
which documents are the only information source avail- prove the ranking of search results.
able, query logs are an additional information source in One important assumption behind query log mining is
the Web search setting. Based on query logs, a set of Web that the clicked pages are relevant to the query. Al-
mining techniques, such as log-based query clustering, though the clicking information is not as accurate as
log-based query expansion, collaborative filtering and explicit relevance judgment in traditional relevance feed-
personalized search, could be employed to improve the back, the users choice does suggest a certain degree of
performance of Web search. relevance. In the long run with a large amount of log data,
query logs can be treated as a reliable resource containing
abundant implicit relevance judgments from a statistical
BACKGROUND point of view.
Web usage mining is an application of data mining tech-

niques to discovering interesting usage patterns from MAIN THRUST
Web data, in order to understand and better serve the
needs of Web-based applications. Since the majority of Web Query Log Preprocessing
usage data is stored in Web logs, usage mining is usually
also referred to as log mining. Web logs can be divided Typically, each record in a Web query log includes the IP
into three categories based on the location of data collect- address of the client computer, timestamp, the URL of the
ing: server log, client log, and proxy log. Server log requested item, the type of Web browser, protocol, etc.
provides an aggregate picture of the usage of a service by The Web log of a search engine records various kinds of
all users, while client log provides a complete picture of user activities, such as submitting queries, clicking URLs
usage of all services by a particular client, with the proxy in the result list, getting HTML pages and skipping to
log being somewhere in the middle (Srivastava, Cooley, another result list. Although all these activities reflect,
Deshpande, & Tan, 2000). more or less, a users intention, the query terms and the
Query log mining could be viewed as a special kind of Web pages the user visited are the most important data for
Web usage mining. While there is a lot of work about mining tasks. Therefore, a query session, the basic unit of
mining Website navigation logs for site monitoring, site mining tasks, is defined as a query submitted to a search
adaptation, performance improvement, personalization engine together with the Web pages the user visits in
and business intelligence, there is relatively little work of response to the query.
mining search engines query logs for improving Web Since the HTTP protocol requires a separate connec-
search performance. In early years, researchers have tion for every client-server interaction, the activities of
proved that relevance feedback can significantly improve multiple users usually interleave with each other. There
retrieval performance if users provide sufficient and cor- are no clear boundaries among user query sessions in the
rect relevance judgments for queries (Xu & Croft, 2000). logs, which makes it a difficult task to extract individual
However, in real search scenarios, users are usually query sessions from Web query logs (Cooley, Mobasher,
reluctant to explicitly give their relevance feedback. A & Srivastava, 1999). There are mainly two steps to extract
large amount of users past query sessions have been query sessions from query logs: user identification and
accumulated in the query logs of search engines. Each session identification. User identification is the process
query session records a user query and the correspond- of isolating from the logs the activities associated with an
TEAM LinG
Enhancing Web Search through Query Log Mining
individual user. Activities of the same user could be other on query session. Since queries with the same or
grouped by their IP addresses, agent types, site topolo- similar search intentions may be represented with differ- -
gies, cookies, user IDs, etc. The goal of session identifi- ent words and the average length of Web queries is very
cation is to divide the queries and page accesses of each short, content-based query clustering usually does not
user into individual sessions. Finding the beginning of a perform well.
query session is trivial: a query session begins when a Using query sessions mined from query logs to cluster
user submits a query to a search engine. However, it is queries is proved to be a more promising method (Wen,
difficult to determine when a search session ends. The Nie, & Zhang, 2002). Through query sessions, query
simplest method of achieving this is through a timeout, clustering is extended to query session clustering.
where if the time between page requests exceeds a certain The basic assumption here is that the activities following
limit, it is assumed that the user is starting a new session. a query are relevant to the query and represent, to some
extent, the semantic features of the query. The query text
Log-Based Query Clustering and the activities in a query session as a whole can
represent the search intention of the user more precisely.
Query clustering is a technique aiming at grouping users Moreover, the ambiguity of some query terms is elimi-
semantically (not syntactically) related queries in Web nated in query sessions. For instance, if a user visited a
query logs. Query clustering could be applied to FAQ few tourism Websites after submitting a query Java, it
detecting, index-term selection and query reformulation, is reasonable to deduce that the user was searching for
which are effective ways to improve Web search. First of information about Java Island, not Java programming
all, FAQ detecting means to detect Frequently Asked language or Java coffee. Moreover, query clustering
Questions (FAQs), which can be achieved by clustering and document clustering can be combined and reinforced
similar queries in the query logs. A cluster being made up with each other (Beeferman & Berger, 2000).
of many queries can be considered as a FAQ. Some search
engines (e.g. Askjeeves) prepare and check the correct Log-Based Query Expansion
answers for FAQs by human editors, and a significant
majority of users queries can be answered precisely in Query expansion involves supplementing the original
this way. Second, inconsistency between term usages in query with additional words and phrases, which is an
queries and those in documents is a well-known problem effective way to overcome the term-mismatching problem
in information retrieval, and the traditional way of directly and to improve search performance. Log-based query
extracting index terms from documents will not be effec- expansion is a new query expansion method based on
tive when the user submits queries containing terms query log mining. Taking query sessions in query logs as
different from those in the documents. Query clustering a bridge between user queries and Web pages, probabi-
is a promising technique to provide a solution to the word listic correlations between terms in queries and those in
mismatching problem. If similar queries can be recognized pages can then be established. With these term-term
and clustered together, the resulting query clusters will be correlations, relevant expansion terms can be selected
very good sources for selecting additional index terms for from the documents for a query. For example, a recent work
documents. For example, if queries such as atomic bomb, by Cui, Wen, Nie, and Ma (2003) shows that, from query
Manhattan Project, Hiroshima bomb and nuclear logs, some very good terms, such as personal com-
weapon are put into a query cluster, this cluster, not the puter, Apple Computer, CEO, Macintosh and
individual terms, can be used as a whole to index docu- graphical user interface, can be detected to be tightly
ments related to atomic bomb. In this way, any queries correlated to the query Steve Jobs, and using these
contained in the cluster can be linked to these documents. terms to expand the original query can lead to more
Third, most words in the natural language have inherent relevant pages.
ambiguity, which makes it quite difficult for user to formu- Experiments by Cui, Wen, Nie, and Ma (2003) show
late queries with appropriate words. Obviously, query that mining user logs is extremely useful for improving
clustering could be used to suggest a list of alternative retrieval effectiveness, especially for very short queries
terms for users to reformulate queries and thus better on the Web. The log-based query expansion overcomes
represent their information needs. several difficulties of traditional query expansion meth-
The key problem underlying query clustering is to ods because a large number of user judgments can be
determine an adequate similarity function so that truly extracted from user logs, while eliminating the step of
similar queries can be grouped together. There are mainly collecting feedbacks from users for ad-hoc queries. Log-
two categories of methods to calculate the similarity based query expansion methods have three other impor-
between queries: one is based on query content, and the tant properties. First, the term correlations are pre-
439
TEAM LinG
computed offline and thus the performance is better than means that objects newly introduced into the system
traditional local analysis methods which need to calculate have not been rated by any users and can therefore not
term correlations on the fly. Second, since user logs be recommended. Due to the absence of recommenda-
contain query sessions from different users, the term tions, users tend to not be interested in these new
correlations can reflect the preference of the majority of objects. This in turn has the consequence that the newly
the users. Third, the term correlations may evolve along added objects remain in their state of not being recom-
with the accumulation of user logs. Hence, the query mendable. Combining collaborative and content-based
expansion process can reflect updated user interests at a filtering is therefore a promising approach to solving the
specific time. above problems (Baudisch, 1999).
Collaborative Filtering and Personalized Personalized Web Search

Web Search
While most modern search engines return identical re-
Collaborative filtering and personalized Web search are sults to the same query submitted by different users,
two most successful examples of personalization on the personalized search targets to return results related to
Web. Both of them heavily rely on mining query logs to users preferences, which is a promising way to alleviate
detect users preferences or intentions. the increasing information overload problem on the Web.
The core task of personalization is to obtain the prefer-
Collaborative Filtering ence of each individual user, which is called user profile.
User profiles could be explicitly provided by users or
Collaborative filtering is a method of making automatic implicitly learned from users past experiences (Hirsh,
predictions (filtering) about the interests of a user by Basu, & Davison, 2000; Mobasher, Cooley, & Srivastava,
collecting taste information from many users (collaborat- 2000). For example, Googles personalized search re-
ing). The basic assumption of collaborative filtering is that quires users to explicitly enter their preferences. In the
users having similar tastes on some items may also have Web context, query log provides a very good source for
similar preferences on other items. From Web logs, a learning user profiles since it records the detailed infor-
collection of items (books, music CDs, movies, etc.) with mation about users past search behaviors. The typical
users searching, browsing and ranking information could method of mining a users profile from query logs is to
be extracted to train prediction and recommendation mod- first collect all of this users query sessions from logs,
els. For a new user with a few items he/she likes or dislikes, and then learn the users preferences on various topics
other items meeting the users taste could be selected based on the queries submitted and the pages viewed by
based on the trained models and recommended to him/her the user (Liu, Yu, & Meng, 2004).
(Shardanand & Maes, 1995; Konstan, Miller, Maltz, User preferences could be incorporated into search
Herlocker, & Gordon, L. R., et al., 1997). Generally, vector engines to personalize both their relevance ranking and
space models and probabilistic models are two major mod- their importance ranking. A general way for personaliz-
els for collaborative filtering (Breese, Heckerman, & Kadie, ing relevance ranking is through query reformulation,
1998). Collaborative filtering systems have been imple- such as query expansion and query modification, based
mented by several e-commerce sites including Amazon. on user profiles. A main feature of modern Web search
Collaborative filtering has two main advantages over engines is that they assign importance scores to pages
content-based recommendation systems. First, content- through mining the link structures and the importance
based annotations of some data types are often not avail- scores will significantly affect the final ranking of search
able (such as video and audio) or the available annotations results. The most famous link analysis algorithms are
usually do not meet the tastes of different users. Second, PageRank (Page, Brin, Motwani, & Winograd, 1998) and
through collaborative filtering, multiple users knowledge HITS (Kleinberg, 1998). One main shortcoming of
and experiences could be remembered, analyzed and shared, PageRank is that it assigns a static importance score to
which is a characteristic especially useful for improving a page, no matter who is searching and what query is
new users information seeking experiences. being used. Recently, a topic-sensitive PageRank algo-
The original form of collaborative filtering does not use rithm (Haveliwala, 2002) and a personalized PageRank
the actual content of the items for recommendation, which algorithm (Jeh & Widom, 2003) are proposed to calculate
may suffers from the scalability, sparsity and synonymy different PageRank scores based on different users
problems. Most importantly, it is incapable of overcoming preferences and interests. A search engine using topic-
the so-called first-rater problem. The first-rater problem sensitive PageRank and Personalized PageRank is ex-
pected to retrieve Web pages closer to user preferences.
440
TEAM LinG
FUTURE TRENDS ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining (pp. 407-416). -
Although query log mining is promising to improve the Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical
information seeking process on the Web, the number of analysis of predictive algorithms for collaborative filter-
publications in this field is relatively small. The main ing. Proceedings of the Fourteenth Conference on Un-
reason might lie in the difficulty of obtaining a large certainty in Artificial Intelligence (pp. 43-52).
amount of query log data. Due to lack of data, especially
lack of a standard data set, it is difficult to repeat and Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data
compare existing algorithms. This may block knowledge preparation for mining World Wide Web browsing pat-
accumulation in the field. On the other hand, search terns. Journal of Knowledge and Information Systems, 1(1).
engine companies are usually reluctant to make pubic of
their log data because of the costs and/or legal issues Cui, H., Wen, J.-R., Nie, J.-Y., & Ma, W.-Y. (2003). Query
involved. However, to create such a kind of standard log expansion by mining user logs. IEEE Transaction on
dataset will greatly benefit both the research community Knowledge and Data Engineering, 15(4), 829-839.
and the search engine industry, and thus may become a Haveliala, T. H. (2002). Topic-sensitive PageRank. Pro-
major challenge of this field in the future. ceeding of the Eleventh World Wide Web conference
Nearly all of the current published experiments were (WWW 2002).
conducted on snapshots of small to medium scale query
logs. In practice, query logs are usually continuously Hirsh, H., Basu, C., & Davison, B. D. (2000). Learning to
generated and with huge sizes. To develop scalable and personalize. Communications of the ACM, 43(8), 102-106.
incremental query log mining algorithms capable of han- Jeh, G., & Widom, J. (2003). Scaling personalized Web
dling the huge and growing dataset is another challenge. search. Proceedings of the Twelfth International World
We also foresee that client-side query log mining and Wide Web Conference (WWW 2003).
personalized search will attract more and more attentions
from both academy and industry. Personalized informa- Kleinberg, J. (1998). Authoritative sources in a hyperlinked
tion filtering and customization could be effective ways to environment. Proceedings of the 9th ACM SIAM Interna-
ease the deteriorated information overload problem. tional Symposium on Discrete Algorithms (pp. 668-677).
Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L.,
Gordon, L. R., & Riedl, J. (1997). GroupLens: applying
CONCLUSION collaborative filtering to Usenet news. Communications
of the ACM, 40(3), 77-87.
Web query log mining is an application of data mining
techniques to discovering interesting knowledge from Liu, F., Yu, C., & Meng, W. (2004). Personalized Web
Web query logs. In recent years, some research work and search for improving retrieval effectiveness. IEEE Transac-
industrial applications have demonstrated that Web log tions on Knowledge and Data Engineering, 16(1), 28-40.
mining is an effective method to improve the performance
of Web search. Web search engines have become the Mobasher, B., Cooley, R., & Srivastava, J. (2000). Auto-
main entries for people to exploit the Web and find needed matic personalization based on Web usage mining. Com-
information. Therefore, to keep improving the perfor- munications of the ACM, 43(8), 142-151.
mance of search engines and to make them more user- Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The
friendly are important and challenging tasks for research- PageRank citation ranking: Bringing order to the Web.
ers and developers. Query log mining is expected to play Technical report of Stanford University.
a more and more important role in enhancing Web search.
Shardanand, U., & Maes, P. (1995). Social information
filtering: Algorithms for automating word of mouth. Pro-
REFERENCES ceedings of the Conference on Human Factors in Com-
puting Systems.
Baudisch, P. (1999). Joining collaborative and content- Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.
based filtering. Proceedings of the Conference on Human (2000). Web usage mining: Discovery and applications of
Factors in Computing Systems (CHI99). usage patterns from Web data. SIGKDD Explorations,
Beeferman, D., & Berger. A. (2000). Agglomerative clus- 1(2), 12-23.
tering of a search engine query log. Proceedings of the 6th
441
TEAM LinG
Wen, J.-R., Nie, J.-Y., & Zhang, H.-J. (2002). Query clus- tions between terms in the user queries and those in the
tering using user logs. ACM Transactions on Information documents can then be established through user logs.
Systems (ACM TOIS), 20(1), 59-81. With these term-term correlations, relevant expansion
terms can be selected from the documents for a query.
Xu, J., & Croft, W.B. (2000). Improving the effectiveness
of information retrieval with local context analysis. ACM Log-Based Personalized Search: Personalized search
Transactions on Information Systems, 18(1), 79-112. targets to return results related to users preferences. The
core task of personalization is to obtain the preference of
each individual user, which could be learned from query logs.
KEY TERMS Query Log: A type of file keeping track of the activi-
ties of the users who are utilizing a search engine.
Collaborative Filtering: A method of making auto- Query Log Mining: An application of data mining
matic predictions (filtering) about the interests of a user techniques to discover interesting knowledge from Web
by collecting taste information from many users (collabo- query logs. The mined knowledge is usually used to
rating). enhance Web search.
Log-Based Query Clustering: A technique aiming at Query Session: a query submitted to a search engine
grouping users semantically related queries collected in together with the Web pages the user visits in response
Web query logs. to the query. Query session is the basic unit of many query
Log-Based Query Expansion: A new query expansion log mining tasks.
method based on query log mining. Probabilistic correla-
442
TEAM LinG
443
Enhancing Web Search through Web -

Structure Mining
Ji-Rong Wen
Microsoft Research Asia, China
The Web is an open and free environment for people to Web structure mining can be further divided into three
publish and get information. Everyone on the Web can categories based on the kind of structured data used.
be either an author, a reader, or both. The language of the
Web, HTML (Hypertext Markup Language), is mainly Web graph mining: Compared to a traditional
designed for information display, not for semantic rep- document set in which documents are indepen-
resentation. Therefore, current Web search engines dent, the Web provides additional information about
usually treat Web pages as unstructured documents, and how different documents are connected to each
traditional information retrieval (IR) technologies are other via hyperlinks. The Web can be viewed as a
employed for Web page parsing, indexing, and search- (directed) graph whose nodes are the Web pages
ing. The unstructured essence of Web pages seriously and whose edges are the hyperlinks between them.
blocks more accurate search and advanced applications There has been a significant body of work on ana-
on the Web. For example, many sites contain structured lyzing the properties of the Web graph and mining
information about various products. Extracting and inte- useful structures from it (Page et al., 1998; Kleinberg,
grating product information from multiple Web sites 1998; Bharat & Henzinger, 1998; Gibson, Kleinberg,
could lead to powerful search functions, such as com- & Raghavan, 1998). Because the Web graph struc-
parison shopping and business intelligence. However, ture is across multiple Web pages, it is also called
these structured data are embedded in Web pages, and interpage structure.
there are no proper traditional methods to extract and Web information extraction (Web IE): In addition,
integrate them. Another example is the link structure of although the documents in a traditional information
the Web. If used properly, information hidden in the retrieval setting are treated as plain texts with no or
links could be taken advantage of to effectively improve few structures, the content within a Web page does
search performance and make Web search go beyond have inherent structures based on the various HTML
traditional information retrieval (Page, Brin, Motwani, and XML tags within the page. While Web content
& Winograd, 1998, Kleinberg, 1998). mining pays more attention to the content of Web
Although XML (Extensible Markup Language) is an pages, Web information extraction has focused on
effort to structuralize Web data by introducing seman- automatically extracting structures with various
tics into tags, it is unlikely that common users are accuracy and granularity out of Web pages. Web
willing to compose Web pages using XML due to its content structure is a kind of structure embedded in
complication and the lack of standard schema defini- a single Web page and is also called intrapage
tions. Even if XML is extensively adopted, a huge amount structure.
of pages are still written in the HTML format and remain Deep Web mining: Besides Web pages that are
unstructured. Web structure mining is the class of meth- accessible or crawlable by following the
ods to automatically discover structured data and infor- hyperlinks, the Web also contains a vast amount of
mation from the Web. Because the Web is dynamic, noncrawlable content. This hidden part of the Web,
massive and heterogeneous, automated Web structure referred to as the deep Web or the hidden Web
mining calls for novel technologies and tools that may (Florescu, Levy, & Mendelzon, 1998), comprises
take advantage of state-of-the-art technologies from various a large number of online Web databases. Compared
areas, including machine learning, data mining, information to the static surface Web, the deep Web contains a
retrieval, and databases and natural language processing. much larger amount of high-quality structured in-
TEAM LinG
Enhancing Web Search through Web Structure Mining
formation (Chang, He, Li, & Zhang, 2003). Auto- weight and a hub weight to solve the first problem and
matically discovering the structures of Web data- combine connectivity and content analysis to solve the
bases and matching semantically related attributes latter two. Chakrabarti, Joshi, and Tawde (2001) ad-
between them is critical to understanding the struc- dressed another problem with HITS: regarding the whole
tures and semantics of the deep Web sites and to page as a hub is not suitable, because a page always
facilitating advanced search and other applications. contains multiple regions in which the hyperlinks point
to different topics. They proposed to disaggregate hubs
into coherent regions by segmenting the DOM (docu-
MAIN THRUST ment object model) tree of an HTML page.
Web Graph Mining PageRank
Mining the Web graph has attracted a lot of attention in The main drawback of the HITS algorithm is that the hubs
the last decade. Some important algorithms have been and authority score must be computed iteratively from
proposed and have shown great potential in improving the query result on the fly, which does not meet the real-
the performance of Web search. Most of these mining time constraints of an online search engine. To over-
algorithms are based on two assumptions. (a) Hyperlinks come this difficulty, Page et al. (1998) suggested using
convey human endorsement. If there exists a link from a random surfing model to describe the probability that
page A to page B, and these two pages are authored by a page is visited and taking the probability as the impor-
different people, then the first author found the second tance measurement of the page. They approximated this
page valuable. Thus the importance of a page can be probability with the famous PageRank algorithm, which
propagated to those pages it links to. (b) Pages that are computes the probability scores in an iterative manner.
co-cited by a certain page are likely related to the same The main advantage of the PageRank algorithm over the
topic. Therefore, the popularity or importance of a page HITS algorithm is that the importance values of all pages
is correlated to the number of incoming links to some are computed off-line and can be directly incorporated
extendt, and related pages tend to be clustered together into ranking functions of search engines.
through dense linkages among them. Noisy link and topic drifting are two main problems
in the classic Web graph mining algorithms. Some links,
Hub and Authority such as banners, navigation panels, and advertisements,
can be viewed as noise with respect to the query topic
In the Web graph, a hub is defined as a page containing and do not carry human editorial endorsement. Also,
pointers to many other pages, and an authority is de- hubs may be mixed, which means that only a portion of
fined as a page pointed to by many other pages. An the hub content may be relevant to the query. Most link
authority is usually viewed as a good page containing analysis algorithms treat each Web page as an atomic,
useful information about one topic, and a hub is usually indivisible unit with no internal structure. This leads to
a good source to locate information related to one false reinforcements of hub/authority and importance
topic. Moreover, a good hub should contain pointers to calculation. Cai, He, Wen, and Ma (2004) used a vision-
many good authorities, and a good authority should be based page segmentation algorithm to partition each
pointed to by many good hubs. Such a mutual reinforce- Web page into blocks. By extracting the page-to-block,
ment relationship between hubs and authorities is taken block-to-page relationships from the link structure and
advantage of by an iterative algorithm called HITS page layout analysis, a semantic graph over the Web can
(Kleinberg, 1998). HITS computes authority scores and be constructed such that each node exactly represents a
hub scores for Web pages in a subgraph of the Web, single semantic topic. This graph can better describe the
which is obtained from the (subset of) search results of semantic structure of the Web. Based on block-level
a query with some predecessor and successor pages. link analysis, they proposed two new algorithms, Block
Bharat and Henzinger (1998) addressed three prob- Level PageRank and Block Level HITS, whose perfor-
lems in the original HITS algorithm: mutually rein- mances are shown to exceed the classic PageRank and
forced relationships between hosts (where certain docu- HITS algorithms.
ments conspire to dominate the computation), auto-
matically generated links (where no humans opinion is Community Mining
expressed by the link), and irrelevant documents (where
the graph contains documents irrelevant to the query Many communities, either in an explicit or implicit
topic). They assign each edge of the graph an authority form, exist in the Web today, and their number is grow-
444
TEAM LinG
ing at a very fast speed. Discovering communities from a analysis, semantic analysis, and discourse analysis. IE
network environment such as the Web has recently be- tools for semistructured pages are different from the -
come an interesting research problem. The Web can be classical ones, as IE utilizes available structured informa-
abstracted into directional or nondirectional graphs with tion, such as HTML tags and page layouts, to infer the
nodes and links. It is usually rather difficult to understand data formats or pages. Such a kind of methods is also
a networks nature directly from its graph structure, par- called wrapper induction (Kushmerick, Weld, &
ticularly when it is a large scale complex graph. Data mining Doorenbos, 1997; Cohen, Hurst, & Jensen, 2002). In
is a method to discover the hidden patterns and knowledge contrast to classic IE approaches, wrapper induction
from a huge network. The mined knowledge could provide operates less dependently of the specific contents of
a higher logical view and more precise insight of the nature Web pages and mainly focuses on page structure and
of a network and will also dramatically decrease the dimen- layout. Existing approaches for Web IE mainly include
sionality when trying to analyze the structure and evolu- the manual approach, supervised learning, and unsuper-
tion of the network. vised learning. Although some manually built wrappers
Quite a lot of work has been done in mining the exist, supervised learning and unsupervised learning
implicit communities of users, Web pages, or scientific are viewed as more promising ways to learn robust and
literature from the Web or document citation database scalable wrappers, because building IE tools manually
using content or link analysis. Several different defini- is not feasible and scalable for the dynamic, massive and
tions of community were also raised in the literature. In diverse Web contents. Moreover, because supervised
Gibson et al. (1998), a Web community is a number of learning still relies on manually labeled sample pages and
representative authority Web pages linked by important thus also requires substantial human effort, unsuper-
hub pages that share a common topic. Kumar, Raghavan, vised learning is the most suitable method for Web IE.
Rajagopalan, and Tomkins (1999) define a Web commu- There have been several successful fully automatic IE
nity as a highly linked bipartite subgraph with at least one tools using unsupervised learning (Arasu & Garcia-
core containing complete bipartite subgraph. In Flake, Molina, 2003; Liu, Grossman, & Zhai, 2003).
Lawrence, and Lee Giles (2000), a set of Web pages that
linked more pages in the community than those outside Deep Web Mining
of the community could be defined as a Web community.
Also, a research community could be based on a single- In the deep Web, it is usually difficult or even impos-
most-cited paper and could contain all papers that cite it sible to directly obtain the structures (i.e. schemas) of the
(Popescul, Flake, Lawrence, Ungar, & Lee Giles, 2000). Web sites backend databases without cooperation from
the sites. Instead, the sites present two other distin-
Web Information Extraction guishing structures, interface schema and result schema,
to users. The interface schema is the schema of the query
Web IE has the goal of pulling out information from a interface, which exposes attributes that can be queried in
collection of Web pages and converting it to a homoge- the backend database. The result schema is the schema
neous form that is more readily digested and analyzed for of the query results, which exposes attributes that are
both humans and machines. The results of IE could be shown to users.
used to improve the indexing process, because IE re- The interface schema is useful for applications, such
moves irrelevant information in Web pages and facili- as a mediator that queries multiple Web databases, be-
tates other advanced search functions due to the struc- cause the mediator needs complete knowledge about the
tured nature of data. The structuralization degrees of search interface of each database. The result schema is
Web pages are diverse. Some pages can be just taken as critical for applications, such as data extraction, where
plain text documents. Some pages contain a little loosely instances in the query results are extracted. In addition
structured data, such as a product list in a shopping page to the importance of the interface schema and result
or a price table in a hotel page. Some pages are organized schema, attribute matching across different schemas is
with more rigorous structures, such as the home pages of also important. First, matching between different inter-
the professors in a university. Other pages have very strict face schemas and matching between different results
structures, such as the book description pages of Amazon, schemas (intersite schema matching) are critical for
which are usually generated by a uniform template. metasearching and data-integration among related Web
Therefore, basically two kinds of Web IE techniques databases. Second, matching between the interface
exist: IE from unstructured pages and IE from schema and the result schema of a single Web database
semistructured pages. IE tools for unstructured pages (intrasite schema matching) enables automatic data an-
are similar to those classical IE tools that typically use notation and database content crawling.
natural language processing techniques such as syntactic
445
TEAM LinG
Most existing schema-matching approaches for Web greatly improve the effectiveness of current Web search
databases primarily focus on matching query interfaces and will enable much more sophisticated Web information
(He & Chang, 2003; He, Meng, Yu, & Wu, 2003; Raghavan retrieval technologies in the future.
& Garcia-Molina, 2001). They usually adopt a label-
based strategy to identify attribute labels from the de-
scriptive text surrounding interface elements and then REFERENCES
find synonymous relationships between the identified
labels. The performance of these approaches may be Arasu, A., & Garcia-Molina, H. (2003). Extracting struc-
affected when no attribute description can be identified tured data from Web pages. Proceedings of the ACM
or when the identified description is not informative. In SIGMOD International Conference on Management
Wang, Wen, Lochovsky, and Ma (2004), an instance- of Data.
based schema-matching approach was proposed to iden-
tify both the interface and result schemas of Web data- Bharat, K., & Henzinger, M. R. (1998). Improved algo-
bases. Instance-based approaches depend on the content rithms for topic distillation in a hyperlinked environ-
overlap or statistical properties, such as data ranges and ment. Proceedings of the 21st Annual International
patterns, to determine the similarity of two attributes. ACM SIGIR Conference.
Thus, they could effectively deal with the cases where
attribute names or labels are missing or not available, Cai, D., He, X., Wen J.-R., & Ma, W.-Y. (2004). Block-
which are common for Web databases. level link analysis. Proceedings of the 27th Annual
International ACM SIGIR Conference.
Chakrabarti, S., Joshi, M., & Tawde, V. (2001). En-
FUTURE TRENDS hanced topic distillation using text, markup tags, and
hyperlinks. Proceedings of the 24th Annual Interna-
It is foreseen that the biggest challenge in the next tional ACM SIGIR Conference (pp. 208-216).
several decades is how to effectively and efficiently dig
out a machine-understandable information and knowl- Chang, C. H., He, B., Li, C., & Zhang, Z. (2003). Structured
edge layer from unorganized and unstructured Web data. databases on the Web: Observations and implications
However, Web structure mining techniques are still in discovery (Tech. Rep. No. UIUCCDCS-R-2003-2321). Ur-
their youth today. For example, the accuracy of Web bana-Champaign, IL: University of Illinois, Department of
information extraction tools, especially those auto- Computer Science.
matically learned tools, is still not satisfactory to meet Cohen, W., Hurst, M., & Jensen, L. (2002). A flexible
the requirements of some rigid applications. Also, deep learning system for wrapping tables and lists in HTML
Web mining is a new area, and researchers have many documents. Proceedings of the 11th World Wide Web
challenges and opportunities to further explore, such as Conference.
data extraction, data integration, schema learning and
matching, and so forth. Moreover, besides Web pages, Flake, G. W., Lawrence, S., & Lee Giles, C. (2000).
various other types of structured data exist on the Web, Efficient identification of Web communities. Proceed-
such as e-mail, newsgroup, blog, wiki, and so forth. ings of the Sixth International Conference on Knowl-
Applying Web mining techniques to extract structures edge Discovery and Data Mining.
from these data types is also a very important future Florescu, D., Levy, A. Y., & Mendelzon, A. O. (1998).
research direction. Database techniques for the World Wide Web: A sur-
vey. SIGMOD Record, 27(3), 59-74.
CONCLUSION Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring

Web communities from link topology. Proceedings of the
Despite the efforts of XML and semantic Web, which Ninth ACM Conference on Hypertext and Hypermedia.
target to bring structures and semantics to the Web, He, B., & Chang, C. C. (2003). Statistical schema matching
Web structure mining is considered a more promising across Web query interfaces. Proceedings of the ACM
way to structuralize the Web due to its characteristics of SIGMOD International Conference on Management of
automation, scalability, generality, and robustness. As a Data.
result, there has been a rapid growth of technologies for
automatically discovering structures from the Web, namely He, H., Meng, W., Yu, C., & Wu, Z. (2003). WISE-Integra-
Web graph mining, Web information extraction, and deep tor: An automatic integrator of Web search interfaces for
Web mining. The mined information and knowledge will
446
TEAM LinG
e-commerce. Proceedings of the 29th International Con- Wang, J., Wen, J.-R., Lochovsky, F., & Ma, W.-Y. (2004).
ference on Very Large Data Bases. Instance-based schema matching for Web databases by -
domain-specific query probing. Proceedings of the 30th
Kleinberg, J. (1998). Authoritative sources in a hyperlinked International Conference on Very Large Data Bases.
environment. Proceedings of the Ninth ACM SIAM Inter-
national Symposium on Discrete Algorithms (pp. 668-
677).
Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A.
KEY TERMS
(1999). Trawling the Web for emerging cyber-commu-
nities. Proceedings of the Eighth International World Community Mining: A Web graph mining algorithm
Wide Web Conference. to discover communities from the Web graph in order to
provide a higher logical view and more precise insight of
Kushmerick, N., Weld, D., & Doorenbos, R. (1997). the nature of the Web.
Wrapper induction for information extraction. Pro-
ceedings of the International Joint Conference on Deep Web Mining: Automatically discovering the
Artificial Intelligence. structures of Web databases hidden in the deep Web and
matching semantically related attributes between them.
Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data
records in Web pages. Proceedings of the ACM SIGKDD HITS: A Web graph mining algorithm to compute
International Conference on Knowledge Discovery & authority scores and hub scores for Web pages.
Data Mining. PageRank: A Web graph mining algorithm that uses
Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). the probability that a page is visited by a random surfer
The PageRank citation ranking: Bringing order to the on the Web as a key factor for ranking search results.
Web (Tech. Rep.). Stanford University. Web Graph Mining: The mining techniques used to
Popescul, A., Flake, G. W., Lawrence, S., Ungar, L. H., discover knowledge from the Web graph.
& Lee Giles, C. (2000). Clustering and identifying Web Information Extraction: The class of mining
temporal trends in document databases. Proceedings of methods to pull out information from a collection of
the IEEE Conference on Advances in Digital Librar- Web pages and converting it to a homogeneous form
ies. that is more readily digested and analyzed for both
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the humans and machines.
hidden Web. Proceedings of the 27th International Web Structure Mining: The class of methods used
Conference on Very Large Data Bases. to automatically discover structured data and informa-
tion from the Web.
447
TEAM LinG
448
Ensemble Data Mining Methods

Nikunj C. Oza
NASA Ames Research Center, USA
INTRODUCTION and then combines them in some appropriate manner (e.g.,

averaging or voting). As mentioned earlier, it is important
Ensemble data mining methods, also known as committee to have base models that are competent but also comple-
methods or model combiners, are machine learning meth- mentary. To further motivate this point, consider Figure
ods that leverage the power of multiple models to achieve 1. This figure depicts a classification problem in which the
better prediction accuracy than any of the individual goal is to separate the points marked with plus signs from
models could on their own. The basic goal when design- points marked with minus signs. None of the three indi-
ing an ensemble is the same as when establishing a vidual linear classifiers (marked A, B, and C) is able to
committee of people: Each member of the committee should separate the two classes of points. However, a majority
be as competent as possible, but the members should vote over all three linear classifiers yields the piecewise-
complement one another. If the members are not comple- linear classifier shown as a thick line. This classifier is able
mentary, that is, if they always agree, then the committee to separate the two classes perfectly. For example, the
is unnecessary any one member is sufficient. If the plusses at the top of the figure are correctly classified by
members are complementary, then when one or a few A and B, but are misclassified by C. The majority vote over
members make an error, the probability is high that the these classifiers correctly identifies them as plusses. This
remaining members can correct this error. Research in happens because A and B are very different from C. If our
ensemble methods has largely revolved around designing ensemble instead consisted of three copies of C, then all
ensembles consisting of competent yet complementary three classifiers would misclassify the plusses at the top
models. of the figure, and so would a majority vote over these
classifiers.
BACKGROUND
MAIN THRUST
A supervised machine learning task involves construct-
ing a mapping from input data (normally described by We now discuss the key elements of an ensemble-learn-
several features) to the appropriate outputs. In a classi- ing method and ensemble model and, in the process,
fication learning task, each output is one or more classes discuss several ensemble methods that have been devel-
to which the input belongs. The goal of classification oped.
learning is to develop a model that separates the data into
the different classes, with the aim of classifying new Ensemble Methods
examples in the future. For example, a credit card company
may develop a model that separates people who defaulted The example shown in Figure 1 is an artificial example. We
on their credit cards from those who did not, based on cannot normally expect to obtain base models that
other known information such as annual income. The goal misclassify examples in completely separate parts of the
would be to predict whether a new credit card applicant is input space and ensembles that classify all the examples
likely to default on his or her credit card and thereby correctly. However, many algorithms attempt to generate
decide whether to approve or deny this applicant a new a set of base models that make errors that are as uncorrelated
card. In a regression learning task, each output is a as possible. Methods such as bagging (Breiman, 1994)
continuous value to be predicted (e.g., the average bal- and boosting (Freund & Schapire, 1996) promote diver-
ance that a credit card holder carries over to the next sity by presenting each base model with a different subset
month). of training examples or different weight distributions over
Many traditional machine learning algorithms gener- the examples. For example, in Figure 1, if the plusses in the
ate a single model (e.g., a decision tree or neural network). top part of the figure were temporarily removed from the
Ensemble learning methods instead generate multiple training set, then a linear classifier learning algorithm
models. Given a new example, the ensemble passes it to trained on the remaining examples would probably yield
each of its multiple base models, obtains their predictions, a classifier similar to C. On the other hand, removing the
TEAM LinG
Figure 1. An ensemble of linear classifiers gating network essentially keeps track of how well each
base model performs in each part of the input space. The -
hope is that each model learns to specialize in different
input regimes and is weighted highly when the input falls
into its specialty. Merzs method uses PCA to lower the
weights of base models that perform well overall but are
redundant and, therefore, effectively give too much weight
to one model. For example, in Figure 1, if an ensemble of
three models instead had two copies of A and one copy of
B, we may prefer to lower the weights of the two copies of
A because, essentially, A is being given too much weight.
Here, the two copies of A would always outvote B, thereby
rendering B useless. Merzs method also increases the
weight on base models that do not perform as well overall
but perform well in parts of the input space, where the
other models perform poorly. In this way, a base models
plusses in the bottom part of the figure would probably unique contributions are rewarded.
yield classifier B, or something similar. In this way, run- When designing an ensemble learning method, in
ning the same learning algorithm on different subsets of addition to choosing the method by which to bring about
training examples can yield very different classifiers, diversity in the base models and choosing the combining
which can be combined to yield an effective ensemble. method, one has to choose the type of base model and
Input decimation ensembles (IDE) (Oza & Tumer, 2001; base model learning algorithm to use. The combining
Tumer & Oza, 2003) and stochastic attribute selection method may restrict the types of base models that can be
committees (SASC) (Zheng & Webb, 1998) instead pro- used. For example, to use average combining in a classi-
mote diversity by training each base model with the same fication problem, one must have base models that can
training examples but different subsets of the input fea- yield probability estimates. This precludes the use of
tures. SASC trains each base model with a random subset linear discriminant analysis or support vector machines,
of input features. IDE selects, for each class, a subset of which cannot return probabilities. The vast majority of
features that has the highest correlation with the presence ensemble methods use only one base model learning
of that class. Each feature subset is used to train one base algorithm but use the methods described earlier to bring
model. However, in both SASC and IDE, all the training about diversity in the base models. Surprisingly, little
patterns are used with equal weight to train all the base work has been done (e.g., Merz, 1999) on creating en-
models. sembles with many different types of base models.
So far we have distinguished ensemble methods by Two of the most popular ensemble learning algorithms
the way they train their base models. We can also distin- are bagging and boosting, which we briefly explain next.
guish methods by the way they combine their base mod-
els predictions. Majority or plurality voting is frequently Bagging
used for classification problems and is used in bagging.
If the classifiers provide probability values, simple aver- Bootstrap aggregating (Bagging) generates multiple
aging is commonly used and is very effective (Tumer & bootstrap training sets from the original training set (by
Ghosh, 1996). Weighted averaging has also been used, using sampling with replacement) and uses each of them
and different methods for weighting the base models have to generate a classifier for inclusion in the ensemble. The
been examined. Two particularly interesting methods for algorithms for bagging and sampling with replacement are
weighted averaging include mixtures of experts (Jordan & given in Figure 2. In these algorithms, T is the original
Jacobs, 1994) and Merzs use of principal components training set of N examples, M is the number of base models
analysis (PCA) to combine models (Merz, 1999). In the to be learned, L b is the base model learning algorithm, the
mixtures of experts method, the weights in the weighted his are the base models, random_integer (a,b) is a func-
average combination are determined by a gating network, tion that returns each of the integers from a to b with equal
which is a model that takes the same inputs that the base probability, and I(A) is the indicator function that returns
models take and returns a weight on each of the base 1 if A is true and 0 otherwise.
models. The higher the weight for a base model, the more To create a bootstrap training set from an original
that base model is trusted to provide the correct answer. training set of size N, we perform N multinomial trials,
These weights are determined during training by how well where in each trial, we draw one of the N examples. Each
the base models perform on the training examples. The example has the probability 1/N of being drawn in each
449
TEAM LinG
Figure 2. Batch bagging algorithm and sampling with Figure 3. Batch boosting algorithm
replacement
AdaBoost({(x1, y1 ),(x 2 , y 2 ),K,(x N , y N )},Lb , M)

Initialize D1 (n ) = 1/N for all n {1,2,K,N }.
Bagging (T , M )
For each m = 1,2,K, M :
For each m = 1, 2,K , M
hm = Lb ({(x1, y1 ),(x 2 , y 2 ),K,(x N , y N )},Dm ).
Tm = Sample_With_Replacement(T ,| T |)
m = Dm (n ).
hm = Lb (Tm ) n:h m (x n )y n
M If m 1/2 then,
Return h fin ( x) = arg max yY I (h
m =1
m ( x) = y ). set M = m 1 and abort this loop.
Update distribution Dm :
1
2 1 if hm (x n ) = y n
Sample_With_Replacement(T , N ) ( m)
Dm +1 (n) = Dm (n )
S = {} 1
otherwise.
2m
For i = 1, 2,K , N
M 1 m
r = random_integer(1, N ) Return h fin (x) = argmaxy Y I(hm (x) = y)log .
Add T [r ] to S . m=1 m
Return S .
trial. The second algorithm shown in Figure 2 does exactly Boosting

this N times, the algorithm chooses a number r from 1 to
N and adds the rth training example to the bootstrap We now explain the AdaBoost algorithm, because it is
training set S. Clearly, some of the original training ex- the most frequently used among all boosting algorithms.
amples will not be selected for inclusion in the bootstrap AdaBoost generates a sequence of base models with
training set, and others will be chosen one time or more. In different weight distributions over the training set. The
bagging, we create M such bootstrap training sets and AdaBoost algorithm is shown in Figure 3. Its inputs are
then generate classifiers using each of them. Bagging a set of N training examples, a base model learning
returns a function h fin ( x) that classifies new examples by algorithm Lb, and the number M of base models that we
returning the class y that gets the maximum number of wish to combine. AdaBoost was originally designed for
votes from the base models h1,h2,,hM. In bagging, the M two-class classification problems; therefore, for this ex-
bootstrap training sets that are created are likely to have planation, we will assume that two possible classes exist.
some differences. If these differences are enough to in- However, AdaBoost is regularly used with a larger num-
duce noticeable differences among the M base models ber of classes. The first step in AdaBoost is to construct
while leaving their performances reasonably good, then an initial distribution of weights D1 over the training set.
the ensemble will probably perform better than the base This distribution assigns equal weight to all N training
models individually. Breiman (1994) demonstrates that examples. We now enter the loop in the algorithm. To
bagged ensembles tend to improve upon their base models construct the first base model, we call Lb with distribution
more if the base model learning algorithms are unstable D1 over the training set.1 After getting back a model h1, we
differences in their training sets tend to induce significant calculate its error e1 on the training set itself, which is just
differences in the models. He notes that decision trees are the sum of the weights of the training examples that h1
unstable, which explains why bagged decision trees often misclassifies. We require that e1 < 1/2 (this is the weak
outperform individual decision trees; however, decision learning assumption the error should be less than
stumps (decision trees with only one variable test) are what we would achieve through randomly guessing the
stable, which explains why bagging with decision stumps class2). If this condition is not satisfied, then we stop and
tends to not improve upon individual decision stumps. return the ensemble consisting of the previously gener-
450
TEAM LinG
ated base models. If this condition is satisfied, then we and documents. Research in ensemble methods is begin-
calculate a new distribution, D2, over the training examples ning to explore these new types of data. For example, -
as follows. Examples that were correctly classified by h1 ensemble learning traditionally has required access to the
have their weights multiplied by 1/(2(1-e1). Examples that entire dataset at once; that is, it performs batch learning.
were misclassified by h1 have their weights multiplied by However, this idea is clearly impractical for very large
1/(2e1). Note that because of our condition e1 < 1/2, datasets that cannot be loaded into memory all at once.
correctly classified examples have their weights reduced, Oza and Russell (2001) and Oza (2001) apply ensemble
and misclassified examples have their weights increased. learning to such large datasets. In particular, this work
Specifically, examples that h1 misclassified have their total develops online bagging and boosting; that is, they learn
weight increased to 1/2 under D2, and examples that h1 in an online manner. Whereas standard bagging and
correctly classified have their total weight reduced to 1/ boosting require at least one scan of the dataset for every
2 under D2. We then go into the next iteration of the loop base model created, online bagging and online boosting
to construct base model h2 using the training set and the require only one scan of the dataset, regardless of the
new distribution D2. The point is that the next base model number of base models. Additionally, as new data arrive,
will be generated by a weak learner (i.e., the base model will the ensembles can be updated without reviewing any past
have an error less than 1/2); therefore, at least some of the data. However, because of their limited access to the data,
examples misclassified by the previous base model will these online algorithms do not perform as well as their
have to be correctly classified by the current base model. standard counterparts. Other work has also been done to
In this way, boosting forces subsequent base models to apply ensemble methods to other types of data, such as
correct the mistakes made by earlier models. We construct time-series data (Weigend, Mangeas, & Srivastava, 1995).
M base models in this fashion. The ensemble returned by However, most of this work is experimental. Theoretical
AdaBoost is a function that takes a new example as input frameworks that can guide us in the development of new
and returns the class that gets the maximum weighted vote ensemble learning algorithms specifically for modern
over the M base models, where each base models weight datasets have yet to be developed.
is log((1-em)/em), which is proportional to the base models
accuracy on the weighted training set presented to it.
AdaBoost has performed very well in practice and is CONCLUSION
one of the few theoretically motivated algorithms that has
turned into a practical algorithm. However, AdaBoost can Ensemble methods began about 10 years ago as a separate
perform poorly when the training data is noisy (Dietterich, area within machine learning and were motivated by the
2000); that is, the inputs or outputs have been randomly idea of wanting to leverage the power of multiple models
contaminated. Noisy examples are normally difficult to and not just trust one model built on a small training set.
learn. Because of this fact, the weights assigned to noisy Significant theoretical and experimental developments
examples often become much higher than for the other have occurred over the past 10 years and have led to
examples, often causing boosting to focus too much on several methods, especially bagging and boosting, being
those noisy examples at the expense of the remaining data. used to solve many real problems. However, ensemble
Some work has been done to mitigate the effect of noisy methods also appear to be applicable to current and
examples on boosting (Oza, 2003, 2004; Ratsch, Onoda, & upcoming problems of distributed data mining and online
Muller, 2001). applications. Therefore, practitioners in data mining should
stay tuned for further developments in the vibrant area of
ensemble methods. An excellent way to do this is to follow
FUTURE TRENDS the series of workshops called the International Work-
shop on Multiple Classifier Systems. This series balance
The fields of machine learning and data mining are in- between theory, algorithms, and applications of ensemble
creasingly moving away from working on small datasets methods gives a comprehensive idea of the work being
in the form of flat files that are presumed to describe a done in the field.
single process. They are changing their focus toward the
types of data increasingly being encountered today: very
large datasets, possibly distributed over different loca- REFERENCES
tions, describing operations with multiple regimes of
operation, time-series data, online applications (the data Breiman, L. (1994). Bagging predictors (Tech. Rep. 421).
is not a time series but nevertheless arrives continually Berkeley: University of California, Department of Statis-
and must be processed as it arrives), partially labeled data, tics.
451
TEAM LinG
Dietterich, T. (2000). An experimental comparison of three Zheng, Z., & Webb, G. (1998). Stochastic attribute selec-
methods for constructing ensembles of decision trees: tion committees. Proceedings of the 11th Australian
Bagging, boosting, and randomization. Machine Learn- Joint Conference on Artificial Intelligence (pp. 321-332),
ing, 40, 139-158. Brisbane, Australia. Springer-Verlag.
Freund, Y., & Schapire, R. (1996). Experiments with a new
boosting algorithm. In M. Kaufmann (Ed.), Proceedings KEY TERMS
of the 13th International Conference on Machine Learn-
ing (pp. 148-156). Bari, Italy: Morgan Kaufmann Publish- Batch Learning: Learning by using an algorithm that
ers. views the entire dataset at once and can access any part
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixture of the dataset at any time and as many times as desired.
of experts and the EM algorithm. Neural Computation, 6, Decision Tree: A model consisting of nodes that
181-214. contain tests on a single attribute and branches repre-
Merz, C. J. (1999). A principal component approach to senting the different outcomes of the test. A prediction is
combining regression estimates. Machine Learning, 36, generated for a new example by performing the test de-
9-32. scribed at the root node and then proceeding along the
branch that corresponds to the outcome of the test. If the
Oza, N. C. (2001). Online ensemble learning. Unpublished branch ends in a prediction, then that prediction is re-
doctoral dissertation, University of California, Berkeley. turned. If the branch ends in a node, then the test at that
node is performed and the appropriate branch selected.
Oza, N. C. (2003). Boosting with averaged weight vectors. This continues until a prediction is found and returned.
In T. Windeatt & F. Roli (Eds.), Proceedings of the Fourth
International Workshop on Multiple Classifier Systems Ensemble: A function that returns a combination of
(pp. 15-24). Guildford, UK: Springer-Verlag. the predictions of multiple machine learning models.
Oza, N. C. (2004). AveBoost2: Boosting with noisy data. Machine Learning: The branch of artificial intelli-
In F. Roli, J. Kittler, & T. Windeatt (Eds.), Proceedings of gence devoted to enabling computers to learn.
the Fifth International Workshop on Multiple Classifier
Systems (pp. 31-40). Cagliari, Italy: Springer-Verlag. Neural Network: A nonlinear model derived through
analogy with the human brain. It consists of a collection
Oza, N. C., & Russell, S. (2001). Experimental comparisons of elements that linearly combine their inputs and pass the
of online and batch versions of bagging and boosting. result through a nonlinear transfer function.
Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Online Learning: Learning by using an algorithm that
USA, (pp. 359-364). ACM Press. only examines the dataset once in order. This paradigm is
often used in situations when data arrives continually in
Oza, N. C., & Tumer, K. (2001). Input decimation en- a stream and when predictions must be obtainable at any
sembles: Decorrelation through dimensionality reduc- time.
tion. Proceedings of the Second International Workshop
on Multiple Classifier Systems, Berlin (pp. 238-247). Principal Components Analysis (PCA): Given a
Springer-Verlag. dataset, PCA determines the axes of maximum variance.
For example, if the dataset were shaped like an egg, then
Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins the long axis of the egg would be the first principal
for AdaBoost. Machine Learning, 42, 287-320. component, because the variance is greatest in this direc-
tion. All subsequent principal components are found to
Tumer, K., & Ghosh, J. (1996). Error correlation and error be orthogonal to all previous components.
reduction in ensemble classifiers. Connection Science,
8(3-4), 385-404.
Tumer, K., & Oza, N. C. (2003). Input decimated en-
ENDNOTES
sembles. Pattern Analysis and Applications, 6(1), 65-77. 1
If Lb cannot take a weighted training set, then one
Weigend, A. S., Mangeas, M., & Srivastava, A. N. (1995). can call it with a training set that is generated by
Nonlinear gated experts for time-series: Discovering resampling with replacement from the original training
gimes and avoiding overfitting. International Journal of set according to the distribution Dm.
Neural Systems, 6(4), 373-399. 2
This requirement is perhaps too strict when more
than two classes exist. AdaBoost has a multiclass
452
TEAM LinG
version (Freund & Schapire, 1997) that does not

have this requirement. However, the AdaBoost al- -
gorithm presented here is often used even when
more than two classes exist, if the base model learn-
ing algorithm is strong enough to satisfy the re-
quirement.
453
TEAM LinG
454
Ethics of Data Mining

Jack Cook
Rochester Institute of Technology, USA
INTRODUCTION miners and decision makers must contemplate ethical

issues before encountering one. Otherwise, they risk not
Decision makers thirst for answers to questions. As more identifying when a dilemma exists or making poor choices,
data is gathered, more questions are posed: Which cus- since all aspects of the problem have not been identified.
tomers are most likely to respond positively to a marketing
campaign, product price change or new product offering?
How will the competition react? Which loan applicants are BACKGROUND
most likely or least likely to default? The ability to raise
questions, even those that currently cannot be answered, Technology has moral properties, just as it has political
is a characteristic of a good decision maker. Decision properties (Brey 2000; Feenberg, 1999; Sclove, 1995;
makers no longer have the luxury of making decisions Winner, 1980). Winner (1980) argues that technological
based on gut feeling or intuition. Decisions must be artifacts and systems function like laws, serving as frame-
supported by data; otherwise decision makers can expect works for public order by constraining individuals be-
to be questioned by stockholders, reporters, or attorneys haviors. Sclove (1995) argues that technologies possess
in a court of law. Data mining can support and often direct the same kinds of structural effects as other elements of
decision makers in ways that are often counterintuitive. society, such as laws, dominant political and economic
Although data mining can provide considerable insight, institutions, and systems of cultural beliefs. Data mining,
there is an inherent risk that what might be inferred may being a technological artifact, is worthy of study from an
be private or ethically sensitive (Fule & Roddick, 2004, p. 159). ethical perspective due to its increasing importance in
Extensively used in telecommunications, financial decision making, both in the private and public sectors.
services, insurance, customer relationship management Computer systems often function less as background
(CRM), retail, and utilities, data mining more recently has technologies and more as active constituents in shaping
been used by educators, government officials, intelli- society (Brey, 2000). Data mining is no exception. Higher
gence agencies, and law enforcement. It helps alleviate integration of data mining capabilities within applications
data overload by extracting value from volume. However, ensures that this particular technological artifact will
data analysis is not data mining. Query-driven data analy- increasingly shape public and private policies.
sis, perhaps guided by an idea or hypothesis, that tries to Data miners and decision makers obviously are obli-
deduce a pattern, verify a hypothesis, or generalize infor- gated to adhere to the law. But ethics are oftentimes more
mation in order to predict future behavior is not data restrictive than what is called for by law. Ethics are
mining (Edelstein, 2003). It may be a first step, but it is not standards of conduct that are agreed upon by cultures
data mining. Data mining is the process of discovering and organizations. Supreme Court Justice Potter Stewart
and interpreting meaningful, previously hidden patterns defines the difference between ethics and laws as know-
in the data. It is not a set of descriptive statistics. Descrip- ing the difference between what you have a right to do
tion is not prediction. Furthermore, the focus of data (legally, that is) and what is right to do. Sadly, a number
mining is on the process, not a particular technique, used of IS professionals either lack an awareness of what their
to make reasonably accurate predictions. It is iterative in company actually does with data and data mining results
nature and generically can be decomposed into the fol- or purposely come to the conclusion that it is not their
lowing steps: (1) data acquisition through translating, concern. They are enablers in the sense that they solve
cleansing, and transforming data from numerous sources, managements problems. What management does with
(2) goal setting or hypotheses construction, (3) data that data or results is not their concern.
mining, and (4) validating or interpreting results. Most laws do not explicitly address data mining,
The process of generating rules through a mining although court cases are being brought to stop certain
operation becomes an ethical issue, when the results are data mining practices. A federal court ruled that using
used in decision-making processes that affect people or data mining tools to search Internet sites for competitive
when mining customer data unwittingly compromises the information may be a crime under certain circumstances
privacy of those customers (Fule & Roddick, 2004). Data (Scott, 2002). In EF Cultural Travel BV vs. Explorica Inc.
TEAM LinG
(No. 01-2000 1st Cir. Dec. 17, 2001), the First Circuit Court 1. Selecting the wrong problem for data mining.
of Appeals in Massachusetts held that Explorica, a tour 2. Ignoring what the sponsor thinks data mining is and -
operator for students, improperly obtained confidential what it can and cannot do.
information about how rival EFs Web site worked and 3. Leaving insufficient time for data preparation.
used that information to write software that gleaned data 4. Looking only at aggregated results, never at indi-
about student tour prices from EFs Web site in order to vidual records.
undercut EFs prices (Scott, 2002). In this case, Explorica 5. Being nonchalant about keeping track of the mining
probably violated the federal Computer Fraud and Abuse procedure and results.
Act (18 U.S.C. Sec. 1030). Hence, the source of the data is 6. Ignoring suspicious findings in a haste to move on.
important when data mining. 7. Running mining algorithms repeatedly without think-
Typically, with applied ethics, a morally controversial ing hard enough about the next stages of the data
practice, such as how data mining impacts privacy, is analysis.
described and analyzed in descriptive terms, and finally 8. Believing everything you are told about the data.
moral principles and judgments are applied to it and moral 9. Believing everything you are told about your own
deliberation takes place, resulting in a moral evaluation, data mining analyses.
and operationally, a set of policy recommendations (Brey, 10. Measuring results differently from the way the
2000, p. 10). Applied ethics is adopted by most of the sponsor will measure them.
literature on computer ethics (Brey, 2000). Data mining
may appear to be morally neutral, but appearances in this These blunders are hidden ethical dilemmas faced by
case are deceiving. This paper takes an applied perspec- those who perform data mining. In the next subsections,
tive to the ethical dilemmas that arise from the application sample ethical dilemmas raised with respect to the appli-
of data mining in specific circumstances as opposed to cation of data mining results in the public sector are
examining the technological artifacts (i.e., the specific examined, followed briefly by those in the private sector.
software and how it generates inferences and predictions)
used by data miners. Ethics of Data Mining in the Public
Sector
MAIN THRUST Many times, the objective of data mining is to build a
customer profile based on two types of datafactual
Computer technology has redefined the boundary be- (who the customer is) and transactional (what the cus-
tween public and private information, making much more tomer does) (Adomavicius & Tuzhilin, 2001). Often, con-
information public. Privacy is the freedom granted to sumers object to transactional analysis. What follows are
individuals to control their exposure to others. A custom- two examples; the first (identifying successful students)
ary distinction is between relational and informational creates a profile based primarily on factual data, and the
privacy. Relational privacy is the control over ones second (identifying criminals and terrorists) primarily on
person and ones personal environment, and concerns transactional.
the freedom to be left alone without observation or inter-
ference by others. Informational privacy is ones control Identifying Successful Students
over personal information in the form of text, pictures,
recordings, and so forth (Brey, 2000). Probably the most common and well-developed use of
Technology cannot be separated from its uses. It is data mining is the attraction and retention of customers.
the ethical obligation of any information systems (IS) At first, this sounds like an ethically neutral application.
professional, through whatever means he or she finds out Why not apply the concept of students as customers to
that the data that he or she has been asked to gather or the academe? When students enter college, the transition
mine is going to be used in an unethical way, to act in a from high school for many students is overwhelming,
socially and ethically responsible manner. This might negatively impacting their academic performance. High
mean nothing more than pointing out why such a use is school is a highly structured Monday-through-Friday
unethical. In other cases, more extreme measures may be schedule. College requires students to study at irregular
warranted. As data mining becomes more commonplace hours that constantly change from week to week, depend-
and as companies push for even greater profits and market ing on the workload at that particular point in the course.
share, ethical dilemmas will be increasingly encountered. Course materials are covered at a faster pace; the duration
Ten common blunders that a data miner may cause, result- of a single class period is longer; and subjects are often
ing in potential ethical or possibly legal dilemmas, are more difficult. Tackling the changes in a students aca-
(Skalak, 2001):
455
TEAM LinG
demic environment and living arrangement as well as example, the following databases represent some of those
developing new interpersonal relationships is daunting used by the U.S. Immigration and Naturalization Service
for students. Identifying students prone to difficulties and (INS) to capture information on aliens (Verton, 2002).
intervening early with support services could significantly
improve student success and, ultimately, improve reten- Employment Authorization Document System
tion and graduation rates. Marriage Fraud Amendment System
Consider the following scenario that realistically could Deportable Alien Control System
arise at many institutions of higher education. Admissions Reengineered Naturalization Application Casework
at the institute has been charged with seeking applicants System
who are more likely to be successful (i.e., graduate from the Refugees, Asylum, and Parole System
institute within a five-year period). Someone suggests Integrated Card Production System
data mining existing student records to determine the Global Enrollment System
profile of the most likely successful student applicant. Arrival Departure Information System
With little more than this loose definition of success, a Enforcement Case Tracking System
great deal of disparate data is gathered and eventually Student and Schools System
mined. The results indicate that the most likely successful General Counsel Electronic Management System
applicant, based on factual data, is an Asian female whose Student Exchange Visitor Information System
familys household income is between $75,000 and $125,000 Asylum Prescreening System
and who graduates in the top 25% of her high school class. Computer-Linked Application Information Man-
Based on this result, admissions chooses to target market agement System (two versions)
such high school students. Is there an ethical dilemma? Non-Immigrant Information System
What about diversity? What percentage of limited market-
ing funds should be allocated to this customer segment? There are islands of excellence within the public
This scenario highlights the importance of having well- sector. One such example is the U.S. Armys Land Infor-
defined goals before beginning the data mining process. mation Warfare Activity (LIWA), which is credited with
The results would have been different if the goal were to having one of the most effective operations for mining
find the most diverse student population that achieved a publicly available information in the intelligence commu-
certain graduation rate after five years. In this case, the nity (Verton, 2002, p. 5).
process was flawed fundamentally and ethically from the Businesses have long used data mining. However,
beginning. recently, governmental agencies have shown growing
interest in using data mining in national security initia-
Identifying Criminals and Terrorists tives (Carlson, 2003, p. 28). Two government data min-
ing projects, the latter renamed by the euphemism fac-
The key to the prevention, investigation, and prosecution tual data analysis, have been under scrutiny (Carlson,
of criminals and terrorists is information, often based on 2003) These projects are the U.S. Transportation Secu-
transactional data. Hence, government agencies increas- rity Administrations (TSA) Computer Assisted Passen-
ingly desire to collect, analyze, and share information ger Prescreening System II (CAPPS II) and the Defense
about citizens and aliens. However, according to Rep. Curt Advanced Research Projects Agencys (DARPA) Total
Weldon (R-PA), chairman of the House Subcommittee on Information Awareness (TIA) research project (Gross,
Military Research and Development, there are 33 classified 2003). TSAs CAPPS II will analyze the name, address,
agency systems in the federal government, but none of phone number, and birth date of airline passengers in an
them link their raw data together (Verton, 2002). As Steve effort to detect terrorists (Gross, 2003). James Loy, direc-
Cooper, CIO of the Office of Homeland Security, said, I tor of the TSA, stated to Congress that, with CAPPS II,
havent seen a federal agency yet whose charter includes the percentage of airplane travelers going through extra
collaboration with other federal agencies (Verton, 2002, p. screening is expected to drop significantly from 15% that
5). Weldon lambasted the federal government for failing to undergo it today (Carlson, 2003). Decreasing the number
act on critical data mining and integration proposals that of false positive identifications will shorten lines at
had been authored before the terrorists attacks on Sep- airports.
tember 11, 2001 (Verton, 2002). TIA, on the other hand, is a set of tools to assist
Data to be mined is obtained from a number of sources. agencies such as the FBI with data mining. It is designed
Some of these are relatively new and unstructured in to detect extremely rare patterns. The program will in-
nature, such as help desk tickets, customer service com- clude terrorism scenarios based on previous attacks,
plaints, and complex Web searches. In other circumstances, intelligence analysis, war games in which clever people
data miners must draw from a large number of sources. For imagine ways to attack the United States and its de-
456
TEAM LinG
ployed forces, testified Anthony Tether, director of With this profile of personal details comes a substan-
DARPA, to Congress (Carlson, 2003, p. 22). When asked tial ethical obligation to safeguard this data. Ignoring any -
how DARPA will ensure that personal information caught legal ramifications, the ethical responsibility is placed
in TIAs net is correct, Tether stated that were not the firmly on IS professionals and businesses, whether they
people who collect the data. Were the people who supply like it or not; otherwise, they risk lawsuits and harming
the analytical tools to the people who collect the data individuals. The data industry has come under harsh
(Gross, 2003, p. 18). Critics of data mining say that while review. There is a raft of federal and local laws under
the technology is guaranteed to invade personal privacy, consideration to control the collection, sale, and use of
it is not certain to enhance national security. Terrorists do data. American companies have yet to match the tougher
not operate under discernable patterns, critics say, and privacy regulations already in place in Europe, while
therefore the technology will likely be targeted primarily personal and class-action litigation against businesses
at innocent people (Carlson, 2003, p. 22). Congress voted over data privacy issues is increasing (Wilder & Soat,
to block funding of TIA. But privacy advocates are 2001, p. 38).
concerned that the TIA architecture, dubbed mass
dataveillance, may be used as a model for other programs
(Carlson, 2003). FUTURE TRENDS
Systems such as TIA and CAPPS II raise a number of
ethical concerns, as evidenced by the overwhelming Data mining traditionally was performed by a trained
opposition to these systems. One system, the Multistate specialist, using a stand-alone package. This once na-
Anti-TeRrorism Information EXchange (MATRIX), rep- scent technique is now being integrated into an increas-
resents how data mining has a bad reputation in the public ing number of broader business applications and legacy
sector. MATRIX is self-defined as a pilot effort to systems used by those with little formal training, if any,
increase and enhance the exchange of sensitive terrorism in statistics and other related disciplines. Only recently
and other criminal activity information between local, has privacy and data mining been addressed together, as
state, and federal law enforcement agencies (matrix- evidenced by the fact that the first workshop on the
at.org, accessed June 27, 2004). Interestingly, MATRIX subject was held in 2002 (Clifton & Estivill-Castro, 2002).
states explicitly on its Web site that it is not a data-mining The challenge of ensuring that data mining is used in an
application, although the American Civil Liberties Union ethically and socially responsible manner will increase
(ACLU) openly disagrees. At the very least, the perceived dramatically.
opportunity for creating ethical dilemmas and ultimately
abuse is something the public is very concerned about, so
much so that the project felt that the disclaimer was CONCLUSION
needed. Due to the extensive writings on data mining in
the private sector, the next subsection is brief. Several lessons should be learned. First, decision makers
must understand key strategic issues. The data miner
Ethics of Data Mining in the Private must have an honest and frank dialog with the sponsor
Sector concerning objectives. Second, decision makers must not
come to rely on data mining to make decisions for them.
Businesses discriminate constantly. Customers are clas- The best data mining is susceptible to human interpreta-
sified, receiving different services or different cost struction. Third, decision makers must be careful not to explain
tures. As long as discrimination is not based on protected away with intuition data mining results that are
characteristics such as age, race, or gender, discriminat- counterintuitive. Decision making inherently creates ethi-
ing is legal. Technological advances make it possible to cal dilemmas, and data mining is but a tool to assist
track in great detail what a person does. Michael Turner, management in key decisions.
executive director of the Information Services Executive
Council, states, For instance, detailed consumer infor-
mation lets apparel retailers market their products to REFERENCES
consumers with more precision. But if privacy rules im-
pose restrictions and barriers to data collection, those Adomavicius, G. & Tuzhilin, A. (2001). Using data mining
limitations could increase the prices consumers pay when methods to build customer profiles. Computer, 34(2), 74-82.
they buy from catalog or online apparel retailers by 3.5%
to 11% (Thibodeau, 2001, p. 36). Obviously, if retailers Brey, P. (2000). Disclosive computer ethics. Computers
cannot target their advertising, then their only option is and Society, 30(4), 10-16.
to mass advertise, which drives up costs.
457
TEAM LinG
Carlson, C. (2003a). Feds look at data mining. eWeek, Winner, L. (1980). Do artifacts have politics? Daedalus,
20(19), 22. 109, 121-136.
Carlson, C. (2003b). Lawmakers will drill down into data
mining. eWeek, 20(13), 28. KEY TERMS
Clifton, C. & Estivill-Castro, V. (Eds.). (2002). Privacy,
security and data mining. Proceedings of the IEEE Inter- Applied Ethics: The study of a morally controversial
national Conference on Data Mining Workshop on Pri- practice, whereby the practice is described and analyzed,
vacy, Security, and Data Mining, Maebashi City, Japan. and moral principles and judgments are applied, resulting
in a set of recommendations.
Edelstein, H. (2003). Description is not prediction. DM
Review, 13(3), 10. Ethics: The study of the general nature of morals and
values as well as specific moral choices; it also may refer
Feenberg, A. (1999). Questioning technology. London: to the rules or standards of conduct that are agreed upon
Routledge. by cultures and organizations that govern personal or
professional conduct.
Fule, P., & Roddick, J.F. (2004). Detecting privacy and
ethical sensitivity in data mining results. Proceedings of Factual Data: Data that include demographic informa-
the 27 th Conference on Australasian Computer Science, tion such as name, gender, and birth date. It also may
Dunedin, New Zealand. contain information derived from transactional data such
as someones favorite beverage.
Gross, G. (2003). U.S. agencies defend data mining plans.
ComputerWorld, 37(19), 18. Factual Data Analysis: Another term for data mining,
often used by government agencies. It uses both factual
Sclove, R. (1995). Democracy and technology. New York: and transactional data.
Guilford Press.
Informational Privacy: The control over ones per-
Scott, M.D. (2002). Can data mining be a crime? CIO sonal information in the form of text, pictures, recordings,
Insight, 1(10), 65. and such.
Skalak, D. (2001). Data mining blunders exposed! 10 data Mass Dataveillance: Suspicion-less surveillance of
mining mistakes to avoid making today. DB2 Magazine, 6(2), large groups of people.
10-13.
Relational Privacy: The control over ones person
Thibodeau, P. (2001). FTC examines privacy issues raised and ones personal environment.
by data collectors. ComputerWorld, 35(13), 36.
Transactional Data: Data that contains records of
Verton, D. (2002a). Congressman says data mining could purchases over a given period of time, including such
have prevented 9-11. ComputerWorld, 36(35), 5. information as date, product purchased, and any special
Verton, D. (2002b). Database woes thwart counterterrorism requests.
work. ComputerWorld, 36(49), 14.
Wilder, C., & Soat, J. (2001). The ethics of data. Informa-
tion Week, 1(837), 37-48.
458
TEAM LinG
459
Ethnography to Define Requirements and Data -

Model
Gary J. DeLorenzo
Robert Morris University, USA
INTRODUCTION $8 billion per year manufacturing company specializing in

coatings, chemicals, glass and fiberglass, the normal
Ethnographic research offers an orientation to under- workflow process to define user requirements for decision
stand the process and structure of a social setting and support is through a quick, low cost, rapid application
employs research techniques consistent with this orien- development approach.
tation. The ethnographic study is rooted in gaining an To gain an in-depth understanding of the culture and
understanding of cultural knowledge used to interpret the knowledge base of the Credit Services Department, the
experience and behavior patterns of the researched par- ethnographic research technique was selected to con-
ticipants. The ethnographic method aids the researcher in struct an understanding to questions such as why do
identifying descriptions and in interpreting language, they do what they do? and more importantly for me,
terms, and meanings as the basis of defining user require- what reporting solutions are needed to understand cus-
ments for information systems. tomer payments so that actions can be prioritized accord-
In the business sector, enterprises that use third- ingly? The ability to answer the second question was
party application providers to support day-to-day opera- critical in order to define the conceptual model and infor-
tions have limited independence to access, review, and mation system specifications based on Credits user re-
analyze the data needed for strategic and tactical decision quirements. While using a third-party vendor solution
making. Because of security and firewall protection con- thorough an ASP model, another challenge was in gaining
trols maintained by the provider, enterprises do not have access to data that was not readily accessible (Boyd, 2001).
direct access to the data needed to build analytical report-
ing solutions for decision support. Why the Ethnographic Technique?
This article reports a unique methodology used to
define the conceptual model of one enterprises (PPG Ethnographic research is based on understanding the
Industries, Inc.) business needs to reduce working capital language, terms, and meaning within a culture (Agar,
when using application service providers (ASPs). 1980). Ethnography is an appealing approach, because it
promotes an understanding of process, terms and mean-
ing, and beliefs of the people in a particular culture.
BACKGROUND Interviews, discussions, participant observation, and
social interaction would help me attain an understanding
The Methodology of a culture. As opposed to the quantitative approach,
which reflects on researching a situation, problem, or
Ethnographic research is the study of both explicit and hypothesis at a given point in time, the qualitative ap-
tacit knowledge within a culture. The technique is based proach using ethnographic techniques is more of an
to gain an understanding of the acquired knowledge evolutionary process over time, which covers months, if
people use to interpret experience and generate behavior. not years, for the researcher. The time consideration is
Whereas explicit cultural knowledge can be communi- important because of the ever changing and evolving
cated at a conscious level and with relative ease, tacit environment of the enterprise. New business plans, direc-
cultural knowledge remains largely outside of peoples tion, and project activities frequently change within a
awareness (Creswell, 1994). business. Emerging technology processes such as RAD
The ethnographic technique aids the researcher in (rapid application development) to streamline the pro-
identifying cultural descriptions and uncovering mean- cess of developing and implementing information sys-
ing to language within an organization. From my 12 years tems continue to change the way information systems
of experience at PPG Industries, Inc. (PPG), a U.S.-based are created.
TEAM LinG
Ethnography to Define Requirements and Data Model
MAIN THRUST field research. These findings are assembled into a

coherent sequence of bold, structured, and related sto-
Cultural Setting and Contextual ries.
Baseline
Initial Themes
To gain an understanding of what the Credit Services
Department actually does, an overview of the specific According to Spradley (1979), the next step is the identi-
department goals, objectives, and responsibilities are fication of initial themes. The themes are thin descriptions
noted in this section and called the contextual baseline, of high-level, overview insights toward a culture. Geertz
and uses the native, emic terms of the department. A (1973) states that systematic modes of assessment [are
contextual baseline identifies the driving motivators of needed in] interpretative approaches. For a field of study
Credit Services to understand what people do and how to assert itself as a science, the conceptual structure of
they accomplish their jobs. The General Credit Manager a cultural interpretation should be formulable. (p. 24) He
noted how the Credit Services Department is viewed as continues to state that theoretical formulations tend to
the financial conscience of PPG in the relationship be- start as general interpretations, but the scientific, theo-
tween sales and receiving payments against those sales retical process pushing forward should [include} the
by our customers (Camsuzou, 2001) The Credit Services essential task of theory building, not to codify abstract
Department monitors credit ratings to ensure that custom- regularities, but to make thick descriptions possible (p.
ers have the financial stability to pay invoices while also 26). This theory-building approach, from an overview
tracking and researching with outside banking services perspective, helped me to define the initial themes of the
on the application of payments against invoices. Credit Services Department.
After analyzing the transcribed interviews, field notes,
and results from categorizing the field-work terms, three
Field Work general themes began to surface. The themes, also called
categories by Spradley (1979), are the following: (1) At the
The interview process included different informants enterprise level of PPG, there is a strategic mission to
throughout PPG with informants from strategic associ- reduce working capital through improved cash flow. (2) At
ates (business unit directors), tactical members (credit the functional level for both the Credit Services Depart-
managers, business unit managers), and operational mem- ment and the business units, there is a tactical vision to
bers (users with the business units). Fourteen individuals decrease daily sales outstanding. (3). In pursuing the first
from throughout the enterprise were interviewed as key two themes, information systems are used to track cus-
informants. Also, general observations were captured in tomer payment patterns and to monitor incoming cash flow.
the field notebook by sitting and working on a daily basis
in the Credit Department area (over 50 employees) from
June, 2001 through December, 2001.
Domain Analysis
Geertz (1973) suggests that qualitative research often
is presented as working up from data [whereas] in For Spradley (1979), a domain structure includes the
ethnography, the techniques and procedures [do not] terms that belong to a category . and the semantic
define the enterprise. [what defines it] is the intellectual relationship of terms and categories [that] are linked
effort [borrowed] from Ryle called thick descriptions (p together(p. 100). The process of domain identification
6). Thick descriptions are a reference to getting an under- included contexualizing the terms and categories, draw-
standing and insight into a culture through the uncovering overview diagrams of terms and categories, and iden-
ing of facts, at a very detailed level. Its the ability to gain tifying how the terms and categories related to each other.
a truer, real understanding of a culture through an inter- Spradley (1979) summarized domain analysis as the pro-
pretation of language, symbols, and signs. Geertz (1973) cess to gain a better understanding, from the gathered
summarized that the challenge of the ethnographer does field data, the semantic relationships of terms and catego-
not rest on the ability to capture facts and carry them home ries (p. 107).
like a mask or carving, but on a degree to which he is able Further review of the language to identify the thick
to clarify what goes on in such places, to reduce the descriptions and rich points for PPG to reduce working
puzzlement [where] unfamiliar acts naturally give rise [to capital and decrease sales outstanding resulted in defin-
meaning] (p. 17). Ethnography is a technique based on ing 43 critical and essential terms. These terms were used
a theory where the understanding of a culture can be as a baseline to define the domain analysis. The semantic
gained through the analysis of the cumulative, unstruc- relationships are links between two categories that have
tured, and disconnected findings captured during the boundaries either inside or outside of the domain. The domain
analysis identifies the relationships among categories.
460
TEAM LinG
The domain analysis concept also is used in informa- term, relationship, and hierarchical structure of each term
tion systems to show how data relate to each other. Peter within the area. Structured questions allow the researcher -
Chen (1976) proposed the Entity-Relationship (ER) model to find out how informants have organized their knowl-
as a way to define and unify relational database views and edge. A review of the transcribed words from the inter-
can be found in popular relational database management views, additional notes, and the categorized terms from
systems such as Oracle and Microsoft SQL Server. In the the field work began to identify key, structured ques-
entity-relationship model, tables are shown as entities, tions that were important to the Credit Services Depart-
data fields are shown as attributes, and the line shows how ment. These in-depth, structured, thick-description ques-
the entities and attributes relate to each other. Entities are tions began to emerge out of the terms and categories
categories such as a person, place, or thing. The domain used in the language.
analysis identified in Figure 1 is based on the initial themes
of reducing working capital and decreasing sales out- Componential Analysis
standing, where rectangles identify primary domain enti-
ties for both the internal PPG associates and external PPG Spradley (1979) states that the development research
customers. sequence evolves through locating informants, con-
ducting interviews, and collecting the terms and symbols
Taxonomic Analysis used for language and communication. This research
included the domain analysis process with an overview
Further analysis of the fields in the domain analysis in- of terms that make up categories and the relationship
cluded a review of the specific terms were considered among categories. The domain analysis provided an
critical because of the frequency and consistency that overview of the language in the cultural scene. The
these terms were used in the language. These terms even- technique of taxonomic analysis identified relationships
tually became the baseline data field attributes required for and differences among terms within each domain. The
the decision support system model. A taxonomic analysis next technique in the process included searching and
is a technique used to capture the essential terms, also organizing the attributes associated with cultural terms
called fields, in a hierarchical fashion to define not only the and categories. Attributes are specific terms that have
relationship among the fields but also to understand what significant meaning in the language. This process is
terms are a subset meaning to a grander, higher term. The called componential analysis.
design of the taxonomic analysis (see Figure 2) is based on Spradley (1973) states that, in drafting a schematic
a spiraling concept where all of the terms and relationships diagram on domains, This thinking process is one of the
defined previously are dependent on the central theme; best strategies for discovering cultural themes (p. 199).
that is, to get dollars for sales through reducing working The schematic diagram in Figure 3 identifies the one
capital, which is noted in the center of the drawing. The central theme that motivates Credit Services and this
taxonomic analysis is broken into five component areas of research project, and is based on one principleunder-
who, what, why, when and how. Each area identifies the standing customer payment behavior.
Figure 1. Domain analysis (DeLorenzo, 2001)
B ra n c h
C u s to m e r
F ie ld S a le s O p e r a t io n s /
S e r v ic e
D is t r ib u t io n
S t r a te g ic B u s in e s s
U n its
C r e d it
P a re n t H d q trs ,
L o c a t io n s
E x te rn a l
C r e d it S e r v ic e s In te rn a l P P G
C u s to m e r
D e p a rtm e n t A u d ie n c e
A u d ie n c e
C a s h A p p lic a t io n
A c c o u n t S ta tu s
P u rc h a s e a n d S e ll
P r o d u c ts / S e r v ic e s
P re s e n t S ta te m e n t
a n d I n v o ic e f o r
Paym ent
D is p u te s
R e c e iv e $ , B a n k $ T ra n s fe r to
A p p ly P a y m e n t PP G Account
461
TEAM LinG
Figure 2. Taxonomic analysis (DeLorenzo, 2001)

Taxonomic Analysis
Customer
WHAT
Payment
Cash
WHO Application
Location - Dispute
Level 3 Invoice
Credit
Limit Status:
Parent - Product Line
On Hold,
Branch Operations/ Level 1 (Divisional Sales
Active,
Distribution Resp)
Release
Customer
Service Invoices
Customer
Cash 15 SBU's
Field
Sales Application Account
PPG Auto Statements
PPG Paid
Credit Glass Get Dollars
Current In Full
For Sales,
Balance
Reduce Dispute
Oracle Working Cap Past Due Dispute Resolution:
A/R: AG, CARS Online Aging Reason:
Transaction Balance Unearned
Knowledge Over
SBU Processing Discount,
Mgt Credit Limit
Order Entry Price
Term Daily Deduction
Timeline Sales
Dimension
DSS/Business Outstanding
Intelligence Deduct/ Chargeback,
eBilling Short Writeoff,
HOW Invoice Term - Statement Term - Pay Credit Memo WHY
Online (Discounted Price) (Discount on Payment) Reason
Analytical
Processing
Measures
Invoice Date Discount Due Date
Invoice Due Date Payment Date
WHEN
Figure 3. Cultural themes (DeLorenzo, 2002)
S e ll p r o d u c t t o f in a n c ia lly
s ta b le c u s to m e rs ,
A c c u r a te p ric e o n in v o ic e
C u ltu ra l T h e m e s
R e d u c e s w o r k in g c a p it a l a t t h e
e n t e r p r is e , P P G le v e l
R e v ie w a n d m o n it o r f in a n ic a l
s t a b ilit y o f c u s to m e rs
Im p ro v e d c a s h flo w a t th e s tra te g ic
b u s in e s s u n it , d e p a r t m e n t le v e l
R e c e iv e c a s h , e n s u re p a y m e n ts U n d e r s t a n d in g C u s t o m e r
re c e iv e d f o r s e r v ic e s P a y m e n t B e h a v io r
I m p r o v e in v o ic e , s ta te m e n t in te g rity
r e s u lt in g in r e d u c e d d is p u t e s /
d e d u c ts -- - - h ig h e r
c u s to m e r s a t is f a c t io n
D is p u t e s / d e d u c t s a n d r e s o lu t io n s
o n d is c r e p a n c ie s
K n o w le d g e b a s e d d e c is io n
s u p p o rt a n d b u s in e s s
in t e llig e n c e
H y b r id , in t e g r a t e d in f o r m a t io n
s y s te m s
FUTURE TRENDS years that businesses have taken to manage data. It

addresses the information system needs at PPG to better
Requirements Definition understand customer payment behavior (Craig & Jutla,
2001; Forcht & Cochran, 1999; Inmon, 1996).
The final effort in the ethnography included the drafting The accumulation of the domain analysis and impor-
of the decision support system model to assist PPG in tant entities, the taxonomic analysis with critical catego-
reducing working capital. The model is based on evolu- ries and attributes, and the componential analytical ef-
tionary changes in computer architecture over the past 40 forts identifying cultural themes brought me to my last
462
TEAM LinG
activity. The final stage in the ethnographic process was Inmon, W. (1996). Building the data warehouse. New
to define the decision support system model to assist PPG York: Wiley Computer Publishing. -
in tracking customer payments. The model was based on
using an information systems-based approach to capture Spradley, J.P. (1979). The ethnographic interview. Or-
the data required to understand customer payment be- lando, FL: Harcourt Brace.
havior and to provide trend analysis capabilities to gain
knowledge and insight from that understanding. KEY TERMS
Application Solution Providers: Third-party vendors

CONCLUSION who provide data center, telecommunications, and appli-
cation options for major companies.
An analysis of a cultures language and communication
mechanisms are important areas to understand in building Business Intelligence: Synonym for online analytical
information systems models. The research included the processing system.
gathering of data requirements for the Credit Services Data Mining: Decision support technique that can
Department at PPG by living and working within the predict future trends and behaviors based on past historical data.
researched community to gain a better understanding of
its problem-solving and reporting needs to monitor cus- Data Warehouse: The capturing of attributes from
tomer payment behavior. The data requirements became multiple databases in a centralized repository to assist in
the baseline prerequisite in defining the conceptual deci- the decision-making and problem-solving needs through-
sion support model and database schema for the Credit out the organization.
Services Departments decision support project. In 2003,
the conceptual model and schema were used by PPGs Decision Support: Evolutionary step in the 1990s with
Information Technology Department to develop an appli- characteristics to review retrospective, dynamic data.
cation system to assist the Credit Services Department in OLAP is an example of an enabling technology in this area.
tracking and managing customer payment behavior. Dimensions: Static, fact-oriented information used in
decision support to drill down into measurements.
REFERENCES Domain Analysis Technique: Search for the larger

units of cultural knowledge called domains, a synonym for
Agar, M. (1980). The professional stranger: An informal intro- person, place, or thing. Used to gain an understanding of
duction to ethnography. San Diego, CA: Academic Press, Inc. the semantic relationships of terms and categories.
Boyd, J. (2001, April 16). Think ASPs make sense. Internet Drill Down: User interface technique to navigate into
Week, 48. lower levels of information in decision support systems.
Camsuzou, C. (2001). The ecommerce vision of the credit Ethnography: The work of describing a culture. The
services department. PPG Industries IT Strategy, 15-17. essential core aims to understand another way of life from
the native point of view.
Chen P. (1976). The entity-relationship modelToward a
unified view of data. ACM Transactions on Database Measurements: Dynamic, numeric values associated
Systems, 1(1), 9-36. with dimensions found through drilling down into lower
levels of detail within decision support systems.
Craig, J., & Jutla, D. (2001). eBusiness readiness: A cus-
tomer-focused framework. Upper Saddle River, NJ: Online Analytical Processing Systems (OLAP):
Addison Wesley Publishers. Technology-based solution with data delivered at mul-
tiple dimensions to allow drilling down at multiple levels.
Creswell, J. (1994). Research design qualitative and quanti-
tative approaches. Thousand Oaks, CA: Sage Publications. Requirements: Specifics based on defined criterion
used as the basis for information systems design.
Forcht, K.A., & Cochran, K. (1999). Using data mining and
data warehousing techniques. Industrial Management Systems Development Life Cycle: A controlled, phased
and Data Systems, 189-196. approach in building information systems from under-
standing wants, defining specifications, designing and
Geertz, C. (1973). Emphasizing interpretation from the coding the system through implementing the final solution.
interpretation of cultures. New York: Basic Books, Inc.
463
TEAM LinG
464
Evaluation of Data Mining Methods

Paolo Giudici
University of Pavia, Italy
INTRODUCTION The previous considerations suggest the need for a

systematic study of the methods for comparison and
Several classes of computational and statistical meth- evaluation of data mining models.
ods for data mining are available. Each class can be
parameterised so that models within the class differ in
terms of such parameters (see, for instance, Giudici, MAIN THRUST
2003; Hastie et al., 2001; Han & Kamber, 2000; Hand et
al., 2001; Witten & Frank, 1999): for example, the class Comparison criteria for data mining models can be
of linear regression models, which differ in the number classified schematically into: criteria based on statisti-
of explanatory variables; the class of Bayesian net- cal tests, based on scoring functions, computational
works, which differ in the number of conditional depen- criteria, Bayesian criteria and business criteria.
dencies (links in the graph); the class of tree models,
which differ in the number of leaves; and the class multi- Criteria Based on Statistical Tests
layer perceptrons, which differ in terms of the number
of hidden strata and nodes. Once a class of models has The first are based on the theory of statistical hypoth-
been established the problem is to choose the best esis testing and, therefore, there is a lot of detailed
model from it. literature related to this topic. See, for example, a text
about statistical inference, such as Mood, Graybill, &
Boes (1991) and Bickel & Doksum (1977). A statistical
BACKGROUND model can be specified by a discrete probability func-
tion or by a probability density function, f(x). Such
A rigorous method to compare models is statistical model is usually left unspecified, up to unknown quan-
hypothesis testing. With this in mind one can adopt a tities that have to be estimated on the basis of the data at
sequential procedure that allows a model to be chosen hand. Typically, the observed sample it is not sufficient
through a sequence of pairwise test comparisons. How- to reconstruct each detail of f(x), but can indeed be used
ever, we point out that these procedures are generally to approximate f(x) with a certain accuracy. Often a
not applicable in particular to computational data min- density function is parametric so that it is defined by a
ing models, which do not necessarily have an underlying vector of parameters =( 1 ,,I ), such that each value
probabilistic model and, therefore, do not allow the of corresponds to a particular density function,
application of statistical hypotheses testing theory. Fur- p (x). In order to measure the accuracy of a parametric

thermore, it often happens that for a data problem it is model, one can resort to the notion of distance between
possible to use more than one type of model class, with a model f, which underlies the data, and an approximat-
different underlying probabilistic assumptions. For ex- ing model g (see, for instance, Zucchini, 2000). Notable
ample, for a problem of predictive classification it is examples of distance functions are, for categorical
possible to use both logistic regression and tree models variables: the entropic distance, which describes the
as well as neural networks. proportional reduction of the heterogeneity of the de-
We also point out that model specification and, pendent variable; the chi-squared distance, based on the
therefore, model choice is determined by the type of distance from the case of independence; and the 0-1
variables used. These variables can be the result of distance, which leads to misclassification rates. For
transformations or of the elimination of observations, quantitative variables, the typical choice is the Euclid-
following an exploratory analysis. We then need to ean distance, representing the distance between two
compare models based on different sets of variables vectors in a Cartesian space. Another possible choice is
present at the start. For example, how do we compare a the uniform distance, applied when nonparametric mod-
linear model with the original explanatory variables els are being used.
with one with a set of transformed explanatory vari- Any of the previous distances can be employed to
ables? define the notion of discrepancy of an statistical model.
TEAM LinG
The discrepancy of a model, g, can be obtained comparing assumption is not always valid and therefore the AIC
the unknown probabilistic model, f, and the best paramet- criterion does not lead to a consistent estimate of the -
ric statistical model. Since f is unknown, closeness can be dimension of the unknown model. An alternative, and
measured with respect to a sample estimate of the un- consistent, scoring function is the BIC criterion (Baye-
known density f. A common choice of discrepancy func- sian Information Criterion), also called SBC, formulated
tion is the Kullback-Leibler divergence, which can be by Schwarz (1978). As can be seen from its definition, the
applied to any type of observations. In such context, the BIC differs from the AIC only in the second part, which
best model can be interpreted as that with a minimal loss now also depends on the sample size n. Compared to the
of information from the true unknown distribution. AIC, when n increases the BIC favours simpler models. As
It can be shown that the statistical tests used for model n gets large, the first term (linear in n) will dominate the
comparison are generally based on estimators of the total second term (logarithmic in n). This corresponds to the
Kullback-Leibler discrepancy; the most used is the log- fact that, for a large n, the variance term in the mean
likelihood score. Statistical hypothesis testing is based squared error expression tends to be negligible. We also
on subsequent pairwise comparisons of log-likelihood point out that, despite the superficial similarity between
scores of alternative models. Hypothesis testing allows the AIC and the BIC, the first is usually justified by
one to derive a threshold below which the difference resorting to classical asymptotic arguments, while the
between two models is not significant and, therefore, the second by appealing to the Bayesian framework.
simpler models can be chosen. To conclude, the scoring function criteria for select-
Therefore, with statistical tests it is possible make ing models are easy to calculate and lead to a total
an accurate choice among the models. The defect of this ordering of the models. From most statistical packages
procedure is that it allows only a partial ordering of we can get the AIC and BIC scores for all the models
models, requiring a comparison between model pairs considered. A further advantage of these criteria is that
and, therefore, with a large number of alternatives it is they can be used also to compare non-nested models
necessary to make heuristic choices regarding the com- and, more generally, models that do not belong to the
parison strategy (such as choosing among the forward, same class (for instance a probabilistic neural network
backward and stepwise criteria, whose results may di- and a linear regression model).
verge). Furthermore, a probabilistic model must be However, the limit of these criteria is the lack of a
assumed to hold, and this may not always be possible. threshold, as well the difficult interpretability of their
measurement scale. In other words, it is not easy to
Criteria Based on scoring functions determine if the difference between two models is
significant or not, and how it compares to another
A less structured approach has been developed in the difference. These criteria are indeed useful in a prelimi-
field of information theory, giving rise to criteria based nary exploratory phase. To examine this criteria and to
on score functions. These criteria give each model a compare it with the previous ones see, for instance,
score, which puts them into some kind of complete Zucchini (2000), or Hand, Mannila, & Smyth (2001).
order. We have seen how the Kullback-Leibler discrep-
ancy can be used to derive statistical tests to compare Bayesian Criteria
models. In many cases, however, a formal test cannot be
derived. For this reason, it is important to develop A possible compromise between the previous two cri-
scoring functions that attach a score to each model. The teria is the Bayesian criteria, which could be developed in
Kullback-Leibler discrepancy estimator is an example a rather coherent way (see, e.g., Bernardo & Smith,
of such a scoring function that, for complex models, can 1994). It appears to combine the advantages of the two
be often be approximated asymptotically. A problem previous approaches: a coherent decision threshold and a
with the Kullback-Leibler score is that it depends on the complete ordering. One of the problems that may arise is
complexity of a model as described, for instance, by the connected to the absence of a general purpose software.
number of parameters. It is thus necessary to employ For data mining works using Bayesian criteria the reader
score functions that penalise model complexity. could see, for instance, Giudici (2001), Giudici & Castelo
The most important of such functions is the AIC (2003) and Brooks et al. (2003).
(Akaike Information Criterion) (Akaike, 1974). From
its definition, notice that the AIC score essentially Computational Criteria
penalises the loglikelihood score with a term that in-
creases linearly with model complexity. The AIC criterion The intensive wide spread use of computational meth-
is based on the implicit assumption that q remains con- ods has led to the development of computationally inten-
stant when the size of the sample increases. However this sive model comparison criteria. These criteria are usually
465
TEAM LinG
based on using dataset different than the one being that of SAS Enterprise Miner (SAS Institute, 2004).
analysed (external validation) and are applicable to all The idea behind these methods is to focus the atten-
the models considered, even when they belong to differ- tion, in the choice among alternative models, to the
ent classes (for example in the comparison between utility of the obtained results. The best model is the one
logistic regression, decision trees and neural networks, that leads to the least loss.
even when the latter two are non probabilistic). A pos- Most of the loss function based criteria are based
sible problem with these criteria is that they take a long on the confusion matrix. The confusion matrix is used
time to be designed and implemented, although general as an indication of the properties of a classification
purpose software has made this task easier. rule. On its main diagonal it contains the number of
The most common of such criterion is based on observations that have been correctly classified for
cross-validation. The idea of the cross-validation method each class. The off-diagonal elements indicate the num-
is to divide the sample into two sub-samples, a training ber of observations that have been incorrectly classi-
sample, with n - m observations, and a validation sample, fied. If it is assumed that each incorrect classification
with m observations. The first sample is used to fit a has the same cost, the proportion of incorrect classifi-
model and the second is used to estimate the expected cations over the total number of classifications is
discrepancy or to assess a distance. Using this criterion called rate of error, or misclassification error, and it is
the choice between two or more models is made by the quantity that must be minimised. The assumption of
evaluating an appropriate discrepancy function on the equal costs can be replaced by weighting errors with
validation sample. Notice that the cross-validation idea their relative costs.
can be applied to the calculation of any distance function. The confusion matrix gives rise to a number of
One problem regarding the cross-validation criterion graphs that can be used to assess the relative utility of
is in deciding how to select m, that is, the number of the a model, such as the Lift Chart, and the ROC Curve (see
observations contained in the validation sample. For Giudici, 2003). The lift chart puts the validation set
example, if we select m = n/2 then only n/2 observations observations, in increasing or decreasing order, on the
would be available to fit a model. We could reduce m but basis of their score, which is the probability of the
this would mean having few observations for the valida- response event (success), as estimated on the basis of
tion sampling group and therefore reducing the accuracy the training set. Subsequently, it subdivides such scores
with which the choice between models is made. In prac- in deciles. It then calculates and graphs the observed
tice proportions of 75% and 25% are usually used, probability of success for each of the decile classes in
respectively, for the training and the validation samples. the validation set. A model is valid if the observed
To summarise, these criteria have the advantage of success probabilities follow the same order (increas-
being generally applicable but have the disadvantage of ing or decreasing) as the estimated ones. Notice that, in
taking a long time to be calculated and of being sensitive order to be better interpreted, the lift chart of a model
to the characteristics of the data being examined. A way is usually compared with a baseline curve, for which the
to overcome this problem is to consider model combina- probability estimates are drawn in the absence of a
tion methods, such as bagging and boosting. For a thor- model, that is, taking the mean of the observed success
ough description of these recent methodologies, see probabilities.
Hastie, Tibshirani, & Friedman (2001). The ROC (Receiver Operating Characteristic) curve
is a graph that also measures predictive accuracy of a
Business Criteria model. It is based on four conditional frequencies that
can be derived from a model, and the choice of a cut-off
One last group of criteria seem specifically tailored for points for its scores: a) the observations predicted as
the data mining field. These are criteria that compare the events and effectively such (sensitivity); b) the obser-
performance of the models in terms of their relative vations predicted as events and effectively non-events;
losses, connected to the errors of approximation made c) the observations predicted as non-events and effec-
by fitting data mining models. Criteria based on loss tively events; d) the observations predicted as non-
functions have appeared recently, although related ideas events and effectively such (specificity). The ROC
are long time known in Bayesian decision theory (see, curve is obtained representing, for any fixed cut-off
for instance, Bernardo & Smith, 1984). They have a great value, a point in the plane having as x-value the false
application potential, although at present they are mainly positive value (1-specificity) and as y-value the sensitiv-
concerned with classification problems. For a more de- ity value. Each point in the curve corresponds therefore
tailed examination of these criteria the reader can see, for to a particular cut-off. In terms of model comparison, the
example, Hand (1997), Hand, Mannila, & Smyth (2001), or best curve is the one that is leftmost, the ideal one
the reference manuals on data mining software, such as coinciding with the y-axis.
466
TEAM LinG
To summarise, criteria based on loss functions have parison criteria based on business quantities are ex-
the advantage of being easy to understand but, on the tremely useful. -
other hand, they still need formal improvements and
mathematical refinements.
REFERENCES
FUTURE TRENDS Akaike, H. (1974). A new look at statistical model

identification. IEEE Transactions on Automatic Con-
It is well known that data mining methods can be classi- trol, 19, 716-723.
fied into exploratory, descriptive (or unsupervised),
predictive (or supervised) and local (see, e.g., Hand et Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian
al., 2001). Exploratory methods are preliminary to oth- theory. New York: Wiley.
ers and, therefore, do not need a performance measure. Bickel, P.J., & Doksum, K.A. (1977). Mathematical
Predictive problems, on the other hand, are the setting statistics. New York: Prentice Hall.
where model comparison methods are most needed,
mainly because of the abundance of the models avail- Brooks, S.P., Giudici, P., & Roberts, G.O. (2003).
able. All presented criteria can be applied to predictive Efficient construction of reversible jump MCMC pro-
models: this is a rather important aid for model choice. posal distributions. Journal of The Royal Statistical
For descriptive models aimed at summarising vari- Society Series B, 1, 1-37.
ables, such as clustering methods, the evaluation of the Castelo, R., & Giudici, P. (2003). Improving Markov
results typically proceeds on the basis of the Euclidean chain model search for data mining. Machine Learning,
distance, leading at the R2 index. We remark that it is 50, 127-158.
important to examine the ratio between the between
and total sums of squares that leads to R 2 for each Giudici, P. (2001). Bayesian data mining, with applica-
variable in the dataset. This can give a variable-specific tion to credit scoring and benchmarking. Applied Sto-
measure of the goodness of the cluster representation. chastic Models in Business and Industry, 17, 69-81.
Finally, it is difficult to assess local models, such as
association rules, for the bare fact that a global measure Giudici, P. (2003). Applied data mining. London, Wiley.
of evaluation of such model contradicts with the very Han, J., & Kamber, M (2001). Data mining: Concepts and
notion of a local model. The idea that prevails in the techniques. New York: Morgan Kaufmann.
literature is to measure the utility of patterns in terms of
how interesting or unexpected they are to the analyst. As Hand, D. (1997). Construction and assessment of classi-
measures of interest one can consider, for instance, the fication rules. London: Wiley.
support, the confidence and the lift. The former can be Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of
used to assess the importance of a rule, in terms of its data mining. New York: MIT Press.
frequency in the database; the second can be used to
investigate possible dependences between variables; Hastie, T., Tibshirani, R., & Friedman, J. (2001). The
finally the lift can be employed to measure the distance elements of statistical learning: Data mining, inference
from the situation of independence. and prediction. New York: Springer-Verlag.
Mood, A.M., Graybill, F.A., & Boes, D.C. (1991). Intro-
duction to the theory of statistics. Tokyo: McGraw
CONCLUSION Hill.
The evaluation of data mining methods requires a great SAS Institute Inc. (2004). SAS enterprise miner refer-
deal of attention. A valid model evaluation and compari- ence manual. Cary: SAS Institute.
son can improve considerably the efficiency of a data
mining process. We have presented several ways to Schwarz, G. (1978). Estimating the dimension of a
perform model comparison; each has its advantages and model. Annals of Statistics, 62, 461-464.
disadvantages. Choosing which of them is most suited Witten, I., & Frank, E. (1999). Data Mining: Practical
for a particular application depends on the specific machine learning tools and techniques with Java
problem and on the resources (e.g., computing tools and implementation. New York: Morgan Kaufmann.
time) available. Furthermore, the choice must also be
based upon the kind of usage of the final results. This Zucchini, W. (2000). An introduction to model selec-
implies that, for example, in a business setting, com- tion. Journal of Mathematical Psychology, 44, 41-61.
467
TEAM LinG
KEY TERMS Entropic Distance: The entropic distance of a dis-

tribution g from a target distribution f, is:
0-1 Distance: The 0-1 distance between a vector of
fi
predicted values, X g , and a vector of observed values, E d = i f i log
gi
X f , is:
Euclidean Distance: The distance between a vector of
n
predicted values, X g , and a vector of observed values,
0 1 d = 1(X fr X gr )
r =1
X f , is expressed by the equation:
where 1(w,z) =1 if w=z and 0 otherwise.
n
d (X f , X g ) = (X X gr )
2
AIC Criterion: The AIC criterion is defined by the 2
r =1
fr
following equation:
AIC = 2 log L(; x1 ,..., x n ) + 2q

Kullback-Leibler Divergence: The Kullback-
Leibler divergence of a parametric model p with re-
where log L(; x1 ,..., x n ) is the logarithm of the likeli-
spect to an unknown density f is defined by:
hood function calculated in the maximum likelihood
parameter estimate and q is the number of parameters of
f (x i )
the model. K L ( f , p ) = i f (x i ) log
p (x i )

BIC Criterion: The BIC criterion is defined by the
following expression: where the suffix indicates the values of the param-
( )
BIC = 2 log L ; x1 ,..., x n + q log(n )
eters, which minimizes the distance with respect to f.
Chi-Squared Distance: The chi-squared distance Log-Likelihood Score: The log-likelihood score
of a distribution g from a target distribution f is: is defined by
( f i g i )2 2 log p (x i ) [ ]
2
d = i i =1
gi
Discrepancy Of A Model: Assume that f represents

the unknown density of the population, and let g= p be Uniform Distance: The uniform distance between
a family of density functions (indexed by a vector of I two distribution functions, F and G, with values in [0, 1]
parameters, q) that approximates it. Using, to exem- is defined by:
plify, the Euclidean distance, the discrepancy of a model
g, with respect to a target model f is: sup F (t ) G (t )
0 t 1
n
( f , p ) = ( f (x i ) p (x i ))
2
i =1
468
TEAM LinG
469
Evolution of Data Cube Computational -

Approaches
Rebecca Boon-Noi Tan
INTRODUCTION in business application for the past few decades. Common

forms of data analysis include histograms, roll-up totals
Aggregation is a commonly used operation in decision and sub-totals for drill-downs and cross tabulation. The
support database systems. Users of decision support three common forms are difficult to use with these SQL
queries are interested in identifying trends rather than aggregation constructs (Gray, Bosworth, Lyaman, &
looking at individual records in isolation. Decision sup- Pirahesh, 1996). An explanation of the three common
port system (DSS) queries consequently make heavy use problems: (a) Histograms, (b) Roll-up Totals and Sub-
of aggregations, and the ability to simultaneously aggre- Totals for drill-downs, and (c) Cross tabulation will be
gate across many sets of dimensions (in SQL terms, this presented.
translates to many simultaneous group-bys) is crucial for Firstly the problem with histograms is that the SQL
Online Analytical Processing (OLAP) or multidimensional standard Group-by operator does not allow a direct con-
data analysis applications (Datta, VanderMeer, & struction with aggregation over computed categories.
Ramamritham, 2002; Dehne, Eavis, Hambrusch, & Rau- Unfortunately, not all SQL systems directly support his-
Chaplin, 2002; Elmasri & Navathe, 2004; Silberschatz, tograms including the standard. In standard SQL, histo-
Korth & Sudarshan, 2002). grams are computed indirectly from a table-valued expres-
sion which is then aggregated. However, the cube-by
operator is able to have direct construction. The histo-
BACKGROUND gram can be easily obtained from the data cube using the
cube-by operator to compute the raw data. The histogram
Although aggregation functions together with another on the right-hand side of Figure 1 shows the sales amount
operator, Group-by in SQL terms, have been widely used as well as the sub-totals and total of a range of the
Figure 1. An example showing how a histogram is formed from results obtained from a data cube
TEAM LinG
Evolution of Data Cube Computational Approaches
Figure 2. An example of cross-tabulation from a data cube
products at a range of the time-frame on a particular operators (also known as cube-by) to conveniently sup-
location. port such aggregates. The data cube is identified as the
The second problem relates to roll-ups totals and sub- core operator in data warehousing and OLAP
totals for drill-down. Reports commonly aggregate data (Lakshmanan, Pei & Zhao, 2003). The cube-by operator
initially at a coarse level, and then at successively finer computes group-bys corresponding to all possible com-
levels. This type of report is difficult with the normal SQL binations in a list of attributes. An example of data cube
construct. However, the Cube-by operator is able to query is as follows:
present the roll-ups totals and sub-totals for drill-down
easily. SELECT Product, Year, City, SUM(amount)
The third problem relates to cross-tabulation which is FROM Sales
difficult to construct with the current standard SQL. The CUBE BY Product, Year, City
symmetric aggregation result is cross-tabulation table or
cross tab for short (known as a pivot-table in spread- The above query will produce the SUM of amount of
sheets). Using the cube-by operator, cross tab data can all tuples in the database according the 7 group-bys, i.e.
be readily obtained which is routinely displayed in the (Product, Year, City), (Product, Year), (Product, City),
more compact format as shown in Figure 2. This cross tab (Year, City), (Product), (Year), (City). Lastly, the 8th group-
is a two dimensional aggregation within the red-dotted by denotes as ALL, which contains an empty attribute so
line. If we add another location such as L002, it becomes as to make all group-bys results union compatible. For
a 3D aggregation. example, a cube-by of three attributes (ABC) in data cube
In summary, the problem of representing aggregate query will generate eight or 23 group-bys of ([ABC], [AB],
data in a relational data model with the standard SQL can [AC], [BC], [A], [B], [C] and [ALL]).
be a difficult and daunting task. A six dimensional cross- The most straightforward way to execute the data
tab will require a 64 way union of 64 different group-by cube query is to rewrite it as a collection of eight group-
operators in order to build the underlying representation. by queries and execute them separately as shown in
This is an important reason why the use of Group-bys is Figure 3. This means that the eight group-by queries will
inadequate as the resulting representation of aggregation need to access the raw data eight times. It is likely to be
is too complex for optimal analysis. quite expensive in execution time. If the number of dimen-
sion attributes increases, it becomes very expensive to
compute the data cube. This is because the required
MAIN THRUST computation cost grows exponentially with the increase
of dimension attributes. For instance, N number of
Birth of cube-by operator dimension attributes of cube-by will form a 2N number of
group-bys. However, there are a number of ways in which
To overcome the difficulty with these SQL aggregation this simple solution can be improved.
constructs, (Gray et al., 1996) proposed using data cube
470
TEAM LinG
Figure 3. A straightforward way to execute each of the Figure 4. A lattice of the cube-by operator
group-by separately -
Lattice Framework Figure 4 shows a pattern of the eight group-bys based

on cube-by with 3 attributes. One unique feature of the
Harinarayan, Rajaraman & Ullman (1996) noted that some cube-by computation is that the computation of the
of the group-by queries in the data cube query could be various group-bys are not independent of each other,
answered using the results of other. This is known as but are closely related as some of them can be computed
dependence relation on queries. Consider a situation using others. Figure 5 shows a simple example where the
where there are two queries Q1 and Q2. We can say that Q1 results of A and B are both derived from the group-by of
is a subset of Q 2 when Q1can be answered by using the two attributes AB (T# and P#)
result of Q2. In this case, Q1 is dependent on Q2. This results Why is AB chosen in preference to AC or BC in order
in a formation of lattice diagram as shown in Figure 4. to get both the results of A and B? A number of factors
The cuboid may or may not have a common prefix with need to be considered when choosing the best candidate
its parent. For a simple illustration, [AB] and [AC] have to obtain the result of the others group-by. Table 1 shows
common prefix [A] with their parent [ABC]. However, there how complex it can be especially when the number of
is no common prefix between [BC] and its parent [ABC]. An attributes is increased.
edge from a cuboid to its child indicates that the child
cuboid can be computed from its parent.
Figure 5. An example of how A and B can be derived from AB
471
TEAM LinG
Table 1. An illustration of the increment in the number of paths and group-bys
Table 1 illustrates the increment in the number of paths with hierarchies on the dimensions. Lis represent Stores
and group-bys whenever there is an increase in one and Pis represent Products. Stores L1 L3 are in Churchill,
attribute. For example, three attributes [ABC] will result in while L4 L6 are in Morwell, and this roll-up into two
3 D cuboid level and 12 paths if we increase one more towns. Products P1 P3 are of type Cup, while products
attributes [ABCD], this will generate 4 D cuboid level and P4 P5 are of type Plate. Both cup and plate are further
32 paths. When the number of attributes is small, the grouped into the category Kitchenware. The xs are sales
number of group-bys and number of paths are relatively volumes; entries that are blank correspond to (product,
similar. However, when the number of attributes increases, store) combinations for which there are no sales.
the number of paths increases significantly faster by a
factor of Np-1 where p is the path number. For instance, In summary, there are two types of query dependen-
when Na is equal to 2, Ng and Np are equal to 4. When Na cies: dimension dependency, and attribute dependency.
is equal to 3, Ng is 23 = 8 and Np = (23 + 4) = 12. It is important Dimension dependency is present when there is an inter-
to note that with the number of different paths available, action of the different dimensions with one another as
it is difficult to decide which path should be used since shown in Figures 4 and 5. Attribute dependency is intro-
some of the group-bys can be computed by using others. duced when it falls within a dimension caused by attribute
hierarchies as shown in Figures 6 and 7.
Hierarchies in Lattice Diagram
Optimisations in Existing Approaches
In real-life environments, the dimensions of a data cube
normally consist of more than one attribute, and the Sarawagi, Agrawal, and Megiddo (1996) adapted these
dimensions are organized as hierarchies of these at- methods to compute group-bys by incorporating a num-
tributes. Harinarayan, Rajaraman, & Ullman (1996) used a ber of optimization:
simple time dimension example to illustrate a hierarchy:
day, month, and year in Figure 6. Hierarchies are very Smallest-Parent: This optimization was first pro-
important, as they form a basis of two frequently used posed in (Gray et. al, 1997). It aims at computing a
querying operations: drill-down and roll-up. Drill-up group-by from the smallest (in terms of either clos-
is the process of viewing data at gradually more detailed est parent group-by or size of the parent group-by)
levels while roll-up is just the opposite. More details can of previously computed group-by. Each group-by
be found in (Ramakrishnan & Gehrke, 2003, p. 852). can be computed from a number of other group-bys.
Figure 7 shows an example of real-life data set, which Figures 8 and 9 show a three attributes cube (ABC).
can be viewed conceptually as a two-dimensional array
There are a number of options for computing a group-
by A from its group-bys parent (ABC, AB, or AC). For
example, A can be computed from ABC, AB or AC. Figure
Figure 6. A hierarchy example of time attributes 8 shows an example where it is better choice to compute
A from smallest or closest parent group-bys AB or AC
Day rather than ABC. Another example shows another sce-
nario where the size of smallest-parent is smaller than the
Week Month other (Figure 9). AC is a better choice as compared to AB
in terms of size.
Year Cache-Results: This optimization aims at caching

None (in memory) the results of a group-by from which
472
TEAM LinG
Figure 7. A hierarchy example of two attributes

-
Figure 8. AB or AC is clearly better choice of computing Figure 9. An example of different size in AB and AC
A
Figure 10. Reduction in disk I/O by computing A from AB Figure 11. Reduction of disk reads by computing as many
instead of AC group-bys as possible in the memory
Caching Disk [ABC]

Input/Output
(in memory) Disk Memory
[AC] Input/Output
[AB] [AB, AC, A]
other group-bys are computed to reduce disk I/O. has brought equal items together, and duplicate
Figure 10 shows an example AB is better choice to removal will then be easy (Figure 12).
compute A as compared to AC. The reason is that Share-Partitions: This optimization is specific to
A can be computed while AB is still in memory so the hash-based algorithms and share the partition
that there is no disk I/O cost involved. costs across multiple group-bys. Data to be aggre-
Amortize-Scans: This optimization aims at amortiz- gated is usually too large for the hash-tables to fit
ing disk reads by computing as many group-bys as in memory. Hence, the conventional way to deal
possible, together in memory. In Figure 11 as many with limited memory when constructing hash tables
group-bys as possible are computed such as AB, is to partition the data on one or more attributes. If
AC and A from ABC while ABC is still in the memory. data is partitioned on an attribute, say A, then all
Thus, there is no need to involve disk read. group-bys that contain A can be computed by
Share-Sorts: This optimization is specific to the independently grouping on each partition.
sort-based algorithms and aims at sharing sorting
cost across multiple group-bys. Share-sort is using Unfortunately, the above optimizations are often
data sorted in a particular order to compute all contradictory for OLAP databases especially when the
group-bys that are prefixes of that order. There is no size of the data to be aggregated is usually much larger
need to re-sort for the subsequent group-bys as it than the available main memory. For instance, a group-by
473
TEAM LinG
Figure 12. An example of share-sorts Figure 13. An example of sharing partitioning cost
A
A sorted
AB sorted AB AC
ABC Sort ABC partitioned on A
[A] can be computed from one of the several parent group- by for each group-by and this has resulted in biased
bys ([AB] [ABC] [AC]), but the bigger one AC (in term of towards optimizing for smallest-parent. Second
size) is in memory and the smallest one AB is not. In this optimization, share-partitions, is achieved by com-
case, based on the Cache-results optimization, AC maybe puting from the same partition all group-bys that
is a better choice. contain the partitioning attribute. Third and fourth
With the possible optimization techniques, Agarwal optimizations are achieved when computing a
et al. (1996) suggested two proposed approaches, which subtree, the algorithm maintains all hash-tables of
are basically sort-based PipeSort and hash-based group-bys in the subtree in memory until all its
PipeHash. However, there is a need for some global children are created and also for each group-by,
planning, which uses the search lattice introduced in therefore its children can be computed in one scan
(Harinarayan et al., 1996). of the group-by. However, the limitation of PipeHash
algorithm relates to the NP-Hard problem especially
Search Lattice in minimizing overall disk scan cost.
Overlap: Deshpande et al. (1998) proposed the
The search lattice is a graph where a vertex represents a OVERLAP algorithm for data cube computation and
group-by of the cube as shown in Figure 14. A directed it based on sorting-based method. The Overlap
edge connects group-by i to group-by j whenever j can be algorithm is executed in four stages. The overlap
generated from i and j has exactly one attribute less than algorithm has minimized the number of sorting steps
i (i is called the parent of j, for instance, AB is called the required to compute many sub-aggregates and also
parent of A). Level k of the search lattice denotes all minimized the number of disk accesses by overlap-
group-bys that contain exactly k attributes. ping the computation of the cuboids. However, the
limitation of the overlap algorithm is that the memory
Data Cube Computation Methodology may be not large enough to store more than one such
partition simultaneously. This is because each of
PipeSort: Agrawal et al. (1996) have incorporated the O(k) nodes has O(k) such descendants, so there
the share-sorts, cache-results and amortize-scans are at least O(k2) nodes in the search tree that
into the PipeSort algorithm. The aim is to get mini- involve additional disk I/O. As a result, the total I/
mum total cost and also to use Pipelining to achieve O cost of OVERLAP is at least quadratic in k for
cache-results and amortize-scans. However, the main sparse data.
limitation is that it does not scale well with respect Partitioning: Ross et al. (1997) have taken into
to the number of Cube-by attributes. PipeSort per- consideration the fact that real data is frequently
forms one sort operation for the pipelined evalua- sparse. It exists for two reasons: (a) large domain
tion of each path. When the underlying relation is sizes of some cube-by attributes and (b) large num-
sparse and much larger than available memory, this ber of cube-by attributes in the data cube query. In
results in many of the cuboids that PipeSort sorts (Ross et al., 1997), large relations are partitioned into
are also larger than the available memory. Hence, fragments that fit in memory. Thus, there is always
this has resulted a considerable amount of I/O when enough memory to fit in the fragments of large
performing PipeSort. relation. The memory-cube is similar to PipeSort as
PipeHash: The PipeHash algorithm is able to in- it computes the various cuboids (group-bys) of the
clude the four stated optimizations only if the data cube using the idea of pipelined paths. The
memory is available. First optimization is smallest- memory-cube performs multiple in-memory sorts,
parent where PipeHash has fixed the parent group- and does not incur any I/O beyond the input of the
relation and the output of the data cube itself.
474
TEAM LinG
Figure 14. A lattice search of the cube-by operator

-
Array Computation: Zhao, Deshpande, Naughton CONCLUSION

and Shukla (1997) introduced the multi-way array
cubing algorithm in which the cells are visited in the Data cube computation requires all possible combina-
right order so that a cell does not have to be revisited tions of group-bys which correspond to the attributes.
for each sub-aggregate. Two main issues: (a) the However, due to the complexity associated with data cube
array itself is far too large to fit in memory especially computation, a number of approaches have been pro-
in a multidimensional application; and (b) many of posed. In spite of these approaches, there are a number of
the cells in the array are empty. The goal is to overlap research issues that still need to be resolved.
the computation of all these group-bys and finish
the cube in one scan of the array with the require-
ment of memory minimized. REFERENCES
Table 2 shows a summary of benefits and optimization
Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A.,
of existing techniques. The existing techniques show that
Naughton, J.F., Ramakrishnan, R. et al. (1996, September).
their performance based on uni-processor environments.
On the computation of multidimensional aggregates. In-
They either heavily relied on estimation of memory re-
ternational VLDB Conference (pp. 506-521), Bombay,
quirements, or simply partitioned data into equal-sized
India.
partitions based on a uniform distribution assumption.
Datta, A., VanderMeer, D., & Ramamritham, K. (2002).
Parallel star join + DataIndexes: Efficient query process-
FUTURE TRENDS ing in data warehouses and OLAP. IEEE Transactions
Journal on Knowledge & Data Engineering, 14, 1299-
Other approaches have utilized parallel techniques (Datta, 1316.
et al., 2002; Tan, Taniar & Lu, 2003; Taniar & Tan, 2002).
Dehne, F., Eavis, T., Hambrusch, S., & Rau-Chaplin, A.
Dehne et al. (2003) have proposed the use of parallel
(2002). Parallelizing the data cube. Journal of Distributed
techniques on data cubes whereas (Lu, Yu, Feng, & Li.,
& Parallel Databases, 11, 181-201.
2003) focused on the handling of data skew in parallel data
cube computations. However, due to the complexity of the Elmasri, R., & Navathe, S.B. (2004). Fundamentals of
data cube there is still capacity for computations to be database systems. Boston, MA: Addison Wesley.
further optimized. There is thus still scope for researchers
to address these new challenges. Gray, J., Bosworth, A., Lyaman, A., & Pirahesh, H. (1996,
June). Data cube: A relational aggregation operator gen-
eralizing GROUP-BY, CROSS-TAB, and SUB-TOTALS.
International ICDE Conference (pp. 152-159), New Or-
leans, Louisiana.
475
TEAM LinG
Table 2. Benefits and optimization of existing techniques
Harinarayan, V., Rajaraman, A., & Ullman, J. (1996, June). tiple dimensional queries. International ACM SIGMOD
Implementing data cubes efficiently. International ACM Conference (pp. 271-282), Seattle, Washington.
SIGMOD Conference (pp. 205-216), Montreal, Canada.
Lakshmanan, L.V.S., Pei, J., & Zhao, Y. (2003, September).
Efficacious data cube exploration by semantic summariza- KEY TERMS
tion and compression. International VLDB Conference
(pp. 1125-1128), Berlin, Germany.
Amortize-Scans: Amortizing disk reads by comput-
Lu, H.J., Yu, J.X., Feng, L., & Li, Z.X. (2003). Fully dynamic ing as many group-bys as possible, simultaneously in
partitioning: Handling data skew in parallel data cube memory.
computation. Journal Distributed & Parallel Databases,
13, 181-202. Attribute Dependency: Introduces when it falls within
a dimension caused by attribute hierarchies.
Ramakrishnan, R., & Gehrke, J. (2003). Database manage-
ment systems. NY: McGraw-Hill. Cache-Results: the result of a group-by is obtained
from other group-bys computation (in memory).
Ross, K.A., & Srivastava, D. (1997, August). Fast compu-
tation of sparse datacubes. International VLDB Confer- Data Cube: Is known as core operator in data ware-
ence (pp. 116-185), Athens, Greece. house and OLAP (Lakshmanan, Pei & Zhao, 2003).
Sarawagi, S., Agrawal, R., & Megiddo, N. (1998, March). Data Cube Operator: Computes Group-by correspond-
Discovery-driven exploration of OLAP data cubes. Inter- ing to all possible combinations of attributes in the Cube-
national EDBT Conference (pp. 168-182), Valencia, Spain. by clause.
Silberschatz, A., Korth, H., & Sudarshan, S. (2002). Data- Dependence Relation: Relates to data cube query
base system concepts. NY: McGraw-Hill. where some of the group-by queries could be answered
using the results of other.
Tan, R.B.N., Taniar, D., & Lu, G.J. (2003, March). Efficient
execution of parallel aggregate data cube queries in data Dimension Dependency: Presents when there is an
warehouse environments. International IDEAL Confer- interaction of the different dimensions with one another.
ence (pp. 709-7016), Hong Kong, China. OLAP: Describes a technology that uses a multi-
Taniar, D., & Tan, R.B.N. (2002, May). Parallel processing dimensional view of aggregate data to provide quick
of multi-join expansion_aggregate data cube query in access to strategic information for the purposes of ad-
high performance database systems. International I- vanced analysis.
SPAN Conference (pp. 51-58), Manila, Philippines. Smallest-Parent: In terms of either closest parent
Zhao, Y., Deshpande, P., Naughton, J., & Shukla, A. (1998, group-by or size of the parent group-by.
June). Simultaneous optimization and evaluation of mul-
476
TEAM LinG
477
Evolutionary Computation and Genetic -

Algorithms
William H. Hsu
Kansas State University, USA
INTRODUCTION This article begins with a survey of the following GA

variants: the simple genetic algorithm, evolutionary algo-
A genetic algorithm (GA) is a procedure used to find rithms, and extensions to variable-length individuals. It
approximate solutions to search problems through the then discusses GA applications to data-mining problems,
application of the principles of evolutionary biology. such as supervised inductive learning, clustering, and
Genetic algorithms use biologically inspired techniques, feature selection and extraction. It concludes with a dis-
such as genetic inheritance, natural selection, mutation, cussion of current issues in GA systems, particularly
and sexual reproduction (recombination, or crossover). alternative search techniques and the role of building
Along with genetic programming (GP), they are one of the block (schema) theory.
main classes of genetic and evolutionary computation
(GEC) methodologies.
Genetic algorithms typically are implemented using BACKGROUND
computer simulations in which an optimization problem is
specified. For this problem, members of a space of candi- The field of genetic and evolutionary computation (GEC)
date solutions, called individuals, are represented using was first explored by Turing, who suggested an early
abstract representations called chromosomes. The GA template for the genetic algorithm. Holland (1975) per-
consists of an iterative process that evolves a working set formed much of the foundational work in GEC in the 1960s
of individuals called a population toward an objective and 1970s. His goal of understanding the processes of
function, or fitness function (Goldberg, 1989; Wikipedia, natural adaptation and designing biologically inspired
2004). Traditionally, solutions are represented using fixed- artificial systems led to the formulation of the simple
length strings, especially binary strings, but alternative genetic algorithm (Holland, 1975).
encodings have been developed.
The evolutionary process of a GA is a highly simpli- State of the Field: To date, GAs have been applied
fied and stylized simulation of the biological version. It successfully to many significant problems in ma-
starts from a population of individuals randomly gener- chine learning and data mining, most notably clas-
ated according to some probability distribution (usually sification, pattern detectors (Gonzlez & Dasgupta,
uniform) and updates this population in steps called 2003; Rizki et al., 2002) and predictors (Au et al.,
generations. For each generation, multiple individuals 2003), and payoff-driven reinforcement learning1
are selected randomly from the current population, based (Goldberg, 1989).
upon some application of fitness, bred using crossover, Theory of GAs: Current GA theory consists of two
and modified through mutation, to form a new population. main approachesMarkov chain analysis and
schema theory. Markov chain analysis is primarily
Crossover: Exchange of genetic material concerned with characterizing the stochastic dy-
(substrings) denoting rules, structural components, namics of a GA system (i.e., the behavior of the
features of a machine learning, search, or optimiza- random sampling mechanism of a GA over time). The
tion problem. most severe limitation of this approach is that, while
Selection: The application of the fitness criterion to crossover is easy to implement, its dynamics are
choose which individuals from a population will go difficult to describe mathematically. Markov chain
on to reproduce. analysis of simple GAs has therefore been more
Replication: The propagation of individuals from successful at capturing the behavior of evolution-
one generation to the next. ary algorithms with selection and mutation only.
Mutation: The modification of chromosomes for These include evolutionary algorithms (EAs) and
single individuals. evolutions strategie (Schwefel, 1977).
TEAM LinG
Evolutionary Computation and Genetic Algorithms
Successful building blocks can become redundant in Applications

a GA population. This can slow down processing and also
can result in a phenomenon called takeover, where the Genetic algorithms have been applied to many classifica-
population collapses to one or a few individuals. Goldberg tion and performance tuning applications in the domain of
(2002) characterizes steady-state innovation in GAs as the knowledge discovery in databases (KDD). De Jong, et al.
situation where time to produce a new, more highly-fit produced GABIL (Genetic Algorithm-Based Inductive
building block (the innovation time, ti) is lower than the Learning), one of the first general-purpose GAs for learn-
expected time for the most fit individual to dominate the ing disjunctive normal form concepts (De Jong et al.,
entire population (the takeover time, t*). Steady state 1993). GABIL was shown to produce rules achieving
innovation is achieved, facilitating convergence toward an validation set accuracy comparable to that of decision
optimal solution, when ti < t*, because the countdown to trees induced using ID3 and C4.5.
takeover, or race, between takeover and innovation is reset. Since GABIL, there has been work on inducing rules
(Zhou et al., 2003) and decision trees (Cant-Paz & Kamath,
2003) using evolutionary algorithms. Other representa-
MAIN THRUST tions that can be evolved using a genetic algorithm
include predictors (Au et al., 2003) and anomaly detectors
The general strengths of genetic algorithms lie in their (Gonzlez & Dasgupta, 2003). Unsupervised learning
ability to explore the search space efficiently through methodologies, such as data clustering (Hall et al., 1999;
parallel evaluation of fitness (Cant-Paz, 2000) and mixing Lorena & Furtado, 2001) also admit GA-based represen-
of partial solutions through crossover (Goldberg, 2002); to tation, with application to such current data-mining prob-
maintain a search frontier to seek global optima (Goldberg, lems as gene expression profiling in the domain of compu-
1989); and to solve multi-criterion optimization problems. tational biology (Iba, 2004). KDD from text corpora is
The basic units of partial solutions are referred to in the another area where evolutionary algorithms have been
literature as building blocks, or schemata. Modern GEC applied (Atkinson-Abutridy et al., 2003).
systems also are able to produce solutions of variable GAs can be used to perform meta-learning, or higher-
length (De Jong et al., 1993; Kargupta & Ghosh, 2002). order learning, by extracting features (Raymer et al., 2000),
A more specific advantage of GAs is their ability to selecting features (Hsu, 2003), or selecting training in-
represent rule-based, permutation-based, and construc- stances (Cano et al., 2003). They also have been applied
tive solutions to many pattern-recognition and machine- to combine, or fuse, classification functions (Kuncheva &
learning problems. Examples of this include induction of Jain, 2000).
decision trees (Cant-Paz & Kamath, 2003) among several
other recent applications surveyed in the following.
FUTURE TRENDS
Types of Gas
Some limitations of GAs are that, in certain situations,
The simplest genetic algorithm represents each chromo- they are overkill compared to more straightforward opti-
some as a bit string (containing binary digits 0 and 1) of mization methods such as hill-climbing, feed-forward ar-
fixed length. Numerical parameters can be represented by tificial neural networks using back propagation, and even
integers, though it is possible to use floating-point rep- simulated annealing and deterministic global search. In
resentations for reals. The simple GA performs crossover global optimization scenarios, GAs often manifest their
and mutation at the bit level for all of these (Goldberg, strengths: efficient, parallelizable search; the ability to
1989; Wikipedia, 2004). evolve solutions with multiple objective criteria (Llor &
Other variants treat the chromosome as a parameter Goldberg, 2003); and a characterizable and controllable
list, containing indices into an instruction table or an process of innovation.
arbitrary data structure with pre-defined semantics (e.g., Several current controversies arise from open research
nodes in a linked list, hashes, or objects. Crossover and problems in GEC:
mutation are required to preserve semantics by respecting
object boundaries, and formal invariants for each genera- Selection is acknowledged to be a fundamentally
tion can be specified according to these semantics. For important genetic operator. Opinion, however, is
most data types, operators can be specialized, with differ- divided over the importance of crossover vs. muta-
ing levels of effectiveness that generally are domain- tion. Some argue that crossover is the most impor-
dependent (Wikipedia, 2004). tant, while mutation is only necessary to ensure that
potential solutions are not lost. Others argue that
478
TEAM LinG
crossover in a largely uniform population only serves Looking ahead to future opportunities and challenges
to propagate innovations originally found by muta- in data mining, genetic algorithms are widely applicable -
tion, and that in a non-uniform population, cross- to classification by means of inductive learning. GAs
over is nearly always equivalent to a very large also provide a practical method for optimization of data
mutation, which is likely to be catastrophic. preparation and data transformation steps. The latter
In the field of GEC, basic building blocks for solu- includes clustering, feature selection and extraction, and
tions to engineering problems primarily have been instance selection. In data mining, GAs likely are most
characterized using schema theory, which has been useful where high-level, fitness-driven search is needed.
critiqued as being insufficiently exact to characterize Non-local search (global search or search with an adap-
the expected convergence behavior of a GA. Propo- tive step size) and multi-objective data mining are also
nents of schema theory have shown that it provides problem areas where GAs have proven promising.
useful normative guidelines for design of GAs and
automated control of high-level GA properties (e.g.,
population size, crossover parameters, and selec- REFERENCES
tion pressure).
Atkinson-Abutridy, J., Mellish, C., & Aitken, S. (2003). A
Recent and current research in GEC relates certain semantically guided and domain-independent evolution-
evolutionary algorithms to ant colony optimization ary model for knowledge discovery from texts. IEEE Trans-
(Parpinelli, Lopes & Freitas, 2002). actions on Evolutionary Computation, 7(6), 546-560.
Au, W.-H., Chan, K.C.C., & Yao, X. (2003). A novel
CONCLUSION evolutionary data mining algorithm with applications to
churn prediction. IEEE Transactions on Evolutionary
Genetic algorithms provide a comprehensive search method- Computation, 7(6), 532-545.
ology for machine learning and optimization. It has been Cano, J.R., Herrera, F., & Lozano, M. (2003). Using evo-
shown to be efficient and powerful through many data- lutionary algorithms as instance selection for data reduc-
mining applications that use optimization and classification. tion in KDD: An experimental study. IEEE Transactions
The current literature (Goldberg, 2002; Wikipedia, 2004) on Evolutionary Computation, 7(6), 561-575.
contains several general observations about the genera-
tion of solutions using a genetic algorithm: Cant-Paz, E. (2000). Efficient and accurate parallel
genetic algorithms. Norwell, MA: Kluwer.
GAs are sensitive to deceptivity, the irregularity of Cant-Paz, E., & Kamath, C. (2003). Inducing oblique
the fitness landscape. This includes locally optimal decision trees with evolutionary algorithms. IEEE Trans-
solutions that are not globally optimal, lack of fitness actions on Evolutionary Computation, 7(1), 54-68.
gradient for a given step size, and jump discontinuities
in fitness. De Jong, K.A., Spears, W.M., & Gordon, F.D. (1993).
In general, GAs have difficulty with adaptation to Using genetic algorithms for concept learning. Machine
dynamic concepts or objective criteria. This phe- Learning, 13, 161-188.
nomenon, called concept drift in supervised learning
and data mining, is a problem, because GAs tradi- Goldberg, D.E. (1989). Genetic algorithms in search,
tionally are designed to evolve highly fit solutions optimization, and machine learning. Reading, MA:
(populations containing building blocks of high rela- Addison-Wesley.
tive and absolute fitness) with respect to stationary Goldberg, D.E. (2002). The design of innovation: Lessons
concepts. from and for competent genetic algorithms. Norwell,
GAs are not always effective at finding globally MA: Kluwer.
optimal solutions but can rapidly locate good solu-
tions, even for difficult search spaces. This makes Gonzlez, F.A., & Dasgupta, D. (2003). Anomaly detec-
steady-state GAs (i.e., Bayesian optimization GAs tion using real-valued negative selection. Genetic Pro-
that collect and integrate solution outputs after gramming and Evolvable Machines, 4(4), 383-403.
convergence to an accurate representation of build- Hall, L.O., Ozyurt, I.B., & Bezdek, J.C. (1999). Clustering
ing blocks) a useful alternative to generational GAs with a genetically optimized approach. IEEE Transac-
(maximization GAs that seek the best individual of tions on Evolutionary Computation, 3(2), 103-112.
the final generation after convergence).
479
TEAM LinG
Holland, J.H. (1975). Adaptation in natural and artificial KEY TERMS

systems. Ann Arbor, MI: University of Michigan Press.
Hsu, W.H. (2003). Control of inductive bias in supervised Crossover: In biology, a process of sexual recombina-
learning using evolutionary computation: A wrapper- tion, by which two chromosomes are paired up and ex-
based approach. In J. Wang (Ed.), Data mining: Oppor- change some portion of their genetic sequence. Cross-
tunities and challenges. Hershey, PA: Idea Group Pub- over in GAs is highly stylized and typically involves
lishing. exchange of strings. These can be performed using a
crossover bit mask in bit-string GAs but require complex
Iba, H. (2004). Classification of gene expression profile exchanges (i.e., partial-match, order, and cycle crossover)
using combinatory method of evolutionary computation in permutation GAs.
and machine learning. Genetic Programming and Evolv-
able Machines, Special Issue on Biological Applica- Evolutionary Computation: A solution approach based
tions of Genetic and Evolutionary Computation, 5(2), on simulation models of natural selection, which begins
145-156. with a set of potential solutions and then iteratively
applies algorithms to generate new candidates and select
Kargupta, H., & Ghosh, S. (2002). Toward machine learn- the fittest from this set. The process leads toward a model
ing through genetic code-like transformations. Genetic that has a high proportion of fit individuals.
Programming and Evolvable Machines, 3(3), 231-258.
Generation: The basic unit of progress in genetic and
Kuncheva, L.I., & Jain, L.C. (2000). Designing classifier evolutionary computation, a step in which selection is
fusion systems by genetic algorithms. IEEE Transactions applied over a population. Usually, crossover and muta-
on Evolutionary Computation, 4(4), 327-336. tion are applied once per generation, in strict order.
Llor, X., & Goldberg, D.E. (2003). Bounding the effect of Individual: A single candidate solution in genetic and
noise in multiobjective learning classifier systems. Evolu- evolutionary computation, typically represented using
tionary Computation, 11(3), 278-297. strings (often of fixed length) and permutations in genetic
algorithms, or using problem-solver representations (pro-
Lorena, L.A.N., & Furtado, J.C. (2001). Constructive ge- grams, generative grammars, or circuits) in genetic pro-
netic algorithm for clustering problems. Evolutionary gramming.
Computation, 9(3), 309-328.
Meta-Learning: Higher-order learning by adapting
Parpinelli, R.S., Lopes, H.S., & Freitas, A.A. (2002). Data the parameters of a machine-learning algorithm or the
mining with an ant colony optimization algorithm. IEEE algorithm definition itself, in response to data. An example
Transactions on Evolutionary Computation, 6(4), 321-332. of this approach is the search-based wrapper of inductive
Raymer, M.L., Punch, W.F., Goodman, E.D., Kuhn, L.A., learning, which wraps a search algorithm around an in-
& Jain, A.K. (2000). Dimensionality reduction using ge- ducer to find locally optimal parameter values such as the
netic algorithms. IEEE Transactions on Evolutionary relevant feature subset for a given classification target
Computation, 4(2), 164-171. and data set. Validation set accuracy typically is used as
fitness. Genetic algorithms have been used to implement
Rizki, M.M., Zmuda, M.A., & Tamburino, L.A. (2002). such wrappers for decision tree and Bayesian network
Evolving pattern recognition systems. IEEE Transac- inducers.
tions on Evolutionary Computation, 6(6), 594-609.
Mutation: In biology, a permanent, heritable change
Schwefel, H.-P. (1997). Numerische optimierung von com- to the genetic material of an organism. Mutation in GAs
puterModellen mittels der evolutionsstrategie. Basel: involves string-based modifications to the elements of a
Birkhauser Verlag. candidate solution. These include bit-reversal in bit-
Wikipedia. (2004). Genetic algorithm. Retrieved from string GAs and shuffle and swap operators in permutation
http://en.wikipedia.org/wiki/Genetic_algorithm GAs.
Zhou, C., Xiao, W., Tirpak, T. M., & Nelson, P.C. (2003). Permutation GA: A type of GA where individuals
Evolving accurate and compact classification rules with represent a total ordering of elements, such as cities to be
gene expression programming. IEEE Transactions on visited in a minimum-cost graph tour (the Traveling Sales-
Evolutionary Computation, 7(6), 519-531. man Problem). Permutation GAs use specialized cross-
over and mutation operators compared to the more com-
mon bit string GAs.
480
TEAM LinG
Schema (pl. Schemata): An abstract building block of bined evaluation measures being compared in order to
a GA-generated solution, corresponding to a set of indi- choose individuals. -
viduals. Schemata typically are denoted by bit strings
with dont-care symbols # (e.g., 1#01#00# is a schema
with 23 = 8 possible instances, one for each instantiation
of the # symbols to 0 or 1). Schemata are important in GA ENDNOTE
research, because they form the basis of an analytical
approach called schema theory, for characterizing build- 1
Payoff-driven reinforcement learning describes a
ing blocks and predicting their proliferation and survival class of learning problems for intelligent agents that
probability across generations, thereby describing the receive rewards, or reinforcements, from the envi-
expected relative fitness of individuals in the GA. ronment in response to actions selected by a policy
function. These rewards are transmitted in the form
Selection: In biology, a mechanism in by which the of payoffs, sometimes strictly non-negative. A GA
fittest individuals survive to reproduce and the basis of acquires policies by evolving individuals, such as
speciation according to the Darwinian theory of evolu- condition-action rules, that represent candidate
tion. Selection in GP involves evaluation of a quantitative policies.
criterion over a finite set of fitness cases, with the com-
481
TEAM LinG
482
Evolutionary Data Mining for Genomics

Laetitia Jourdan
LIFL, University of Lille 1, France
Clarisse Dhaenens
El-Ghazali Talbi
Knowledge discovery from genomic data has become an Evolutionary data mining for genomics groups three
important research area for biologists. Nowadays, a lot important fields: evolutionary computation, knowledge
of data is available on the Web, but it is wrong to say that discovery, and genomics.
corresponding knowledge is also available. For example, It is now well known that evolutionary algorithms are
the first draft of the human genome, which contains well suited for some data mining tasks (Freitas, 2002).
3,000,000,000 letters, was achieved in June 2000, but, Here, we want to show the interest of dealing with
up to now, only a small part of the hidden knowledge has genomic data, thanks to evolutionary approaches. A first
been discovered. This is the aim of bioinformatics, proof of this interest may be the recent book by Gary
which brings together biology, computer science, math- Fogel and David Corne, Evolutionary Computation in
ematics, statistics, and information theory to analyze Bioinformatics, which groups several applications of
biological data for interpretation and prediction. Hence, evolutionary computation to problems in the biological
many problems encountered while studying genomic sciences and, in particular, in bioinformatics (Fogel &
data may be modeled as data mining tasks, such as Corne, 2002). In this article, several data mining tasks
feature selection, classification, clustering, or associa- are addressed, such as feature selection or clustering,
tion rule discovery. and solved, thanks to evolutionary approaches.
An important characteristic of genomic applications Another proof of the interest of such approaches is
is the large amount of data to analyze, and, most of the the number of sessions around evolutionary computa-
time, it is not possible to enumerate all the possibilities. tion in bioinformatics and computational biology that
Therefore, we propose to model these knowledge dis- have been organized during the last Congress on Evolu-
covery tasks as combinatorial optimization tasks in tionary Computation (CEC) in Portland, Oregon in 2004.
order to apply efficient optimization algorithms to ex- The aim of genomic studies is to understand the
tract knowledge from large datasets. To design an effi- function of genes, to determine which genes are in-
cient optimization algorithm, several aspects have to be volved in a given process, and how genes are related.
considered. The main one is the choice of the type of Hence, experiments are conducted, for example, to
resolution method according to the characteristics of localize coding regions in DNA sequences and/or to
the problem. Is it an easy problem, for which a polyno- evaluate the expression level of genes in certain condi-
mial algorithm may be found? If the answer is yes, then tions. Resulting from this, data available for the
let us design such an algorithm. Unfortunately, most of bioinformatics researcher may deal with DNA sequence
the time, the response to the question is no, and only information that are related to other types of data. The
heuristics that may find good but not necessarily opti- example used to illustrate this article may be classified
mal solutions can be used. In our approach, we focus on in this category.
evolutionary computation, which has already shown an Another type of data deals with the recent technol-
interesting ability to solve highly complex combinato- ogy called microarray, which allows the simultaneous
rial problems. measurement of the expression level of thousands of
In this article, we will show the efficacy of such an genes under different conditions (i.e., various time
approach while describing the main steps required to points of a process, absorption of different drugs, etc.).
solve data mining problems from genomics with evolu- This new type of data requires specific data mining tasks,
tionary algorithms. We will illustrate these steps with a as the number of genes to study is very large and the
real problem.
TEAM LinG
number of conditions may be limited. Classical questions such a formulation is to identify the task. This work must
are the classification or the clustering of genes based on be done through discussions and cooperation with bi- -
their expression pattern, and commonly used approaches ologists in order to agree on the objective of the prob-
may vary from statistical approaches (Yeung & Ruzzo, lems. For example, in our data, identifying groups of
2001) to evolutionary approaches (Merz, 2002) and may people can be modeled as a clustering task, as we cannot
use additional biological information, such as gene ontol- take into account non-affected people. Moreover, a lot
ogy (GO) (Speer, Spieth & Zell, 2004). Recently, the bi- of loci have to be studied (3,652 points of comparison
clustering that allows the grouping of instances having on the 23 chromosomes and two environmental factors)
similar characteristic for a subset of attributes (here, and classical clustering algorithms are not able to cope
genes having the same expression patterns for a subset of with so many points. So, we decided first to execute a
conditions) has been applied to this type of data and feature selection in order to reduce the number of loci
evolutionary approaches proposed (Bleuler, Preli & in consideration and to extract the most influential
Ziztler, 2004). In this context of microarray data analysis, features that will be used for the clustering. Hence, the
association rule discovery also has been realized using model of this problem is decomposed into two phases:
evolutionary algorithms (Khabzaoui, Dhaenens & Talbi, feature selection and clustering.
2004).
From a Data Mining Task to an
Optimization Problem
MAIN THRUST
The most difficult aspect of turning a data mining task
In order to extract knowledge from genomic data using into an optimization problem is to define the criterion
evolutionary algorithms, several steps have to be con- to optimize. The choice of the optimization criterion,
sidered: which measures the quality of candidate knowledge to
be extracted, is very important, and the quality of the
1. Identification of the knowledge discovery task results of the approach depends on it. Indeed, develop-
from the biological problem under study; ing a very efficient method that does not use the right
2. Design of this task as an optimization problem; criterion will lead to obtaining the right answer to the
3. Resolution using an evolutionary approach. wrong question. The optimization criterion either can
be specific to the data mining task or dependent of the
Hence, in this section, we will focus on each of these biological application. Several different choices exist.
steps. First, we will present the genomic application that For example, considering the gene clustering, the opti-
we will use to illustrate the rest of the article and mization criterion can be the minimization of the mini-
indicate the knowledge discovery tasks that have been mum sum-of-squares (MSS) (Merz, 2002), while for
extracted. Then, we will show the challenges and some the determination of the members of a predictive gene
proposed solutions for the two other steps. group, the criterion can be the maximization of the
classification success using a maximum likelihood
Genomics Application (MLHD) classification method (Ooi & Tan, 2003).
Once the optimization criterion is defined, the sec-
The genomic problem under study is to formulate hy- ond step of the design of the data mining task into an
potheses on predisposition factors of different multi- optimization problem is to define the encoding of a
factorial diseases, such as diabetes and obesity. In such solution, which may be independent of the resolution
diseases, one of the difficulties is that sane people can method. For example, for clustering problems in gene
become affected during their life, so only the affected expression mining with evolutionary algorithm,
status is relevant. This work has been done in collabora- Faulkenauer and Marchand (2001) use the specific CGA
tion with the Biology Institute of Lille (IBL, France). encoding that is dedicated to grouping problems and is
One approach aims to discover the contribution of well suited to clustering.
environmental factors and genetic factors in the patho- Regarding the genomic application used to illustrate
genesis of the disease under study by discovering com- this article, two phases have been isolated. For the
plex interactions, such as ([gene A and gene B] or [gene feature selection, an optimization approach has been
C and environmental factor D]) in one or more popula- adopted, using an evolutionary algorithm (see next para-
tion. The rest of the article will use this problem as an graph), whereas a classical approach (k-means) has been
illustration. chosen for the clustering phase. Determining the opti-
To solve such a problem, the first thing is to formu- mization criterion for the feature selection was not an easy
late it into a classical data mining task. The difficulty of task, as it was difficult not to favor small sets of features.
483
TEAM LinG
A corrective factor has been introduced (Jourdan et al., operators that are less efficient are less used, which may
2002). change during the search, if they become more efficient.
Diversification mechanisms are designed to avoid
Solving with Evolutionary Algorithms premature convergence. There exist several mecha-
nisms. The more classical are the sharing and the ran-
Once the formalization of the data mining task into an dom immigrant. The sharing boosts the selection of
optimization problem is done, resolution methods either individuals that lie in less crowded areas of the search
can be exact methods, specific heuristics, or space (Mafhoud, 1995). To apply such a mechanism, a
metaheuristics. As the space of the potential knowledge distance between solutions has to be defined. In the
is exponential in genomics problems (Zaki & Ho, 2000), feature selection for the genomic association discov-
exact methods are almost always discarded. The draw- ery, a distance has been defined by integrating knowl-
backs of heuristic approaches are that it is difficult to edge of the application domain. The distance is corre-
cope with multiple solutions and not easy to integrate lated to a Hamming distance, which integrates biologi-
specific knowledge in a general approach. The advan- cal notions (chromosomal cut, inheritance notion, etc.).
tages of metaheuristics are that you can define a general A further approach to the diversification of the
framework to solve the problem while specializing some population, the random immigrant, introduces new in-
agents in order to suit a specific problem. Genetic Algo- dividuals. An idea is to generate new individuals by
rithms (GAs), which represent a class of evolutionary recording statistics on previous selections.
methods, have given good results on hard combinatorial Assessing the efficiency of such algorithms applied
problems (Michalewicz, 1996). to genomic data is not easy, as most of the time,
In order to develop a genetic algorithm for knowl- biologists have no exact idea about what must be found.
edge discovery, we have to focus on the following: Hence, one step in this analysis is to develop simulated
data for which optimal results are known. In this man-
Operators ner, it is possible to measure the efficiency of the
Diversification mechanisms proposed method. For example, in the problem under
Intensification mechanisms study, predetermined genomic associations were con-
structed to form simulated data. Then, the algorithm
Operators allow GAs to explore the search space and was tested on these data and found these associations.
must be adapted to the problem. Generally, there are two
classes of operatorsmutation and crossover. The mu-
tation allows diversity. For the feature selection task FUTURE TRENDS
under study, the mutation flips n bits (Jourdan, Dhaenens
& Talbi, 2001). The crossover produces one, two, or There has been much work on evolutionary data mining
more children solutions by recombining two or parents. for genomics. In order to be more efficient and to
The objective of this mechanism is to keep useful infor- propose more interesting solutions for decision mak-
mation of the parents in order to ameliorate the solu- ers, the researchers are investigating multi-criteria de-
tions. In the considered problem, the subset-oriented sign of the data mining tasks. Indeed, we exposed that
common feature crossover operator (SSOCF) has been one of the critical phases was the determination of the
used. Its objective is to produce offspring that have the optimization criterion, and it may be difficult to select a
same distribution as the parents. This operator is adapted single one. In response to this problem, the multi-crite-
well for feature selection (Emmanouilidis, Hunter & ria design allows us to take into account some criteria
MacIntyre, 2000). Another advantage of evolutionary dedicated to a specific data mining task and some crite-
algorithms is that you easily can use other data mining ria coming from the application domain. Evolutionary
algorithms as an operator; for example, a kmeans itera- algorithms that work on a population of solutions are
tion may be used as an operator in a clustering problem well adapted to multi-criteria problems, as they can
(Krishma & Murty, 1999). exploit Pareto approaches and propose several good
Working on knowledge discovery on a particular solutions (i.e., solutions of best compromise).
domain by optimization leads to definition and use of For data mining in genomics, rule discovery has not
several operators, where some may use domain knowl- been very well applied and should be studied carefully,
edge, and others may be specific to the model and/or the as this is a very general model. Moreover, an interest-
encoding. To take advantage of all the operators, the idea ing multi-criteria model has been proposed for this
is to use adaptive mechanisms (Hong, Wang & Chen, task and starts to give some interesting results by using
2000) that help to adapt application probabilities of these multi-criteria genetic algorithms (Jourdan et al., 2004).
operators according to the progress they produce. Hence,
484
TEAM LinG
CONCLUSION tors involved in multifactorial diseases. Knowledge Based

Systems (KBS) Journal, 15(4), 235-242. -
Genomic is a real challenge for researchers in data Jourdan, L., Khabzaoui, M., Dhaenens, C., & Talbi, E-G.
mining, as most of the problems encountered are diffi- (2004). A hybrid metaheuristic for knowledge discovery in
cult to address for several reasons: (1) the objectives microarray experiments. Handbook of bioinspired algo-
are not always very clear, neither for the data mining rithms and applications. CRC Press.
specialist nor for the biologist, and a real dialog has to
exist between the two; (2) data may be uncertain and Khabzaoui, M., Dhaenens, C., & Talbi, E-G. (2004). Asso-
noisy; (3) the number of experiments may not be enough ciation rules discovery for DNA microarray data. Pro-
in comparison to the number of elements that have to be ceedings of the Bioinformatics Workshop of SIAM Inter-
studied. national Conference on Data Mining, Orlando, Florida.
In this article, we present the different phases re-
quired to deal with data mining problems arriving in Krishna, K., & Murty, M. (1999). Genetic K-means algo-
genomics, thanks to optimization approaches. The objec- rithm. IEEE Transactions on Systems, Man and Cybernet-
tive was to give for each phase the main challenges and to ics, 29(3), 433-439.
illustrate through an example some responses. We saw Mafhoud, S.W. (1995). Niching method for genetic algo-
that the definition on the optimization criterion is a rithms [doctoral thesis]. IL: University of Illinois.Merz, P.
central point of this approach, and the multi-criteria (2002). Clustering gene expression profiles with memetic
optimization may be used as a response to this difficulty. algorithms. Proceedings of the 7th International Confer-
ence on Parallel Problem Solving from Nature (PPSN
VII).
REFERENCES
Michalewicz, Z. (1996). Genetic algorithms + data struc-
Bleuler, S., Preli, A., & Ziztler, E. (2004). An EA tures = evolution programs. Springer-Verlag.
framework for biclustering of gene expression data. Ooi, C.H., & Tan, P. (2003). Genetic algorithms applied to
Proceedings of the Congress on Evolutionary Com- multi-class prediction for the analysis of gene expression
putation (CEC04), Portland, Oregon. data. Bioinformatics, 19, 37-44.
Emmanouilidis, C., Hunter, A., & MacIntyre, J. (2000). Speer, N., Spieth, C., & Zell, A. (2004). A memetic co-
A multiobjective evolutionary setting for feature selec- clustering algorithm for gene expression profiles and
tion and a commonality-based crossover operator. Pro- biological annotation. Proceedings of the Congress on
ceedings of the Congress on Evolutionary Computa- Evolutionary Computation (CEC04), Portland, Oregon.
tion, San Diego, California.
Yeung, K.Y., & Ruzzo, W.L. (2001). Principal compo-
Falkenauer E., & Marchand, A. (2001). Using k-means? nent analysis for clustering gene expression data.
Consider ArrayMiner. Proceedings of the 2001 Inter- Bioinformatics, 17, 763-774.
national Conference on Mathematics and Engineer-
ing Techniques in Medicine and Biological Sciences Zaki, M., & Ho, C.T. (2000). Large-scale parallel data
(METMBS2001), Las Vegas, Nevada. mining. State-of-the-Art Survey, LNAI.
Fogel, G., & Corne, D. (2002). Evolutionary computa-

tion in bioinformatics. Morgan Kaufmann.
KEY TERMS
Freitas, A.A. (2002). Data mining and knowledge dis-
covery with evolutionary algorithms. Springer-Verlag. Association Rule Discovery: Implication of the
Hong, T-P., Wang, H-S., & Chen, W-C. (2000). Simul- form X Y, where X and Y are sets of items; implies
taneously applying multiple mutation operators in ge- that if the condition X is verified, the prediction Y is
netic algorithms. J. Heuristics, 6(4), 439-455. valid.
Jourdan, L., Dhaenens, C., & Talbi, E.G. (2001). An Bioinformatics: Field of science in which biology,
optimization approach to mine genetic data. Proceed- computer science, and information technology merge
ings of Biological Data Mining and Knowledge Dis- into a single discipline.
covery (METMBS01). Clustering: Data mining task in which the system
Jourdan, L., Dhaenens, C., Talbi, E.-G., & Gallina, S. has to classify a set of objects without any information
(2002). A data mining approach to discover genetic fac- on the characteristics of the classes.
485
TEAM LinG
Feature (or Attribute): Quantity or quality describing Multi-Factorial Disease: Disease caused by several
an instance. factors. Often, multi-factorial diseases are due to two
kinds of causality that interactone is genetic (and often
Feature Selection: Task of identifying and selecting polygenetic) and the other is environmental.
a useful subset of features from a large set of redundant,
perhaps irrelevant, features. Optimization Criterion: Criterion that gives the qual-
ity of a solution of an optimization problem.
Genetic Algorithm: Evolutionary algorithm using a
population and based on the Darwinian principle, the
survival of the fittest.
486
TEAM LinG
487
Evolutionary Mining of Rule Ensembles -

Jorge Muruzbal
University Rey Juan Carlos, Spain
INTRODUCTION to the general GP, EP, and CS frameworks and Koza,

Keane, Streeter, Mydlowec, Yu, and Lanza (2003) for an
Ensemble rule based classification methods have been idea of performance by GP algorithms at the patent level.
popular for a while in the machine-learning literature Here I focus on ensemble rule based methods for classi-
(Hand, 1997). Given the advent of low-cost, high-com- fication tasks or supervised learning (Hand, 1997).
puting power, we are curious to see how far can we go by The CS architecture is naturally suitable for this sort
repeating some basic learning process, obtaining a vari- of rule assembly problems, for its basic representation
ety of possible inferences, and finally basing the global unit is the rule, or classifier (Holland, Holyoak, Nisbett,
classification decision on some sort of ensemble sum- & Thagard, 1986). Interestingly, tentative ensembles in
mary. Some general benefits to this idea have been CS algorithms are constantly tested for successful co-
observed indeed, and we are gaining wider and deeper operation (leading to correct predictions). The fitness
insights on exactly why this is the case in many fronts of measure seeks to reinforce those classifiers leading to
interest. success in each case. However interesting, the CS ap-
There are many ways to approach the ensemble- proach in no way exhausts the scope of evolutionary
building task. Instead of locating ensemble members computation ideas for ensemble-based learning; see,
independently, as in Bagging (Breiman, 1996), or with for example, Kuncheva and Jain (2000), Liu, Yao, and
little feedback from the joint behavior of the forming Higuchi (2000), and Folino, Pizzuti, and Spezzano
ensemble, as in Boosting (see, e.g., Schapire & Singer, (2003).
1998), members can be created at random and then made Ensembles of trees or rules are the natural reference
subject to an evolutionary process guided by some for evolutionary mining approaches. Smaller trees, made
fitness measure. Evolutionary algorithms mimic the by rules (leaves) with just a few tests, are of particular
process of natural evolution and thus involve popula- interest. Stumps place a single test and constitute an
tions of individuals (rather than a single solution itera- extreme case (which is nevertheless used often). These
tively improved by hill climbing or otherwise). Hence, rules are more general and hence tend to make more
they are naturally linked to ensemble-learning methods. mistakes, yet they are also easier to grasp and explain. A
Based on the long-term processing of the data and the related notion is that of support, the estimated probabil-
application of suitable evolutionary operators, fitness ity that new data satisfy a given rules conditions. A great
landscapes can be designed in intuitive ways to prime deal of effort has been done in the contemporary CS
the ensembles desired properties. Most notably, be- literature to discern the idea of adequate generality, a
yond the intrinsic fitness measures typically used in recurrent topic in the machine-learning arena.
pure optimization processes, fitness can also be endog-
enous, that is, it can prime the context of each individual
as well. MAIN THRUST
Evolutionary and Tree-Based Rule

BACKGROUND Ensembles
A number of evolutionary mining algorithms are avail- In this section, I review various methods for ensemble
able nowadays. These algorithms may differ in the na- formation. As noted earlier, in this article, I use the
ture of the evolutionary process or in the basic models ensemble to build averages of rules. Instead of averag-
considered for the data or in other ways. For example, ing, one could also select the most suitable classifier in
approaches based on the genetic programming (GP), each case and make the decision on the basis of that rule
evolutionary programming (EP), and classifier system alone (Hand, Adams, & Kelly, 2001). This alternative
(CS) paradigms have been considered, while predictive idea may provide additional insights of interest, but I do
rules, trees, graphs, and other structures been have not analyze it further here.
evolved. See Eiben and Smith (2003) for an introduction
TEAM LinG
Evolutionary Mining of Rule Ensembles
It is conjectured that maximizing the degree of interac- fiers are then maps taking only two values, a real number
tion amongst the rules already available is critical for for those x verifying the leaf and 0 elsewhere. The final
efficient learning (Kuncheva & Jain, 2000; Hand et al., boosting aggregation for x is thus unaffected by all
2001). A fundamental issue concerns then the extent to abstaining rules (with x C), so the number of expressing
which tentative rules work together and are capable of rules may be a small fraction of the total number of
influencing the learning of new rules. Conventional rules.
methods like Bagging and Boosting show at most mod-
erate amounts of interaction in this sense. While Bag- The General CS-Based Evolutionary
ging and Boosting are useful, well-known data-mining Approach
tools, it is appropriate to explore other ensemble-learn-
ing ideas as well. In this article, I focus mainly on the CS The general classifier system (CS) architecture invented
algorithm. CS approaches provide interesting architec- by John Holland constitutes perhaps one of the most
tures and introduce complex nonlinear processes to sophisticated classes of evolutionary computation al-
model prediction and reinforcement. I discuss a spe- gorithms (Hollandet et al., 1986). Originally conceived as
cific CS algorithm and show how it opens interesting a model for cognitive tasks, it has been considered in
pathways for emergent cooperative behaviour. many (simplified) forms to address a number of learn-
ing problems. The nowadays standard stimulus-response
Conventional Rule Assembly (or single-step) CS architecture provides a fascinating
approach to the representation issue. Straightforward
In Bagging methods, different training samples are cre- rules (classifiers) constitute the CS building blocks. CS
ated by bootstraping, and the same basic learning proce- algorithms maintain a population of such predictive
dure is applied on each bootstrapped sample. In Bagging rules whose conditions are hyperplanes involving the
trees, predictions are decided by majority voting or by wild-card character #. If we generalize the idea of hyper-
averaging the various opinions available in each case. plane to mean conjunctions of conditions on predic-
This idea is known to reduce the basic instability of tors where each condition involves a single predictor,
trees (Breiman, 1996). We see that these rules are also used by many other
A distinctive feature of the Boosting approach is the learning algorithms. Undoubtedly, hyperplane interpret-
iterative calling to a basic weak learner (WL) algorithm ability is a major factor behind this popularity.
(Schapire & Singer, 1998). Each time the WL is in- Critical subsystems in CS algorithms are the perfor-
voked, it takes as input the training set together with mance, credit-apportionment, and rule discovery mod-
a dynamic (probability) weight distribution over the data ules (Eiben & Smith, 2003). As regards credit-appor-
and returns a single tree. The output of the algorithm tionment, the question has been recently raised about
is a weighted sum itself, where the weights are propor- the suitability of endogenous reward schemes, where
tional to individual performance error. The WL learning endogenous refers to the overall context in which clas-
algorithm needs only to produce moderately successful sifiers act, versus other schemes based on intrinsic
models. Thus, trees and simplified trees (stumps) con- value measures (Booker, 2000). A well-known family
stitute a popular choice. Several weight updating schemes of algorithms is XCS (and descendants), some of which
have been proposed. Schapire and Singer update weights have been previously advocated as data-mining tools
according to the success of the last model incorporated, (see, e.g., Wilson, 2001). The complexity of the CS
whereas in their LogitBoost algorithm, Friedman, Hastie, dynamics has been analyzed in detail in Westerdale
and Tibshirani (2000) let the weights depend on overall (2001).
probabilistic estimates. This latter idea better reflects The match set M=M(x) is the subset of matched
the joint work of all classifiers available so far and (concurrently activated) rules, that is, the collection of
hence should provide a more effective guide for the WL all classifiers whose condition is verified by the input
in general. data vector x. The (point) prediction for a new x will be
The notion of abstention brings a connection with based exclusively on the information contained in this
the CS approach that will be apparent as I discuss the ensemble M.
match set idea in the followins sections. In standard
boosting trees, each tree contributes a leaf to the overall A System Based on Support and
prediction for any new x input data vector, so the number
of expressing rules is the number of boosting rounds
Predictive Scoring
independently of x. In the system proposed by Cohen
and Singer (1999), the WL essentially produces rules or Support is a familiar notion in various data-mining sce-
single leaves C (rather than whole trees). Their classi- narios. There is a general trade-off between support and
488
TEAM LinG
predictive accuracy: the larger the support, the lower the Furthermore, the use of probabilistic predictions
accuracy. The importance of explicitly bounding support R(j) for j=1, ..., k, where k is the number of output classes, -
(or deliberately seeking high support) has been recog- makes possible a natural ranking of the M=M(x) as-
nized often in the literature (Greene & Smith, 1994; Fried- signed probabilities Ri(y) for the true class y related to
man & Fisher, 1999; Muselli & Liberati, 2002). Because the the current x. Rules i with large Ri(y) (scoring high) are
world of generality intended by using only high-support generally preferred in each niche. In fact, only a few rules
rules introduces increased levels of uncertainty and error, are rewarded at each step, so rules compete with each
statistical tools would seem indispensable for its proper other for the limited amount of resources. Persistent lack
modeling. To this end, classifiers in the BYPASS algorithm of reward means extinction to survive, classifiers must
(Muruzbal, 2001) differ from other CS alternatives in that get reward from time to time. Note that newly discovered,
they enjoy (support-minded) probabilistic predictions (thus more effective rules with even better scores may cut
extending the more common single-label predictions). dramatically the reward given previously to other rules
Support plays an outstanding role: A minimum support in certain niches. The fitness landscape is thus highly
level b is input by the analyst at the outset, and consider- dynamic, and lower scores pi(y) may get reward and
ation is restricted to rules with (estimated) support above survive provided they are the best so far at some niche.
b. The underlying predictive distributions, R, are easily An intrinsic measure of fitness for a classifier C R (such
constructed and coherently updated following a standard as the lifetime average score -log R(Y), where Y is condi-
Bayesian Multinomial-Dirichlet process, whereas their toll tioned by X C) could hardly play the same role.
on the system is minimal memory-wise. Actual predictions It is worth noting that BYPASS integrates three learn-
are built by first averaging the matched predictive distri- ing modes in its classifiers: Bayesian at the data-process-
butions and then picking the maximum a posteriori label of ing level, reinforcement at the survival (competition)
the result. Hence, by promoting mixtures of probability level, and genetic at the rule-discovery (exploration)
distributions, the BYPASS algorithm connects readily level. Standard genetic algorithms (GAs) are triggered by
with mainstream ensemble-learning methods. system failure and act always circumscribed to M. Be-
We can find sometimes perfect regularities, that is, cause the Bayesian updating guarantees that in the long
subsets C for which the conditional distribution of the run, predictive distributions R reflect the true conditional
response Y (given X C) equals 1 for some output class: probabilities P(Y=j | X C), scores become highly reli-
P(Y=j | X C)=1 for some j and 0 elsewhere. In the well- able to form the basis of learning engines such as reward
known multiplexer environment, for example, there ex- or crossover selection. The BYPASS algorithm is sketched
ists a set of such classifiers such that 100% performance in Table 1. Note that utility reflects accumulated reward
can be achieved. But in real situations, it will be difficult (Muruzbal, 2001). Because only matched rules get re-
to locate strictly neat C unless its support is quite small. ward, high support is a necessary (but not sufficient)
Moreover, putting too much emphasis on error-free condition to have high utility. Conversely, low-uncer-
behavior may increase the risk of overfitting; that is, we tainty regularities need to comply with the (induced)
may infer rules that do not apply (or generalize poorly) bound on support. The background generalization rate
over a test sample. When restricting the search to high- P# (controlling the number of #s in the random C built
support rules, probability distributions are well equipped along the run) is omitted for clarity, although some
to represent high-uncertainty patterns. When the largest tuning of P# with regard to threshold u is often required
P(Y=j | X C) is small, it may be especially important to in practice.
estimate P(Y=j | X C) for all j.
Table 1. The skeleton BYPASS algorithm
Input parameters (h: reward intensity) (u: minimum utility)

Initialize classifier population, say P
Iterate for a fixed number of cycles:
o Sample training item (x,y) from file
o Build match set M=M(x) including all rules with x C
o Build point prediction y*
o Check for success y* = y
If successful, reward the h 2 highest scores Ri(y) in M
If unsuccessful, add a new rule to P via a GA or otherwise
o Update predictive distributions R (in M)
o Update utility counters (in P)
o Check for low utility via threshold u and possibly delete some rules
489
TEAM LinG
BYPASS has been tested on various tasks under uncertainty associated to rules. Also, further research
demanding b (and not very large h), and the results have should be conducted to clearly delineate the strengths
been satisfactory in general. Comparative smaller popu- of the various CS approaches against current alternative
lations are used, and low uncertainty but high support methods for rule ensemble formation and data mining.
rules are uncovered. Very high values for P# (as high as
0.975) have been tested successfully in some cases
(Muruzbal, 2001). In the juxtaposed (or concatenated) CONCLUSION
multiplexer environment, BYPASS is shown to maintain a
compact population of relatively high uncertainty rules Evolutionary rule mining is a successful, promising
that solves the problem by bringing about appropriate research area. Evolutionary algorithms constitute by
match sets (nearly) all the time. Recent work by Butz, now a very useful and wide class of stochastic optimiza-
Goldberg, and Tharakunnel (2003) shows that XCS also tion methods. The evolutionary CS approach is likely to
solves this problem, although working at a lower level of provide interesting insights and cross-fertilization of
support (generality). ideas with other data-mining methods. The BYPASS
To summarize, BYPASS does not relay on rule plural- algorithm discussed in this article has been shown to
ity for knowledge encoding because it uses compact tolerate the high support constraint well, leading to
probabilistic predictions (bounded by support). It re- pleasant and unexpected results in some problems. These
quires no intrinsic value for rules and no added tailor- results stress the latent predictive power of the en-
made heuristics. Besides, it tends to keep population size sembles formed by high uncertainty rules.
under control (with increased processing speed and
memory savings). The ensembles (match sets) derived
from evolution in BYPASS have shown good promise of REFERENCES
cooperation.
Booker, L. B. (2000). Do we really need to estimate rule
utilities in classifier systems? Lecture Notes in Artifi-
FUTURE TRENDS cial Intelligence, 1813, 125-142.
Quick interactive data-mining algorithms and protocols Breiman, L. (1996). Bagging predictors. Machine Learn-
are nice when human judgment is available. When not, ing, 24, 123-140.
computer-intensive, autonomous algorithms capable of
thoroughly squeezing the data are also nice for prelimi- Bull, L. (2002). On using constructivism in neural classifier
nary exploration and other purposes. In a sense, we systems. Lecture Notes in Computer Science, 2439, 558-
should tend to rely on the latter to mitigate the nearly 567.
ubiquitous data overflow problem. Representation Butz, M. V., Goldberg, D. E., & Tharakunnel, K. (2003).
schemes and learning engines are crucial to the success Analysis and improvement of fitness exploitation in
of these unmanned agents and need, of course, further XCS: Bounding models, tournament selection, and bi-
investigation. Ensemble methods have lots of appealing lateral accuracy. Evolutionary Computation, 11(3), 239-
features and will be subject to further analysis and 277.
testing. Evolutionary algorithms will continue to uprise
and succeed in yet some other application areas. Addi- Cohen, W. W., & Singer, Y. (1999). A simple, fast, and
tional commercial spin-offs will keep coming. Although effective rule learner. Proceedings of the 16th Na-
great progress has been made in identifying many key tional Conference on Artificial Intelligence,.
insights in the CS framework, some central points still Eiben, A. E., & Smith, J. E. (2003). Introduction to
need further discussion. Specifically, the idea of rules evolutionary computing. Springer.
that perform its prediction following some kind of more
elaborate computation is appealing, and indeed more Folino, G., Pizzuti, C., & Spezzano, G. (2003). En-
functional representations of classifiers (such as multi- semble techniques for parallel genetic programming
layer perceptrons) have been proposed in the CS litera- based classifiers. Lecture Notes in Computer Science,
ture (see, e.g., Bull, 2002). On the theoretical side, a 2610, 59-69.
formal framework for more rigorous analysis in high-
support learning is much needed. The task is not easy, Friedman, J. H., & Fisher, N. (1999). Bump hunting in high-
however, because the target is somewhat more vague, dimensional data. Statistics and Computing, 9(2), 1-20.
and individual as well as collective interests should be
brought to terms when evaluating the generality and
490
TEAM LinG
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Addi- Wilson, S. W. (2001). Mining oblique data with XCS.
tive logistic regression: A statistical view of boosting. Lecture Notes in Artificial Intelligence, 1996, 158-176. -
Annals of Statistics, 28(2), 337-407.
Greene, D. P., & Smith, S. F. (1994). Using coverage as a
model building constraint in learning classifier systems. KEY TERMS
Evolutionary Computation, 2(1), 67-91.
Hand, D. J. (1997). Construction and assessment of clas- Classification: The central problem in (supervised)
sification rules. Wiley. data mining. Given a training data set, classification
algorithms provide predictions for new data based on
Hand, D. J., Adams, N. M., & Kelly, M. G. (2001). Multiple predictive rules and other types of models.
classifier systems based on interpretable linear classifi-
ers. Lecture Notes in Computer Science, 2096, 136-147. Classifier System: A rich class of evolutionary com-
putation algorithms building on the idea of evolving a
Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. population of predictive (or behavioral) rules under the
R. (1986). Induction: Processes of inference, learning enforcement of certain competition and cooperation pro-
and discovery. MIT Press. cesses. Note that classifier systems can also be under-
stood as systems capable of performing classification.
Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W., Not all CSs in the sense meant here qualify as classifier
Yu, J., & Lanza, G. (Eds.). (2003). Genetic programming systems in the broader sense, but a variety of CS algo-
IV: Routine human-competitive machine intelligence. rithms concerned with classification do.
Kluwer.
Ensemble-Based Methods: A general technique
Kuncheva, L. I., & Jain, L. C. (2000). Designing classifier that seeks to profit from the fact that multiple rule
fusion systems by genetic algorithms. IEEE Transactions generation followed by prediction averaging reduces
on Evolutionary Computation, 4(4), 327-336. test error.
Liu, Y., Yao, X., & Higuchi, T. (2000). Evolutionary en- Evolutionary Computation: The solution approach
sembles with negative correlation learning. IEEE Trans- guided by artificial evolution, which begins with random
actions on Evolutionary Computation, 4(4), 380-387. populations (of solution models), then iteratively ap-
Muruzbal, J. (2001). Combining statistical and reinforce- plies algorithms of various kinds to find the best or
ment learning in rule-based classification. Computational fittest models.
Statistics, 16(3), 341-359. Fitness Landscape: Optimization space due to the
Muselli, M., & Liberati, D. (2002). Binary rule generation characteristics of the fitness measure used to define the
via hamming clustering. IEEE Transactions on Knowl- evolutionary computation process.
edge and Data Engineering, 14, 1258-1268. Predictive Rules: Standard if-then rules with the
Schapire, R. E., & Singer, Y. (1998). Improved boosting consequent expressing some form of prediction about
algorithms using confidence-rated predictions. Machine the output variable.
Learning, 37(3), 297-336. Rule Mining: A computer-intensive task whereby
Westerdale, T. H. (2001). Local reinforcement and recom- data sets are extensively probed for useful predictive
bination in classifier systems. Evolutionary Computa- rules.
tion, 9(3), 259-281. Test Error: Learning systems should be evaluated
with regard to their true error rate, which in practice is
approximated by the error rate on test data, or test error.
491
TEAM LinG
492
Explanation-Oriented Data Mining

Yiyu Yao
University of Regina, Canada
Yan Zhao
University of Regina, Canada
INTRODUCTION oriented data mining is a recent result from such an

investigation (Yao, 2003; Yao, Zhao, & Maguire, 2003).
Data mining concerns theories, methodologies, and, in
particular, computer systems for knowledge extraction Common Goals of Scientific Research
or mining from large amounts of data (Han & Kamber, and Data Mining
2000). The extensive studies on data mining have led to
many theories, methodologies, efficient algorithms and Scientific research is affected by the perceptions and
tools for the discovery of different kinds of knowledge the purposes of science. Martella et al. summarized the
from different types of data. In spite of their differ- main purposes of science, namely, to describe and pre-
ences, they share the same goal, namely, to discover new dict, to improve or manipulate the world around us, and
and useful knowledge, in order to gain a better under- to explain our world (Martella, et al., 1999). The results
standing of nature. of the scientific research process provide a description
The objective of data mining is, in fact, the goal of of an event or a phenomenon. The knowledge obtained
scientists when carrying out scientific research, inde- from this research helps us to make predictions about
pendent of their various disciplines. Data mining, by what will happen in the future. Research findings are a
combining research methods and computer technology, useful tool for making an improvement in the subject
should be considered as a research support system. This matter. Research findings also can be used to determine
goal-oriented view enables us to re-examine data min- the best or the most effective ways of bringing about
ing in the wider context of scientific research. Such a desirable changes. Finally, scientists develop models
re-examination leads to new insights into data mining and theories to explain why a phenomenon occurs.
and knowledge discovery. Goals similar to those of scientific research have
The result, after an immediate comparison between been discussed by many researchers in data mining. For
scientific research and data mining, is that an explana- example, Fayyad, Piatetsky-Shapiro, Smyth and
tion construction and evaluation task is added to the Uthurusamy identified two high-level goals of data min-
existing data mining framework. In this chapter, we ing as prediction and description (Fayyad, et al., 1996).
elaborate upon the basic concerns and methods of ex- Prediction involves the use of some variables to predict
planation construction and evaluation. Explanation-ori- the values of some other variables, while description
ented association mining is employed as a concrete focuses on patterns that describe the data. Ling, Chen,
example to demonstrate the entire framework. Yang and Cheng studied the issue of manipulation and
action based on discovered knowledge (Ling et al.,
2002). Yao, Zhao, et al. introduced the notion of
BACKGROUND explanation-oriented data mining, which focuses on
constructing models for the explanation of data mining
Scientific research and data mining have much in com- results (2003).
mon in terms of their goals, tasks, processes and meth-
odologies. As a recently emerged area of multi-disci- Common Processes of Scientific
plinary study, data mining and knowledge discovery Research and Data Mining
research can benefit from the long established studies
of scientific research and investigation (Martella, Research is a highly complex and subtle human activity,
Nelson, & Marchand-Martella, 1999). By viewing data which is difficult to formally define. It seems impos-
mining in a wider context of scientific research, we can sible to give any formal instruction on how to do re-
obtain insights into the necessities and benefits of search. On the other hand, some lessons and general
explanation construction. The model of explanation- principles can be learnt from the experience of scien-
TEAM LinG
Table 1. The model of scientific research processes

Idea-generation phase: to identify a topic of interest.
Problem-definition phase: to precisely and clearly define and formulate vague and general
-
ideas generated in the previous phase.
Procedure-design/planning phase: to make a workable research plan by considering all issues
involved.
Observation/experimentation phase: to observe real world phenomenon, collect data, and carry
out experiments.
Data-analysis phase: to make sense out of the data collected.
Results-interpretation phase: to build rational models and theories that explain the results from
the data-analysis phase.
Communication phase: to present the research results to the research community.
Table 2. The model of data mining processes

Data pre-processing phase: to select and clean working data.
Data transformation phase: to change the working data into the required form.
Pattern discovery and evaluation phase: to apply algorithms to identify knowledge embedded
in data, and to evaluate the discovered knowledge.
Explanation construction and evaluation phase: to construct plausible explanations for
discovered knowledge, and to evaluate different explanations.
Pattern presentation: to present the extracted knowledge and explanations.
tists. There are some basic principles and techniques There are bi-directional benefits. The experiences and
that are commonly used in most types of scientific results from the studies of research methods can be
investigations. We adopt the model of the research applied to data mining problems; the data mining algo-
process from Garziano and Raulin (2000), and combine rithms can be used to support scientific research.
it with other models (Martella, et al., 1999). The basic
phases and their objectives are summarized in Table 1.
It is possible to combine several phases into one, or to MAIN THRUST
divide one phase into more detailed steps. The division
between phases is not clear-cut. The research process Explanations of data mining address several important
does not follow a rigid sequencing of the phases. Itera- questions. What needs to be explained? How to explain
tion of different phrases may be necessary (Graziano & the discovered knowledge? Moreover, is an explanation
Raulin, 2000). correct and complete? By answering these questions,
Many researchers have proposed and studied models one can better understand explanation-oriented data
of data mining processes (Fayyad, et al. 1996; Mannila, mining. The ideas and processes of explanation con-
1997; Yao, Zhao, et al., 2003; Zhong, Liu, & Ohsuga, struction and explanation evaluation are demonstrated
2001). A model that adds the explanation facility to the by explanation-oriented association mining.
commonly used models has been recently proposed by
Yao, Zhao, et al.; it is remarkably similar to the model of
scientific research. The basic phases and their objec- Figure 1. A framework of explanation-oriented data
tives are summarized in Table 2. Like the research mining
process, the data mining process is also an iterative
process and there is no clear-cut difference among the Pattern discovery & evaluation
different phases. In fact, Zhong, et al. argue that it should Data transformation Explanation construction
be a dynamically organized process (Zhong, et al., 2001). Data preprocessing
& evaluation
The entire framework is illustrated in Figure 1. Data Target Transformed Patterns

Pattern representation
There is a parallel correspondence between the data data
processes of scientific research and data mining. Their

main difference lies in the subjects that perform the
tasks. Research is carried out by scientists, and data
mining is done by computer systems. In particular, data
mining may be viewed as a study of domain-independent Explained
patterns
Knowledge
research methods with emphasis on data analysis. The

higher and more abstract level of comparisons and con- Background Selected
attributes
Explanation
profiles
nections between scientific research and data mining Attribute selection
can be further studied in levels that are more concrete. Profile construction
493
TEAM LinG
Basic Issues methods include decision tree learning, rule-based learn-

ing, and decision graph learning. The learned results are
Explanation-oriented data mining explains and in- represented as either a tree, or a set of if-then rules.
terprets the knowledge discovered from data.
The constructed explanations provide some evi-
Knowledge can be discovered by unsupervised learn- dence about conditions (within the background
ing methods. Unsupervised learning studies how sys- knowledge) under which the discovered pattern is
tems can learn to represent, summarize, and organize the most likely to happen, or how the background
data in a way that reflects the internal structure (namely, knowledge is related to the pattern.
a pattern) of the overall collection. This process does
not explain the patterns, but describes them. The primary The role of explanation in data mining is linked to
unsupervised techniques include clustering mining, be- proper description, relation and causality. Comprehen-
lief (usually Bayesian) networks learning, and associa- sibility is the key factor. The accuracy of the con-
tion mining. The criteria for choosing which pattern to be structed explanations relies on the amount of training
explained are directly related to the pattern evaluation examples. Explanation-oriented data mining performs
step of data mining. poorly with insufficient data or poor presuppositions.
Different background knowledge may infer different
Explanation-oriented data mining requires back- explanations. There is no reason to believe that only
ground knowledge to infer features that can possi- one unique explanation exists. One can use statistical
bly explain a discovered pattern. measures and domain knowledge to evaluate different
explanations.
Understanding the theory of explanation requires
considerations from many branches of inquiry: physics, A Concrete Example: Explanation-
chemistry, meteorology, human culture, logic, psychol- oriented Association Mining
ogy, and, above all, the methodology of science. In data
mining, explanation can be made at a shallow, syntactic Explanations, also expressed as conditions, can pro-
level based on statistical information, or at a deep, se- vide additional semantics to a standard association. For
mantic level based on domain knowledge. The required example, by adding time, place, and/or customer pro-
information and knowledge for explanation may not nec- files as conditions, one can identify when, where, and/
essarily be inside the original dataset. Additional infor- or to whom an association occurs.
mation often is required for an explanation construction.
It is argued that the power of explanation involves the Explanation Construction
power of insight and anticipation. One collects certain
features based on the underlying hypothesis that these The approach of explanation-oriented data mining com-
may provide explanations of the discovered pattern. That bines unsupervised and supervised learning methods,
something is unexplainable may simply be an expression namely, forming a concept first, and then explaining the
of the inability to discover an explanation of a desired concept. It consists of two main steps and uses two data
sort. The process of selecting the relevant and explana- tables. One table is used to learn a pattern. The other
tory features may be subjective, and trial-and-error. In table, an explanation profile constructed with respect
general, the better the background knowledge is, the to an interesting pattern, is used to search for explana-
more accurate the inferred explanations are likely to be. tions of the pattern. In the first step, an unsupervised
learning algorithm, like the Apriori algorithm (Agrawal,
Explanation-oriented data mining utilizes induc- Imielinski, & Swami, 1993), can be applied to discover
tion, namely, drawing an inference from a set of frequent associations. To discover other types of
acquired training instances, and justifying or pre- associations, different algorithms can be applied. In
dicting the instances one might observe in the the second step, an explanation profile is constructed.
future. Objects in the profile are labelled as positive instances if
they satisfy the desired pattern, and negative instances,
Supervised learning methods can be applied to this if they do not. Conditions that explain the pattern are
explanation construction. The goal of supervised learn- searched by using a supervised learning algorithm. A
ing is to find a model that will correctly associate the classical supervised learning algorithm such as ID3
input patterns with the classes. In real world applica- (Quinlan, 1983), C4.5 (Quinlan, 1993), or PRISM
tions, supervised learning models are extremely useful (Cendrowska, 1987), may be used to construct explana-
analytic techniques. The widely used supervised learning tions.
494
TEAM LinG
Table 3. The Apriori-ID3 algorithm

Input: A transaction table, and a related explanation profile table. -
Output: Associations and explained associations.
1. Use the Apriori algorithm to generate a set of frequent associations in the transaction table. For each
association in the set, support ( ) minsup , and confidence( ) minconf .
2. If the association is considered interesting (with respect to the user feedback or interestingness
measures), then
a. Introduce a binary attribute named Decision. Given a transaction, its value on Decision is + if
it satisfies in the original transaction table, otherwise its value is -.
b. Construct an information table by using the attribute Decision and explanation profiles. The new
table is called an explanation table.
c. By treating Decision as the target class, one can apply the ID3 algorithm to derive classification
rules of the form ( Decision = "+ ") . The condition is a formula discovered in the
explanation table, which states that under the association occurs.
d. Evaluate the constructed explanation(s).
The Apriori-ID3 algorithm, which can be regarded as The absolute difference represents the disparity be-
an example of explanation-oriented association mining tween the pattern and the pattern under the condition.
method, is described in Table 3. For a positive value, one may say that the condition
supports ; for a negative value, one may say that the
Explanation Evaluation
condition rejects . The relative difference is the ratio
Once explanations are generated, it is necessary to of absolute difference to the value of the unconditional
evaluate them. For explanation-oriented association pattern. The ratio of change compares the actual change
mining, we want to compare a conditional association and the maximum potential change.
(explained association) with its unconditional counter- Generality is the measure to quantify the size of a
part, in addition to comparing different conditions. condition with respect to the whole data, defined by
Let T be a transaction table, and E be an explanation ||
profile table associated with T. Suppose that for a generality ( ) = . When the generality of conditions
|U |
desired pattern generated by an unsupervised learning
is essential, a compound measure should be applied. For
algorithm from T, there is a set K of conditions (expla-
example, one may be interested in discovering an accu-
nations) discovered by a supervised learning algorithm
rate explanation with a high ratio of change and a high
from E, and K is one explanation. Two points are
generality. However, it often happens that an explana-
noted. First, the set K of explanations can be different
tion has a high generality but a low RC value, while
according to various explanation profile tables, or vari-
another explanation has a low generality but a high RC
ous supervised learning algorithms. Second, not all
value. A trade-off between these two explanations does
explanations in K are equally interesting. Different con-
not necessarily exist.
ditions may have different degrees of interestingness.
A good explanation system must be able to rank the
Suppose is a quantitative measure used to evaluate constructed explanations and be able to reject the bad
plausible explanations, which can be the support mea- explanations. It should be realized that evaluation is a
sure for an undirected association, the confidence or difficult process because so many different kinds of knowl-
coverage measure for a one-way association, or the edge can come into play. In many cases, one must rely on
similarity measure for a two-way association (Yao & domain experts to reject uninteresting explanations.
Zhong, 1999). A condition K provides an explana-
tion of a discovered pattern if ( | ) > ( ) . One can
further evaluate explanations quantitatively based on FUTURE TRENDS
several measures, such as absolute difference (AD),
relative difference (RD) and ratio of change (RC): Considerable research remains to be done for explana-
tion construction and evaluation.
In this chapter, rule-based explanation is con-
AD( | ) = ( | ) ( ),
structed by inductive supervised learning algorithms.
( | ) ( ) Considering the structure of explanation, case-based
RD( | ) = ,
( ) explanations also need to be addressed. Based on the
( | ) ( ) case-based explanation, a pattern is explained if an
RC ( | ) = .
1 ( ) actual prior case is presented to provide compelling
495
TEAM LinG
support. One of the perceived benefits of case-based Brodie, M. & Dejong, G. (2001). Iterated phantom
explanation is that the rule generation effort is saved. induction: A knowledge-based approach to learning con-
Instead, similarity functions need to be studied in order trol. Machine Learning, 45(1), 45-76.
to evaluate the distance between the description of the
new pattern and an existing case, and retrieve the most Cendrowska, J. (1987). PRISM: An algorithm for induc-
similar case as an explanation. ing modular rules. International Journal of Man-Ma-
The constructed explanations of the discovered pat- chine Studies, 27, 349-370.
tern provide conclusive evidence for the new instances. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. &
In other words, the new instances can be explained and Uthurusamy, R. (Eds.). (1996). Advances in Knowl-
implied by the explanations. This is normally true when edge Discovery and Data Mining. AAAI/MIT Press.
the explanations are sound and complete. However,
sometimes, the constructed explanations cannot guar- Graziano, A.M. & Raulin, M.L. (2000). Research meth-
antee that a certain instance is a perfect fit. Even worse, ods: A process of inquiry (4th ed.). Boston: Allyn &
a new data set, as a whole, may show a change or a Bacon.
confliction with the learnt explanations. This is because Han, J. & Kamber, M. (2000). Data mining: Concept
the explanations may be context-dependent on certain and techniques. Morgan Kaufmann Publisher.
spatial and/or temporal intervals. To consolidate the
explanations we have constructed, we cannot simply Ling, C.X., Chen, T., Yang, Q. & Cheng, J. (2002).
logically and, or, or ignore the new explanation. Mining optimal actions for profitable CRM. Proceed-
Instead, a spatial-temporal reasoning model needs to be ings of International Conference on Data Mining (pp.
introduced to show the trend and evolution of the pattern 767-770).
to be explained.
The explanations we have introduced so far are not Mannila, H. (1997). Methods and problems in data mining.
necessarily the causal interpretation of the discovered Proceedings of International Conference on Database
pattern, i.e. the relationships expressed in the form of Theory (pp. 41-55).
deterministic and functional equations. They can be Martella, R.C., Nelson, R. & Marchand-Martella, N.E.
inductive generalizations, descriptions, or deductive (1999). Research methods: Learning to become a
implications. Explanation as causality is the strongest critical research consumer. Boston: Allyn & Bacon.
explanation and coherence.
We might think of Bayesian networks as an inference Mitchell, T. (1999). Machine learning and data mining.
that unveils the internal relationship between attributes. Communications of the ACM, 42(11), 30-36.
Searching for an optimal model is difficult and NP-hard. Quinlan, J.R. (1983). Learning efficient classification
Arrow direction is not guaranteed. Expert knowledge procedures. In J.S. Michalski, J.G. Carbonell & T.M.
could be integrated in the a priori search function, such Mirchell (Eds.), Machine learning: An artificial intel-
as the presence of links and orders. ligence approach (pp. 463-482). Palo Alto, CA: Mor-
gan Kaufmann.
CONCLUSION Quinlan, J.R. (1993). C4.5: Programs for Machine

learning. Morgan Kaufmann Publisher.
Explanation-oriented data mining offers a new perspec- Yao, Y.Y. (2003). A framework for web-based research
tive. It closely relates scientific research and data minsupport systems. Proceedings of the Twenty-Seventh
ing, which have bi-directional benefits. The ideas of Annual International Computer Software and Appli-
explanation-oriented mining can have a significant im- cations Conference (pp. 601-606).
pact on the understanding of data mining and effective
applications of data mining results. Yao, Y.Y., Zhao, Y. & Maguire, R.B. (2003). Explana-
tion-oriented association mining using rough set theory.
Proceedings of Rough Sets, Fuzzy Sets and Granular
REFERENCES Computing (pp. 165-172).
Yao, Y.Y. & Zhong, N. (1999). An analysis of quantita-
Agrawal, R., Imielinski, T. & Swami, A. (1993). Mining tive measures associated with rules. Proceedings Pa-
association rules between sets of items in large data- cific-Asia Conference on Knowledge Discovery and
bases. Proceedings of ACM Special Interest Group on Data Mining (pp. 479-488).
Management of Data 1993 (pp. 207-216).
496
TEAM LinG
Zhong, N., Liu, C. & Ohsuga, S. (2001). Dynamically Method of Explanation-Oriented Data Mining: The
organizing KDD processes. International Journal of method consists of two main steps and uses two data -
Pattern Recognition and Artificial Intelligence, 15, 451- tables. One table is used to learn a pattern. The other table,
473. an explanation table, is used to explain one desired pat-
tern. In the first step, an unsupervised learning algorithm
is used to discover a pattern of interest. In the second
step, by treating objects satisfying the pattern as positive
KEY TERMS instances, and treating the rest as negative instances, one
can search for conditions that explain the pattern by a
Absolute Difference: A measure that represents the supervised learning algorithm.
difference between an association and a conditional
association based on a given measure. The condition Ratio of Change: A ratio of actual change (absolute
provides a plausible explanation. difference) to the maximum potential change.
Explanation-Oriented Data Mining: A general Relative Difference: A measure that represents the
framework includes data pre-processing, data transfor- difference between an association and a conditional
mation, pattern discovery and evaluation, pattern expla- association relative to the association based on a given
nation and explanation evaluation, and pattern presenta- measure.
tion. This framework is consistent with the general Scientific Research Processes: A general model
model of scientific research processes. consists of the following phases: idea generation, prob-
Generality: A measure that quantifies the coverage lem definition, procedure design/planning, observation/
of an explanation in the whole data set. experimentation, data analysis, results interpretation,
and communication. It is possible to combine several
Goals of Scientific Research: The purposes of phases, or to divide one phase into more detailed steps.
science are to describe and predict, to improve or to The division between phases is not clear-cut. Iteration
manipulate the world around us, and to explain our of different phrases may be necessary.
world. One goal of scientific research is to discover
new and useful knowledge for the purpose of science.
As a specific research field, data mining shares this
common goal, and may be considered as a research
support system.
497
TEAM LinG
498
Factor Analysis in Data Mining

Zu-Hsu Lee
Richard L. Peterson
Chen-Fu Chien
National Tsing Hua University, Taiwan
Ruben Xing
The rapid growth and advances of information technol- The concept of FA was created in 1904 by Charles
ogy enable data to be accumulated faster and in much Spearman, a British psychologist. The term factor analy-
larger quantities (i.e., data warehousing). Faced with vast sis was first introduced by Thurston in 1931. Exploratory
new information resources, scientists, engineers, and FA and confirmatory FA are two main types of modern FA
business people need efficient analytical techniques to techniques. The goals of FA are (1) to reduce the number
extract useful information and effectively uncover new, of variables and (2) to classify variables through detec-
valuable knowledge patterns. tion of the structure of the relationships between vari-
Data preparation is the beginning activity of exploring ables. FA achieves the goals by creating a fewer number
for potentially useful information. However, there may be of new dimensions (i.e., factors) with potentially useful
redundant dimensions (i.e., variables) in the data, even knowledge. The applications of FA techniques can be
after the data are well prepared. In this case, the perfor- found in various disciplines in science, engineering, and
mance of data-mining methods will be affected negatively social sciences, such as chemistry, sociology, econom-
by this redundancy. Factor Analysis (FA) is known to be ics, and psychology. To sum up, FA can be considered as
a commonly used method, among others, to reduce a broadly used statistical approach that explores the
data dimensions to a small number of substantial char- interrelationships among variables and determines a
acteristics. smaller set of common underlying factors. Furthermore,
FA is a statistical technique used to find an underlying the information contained in the original variables can be
structure in a set of measured variables. FA proceeds with explained by these factors with a minimum loss of infor-
finding new independent variables (factors) that describe mation.
the patterns of relationships among original dependent
variables. With FA, a data miner can determine whether or
not some variables should be grouped as a distinguishing MAIN THRUST
factor, based on how these variables are related. Thus, the
number of factors will be smaller than the number of In order to represent the important structure of the data
original variables in the data, enhancing the performance efficiently (i.e., in a reduced number of dimensions), there
of the data-mining task. In addition, the factors may be are a number of techniques that can be used for data
able to reveal underlying attributes that cannot be ob- mining. These generally are referred to as multi-dimen-
served or interpreted explicitly so that, in effect, a recon- sional scaling methods. The most basic one is Principle
structed version of the data is created and used to make Component Analysis (PCA). Through transforming the
hypothesized conclusions. In general, FA is used original variables in the data into the same number of new
with many data-mining methods (e.g., neural net- ones, which are mutually orthogonal (uncorrelated), PCA
work, clustering). sequentially extracts most of the variance (variability) of
TEAM LinG
the data. The hope is that most of the information in the ment (Rummel, 2002). The analysis procedures can be
data might be contained in the first few components. FA performed through a geometrical presentation by plotting .
also extracts a reduced number of new factors from the data points in a multi-dimensional coordinate axis (explor-
original data set, although it has different aims from PCA. atory FA) or through mathematical techniques to test the
FA usually starts with a survey or a number of ob- specified model and suspected relationship among vari-
served traits. Before FA is applied, the assumptions of ables (confirmatory FA).
correlations in the data (normality, linearity, homogeneity In order to illustrate how FA proceeds step by step,
of sample, and homoscedasticity) need to be satisfied. In here is an example from a case study on the key variables
addition, the factors to extract should all be orthogonal (or characteristics) for induction machines, conducted by
to one another. After defining the measured variables to Mat and Caldern (2000). The sample of a group of
represent the data, FA considers these variables as a motors was selected from a catalog published by Siemens
linear combination of latent factors that cannot be mea- (1988). It consists of 134 cases with no missing values and
sured explicitly. The objective of FA is to identify these 13 variables that are power (P), speed (W), efficiency (E),
unobserved factors, reflect what the variables share in power factor (PF), current (I), locked-rotor current (ILK),
common, and provide further information about them. torque (M), locked-rotor torque (MLK), breakdown torque
Mathematically, let X represent a column vector that (MBD), inertia (J), weight (WG), slip (S), and slope of M-
contains p measured variables and has a mean vector , s curve (M_S). FA can be implemented in the following
F stand for a column vector which contains q latent procedures using this sample data.
factors, and L be a p q matrix that transforms F to X.
The elements of L (i.e., factor loadings) give the weights Step 1: Ensuring the Adequacy of the
that each factor contributes to each measured variable. In Date
addition, let be a column vector containing p uncorrelated
random errors. Note that q is smaller than p. The following The correlation matrix containing correlations between
equation simply illustrates the general model of FA the variables is first examined to identify the variables that
(Johnson & Wichern, 1998): are statistically significant. In the case study, this matrix
from the sample data showed that the correlations be-
X - = LF + . tween the variables are satisfactory, and thus, all vari-
ables are kept for the next step. Meanwhile, preliminary
FA and PCA yield similar results in many cases, but, tests, such as the Bartlett test, the Kaiser-Meyer-Olkin
in practice, PCA often is preferred for data reduction, (KMO) test, and the Measures of Sampling Adequacy
while FA is preferred to detect structure of the data. (MSA) test, are used to evaluate the overall significance
In any experiment, any one scenario may be delineated of the correlation. Table 1 shows that the values of MSA
by a large number of factors. Identifying important factors (rounded to two decimal places) are higher than 0.5 for all
and putting them into more general categories generates variables but variable W. However, the MSA value of
an environment or structure that is more advantageous to variable W is close to 0.5 (MSA should be higher than 0.5,
data analysis, reducing the large number of variables to according to Hair, et al. (1998)).
smaller, more manageable, interpretable factors (Kachigan,
1986). Technically, FA allows the determination of the Step 2: Finding the Number of Factors
interdependency and pattern delineation of data. It un-
tangles the linear relationships into their separate pat- There are many approaches available for this purpose
terns as each pattern will appear as a factor delineating a (e.g., common factor analysis, parallel analysis). The case
distinct cluster of interrelated data (Rummel, 2002, Sec- study first employed the plot of eigenvalues vs. the factor
tion 2.1). In other words, FA attempts to take a group of number (the number of factors may be 1 to 13) and found
interdependent variables and create separate descriptive that choosing three factors accounts for 91.3% of the total
categories and, after this transformation, thereby de- variance. Then, it suggested that the solution be checked
crease the number of variables that are used in an experi-
Table 1. Measures of the adequacy of FA to the sample: MSA
E I ILK J M M_S MBD MLK P PF S

0.75 0.73 0.86 0.78 0.76 0.82 0.79 0.74 0.74 0.76 0.85
W WG
0.39 0.87
499
TEAM LinG
with the attempt to extract two or three more factors. Based methods may be incorporated within the concept of FA.
on the comparison between results from different selected Proper selection of methods should depend on the na-
methods, the study ended up with five factors. ture of data and the problem to which FA is applied.
Garson (2003) points out the abilities of FA: determin-
Step 3: Determining Transformation (or ing the number of different factors needed to explain the
Rotation) Matrix pattern of relationships among the variables, describing
the nature of those factors, knowing how well the hy-
A commonly used method is orthogonal rotation. Table 2 pothesized factors explain the observed data, and find-
can represent the transformation matrix, if each cell shows ing the amount of purely random or unique variance that
the factor loading of each variable on each factor. For the each observed variable includes. Because of these abili-
sample size, only loadings with an absolute value bigger ties, FA has been used for various data analysis prob-
than 0.5 were accepted (Hair et al., 1998), and they are lems and may be used in a variety of applications in data
marked X in Table 2 (other loadings lower than 0.5 are not mining from science-oriented to business applications.
listed). From the table, J, I, M, S, M, WG, and P can be One of the most important uses is to provide a sum-
grouped to the first factor; PF, ILK, E, and S belong to the mary of the data. The summary facilitates learning the
second factor; W and MLK can be considered another two data structure via an economic description. For example,
single factors, respectively. Note that MBD can go to Pan et al. (1997) employed Artificial Neural Network
factor 2 or be an independent factor (factor 5) of the other (ANN) techniques combined with FA for spectroscopic
four. The case study settled the need to retain it in the fifth quantization of amino acids. Through FA, the number of
factor, based on the results obtained from other samples. input nodes for neural networks was compressed effec-
In this step, an oblique rotation is another method to tively, which greatly sped up the calculations of neural
determine the transformation matrix. Since it is a non- networks. Tan and Wisner (2001) used FA to reduce a set
orthogonal rotation, the factors are not required to be of factors affecting operations management constructs
uncorrelated to each other. This gives better flexibility and their relationships. Kiousis (2004) applied an explor-
than an orthogonal rotation. Using the sample data of the atory FA to he New York Times news coverage of eight
case study, a new transformation matrix may be obtained major political issues during the 2000 presidential elec-
from an oblique rotation, which provides new loadings to tion. FA identified two indices that measured the con-
group variables into factors. struct of the key independent variable in agenda-setting
research, which could then be used in future investiga-
tions.
Step 4: Interpreting the Factors. Screening variables is another important function of
FA. A co-linearity problem will appear, if the factors of
In the case study, the factor consisting of J, I, M, S, M, WG, the variables in the data are very similar to each other. In
and P was named size, because the higher value of the order to avoid this problem, a researcher can group
weight (WG) and power (P) reflects the larger size of the closely related variables into one category and then
machine. The factor containing PF, ILK, E, and S is ex- extract the one that would have the greatest use in
plained as global efficiency. determining a solution (Kachigan, 1986). For example,
This example provides a general demonstration of the Borovec (1996) proposed a six-step sequential extraction
application of FA techniques to data analysis. Various
Table 2. Grouping of variables using the orthogonal rotation matrix
Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

E X (0.93540)
I X (0.93314)
ILK X (0.84356)
J X (0.97035)
M X (0.94582)
M_S X (0.98255)
MBD X (0.62605) X (0.67072)
MLK X (0.91674)
P X (0.92265)
PF X (0.81847)
S X (-0.92848)
W X (0.95222)
WG X (0.94153)
500
TEAM LinG
procedure and applied FA, which found three dominant FUTURE TRENDS
trace elements from 12 surface stream sediments. These .
three factors accounted for 78% of the total variance. In At the Factor Analysis at 100 Conference held in May
another example, Chen, et al. (2001) performed an explor- 2004, the future of FA was discussed. Millsap and Meredith
atory FA on 48 financial ratios from 63 firms. Four critical (2004) suggested further research in the area of ordinal
financial ratios were concluded, which explained 80% of measures in multiple populations and technical issues of
the variation in productivity. small samples. These conditions can generate bias in
FA can be used as a scaling method, as well. Oftentimes, current FA methods, causing results to be suspect. They
after the data are collected, the development of scales is also suggested further study in the impact of violations
needed among individuals, groups, or nations, when they of factorial invariance and explanations for these viola-
are intended to be compared and rated. As the character- tions. Wall and Amemiya (2004) feel that there are chal-
istics are grouped to independent factors, FA assigns lenges in the area of non-linear FA. Although models exist
weights to each characteristic according to the observed for non-linear analysis, there are aspects of this area that
relationships among the characteristics. For instance, are not fully understood.
Tafeit, et al. (1999, 2000) provided a comparison between However, the flexibility of FA and its ability to reduce
FA and ANN for low-dimensional classification of high- the complexity of the data still make FA one of commonly
dimensional body fat topography data of healthy and used techniques. Incorporated with advances in informa-
diabetic subjects with a high-dimensional and partly tion technologies, the future of FA shows great promise
highly intercorrelated set of data. They found that the for the applications in the area of data mining.
analysis of the extracted weights yielded useful informa-
tion about the structure of the data. As the weights for
each characteristic are obtained by FA, the score (by CONCLUSION
summing characteristics times these weights) can be used
to represent the scale of the factor to facilitate the rating FA is a useful multivariate statistical technique that has
of factors. been applied in a wide range of disciplines. It enables
In addition, FAs ability to divide closely related researchers to effectively extract information from huge
variables into different groups is also useful for statistical databases and attempts to organize and minimize the
hypothesis testing, as Rummel (2002) stated, when hy- amount of variables used in collecting or measuring data.
potheses are about the dimensions that can be a group of However, the applications of FA in business sectors (e.g.,
highly intercorrelated characteristics, such as personal- e-business) is relatively new.
ity, attitude, social behavior, and voting. For instance, in Currently, the increasing volumes of data in databases
a study of resource investments in tourism business, and data warehouses are the key issue governing their
Morais, et al. (2003) use confirmatory FA to find that pre- future development. Allowing the effective mining of
established resource investment scales could not fit their potentially useful information from huge databases with
model well. They reexamined each subscale with explor- many dimensions, FA definitely is helpful in sorting out
atory FA to identify factors that should not have been the significant parts of information for decision makers, if
included in the original model. it is used appropriately.
There have been controversies about uses of FA.
Hand, et al. (2001) pointed out that one important reason
is that FAs solutions are not invariant to various trans-
formations. More precisely, the extracted factors are
REFERENCES
basically non-unique, unless extra constraints are im-
posed (Hand et al., 2001, p. 84). The same information Borovec, Z. (1996). Evaluation of the concentrations of
may reach different interpretations with personal judg- trace elements in stream sediments by factor and cluster
ment. Nevertheless, no method is perfect. In some situa- analysis and the sequential extraction procedure. The
tions, other statistical methods, such as regression analy- Science of the Total Environment, 117, 237-250.
sis and cluster analysis, may be more appropriate than FA. Chen, L., Liaw, S., & Chen, Y. (2001). Using financial
However, FA is a well-known and useful tool among data- factors to investigate productivity: An empirical study in
mining techniques. Taiwan. Industrial Management & Data Systems, 101(7),
378-384.
501
TEAM LinG
Garson, D. (2003). Factor analysis. Retrieved from http:/ Tan, K., & Wisner, J. (2003). A study of operations
/www2.chass.ncsu.edu/garson/pa765/factor.htm management constructs and their relationships. Interna-
tional Journal of Operations & Production Manage-
Hair, J., Anderson, R., Tatham R., & Black, W. (1998). ment, 23(11), 1300-1325.
Multivariate data analysis with readings. Englewood
Cliffs, NJ: Prentice-Hall, Inc. Wall, M., & Amemiya, Y. (2004). A review of nonlinear
factor analysis methods and applications. Proceedings of
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of the Factor Analysis at 100: Historical Developments
data mining. Cambridge, MA. MIT Press. and Future Directions Conference, Chapel Hill, North
Johnson, R., & Wichern, D. (1998). Applied multivariate Carolina.
statistical analysis. Englewood Cliffs, NJ: Prentice Hall, Williams, R.H., Zimmerman, D.W., Zumbo, B.D., & Ross,
Inc. D. (2003). Charles Spearman: British behavioral scientist.
Kachigan, S. (1986). Statistical analysis: An interdisci- Human Nature Review. Retrieved from http://human-
plinary introduction to univariate and multivariate nature.com/nibbs/03/spearman.html
methods. New York: Radius Press.
Kiousis, S. (2004). Explicating media salience: A factor KEY TERMS
analysis of New York Times issue coverage during the
2000 U.S. presidential election. Journal of Communica- Cluster Analysis: A multivariate statistical technique
tion, 54(1), 71-87. that assesses the similarities between individuals of a
population. Clusters are groups or categories formed so
Mate, C., & Calderon, R. (2000). Exploring the character- members within a cluster are less different than members
istics of rotating electric machines with factor analysis. from different clusters.
Journal of Applied Statistics, 27(8), 991-1006.
Eigenvalue: The quantity representing the variance of
Millsap, R., & Meredith, W. (2004). Factor invariance: a set of variables included in a factor.
Historical trends and new developments. Proceedings of
the Factor Analysis at 100: Historical Developments Factor Score: A measure of a factors relative weight
and Future Directions Conference, Chapel Hill, North to others, which is obtained using linear combinations of
Carolina. variables.
Morais, D., Backman, S., & Dorsch, M. (2003). Toward the Homogeneity: The degree of similarity or uniformity
operationalization of resource investments made between among individuals of a population.
customers and providers of a tourism service. Journal of
Travel Research, 41, 362-374. Homoscedasticity: A statistical assumption for linear
regression models. It requires that the variations around
Pan, Z. et al. (1997). Spectroscopic quantization of amino the regression line be constant for all values of input
acids by using artificial neural networks combined with variables.
factor analysis. Spectrochimica Acta Part A 53, 1629-
1632. Matrix: An arrangement of rows and columns to
display quantities. A p q matrix contains p q quan-
Rummel, R.J. (2002). Understanding factor analysis. Re- tities arranged in p rows and q columns (i.e., each row has
trieved from http://www.hawaii.edu/powerkills/ q quantities,and each column has p quantities).
UFA.HTM
Normality: A statistical assumption for linear regres-
Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (1999). The sion models. It requires that the errors around the regres-
determination of three subcutaneous adipose tissue com- sion line be normally distributed for each value of input
partments in non-insulin-dependent diabetes mellitus variable.
women with artificial neural networks and factor analysis.
Artificial Intelligence in Medicine, 17, 181-193. Variance: A statistical measure of dispersion around
the mean within the data. Factor analysis divides variance
Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (2000). of a variable into three elementscommon, specific, and
Artificial neural networks compared to factor analysis for error.
low-dimensional classification of high-dimensional body
fat topography data of healthy and diabetic subjects. Vector: A quantity having both direction and magni-
Computers and Biomedical Research, 33, 365-374. tude. This quantity can be represented by an array of
components in a column (column vector) or in a row (row
vector).
502
TEAM LinG
503
Financial Ratio Selection for Distress .

Classification
Roberto Kawakami Harrop Galvo
Instituto Tecnolgico de Aeronutica, Brazil
Victor M. Becerra
University of Reading, UK
Magda Abou-Seada
Middlesex University, UK
INTRODUCTION w = S1(m 1 m 2) (3)
Prediction of corporate financial distress is a subject The optimal cut-off value for classification z c can be
that has attracted the interest of many researchers in calculated as
finance. The development of prediction models for
financial distress started with the seminal work by Altman 1 2)TS1(
z c = 0.5( 1 + 2) (4)
(1968), who used discriminant analysis. Such a tech-
nique is aimed at classifying a firm as bankrupt or A given vector x should be assigned to Population 1
nonbankrupt on the basis of the joint information con- if Z(x) > zc, and to Population 2 otherwise.
veyed by several financial ratios. The generalization (or prediction) performance of
The assessment of financial distress is usually based the Z-score model, that is, its ability to classify objects
on ratios of financial quantities, rather than absolute not used in the modeling phase, can be assessed by using
values, because the use of ratios deflates statistics by an independent validation set or cross-validation meth-
size, thus allowing a uniform treatment of different ods (Duda et al., 2001). The simplest cross-validation
firms. Moreover, such a procedure may be useful to technique, termed leave-one-out, consists of separat-
reflect a synergy or antagonism between the constitu- ing one of the m modelling objects and obtaining a Z-
ents of the ratio. score model with the remaining m 1 objects. This model
is used to classify the object that was left out. The
procedure is repeated for each object in the modeling
BACKGROUND set in order to obtain a total number of cross-validation
errors.
The classification of companies on the basis of finan- Resampling techniques (Good, 1999) such as the
cial distress can be performed by using linear discrimi- Bootstrap method (Davison & Hinkley, 1997) can also
nant models (also called Z-score models) of the follow- be used to assess the sensitivity of the analysis to the
ing form (Duda, Hart, & Stork, 2001): choice of the training objects.
1 2)TS1x
Z(x) = ( (1) The Financial Ratio Selection Problem
where x = [x1 x2 ... xn]T is a vector of n financial ratios, 1n The selection of appropriate ratios from the available
and 2n are the sample mean vectors of each group financial information is an important and nontrivial
(continuing and failed companies), and Snn is the common stage in building distress classification models. The
sample covariance matrix. Equation 1 can also be written best choice of ratios will normally depend on the types
as of companies under analysis and also on the economic
context. Although the analysts market insight plays an
Z = w1x 1 + w2x2 + ... + wnxn = wTx (2) important role at this point, the use of data-driven
selection techniques can be of value, because the rel-
where w = [w1 w2 ... wn]T is a vector of coefficients obtained as evance of certain ratios may only become apparent when
their joint contribution is considered in a multivariate
TEAM LinG
Financial Ratio Selection for Distress Classification
context. Moreover, some combinations of ratios may not Algorithm A (Preselection Followed by
satisfy the statistical assumptions required in the model- Exhaustive Search)
ing process, such as normal distribution and identical
covariances in the groups being classified, in the case of If N variables are initially available for selection, they can
standard linear discriminant analysis (Duda et al., 2001). be combined in 2N 1 different subsets (each subset with
Finally, collinearity between ratios may cause the model a number of variables between 1 and N). Thus, the com-
to have poor prediction ability (Naes & Mevik, 2001). putational workload can be substantially reduced if
Techniques proposed for ratio selection include nor- some variables are preliminarily excluded.
mality tests (Taffler, 1982), and clustering followed by In this algorithm, such a preselection is carried out
stepwise discriminant analysis (Alici, 1996). according to a multivariate relevance index W(x) that
Most of the works cited in the preceding paragraph measures the contribution of each variable x to the
begin with a set of ratios chosen from either popularity classification output when a Z-score model is employed.
in the literature, theoretical arguments, or suggestions This index is obtained by using all variables to build a
by financial analysts. However, this article shows that it model as in Equation 1 and by multiplying the absolute
is possible to select ratios on the basis of data taken value of each model weight by the sample standard
directly from the financial statements. deviation (including both groups) of the respective vari-
For this purpose, we compare two selection methods able.
proposed by Galvo, Becerra, and Abou-Seada (2004). An appropriate threshold value for the relevance
A case study involving 60 failed and continuing British index W(x) can be determined by augmenting the model-
firms in the period from 1997 to 2000 is employed for ing data with artificial uninformative variables (noise)
illustration. and then obtaining a Z-score model. Those variables
whose relevance is not considerably larger than the
average relevance of the artificial variables are then
MAIN THRUST eliminated (Centner, Massart, Noord, Jong, Vandeginste,
& Sterna, 1996).
It is not always advantageous to include all available After the preselection phase, all combinations of the
variables in the building of a classification model (Duda remaining variables are tested. Subsets with the same
et al., 2001). Such an issue has been studied in depth in number of variables are compared on the basis of the
the context of spectrometry (Andrade, Gomez- number of classification errors on the modelling set for
Carracedo, Fernandez, Elbergali, Kubista, & Prada, a Z-score model and the condition number of the matrix
2003), in which the variables are related to the wave- of modeling data. The condition number (the ratio
lengths monitored by an optical instrumentation frame- between the largest and smallest singular value of the
work. This concept also applies to the Z-score modeling matrix) should be small to avoid collinearity problems
process described in the preceding section. In fact, (Navarro-Villoslada, Perez-Arribas, Leon-Gonzalez, &
numerical ill-conditioning tends to increase with (m Polodiez, 1995). After the best subset has been deter-
n)1, where m is the size of the modeling sample, and n mined for each given number of variables, a cross-
is the number of variables (Tabachnick & Fidell, 2001). validation procedure is employed to find the optimum
If n > m, matrix S becomes singular, thus preventing the use number of variables.
of Equation 1. In this sense, it may be more appropriate to
select a subset of the available variables for inclusion in Algorithm B (Genetic Selection)
the classification model.
The selection procedures to be compared in this The drawback of the preselection procedure employed
article search for a compromise between maximizing in Algorithm A is that some variables that display a small
the amount of discriminating information available for relevance index when all variables are considered to-
the model and minimizing collinearity between the clas- gether could be useful in smaller subsets. An alternative
sification variables, which is a known cause of generali- to such a preselection consists of employing a genetic
zation problems (Naes & Mevik, 2001). These goals are algorithm (GA), which tests subsets of variables in an
usually conflicting, because the larger the number of efficient way instead of performing an exhaustive search
variables, the more information is available, but also the (Coley, 1999; Lestander, Leardi, & Geladi, 2003).
more difficult it is to avoid collinearity. The GA represents subsets of variables as individu-
als competing for survival in a population. The genetic
504
TEAM LinG
code of each individual is stored in a chromosome, which were extracted from the statements, allowing 28 ratios to
is a string of N binary genes, each gene associated to one be built, as shown in Table 1. Quantities WC, PBIT, EQ, .
of the variables available for selection. The genes with S, TL, ARP, and TA are commonly found in the financial
value 1 indicate the variables that are to be included in the distress literature (Altman, 1968; Taffler, 1982; Alici,
classification model. 1996), and the ratios shown in boxes are those adopted
In the formulation adopted here, the measure F of the by Altman (1968). It is worth noting that the book value
survival fitness of each individual is defined as follows. A of equity was used rather than the market value of equity
Z-score model is obtained from Equation 1 with the vari- to allow the inclusion of firms not quoted in the stock
ables indicated in the chromosome, and then F is calcu- market. Quantity RPY is not typically employed in dis-
lated as tress models, but we include it here to illustrate the ability
of the selection algorithms to discard uninformative
F = (e + r)1 (5) variables. The data employed in this example are given in
Galvo, Becerra, and Abou-Seada (2004).
where e is the number of classification errors in the mod- The data set was divided into a modeling set (21 failed
eling set, r is the condition number associated to the and 21 continuing firms) and a validation set (8 failed and
variables included in the model, and > 0 is a design 10 continuing firms). In what follows, the errors will be
parameter that balances modeling accuracy against col- divided into Type 1 (failed company classified as con-
linearity prevention. The larger is, the more emphasis is tinuing) and 2 (continuing company classified as failed).
placed on avoiding collinearity.
After a random initialization of the population, the Conventional Financial Ratios
algorithm proceeds according to the classic evolution-
ary cycle (Coley, 1999). At each generation, the roulette Previous studies (Becerra, Galvo, & Abou-Seada,
method is used for mating pool selection, followed by 2001) with this data set revealed that when the five
the genetic operators of one-point crossover and point conventional ratios are employed, Ratio 13 (PBIT/TA)
mutation. The population size is kept constant, with each is actually redundant and should be excluded from the
generation replacing the previous one completely. How- Z-score model in order to avoid collinearity problems.
ever, the best-fitted individual is preserved from one Thus, Equation 1 was applied only to the remaining four
generation to the next (elitism) in order to prevent ratios, leading to the results shown in Table 2. It is
good solutions from being lost. worth noting that if Ratio PBIT/TA is not discarded, the
number of validation errors increases from four to
seven.
CASE STUDY
Algorithm A
This example employs financial data from 29 failed and
31 continuing British corporations in the period from The preselection procedure was carried out by aug-
1997 to 2000. The data for the failed firms were taken menting the 28 financial ratios with seven uninforma-
from the last financial statements published prior to the tive variables yielded by an N(0,1) random number
start of insolvency proceedings. Eight financial quantities generator. The relevance index thus obtained is shown
in Figure 1. The threshold value, represented by a hori-
zontal line, was set to five times the average relevance of
Table 1. Numbering of financial ratios (Num/Den). the uninformative variables. As a result, 13 ratios were
Conventional ratios are displayed in boxes. WC = discarded.
working capital, PBIT = profit before interest and tax, After the preselection phase, combinations of the 15
EQ = equity, S = Sales, TL = total liabilities, ARP = remaining ratios were tested for modeling accuracy and
accumulated retained profit, RPY = retained profit for
the year, TA = total assets.
Table 2. Results of a Zscore model using four
Den
Num conventional ratios
WC PBIT EQ S TL ARP RPY
WC
PBIT 1 Type 1 Type 2 Percent
EQ 2 8
Data set
errors errors accuracy
S 3 9 14
Modeling 2 7 79%
TL 4 10 15 19
ARP 5 11 16 20 23 Cross- 3 8 74%
RPY 6 12 17 21 24 26 validation
TA 7 13 18 22 25 27 28 Validation 0 4 78%
505
TEAM LinG
Figure 1. Relevance index of the financial ratios. Table 3. Results of a Z-score model using the five
Log 10 values are displayed for the convenience of ratios selected by Algorithm A
visualization.
Data Type 1 Type 2 Percent
set errors errors accuracy
Modeling 1 3 90%
Cross- 2 5 83%
validation
Validation 1 2 83%
which is in agreement with the fact that in comparison with

the other financial quantities used in this study, RPY has
not often been used in the financial distress literature. The
results of using the five selected ratios are summarized in
Table 3.
Algorithm B
The genetic algorithm was employed with a population

size of 100 individuals. The crossover and mutation
probabilities were set to 60% and 5% , respectively, and
condition number. Figure 2 displays the number of mod- the number of generations was set to 100.
eling and cross-validation errors for the best subsets Three values were tested for the design parameter :
obtained as a function of the number n of ratios included. 10, 1, and 0,1. For each value of , the algorithm was run
The cross-validation curve reaches a minimum between five times, starting from different populations, and the
five and seven ratios. In this situation, the use of five best result (in terms of fitness) was kept. The selected
ratios was deemed more appropriate due to the use of the ratios are shown in Table 4, which also presents the
well-known Parsimony Principle, which states that given number of modeling and cross-validation errors in each
models with similar prediction ability, the simplest should case. Regardless of the value of , no ratio involving
be favored (Duda et al., 2001). The selected ratios were 7 quantity RPY was selected (see Table 1).
(WC/TA), 9 (PBIT/S), 10 (PBIT/TL), 13 (PBIT/TA), and 25 On the basis of cross-validation performance, = 1
(TL/TA). Interestingly, two of these ratios (WC/TA and is seen to be the best choice (Ratios 2, 9, and 15). In fact,
PBIT/TA) belong to the set advocated by Altman (1968). a smaller value for , which causes the algorithm to
Note that no ratio involving quantity RPY was selected, place less emphasis on collinearity avoidance, results in
the selection of more ratios. As a result, there is a gain
in modeling accuracy but not in generalization ability
Figure 2. Modeling (dashed line) and cross-validation (as assessed by cross-validation), which means that the
(solid line) errors as a function of the number of ratios model has become excessively complex. On the other
included in the Z-score model. The arrow indicates hand, a larger value for discouraged the selection of
the final choice of ratios. sets with more than one ratio. In this case, there is not
enough discriminating information, which results in
poor modeling and cross-validation performances.
Table 5 details the results for = 1. Notice that one of
the ratios obtained in this case belongs to the set used by
Altman (EQ/TL). It is also worth pointing out that Ratios
2 and 9 were discarded in the preselection phase of
Algorithm A, which prevented the algorithm from taking
the GA solution {2, 9, 15} into account.
Resampling Study
The results in the preceding two sections were obtained

for one given partition of the available data into model-
ing and validation sets. In order to assess the validity of
506
TEAM LinG
Table 4. GA results for different values of the weight Table 6. Resampling results (average number of errors)
parameter .
Selected Modeling Cross-validation Ratios employed
Data
ratios errors errors Conventional Algorithm A Algorithm B
set
10 {7} 11 11 (4 ratios) (5 ratios) (3 ratios)
1 {2,9,15} 4 5 Modeling 8.52 5.70 5.36
0.1 {2,9,13,18,19,22,25} 3 7 Validation 4.69 3.22 2.89
the ratio selection scheme employed, a resampling study CONCLUSION

was carried out. For this purpose, 1,000 different model-
ing/validation partitions were randomly generated with The selection of appropriate ratios from the available
the same size as the one employed before (42 modeling financial quantities is an important and nontrivial step in
companies and 18 validation companies). For each parti- building models for corporate financial distress classi-
tion, a Z-score model was built and validated by using the fication. This article shows how data-driven variable
subsets of ratios obtained in the previous subsections. selection techniques can be useful tools in building
Table 6 presents the average number of resulting errors. distress classification models. The article compares the
As can be seen, the ratios selected by Algorithms A results of two such techniques, one involving
and B lead to better classifiers than the conventional preselection followed by exhaustive search, and the
ones. The best result, in terms of both modeling and other employing a genetic algorithm.
validation performances, was obtained by Algorithm B.
Such a finding is in line with the parsimony of the
associated classifier, which employs only three ratios. REFERENCES
Alici, Y. (1996). Neural networks in corporate failure

FUTURE TRENDS prediction: The UK experience. In A. P. Refenes, Y.
Abu-Mostafa, & J. Moody (Eds.), Neural networks in
The research on distress prediction has been moving financial engineering. London: World Scientific.
towards nonparametric modeling in order to circumvent
the limitations of discriminant analysis (such as the Altman, E. I. (1968). Financial ratios, discriminant analysis
need for the classes to exhibit multivariate normal and the prediction of corporate bankruptcy. Journal of
distributions with identical covariances). In this con- Finance, 23, 505-609.
text, neural networks have been found to be a useful
alternative, as demonstrated in a number of works (Wil- Andrade, J. M., Gomez-Carracedo, M. P., Fernandez, E.,
son & Sharda, 1994; Alici, 1996; Atiya, 2001; Becerra Elbergali, A., Kubista, M., & Prada, D. (2003). Classifi-
et al., 2001). cation of commercial apple beverages using a minimum
In light of the utility of data-driven ratio selection set of mid-IR wavenumbers selected by Procrustes ro-
techniques for discriminant analysis, as discussed in tation. Analyst, 128(9), 1193-1199.
this article, it is to be expected that ratio selection may also Atiya, A. F. (2001). Bankruptcy prediction for credit
play an important role for neural network models. How- risk using neural networks: A survey and new results.
ever, future developments along this line will have to IEEE Transactions on Neural Networks, 12(4), 929-935.
address the difficulty that nonlinear classifiers usually
have to be adjusted by numerical search techniques Becerra, V. M., Galvo, R. K. H., & Abou-Seada, M. (2001).
(training algorithms), which may be affected by local Financial distress classification employing neural net-
minima problems (Duda et al., 2001). works. Proceedings of the IASTED International Confer-
ence on Artificial Intelligence and Applications (pp. 45-
Table 5. Results of a Z-score model using the three 49).
ratios selected by Algorithm B
Centner, V., Massart, D. L., Noord, O. E., Jong, S.,
Data Type 1 Type 2 Percent Vandeginste, B. M., & Sterna, C. (1996). Elimination of
set errors errors accuracy uninformative variables for multivariate calibration.
Modeling 4 0 90%
Cross- 4 1 88%
Analytical Chemistry, 68, 3851-3858.
validation
Validation 1 3 78%
507
TEAM LinG
Coley, D. A. (1999). An introduction to genetic algo- Wilson, R. L., & Sharda, R. (1994). Bankruptcy prediction
rithms for scientists and engineers. Singapore: World using neural networks. Decision Support Systems, 11,
Scientific. 545-557.
Davison, A. C., & Hinkley, D. V. (Eds.). (1997). Bootstrap
methods and their application. Cambridge, MA: Cam-
bridge University Press. KEY TERMS
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern
classification (2nd ed.). New York: Wiley. Condition Number: Ratio between the largest and
smallest singular values of a matrix, often employed to
Galvo, R. K. H., Becerra, V. M., & Abou-Seada, M. assess the degree of collinearity between variables asso-
(2004). Ratio selection for classification models. Data ciated to the columns of the matrix.
Mining & Knowledge Discovery, 8, 151-170.
Cross-Validation: Resampling method in which ele-
Good, P. I. (1999). Resampling methods: A practical ments of the modeling set itself are alternately removed
guide to data analysis. Boston, MA: Birkhauser. and reinserted for validation purposes.
Lestander, T. A., Leardi, R., & Geladi, P. (2003). Selec- Financial Distress: A company is said to be under
tion of near infrared wavelengths using genetic algo- financial distress if it is unable to pay its debts as they
rithms for the determination of seed moisture content. become due, which is aggravated if the value of the
Journal of Near Infrared Spectroscopy, 11(6), 433-446. firms assets is lower than its liabilities.
Naes, T., & Mevik, B. H. (2001). Understanding the col- Financial Ratio: Ratio formed from two quantities
linearity problem in regression and discriminant analysis. taken from a financial statement.
Journal of Chemometrics, 15(4), 413-426.
Genetic Algorithm: Optimization technique inspired
Navarro-Villoslada, F., Perez-Arribas, L. V., Leon- by the mechanisms of evolution by natural selection, in
Gonzalez, M. E., & Polodiez, L. M. (1995). Selection of which the possible solutions are represented as the chro-
calibration mixtures and wavelengths for different mul- mosomes of individuals competing for survival in a popu-
tivariate calibration methods. Analytica Chimica Acta, lation.
313(1-2), 93-101.
Linear Discriminant Analysis: Multivariate clas-
Tabachnick, B. G., & Fidell, L. S. (2001). Using multi- sification technique that models the classes under con-
variate statistics (4th ed.). Boston, MA: Allyn & Bacon. sideration by normal distributions with equal covari-
ances, which leads to hyperplanes as the optimal deci-
Taffler, R. J. (1982). Forecasting company failure in the sion surfaces.
UK using discriminant analysis and financial ratio data.
Journal of the Royal Statistical Society, Series A, 145, Resampling: Validation technique employed to as-
342-358. sess the sensitivity of the classification method with
respect to the choice of modeling data.
508
TEAM LinG
509
Flexible Mining of Association Rules .

Hong Shen
Japan Advanced Institute of Science and Technology, Japan
The discovery of association rules showing conditions An association rule is called binary association rule if all
of data co-occurrence has attracted the most attention items (attributes) in the rule have only two values: 1 (yes)
in data mining. An example of an association rule is the or 0 (no). Mining binary association rules was the first
rule the customer who bought bread and butter also proposed data mining task and was studied most inten-
bought milk, expressed by T(bread; butter)T(milk). sively. Centralized on the Apriori approach (Agrawal et
Let I ={x1,x2,,xm} be a set of (data) items, called the al., 1993), various algorithms were proposed (Savasere et
domain; let D be a collection of records (transactions), al., 1995; Shen, 1999; Shen, Liang, & Ng, 1999; Srikant &
where each record, T, has a unique identifier and con- Agrawal, 1996). Almost all the algorithms observe the
tains a subset of items in I. We define itemset to be a set downward property that all the subsets of a frequent
of items drawn from I and denote an itemset containing k itemset must also be frequent, with different pruning
items to be k-itemset. The support of itemset X, denoted strategies to reduce the search space. Apriori works by
by (X/D), is the ratio of the number of records (in D) finding frequent k-itemsets from frequent (k-1)-itemsets
containing X to the total number of records in D. An iteratively for k=1, 2, , m-1.
association rule is an implication rule X Y, where X; Two alternative approaches, mining on domain parti-
Y I and XIY=0. The confidence of X Y is the ratio of tion (Shen, L., Shen, H., & Cheng, 1999) and mining based
(X U Y/D) to (X/D), indicating that the percentage of on knowledge network (Shen, 1999) were proposed. The
those containing X also contain Y. Based on the user- first approach partitions items suitably into disjoint
specified minimum support (minsup) and confidence itemsets, and the second approach maps all records to
(minconf), the following statements are true: An itemset individual items; both approaches aim to improve the
X is frequent if (X/D)> minsup, and an association rule bottleneck of Apriori that requires multiple phases of
( X UY / D )
scans (read) on the database.
X Y is strong if X U Y is frequent and ( X /Y ) minconf. Finding all the association rules that satisfy minimal
The problem of mining association rules is to find all support and confidence is undesirable in many cases for
strong association rules, which can be divided into two a users particular requirements. It is therefore necessary
subproblems: to mine association rules more flexibly according to the
users needs. Mining different sets of association rules of
1. Find all the frequent itemsets. a small size for the purpose of predication and classifica-
2. Generate all strong rules from all frequent itemsets. tion were proposed (Li, Shen, & Topor, 2001; Li, Shen, &
Topor, 2002; Li, Shen, & Topor, 2004; Li, Topor, & Shen,
Because the second subproblem is relatively straight- 2002).
forward we can solve it by extracting every subset from
an itemset and examining the ratio of its support; most of
the previous studies (Agrawal, Imielinski, & Swami, 1993; MAIN THRUST
Agrawal, Mannila, Srikant, Toivonen, & Verkamo, 1996;
Park, Chen, & Yu, 1995; Savasere, Omiecinski, & Navathe, Association rule mining can be carried out flexibly to
1995) emphasized on developing efficient algorithms for suit different needs. We illustrate this by introducing
the first subproblem. important techniques to solve two interesting problems.
This article introduces two important techniques for
association rule mining: (a) finding N most frequent
itemsets and (b) mining multiple-level association rules.
TEAM LinG
Flexible Mining of Association Rules
Finding N Most Frequent Itemsets To find all the winners, the algorithm makes multiple
passes over the data. In the first pass, we count the
Given x, y I, we say that x is greater than y, or y is less supports of all 1-itemsets, select the N largest ones from
them to form W 1, and then use W 1 to generate potential 2-
than x, if ( x / D) > ( y / D) . The largest itemset in D is the
winners with size (2). Each subsequent pass k involves
itemset that occurs most frequently in D. We want to find three steps: First, we count the support for potential k-
the N largest itemsets in D, where N is a user-specified winners with size k (called candidates) during the pass
number of interesting itemsets. Because users are usually over D; then select the N largest ones from a pool pre-
interested in those itemsets with larger supports, finding cisely containing supports of all these candidates and all
N most frequent itemsets is significant, and its solution (k-1)-winners to form W k; finally, use W k to generate
can be used to generate an appropriate number of inter- potential (k+1)-winners with size k+1, which will be used
esting itemsets for mining association rules (Shen, L., in the next pass. This process continues until we cant get
Shen, H., Pritchard, & Topor, 1998). any potential (k+1)-winners with size k+1, which implies
We define the rank of itemset x, denoted by (x), as that W k+1 = Wk. From Property 2, we know that the last Wk
follows: (x) ={(y/D)>(x/D), y I}|+1. Call x a exactly contains all winners.
winner if (x)<N and (x/D)>1, which means that x is one We assume that M k is the number of itemsets with
of the N largest itemsets and it occurs in D at least once. support equal to k-crisup and a size not greater than k,
We dont regard any itemset with support 0 as a winner, where 1<k<|I|, and M is the maximum of all M1~M|I|. Thus,
even if it is ranked below N, because we do not need to we have |Wk|=N+M k-1<N+M. It was shown that the time
provide users with an itemset that doesnt occur in D at complexity of the algorithm is proportional to the number
all. of all the candidates generated in the algorithm, which is
Use W to denote the set of all winners and call the O(log(N+M)*min{N+M,|I|}*(N+M)) (Shen et al., 1998).
support of the smallest winner the critical support, Hence, the time complexity of the algorithm is polynomial
denoted by crisup. Clearly, W exactly contains all for bounded N and M.
itemsets with support exceeding crisup; we also have
crisup>1. It is easy to see that |W| may be different from Mining Multiple-Level Association Rules
N: If the number of all itemsets occurring in D is less
than N, |W| will be less than N; |W| may also be greater Although most previous research emphasized mining
than N, as different itemsets may have the same support. association rules at a single concept level (Agrawal et
The problem of finding the N largest itemsets is to al., 1993; Agrawal et al., 1996; Park et al., 1995; Savasere
generate W. et al., 1995; Srikant & Agrawal, 1996), some techniques
Let x be an itemset. Use Pk(x) to denote the set of were also proposed to mine rules at generalized abstract
all k-subsets (subsets with size k) of x. Use Uk to denote (multiple) levels (Han & Fu, 1995). However, they can
P1(I) U U Pk(I), the set of all itemsets with a size not only find multiple-level rules in a fixed concept hierar-
greater than k. Thus, we introduce the k-rank of x, denoted chy. Our study in this fold is motivated by the goal of
by k(x), as follows: k(x) = |{y| {(y/D)>(x/D), mining multiple-level rules in all concept hierarchies
y Uk}|+1. Call x a k-winner if k(x)<N and (x/D)>1, (Shen, L., & Shen, H., 1998).
which means that among all itemsets with a size not greater A concept hierarchy can be defined on a set of data-
than k, x is one of the N largest itemsets and also occurs base attribute domains such as D(a1),,D(an), where, for
in D at least once. Use Wk to denote the set of all k-winners. i [1, n], ai denotes an attribute, and D(a i) denotes the
We define k-critical-support, denoted by k-crisup, as domain of ai. The concept hierarchy is usually partially
follows: If |W k|<N, then k-crisup is 1; otherwise, k-crisup ordered according to a general-to-specific ordering. The
is the support of the smallest k-winner. Clearly, Wk exactly most general concept is the null description ANY, whereas
contains all itemsets with a size not greater than k and the most specific concepts correspond to the specific
support not less than k-crisup. We present some useful attribute values in the database. Given a set of
properties of the preceding concepts as follows. D(a1),,D(an), we define a concept hierarchy H as fol-
lows: H n H n - 1 H 0 , where
Property: Let k and i be integers such that 1<k<k+i <|I|. H = D(a1 ) L D(aii ) f o r
i i
i[0,n], and
{ a 1 , , a n } = { a1 ,..., an } { a1 ,..., an 1 } .
n n n n 1
(1) Given x Uk, we have x Wk iff (x/D)>k-crisup.
(2) If W k-1 = W k, then W=Wk Here, Hn represents the set of concepts at the primitive
(3) Wk+i I Uk Wk. level, Hn-1 represents the concepts at one level higher
(4) 1<k-crisup<(k +i)-crisup. than those at Hn, and so forth; H0, the highest level
hierarchy, may contain solely the most general concept,
510
TEAM LinG
start with L k-1, the set of all large [k-1]-itemsets, and use
ANY. We also use { a1n ,..., ann }{ a1n ,..., ann11 }...{ a11 } to
denote H directly, and H0 may be omitted here.
Lk-1 to generate Ck, a superset of all large [k]-itemsets. Call .
the elements in Ck candidate itemsets, and count the
We introduce FML items to represent concepts at any support for these itemsets during the pass over the data.
level of a hierarchy. Let *, called a trivial digit, be a dont- At the end of the pass, we determine which of these
care digit. An FML item is represented by a sequence of itemsets are actually large and obtain Lk for the next pass.
digits, x = x1x2 xn, xi D(ai) 7 {*}. The flat-set of x is This process continues until no new large itemsets are
defined as Sf (x) ={(I, x i) | i [1; n] and xi *}. Given two found. Not that L1={{trivial item}}, because {trivial item}
items x and y, x is called a generalized item of y if Sf (x) Sf is the unique [1]-itemset and is supported by all transac-
(y), which means that x represents a higher-level concept tions.
that contains the lower-level concept represented by y. The computation cost of the preceding algorithm,
Thus, *5* is a generalized item of 35* due to Sf(*5*) which finds all frequent FML itemsets is O( cC s( x) ),
={(2,5)} Sf (35*)={(1,3),(2,5)}. If Sf(x)=, then x is called where C is the set of all candidates, g(c) is the cost for
a trivial item, which represents the most general concept, generating c as a candidate, and s(c) is the cost for
ANY. counting the support of c (Shen, L., & Shen, H., 1998).
Let T be an encoded transaction table, t a transaction The algorithm is optimal if the method of support
in T, x an item, and c an itemset. We can say that (a) t counting is optimal.
supports x if an item y exists in t such that x is a generalized After all frequent FML itemsets have been found,
item of y and (b) t supports c if t supports every item in c. we can proceed with the construction of strong FML
The support of an itemset c in T, (c/T), is the ratio of the rules. Use r(l,a) to denote rule Fs(a)Fs(Fc(l)-Fc(a)),
number of transactions (in T) that support c to the total where l is an itemset and a l. Use Fo(l,a) to denote
number of transactions in T. Given a minsup, an itemset c Fc(l)-Fc(a) and say that Fo(l,a) is an outcome form of l
is large if (c/T)e>minsup; otherwise, it is small. or the outcome form of r(l,a). Note that Fo(l,a) repre-
Given an itemset c, we define its simplest form as sents only a specific form rather than a meaningful
Fs(c)={x c| y c; Sf (x) Sf(y)} and its complete form itemset, so it is not equivalent to any other itemset
as Fc(c)={x|Sf (x) Sf (y), y c}. whose simplest form is Fs(Fo(l,a)). Outcome forms are
Given an itemset c, we call the number of elements in also called outcomes directly. Clearly, the correspond-
Fs(c) its size and the number of elements in Fc(c) its weight. ing relationship between rules and outcomes is one to
An itemset of size j and weight k is called a (j)-itemset, [k]- one. An outcome is strong if it corresponds to a strong
itemset, or (j)[k]-itemset. Let c be a (j)[k]-itemset. Use Gi(c) rule. Thus, all strong rules related to a large itemset can
to indicate the set of all [i]-generalized-subsets of c, where be obtained by finding all strong outcomes of this
i<k. Thus, the set of all [k-1]-generalized-subsets of c can itemset.
be generated as follows: Gk-1(c) = {Fs(Fc(c)- {x}) | x Fs(c)}; Let l be an itemset. Use O(l) to denote the set of all
the size of Gk-1(c) is j. Hence, for k>1, c is a (1)-itemset iff outcomes of l; that is, O(l) ={Fo(l,a)|a l}. Thus, from
|Gk-1(c)|=1. With this observation, we call a [k-1]-itemset a O(l), we can output all rules related to l: Fs(Fc(l)-o) Fs(o)
self-extensive if a (1)[k]-itemset b exists such that Gk-1(b) (denoted by r(l , o) ), where o O(l). Clearly, r(l,a) and
= {a}; at the same time, we call b the extension result of a.
r(l , o) denote the same rule if o=Fo(l,a). Let o, o O(l ) . We
Thus, all self-extensive itemsets as can be generated from
can say two things: (a) o is a |k|-outcome of l if o exactly
all (1)-itemsets bs as follows: Fc(a) = Fc(b)-Fs(b).
Let b be a (1)-itemset and Fs(b)={x}. From |Fc(b)| = |{y|Sf contains k elements and (b) o is a sub-outcome of o
(y) Sf (x)}| = 2| S ( x )| , we know that b is a (1) [ 2| S ( x )| ]-itemset.
f f versus l if o o . Use Ok(l) to denote the set of all the |k|-
Let a be the self-extensive itemset generated by b; that is, outcomes of l and use Vm(o,l) to denote the set of all the
a is a [ 2| S ( x )| -1]-itemset such that Fc(a)=Fc(b)-Fs(b). Thus,
f |m|-sub-outcomes of o versus l. Let o be an |m+1|-outcome
if |Sf (x)| > 1, there exist y, z Fs(a) and yz such that Sf of l and me>1. If |Vm(o,l)| =1, then o is called an elementary
outcome; otherwise, o is a non-elementary outcome.
(x)=Sf(y) 7 Sf(z). Clearly, this property can be used to gen-
Let r(l,a) and r(l,b) be two rules. We can say that r(l,a)
erate the corresponding extension result from any self-
is an instantiated rule of r(l,b) if b a. Clearly, b a
extensive [2m-1]-itemset, where m > 1. For simplicity, given
a self-extensive [2m-1]-itemset a, where m>1, we directly (l / T ) (l / T )
implies (b/T)>(a/T) and . Hence, all
use Er(a) to denote its extension result. For example, (a / T ) (b / T )
Er({12*,1*3,*23}) = {123}. instantiated rules of a strong rule must also be strong. Let
The algorithm makes multiple passes over the data- l be a large itemset, o1, o2 O(l), o1=Fo(l,a), and o2 =
base. In the first pass, we count the supports of all [2]- Fo(l,b). The three straightforward conclusions are (a)
itemsets and then select all the large ones. In pass k-1, we
511
TEAM LinG
that mining association rules can be performed flexibly in

o1 o2 iff b a, (b) r(l , o1 ) = r(l,a), and (c) r(l , o2 ) = r(l,b).
user-specified forms to suit different needs. More rel-
Therefore, r(l , o1 ) is an instantiated rule of r(l , o2 ) iff evant results can be found in Shen (1999). We hope that
this article provides useful insight to researchers for
o1 o2 , which implies that all the suboutcomes of a strong better understanding to more complex problems and their
outcome must also be strong. This characteristic is similar solutions in this research direction.
to the property that all generalized subsets of a large
itemset must also be large. Hence, the algorithm works as
follows. From a large itemset l, we first generate all the REFERENCES
strong rules with |1|-outcomes (ignore the |0|-outcome ,
which only corresponds to all trivial strong rules; Fs(l)
for every large itemset l). We then use all these strong |1|- Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining
outcomes to generate all the possible strong |2|-outcomes associations between sets of items in massive data-
of l, from which all the strong rules with |2|-outcomes can bases. Proceedings of the ACM SIGMOD International
be produced, and so forth. Conference on Management of Data (pp. 207-216).
The computation cost of the algorithm constructing Agrawal, R., Mannila, H., Srikant,Toivonen, R. H., &
all the strong FML rules is O(|C|t + |C|s) (Shen, L., & Shen, Verkamo, A. I. (1996). Fast discovery of association
H., 1998), which is optimal for bounded costs t and s, rules. In U. Fayyad (Ed.), Advances in knowledge discov-
where C is the set of all the candidate rules, s is the average ery and data mining (pp. 307-328). MIT Press.
cost for generating one element in C, and t is the average
cost for computing the confidence of one element in C. Han, J., & Fu, Y. (1995). Discovery of multiple-level asso-
ciation rules from large databases. Proceedings of the
21st VLDB International Conference (pp. 420-431).
FUTURE TRENDS Li, J., Shen, H., & Topor, R. (2001). Mining the smallest associa-
tion rule set for predictions. Proceedings of the IEEE Interna-
The discovery of data association is a central topic of tional Conference on Data Mining (pp. 361-368).
data mining. Mining this association flexibly to suit
different needs has an increasing significance for appli- Li, J., Shen, H., & Topor, R. (2002). Mining the optimum
cations. Future research trends include mining associa- class association rule set. Knowledge-Based Systems,
tion rules in data of complex types such as multimedia, 15(7), 399-405.
time-series, text, and biological databases, in which Li, J., Shen, H., & Topor, R. (2004). Mining the informative
certain properties (e.g., multidimensionality), con- rule set for prediction. Journal of Intelligent Information
straints (e.g., time variance), and structures (e.g., se- Systems, 22(2), 155-174.
quence) of the data must be taken into account during the
mining process; mining association rules in databases Li, J., Topor, R., & Shen, H. (2002). Construct robust rule
containing incomplete, missing, and erroneous data, sets for classification. Proceedings of the ACM Interna-
which requires people to produce correct results re- tional Conference on Knowledge Discovery and Data
gardless of the unreliability of data; mining association Mining (pp. 564-569).
rules online for streaming data that come from an open-
Park, J. S., Chen, M., & Yu, P. S. (1995). An effective hash
ended data stream and flow away instantaneously at a
based algorithm for mining association rules. Proceed-
high rate, which usually adopts techniques such as active
ings of the ACM SIGMOD International Conference on
learning, concept drift, and online sampling and pro-
Management of Data (pp. 175-186).
ceeds in an incremental manner.
Savasere, A., Omiecinski, R., & Navathe, S. (1995). An
efficient algorithm for mining association rules in large
CONCLUSION databases. Proceedings of the 21st VLDB International
Conference (pp. 432-443).
We have summarized research activities in mining asso- Shen, H. (1999). New achievements in data mining. In J.
ciation rules and some recent results of my research in Sun (Ed.), Science: Advancing into the new millenium
different streams of this topic. We have introduced impor- (pp. 162-178). Beijing: Peoples Education.
tant techniques to solve two interesting problems of
mining N most frequent itemsets and mining multiple-level Shen, H., Liang, W., & Ng, J. K-W. (1999). Efficient
association rules, respectively. These techniques show computation of frequent itemsets in a subcollection of
multiple set families. Informatica, 23(4), 543-548.
512
TEAM LinG
Shen, L., & Shen, H. (1998). Mining flexible multiple-level Concept Hierarchy: The organization of a set of data-
association rules in all concept hierarchies. Proceedings base attribute domains into different levels of abstraction .
of the Ninth International Conference on Database and according to a general-to-specific ordering.
Expert Systems Applications (pp. 786-795).
Confidence of Rule X Y: The fraction of the data-
Shen, L., Shen, H., & Cheng, L. (1999). New algorithms for base containing X that also contains Y, which is the ratio
efficient mining of association rules. Information Sci- of the support of X U Y to the support of X.
ences, 118, 251-268.
Flexible Mining of Association Rules: Mining asso-
Shen, L., Shen, H., Pritchard, P., & Topor, R. (1998). Finding ciation rules in user-specified forms to suit different
the N largest itemsets. Proceedings of the IEEE Interna- needs, such as on dimension, level of abstraction, and
tional Conference on Data Mining, 19 (pp. 211-222). interestingness.
Srikant, R., & Agrawal, R. (1996). Mining quantitative Frequent Itemset: An itemset that has a support
association rules in large relational tables. ACM greater than user-specified minimum support.
SIGMOD Record, 25(2), 1-8.
Strong Rule: An association rule whose support (of
the union of itemsets) and confidence are greater than
user-specified minimum support and confidence, re-
KEY TERMS spectively.
Support of Itemset X: The fraction of the database
Association Rule: An implication rule X Y that that contains X, which is the ratio of the number of
shows the conditions of co-occurrence of disjoint records containing X to the total number of records in
itemsets (attribute value sets) X and Y in a given database. the database.
513
TEAM LinG
514
Formal Concept Analysis Based Clustering

Jamil M. Saquer
Southwest Missouri State University, USA
INTRODUCTION Similarly, the set of objects possessing all the features in

a set of features B is denoted by (B) and is given by {g
Formal concept analysis (FCA) is a branch of applied G | gIm " m B}. The operators and b satisfy the
mathematics with roots in lattice theory (Wille, 1982; assertions given in the following lemma.
Ganter & Wille, 1999). It deals with the notion of a concept
in a given universe, which it calls context. For example, Lemma 1 (Wille, 1982): Let (G, M, I) be a context. Then
consider the context of transactions at a grocery store the following assertions hold:
where each transaction consists of the items bought
together. A concept here is a pair of two sets (A, B). A is 1. A1 A2 implies b(A2) b(A1) for every A1, A2 G,
the set of transactions that contain all the items in B and and B 1 B2 implies (B2) (B 1) for every B1, B 2
B is the set of items common to all the transactions in A. M.
A successful area of application for FCA has been data 2. A (b(A) and A = b((b(A)) for all A G, and B
mining. In particular, techniques from FCA have been b(((B)) and B = (b( (B))) for all B M.
successfully used in the association mining problem and
in clustering (Kryszkiewicz, 1998; Saquer, 2003; Zaki & A formal concept in the context (G, M, I) is defined as
Hsiao, 2002). In this article, we review the basic notions a pair (A, B) where A G, B M, b(A) = B, and (B) = A.
of FCA and show how they can be used in clustering. A is called the extent of the formal concept and B is called
its intent. For example, the pair (A, B) where A = {2, 3, 4}
and B = {a, g, h} is a formal concept in the context given
BACKGROUND in Table 1. A subconcept/superconcept order relation on
concepts is as follows: (A1, B 1) (A2, B 2) iff A1 A2 (or
A fundamental notion in FCA is that of a context, which equivalently, iff B2 B1). The fundamental theorem of
is defined as a triple (G, M, I), where G is a set of objects, FCA states that the set of all concepts on a given context
M is a set of features (or attributes), and I is a binary is a complete lattice, called the concept lattice (Ganter &
relation between G and M. For object g and feature m, gIm Wille, 1999). Concept lattices are drawn using Hasse
if and only if g possesses the feature m. An example of a diagrams, where concepts are represented as nodes. An
context is given in Table 1, where an X is placed in the edge is drawn between concepts C1 and C2 iff C1 C 2 and
i th row and jth column to indicate that the object in row i there is no concept C 3 such that C1 C3 C2. The concept
possesses the feature in column j. lattice for the context in Table 1 is given in Figure 1.
The set of features common to a set of objects A is A less condensed representation of a concept lattice
denoted by (A) and is defined as {m M | gIm " g A}. is possible using reduced labeling (Ganter & Wille, 1999).
Figure 2 shows the concept lattice in Figure 1 with
reduced labeling. It is easier to see the relationships and
Table 1. A context excerpted from (Ganter & Wille, 1999, similarities among objects when reduced labeling is used.
p. 18) a = needs water to live; b = lives in water; c = lives The extent of a concept C in Figure 2 consists of the
on land; d = needs chlorophyll; e = two seeds leaf; f = one objects at C and the objects at the concepts that can be
seed leaf; g = can move around; h = has limbs; i = suckles reached from C going downward following descending
its offsprings paths towards the bottom concept. Similarly, the intent of
C consists of the features at C and the features at the
a b c d e f g h i
1 Leech X X X concepts that can be reached from C going upwards
2 Bream X X X X following ascending paths to the top concept.
3 Frog X X X X X Consider the context presented in Table 1. Let B = {a,
4 Dog X X X X X f}. Then, (B) = {5, 6, 8}, and b((B)) = b({5, 6, 8}) = {a, d,
5 Spike-weed X X X X
f} {a, f}; therefore, in general, ((B)) B. A set of
6 Reed X X X X X
7 Bean X X X X features B that satisfies the condition b((B)) = B is called
8 Maize X X X X a closed feature set. Intuitively, a closed feature set is a
TEAM LinG
Figure 1. Concept lattice for the context in Table 1

.
Figure 2. Concept lattice for the context in Table 1 with reduced labeling
maximal set of features shared by a set of objects. It is easy Traditionally, most clustering algorithms do not allow
to show that intents of the concepts of a concept lattice clusters to overlap. However, this is not a valid assump-
are all closed feature sets. tion for many applications. For example, in Web docu-
The support of a set of features B is defined as the ments clustering, many documents have more than one
percentage of objects that possess every feature in B. topic and need to reside in more than one cluster (Beil,
That is, support(B) = |(B)|/|G|, where |B| is the cardinality Ester, & Xu, 2002; Hearst, 1999; Zamir & Etzioni, 1998).
of B. Let minSupport be a user-specified threshold value Similarly, in the market basket data, items purchased in a
for minimum support. A feature set B is frequent iff transaction may belong to more than one category of
support(B) minSupport. A frequent closed feature set is items.
a closed feature set, which is also frequent. For example, The concept lattice structure provides a hierarchical
for minSupport = 0.3, {a, f} is frequent, {a, d, f} is frequent clustering of objects, where the extent of each node could
closed, while {a, c, d, f} is closed but not frequent. be a cluster and the intent provides a description of that
cluster. There are two main problems, though, that make
it difficult to recognize the clusters to be used. First, not
CLUSTERING BASED ON FCA all objects are present at all levels of the lattice. Second,
the presence of overlapping clusters at different levels is
It is believed that the method described below is the first not acceptable for disjoint clustering. The techniques
for using FCA for disjoint clustering. Using FCA for described in this chapter solve these problems. For ex-
conceptual clustering to gain more information about data ample, for a node to be a cluster candidate, its intent must
is discussed in Carpineto & Romano (1999) and Mineau be frequent (meaning a minimum percentage of objects
& Godin (1995). In the remainder of this article we show must possess all the features of the intent). The intuition
how FCA can be used for clustering. is that the objects within a cluster must contain many
515
TEAM LinG
features in common. Overlapping is resolved by using a negative(g, Ci) be the set of features in b(g) which are
score function that measures the goodness of a cluster for global-frequent but not cluster-frequent. The function
an object and keeps the object in the cluster where it scores score(g, Ci) is then given by the following formula:
best.
score(g, C i) = f positive(g, Ci) cluster-support(f) -
Formalizing the Clustering Problem f negative(g, Ci) global-support(f).
Given a set of objects G = {g1, g2, , gn}, where each object The first term in score(g, Ci) favors Ci for every feature
is described by the set of features it possesses, (i.e., gi is in positive(g, Ci) because these features contribute to
described by b(gi)), = {C1, C2, , Ck} is a clustering of G intra-cluster similarity. The second term penalizes Ci for
if and only if each Ci is a subset of G and UCi = G, where i every feature in negative(g, C i) because these features
ranges from 1 to k. For disjoint clustering, the additional contribute to inter-cluster similarities.
condition Ci I Cj = must be satisfied for all i j. An overlapping object will be deleted from all initial
Our method for disjoint clustering consists of two clusters except for the cluster where it scores highest.
steps. First, assign objects to their initial clusters. Second, Ties are broken by assigning the object to the cluster
make these clusters disjoint. For overlapping clustering, with the longest label. If this does not resolve ties, then
only the first step is needed. one of the clusters is chosen randomly.
Assigning Objects to Initial Clusters An Illustrating Example
Each frequent closed feature set (FCFS) is a cluster candi- Consider the objects in Table 1. The closed feature sets
date, with the FCFS serving as a label for that cluster. Each are the intents of the concept lattice in Figure 1. For
object g is assigned to the cluster candidates described by minSupport = 0.35, a feature set must appear in at least 3
the maximal frequent closed feature set (MFCFS) con- objects to be frequent. The frequent closed feature sets
tained in b(g). These initial clusters may not be disjoint are a, ag, ac, ab, ad, agh, abg, acd, and adf. These are the
because an object may contain several MFCFS. For ex- candidate clusters. Using the notation C[x] to indicate
ample, for minSupport = 0.375, object 2 in Table 1 contains the cluster with label x, and assigning objects to MFCFS
the MFCFSs agh and abg. results in the following initial clusters: C[agh] = {2, 3, 4},
Notice that all of the objects in a cluster must contain C[abg] = {1, 2, 3}, C[acd] = {6, 7, 8}, and C[adf] = {5, 6,
all the features in the FCFS describing that cluster (which 8}. To find the most suitable cluster for object 6, we need
is also used as the cluster label). This is always true even to calculate its score in each cluster containing it. For
after any overlapping is removed. This means that this cluster-support threshold value of 0.7, it is found that,
method produces clusters with their descriptions. This is score(6,C[acd]) = 1 - 0.63 + 1 + 1 - 0.38 = 1.99 and
a desirable property. It helps a domain expert to assign score[6,C[adf]) = 1 - 0.63 - 0.63 + 1 + 1 = 1.74.
labels and descriptions to clusters. We will use score(6,C[acd]) to explain the calculation.
All features in b(6) are global-frequent. They are a, b, c,
d, and f, with global frequencies 1, 0.63, 0.63, 0.5, and 0.38,
Making Clusters Disjoint respectively. Their respective cluster-support values are
1, 0.33, 1, 1, and 0.67. For a feature to be cluster-frequent
To make the clusters disjoint, we find the best cluster for in C[acd], it must appear in at least [ x |C[acd]|] = 3 of its
each overlapping object g and keep g only in that cluster. objects. Therefore, a, c, d positive(6,C[acd]), and b, f
To achieve this, a score function is used. The function negative(6,C[acd]). Substituting these values into the
score(g, C i) measures the goodness of cluster Ci for object formula for the score function, it is found that the
g. Intuitively, a cluster is good for an object g if g has many score(6,C[acd]) = 1.99. Since score(6,C[acd]) >
frequent features which are also frequent in Ci. On the other score[6,C[adf]), object 6 is assigned to C[acd].
hand, C i is not good for g if g has frequent features that are Tables 2 and 3 show the score calculations for all
not frequent in Ci. overlapping objects. Table 2 shows initial cluster as-
Define global-support(f) as the percentage of objects signments. Features that are both global-frequent and
possessing f in the whole database, and cluster-support(f) cluster-frequent are shown in the column labeled
as the percentages of objects possessing f in a given positive(g,C i), and features that are only global-frequent
cluster C i. We say f is cluster frequent in C i if cluster- are in the column labeled negative(g,Ci). For elements in
support(f) is at least a user-specified minimum threshold the column labeled positive(g,Ci), the cluster-support
value q. For cluster Ci, let positive(g, Ci) be the set of values are listed between parentheses after each feature
features in b(g) which are both global-frequent (i.e., fre- name. The same format is used for global-support values
quent in the whole database) and cluster-frequent. Also let
516
TEAM LinG
Table 2. Initial cluster assignments and support values

(minSupport = 0.37, = 0.7) .
Ci initial clusters Positive(g,Ci) negative(g,Ci)
C[agh] 2: abgh a(1), b(0.63),
3: abcgh g(1), c(0.63)
4: acghi h(1)
C[abg] 1: abg a(1), c(0.63),
2: abgh b(1), h(0.38)
3: abcgh g(1)
C[acd] 6: abcdf a(1), b(0.63),
7: acde c(1), f(0.38)
8: acdf d(1)
C[adf] 5: abdf a(1), b(0.63),
6: abcdf d(1), c(0.63)
8: acdf f(1)
Table 3. Score calculations and final cluster assignments and = 0.5, we get the following final clusters C[agh] = {3,
(minSupport = 0.37, = 0.7) 4}, C[abg] = {1, 2}, C[acd] = {7, 8}, and C[adf] = {5, 6}.
Ci Score(g, Ci) Final clusters

C[agh] 2: 1 - .63 + 1 + 1 = 2.37 FUTURE TRENDS
3: 1 - .63 - .63 + 1 + 1 = 1.74
4: does not overlap 4 We have shown how FCA can be used for clustering.
C[abg] 1: does not overlap 1 Researchers have also used the techniques of FCA and
2: 1 + 1 + 1 - .38 = 2.62 2 frequent itemsets in the association mining problem (Fung,
3: 1 + 1 - .63 + 1 - .38 = 1.99 3
Wang, & Ester, 2003; Wang, Xu, & Liu, 1999; Yun, Chuang,
C[acd] 6: 1 - .63 + 1 +1 - .38 = 2.62 6
5: does not overlap 7
& Chen, 2001). We anticipate this framework to be suitable
8: 1 + 1 + 1 -.38 = 2.62 8 for other problems in data mining such as classification.
C[adf] 5: does not overlap 5 Latest developments in efficient algorithms for generat-
6: 1 - .63 - .63 + 1 + 1 = 1.74 ing frequent closed itemsets make this approach efficient
8: 1 - .63 + 1 + 1 = 2.37 for handling large amounts of data (Gouda & Zaki, 2001;
Pei, Han, & Mao, 2000; Zaki & Hsiao, 2002).
CONCLUSION
for features in the column labeled negative(g,Ci). It is only
a coincidence that in this example all cluster-support This chapter introduces formal concept analysis (FCA);
values are 1 (try = 0.5 for different values). Table 3 shows a useful framework for many applications in computer
score calculations, and final cluster assignments. science. We also showed how the techniques of FCA can
Notice that, different threshold values may result in be used for clustering. A global support value is used to
different clusters. The value of minSupport affects cluster specify which concepts can be candidate clusters. A
labels and initial clusters while that of affects final score function is then used to determine the best cluster
elements in clusters. For example, for minSupport = 0.375 for each object. This approach is appropriate for cluster-
517
TEAM LinG
ing categorical data, transaction data, text data, Web Saquer, J. (2003). Using concept lattices for disjoint clus-
documents, and library documents. These data usually tering. In The Second IASTED International Conference
suffer from the problem of high dimensionality with only on Information and Knowledge Sharing (pp. 144-148).
few items or keywords being available in each transaction
or document. FCA contexts are suitable for representing Wang, K., Xu, C., & Liu, B. (1999). Clustering transactions
this kind of data. using large items. In ACM International Conference on
Information and Knowledge Management (pp. 483-490).
Wille, R. (1982). Restructuring lattice theory: An ap-
REFERENCES proach based on hierarchies of concepts. In I. Rival (Ed.),
Ordered sets (pp. 445-470). Dordecht-Boston: Reidel.
Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based Yun, C., Chuang, K., & Chen, M. (2001). An efficient
text clustering. In The 8th International Conference on clustering algorithm for market basket data based on small
Knowledge Discovery and Data Mining (KDD 2002) large ratios. In The 25th COMPSAC Conference (pp. 505-
(pp. 436-442). 510).
Carpineto, C., & Romano, G. (1993). GALOIS: An order- Zaki, M.J., & Hsiao, C. (2002). CHARM: An efficient
theoretic approach to conceptual clustering. In Proceed- algorithm for closed itemset mining. In Second SIAM
ings of 1993 International Conference on Machine Learn- International Conference on Data Mining.
ing (pp. 33-40).
Zamir, O., & Etzioni, O. (1998). Web document clustering:
Fung, B., Wang, K., & Ester, M. (2003). Large hierarchical A feasibility demonstration. In The 21st Annual Interna-
document clustering using frequent itemsets. In Third tional ACM SIGIR (pp. 46-54).
SIAM International Conference on Data Mining (pp. 59-
70).
KEY TERMS
Ganter, B., & Wille, R. (1999). Formal concept analysis:
Mathematical foundations. Berlin: Springer-Verlag. Cluster-Support of Feature F In Cluster Ci: Percent-
Gouda, K., & Zaki, M. (2001). Efficiently mining maximal age of objects in Ci possessing f.
frequent itemsets. In First IEEE International Confer- Concept: A pair (A, B) of a set A of objects and a set
ence on Data Mining (pp. 163-170). San Jose, USA. B of features such that B is the maximal set of features
Hearst, M. (1999). The use of categories and clusters for possessed by all the objects in A and A is the maximal set
organizing retrieval results. In T. Strzalkowski (Ed.), Natu- of objects that possess every feature in B.
ral language information retrieval (pp. 333-369). Bos- Context: A triple (G, M, I) where G is a set of
ton: Kluwer Academic Publishers. objects, M is a set of features and I is a binary relation
Kryszkiewicz, M. (1998). Representative association rules. between G and M such that gIm if and only if object g
In Proceedings of PAKDD 98. Lecture Notes in Artifi- possesses the feature m.
cial Intelligence (Vol. 1394) (pp. 198-209). Berlin: Springer- Formal Concept Analysis: A mathematical framework
Verlag. that provides formal and mathematical treatment of the
Mineau, G., & Godin, R. (1995). Automatic structuring of notion of a concept in a given universe.
knowledge bases by conceptual clustering. IEEE Trans- Negative(g, Ci): Set of features possessed by g which
actions on Knowledge and Data Engineering, 7 (5), 824- are global-frequent but not cluster-frequent.
829.
Positive(g, Ci): Set of features possessed by g which
Pei J., Han J., & Mao R. (2000). CLOSET: an efficient are both global-frequent and cluster-frequent.
algorithm for mining frequent closed itemsets. In ACM-
SIGMOD Workshop on Research Issues in Data Mining Support or Global-Support of Feature F: Percentage
and Knowledge Discovery (pp. 21-30). Dallas, USA. of object transactions in the whole context (or whole
database) that possess f.
518
TEAM LinG
519
Fuzzy Information and Data Analysis .

Reinhard Viertl
Vienna University of Technology, Austria
INTRODUCTION precise. For details see (Viertl, 2002). This kind of uncer-
tainty can be best described by a so-called fuzzy number.
The results of data warehousing and data mining are A fuzzy number x is defined by a so-called characteriz-
depending essentially on the quality of data. Usually ing function : IR [0,1] which obeys the following:
data are assumed to be numbers or vectors, but this is
often not realistic. Especially the result of a measure-
ment of a continuous quantity is always not a precise
number, but more or less non-precise. This kind of x0 IR : (x0 ) = 1 (1)
uncertainty is also called fuzziness and should not be
confused with errors. Data mining techniques have to (0,1] the so-called cut C [ ()] defined by
take care of fuzziness in order to avoid unrealistic
C [ ()] := {x IR : (x ) } = [a , b ] is a finite closed interval.
results.
(2)
BACKGROUND
Examples of non-precise data are results on ana-
logue measurement equipments as well as readings on
In standard data warehousing and data analysis data are
digital instruments.
treated as numbers, vectors, words, or symbols. These
For continuous vector quantities real measurements
data types do not take care of fuzziness of data and prior
are not precise vectors but also non-precise. This im-
information. Whereas some methodology for fuzzy data
analysis was developed, statistical data analysis is usu- precision can result in a vector (x1 ,L , x k ) of fuzzy num-
ally not taking care of fuzziness. Recently some meth- bers xi , or more generally, in a so-called k-dimen-
ods for statistical analysis of non-precise data were
sional fuzzy vector x . Using the notation
published (Viertl, 1996, 2003).
Historically fuzzy sets were first introduced by K. Menger x = (x1 ,L , x k ) IR k a k-dimensional fuzzy vector is
in 1951 (Menger, 1951). Later L. Zadeh made fuzzy models
defined by its vector-characterizing function
popular. For more information on fuzzy modeling compare
(Dubois & Prade, 2000). : IR k [0,1] obeying
Most data analysis techniques are statistical tech-
niques. Only in the last 20 years alternative methods
using fuzzy models were developed. For a detailed discus- x 0 IR k : (x 0 ) = 1 (1)
sion compare (Bandemer & Nther, 1992; Berthold &
Hand, 2003).
(0,1] the cut C [ ()] defined by
C [ ()] := {x IR k : (x ) } is a compact simply con-
MAIN THRUST nected subset of IR k (2)
The main thrust of this chapter is to provide the quanti-

tative description of fuzzy data, as well as generalized Remark: A vector of fuzzy numbers is essentially
methods for the statistical analysis of fuzzy data. different from a fuzzy vector. But it is possible to
construct a fuzzy vector x from a vector 1 (),L , k () of
Non-Precise Data fuzzy numbers. The vector-characterizing function
(,L ,) of x can be obtained by

The result of one measurement of a continuous quantity
is not a precise real number but more or less non-
TEAM LinG
Fuzzy Information and Data Analysis
(x1 , L , x k ) = min{1 (x1 ),L , k (x k )} (x1 ,L , x k ) IR k The generalized integral of a fuzzy valued function
f () defined on M is a fuzzy number I denoted by

Examples of 2-dimensional fuzzy data are light

points on radar screens. I = f (x )dx
,
M
Descriptive Statistics with Fuzzy Data
which is defined via its cuts
Analysis of variable data by forming histograms has to
take care of fuzziness. This is possible based on the

characterizing functions i () of the observations xi for C [I ] = f (x )dx, f (x )dx (0,1]
i = 1(1)n . The height h j over a class K j , j = 1(1)k of the
M M
histogram is a fuzzy number whose characterizing func-
tion j () is obtained in the following way: in case of integrable level curves f () and f () .
For each the cut Fuzzy probability densities on measurable spaces
level
(M , A ) are special fuzzy valued functions f () defined
[ ] [ ]
C j () = h n , (K j ), h n, (K j ) of j () is defined by
on M with integrable level curves, for which
# {xi: C [ i ()] K j }
f (x )dx = 1

+
,
h n, (K j ) =
M
n
where 1+ is a fuzzy number fulfilling
# {xi: C [ i ()] K j }
h n, (K j ) = . 1 C1 [1+ ] and C [1+ ] (0, ) (0,1] .
n
Based on fuzzy probability densities so-called fuzzy

By the representation lemma of fuzzy numbers hereby probability distributions P on A are defined in the
following way:
the characterizing functions j (), j = 1(1)k are determined:
Denoting the set of all classical probability densi-
ties () on M which are bounded by level curves
j (x ) = max I C [ j ( )] (x ) x IR
(0 ,1] f () and f () by
The resulting generalized histogram is also called

fuzzy histogram. For more details compare (Viertl & {
S = () : f (x ) (x ) f (x ) x M }
Hareter, 2004).
the fuzzy associated probability P (A) for all A A is
Fuzzy Probability Distributions the fuzzy number whose cuts
In standard data analysis probability densities are con- [
C [P ( A)] = P (A), P ( A) are ] defined by
sidered as limits of histograms. For fuzzy data limits of
fuzzy histograms are fuzzy valued functions f () whose
values f (x ) are fuzzy numbers. These fuzzy valued P (A) = sup (x )dx
S A
functions are normalized by generalizing classical inte-
gration with the help of so-called level curves f ()
P (A) = inf (x )dx .
and f () of f () , which are defined by the endpoints of S
A
the cuts of f (x ) for all x Def [ f ()]:

By this definition the probability of the extreme
events and M are precise numbers, i.e.
[
C [ f (x )] = f (x ), f (x ) ] for all x Def [ f ()]
520
TEAM LinG
P ( ) = 0 and P (M ) = 1 . For more details compare the book (Viertl, 1996)

Remark: For precise data xi IR the characterizing .
Statistical Inference based on Fuzzy functions are the one-point indicator functions I {x } () i
Information and the resulting characterizing functions are the indi-

cator functions of the classical results.
Statistical inference procedures can be generalized to Statistical tests in case of fuzzy data yield fuzzy
the situation of fuzzy data. Moreover also Bayesian values t of test statistics t (x1 ,L , xn ) . Therefore simple
inference procedures can be generalized for fuzzy a-
significance tests based on rejection regions dont
priori distributions and fuzzy sample data.
apply. A solution to this problem is to use p-values. An
For stochastic model X ~ f ( | ) , with observa- alternative approach is so-called fuzzy p-values as
tion space M and sample x1 , L , xn let (x1 ,L , x n ) be an described in (Filzmoser & Viertl, 2004).
estimator for the true parameter, i.e. a measurable func-
tion from the sample space M n to . In case of fuzzy Fuzzy Bayesian Inference
sample x1 ,L , xn the estimator can be adapted using the
The second kind of fuzzy information appears as fuzzy
so-called extension principle from fuzzy set theory in a-priori information in the context of Bayesian data
the following way. In order to do this first the fuzzy analysis. Fuzzy a-priori information is expressed by
sample x1 ,L , x n with characterizing functions fuzzy probability distributions on the parameter space
1 (), L , n () has to be combined into a fuzzy element x in connection with a stochastic model X ~ f ( | ) ,
of the sample space M n with vector-characterizing func- with observation space M . Frequently such fuzzy
tion (,L,) whose values are defined by probability distributions are generated by fuzzy prob-
ability densities as described in the subsection Fuzzy
Probability Distributions above.
(x1 , L , x n ) = min i (xi ) (x1 , L , x n ) M n . Using the concept of level curves () and ()
i =1(1) n
of fuzzy a-priori densities () on the parameter space

Let : M be an estimator for the parameter
n
, it is possible to generalize Bayes theorem to the
of a stochastic model X ~ f ( | ) , .
situation of fuzzy a-priori information and fuzzy data
Using the so-called fuzzy combined sample x the D . The resulting a-posteriori distribution is a fuzzy
characterizing function () of the generalized fuzzy probability distribution with fuzzy density ( | D ) on
estimator is given by its values the parameter space.
The fuzzy density ( | D ) can be used to calculate
sup{ (x ) : (x )= } if 1 ( ) predictive densities p( | D ) by the generalized integra-

( ) = . tion from above in the following way:
0 if 1 ( ) =
p(x | D ) = f (x | ) ( | D )d for all xM .

The generalized estimator is a fuzzy element of the
parameter space .
A similar construction is possible to generalize the This is again a fuzzy probability distribution, now on the
concept of confidence sets. Let (X 1 ,L , X n ) be a confi- observation space M . For more details see (Viertl & Hareter,
dence function for at given confidence level 1 . For 2004).
fuzzy data with fuzzy combined sample x with vector-
Fuzzy Data Analysis
characterizing function (,L ,) a generalized fuzzy con-
fidence region 1 with membership function () is In recent times non-statistical approaches to data analy-
defined by sis problems were developed using ideas from fuzzy
theory. Especially for linguistic data this approach was
successful. Instead of stochastic models to describe
sup{ (x ) : (x )} if x : (x ) uncertainty it is possible to use fuzzy models. Such
( ) = .
0 if /x : (x ) alternative methods are fuzzy clustering, fuzzy regres-
521
TEAM LinG
sion analysis, integration of fuzzy valued functions, fuzzy Ross, R., Booker, J., & Parkinson, J., (Eds.) (2002).
optimization, and fuzzy approximation. For more details Fuzzy logic and probability applications Bridging
compare (Bandemer & Nther, 1992). the gap. Philadelphia: American Statistical Association
and Society of Industrial and Applied Mathematics.
Viertl, R. (1996). Statistical methods for non-precise
FUTURE TRENDS data. Boca Raton: CRC Press.
Up to now fuzzy models where mainly used to describe Viertl, R. (2002). On the description and analysis of
vague data in form of linguistic data. But also precision measurements of continuous quantities. Kybernetika,
measurements are connected with uncertainty which is 38, 353-362.
not only of stochastic nature. Therefore hybrid methods
of data analysis, which take care of the fuzziness of data Viertl, R. (2003). Statistical inference with imprecise
as well as the statistical variation of them, have to be data. Encyclopedia of life support systems. Retrieved
applied and further developed. Especially research on from UNESCO, Paris, Web Site: www.eolss.unesco.org
how to obtain the characterizing functions of non-pre- Viertl, R., & Hareter, D. (2004). Fuzzy information and
cise data are necessary. imprecise probability. Zeitschrift fr Angewandte
Mathematik und Mechanik, 84(10-11), 1-10.
CONCLUSION Wolkenhauer, O. (2001). Data engineering Fuzzy

mathematics in systems theory and data analysis.
In this contribution different kinds of uncertainty are New York: Wiley.
discussed. These are variation and fuzziness. Varia-
tion is usually described by stochastic models and
fuzziness by fuzzy numbers and fuzzy vectors respec- KEY TERMS
tively. Looking at real data it is necessary to distinguish
between fuzziness and variability in order to obtain Characterizing Function: Mathematical descrip-
realistic results from data analyses. tion of a fuzzy number.
Fuzzy Estimate: Generalized statistical estimation
REFERENCES technique for the situation of non-precise data.
Fuzzy Histogram: A generalized histogram based
Bandemer, H., & Nther, W. (1992). Fuzzy data analy- on non-precise data, whose heights are fuzzy numbers.
sis. Dordrecht: Kluwer.
Fuzzy Information: Information which is not given
Berthold, M., & Hand, D. (Eds.). (2003). Intelligent in form of precise numbers, precise vectors, or pre-
data analysis (2nd ed.). Berlin: Springer. cisely defined terms.
Bertoluzza, C., Gil, M., & Ralescu, D. (Eds.). (2002). Fuzzy Number: Quantitative mathematical descrip-
Statistical modeling, analysis and management of tion of fuzzy information concerning a one-dimensional
fuzzy data. Heidelberg: Physica-Verlag. numerical quantity.
Dubois, D., & Prade, H. (Eds.). (2000). Fundamentals of Fuzzy Statistics: Statistical analysis methods for
fuzzy sets. Boston: Kluwer. the situation of fuzzy information.
Filzmoser, P., & Viertl, R. (2004). Testing hypotheses Fuzzy Valued Functions: Generalized real valued
with fuzzy data: The fuzzy p- value. Metrika, 59, 21-29. functions whose values are fuzzy numbers.
Grzegorzewski, P., Hryniewicz, O., & Gil, M. (Eds.). Fuzzy Vector: Mathematical description of non-
(2002). Soft methods in probability, statistics and precise vector quantities.
data analysis. Heidelberg: Physica-Verlag.
Hybrid Data Analysis Methods: Data analysis tech-
Menger, K. (1951). Ensembles flous et functions niques using methods from statistics and from fuzzy
aleatoires. Comptes Rendus Acad. Sci., 232, 2001-2003. modeling.
Mller, B., & Beer, M. (2004). Fuzzy randomness Non-Precise Data: Data which are not precise num-
Uncertainty in civil engineering and computational bers or not precise vectors.
mechanics. Berlin: Springer.
522
TEAM LinG
523
A General Model for Data Warehouses /

Michel Schneider
Blaise Pascal University, France
INTRODUCTION by means of a containment function. In (Lehner, 1998),

the organization of a dimension results from the func-
Basically, the schema of a data warehouse lies on two tional dependences which exist between its members,
kinds of elements: facts and dimensions. Facts are used and a multi-dimensional normal form is defined. In
to memorize measures about situations or events. Di- (Hsemann, 2000), the functional dependences are also
mensions are used to analyse these measures, particu- used to design the dimensions and to relate facts to
larly through aggregation operations (counting, summa- dimensions. In (Abello, 2001), relationships between
tion, average, etc.). To fix the ideas let us consider the levels in a hierarchy are apprehended through the Part-
analysis of the sales in a shop according to the product Whole semantics. In (Tsois, 2001), dimensions are
type and to the month in the year. Each sale of a product organized around the notion of a dimension path which
is a fact. One can characterize it by a quantity. One can is a set of drilling relationships. The model is centered
calculate an aggregation function on the quantities of on a parent-child (one to many) relationship type. A
several facts. For example, one can make the sum of drilling relationship describes how the members of a
quantities sold for the product type mineral water children level can be grouped into sets that correspond
during January in 2001, 2002 and 2003. Product type is to members of the parent level. In (Vassiliadis, 2000), a
a criterion of the dimension Product. Month and Year dimension is viewed as a lattice and two functions anc
are criteria of the dimension Time. A quantity is so and desc are used to perform the roll up and the drill
connected both with a type of product and with a month down operations. Pedersen (1999) proposes an ex-
of one year. This type of connection concerns the orga- tended multidimensional data model which is also based
nization of facts with regard to dimensions. On the other on a lattice structure, and which provides non-strict
hand a month is connected to one year. This type of hierarchies (i.e. too many relationships between the
connection concerns the organization of criteria within different levels in a dimension).
a dimension. The possibilities of fact analysis depend Modelling of facts and their relationships has not
on these two forms of connection and on the schema of received so much attention. Facts are generally consid-
the warehouse. This schema is chosen by the designer in ered in a simple fashion which consists in relating a fact
accordance with the users needs. with the roots of the dimensions. However, there is a
Determining the schema of a data warehouse cannot need for considering more sophisticated structures
be achieved without adequate modelling of dimensions where the same set of dimensions are connected to
and facts. In this article we present a general model for different fact types and where several fact types are
dimensions and facts and their relationships. This model inter-connected. The model described in (Pedersen,
will facilitate greatly the choice of the schema and its 1999) permits some possibilities in this direction but is
manipulation by the users. not able to represent all the situations.
Apart from these studies it is important to note
various propositions (Agrawal, 1997; Datta, 1999;
BACKGROUND Gyssens, 1997; Nguyen, 2000) for cubic models where
the primary objective is the definition of an algebra for
Concerning the modelling of dimensions, the objective multidimensional analysis. Other works must also be
is to find an organization which corresponds to the mentioned. In (Golfarelli, 1998), a solution is proposed to
analysis operations and which provides strict control derive multidimensional structures from E/R shemas. In
over the aggregation operations. In particular it is im- (Hurtado, 2001) are established conditions for reasoning
portant to avoid double-counting or summation of non- about summarizability in multidimensional structures.
additive data. Many studies have been devoted to this
problem. Most recommend organizing the criteria (we
said also members) of a given dimension into hierar- MAIN THRUST
chies with which the aggregation paths can be explicitly
defined. In (Pourabbas, 1999), hierarchies are defined Our objective in this article is to propose a generic
model based on our personal research work and which
TEAM LinG
A General Model for Data Warehouses
integrates existing models. This model can be used to There may be no fact attribute; in this case a fact
apprehend the sharing of dimensions in various ways and records the occurrence of an event or a situation. In such
to describe different relationships between fact types. cases, analysis consists in counting occurrences satis-
Using this model, we will also define the notion of well- fying a certain number of conditions.
formed warehouse structures. Such structures have de- For the needs of an application, it is possible to
sirable properties for applications. We suggest a graph introduce different fact types sharing certain dimen-
representation for such structures which can help the sions and having references between them.
users in designing and requesting a data warehouse.
Modelling Dimensions
Modelling Facts
The different criteria which are needed to conduct analy-
A fact is used to record measures or states concerning an sis along a dimension are introduced through members.
event or a situation. Measures and states can be analysed A member is a specific attribute (or a group of at-
through different criteria organized in dimensions. tributes) taking its values on a well defined domain. For
A fact type is a structure example, the dimension TIME can include members
such as DAY, MONTH, YEAR, etc. Analysing a fact
fact_name[(fact_key), attribute A along a member M means that we are inter-
(list_of_reference_attributes), ested in computing aggregate functions on the values of
(list_of_fact_attributes)] A for any grouping defined by the values of M. In the
article we will also use the notation Mij for the j-th
where member of i-th dimension.
Members of a dimension are generally organized in
fact_name is the name of the type; a hierarchy which is a conceptual representation of the
fact_key is a list of attribute names; the concat- hierarchies of their occurrences. Hierarchy in dimen-
enation of these attributes identifies each instance sions is a very useful concept that can be used to impose
of the type; constraints on member values and to guide the analysis.
list_of_reference_attributes is a list of attribute Hierarchies of occurrences result from various rela-
names; each attribute references a member in a tionships which can exist in the real world: categoriza-
dimension or another fact instance; tion, membership of a subset, mereology. Figure 1
list_of_fact_attributes is a list of attribute names; illustrates a typical situation which can occur. Note that
each attribute is a measure for the fact. a hierarchy is not necessarily a tree.
We will model a dimension according to a hierarchi-
The set of referenced dimensions comprises the cal relationship (HR) which links a child member Mij
dimensions which are directly referenced through the (i.e. town) to a parent member Mik (i.e. region) and we
list_of_reference_attributes, but also the dimensions will use the notation MijMik. For the following we
which are indirectly referenced through other facts. consider only situations where a child occurrence is
Each fact attribute can be analysed along each of the linked to a unique parent occurrence in a type. However,
referenced dimensions. Analysis is achieved through a child occurrence, as in case (b) or (c), can have several
the computing of aggregate functions on the values of parent occurrences but each of different types. We will
this attribute. also suppose that HR is reflexive, antisymmetric and
As an example let us consider the following fact type transitive. This kind of relationship covers the great
for memorizing the sales in a set of stores. majority of real situations. Existence of this HR is very
important since it means that the members of a dimension
Sales[(ticket_number, product_key), (time_key,
product_key, store_key),
(price_per_unit, quantity)] Figure 1. A typical hierarchy in a dimension
ck1 ck2 ck3 ck4 .
cust_key
The key is (ticket_number, product_key). This means
that there is an instance of Sales for each different
product of a ticket. There are three dimension refer- town demozone
Dallas Mesquite Akron . North North .
ences: time_key, product_key, store_key. There are
two fact attributes: price_per_unit, quantity. The fact
attributes can be analysed through aggregate operations region Texas Ohio .
by using the three dimensions.
524
TEAM LinG
can be organized into levels and correct aggregation of First, a fact can directly reference any member of a
fact attribute values along levels can be guaranteed. dimension. Usually a dimension is referenced through /
Each member of a dimension can be an entry for this one of its roots (as we saw above, a dimension can have
dimension i.e. can be referenced from a fact type. This several roots). But it is also interesting and useful to
possibility is very important since it means that dimen- have references to members other than the roots. This
sions between several fact types can be shared in various means that a dimension can be used by different facts
ways. In particular, it is possible to reference a dimen- with different granularities. For example, a fact can
sion at different levels of granularity. A dimension root directly reference town in the customer dimension and
represents a standard entry. For the three dimensions in another can directly reference region in the same
Figure 1, there is a single root. However, definition 3 dimension. This second reference corresponds to a
authorizes several roots. coarser granule of analysis than the first.
As in other models (Hsemann, 2000), we consider Moreover, a fact F1 can reference any other fact F2.
property attributes which are used to describe the mem- This type of reference is necessary to model real
bers. A property attribute is linked to its member through situations. This means that a fact attribute of F1 can be
a functional dependence, but does not introduce a new analysed by using the key of F2 (acting as the grouping
member and a new level of aggregation. For example the attribute of a normal member) and also by using the
member town in the customer dimension may have prop- dimensions referenced by F2.
erty attributes such as population, administrative posi- To formalise the interconnection between facts and
tion, etc. Such attributes can be used in the selection dimensions, we thus suggest extending the HR rela-
predicates of requests to filter certain groups. tionship of section 3 to the representation of the asso-
We now define the notion of member type, which ciations between fact types and the associations be-
incorporates the different features presented above. tween fact types and member types. We impose the
A member type is a structure: same properties (reflexivity, anti-symmetry, transitiv-
ity). We also forbid cyclic interconnections. This gives
member_name[(member_key), a very uniform model since fact types and member
dimension_name, types are considered equally. To maintain a traditional
(list_of_reference_attributes)] vision of the data warehouses, we also ensure that the
members of a dimension cannot reference facts.
where Figure 2 illustrates the typical structures we want
to model. Case (a) corresponds to the simple case, also
member_name is the name of the type; known as star structure, where there is a unique fact
member_key is a list of attribute names; the con- type F1 and several separate dimensions D1, D2, etc.
catenation of these attributes identifies each in- Cases (b) and (c) correspond to the notion of facts of
stance of the type; fact. Cases (d), (e) and (f) correspond to the sharing of
list_of_reference_attributes is a list of attribute the same dimension. In case (f) there can be two differ-
names where each attribute is a reference to the ent paths starting at F2 and reaching the same member
successors of the member instance in the cover M of the sub-dimension D21. So analysis using these
graph of the dimension.
Only the member_key is mandatory. Figure 2. Typical warehouse structures

Using this model, the representations of the members
F1 F1 F1 F2
of dimension customer in Figure 1 are the following:
D2 D1 F2 D1 D2 F3 D2
cust_root[(cust_key), customer, (town_name,
zone_name)]
D2 D3
town[(town_name), customer,(region_name)] (a) (b) (c)
demozone[(zone_name), customer, (region_name)]
region[(region_name), customer, ()] F2
F1 F2 F1 F2
F1
Well-Formed Structures D1 D1
D1
D21
D21
In this section we explain how the fact types and the (d) (e) (f)
member types can be interconnected in order to model
various warehouse structures.
525
TEAM LinG
Figure 3. The DWG of a star-snowflake structure
sales [(sales_key), (time_key, product_key, cust_key),

(quantity, price_per_unit)]
time_root product_root cust_root

[(time_key), ] [(product_key), ] [(cust_key), ]
day category town demozone

[(day_no),...] [(category_no), ] [(town_name), ] [(zone_name), ...]
year
[(year_no),] family region
[(family_name),] [(region_name),]
two paths cannot give the same results when reaching M. level increases (by one or more depending on the used
To pose this problem we introduce the DWG and the path). When using aggregate operations, this action corre-
path coherence constraint. sponds to a ROLLUP operation (corresponding to the
To represent data warehouse structures, we suggest semantics of the HR) and the opposite operation to a
using a graph representation called DWG (data ware- DRILLDOWN. Starting from the reference to a dimension
house graph). It consists in representing each type (fact D in a fact type F, we can then roll up in the hierarchies of
type or member type) by a node containing the main dimensions by following a path of the DWG.
information about this type, and representing each ref-
erence by a directed edge. Illustrating the Modelling of a Typical
Suppose that in the DWG graph, there are two differ- Case with Well-Formed Structures
ent paths P1 and P2 starting from the same fact type F,
and reaching the same member type M. We can analyse Well-formed structures are able to model correctly and
instances of F by using P1 or P2. The path coherence completely the different cases of Figure 2. We illus-
constraint is satisfied if we obtain the same results when trate in this section the modelling for the star-snow-
reaching M. flake structure.
For example in case of figure 1 this constraint means We have a star-snowflake structure when:
the following: for a given occurrence of cust_key,
whether the town path or the demozone path is used, one there is a unique root (which corresponds to the
always obtains the same occurrence of region. unique fact type);
We are now able to introduce the notion of well- each reference in the root points towards a sepa-
formed structures. rate subgraph in the DWG (this subgraph corre-
sponds to a dimension).
Definition of a Well-Formed Warehouse
Structure Our model does not differentiate star structures
from snowflake structures. The difference will appear
A warehouse structure is well-formed when the DWG is with the mapping towards an operational model (rela-
acyclic and the path coherence constraint is satisfied tional model for example). The DWG of a star-snow-
for any couple of paths having the same starting node and flake structure is represented in Figure 3. This repre-
the same ending node. sentation is well-formed. Such a representation can be
A well-formed warehouse structure can thus have very useful to a user for formulating requests.
several roots. The different paths from the roots can be
always divided into two sub-paths: the first one with
only fact nodes and the second one with only member FUTURE TRENDS
nodes. So, roots are fact types.
Since the DWG is acyclic, it is possible to distribute A first opened problem is that concerning the property
its nodes into levels. Each level represents a level of of summarizability between the levels of the different
aggregation. Each time we follow a directed edge, the dimensions. For example, the total of the sales of a
526
TEAM LinG
product for 2001 must be equal to the sum of the totals REFERENCES
for the sales of this product for all months of 2001. Any /
model of data warehouse has to respect this property. In Abello, A., Samos, J., & Saltor, F. (2001). Understand-
our presentation we supposed that function HR verified ing analysis dimensions in a multidimensional object-
this property. In practice, various functions were pro- oriented model. Intl Workshop on Design and Man-
posed and used. It would be interesting and useful to agement of Data Warehouses, DMDW2000, Interlaken,
begin a general formalization which would regroup all Switzerland.
these propositions.
Another opened problem concerns the elaboration Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modelling
of a design method for the schema of a data warehouse. multidimensional databases. International Conference
A data warehouse is a data base and one can think that its on Data Engineering, ICDE97 (pp. 232-243), Birming-
design does not differ from that of a data base. In fact a ham, UK.
data warehouse presents specificities which it is neces- Datta, A., & Thomas, H. (1999). The cube data model: A
sary to take into account, notably the data loading and conceptual model and algebra for on-line analytical
the performance optimization. Data loading can be com- processing in data warehouses. Decision Support Sys-
plex since sources schemas can differ from the data tems, 27(3), 289-301.
warehouse schema. Performance optimization arises
particularly when using relational DBMS for imple- Golfarelli, M., Maio, D., & Rizzi, V.S. (1998). Concep-
menting the data warehouse. tual design of data warehouses from E/R schemes. 32th
Hawaii International Conference on System Sciences,
HICSS1998.
CONCLUSION Gyssens, M., & Lakshmanan, V.S. (1997). A foundation
for multi-dimensional databases. Intl Conference on
In this paper we propose a model which can describe Very Large Databases (pp. 106-115).
various data warehouse structures. It integrates and
extends existing models for sharing dimensions and for Hurtado, C., & Mendelzon, A. (2001). Reasoning about
representing relationships between facts. It allows for summarizability in heterogeneous multidimensional
different entries in a dimension corresponding to dif- schemas. International Conference on Database Theory,
ferent granularities. A dimension can also have several ICDT01.
roots corresponding to different views and uses. Thanks
to this model, we have also suggested the concept of Hsemann, B., Lechtenbrger, J., & Vossen, G. (2000).
Data Warehouse Graph (DWG) to represent a data ware- Conceptual data warehouse design. Intl Workshop on
house schema. Using the DWG, we define the notion of Design and Management of Data Warehouse,
well-formed warehouse structures which guarantees DMDW2000, Stockholm, Sweden.
desirable properties. Lehner, W., Albrecht, J., & Wedekind, H. (1998). Multidi-
We have illustrated how typical structures such as mensional normal forms. l0th Intl Conference on Scien-
star-snowflake structures can be advantageously repre- tific and Statistical Data Management, SSDBM98, Capri,
sented with this model. Other useful structures like Italy.
those depicted in Figure 2 can also be represented.
The DWG gathers the main information from the Nguyen, T., Tjoa, A.M., Wanger, S. (2000). Conceptual
warehouse and it can be very useful to users for formu- multidimensional data model based on metacube. Intl
lating requests. We believe that the DWG can be used as Conference on Advances in Information Systems (pp. 24-
an efficient support for a graphical interface to manipu- 33), Izmir, Turkey.
late multidimensional structures through a graphical Pedersen, T.B., & Jensen, C.S. (1999). Multidimensional
language. data modelling for complex data. Intl Conference on Data
The schema of a data warehouse represented with our Engineering, ICDE 99.
model can be easily mapped into an operational model.
Since our model is object-oriented a mapping towards Pourabbas, E., & Rafanelli, M. (1999). Characterization of
an object model is straightforward. But it is possible hierarchies and some operators in OLAP environment.
also to map the schema towards a relational model or an ACM Second International Workshop on Data Ware-
object relational model. It appears that our model has a housing and OLAP, DOLAP99 (pp. 54-59), Kansas City,
natural place between the conceptual schema of the USA.
application and an object relational implementation of
the warehouse. It can thus serve as a helping support for Tsois, A., Karayannidis, N., & Sellis, T. (2001). MAC:
the design of relational data warehouses. Conceptual data modeling for OLAP. Intl Workshop on
527
TEAM LinG
Design and Management of Data Warehouses, product type, manufacturer type). Members are used to
DMDW2000, Interlaken, Switzerland. drive the aggregation operations.
Vassiliadis, P., & Skiadopoulos, S. (2000). Modelling and Drilldown: Opposite operation of the previous one.
optimisation issues for multidimensional databases. In-
ternational Conference on Advanced Information Sys- Fact: Element recorded in a warehouse (example: each
tems Engineering, CAISE2000 (pp. 482-497), Stockholm, product sold in a shop) and whose characteristics (i.e.
Sweden. measures) are the object of the analysis (example: quan-
tity of a product sold in a shop).
Galaxy Structure: Structure of a warehouse for which
two different types of facts share a same dimension.
KEY TERMS
Hierarchy: The members of a dimension are gener-
Data Warehouse: A data base which is specifically ally organized along levels into a hierarchy.
elaborated to allow different analysis on data. Analysis
consists generally to make aggregation operations (count, Member: Every criterion in a dimension is materi-
sum, average, etc.). A data warehouse is different from alized through a member.
a transactional data base since it accumulates data along Rollup: Operation consisting in going in a hierarchy
time and other dimensions. Data of a warehouse are at a more aggregated level.
loaded and updated at regular intervals from the transac-
tional data bases of the company. Star Structure: Structure of a warehouse for which
a fact is directly connected to several dimensions and
Dimension: Set of members (criteria) allowing to can be so analyzed according to these dimensions. It is
drive the analysis (example for the Product dimension: the most simple and the most used structure.
528
TEAM LinG
529
Genetic Programming /
William H. Hsu
Kansas State University, USA
Genetic programming (GP) is a subarea of evolutionary Although Cramer (1985) first described the use of cross-
computation first explored by John Koza (1992) and inde- over, selection, mutation, and tree representations for
pendently developed by Nichael Lynn Cramer (1985). It is using genetic algorithms to generate programs, Koza, et
a method for producing computer programs through ad- al. (1992) is indisputably the fields most prolific and
aptation according to a user-defined fitness criterion, or influential author (Wikipedia, 2004). In four books, Koza,
objective function. et al. (1992, 1994, 1999, 2003) have described GP-based
GP systems and genetic algorithms (GAs) are related solutions to numerous toy problems and several impor-
but distinct approaches to problem solving by simulated tant real-world problems.
evolution. As in the GA methodology, GP uses a represen-
tation related to some computational model, but in GP, State of the Field: To date, GPs have been applied
fitness is tied to task performance by specific program successfully to a few significant problems in ma-
semantics. Instead of strings or permutations, genetic chine learning and data mining, most notably sym-
programs most commonly are represented as variable- bolic regression and feature construction. The
sized expression trees in imperative or functional pro- method is very computationally intensive, how-
gramming languages, as grammars (ONeill & Ryan, 2001) ever, and it is still an open question in current
or as circuits (Koza et al., 1999). GP uses patterns from research whether simpler methods can be used in-
biological evolution to evolve programs: stead. These include supervised inductive learning,
deterministic optimization, randomized approxima-
Crossover: Exchange of genetic material such as tion using non-evolutionary algorithms (i.e., Markov
program subtrees or grammatical rules. chain Monte Carlo approaches), genetic algorithms,
Selection: The application of the fitness criterion in and evolutionary algorithms. It is postulated by GP
order to choose which individuals from a population researchers that the adaptability of GPs to struc-
will go on to reproduce. tural, functional, and structure-generating solutions
Replication: The propagation of individuals from of unknown forms makes them more amenable to
one generation to the next. solving complex problems. Specifically, Koza, et al.
Mutation: The structural modification of individuals. (1999, 2003) demonstrate that, in many domains, GP
is capable of human-competitive automated discov-
To work effectively, GP requires an appropriate set of ery of concepts deemed to be innovative through
program operators, variables, and constants. Fitness in technical review such as patent evaluation.
GP typically is evaluated over fitness cases. In data
mining, this usually means training and validation data,
but cases also can be generated dynamically using a MAIN THRUST
simulator or be directly sampled from a real-world prob-
lem-solving environment. GP uses evaluation over these The general strengths of genetic programming lie in its
cases to measure performance over the required task, ability to produce solutions of variable functional form,
according to the given fitness criterion. reuse partial solutions, solve multi-criterion optimization
This article begins with a survey of the design of GP problems, and explore a large search space of solutions in
systems and their applications to data-mining problems, parallel. Modern GP systems also are able to produce
such as pattern classification, optimization of representa- structured, object-oriented, and functional programming
tions for inputs and hypotheses in machine learning, solutions involving recursion or iteration, subtyping, and
grammar-based information extraction, and problem trans- higher-order functions.
formation by reinforcement learning. It concludes with a A more specific advantage of GPs is their ability to
discussion of current issues in GP systems (i.e., scalability, represent procedural, generative solutions to pattern
human-comprehensibility, code growth and reuse, and recognition and machine-learning problems. Examples of
incremental learning).
TEAM LinG
Genetic Programming
this include image compression and reconstruction (Koza, whereas type coercion can implement even better prefer-
1992) and several of the recent applications surveyed in ential bias.
the following.
Grammar-Based GP for Data Mining
GP for Pattern Classification
Not all GP-based approaches use expression tree-based
GP in pattern classification departs from traditional super- representations or functional program interpretation as
vised inductive learning in that it evolves solutions whose the computational model. Wong and Leung (2000) survey
functional form is not determined in advance, and in some data mining using grammars and formal languages. This
cases can be theoretically arbitrary. Koza (1992, 1994) general approach has been shown to be effective for some
developed GPs for several pattern reproduction prob- natural language learning problems, and extension of the
lems, such as the multiplexer and symbolic regression approach to procedural information extraction is a topic of
problems. current research in the GP community.
Since then, there has been continuing work on induc-
tive GP for pattern classification (Kishore et al., 2000), GP Software Packages: Functionality
prediction (Brameier & Banzhaf, 2001), and numerical curve and Research Features
fitting (Nikolaev & Iba, 2001). GP has been used to boost
performance in learning polynomial functions (Nikolaev & A number of GP software packages are available publicly
Iba, 2001). More recent work on tree-based multi-crossover and commercially. General features common to most GP
schemes has produced positive results in GP-based design systems for research and development include a very
of classification functions (Muni et al., 2004). high-period random number generator, such as the
Mersenne Twister for random constant generation and GP
GP for Control of Inductive Bias, operations; a variety of selection, crossover, and muta-
Feature Construction, and Feature tion operations; and trivial parallelism (e.g., through multi-
Extraction threading).
One of the most popular packages for experimentation
GP approaches to inductive learning face the general with GP is Evolutionary Computation in Java, or ECJ (Luke
problem of optimizing inductive bias: the preference for et al., 2004). ECJ implements the previously discussed
groups of hypotheses over others on bases other than features as well as parsimony, strongly-typed GP, migra-
pure consistency with training data or other fitness cases. tion strategies for exchanging individual subpopulations
Krawiec (2002) approaches this problem by using GP to in island mode GP (a type of GP featuring multiple demes
preserve useful components of representation (features) local populations or breeding groups), vector representa-
during an evolutionary run, validating them using the tions, and reconfigurability using parameter files.
classification data and reusing them in subsequent gen-
erations. This technique is related to the wrapper ap- Other Applications: Optimization,
proach to knowledge discovery in databases (KDD), Policy Learning
where validation data is held out and used to select
examples for supervised learning or to construct or select Like other genetic and evolutionary computation method-
variables given as input to the learning system. Because ologies, GP is driven by fitness and suited to optimization
GP is a generative problem-solving approach, feature approaches to machine learning and data mining. Its
construction in GP tends to involve production of new program-based representation makes it good for acquir-
variable definitions rather than merely selecting a subset. ing policies by reinforcement learning.1 Many GP prob-
Evolving dimensionally-correct equations on the ba- lems are error-driven or payoff-driven (Koza, 1992), in-
sis of data is another area where GP has been applied. cluding the ant trail problems and foraging problems now
Keijzer and Babovic (2002) provide a study of how GP explored more heavily by the swarm intelligence and ant
formulates its declarative bias and preferential (search- colony optimization communities. A few problems use
based) bias. In this and related work, it is shown that a specific information-theoretic criteria, such as maximum
proper units of measurement (strong typing) approach entropy or sequence randomization.
can capture declarative bias toward correct equations,
530
TEAM LinG
Genetic Programming
FUTURE TRENDS CONCLUSION

/
Limitations: Scalability and Solution Genetic programming (GP) is a search methodology that
Comprehensibility provides a flexible and complete mechanism for machine
learning, automated discovery, and cost-driven optimi-
Genetic programming remains a controversial approach zation. It has been shown to work well in many optimiza-
due to its high computational cost, scalability issues, and tion and policy-learning problems, but scaling GP up to
current gaps in fundamental theory for relating its perfor- most real-world data mining domains is a challenge, due
mance to traditional search methods, such as hill climbing. to its high computational complexity. More often, GP is
While GP has achieved results in design, optimization, and used to evolve data transformations by constructing
intelligent control that are as good as and sometimes better features or to control the declarative and preferential
than those produced by human engineers, it is not yet inductive bias of the machine-learning component. Mak-
widely used as a technique due to these limitations in ing GP practical poses several key questions dealing with
theory. An additional controversy, stemming from the how to scale up, make solutions comprehensible to
intelligent systems community, is the role of knowledge in humans and statistically validate them, control the growth
search-driven approaches such as GP. Some proponents of solutions, reuse partial solutions efficiently, and learn
of GP view it as a way to generate innovative solutions with incrementally.
little or no domain knowledge, while critics have expressed Looking ahead to future opportunities and challenges
skepticism over original results due to the lower human in data mining, genetic programming provides one of the
comprehensibility of some results. The crux of this debate more general frameworks for machine learning and adap-
is a tradeoff between innovation and originality vs. com- tive problem solving. In data mining, they are likely to be
prehensibility, robustness, and ease of validation. Suc- most useful where a generative or procedural solution is
cesses in replicating previously patented engineering desired, or where the exact functional form of the solution
designs such as analog circuits using GP (Koza et al., 2003) (whether a mathematical formula, grammar, or circuit) is
have increased its credibility in this regard. not known in advance.
Open Issues: Code Growth, Diversity,

REFERENCES
Reuse, and Incremental Learning
Brameier, M., & Banzhaf, W. (2001). Evolving teams of
Some of the most important open problems in GP deal with predictors with linear genetic programming. Genetic
the proliferation of solution code (called code growth or Programming and Evolvable Machines, 2(4), 381-407.
code bloat), the reuse of previously evolved partial solu-
tions, and incremental learning. Code growth is an increase Burke, E.K., Gustafson, S., & Kendall, G. (2004). Diversity
in solution size across generations and generally refers to in genetic programming: An analysis of measures and
one that is not matched by a proportionate increase in correlation with fitness. IEEE Transactions on Evolu-
fitness. It has been studied extensively in the field of GP tionary Computation, 8(1), 47-62.
by many researchers. Luke (2000) provides a survey of
known and hypothesized causes of code growth, along Cramer, N.L. (1985). A representation for the adaptive
with methods for monitoring and controlling growth. Re- generation of simple sequential programs. Proceedings
cently, Burke, et al. (2004) explored the relationship be- of the International Conference on Genetic Algorithms
tween diversity (variation among solutions) and code and Their Applications.
growth and fitness. Some techniques for controlling code Hsu, W.H., & Gustafson, S.M. (2002). Genetic program-
growth include reuse of partial solutions through such ming and multi-agent layered learning by reinforcements.
mechanisms as automatically defined functions, or ADFs Proceedings of the Genetic and Evolutionary Computa-
(Koza, 1994) and incremental learning; that is, learning in tion Conference (GECCO-2002), New York.
stages. One incremental approach in GP is to specify
criteria for a simplified problem and then transfer the Keijzer, M., & Babovic, V. (2002). Declarative and prefer-
solutions to a new GP population (Hsu & Gustafson, 2002). ential bias in GP-based scientific discovery. Genetic
Programming and Evolvable Machines, 3(1), 41-79.
531
TEAM LinG
Genetic Programming
Kishore, J.K., Patnaik, L.M., Mani, V., & Agrawal, V.K. KEY TERMS
(2000). Application of genetic programming for
multicategory pattern classification. IEEE Transactions Automatically-Defined Function (ADF): Parametric
on Evolutionary Computation, 4(3), 242-258. functions that are learned and assigned names for reuse
Koza, J.R. (1992). Genetic programming: On the program- as subroutines. ADFs are related to the concept of macro-
ming of computers by means of natural selection. Cam- operators or macros in speedup learning.
bridge, MA: MIT Press. Code Growth (Code Bloat): The proliferation of solu-
Koza, J.R. (1994). Genetic programming II: Automatic tion elements (e.g., nodes in a tree-based GP representa-
discovery of reusable programs. Cambridge, MA: MIT tion) that do not contribute toward the objective function.
Press. Crossover: In biology, a process of sexual recombina-
Koza, J.R. et al. (2003). Genetic programming IV: Routine tion, by which two chromosomes are paired up and ex-
human-competitive machine intelligence. San Mateo, change some portion of their genetic sequence. Cross-
CA: Morgan Kaufmann. over in GP is highly stylized and involves structural
exchange, typically using subexpressions (subtrees) or
Koza, J.R., Bennett III, F.H., Andr, D., & Keane, M.A. production rules in a grammar.
(1999). Genetic programming III: Darwinian invention
and problem solving. San Mateo, CA: Morgan Kaufmann. Evolutionary Computation: A solution approach based
on simulation models of natural selection, which begins
Krawiec, K. (2002). Genetic programming-based construc- with a set of potential solutions and then iteratively
tion of features for machine learning and knowledge applies algorithms to generate new candidates and select
discovery tasks. Genetic Programming and Evolvable the fittest from this set. The process leads toward a model
Machines, 3(4), 329-343. that has a high proportion of fit individuals.
Kushchu, I. (2002). Genetic programming and evolution- Generation: The basic unit of progress in genetic and
ary generalization. IEEE Transactions on Evolutionary evolutionary computation, a step in which selection is
Computation, 6(5), 431-442. applied over a population. Usually, crossover and muta-
tion are applied once per generation, in strict order.
Luke, S. (2000). Issues in scaling genetic programming:
Breeding strategies, tree generation, and code bloat [doc- Individual: A single candidate solution in genetic and
toral thesis]. College Park, MD: University of Maryland. evolutionary computation, typically represented using
strings (often of fixed length) and permutations in genetic
Luke, S. et al. (2004). Evolutionary computation in Java algorithms, or using problem-solver representations (i.e.,
v11. Retrieved from http://www.cs.umd.edu/projects/plus/ programs, generative grammars, or circuits) in genetic
ec/ecj/ programming.
Muni, D.P., Pal, N.R., & Das, J. (2004). A novel approach to Island Mode GP: A type of parallel GP, where multiple
design classifiers using genetic programming. IEEE Trans- subpopulations (demes) are maintained and evolve inde-
actions on Evolutionary Computation, 8(2), 183-196. pendently, except during scheduled exchanges of indi-
Nikolaev, N.Y., & Iba, H. (2001a). Regularization approach viduals.
to inductive genetic programming. IEEE Transactions on Mutation: In biology, a permanent, heritable change
Evolutionary Computation, 5(4), 359-375. to the genetic material of an organism. Mutation in GP
Nikolaev, N.Y., & Iba, H. (2001b). Accelerated genetic involves structural modifications to the elements of a
programming of polynomials. Genetic Programming and candidate solution. These include changes, insertion,
Evolvable Machines, 2(3), 231-257. duplication, or deletion of elements (subexpressions,
parameters passed to a function, components of a resis-
ONeill, M., & Ryan, C. (2001). Grammatical evolution. tor-capacitor-inducer circuit, non-terminals on the right-
IEEE Transactions on Evolutionary Computation. hand side of a production rule).
Wikipedia. (2004). Genetic programming. Retrieved from Parsimony: An approach in genetic and evolutionary
http://en.wikipedia.org/wiki/Genetic_programming computation related to minimum description length, which
Wong, M.L., & Leung, K.S. (2000). Data mining using rewards compact representations by imposing a penalty
grammar based genetic programming and applications. for individuals in direct proportion to their size (e.g.,
Norwell, MA: Kluwer. number of nodes in a GP tree). The rationale for parsimony
is that it promotes generalization in supervised inductive
532
TEAM LinG
Genetic Programming
learning and produces solutions with less code, which ENDNOTE

can be more efficient to apply. /
Selection: In biology, a mechanism by which the
1
Reinforcement learning describes a class of prob-
fittest individuals survive to reproduce, and the basis of lems in machine learning, in which an agent explores
speciation, according to the Darwinian theory of evolu- an environment, selecting actions based upon per-
tion. Selection in GP involves evaluation of a quantitative ceived state and an internal policy. This policy is
criterion over a finite set of fitness cases, with the com- updated, based upon a (positive or negative) reward
bined evaluation measures being compared in order to that it receives from the environment as a result of
choose individuals. its actions and that it seeks to maximize in the
expected case.
533
TEAM LinG
534
Graph Transformations and Neural Networks

Ingrid Fischer
Friedrich-Alexander University Erlangen-Nrnberg, Germany
INTRODUCTION possible: changes can be done for a neuron, for the

connections weight or new connections, and neurons
As the beginning of the area of artificial neural networks, can be inserted. For unsupervised learning, the results of
the introduction of the artificial neuron of McCulloch and an input are not known.
Pitts is considered. They were inspired by the biological There are many advantages and limitations to neural
neuron. Since then, many new networks or new algorithms network analysis, and to discuss this subject properly,
for neural networks have been invented with the result one must look at each individual type of network. Never-
that this area is not very clearly laid out. In most textbooks theless, there is one specific limitation of neural networks
on (artificial) neural networks (Rojas, 2000; Silipo, 2002), that potential users should be aware of. Neural networks
there is no general definition on what a neural net is, but are more or less, depending on the different types, the
rather an example-based introduction leading from the ultimate black boxes. The final result of the learning
biological model to some artificial successors. Perhaps process is a trained network that provides no equations
the most promising approach to define a neural network or coefficients defining a relationship beyond its own
is to see it as a network of many simple processors (units), internal mathematics.
each possibly having a small amount of local memory. The Graphs are widely-used concepts within computer
units are connected by communication channels (connec- science; in nearly every field, graphs serve as a tool for
tions) that usually carry numeric (as opposed to sym- visualization, summarization of dependencies, explana-
bolic) data called the weight of the connection. The units tion of connections, and so forth. Famous examples are all
operate only on their local data and on the inputs they kinds of different nets and graphs as semantic nets, petri
receive via the connections. It is typical of neural net- nets, flow charts, interaction diagrams, or neural net-
works that they have great potential for parallelism, since works. Invented first 35 years ago, graph transformations
the computations of the components are largely indepen- have been expanding constantly. Wherever graphs are
dent of each other. Neural networks work best if the used, graph transformations also are applied (Blostein &
system modeled by them has a high tolerance to error. Schrr, 1999; Ehrig et al., 1999a; Ehrig et al., 1999b;
Therefore, one would not be advised to use a neural Rozenberg, 1997).
network to balance ones checkbook. However, they work Graph transformations are a very promising method
very well for: for modeling and programming neural networks. The
graph part is automatically given, as the name neural
capturing associations or discovering regularities network already indicates. Having graph transformations
within a set of patterns; as methodology, it is easy to model algorithms on this
any application where the number of variables or graph structure. Structure-preserving and structure-
diversity of the data is very great; changing algorithms can be modeled equally well. This is
any application where the relationships between not the case for the widely used matrices programmed
variables are vaguely understood; or, mostly in C or C++. In these approaches, modeling
any application where the relationships are difficult structure change becomes more difficult.
to describe adequately with conventional ap- This leads directly to a second advantage. Graph
proaches. transformations have proven useful for visualizing the
network and its algorithms. Most modern neural network
Neural networks are not programmed but can be trained simulators have some kind of visualization tool. Graph
in different ways. In supervised learning, examples are transformations offer a basis for this visualization, as the
presented to an initialized net. From the input and the algorithms are already implemented in visual rules. Also,
output of these examples, the neural net learns somehow. in nearly all books, neural networks are visualized as
There are as many learning algorithms as there are types graphs.
of neural nets. Also, learning is motivated physiologi- When having a formal methodology at hand, it is also
cally. When an example is presented to a neural network possible to use it for proving properties of nets and
that it cannot recalculate, several different steps are algorithms. Especially in this area, earlier results for graph
TEAM LinG
transformation systems can be used. Three possibilities In Figure 1, a sample application of a graph transfor-
are especially promising: first, it is interesting whether an mation rule is shown. The left-hand side L consists of /
algorithm is terminating. Though this question is unde- three nodes (1:, 2:, 3:) and three edges. This graph is
cidable in the general case, the formal methods of graph embedded into a graph G. Numbers in G indicate how the
rewriting and general rewriting offer some chances to nodes of L are matched. The embedding of edges is
prove termination for neural network algorithms. The straightforward. In the next step L is deleted from G, and
same holds for the question whether the result produced R is inserted. If L is simply deleted from G, hanging edges
by an algorithm is useful, whether the learning of a neural remain. All edges ending/starting at 1:,2:,3: are missing
network was successful. Then it helps to prove whether one node after deletion. With the help of numbers 1:,2:,3:
two algorithms are equivalent. Finally, possible parallel- in the right-hand side R, it is indicated how these hanging
ism in algorithms can be detected and described, based on edges are attached to R inserted in G/L. The resulting
results for graph transformation systems. graph is H.
Simple graphs are not enough for modeling real-world
applications. Among the different extensions, two are of
BACKGROUND special interest. First, graphs and graph rules can be
labeled. When G is labeled with numbers, L is labeled with
A Short Introduction to Graph variables, and R is labeled with terms over Ls variables.
This way, calculations can be modeled. Taking our ex-
Transformations ample and extending G with numbers 1,2,3, the left-hand
side L with variables x,y,z and the right-hand side with
Despite the different approaches to handling graph trans- terms x+y, x-y, xy, x y is shown in Figure 1. When L is
formations, there are some properties that all approaches embedded in G, the variables are set to the numbers of the
have in common. When transforming a graph G somehow, corresponding nodes. The nodes in H are labeled with the
it is necessary to specify what part of the graph, what result of the terms in R when the variable settings resulting
subgraph L, has to be exchanged. For this subgraph, a new from the embedding of L in G are used.
graph R must be inserted. When applying such a rule to Also, application conditions can be added, restricting
a graph G, three steps are necessary: the application of a rule. For example, the existence of a
certain subgraph A in G can be allowed or forbidden. A rule
Choose an occurrence of L in G. only can be applied if A can be found resp. not found in
Delete L from G. G. Additionally, label-based application conditions are
Insert R into the remainder of G. possible. This rule could be extended by asking for x < y.
Only in this case would the rule be applied.
Figure 1. The application of a graph rewrite rule L R to a graph G
L 1: x R 1: x+y x-y
2: y z 3: 2: xy xy 3:
G 1: 1 H 1: 3 -1
2: 2 3 2: 2 1 3:
3:
535
TEAM LinG
Combining Neural Networks and Graph before. This way, gradient descent-based training
Transformations methods as back propagation, a famous training
algorithm (Rojas, 2000) can be derived. A disad-
Various proposals exist as to how graph transformations vantage of this method is that no topology chang-
and neural networks can be combined having different ing algorithms can be modeled.
goals in mind. Several ideas originate from evolutionary In Fischer (2000), arbitrary neural nets are modeled
computing (Curran & ORiodan, 2002; De Jong & Pollack, as graphs and transformed by arbitrary transforma-
2001; Siddiqi & Lucas, 1998); others stem from electrical tion rules. Because this is the most general ap-
engineering (Wan & Beaufays, 1998). The approach of proach, it will be explained in detail in the following
Fischer (2000) has its roots in graph transformations itself. sections.
Despite these sources, only three really different ideas can
be found:
MAIN THRUST
In most of the papers (Curran & ORiodan, 2002; De
Jong & Pollack, 2001; Siddiqi & Lucas 1998), some In the remainder of this article, one special sort of neural
basic graph operators like insert-node, delete-node, network, the so-called probabilistic neural networks,
change-attribute, and so forth are used. This set of together with training algorithms, are explained in detail.
operators differs in the approaches. In addition to
these operators, some kind of application condition Probabilistic Neural Networks
exists, stating which rule has to be applied when and
on which nodes or edges. These application condi- The main purpose of a probabilistic neural network is to
tions can be directed acyclic graphs giving the se- sort patterns into classes. It always has three layersan
quence of the rule applications. It also can be tree- input layer, a hidden layer, and an output layer. The input
based, where newly created nodes are handed to neurons are connected to each neuron in the hidden
different paths in the tree. The main application area layer. The neurons of the hidden layer are connected to
of these approaches is to grow neural networks from one output neuron each. Hidden neurons are modeled
just one node. In this field, other grammar-based with two nodes, as they have an input value and do some
approaches also can be found (Cantu-Paz & Kamath, calculations on it, resulting in an output value. Neurons
2002), where matrices are rewritten (Browse, Hussain and connections are labeled with values resp. weights. In
& Smilie, 1999), taking attributed grammars. In Figure 2, a probabilistic neural network is shown.
Tsakonas and Dounias (2002), feed-forward neural
networks are grown with the help of a grammar in Calculations in Probabilistic Neural
Backus-Naur-Form. Networks
In Wan and Beaufays (1998), signal flow graphs
known from electrical engineering are used to model First, input is presented to the input neurons. The next
the information flow through the net. With the help step is to calculate the input value of the hidden neurons.
of rewrite rules, the elements can be reversed, so that The main purpose of the hidden neurons is to represent
the signal flow is going in the opposite direction as examples of classes. Each neuron represents one ex-
ample. This example forms the weights from the input
Figure 2. A probabilistic neural network seen as graph.
neurons to this special hidden neuron. When an input is
Please note that not all edges are labeled, due to space
presented to the net, each neuron in the hidden layer
reasons
computes the probability that it is the example it models.
Therefore, first, the Euclidean distance between the
8 5 activation of the input neurons and the weight of the
9 7 connection to the hidden layers neuron is computed.
0.4 0.7 0.3 0.2 The result coming via the connections is summed up to
within a hidden neuron. This is the distance that the
4 1 current input has from the example modeled by the
2 10 4 8 neuron. If the exact example is inserted into the net, the
3 5 result is 0. In Figure 3, this calculation is shown in detail.
The given graph rewrite rule can be applied to the net
shown in Figure 2. The labels i are variables modeling the
1 7 9 2 input values of the input neurons, w models the weights
536
TEAM LinG
Figure 3. Calculating the input activation of a neuron

/
L R
in ||ij - wj||
w1 wn w1 wn
i1 in i1 in
Figure 4. A new neuron is inserted into a net. This A more sophisticated training method is called Dy-
example is taken from Dynamic Decay Adjustment (Silipo, namic Decay Adjustment (Silipo, 2002). The algorithm
2002) also starts with no neuron in the hidden layer. If a pattern
is presented to the net, first, the activation of all existing
hidden neurons is calculated. If there is a neuron whose
1 1 1 activation is equal to or higher than a given threshold +,
d+ this neuron covers the input pattern. If this is not the case,
1 d a new hidden neuron is inserted into the net, as shown in
Figure 4. With the help of this algorithm, fewer hidden
0 neurons are inserted.
i1 i2 i3 i1 i2 i3
FUTURE TRENDS
The area of graph grammars and graph transformation

of the connections. In the right-hand side, R a formula is systems several programming environments, especially
given to calculate the input activation. AGG and Progress (Ehrig, Engels, Kreowski, & Rozenberg,
Then, a Gaussian function is used to calculate the 1999) have been developed. They can be used to obtain
probability that the inserted pattern is the example pattern new implementations of neural network algorithms, espe-
of the neuron. The radius s of the Gaussian function has cially of algorithms that change the nets topology.
to be chosen during the training of the net, or it has to be Pruning means the deletion of neurons and connec-
adapted by the user. The result is the output activation of tions between neurons in already trained neural networks
the neuron. to minimize their size. Several mechanisms exist that mainly
The main purpose of the connection between hidden differ in the evaluation function determining the neurons/
and output layer is to sum up the activations of the hidden connections to be deleted. It is an interesting idea to
layers neurons for one class. These values are multiplied model pruning with graph transformation rules.
with a weight associated to the corresponding connec- A factor graph is a bipartite graph that expresses how
tion. The output neuron with the highest activation rep- a global function of many variables factors into a product
resents the class of the input pattern. of local functions. It subsumes other graphical means of
artificial intelligence as Bayesian networks or Markov
Training Neural Networks random fields. It would be worth checking whether graph
transformation systems are helpful in formulating algo-
rithms for these soft computing approaches. Also, for
For the probabilistic neural networks, different methods
more biological applications, the use of graph transforma-
are used. The classical training method is the following:
tion is interesting.
The complete net is built up by inserting one hidden
neuron for each training pattern. As weights of the con-
nections from the input neuron to the hidden neuron
representing this example, the training pattern itself is CONCLUSION
taken. The connections from the hidden layer to the
output layer are weighted with 1/m if there are m training Modeling neural networks as graphs and algorithms on
patterns for one class. s, the radius of the Gaussian neural networks as graph transformations is an easy-to-
function used, is usually simply set to the average dis- use, straightforward method. It has several advantages.
tance between centers of Gaussians modeling training First, the structure of neural nets is supported. When
patterns. modeling the net topology and topology changing algo-
537
TEAM LinG
rithms, graph transformation systems can present their Rojas, R. (2000). Neural networks. A systematic introduc-
full power. This might be of special interest for education. New York, NY: Springer Verlag.
tional purposes where it is useful to visualize step-by-
step what algorithms do. Finally, the theoretical back- Rozenberg, G. (Ed.). (1997). Handbook of graph gram-
ground of graph transformation and rewriting systems mars and computing by graph transformations. Singapore:
offers several possibilities for proving termination, equiva- World Scientific.
lence, and the like, of algorithms. Siddiqi, A., & Lucas, S.M. (1998). A comparison of matrix
rewriting versus direct encoding for evolving neural net-
works. Proceedings of the IEEE International Confer-
REFERENCES ence on Evolutionary Computation, Anchorage, Alaska.
Blostein, D., & Schrr, A. (1999). Computing with graphs Silipo, R. (2002). Artificial neural networks. In M. Berthold,
and graph transformation. Software Practice and Experi- & D. Hand (Eds.), Intelligent data analysis (pp. 269-319).
ence, 29(3), 1-21. New York, NY: Springer Verlag.
Browse, R.A., Hussain, T.S., & Smillie, M.B. (1999). Using Tsakonas, A., & Dounias, D. (2002). A scheme for the
attribute grammars for the genetic selection of evolution of feedforward neural Networks using BNF-
backpropagation networks for character recognition. Pro- grammar driven genetic programming. Proceedings of the
ceedings of Applications of Artificial Neural Networks EUNITEEUropean Network on Intelligent Technolo-
in Image Processing IV. San Jose, California. gies for Smart Adaptive Systems, Algarve, Portugal.
Cantu-Paz, E., & Kamath, C. (2002). Evolving neural net- Wan, E., & Beaufays, F. (1998). Diagrammatic methods for
works for the classification of galaxies. Proceedings of deriving and relating temporal neural network algorithms.
the Genetic and Evolutionary Computation Conference. In C. Giles, & M. Gori (Eds.), Adaptive processing of
sequences and data structures (pp. 63-98). Salerno, Italy:
Curran, D., & ORiordan, C. (2002). Applying evolution- International Summer School on Neural Networks.
ary computation to designing neural networks: A study
of the state of the art [technical report]. Galway, Ireland:
Department of Information Technology, National Univer- KEY TERMS
sity of Ireland.
Confluence: A rewrite system is confluent, if no mat-
De Jong, E., & Pollack, J. (2001). Utilizing bias to evolve ter in which order rules are applied, they lead to the same
recurrent neural networks. Proceedings of the Interna- result.
tional Joint Conference on Neural Networks, Washing-
ton, D.C. Graph: A graph consists of vertices and edges. Each
edge is connected to a source node and a target node.
Ehrig, H., Engels, G., Kreowski, H.-J., & Rozenberg, G. Vertices and edges can be labeled with numbers and
(Eds.). (1999a). Handbook on graph grammars and symbols.
computing by graph transformation. Singapore: World
Scientific. Graph Production: Similar to productions in general
Chomsky grammars, a graph production consists of a left-
Ehrig, H., Kreowski, H.-J., Montanari, U., & Rozenberg, G. hand side and a right-hand side. The left-hand side is
(Eds.). (1999b). Handbook on graph grammars and embedded in a host graph. Then, it is removed, and in the
computing by graph transformation. Singapore: World resulting hole, the right-hand side of the graph produc-
Scientific. tion is inserted. To specify how this right-hand side is
Fischer, I. (2000). Describing neural networks with graph attached into this hole, how edges are connected to the
transformations [doctoral thesis]. Nuremberg: Friedrich- new nodes, some additional information is necessary.
Alexander University Erlangen-Nuremberg. Different approaches exist how to handle this problem.
Klop, J.W., De Vrijer, R.C., & Bezem, M. (2003). Term Graph Rewriting: The application of a graph produc-
rewriting systems. Cambridge, MA: Cambridge Univer- tion to a graph is also called graph rewriting.
sity Press. Neural Networks: Learning systems, designed by
Nipkow, T., & Baader, F. (1999). Term rewriting and all analogy with a simplified model of the neural connections
that. Cambridge, MA: Cambridge University Press. in the brain, which can be trained to find nonlinear rela-
tionships in data. Several neurons are connected to form
the neural networks.
538
TEAM LinG
Neuron: The smallest processing unit in a neural follows the configuration y with the help of a rule appli-
network. cation. /
Probabilistic Neural Network: One of the many dif- Termination: A rewrite system terminates if it has no
ferent kinds of neural networks with the application area infinite chain.
to classify input data into different classes.
Weight: Connections between neurons of neural net-
Rewrite System: Consists of a set of configurations works have a weight. This weight can be changed during
and a relation x y denoting that the configuration x the training of the net.
539
TEAM LinG
540
Graph-Based Data Mining

Lawrence B. Holder
University of Texas at Arlington, USA
Diane J. Cook
University of Texas at Arlington, USA
Graph-based data mining represents a collection of Graph-based data mining (GDM) is the task of finding
techniques for mining the relational aspects of data novel, useful, and understandable graph-theoretic pat-
represented as a graph. Two major approaches to graph- terns in a graph representation of data. Several ap-
based data mining are frequent subgraph mining and proaches to GDM exist based on the task of identifying
graph-based relational learning. This article will fo- frequently occurring subgraphs in graph transactions,
cus on one particular approach embodied in the Subdue that is, those subgraphs meeting a minimum level of
system, along with recent advances in graph-based su- support. Washio & Motoda (2003) provide an excellent
pervised learning, graph-based hierarchical conceptual survey of these approaches. We here describe four
clustering, and graph-grammar induction. representative GDM methods.
Most approaches to data mining look for associa- Kuramochi and Karypis (2001) developed the FSG
tions among an entitys attributes, but relationships system for finding all frequent subgraphs in large graph
between entities represent a rich source of information, databases. FSG starts by finding all frequent single and
and ultimately knowledge. The field of multi-relational double edge subgraphs. Then, in each iteration, it gener-
data mining, of which graph-based data mining is a part, ates candidate subgraphs by expanding the subgraphs
is a new area investigating approaches to mining this found in the previous iteration by one edge. In each
relational information by finding associations involving iteration the algorithm checks how many times the
multiple tables in a relational database. Two main ap- candidate subgraph occurs within an entire graph. The
proaches have been developed for mining relational candidates, whose frequency is below a user-defined
information: logic-based approaches and graph-based level, are pruned. The algorithm returns all subgraphs
approaches. occurring more frequently than the given level.
Logic-based approaches fall under the area of induc- Yan and Han (2002) introduced gSpan, which com-
tive logic programming (ILP). ILP embodies a number bines depth-first search and lexicographic ordering to
of techniques for inducing a logical theory to describe find frequent subgraphs. Their algorithm starts from all
the data, and many techniques have been adapted to frequent one-edge graphs. The labels on these edges
multi-relational data mining (Dzeroski & Lavrac, 2001; together with labels on incident vertices define a code
Dzeroski, 2003). Graph-based approaches differ from for every such graph. Expansion of these one-edge
logic-based approaches to relational mining in several graphs maps them to longer codes. Since every graph
ways, the most obvious of which is the underlying rep- can map to many codes, all but the smallest code are
resentation. Furthermore, logic-based approaches rely pruned. Code ordering and pruning reduces the cost of
on the prior identification of the predicate or predicates matching frequent subgraphs in gSpan. Yan & Han (2003)
to be mined, while graph-based approaches are more describe a refinement to gSpan, called CloseGraph,
data-driven, identifying any portion of the graph that has which identifies only subgraphs satisfying the minimum
high support. However, logic-based approaches allow support, such that no supergraph exists with the same
the expression of more complicated patterns involving, level of support.
for example, recursion, variables, and constraints among Inokuchi et al. (2003) developed the Apriori-based
variables. These representational limitations of graphs Graph Mining (AGM) system, which searches the space
can be overcome, but at a computational cost. of frequent subgraphs in a bottom-up fashion, beginning
TEAM LinG
with a single vertex, and then continually expanding by a times referred to as value) as calculated using the MDL
single vertex and one or more edges. AGM also employs principle. The queues length is bounded by a user- /
a canonical coding of graphs in order to support fast defined constant.
subgraph matching. AGM returns association rules sat- The search terminates upon reaching a user-speci-
isfying user-specified levels of support and confidence. fied limit on the number of substructures extended, or
The last approach to GDM, and the one discussed in upon exhaustion of the search space. Once the search
the remainder of this chapter, is embodied in the Subdue terminates and Subdue returns the list of best substruc-
system (Cook & Holder, 2000). Unlike the above sys- tures found, the graph can be compressed using the best
tems, Subdue seeks a subgraph pattern that not only substructure. The compression procedure replaces all
occurs frequently in the input graph, but also signifi- instances of the substructure in the input graph by single
cantly compresses the input graph when each instance of vertices, which represent the substructures instances.
the pattern is replaced by a single vertex. Subdue per- Incoming and outgoing edges to and from the replaced
forms a greedy search through the space of subgraphs, instances will point to, or originate from the new vertex
beginning with a single vertex and expanding by one that represents the instance. The Subdue algorithm can
edge. Subdue returns the pattern that maximally com- be invoked again on this compressed graph.
presses the input graph. Holder & Cook (2003) describe Figure 1 illustrates the GDM process on a simple
current and future directions in this graph-based rela- example. Subdue discovers substructure S1, which is
tional learning variant of GDM. used to compress the data. Subdue can then run for a
second iteration on the compressed graph, discovering
substructure S2. Because instances of a substructure can
MAIN THRUST appear in slightly different forms throughout the data, an
inexact graph match, based on graph edit distance, is
As a representative of GDM methods, this section will used to identify substructure instances.
focus on the Subdue graph-based data mining system. Most GDM methods follow a similar process. Varia-
The input data is a directed graph with labels on vertices tions involve different heuristics (e.g., frequency vs.
and edges. Subdue searches for a substructure that best MDL) and different search operators (e.g., merge vs.
compresses the input graph. A substructure consists of extend).
a subgraph definition and all its occurrences throughout
the graph. The initial state of the search is the set of Graph-Based Hierarchical Conceptual
substructures consisting of all uniquely labeled verti- Clustering
ces. The only operator of the search is the Extend
Substructure operator. As its name suggests, it extends Given the ability to find a prevalent subgraph pattern in
a substructure in all possible ways by a single edge and a larger graph and then compress the graph with this
a vertex, or by only a single edge if both vertices are pattern, iterating over this process until the graph can no
already in the subgraph. longer be compressed will produce a hierarchical, con-
Subdues search is guided by the minimum descrip- ceptual clustering of the input data. On the i th iteration,
tion length (MDL) principle, which seeks to minimize the best subgraph Si is used to compress the input graph,
the description length of the entire data set. The evalu- introducing new vertices labeled Si in the graph input to
ation heuristic based on the MDL principle assumes that the next iteration. Therefore, any subsequently-discov-
the best substructure is the one that minimizes the ered subgraph Sj can be defined in terms of one or more
description length of the input graph when compressed of Sis, where i < j. The result is a lattice, where each
by the substructure. The description length of the sub- cluster can be defined in terms of more than one parent
structure S given the input graph G is calculated as
DL(G,S) = DL(S)+DL(G|S), where DL(S) is the descrip- Figure 1. Graph-based data mining: A simple example
tion length of the substructure, and DL(G|S) is the
description length of the input graph compressed by the S2
S1 S1
substructure. Subdue seeks a substructure S that mini-
mizes DL(G,S).
The search progresses by applying the Extend Sub-
structure operator to each substructure in the current
S1 S1 S1
state. The resulting state, however, does not contain all
the substructures generated by the Extend Substructure
operator. The substructures are kept on a queue and are S2 S2
ordered based on their description length (or some-
541
TEAM LinG
subgraph. For example, shows such a clustering done on We can bias the search toward a more characteristic
a DNA molecule. Note that the ordering of pattern dis- description by using the information-theoretic mea-
covery can affect the parents of a pattern. For instance, sure to look for a subgraph that compresses the positive
the lower-left pattern in could have used the C-C-O examples, but not the negative examples. If I(G) repre-
pattern, rather than the C-C pattern, but in fact, the lower- sents the description length (in bits) of the graph G, and
left pattern is discovered before the C-C-O pattern. For I(G|S) represents the description length of graph G
more information on graph-based clustering, see Jonyer compressed by subgraph S, then we can look for an S
et al. (2001). that minimizes I(G +|S) + I(S) + I(G-) I(G-|S), where the
last two terms represent the portion of the negative
Graph-Based Supervised Learning graph incorrectly compressed by the subgraph. This
approach will lead the search toward a larger subgraph
Extending a graph-based data mining approach to per- that characterizes the positive examples, but not the
form supervised learning involves the need to handle negative examples.
negative examples (focusing on the two-class scenario). Finally, this process can be iterated in a set-cover-
In the case of a graph the negative information can come ing approach to learn a disjunctive hypothesis. If using
in three forms. First, the data may be in the form of the error measure, then any positive example contain-
numerous smaller graphs, or graph transactions, each ing the learned subgraph would be removed from subse-
labeled either positive or negative. Second, data may be quent iterations. If using the information-theoretic
composed of two large graphs: one positive and one measure, then instances of the learned subgraph in both
negative. Third, the data may be one large graph in which the positive and negative examples (even multiple in-
the positive and negative labeling occurs throughout. We stances per example) are compressed to a single ver-
will talk about the third scenario in the section on future tex. Note that the compression is a lossy one, that is we
directions. The first scenario is closest to the standard do not keep enough information in the compressed
supervised learning problem in that we have a set of graph to know how the instance was connected to the
clearly defined examples. Let G + represent the set of rest of the graph. This approach is consistent with our
positive graphs, and G- represent the set of negative goal of learning general patterns, rather than mere
graphs. Then, one approach to supervised learning is to compression. For more information on graph-based
find a subgraph that appears often in the positive graphs, supervised learning, see Gonzalez et al. (2002).
but not in the negative graphs. This amounts to replacing
the information-theoretic measure with simply an er- Graph Grammar Induction
ror-based measure. This approach will lead the search
toward a small subgraph that discriminates well. How- As mentioned earlier, two of the advantages of logic-
ever, such a subgraph does not necessarily compress based approach to relational learning are the ability to
well, nor represent a characteristic description of the learn recursive hypotheses and constraints among vari-
target concept. ables. However, there has been much work in the area of
graph grammars, which overcome this limitation. Graph
grammars are similar to string grammars except that
Figure 2. Graph-based hierarchical, conceptual terminals can be arbitrary graphs rather than symbols
clustering of a DNA molecule from an alphabet. While much of the work on graph
grammars involves the analysis of various classes of
DNA graph grammars, recent research has begun to develop
techniques for learning graph grammars (Doshi et al.,
O
CN CC | 2002; Jonyer et al., 2002).
O == P OH
Figure 3b shows an example of a recursive graph
grammar production rule learned from the graph in a.
CC
C
\ \ O A GDM approach can be extended to consider graph
O
NC
\
|
O == P OH grammar productions by analyzing the instances of a
C |
O subgraph to see how they are related to each other. If
|
CH2 two or more instances are connected to each other by
O one or more edges, then a recursive production rule
\
C generating an infinite sequence of such connected sub-
CC
/ \
NC graphs can be constructed. A slight modification to the
O
/ \
C information-theoretic measure taking into account the
extra information needed to describe the recursive
542
TEAM LinG
component of the production is all that is needed to recursive hypotheses in many different domains includ-
allow such a hypothesis to compete along side simple ing learning the building blocks of proteins and commu- /
subgraphs (i.e., terminal productions) for maximizing nication chains in organized crime.
compression.
These graph grammar productions can include non-
terminals on the right-hand side. These productions can FUTURE TRENDS
be disjunctive, as in c, which represents the final produc-
tion learned from a using this approach. The disjunction The field of graph-based relational learning is still
rule is learned by looking for similar, but not identical, young, but the need for practical algorithms is growing
extensions to the instances of a subgraph. A new rule can fast. Therefore, we need to address several challenging
be constructed that captures the disjunctive nature of scalability issues, including incremental learning in
this extension, and included in the pool of production dynamic graphs. Another issue regarding practical ap-
rules competing based on their ability to compress the plications involves the blurring of positive and negative
input graph. With a proper encoding of this disjunction examples in a supervised learning task, that is, the graph
information, the MDL criterion will tradeoff the com- has many positive and negative parts, not easily sepa-
plexity of the rule with the amount of compression it rated, and with varying degrees of class membership.
affords in the input graph. An alternative to defining
these disjunction non-terminals is to instead construct Partitioning and Incremental Mining for
a variable whose range consists of the different disjunc-
tive values of the production. In this way we can intro-
Scalability
duce constraints among variables contained in a sub-
graph by adding a constraint edge to the subgraph. For Scaling GDM approaches to very large graphs, graphs
example, if the four instances of the triangle structure in too big to fit in main memory, is an ever-growing
a each had another edge to a c, d, f and f vertex respec- challenge. Two approaches to address this challenge are
tively, then we could propose a new subgraph, where being investigated. One approach involves partitioning
these two vertices are the graph into smaller graphs that can be processed in a
represented by variables, and an equality constraint distributed fashion (Cook et al., 2001). A second ap-
is introduced between them. If the range of the variable proach involves implementing GDM within a relational
is numeric, then we can also consider inequality con- database management system, taking advantage of user-
straints between variables and other vertices or vari- defined functions and the optimized storage capabilities
ables in the subgraph pattern. of the RDBMS.
Jonyer (2003) has developed a graph grammar learn- A newer issue regarding scalability involves dy-
ing approach with the above capabilities. The approach namic graphs. With the advent of real-time streaming
has shown promise both in handling noise and learning data, many data mining systems must mine incremen-
tally, rather than off-line from scratch. Many of the
domains we wish to mine in graph form are dynamic
Figure 3. Graph grammar learning example with (a) domains. We do not have the time to periodically re-
the input graph, (b) the first grammar rule learned, build graphs of all the data to date and run a GDM system
and (c) the second and third grammar rules learned from scratch. We must develop methods to incremen-
tally update the graph and the patterns currently preva-
(a) a a a a
lent in the graph. One approach is similar to the graph
b c b d b f b f
partitioning approach for distributed processing. New
k
data can be stored in an increasing number of partitions.
x y x y r x y x y Information within partitions can be exchanged, or a
z q z q z q z q repartitioning can be performed if the information loss
exceeds some threshold. GDM can be used to search the
(b) S1 x y S1 x y new partitions, suggesting new subgraph patterns as they
z q z q
evaluate highly in new and old partitions.
Learning with Supervised Graphs

(c) S2 a S2 a
b S3 b S3
In a highly relational domain the positive and negative
examples of a concept are not easily separated. Such a
S3 c d f graph is called a supervised graph, in that the graph as a
543
TEAM LinG
whole contains class information, but perhaps not indi- REFERENCES

vidual components of the graph. This scenario presents
a challenge to any data mining system, but especially to Cook, D., & Holder, L. (2000). Graph-based data mining.
a GDM system, where clearly classified data may be IEEE Intelligent Systems, 15(2), 32-41.
tightly related to less classified data. Two approaches to
this task are being investigated. The first involves modi- Cook, D., Holder, L., Galal, G., & Maglothin, R. (2001).
fying the MDL encoding to take into account the amount Approaches to parallel graph-based knowledge discov-
of information necessary to describe the class member- ery. Journal of Parallel and Distributed Computing,
ship of compressed portions of the graph. The second 61(3), 427-446.
approach involves treating the class membership of a Doshi, S., Huang, F., & Oates, T. (2002). Inferring the
vertex or edge as a cost and weighting the information- structure of graph grammars from data. In Proceedings
theoretic value of the subgraph patterns by the costs of of the International Conference on Knowledge-based
the instances of the pattern. The ability to learn from Computer Systems.
supervised graphs will allow the user more flexibility in
indicating class membership where known, and to varying Deroski, S. (2003). Multi-relational data mining: An intro-
degrees, without having to clearly separate the graph into duction. SIGKDD Explorations, 5(1), 1-16.
disjoint examples.
Deroski, S., & Lavra, N. (2001). Relational data mining.
Berlin: Springer Verlag.
CONCLUSION Gonzalez, J., Holder, L., & Cook, D. (2002). Graph-based
relational concept learning. In Proceedings of the Nine-
Graph-based data mining (GDM) is a fast-growing field teenth International Conference on Machine Learning.
due to the increasing interest in mining the relational
aspects of data. We have described several approaches Holder, L., & Cook, D. (2003). Graph-based relational
to GDM including logic-based approaches in ILP sys- learning: Current and future directions. SIGKDD Ex-
tems, graph-based frequent subgraph mining approaches plorations, 5(1), 90-93.
in AGM, FSG and gSpan, and a graph-based relational Inokuchi, A., Washio, T., & Motoda, H. (2003). Com-
learning approach in Subdue. We described the Subdue plete mining of frequent patterns from graphs: Mining
approach in detail along with recent advances in supergraph data. Machine Learning, 50, 321-254.
vised learning, clustering, and graph-grammar induc-
tion. Jonyer, I. (2003). Context-free graph grammar induction
However, much work remains to be done. Because using the minimum description length principle. Ph.D.
many of the graph-theoretic operations inherent in GDM thesis. Department of Computer Science and Engineering,
are NP-complete or definitely not in P, scalability is a University of Texas at Arlington.
constant challenge. With the increased need for mining
Jonyer, I., Cook, D., & Holder, L. (2001). Graph-based
streaming data, the development of new methods for
hierarchical conceptual clustering. Journal of Machine
incremental learning from dynamic graphs is important.
Learning Research, 2, 19-43.
Also, the blurring of example boundaries in a supervised
learning scenario gives rise to graphs, where the class Jonyer, I., Holder, L., & Cook, D. (2002). Concept
membership of even nearby vertices and edges can vary formation using graph grammars. In Proceedings of the
considerably. We need to develop better methods for KDD Workshop on Multi-Relational Data Mining.
learning in these supervised graphs.
Finally, we discussed several domains throughout Kuramochi, M., & Karypis, G. (2001). Frequent subgraph
this paper that benefit from a graphical representation discovery. In Proceedings of the First IEEE Conference
and the use of GDM to extract novel and useful patterns. on Data Mining.
As more and more domains realize the increased predic- Washio, T., & Motoda, H. (2003). State of the art of graph-
tive power of patterns in relationships between entities, based data mining. SIGKDD Explorations, 5(1), 59-68.
rather than just attributes of entities, graph-based data
mining will become foundational to our ability to better Yan, X., & Han, J. (2002). Graph-based substructure pat-
understand the ever-increasing amount of data in our tern mining. In Proceedings of the International Confer-
world. ence on Data Mining.
Yan, X., & Han, J. (2003). CloseGraph: Mining closed
frequent graph patterns. In Proceedings of the Ninth
544
TEAM LinG
International Conference on Knowledge Discovery and Graph Grammar: Grammar describing the construc-
Data Mining. tion of a set of graphs, where terminals and non-terminals /
represent vertices, edges or entire subgraphs.
Inductive Logic Programming: Techniques for learn-
KEY TERMS ing a first-order logic theory to describe a set of relational
data.
Conceptual Graph: Graph representation described Minimum Description Length (MDL) Principle: Prin-
by a precise semantics based on first-order logic. ciple stating that the best theory describing a set of data
Dynamic Graph: Graph representing a constantly is the one minimizing the description length of the theory
changing stream of data. plus the description length of the data described (or
compressed) by the theory.
Frequent Subgraph Mining: Finding all subgraphs
within a set of graph transactions whose frequency Multi-Relational Data Mining: Mining patterns that
satisfies a user-specified level of minimum support. involve multiple tables in a relational database.
Graph-Based Data Mining: Finding novel, useful, and Supervised Graph: Graph in which each vertex and
understandable graph-theoretic patterns in a graph rep- edge can belong to multiple categories to varying de-
resentation of data. grees. Such a graph complicates the ability to clearly
define transactions on which to perform data mining.
545
TEAM LinG
546
Group Pattern Discovery Systems for Multiple

Data Sources
Shichao Zhang
University of Technology Sydney, Australia
Chengqi Zhang
INTRODUCTION This effort provides a good insight into knowledge dis-

covery from multiple data sources.
Multiple data source mining is the process of identifying However, there are two fundamental problems that
potentially useful patterns from different data sources, or prevent local pattern analysis from widespread applica-
datasets (Zhang et al., 2003). Group pattern discovery tion. First, when the data collected from the Internet is of
systems for mining different data sources are based on poor quality, the poor-quality data can disguise useful
local pattern-analysis strategy, mainly including logical patterns. For example, a stock investor might need to
systems for information enhancing, a pattern discovery collect information from outside data sources when mak-
system, and a post-pattern-analysis system. ing an investment decision, such as news. If fraudulent
information collected is applied directly to investment
decisions, the investor might lose money. In particular,
BACKGROUND much work has been built on consistent data. In the input
to the distributed data mining algorithms, it is assumed
Many large organizations have multiple data sources, that the data sources do not conflict with each other.
such as different branches of a multinational company. However, reality is much more inconsistent than the ideal;
Also, as the Web has emerged as a large distributed data the inconsistency must be resolved before any mining
repository, it is easy nowadays to access a multitude of algorithms can be applied. These generate a crucial re-
data sources. Therefore, individuals and organizations quirement: ontology-based data enhancement.
have taken into account the Internets low-cost informa- The second fundamental challenge is the efficiency of
tion and knowledge when making decisions (Lesser et al., mining algorithms for identifying potentially useful pat-
2000). Although the data collected from the Internet terns in MDSs. Over the years, there has been a great deal
(called external data) brings us opportunities in improving of work in multiple source data mining (Aounallah et al.,
the quality of decisions, it generates a significant chal- 2004; Krishnaswamy et al., 2000; Li et al., 2001; Yin et al.,
lenge: efficiently identifying quality knowledge from differ- 2004). However, traditional multiple data source mining
ent data sources (Kargupta et al., 2000; Liu et al., 2001; still utilizes mono-database mining techniques. That is, all
Prodromidis et al., 1998; Zhong et al., 1999). Potentially, the data from relevant data sources is pooled to amass a
almost every company must confront the multiple data huge dataset for discovery. These algorithms cannot
source (MDS) problem (Hurson et al., 1994). This problem discover some useful patterns; for example, the pattern
is difficult to solve, due to the facts that multiple data source that 70% of the branches within a company agrees that a
mining is a procedure of searching for useful patterns in married customer usually has at least two cars if his or her
multidimensional spaces; and putting all data together from age is between 45 and 65. On the other hand, using our
different sources might amass a huge database for central- local pattern analysis, there can be huge amounts of the
ized processing and cause problems, such as data privacy local patterns. These generate a strong requirement: the
breaches, data inconsistency, and data conflict. development of efficient algorithms for identifying useful
Recently, the authors have developed local pattern patterns in MDSs.
analysis, a new multi-database mining strategy for dis- This article introduces a group of pattern discovery
covering some kinds of potentially useful patterns that systems for dealing with the MDS problem, mainly (1) a
cannot be mined in traditional multi-database mining logical system for enhancing data quality, a logical sys-
techniques (Zhang et al., 2003). Local pattern analysis tem for resolving conflicts, a data cleaning system, and a
delivers high-performance pattern discovery from MDSs. database clustering system, for solving the first problem;
TEAM LinG
Group Pattern Discovery Systems for Multiple Data Sources
and (2) a pattern discovery system and a post-mining branches. Therefore, these patterns may be far more
system for solving the second problem. important in terms of decision-making within the /
company. The key problem is how to efficiently
search for high-vote patterns of interest in multi-
MAIN THRUST dimensional spaces. It can be attacked by mining
the distribution of all patterns.
Group pattern discovery systems are able to (i) effectively (b) Finding Exceptional Patterns: Like high-vote pat-
enhance data quality for mining MDSs and (ii) automati- terns, exceptional patterns also are regarded as
cally identify potentially useful patterns from the multi- novel patterns in multiple data sources, which re-
dimension data in MDSs. flect the individuality of data sources. While high-
vote patterns are useful when a company is reaching
Data Enhancement common decisions, headquarters also are interested in
viewing exceptional patterns used when special deci-
Data enhancement includes the following: sions are made at only a few of the branches, perhaps
for predicting the sales of a new product. Exceptional
1. The data cleaning system mainly includes these patterns can capture the individuality of branches.
functions: recovering incomplete data (filling the Therefore, although an exceptional pattern receives
values missed or expelling ambiguity); purifying votes from only a few branches, it is extremely valuable
data (consistency of data name, consistency of data information in MDSs. The key problem is how to
format, correcting errors, or removing outliers); and construct efficient methods for measuring the inter-
resolving data conflicts (using domain knowledge estingness of exceptional patterns.
or expert decision to settle discrepancy). (c) Searching for Synthesizing Patterns by Weighting
2. The logical system for enhancing data quality fo- Majority: Although each data source can have an
cuses on the following epistemic properties: equal power to vote for patterns for making deci-
veridicality, introspection, and consistency. sions, data sources may be different in importance to
3. The logical system for resolving conflict has the a company. For example, in a company, if the sale of
property of obeying the weighted majority principle branch A is four times that of branch B, branch A is
in case of conflicts. certainly more important than branch B in the com-
4. The fuzzy database clustering system generates pany. (Here, each branch in a company is viewed as
good database clusters. a data source in an MDS environment.) The decisions
of the company are reasonably partial to high-sale
branches. Also, local patterns may have different
Identifying Interesting Patterns supports. For example, let the supports of patterns X1
and X2 be 0.9 and 0.4 in a branch. Pattern X1 is far more
A local pattern may be a frequent itemset, an association believable than pattern X2. These two examples
rule, causal rule, dependency, or some other expression. present the importance of branches and patterns for
Local pattern analysis is an in-place strategy specifically decision making within a company. Therefore, syn-
designed for mining MDSs, providing a feasible way to thesizing patterns is very useful.
generate globally interesting models from data in multi-
dimensional spaces.
Based on our local pattern analysis, three key systems
Post Pattern Analysis
can be developed for automatically searching for poten-
tially useful patterns from local patterns: (a) identifying In an MDS environment, a pattern (e.g., a high-vote
high-vote patterns; (b) finding exceptional patterns; and association rule) is attached to certain factors, including
(c) synthesizing patterns by weighting majority. name, vote, vsupp, and vconf. For a very large set of data
sources, a high-vote association rule may be supported
(a) Identifying High-Vote Patterns: Within an MDS by a number of data sources. So, the sets of its support and
environment, each data source, large or small, can confidence in these data sources are too large to be
have an equal power to vote for their patterns for the browsed by users, and thus, it is rather difficult to apply
decision-making of a company. Some patterns can the rule to decision making for users. Therefore, post-
receive votes from most of the data sources. These pattern analysis is very important in MDS mining. The key
patterns are referred to as high-vote patterns. High- problem is how to construct effective partition for classi-
vote patterns represent the commonness of the fying the patterns mined.
547
TEAM LinG
FUTURE TRENDS Krishnaswamy, S. et al. (2000). An architecture to sup-

port distributed data mining services in e-commerce
The data that are distributed in multiple data sources environments. Proceedings of the WECWIS 2000.
present many complications, and associated mining tasks Lesser, V. et al. (2000). BIG: An agent for resource-
and mining approaches are diverse. Therefore, many chal- bounded information gathering and decision making.
lenging research issues have emerged. Important tasks Artificial Intelligence, 118(1-2), 197-244.
that present themselves to multiple data source mining
researchers and multiple data source mining system and Li, B. et al. (2001). An architecture for multidatabase
application developers are as follows: systems based on CORBA and XML. Proceedings of the
12th International Workshop on Database and Expert
Establishment of a powerful representation for pat- Systems Applications.
terns in distributed data;
Ontology-based methodologies and technologies Liu, H., Lu, H., & Yao, J. (2001). Towards multi-database
for data preparation/enhancement; mining: Identifying relevant databases. IEEE Transactions
Development of efficient and effective mining strat- on Knowledge and Data Engineering, 13(4), 541-553.
egies and systems; and Prodromidis, A., & Stolfo, S. (1998). Pruning meta-clas-
Construction of interactive and integrated multiple sifiers in a distributed data mining system. Proceedings
data source mining environments. of the First National Conference on New Information
Technologies.
CONCLUSION Wu, X., & Zhang, S. (2003). Synthesizing high-frequency

patterns from different data sources. IEEE Transactions
As stated previously, many companies/organizations have on Knowledge and Data Engineering, 15(2), 353-367.
utilized the Internets low-cost information when making Yin, X., Han, J., Yang, J., & Yu, P. (2004). CrossMine:
decisions. Potentially, almost every company must con- Efficient classification across multiple database rela-
front the MDS problem. Although the data from the Inter- tions. Proceedings of the ICDE 2004.
net assists in improving the quality of decisions, the data
is of poor quality. This leaves a large gap between the Zhang, S., Wu, X., & Zhang, C. (2003). Multi-database
available data and the machinery available to process the mining. IEEE Computational Intelligence Bulletin, 2(1),
data. It generates a crucial need: efficiently identifying 5-13.
quality knowledge from MDSs. To support this need, Zhang, S., Zhang, C., & Wu, X. (2004). Knowledge
group pattern discovery systems have been introduced in discovery in multiple databases. Springer.
the context of real MDS mining systems.
Zhong, N., Yao, Y., & Ohsuga, S. (1999). Peculiarity
oriented multi-database mining. Proceedings of PKDD.
REFERENCES
Aounallah, M. et al. (2004). Distributed data mining vs.

sampling techniques: A comparison. Proceedings of the
KEY TERMS
Canadian Conference on AI 2004.
Database Clustering: The process of grouping simi-
Grossman, R., Bailey, S., Ramu, A., Malhi, B., & Turinsky, lar databases together.
A. (2000). The preliminary design of papyrus: A system for
high performance, distributed data mining over clusters. Exceptional Pattern: A pattern that is strongly sup-
Proceedings of Advances in Distributed and Parallel ported (voted for) by only a few of a group data sources.
Knowledge Discovery. High-Vote Pattern: A pattern that is supported (voted
Hurson, A., Bright, M., & Pakzad, S. (1994). Multidatabase for) by most of a group data sources.
systems: An advanced solution for global information Knowledge Discovery in Databases (KDD): Also re-
sharing. IEEE Computer Society Press. ferred to as data mining; the extraction of hidden predic-
Kargupta, H., Huang, W., Sivakumar, K., Park, B., & Wang, tive information large databases.
S. (2000). Collective principal component analysis from Local Pattern Analysis: An in-place strategy specifi-
distributed, heterogeneous data. Principles of Data Min- cally designed for generating globally interesting mod-
ing and Knowledge Discovery (pp. 452-457). els from local patterns in multi-dimensional spaces.
548
TEAM LinG
Multi-Database Mining: The mining of potentially

useful patterns in multi-databases. /
Multiple Data Source Mining: The process of identi-
fying potentially useful patterns from different data
sources.
549
TEAM LinG
550
Heterogeneous Gene Data for Classifying

Tumors
Benny Yiu-ming Fung
The Hong Kong Polytechnic University, Hong Kong
Vincent To-yee Ng
The Hong Kong Polytechnic University, Hong Kong
INTRODUCTION are photolithographically synthesized oligonucleotide

probe arrays and spotted cDNA probe arrays. These have
When classifying tumors using gene expression data, both been reviewed by Sebastiani, Gussoni, Kohane, and
mining tasks commonly make use of only a single data set. Ramoni (2003). They differ in their criteria for measur-
However, classification models based on patterns ex- ing gene expression levels. Oligonucleotide probe ar-
tracted from a single data set are often not indicative of rays measure mRNA abundance indirectly, while spot-
an entire population and heterogeneous samples subse- ted cDNA probe arrays measure cDNA relative to hy-
quently applied to these models may not fit, leading to bridized reference mRNA samples. With the two com-
performance degradation. In short, it is not possible to mon microarray technologies, there exist different probe
guarantee that mining results based on a single gene array notations (Lee et al., 2003). For example, human
expression data set will be reliable or robust (Miller et probe array notations include GeneChip (Affymetrix,
al., 2002). This problem can be addressed using classi- Santa Clara, CA) U133, U95, and U35 accession num-
fication algorithms capable of handling multiple, het- ber sets, BMR chips (Stanford University), UniGene
erogeneous gene expression data sets. Apart from im- clusters, cDNA clone ID and GenBank identifiers. Al-
proving mining performance, the use of such algorithms though the notations used in different technologies
would make mining results less sensitive to the varia- sometimes referred to the same set of genes, this does
tions of different microarray platforms and to experi- not indicate a simple one-to-one mapping (Ramaswamy,
mental conditions embedded in heterogeneous gene Ross, Lander, & Golub, 2003). Users of these notations
expression data sets. should be aware of the potential for duplicated acces-
sion numbers in mapped results.
Another type of variation, the statistical variation
BACKGROUND among different gene expression data sets is unavoid-
able because experimental and biological variations are
Recent research into the mining of gene expression data embedded in data sets (Miller et al., 2002). First of all,
has operated upon multiple, heterogeneous gene ex- individual gene expression data sets are conducted by
pression data sets. This research has taken two broad different laboratories with different experimental ob-
approaches, addressing issues related either to the theo- jectives and conditions even when using the same
retical flexibility that is required to integrate gene microarray technology. Integration of them is a painful
expression data sets with various microarray platforms task. Secondly, the expression levels of genes in experi-
and technologies (Lee et al., 2003), or the focus of ments are normally measured by the ratio of the expres-
this chapter - issues related to tumor classification sion levels of the genes in the varying conditions of
using an integration of multiple, heterogeneous gene interest to the expression levels of the genes in some
expression data sets (Bloom et al., 2004; Ng, Tan, & reference conditions. These reference conditions are
Sundarajan, 2003). This type of tumor classification is varied from experiment to experiment. This is not a
made more difficult by three types of variation, varia- problem if sample sizes are large enough. Zien, Fluck,
tion in the available microarray technologies, experi- Zimmer, and Lengauer (2003) proposed that the use of
mental and biological variations, and variation in the larger sample sizes (e.g. 20 samples) can prevent mining
types of cancers themselves. results of gene expression data from suffering technical
The first type of variation is caused by different and biological variations, and produce more reliable
probe array notations of available microarray technolo- results. Most gene expression data sets, however, con-
gies. The two most common microarray technologies tain fewer than 20 samples per class. A more flexible
TEAM LinG
Heterogeneous Gene Data for Classifying Tumors
solution would be to meta-analyze multiple, heteroge- The second and a better approach is to select a subset
neous gene expression data sets, forming meta-deci- of reference genes, also known as significant genes, and 0
sions from a number of individual decisions. to use the expression levels of these genes to estimate
The last difficulty is to find common features in scaling factors which are used to rescale the expression
various cancer types. These features can be referred as levels of genes in other data sets with the same set of
some sets of significant genes which are most likely reference genes as in the original subset. This approach
expressed in most cancer types, but they may be ex- has two advantages. The first is that it allows the effects
pressed differently in varying cancer types. The study of of outliers caused by non-significant genes to be elimi-
human cancer has recently discovered that the develop- nated while using only a subset of significant genes. In
ment of antigen-specific cancer vaccines leads to the a gene expression data set, only a proportion of genes is
discovery of immunogenic genes. This group of tumor tumor-specific. Because gene expression data contains
antigens has been introduced as the term cancer-testis high-dimensional data, by focusing on such tumor-spe-
(CT) antigen (Coulie et al., 2002). Discovered CT cific genes in classification would reduce computa-
antigens are recently grouped into distinct subsets and tional costs. The second advantage is that it improves the
named as cancer/testis (CT) immunogenic gene fami- quality of the normalization or re-scaling since it avoids
lies. Some works show that most CT immunogenic the underestimation of expression level of significant
gene families are expressed in more than one cancer genes, a problem which may arise because of the pres-
type, but with various expression frequencies. Cur- ence of large amounts of non-significant genes. We also
rently, researchers have reviewed and summarized that note that the selection algorithms are the focus of much
the current discovery is 44 CT immunogenic gene fami- current research. Some works that utilize existing
lies consisting of 89 individual genes in total (Scanlan, features selection algorithms include Dudoit, Yang,
Simpson, & Old, 2004). Callow, and Speed (2002), Bloom et al. (2004), and Lee
et al. (2003). New or enhanced algorithms have been
proposed by Park et al. (2003), Ng, Tan, and Sundarajan.
MAIN THRUST (2003), Choi, Yu, Kim, and Yoo (2003), Storey &
Tibshirani (2003), Chilingaryan, Gevorgyan, Vardanyan,
It is possible is to make classification algorithms more Jones, & Szabo (2002), and Golub et al. (1999).
reliable and robust by combining multiple, heteroge- In recent years, detection of significant genes was
neous gene expression data sets. A simple combination mainly done using fold-change detection. This detec-
method is to merge or append one data set to another. tion method is unreliable because it does not take into
Unfortunately, this method is inflexible because data account statistical variability. Currently, however, most
sets have various scales and ranges of variations. These algorithms that are used to select significant genes
are required to be the same in order to have consistent apply statistical methods. In the rest of the chapter, we
scales for comparisons after the combination. first present some recent works on the identification of
In this chapter, we discuss two approaches to com- significant genes using statistical methods. We then
bine data sets consisting of variation in the available briefly describe our proposed measure, Impact Factors
microarray technologies. The first, and simplest, ap- (IFs), which can be used to carry out tumor classifica-
proach is to normalize gene expression levels of genes tion using heterogeneous gene expression data (Fung &
in the data sets with mean zero and standard deviation Ng, 2003).
one (i.e. standard normal distribution, N(0, 1)) accord-
ing to the means and standard deviations across samples Statistical Methods
in individual data sets. While this approach is simple to
apply, it assumes that all genes have the same or similar The most common statistical method for identifying
expression rates. However, this assumption is incor- significant genes is the two-sample t-test (Cui &
rect. The fact is that only a small subset of genes reflects Churchill, 2003). The advantage of this test is that,
the existence of tumors, and that the remaining genes in because it requires only one gene to be studied for each
a tumor are not epidemiologically significant. It should t-test, it is insensitive to heterogeneity in variance
also be noted that the reflected genes do not all express across a couple of genes. However, while reliable t-
at the same rate. Therefore, when all genes in data sets values require large sample sizes, gene expression data
are normalized to have N(0, 1), the variations of the sets normally consist of small sample sizes. This prob-
reflected genes may be underestimated and the varia- lem of small sample sizes can be overcome using global
tions of genes which are stable and irrelevant may be t-tests, but it assumes that the variance is homogeneous
overestimated. This situation worsens as the number of between different genes (Tusher, Tibshirani, & Chu,
genes in data sets increases. 2001). Tusher, Tibshirani, and Chu (2001) proposed a
551
TEAM LinG
modified t-test that they called significance analysis of The second stage of selection is selection for classi-
microarray (SAM). SAM identifies significant genes in fication. This is done by calculating the differences
microarray experiments by measuring fluctuations of the between rescaled testing samples (since there are two
expression levels of genes across a number of microarray scaling factors, there are two rescaled samples in binary-
experiments. These fluctuations are estimated using per- class tumor classification) and individual classes in training
mutation tests and, to avoid higher t-values, they are set. It should be noted that, to improve the discriminative
expressed as a constant in the denominators of the two- power of the IFs, only those genes with higher differences
sample t-test. Tibshirani, Hastie, Narasimhan, and Chu are selected and used in classification.
(2002) proposed a modified nearest-centroid classifica- IFs have been integrated into classifiers to perform
tion. For all genes, it uses a t-test to calculate a centroid meta-classification of heterogeneous cancer gene ex-
distance, defined as the distance from class centroids to pression data (Fung & Ng, 2003). For most classifiers
overall centroids among classes of the genes. The centroid using either similarity or dissimilarity measures for
distance is then used to shrink the class centroids towards making classification decisions, IFs can be integrated
the overall centroid to order to reduce overfitting. into classifiers by multiplying the IFs directly with the
Correlation analysis is another common statistical original measures. The actual multiplication to be car-
method used to rank the significance of genes. Kuo, ried out depends on whether it is considered as dissimi-
Jenssen, Butte, Ohno-Machado, and Kohane (2002) ap- larity or similarity measures used by classifiers for
plied the Pearson linear and Spearman rank-order corre- making decisions. If it is being applied to dissimilarity
lation coefficients to study the flexibility of cross- measures, the IF of a class is multiplied by the measure
platform utilization of data from multiple gene expres- having the same class as the corresponding IF. In con-
sion data sets. Lee et al. (2003) also used the Pearson and trast, if it is being applied to similarity measures, the IF
Spearman correlation coefficients to study the correla- of a class is multiplied by the measure having another
tion among NCI-60 cancer data sets consisting of differ- class as the corresponding IF.
ent cancer types. They, however, proposed a measure to
rank the correlation of correlation among various data
sets. Later studies of correlation focused on multi- FUTURE TRENDS
platform, multi-type tumor data sets. Bloom et al. (2004)
used the Kruskal-Wallis H-test to identify significant Although DNA microarrays can be used to predict
genes within multi-type, multi-platform tumor data sets patients responses to medical treatment as well as
consisting of 21 data sets and 15 different cancer types. clinical outcomes, tumor classification using gene
expression data is as yet unreliable. This unreliability
Impact Factors has multiple causes. First of all, while some interna-
tional organizations such as the European
Recently, we proposed a dissimilarity measure called Bioinformatics Institute (EBI) and the National Cen-
Impact Factors (IFs) that measure the inter-experimen- ter for Biotechnology Information (NCBI) have cre-
tal variations between individual classes in training ated their own gene expression data repositories, there
samples and heterogeneous testing samples (Fung & Ng, still exists no international benchmark data reposito-
2003). The calculation of IFs takes place in two stages of ries. This makes it difficult for researchers to validate
selections, selection for re-scaling and selection for findings. Certainly, there is a need for integrated gene
classification. In the first stage, we first use SAM to expression databases which can help researchers to
select a set of significant genes. From these, we calcu- validate their findings with data conducted by different
late individual reference points corresponding to differ- laboratories around the world, and hence efficiency
ent classes in training set. These reference points are and effectiveness of different proposed mining algo-
then used to calculate their own scaling factors for the rithms can be compared objectively.
corresponding classes. The factors are used to rescale Unreliability could also be said to arise from inad-
the expression levels of all genes in testing samples. equate interdisciplinary communication between pro-
There are two advantages to using individual scaling fessionals and researchers in relevant fields. Recently,
factors corresponding to different classes: they ensure a number of promising mining algorithms have been
that the different gene expression levels of one class are proposed but it still remains for much work of this kind
not underestimated or overestimated because of unbal- to be analyzed and validated in molecular and biological
anced sample sizes between classes, and they allow terms (Sevenet & Cussenot, 2003). The developers of
individual testing samples to be compared with indi- these algorithms would welcome such input, as it would
vidual classes. be an invaluable assistance to them in constructing
552
TEAM LinG
more general frameworks for defining different mining REFERENCES

tasks in bioinformatics. 0
Lastly, in terms of clinical studies, findings that have Bloom, G., Yang, I.V., Boulware, D., Kwong, K.Y., Coppola,
been made using tumor classification based on gene D., Eschrich, S. et al. (2004). Multi-platform, multi-site,
expression data is still not sufficiently reliable to draw microarray-based human tumor classification. American
conclusions. This is because there the performance of Journal of Pathology, 164(1), 9-16.
tumor classification may be influenced by many as yet
unidentified factors. Such factors may include the effects Chilingaryan, A., Gevorgyan, N., Vardanyan, A., Jones,
of tumor heterogeneity and inflammatory cells. D., & Szabo, A. (2002). Multivariate approach for se-
Unreliability also arises from the narrow use of gene lecting sets of differentially expressed genes. Math-
expression data. Data mining algorithms mainly focus on ematical Biosciences, 176(1), 59-69.
a set of genes, which are used as diagnostic markers in Cui, X., & Churchill, G.A. (2003). Statistical tests for
clinical diagnosis. It could be more interesting to look for differential expression in cDNA microarray experi-
algorithms which use as diagnostic markers not only the ments. Genome Biology, 4(4), 210.
expression levels of genes, but also other categories of
data, such as histopathology and patients information. Choi, J.K., Yu, U., Kim, S., & Yoo, O.J. (2003). Combin-
ing multiple microarray studies and modeling interstudy
variation. Bioinformatics, 19(Suppl. 1), i84-i90.
CONCLUSION Coulie, P.G., Karanikas, V., Lurquin, C., Colau, D.,
Connerotte, T., Hanagiri, T. et al. (2002). Cytolytic T-cell
DNA microarray techniques are nowadays important in responses of cancer patients vaccinated with a MAGE
cancer-related studies. These techniques operate on antigen. Immunological Reviews, 188(1), 33-42.
large, publicly available gene expression data sets, con-
trolled by dispersed laboratories operating under vari- Dudoit, S., Yang, Y.H., Callow, M.J., & Speed T.P.
ous experimental conditions and using a variety of (2002). Statistical methods for identifying differen-
microarray technologies and platforms. A realistic re- tially expressed genes in replicated cDNA microarray
sponse to these facts is to mine data using integrated, experiments. Statistica Sinica, 12, 111-139.
multiple, heterogeneous gene expression data. The ro-
bustness and reliability of mining algorithms developed Fung, B.Y.M., & Ng, V.T.Y. (2003). Classification of het-
to handle such data can be validated on a new set of erogeneous gene expression data. SIGKDD Special
heterogeneous gene expression data in addition to cross- Issue on Microarray Data Mining, 5(2), 69-78.
validation within a single data set. Algorithms that pass Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,
such validation are less sensitive to variations induced GaasenBeek, M., Mesirov, J.P., et al. (1999). Molecular
by experimental conditions and to different microarray classification of cancer: Class discovery and class pre-
platforms, and thus the robustness and reliability of diction by gene-expression monitoring. Science, 286,
mining results can be guaranteed. 531-537.
In this chapter, we have discussed the difficulties of
three types of variation in tumor classification of mul- Kuo, W.P., Jenssen, T.K., Butte, A.J., Ohno-Machado,
tiple, heterogeneous gene expression data: variation in L., & Kohane, I.S. (2002). Analysis of matched mRNA
the available microarray technologies, experimental and measurements from two different microarray technolo-
biological variation, and variation in the types of can- gies. Bioinformatics, 18(3), 405-412.
cers themselves. We further point out that identifica- Lee, J.K., Bussey, K.J., Gwadry, F.G., Reinhold, W.,
tion of significant genes is an important process in Riddick, G., Pelletier, S.L. et al. (2003). Comparing cDNA
tumor classification using gene expression data, espe- and oligonucleotide array data: concordance of gene
cially for multiple, heterogeneous data. We have re- expression across platforms for the NCI-60 cancer cells.
viewed some recent works on the identification of sig- Genome Biology, 4(12), R82.
nificant genes. Finally, we have presented the concept
of Impact Factors (IFs), which are used to combine and Miller, L.D., Long, P.M., Wong, L., Mukherjee, S., McShane,
classify heterogeneous gene expression data L.M., & Liu, E.T. (2002). Optimal gene expression analysis
by microarrays. Cancer Cell, 2(5), 353-361.
553
TEAM LinG
Ng, S.K., Tan, S.H., & Sundarajan, V.S. (2003). On combin- Zien, A., Fluck, J., Zimmer, R., & Lengauer, T. (2003).
ing multiple microarray studies for improved functional Microarrays: How many do you need? Journal of Compu-
classification by whole-dataset feature selection. Ge- tational Biology, 10(3-4), 653-667.
nome Informatics, 14, 44-53.
Park, T., Yi, S.G., Lee, S., Lee, S.Y., Yoo, D.H., Ahn, J.I. et
al. (2003). Statistical tests for identifying differentially KEY TERMS
expressed genes in time-course microarray experiments.
Bioinformatics, 19(6), 694-703. Bioinformatics: It is an integration of mathematical,
Ramaswamy, S., Ross, K.N., Lander, E.S., & Golub, T.R. statistical and computational methods to analyze and
(2003). Evidence for a molecular signature of metastasis handle biological, biomedical, biochemical, and biophysi-
in primary solid tumors. Nature Genetics, 33, 49-54. cal information.
Scanlan, M.J., Simpson, A.J.G., & Old, L.J. (2004). The Cancer-testis (CT) Antigen: It is immunogenic in
cancer/testis genes: Review, standardization, and com- cancer patients, which exhibit highly tissue-restricted
mentary. Cancer Immunity, 4, 1. expression, and are considered promising target mol-
ecules for cancer vaccines.
Sebastiani, P., Gussoni, E., Kohane, I.S., & Ramoni,
M.F. (2003). Statistical challenges in functional Classification: It is the process of distributing things
genomics. Statistical Science, 18(1), 33-70. into classes or categories of the same type by a learnt
mapping function.
Sevenet, N., & Cussenot, O. (2003). DNA microarrays
in clinical practice: Past, present, and future. Clinical Gene Expression: It describes how the information of
and Experimental Medicine, 3(1), 1-3. transcription and translation encoded in a segment of
DNA is converted into proteins in a cell.
Storey, J.D., & Tibshirani, R.J. (2003). Statistical signifi-
cance for genomewide studies. Proceedings of the Na- Microarrays: It is the technology for biological explo-
tional Academy of Sciences of the United States of America, ration which allows to simultaneously measure the amount
100(16), 9440-9445. of mRNA in up to tens of thousand of genes in a single
experiment.
Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002).
Diagnosis of multiple cancer types by shrunken centroids Normalization: In terms of gene expression data, it is
of gene expression. Proceedings of the National Acad- a pre-processing to minimize systematic bias and remove
emy of Sciences of the United States of America, 99(10), the impact of non-biological influences before data analy-
6567-6572. sis is performed.
Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance Probe Arrays: They are a list of labeled, single-
analysis of microarrays applied to the ionizing radiation stranded DNA or RNA molecules in specific nucleotide
response. Proceedings of the National Academy of Sci- sequences, which are used to detect the complementary
ences of the United States of America, 98(9), 5116-5121. base sequence by hybridization.
554
TEAM LinG
555
Hierarchical Document Clustering 0

Benjamin C. M. Fung
Simon Fraser University, Canada
Ke Wang
Martin Ester
INTRODUCTION by the overall frequency of the term in the entire document

set. The idea is that if a term is too common across different
Document clustering is an automatic grouping of text documents, it has little discriminating power (Rijsbergen,
documents into clusters so that documents within a 1979). Although many clustering algorithms have been
cluster have high similarity in comparison to one an- proposed in the literature, most of them do not satisfy the
other, but are dissimilar to documents in other clusters. special requirements for clustering documents:
Unlike document classification (Wang, Zhou, & He,
2001), no labeled documents are provided in clustering; High Dimensionality: The number of relevant terms
hence, clustering is also known as unsupervised learn- in a document set is typically in the order of thou-
ing. Hierarchical document clustering organizes clus- sands, if not tens of thousands. Each of these terms
ters into a tree or a hierarchy that facilitates browsing. constitutes a dimension in a document vector. Natu-
The parent-child relationship among the nodes in the ral clusters usually do not exist in the full dimen-
tree can be viewed as a topic-subtopic relationship in a sional space, but in the subspace formed by a set of
subject hierarchy such as the Yahoo! directory. correlated dimensions. Locating clusters in sub-
This chapter discusses several special challenges in spaces can be challenging.
hierarchical document clustering: high dimensionality, Scalability: Real world data sets may contain
high volume of data, ease of browsing, and meaningful hundreds of thousands of documents. Many clus-
cluster labels. State-of-the-art document clustering altering algorithms work fine on small data sets, but
gorithms are reviewed: the partitioning method fail to handle large data sets efficiently.
(Steinbach, Karypis, & Kumar, 2000), agglomerative and Accuracy: A good clustering solution should have
divisive hierarchical clustering (Kaufman & Rousseeuw, high intra-cluster similarity and low inter-cluster
1990), and frequent itemset-based hierarchical cluster- similarity, i.e., documents within the same cluster
ing (Fung, Wang, & Ester, 2003). The last one, which was should be similar but are dissimilar to documents
recently developed by the authors, is further elaborated in other clusters. An external evaluation method,
since it has been specially designed to address the hierar- the F-measure (Rijsbergen, 1979), is commonly
chical document clustering problem. used for examining the accuracy of a clustering
algorithm.
Easy to Browse with Meaningful Cluster Descrip-
BACKGROUND tion: The resulting topic hierarchy should provide
a sensible structure, together with meaningful clus-
Document clustering is widely applicable in areas such ter descriptions, to support interactive browsing.
as search engines, web mining, information retrieval, Prior Domain Knowledge: Many clustering algo-
and topological analysis. Most document clustering rithms require the user to specify some input param-
methods perform several preprocessing steps including eters, e.g., the number of clusters. However, the
stop words removal and stemming on the document set. user often does not have such prior domain knowl-
Each document is represented by a vector of frequen- edge. Clustering accuracy may degrade drastically
cies of remaining terms within the document. Some if an algorithm is too sensitive to these input param-
document clustering algorithms employ an extra pre- eters.
processing step that divides the actual term frequency
TEAM LinG
Hierarchical Document Clustering
DOCUMENT CLUSTERING METHODS is not suitable for discovering clusters of largely vary-
ing sizes, a common scenario in document clustering.
Hierarchical Clustering Methods Furthermore, it is sensitive to noise that may have a
significant influence on the cluster centroid, which in
One popular approach in document clustering is turn lowers the clustering accuracy. The k-medoids
agglomerative hierarchical clustering (Kaufman & algorithm (Kaufman & Rousseeuw, 1990; Krishnapuram,
Rousseeuw, 1990). Algorithms in this family build the Joshi, & Yi, 1999) was proposed to address the noise
hierarchy bottom-up by iteratively computing the simi- problem, but this algorithm is computationally much
larity between all pairs of clusters and then merging the more expensive and does not scale well to large docu-
most similar pair. Different variations may employ ment sets.
different similarity measuring schemes (Karypis, 2003;
Zhao & Karypis, 2001). Steinbach (2000) shows that Frequent Itemset-Based Methods
Unweighted Pair Group Method with Arithmatic Mean
(UPGMA) (Kaufman & Rousseeuw, 1990) is the most Wang et al. (1999) introduced a new criterion for clus-
accurate one in its category. The hierarchy can also be tering transactions using frequent itemsets. The intu-
built top-down which is known as the divisive approach. ition of this criterion is that many frequent items should
It starts with all the data objects in the same cluster and be shared within a cluster while different clusters should
iteratively splits a cluster into smaller clusters until a have more or less different frequent items. By treating
certain termination condition is fulfilled. a document as a transaction and a term as an item, this
Methods in this category usually suffer from their method can be applied to document clustering; however,
inability to perform adjustment once a merge or split the method does not create a hierarchy of clusters.
has been performed. This inflexibility often lowers the The Hierarchical Frequent Term-based Clustering
clustering accuracy. Furthermore, due to the complex- (HFTC) method proposed by (Beil, Ester, & Xu, 2002)
ity of computing the similarity between every pair of attempts to address the special requirements in docu-
clusters, UPGMA is not scalable for handling large data ment clustering using the notion of frequent itemsets.
sets in document clustering as experimentally demon- HFTC greedily selects the next frequent itemset, which
strated in (Fung, Wang, & Ester, 2003). represents the next cluster, minimizing the overlap of
clusters in terms of shared documents. The clustering
Partitioning Clustering Methods result depends on the order of selected itemsets, which
in turn depends on the greedy heuristic used. Although
K-means and its variants (Cutting, Karger, Pedersen, & HFTC is comparable to bisecting k-means in terms of
Tukey, 1992; Kaufman & Rousseeuw, 1990; Larsen & clustering accuracy, experiments show that HFTC is not
Aone, 1999) represent the category of partitioning clus- scalable (Fung, Wang, & Ester, 2003).
tering algorithms that create a flat, non-hierarchical
clustering consisting of k clusters. The k-means algo- A Scalable Algorithm for Hierarchical
rithm iteratively refines a randomly chosen set of k Document Clustering: FIHC
initial centroids, minimizing the average distance (i.e.,
maximizing the similarity) of documents to their clos- A scalable document clustering algorithm, Frequent
est (most similar) centroid. The bisecting k-means al- Itemset-based Hierarchical Clustering (FIHC) (Fung,
gorithm first selects a cluster to split, and then employs Wang, & Ester, 2003), is discussed in greater detail
basic k-means to create two sub-clusters, repeating because this method satisfies all of the requirements of
these two steps until the desired number k of clusters is document clustering mentioned above. We use item
reached. Steinbach (2000) shows that the bisecting k- and term as synonyms below. In classical hierarchical
means algorithm outperforms basic k-means as well as and partitioning methods, the pairwise similarity be-
agglomerative hierarchical clustering in terms of accu- tween documents plays a central role in constructing a
racy and efficiency (Zhao & Karypis, 2002). cluster; hence, those methods are document-centered.
Both the basic and the bisecting k-means algorithms FIHC is cluster-centered in that it measures the cohe-
are relatively efficient and scalable, and their complex- siveness of a cluster directly using frequent itemsets:
ity is linear to the number of documents. As they are documents in the same cluster are expected to share
easy to implement, they are widely used in different more common itemsets than those in different clusters.
clustering applications. A major disadvantage of k- A frequent itemset is a set of terms that occur
means, however, is that an incorrect estimation of the together in some minimum fraction of documents. To
input parameter, the number of clusters, may lead to illustrate the usefulness of this notion for the task of
poor clustering accuracy. Also, the k-means algorithm clustering, let us consider two frequent items, win-
556
TEAM LinG
dows and apple. Documents that contain the word Figure 1. Initial Clusters
windows may relate to renovation. Documents that 0
contain the word apple may relate to fruits. However, {Sports, Tennis} {Sports, Tennis, Ball} {Sports, Tennis, Racket}
if both words occur together in many documents, then
another topic that talks about operating systems should {Sports, Ball}
be identified. By precisely discovering these hidden Doci
Sports, Tennis, Ball
topics as the first step and then clustering documents {Sports}
based on them, the quality of the clustering solution can
be improved. This approach is very different from HFTC
where the clustering solution greatly depends on the itemset. A document Doc1 containing global fre-
order of selected itemsets. Instead, FIHC assigns docu- quent items Sports, Tennis, and Ball is as-
ments to the best cluster from among all available clus- signed to clusters {Sports}, {Sports, Ball}, {Sports,
ters (frequent itemsets). The intuition of the clustering Tennis} and {Sports, Tennis, Ball}. Suppose
criterion is that there are some frequent itemsets for {Sports, Tennis, Ball} is the best cluster for Doc1
each cluster in the document set, but different clusters measured by some score function. Doc1 is then
share few frequent itemsets. FIHC uses frequent itemsets removed from {Sports}, {Sports, Ball}, and {Sports,
to construct clusters and to organize clusters into a topic Tennis}.
hierarchy.
The following definitions are introduced in (Fung,
Wang, & Ester, 2003): A global frequent itemset is a set Building Cluster Tree
of items that appear together in more than a minimum
fraction of the whole document set. A global frequent In the cluster tree, each cluster (except the root node)
item refers to an item that belongs to some global fre- has exactly one parent. The topic of a parent cluster is
quent itemset. A global frequent itemset containing k more general than the topic of a child cluster and they
items is called a global frequent k-itemset. A global are similar to a certain degree (see Figure 2 for an
frequent item is cluster frequent in a cluster Ci if the example). Each cluster uses a global frequent k-itemset
item is contained in some minimum fraction of docu- as its cluster label. A cluster with a k-itemset cluster
ments in Ci. FIHC uses only the global frequent items in label appears at level k in the tree. The cluster tree is
document vectors; thus, the dimensionality is signifi- built bottom up by choosing the best parent at level k-
cantly reduced. 1 for each cluster at level k. The parents cluster label
The FIHC algorithm can be summarized in three must be a subset of the childs cluster label. By treating
phases: First, construct initial clusters. Second, build a all documents in the child cluster as a single document,
cluster (topic) tree. Finally, prune the cluster tree in the criterion for selecting the best parent is similar to
case there are too many clusters. the one for choosing the best cluster for a document.
Constructing Clusters Example: Cluster {Sports, Tennis, Ball} has a

global frequent 3-itemset label. Its potential par-
For each global frequent itemset, an initial cluster is ents are {Sports, Ball} and {Sports, Tennis}. Sup-
constructed to include all the documents containing this pose {Sports, Tennis} has a higher score. It be-
itemset. Initial clusters are overlapping because one comes the parent cluster of {Sports, Tennis, Ball}.
document may contain multiple global frequent itemsets.
FIHC utilities this global frequent itemset as the cluster Pruning Cluster Tree
label to identify the cluster. For each document, the
best initial cluster is identified and the document is The cluster tree can be broad and deep, which becomes
assigned only to the best matching initial cluster. The not suitable for browsing. The goal of tree pruning is to
goodness of a cluster Ci for a document docj is measured efficiently remove the overly specific clusters based
by some score function using cluster frequent items of on the notion of inter-cluster similarity. The idea is
initial clusters. After this step, each document belongs that if two sibling clusters are very similar, they should
to exactly one cluster. The set of clusters can be viewed be merged into one cluster. If a child cluster is very
as a set of topics in the document set. similar to its parent (high inter-cluster similarity),
then replace the child cluster with its parent cluster.
Example: Figure 1 depicts a set of initial clusters. The parent cluster will then also include all documents
Each of them is labeled with a global frequent of the child cluster.
557
TEAM LinG
Figure 2. Sample cluster tree Website: http://www-users.cs.umn.edu/~karypis/

cluto/
{Sports} Cluster label
Tools: Vivsimo is a clustering search engine.
{Sports, Ball} {Sports, Tennis}
Website: http://vivisimo.com/
{Sports, Tennis, Ball} {Sports, Tennis, Racket} FUTURE TRENDS
Incrementally Updating the Cluster

Example: Suppose the cluster {Sports, Tennis, Ball} Tree
is very similar to its parent {Sports, Tennis} in
Figure 2. {Sports, Tennis, Ball} is pruned and its One potential research direction is to incrementally
documents, e.g., Doc1, are moved up into cluster update the cluster tree (Guha, Mishra, Motwani &
{Sports, Tennis}. OCallaghan, 2000). In many cases, the number of docu-
ments is growing continuously in a document set, and it
Evaluation of FIHC is infeasible to rebuild the cluster tree upon every
arrival of a new document. Using FIHC, one can simply
The FIHC algorithm was experimentally evaluated and assign the new document to the most similar existing
compared to state-of-the-art document clustering meth- cluster. But the clustering accuracy may degrade in the
ods. See (Fung, Wang, & Ester, 2003) for more details. course of the time, since the original global frequent
FIHC uses only the global frequent items in document itemsets may no longer reflect the current state of the
vectors, drastically reducing the dimensionality of the overall document set. Incremental clustering is closely
document set. Experiments show that clustering with related to some of the recent research on data mining in
reduced dimensionality is significantly more efficient stream data (Ordonez, 2003).
and scalable. FIHC can cluster 100K documents within
several minutes while HFTC and UPGMA cannot even
produce a clustering solution. FIHC is not only scalable, CONCLUSION
but also accurate. The clustering accuracy of FIHC
consistently outperforms other methods. FIHC allows Most traditional clustering methods do not completely
the user to specify an optional parameter, the desired satisfy special requirements for hierarchical document
number of clusters in the solution. However, close-to- clustering, such as high dimensionality, high volume,
optimal accuracy can still be achieved even if the user and ease of browsing. In this chapter, we review several
does not specify this parameter. document clustering methods in the context of these
The cluster tree provides a logical organization of requirements, and a new document clustering method,
clusters which facilitates browsing documents. Each clus- FIHC, is discussed a bit more detail. Due to massive
ter is attached with a cluster label that summarizes the volumes of unstructured data generated in the globally
documents in the cluster. Different from other clustering networked environment, the importance of document
methods, no separate post-processing is required for clustering will continue to grow.
generating these meaningful cluster descriptions.
REFERENCES
RELATED LINKS
Beil, F., Ester, M., & Xu, X. (2002). Frequent term-
The followings are some clustering tools on the Internet: based text clustering. International Conference on
Knowledge Discovery and Data Mining, KDD02 (pp.
Tools: FIHC implements Frequent Itemset-based 436-442), Edmonton, Alberta, Canada.
Hierarchical Clustering.
Cutting, D.R., Karger, D.R., Pedersen, J.O., & Tukey,
Website: http://www.cs.sfu.ca/~ddm/
J.W. (1992). Scatter/gather: A cluster-based approach
to browsing large document collections. International
Tools: CLUTO implements Basic/Bisecting K-means Conference on Research and Development in Infor-
and Agglomerative methods.
558
TEAM LinG
mation Retrieval, SIGIR92 (pp. 318-329), Copenhagen, Zhao, Y., & Karypis, G. (2001). Criterion functions for
Denmark. document clustering: Experiments and analysis. Techni- 0
cal report, Department of Computer Science, University of
Fung, B., Wang, K., & Ester, M. (2003, May). Hierarchi- Minnesota.
cal document clustering using frequent itemsets. SIAM
International Conference on Data Mining, SDM03 (pp. Zhao, Y., & Karypis, G. (2002, November). Evaluation of
59-70), San Francisco, CA, United States.. hierarchical clustering algorithms for document datasets.
International Conference on Information and Knowl-
Guha, S., Mishra, N., Motwani, R., & OCallaghan, L. edge Management (pp. 515-524), McLean, Virginia, United
(2000). Clustering data streams. Symposium on Foun- States.
dations of Computing Science (pp. 359-366).
Karypis, G. (2003). Cluto 2.1.1: A software package for
clustering high dimensional datasets. Retrieved from
http://www-users.cs.umn.edu/~karypis/cluto/
KEY TERMS
Kaufman, L., & Rousseeuw, P.J. (1990, March). Finding Cluster Frequent Item: A global frequent item is
groups in data: An introduction to cluster analysis. New cluster frequent in a cluster Ci if the item is contained in
York: John Wiley & Sons, Inc. some minimum fraction of documents in Ci.
Krishnapuram, R., Joshi, A., & Yi, L. (1999, August). A Document Clustering: The automatic organization
fuzzy relative of the k-medoids algorithm with application of documents into clusters or group so that documents
to document and snippet clustering. IEEE International within a cluster have high similarity in comparison to
Conference - Fuzzy Systems, FUZZIEEE 99, Korea. one another, but are very dissimilar to documents in
other clusters.
Larsen, B., & Aone, C. (1999). Fast and effective text
mining using linear-time document clustering. Interna- Document Vector: Each document is represented
tional Conference on Knowledge Discovery and Data by a vector of frequencies of remaining items after
Mining, KDD99 (pp. 16-22), San Diego, California, United preprocessing within the document.
States.
Global Frequent Itemset: A set of words that
Ordonez, C. (2003). Clustering binary data streams with occur together in some minimum fraction of the whole
K-means. Workshop on Research issues in data min- document set.
ing and knowledge discovery, SIGMOD03 (pp. 12-19),
San Diego, California, United States. Inter-Cluster Similarity: The overall similarity among
documents from two different clusters.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A
comparison of document clustering techniques. Work- Intra-Cluster Similarity: The overall similarity among
shop on Text Mining, SIGKDD00. documents within a cluster.
van Rijsbergen, C.J. (1979). Information retrieval (2nd Medoid: The most centrally located object in a cluster.
ed.). London: Butterworth Ltd. Stemming: For text mining purposes, morphological
Wang, K., Xu, C., & Liu, B. (1999). Clustering transac- variants of words that have the same or similar semantic
tions using large items. International Conference on interpretations can be considered as equivalent. For ex-
Information and Knowledge Management, CIKM99 (pp. ample, the words computation and compute can be
483-490), Kansas City, Missouri, United States. stemmed into comput.
Wang, K., Zhou, S., & He Y. (2001, April). Hierarchical Stop Words Removal: A preprocessing step for text
classification of real life documents. SIAM Interna- mining. Stop words, like the and this which rarely help
tional Conference on Data Mining, SDM01, Chicago, the mining process, are removed from input data.
United States.
559
TEAM LinG
560
High Frequency Patterns in Data Mining

Tsau Young Lin
San Jose State University, USA
INTRODUCTION 1. A relational table is a bag relation, that is, repeti-

tions of tuples are permissible (Garcia-Monila et al.
The principal focus is to examine the foundation of 2002)
association (rule) mining (AM) via granular computing 2. An item is an attribute value,
(GrC). The main results is: The set of all high frequency 3. A q-itemset is a subtuple of length q,
patterns can be found by sloving linear inequalities 4. A high frequency pattern of length q is a q-subtuple
within a polynomial time. if its number of occurrences is greater than or
equal to a given threshold.
BACKGROUND Emerging Data Mining Method -

Granular Computing
Some Foundation Issues in Data Mining
Bitmap index is a common notion in database theory.
What is data mining? The following informal paraphrase The advantage of bitmap representation is
of Fayad et al. (1996)s definition seems quite universal: computationally efficient (Louis & Lin, 2000), and the
drawback is the order of the table has to be fixed
Deriving useful patterns from data. (Garcia-Molina, 2002). Based on granular computing,
we propose a new method, called granular representa-
The keys are data, patterns, derivation system, and tions, that avoids this drawback. We will illustrate the
useful-ness. We will examine critically the current idea by examples. The following example is modified
practices of AM. from the text cited above (p. 702). A relational table K
is viewed as a knowledge representation of a set V,
Some Basic Terms in Association called the universe, of real world entities by tuples of
data; see Table 1.
Mining (AM) A bitmap index for an attribute is a collection of bit-
vectors, one for each possible value that may appear in
In AM, two measures, support and confidence, are the the attribute. For the first attribute, BusinesSize (the
main criteria. It is well known among researchers the amount of business in millions), the bitmap index would
support is the main hurdle, in other words, high fre- have nine bit-vectors. The first bit-vector, for value
quency patterns are the main focus. AM is originated TWENTY, is 100011100, because the first, fifth, sixth,
from the market basket data (Agrawal, 1993). However, and seventh tuple have BusinesSize = TWENTY. The
we will be interested in AM for relational tables. For other two, for values TEN and THIRTY, are 011100000 and
definitive, we assert: 000000011 respectively; Table 1 shows both the original
Table 1. K and B are isomorphic
V BusinesSize Bmonth City BusinesSize Bmonth City

v1 TWENTY MAR NY 100011100 110011000 101000000
v2 TEN MAR SJ 011100000 110011000 010011100
v3 TEN FEB NY 011100000 001100000 101000000
v4 K TEN FEB LA 011100000 001100000 000100011
v5 TWENTY MAR SJ 100011100 110011000 010011100
v6 TWENTY MAR SJ 100011100 110011000 010011100
v7 TWENTY APR SJ 100011100 000000100 010011100
v8 THIRTY JAN LA 000000011 000000011 000100011
v9 THIRTY JAN LA 000000011 000000011 000100011
Relational Table K Bitmap Table B
TABLE 1. K AND B ARE ISOMORPHIC
TEAM LinG
Table 2a. Granular data model (GDM) for BusinesSize attribute
BusinesSize Granular Representation Bitmap Representation

0
TWENTY ={v1, v5, v6, v7} =100011100
TEN ={v2, v3, v4} =011100000
THIRTY ={v8, v9} =000000011
GDM in Granules GDM in Bitmaps
Table 2b. Granular data model (GDM) for Bmonth attribute
Bmonth Granular Representation Bitmap Representation

Jan ={v8, v9} =000000011
Feb ={v3, v4} =001100000
Mar ={v1, v2, v5, v6} =110011000
APR ={v7} =000000100
Table 2c. Granular data model (GDM) for CITY attribute
City Granular Representation Bitmap Representation

LA ={v4, v8, v9} =000100011
NY ={v1, v3} ={v1, v3}
SJ ={v2, v5, v6, v7} =010011100
Table 2. K and G are isomorphic
V BusinesSize Bmonth City BusinesSize Bmonth City

v1 TWENTY MAR NY {v1,v5,v6,v7} {v1,v2,v5,v6} {v1,v3}
v2 TEN MAR SJ {v2,v3,v4} {v1,v2,v5,v6} {v2,v5,v6,v7}
v3 TEN FEB NY {v2,v3,v4} {v3,v4} {v1,v3}
v4 K TEN FEB LA {v2,v3,v4} {v3,v4} {v4,v8,v9}
v5 TWENTY MAR SJ {v1,v5,v6,v7} {V1,V2,V5,V6} {V2,V5,V6,V7}
v6 TWENTY MAR SJ {V1,V5,V6,V7} {V1,V2,V5,V6} {V2,V5,V6,V7}
v7 TWENTY APR SJ {V1,V5,V6,V7} {v7} {V2,V5,V6,V7}
v8 THIRTY JAN LA {V8, V9} {v8,v9} {V4,V8,V9}
v9 THIRTY JAN LA {V8, V9} {v8,v9} {V4,V8,V9}
Bag Relation K GRANULR TABLE G
table and bitmap table. Bmonth means Birth month; City Some easy observations:
means the location of the entities.
Next, we will interpret the bit-vectors in terms of set 1. The collection of elementary granules of an at-
theory. A bit-vector can be viewed as a representation of tribute (column) forms a partition, that is, all
a subset of V. For example, the bit-vector, 100011100, granules of this attribute are pairwise disjoint.
of BusinesSize = TWENTY says that the first, fifth, This fact was observed by Pawlak (1982) and Tony
sixth, and seventh entities have been selected, in other Lee (1983).
words, the bit-vector represents the subset {v1, v5, v6, v 7}. 2. From Tables 1 and 2, one can easily conclude that
The other two bi-vectors, for values TEN and THIRTY, the relational table K, the bitmap table B and
represent the subsets {v2, v3, v4} and {v 8, v9} respectively. granular table G are isomorphic. Two tables are
We summarize such translations in Table 2a,b,c. and isomorphic if one can transform a table to the
refer to these subsets as elementary granules. other by renaming all attribute values in a one-to-
one fashion.
561
TEAM LinG
Granular Data Model (GDM) words (in group theory) as symbols; their words are our
Uninterpreted Relational Table in Free symbols.
Format
Data Processing and Computing with
The middle columns of Table 2a, 2b and 2c define 3 Words
partitions. The universe and such 3 partions, denoted by
(V, {E BusinesSize,E Bmonth,ECity}), determines the granular In traditional data processing (TDP), a relational table
table G and vice versa. More generally, 3-tuple (V, E, C) is is a knowledge representation of a slice of real world.
called a GDM, where E is a set of finite family of partitions, So each symbol of the table represents (to human) a
and C consists of the names of all elementary granules. A piece of the real world; however, such a representation
partition (equivalence relation) of V that is not in the given is not implemented in the system. Nevertheless, DBMS,
E is referred to as an uninterpreted attribute of GDM, and under human commands, does process the data, for
its elementary granules are un-interpreted attribute val- examples, Bmonth (attribute), April, March (attribute
ues. values) with human-perceived semantics. So in TDP the
relational table is a table of words; TDP is human
GDM Theorem: The granular table G. determines directed computing with words.
GDM and vice versa.
Data Mining and Computing with
In view of Isomorphic theorem below, it is sufficient to Symbols
do AM in GDM.
In (automated) AM we use the table created in TDP.
However, AM algorithms regard the TDP data as sym-
MAIN THRUST bols; no real world meaning of each word participates
in the process of AM. High frequency patterns are
Analysis of Association Mining (AM) completely deduced from the counting of the symbols.
AM is computing with symbols. The input data of AM is
To understand the mathematical mechanics of AM, let us a relational table of symbols, whose real wolrd mean-
examine how the information has been created and pro- ing does not participate in formal computing.
cessed. We will take the deductive data mining approach. Under such a circumstance, if we replace the given
First, let us set up some terminology. A symbol is a set of symbols by a new set, then we can derive new
string of bits and bytes that represents a slice of real patterns by simply replacing the symbols in old pat-
world, however, such a real world meaning does not terns. Formally, we have (Lin, 2002)
participate in the formal processing or computing. We
term such a processing computing with symbols. In AI, Isomorphic Theorem: Isomorphic relational tables
such a symbol is termed a semantic primitive. have isomorphic patterns.
(Feigenbaum,1981). A symbol is termed a word, if the
intended real world meaning participates in the formal This theorem implies that the theory of AM is a
processing or computing. We term such a processing syntactic theory.
computing with words. Note that mathematicians use
Table 3. The isomorphim of Table K and K
V BusinesSize Bmonth CITY U WT NAME Material

v1 TWENTY MAR NY u1 20 SCREW STEEL
v2 TEN MAR SJ u2 10 SCREW BRASS
v3 TEN FEB NY u3 10 NAIL STEEL
v4 K TEN FEB LA u4 K 10 NAIL ALLOY
v5 TWENTY MAR SJ u5 20 SCREW BRASS
v6 TWENTY MAR SJ u6 20 SCREW BRASS
v7 TWENTY APR SJ u7 20 PIN BRASS
v8 THIRTY JAN LA u8 30 HAMMER ALLOY
v9 THIRTY JAN LA u9 30 HAMMER ALLOY
Bag Relation K Bag Relation K
562
TEAM LinG
Table 4. Three isomorphic 2-patterns; support=cardinality of granules
K K GDM in Granules Support

0
(TWENTY, MAR) (20, SCREW) ={v1, v5, v6, v7}{v1, v2, v5, v6} 3
(MAR, SJ) (SCREW, BRASS) ={v1, v2, v5, v6}{v2, v5, v6, v7} 3
(TWENTY, SJ) (20, BRASS) ={v1, v5, v6, v7}{v2, v5, v6, v7} 3
Example: From Table 3, it should be clear that the 2. Set Based: A high frequency pattern in GDM is a
one-to-one correspondences between K and K granular expression, which is a set theoretical
induces consistently a one-to-one correspondence algebriac expression of elementary granules; when
between the two sets of distinct attribute values. the expression is evaluated set theoretically, the
We describe such a phenomenon by the statement: cardinailty of the resultant set is greater than or
K and K are isomorphic. equal to the threshold; we will call the algebraic
expressions granular pattern. Note that several
In Table 4, we display the high frequency patterns of distinct algebraic expressions of elementary gran-
length 2 from Table K, K and GDM; the three sets of ules may have the same resultant set.
patterns are isomorphic to each other. So for AM, we
can use any one of the three tables. An observation: In Informally, a logical formula of granular pattern is
using K or K for AM, one needs to scan the table to get the logic formula of the names of elementary granules
the support, while in using GDM, the support can be (Lin, 2000); more pricisely we translate elementary
read from the cardinality of the granules, no database granules, and into their names, or and and
scan is required one strength of GDM. Another obser- respectively. Next, we note that there are only finitely
vation: From the definition of elementary granules, it many distinct subsets that can be generated by the inter-
should be obvious that subtuples are mapped to the sections and unions of elementary granules in GDM. If
intersections of elementary granules. we only consider the disjunct normal form, the total
possible high frequency patterns in AM is finite.
Patterns and Granular Formulas
Finding High Frequency Patterns by
Implicitly AM has assumed high frequency patterns are Solving a Set of Linear Inequalities
expressions of the input symbols (elements of the
input relational table.) Such assumptions are not made Let B be the Boolean algebra generated by the elemen-
in other techniques. In neural network techniques, the tary granules; the partial order is the set theoretical
input data are numerical, its patterns are not numerical inclusion . Then B is the set of all granular expres-
expressions. They are essentially functions that are sions. Let O be the smallest element (it is not necessary
derived from activation functions (Lin, 1996; Park & an empty set) and I is the greatest element (I is the
Sanders, 1989). universe V). An element p is an atom, if p O, and there
Let us back to AM, the implicit assumption simpli- is no element x such that p x O. Each atom p is an
fies the problem. What are the possible expressions intersection of some elementary granules. Let S(b) be
of the input symbols? There are two possible formal- the set of all atom pj such that pj b and s(b) be its
isms, logic formula and set theoretical algebraic ex- cardinality. From (Birkoff & MacLane, 1977, Chapter
pression. In logic form, we have several choices, deduc- 11), we have
tive database systems, datalog, or decision logic among
others (Pawlak, 1991; Ullman, 1988-89); we choose Proposition: Every b B can be expressed in the
decision logic because it is simpler. In set theoretical form b = p1 . . . ps(b).
form, we use GDM (Lin, 2000).
For convenience, let us define an operation of a
Expressions of High Frequency binary number x and a set S. We write S*x to mean the
Patterns following:
1. Logic Based: A high frequency pattern in deci- S*x = S, if x=1 and S

sion logic is a logic formula, whose meaning set S*x = , if x=0 or S =
(support) has cardinality greater than or equal to the threshold.
563
TEAM LinG
Let p1, p2, . . ,pm be the set of all atoms in B. Then a (TWENTY, MAR) from K has no meaning on its own,
granular expression b can be expressed as (20, SCREW) from K has a valid meaning.
Let RW(K) be the Real World that K is representing.
b=p1*x 1 . . . p m* xm . The summary implies that the subtuple (TWENTY, MAR),
even though occurs very frequently in the table, there is
and its cardinality can be expressed as no real world event correspond to it. The data implies
that three entities v 1, v5, v6 have common properties
|b| = | p i |*xi encoded by Twenty and Mar. In the table K, they are
naively summarized into one concept (TWENTY,
where | | is the cardinality of . MAR). Unfortunately, in the real world RW(K), the
three occurrences of Twenty and Mar (from three
Main Theorem: Let s be the threshold. Then b is a entities, v1, v 5, v 6) do not integrate into an appropriate
high frequency pattern, if new concept (TWENTY, MAR). Such error occurs,
because high frequency is an inadequate or inaccurate
|b| = | pi |*xi s (*) criterion. We need a tighter notion of patterns.
In applications, pis are readily computable; it is the Semantic Oriented Data Mining
elementary granules of the intersection of all partitions
(defined by attributes); see the Table 1 and 2. So we only If we do know how to compute the semantics, then the
need to find all binary solutions of xi. The generators of the computation should tell us that the repeated two words
solution can be enumerated along the hyperplanes of the TWENTY and MAR can not be combined into a new
inequalities of the constraints. concept regardless of high repetitions, and should be
dropped out. So semantic oriented data mining is needed
Observations (Lin & Louie. 2001, 2002). As ontology, semantic web,
and computing with words (semantic computing) are
Theoretically, this is a remarkable theorem. It says all heating up, it could be a right time to move onto seman-
possible high frequency patterns can be found by solving tic oriented data mining.
linear inequalities. However, the practicality of the
main theorem is highly depended on the complexity of New Notions of Patterns and
the problem. If both | pi | and s are small, then the number Algorithmic Information Theory
of solutions will be out of hands, simply due to the size of
the number. We would like to stress that the difficulty is In (Lin, 1993), based on algorithmic information theory
simply due to the size of possible solutions, not the or Kolmogorov complexity theory, we proposed that a
methodology. The result implies that the notion of high non-random (compressible string) is a string with pat-
frequency patterns may not be tight enough. At this terns and the shortest Turing machine that generates this
moment, (*) is useful only if the number of attributes under string is the pattern. We concluded, then, that a finite
considerations is small. sequence (a relational table is a finite sequence) with
long constant subsequences (the length of such constant
sequence is the support) is trivially compressible (hav-
FUTURE TRENDS ing a pattern). High frequency patterns are such patterns.
Taking the same thought, what would be the next less
Tighter Notion of Patterns trivial compressible finite sequences?
Let us consider the real world meaning of the patterns of

length 2, namely, (TWENTY, MAR) and (20, SCREW). CONCLUSION
What does this subtuple (TWENTY, MAR) mean? 20
million dollar business on March? The last statement is not Our analysis on association mining seems fruitful: (1)
the original meaning of the schema: Originally it means v1, High frequency patterns are natural generalizations of
v5, v6 have 20 million dollar business and they were born in association rules. (2) All high frequency patterns (gener-
March. This subtuple has no meaning on its own. On the alized associations) can be found by solving linear in-
other hand, (20, SCREW) from K is a valid pattern (most of equalities. (3) High frequency patterns are rather lean in
screws have weight 20). In summary, we have semantics (Isomorphic Theorem). So semantic oriented
AM or new notion of patterns may be needed.
564
TEAM LinG
REFERENCES Pawlak Z. (1991). Rough sets. Theoretical aspects of

reasoning about data. Kluwer Academic Publishers. 0
Agrawal, R,. Imielinski, T., & Swami, A. (1993, June). Ullman, J. (1988). Principles of database and knowledge-
Mining association rules between sets of items in large base systems,vol. I. Computer Science Press.
databases. In Proceeding of ACM-SIGMOD interna-
tional Conference on Management of Data (pp. 207-216), Ullman, J. (1989). Principles of database and knowledge-
Washington, DC. base systems, vol. II. Computer Science Press.
Barr, A., & Feigenbaum, E.A. (1981). The handbook of
artificial intelligence. Willam Kaufmann.
Fayad, U.M., Piatetsky-Sjapiro, G., & Smyth, P. (1996).
KEY TERMS
From data mining to knowledge discovery: An overview.
In Fayard, Piatetsky-Sjapiro, Smyth, & Uthurusamy (Eds.), Algorithmic Information Theory: Absolute infor-
Knowledge Discovery in Databases. AAAI/MIT Press. mation theory based on Kolmogorov complexity theory.
Garcia-Molina, H., Ullman, J. D., & Widom, J. (2002). Association (Undirected Association Rule): A
Database systems-The complete book. Prentice Hall. subtuple of a bag relation whose support is greater than
a given threshold.
Lee, T.T. (1983). Algebraic theory of relational data-
bases. The Bell System Technical Journal, 62(10),3159- Bag Relation: A relation that permits repetition of
3204. tuples
Lin, T.Y. (1993). Rough patterns in data - Rough sets and Computing with Symbols: The interpretations of
foundation of intrusion detection systems. Journal of the symbols are not participating in the formal data
Foundation of Computer Science and Decision Sup- processing or computing,
port, 18(3-4), 225-241. Computing with Words: In this article, computing
Lin, T.Y. (1996, July). The power and limit of neural with words means one form of formal data processing or
networks. In Proceedings of the 1996 Engineering computing, in which the interpretations of the symbols
Systems Design and Analysis Conference, 7 (pp. 49-53), do participate. L.A .Zadeh uses this term in a much
Montpellier, France. deeper way.
Lin, T.T. (2000). Data mining and machine oriented mod- Deductive Data Mining: A data mining methodol-
eling: A granular computing approach. Journal of Ap- ogy that requires to list explicitly the input data and
plied Intelligence, 13(2), 113-124. background knowledge. Roughly it treats data mining as
deductive science (axiomatic method)
Lin, T.Y., & Louie, E. (2001). Semantics oriented asso-
ciation rules. In 2002 World Congress of Computa- Granulation and Partition: Partition is a decomposi-
tional Intelligence (pp. 956-961), Honolulu, Hawaii, May tion of a set into a collection of mutually disjoint subsets.
12-17, (paper # 5754). Granulation is defined similarly, but allows the subsets to
be generalized subsets, such as fuzzy sets, and permits
Louie, E., & Lin, T.Y. (2000, October). Finding associa- the overlapping.
tion rules using fast bit computation: Machine-oriented
modeling. In Z. Ras, & S. Ohsuga (Eds.), Foundations Kolmogorov Complexity of a String: The length
of intelligent systems, Lecture Notes in Artificial Intelli- of shortest program that can generate the given string.
gence #1932, 486-494, Springer-Verlag. 12th Interna- Semantic Primitive: It is a symbol that has interpre-
tional Symposium on Methodologies for Intelligent Sys- tation, but the interpretation is not implemented in the
tems, Charlotte, NC. system. So in automated computing, semantic primitive is
Park, J., & Sandberg, I.W. (1991). Universal approximation treated as symbols. However, in interactive computing; it
using radial-basis-function networks. Neural Computa- may be treated as a word (not necessary).
tion, 3, 246-257. Support: Support is the percentage of the tuples in
Pawlak, Z. (1982). Rough sets. International Journal which that subtuple occurs.
of Computer and Information Sciences, 11, 341-356.
565
TEAM LinG
566
Homeland Security Data Mining and Link

Analysis
Bhavani Thuraisingham
The MITRE Corporation, USA
INTRODUCTION sources and multiple domains. There is a clear need to

analyze the data to support planning and other functions
Data mining is the process of posing queries to large of an enterprise.
quantities of data and extracting information often pre- Some of the data-mining techniques include those
viously unknown using mathematical, statistical, and based on statistical reasoning techniques, inductive logic
machine-learning techniques. Data mining has many programming, machine learning, fuzzy sets, and neural
applications in a number of areas, including marketing networks, among others. The data-mining problems in-
and sales, medicine, law, manufacturing, and, more re- clude classification (finding rules to partition data into
cently, homeland security. Using data mining, one can groups), association (finding rules to make associations
uncover hidden dependencies between terrorist groups between data), and sequencing (finding rules to order
as well as possibly predict terrorist events based on past data). Essentially, one arrives at some hypotheses, which
experience. One particular data-mining technique that is is the information extracted from examples and patterns
being investigated a great deal for homeland security is observed. These patterns are observed from posing a
link analysis, where links are drawn between various series of queries; each query may depend on the re-
nodes, possibly detecting some hidden links. sponses obtained to the previous queries posed.
This article provides an overview of the various Data mining is an integration of multiple technolo-
developments in data-mining applications in homeland gies. These include data management, such as database
security. The organization of this article is as follows. management, data warehousing, statistics, machine learn-
First, we provide some background on data mining and ing, decision support, and others, such as visualization
the various threats. Then, we discuss the applications of and parallel computing. There is a series of steps involved
data mining and link analysis for homeland security. in data mining. These include getting the data organized
Privacy considerations are discussed next as part of for mining, determining the desired outcomes to mining,
future trends. The article is then concluded. selecting tools for mining, carrying out the mining, prun-
ing the results so that only the useful ones are considered
further, taking actions from the mining, and evaluating the
BACKGROUND actions to determine benefits. There are various types of
data mining. By this we do not mean the actual techniques
We provide background information on both data min- used to mine the data, but what the outcomes will be.
ing and security threats. These outcomes also have been referred to as data-
mining tasks. These include clustering, classification
anomaly detection, and forming associations.
Data Mining While several developments have been made, there
also are many challenges. For example, due to the large
Data mining is the process of posing various queries and volumes of data, how can the algorithms determine
extracting useful information, patterns, and trends often which technique to select and what type of data mining
previously unknown from large quantities of data possi- to do? Furthermore, the data may be incomplete and/or
bly stored in databases. Essentially, for many organiza- inaccurate. At times, there may be redundant informa-
tions, the goals of data mining include improving mar- tion, and at times, there may not be sufficient informa-
keting capabilities, detecting abnormal patterns, and tion. It is also desirable to have data-mining tools that
predicting the future, based on past experiences and can switch to multiple techniques and support multiple
current trends. There is clearly a need for this technol- outcomes. Some of the current trends in data mining
ogy. There are large amounts of current and historical include mining Web data, mining distributed and hetero-
data being stored. Therefore, as databases become larger, geneous databases, and privacy-preserving data mining,
it becomes increasingly difficult to support decision where one ensures that one can get useful results from
making. In addition, the data could be from multiple mining and at the same time maintain the privacy of
TEAM LinG
Homeland Security Data Mining and Link Analysis
individuals (Berry & Linoff; Han & Kamber, 2000; a hypothesis and then determine whether the hypothesis
Thuraisingham, 1998). is true, or bottom-up reasoning, where we start with 0
examples and then come up with a hypothesis
Security Threats (Thuraisingham, 1998). In the following, we will exam-
ine how data-mining techniques may be applied for
Security threats have been grouped into many catego- homeland security applications. Later, we will examine
ries (Thuraisingham, 2003). These include informa- a particular data-mining technique called link analysis
tion-related threats, where information technologies (Thuraisingham, 2003).
are used to sabotage critical infrastructures, and non- Data-mining techniques include techniques for mak-
information-related threats, such as bombing buildings. ing associations, clustering, anomaly detection, predic-
Threats also may be real-time threats and non-real-time tion, estimation, classification, and summarization. Es-
threats. Real-time threats are threats where attacks have sentially, these are the techniques used to obtain the
timing constraints associated with them, such as build- various data-mining outcomes. We will examine a few
ing X will be attacked within three days. Non-real-time of these techniques and show how they can be applied to
threats are those threats that do not have timing con- homeland security. First, consider association rule min-
straints associated with them. Note that non-real-time ing techniques. These techniques produce results, such
threats could become real-time threats over time. as John and James travel together or Jane and Mary
Threats also include bioterrorism, where biological travel to England six times a year and to France three
and possibly chemical weapons are used to attack, and times a year. Essentially, they form associations be-
cyberterrorism, where computers and networks are at- tween people, events, and entities. Such associations
tacked. Bioterrorism could cost millions of lives, and also can be used to form connections between different
cyberterrorism, such as attacks on banking systems, could terrorist groups. For example, members from Group A
cost millions of dollars. Some details on the threats and and Group B have no associations, but Groups A and B
countermeasures are discussed in various texts (Bolz, have associations with Group C. Does this mean that
2001). The challenge is to come up with techniques to there is an indirect association between A and C?
handle such threats. In this article, we discuss data- Next, let us consider clustering techniques. Clusters
mining techniques for security applications. essentially partition the population based on a charac-
teristic such as spending patterns. For example, those
living in the Manhattan region form a cluster, as they
MAIN THRUST spend over $3,000 on rent. Those living in the Bronx
from another cluster, as they spend around $2,000 on
First, we will discuss data mining for homeland secu- rent. Similarly, clusters can be formed based on terror-
rity. Then, we will focus on a specific data-mining ist activities. For example, those living in region X
technique called link analysis for homeland security. bomb buildings, and those living in region Y bomb
An aspect of homeland security is cyber security. There- planes.
fore, we also will discuss data mining for cyber security. Finally, we will consider anomaly detection tech-
niques. A good example here is learning to fly an air-
plane without wanting to learn to take off or land. The
Applications of Data Mining for general pattern is that people want to get a complete
Homeland Security training course in flying. However, there are now some
individuals who want to learn to fly but do not care about
Data-mining techniques are being examined extensively take off or landing. This is an anomaly. Another example
for homeland security applications. The idea is to gather is John always goes to the grocery store on Saturdays.
information about various groups of people and study But on Saturday, October 26, 2002, he went to a fire-
their activities and determine if they are potential ter- arms store and bought a rifle. This is an anomaly and may
rorists. As we have stated earlier, data-mining outcomes need some further analysis as to why he is going to a
include making associations, linking analyses, forming firearms store when he has never done so before. Some
clusters, classification, and anomaly detection. The tech- details on data mining for security applications have
niques that result in these outcomes are techniques been reported recently (Chen, 2003).
based on neural networks, decisions trees, market-bas-
ket analysis techniques, inductive logic programming, Applications of Link Analysis
rough sets, link analysis based on graph theory, and
nearest-neighbor techniques. The methods used for data Link analysis is being examined extensively for applica-
mining include top-down reasoning, where we start with tions in homeland security. For example, how do we
567
TEAM LinG
connect the dots describing the various events and make that Johns computer is never used between 2:00 A .M.
links and connections between people and events? One and 5:00 A.M.. When Johns computer is in use at 3:00
challenge to using link analysis for counterterrorism is A .M ., for example, then this is flagged as an unusual
reasoning with partial information. For example, agency pattern.
A may have a partial graph, agency B another partial Data mining is also being applied to other applica-
graph, and agency C a third partial graph. The question is tions in cyber security, such a auditing. Here again, data
how do you find the associations between the graphs, on normal database access is gathered, and when some-
when no agency has the complete picture? One would thing unusual happens, then this is flagged as a possible
ague that we need a data miner that would reason under access violation. Digital forensics is another area where
uncertainty and be able to figure out the links between data mining is being applied. Here again, by mining the
the three graphs. This would be the ideal solution, and the vast quantities of data, one could detect the violations
research challenge is to develop such a data miner. The that have occurred. Finally, data mining is being used
other approach is to have an organization above the three for biometrics. Here, pattern recognition and other
agencies that will have access to the three graphs and machine-learning techniques are being used to learn
make the links. the features of a person and then to authenticate the
The strength behind link analyses is that by visualiz- person, based on the features.
ing the connections and associations, one can have a
better understanding of the associations among the vari-
ous groups. Associations such as A and B, B and C, D and FUTURE TRENDS
A, C and E, E and D, F and B, and so forth can be very
difficult to manage, if we assert them as rules. However, While data mining has many applications in homeland
by using nodes and links of a graph, one can visualize the security, it also causes privacy concerns. This is be-
connections and perhaps draw new connections among cause we need to collect all kinds of information about
different nodes. Now, in the real world, there would be people, which causes private information to be di-
thousands of nodes and links connecting people, groups, vulged. Privacy and data mining have been the subject of
events, and entities from different countries and conti- much debate during the past few years, although some
nents as well as from different states within a country. early discussions also have been reported
Therefore, we need link analysis techniques to deter- (Thuraisingham, 1996).
mine the unusual connection, such as a connection be- One promising direction is privacy-preserving data
tween G and P, for example, which is not obvious with mining. The challenge here is to carry out data mining but
simple reasoning strategies or by human analysis. at the same time ensure privacy. For example, one could
Link analysis is one of the data-mining techniques use randomization as a technique and give out approxi-
that is still in its infancy. That is, while much has been mate values instead of the actual values. The challenge is
written about techniques such as association rule min- to ensure that the approximate values are still useful.
ing, automatic clustering, classification, and anomaly Many papers on privacy-preserving data mining have
detection, very little material has been published on link been published recently (Agrawal & Srikant, 2000).
analysis. We need interdisciplinary researchers such as
mathematicians, computational scientists, computer sci-
entists, machine-learning researchers, and statisticians CONCLUSION
working together to develop better link analysis tools.
This article has discussed data-mining applications in
Applications of Data Mining for Cyber homeland security. Applications in national security
Security and cyber security both are discussed. We first pro-
vided an overview of data mining and security threats
Data mining also has applications in cyber security, and then discussed data-mining applications. We also
which is an aspect of homeland security. The most promi- emphasized a particular data-mining techniquelink
nent application is in intrusion detection. For example, analysis. Finally, we discussed privacy-preserving data
our computers and networks are being intruded by unau- mining.
thorized individuals. Data-mining techniques, such as It is only during the past three years that data mining
those for classification and anomaly detection, are being for security applications has received a lot of attention.
used extensively to detect such unauthorized intrusions. Although a lot of progress has been made, there is also
For example, data about normal behavior is gathered, and a lot of work that needs to be done. First, we need to
when something occurs out of the ordinary, it is flagged have a better understanding of the various threats. We
as an unauthorized intrusion. Normal behavior could be need to determine which data-mining techniques are
568
TEAM LinG
applicable to which threats. Much research is also needed Thuraisingham, B. (2003). Web data mining technolo-
on link analysis. To develop effective solutions, data- gies and their applications in business intelligence 0
mining specialists have to work with counter-terrorism and counter-terrorism. FL: CRC Press.
experts. We also need to motivate the tool vendors to
develop tools to handle terrorism.
KEY TERMS
NOTE
Cyber Security: Techniques used to protect the
The views and conclusions expressed in this article are those computer and networks from the threats.
of the author and do not reflect the policies of the MITRE Data Management: Techniques used to organize,
Corporation or of the National Science Foundation. structure, and manage the data, including database man-
agement and data administration.
REFERENCES Digital Forensics: Techniques to determine the

root causes of security violations that have occurred in
Agrawal, R, & Srikant, R. (2000). Privacy-preserving a computer or a network.
data mining. Proceedings of the ACM SIGMOD Con- Homeland Security: Techniques used to protect a
ference, Dallas, Texas. building or an organization from threats.
Berry, M., & Linoff, G. (1997). Data mining tech- Intrusion Detection: Techniques used to protect
niques for marketing, sales, and customer support. the computer system or a network from a specific
New York: John Wiley. threat, which is unauthorized access.
Bolz, F. (2001). The counterterrorism handbook: Tac- Link Analysis: A data-mining technique that uses
tics, procedures, and techniques. CRC Press. concepts and techniques from graph theory to make
Chen, H. (Ed.). (2003). Proceedings of the 1st Confer- associations.
ence on Security Informatics, Tucson, Arizona. Privacy: Process of ensuring that information
Han, J., & Kamber, M. (2000). Data mining, concepts deemed personal to an individual is protected.
and techniques. CA: Morgan Kaufman. Privacy-Preserving Data Mining: Data-mining
Thuraisingham, B. (1996). Data warehousing, data min- techniques that extract useful information but at the
ing and security. Proceedings of the IFIP Database same time ensure the privacy of individuals.
Security Conference, Como, Italy. Threats: Events that disrupt the normal operation of
Thuraisingham, B. (1998). Data mining: Technologies, a building, organization, or network of computers.
techniques, tools and trends, FL: CRC Press.
569
TEAM LinG
570
Humanities Data Warehousing

Janet Delve
University of Portsmouth, UK
INTRODUCTION putting data into a model that matches it closely, some-

thing that is very hard to achieve with the relational
Data Warehousing is now a well-established part of the model. (Burt2 & James, 1996) considered the relative
business and scientific worlds. However, up until re- freedom of using source-oriented data modeling
cently, data warehouses were restricted to modeling (Denley, 1994) as compared to relational modeling
essentially numerical data examples being sales fig- with its restrictions due to normalization (which splits
ures in the business arena (e.g. Wal-Marts data ware- data into many separate tables), and highlighted the
house) and astronomical data (e.g. SKICAT) in scien- possibilities of data warehouses. Normalization is not
tific research, with textual data providing a descriptive the only hurdle historians encounter when using the
rather than a central role. The lack of ability of data relational model.
warehouses to cope with mainly non-numeric data is Date and time fields provide particular difficulties:
particularly problematic for humanities1 research uti- historical dating systems encompass a number of dif-
lizing material such as memoirs and trade directories. ferent calendars, including the Western, Islamic, Revo-
Recent innovations have opened up possibilities for lutionary and Byzantine. Historical data may refer to
non-numeric data warehouses, making them widely ac- the first Sunday after Michaelmas, requiring calcula-
cessible to humanities research for the first time. Due tion before a date may be entered into a database. Unfor-
to its irregular and complex nature, humanities research tunately, some databases and spreadsheets cannot handle
data is often difficult to model and manipulating time dates falling outside the late 20th century. Similarly, for
shifts in a relational database is problematic as is fitting researchers in historical geography, it might be neces-
such data into a normalized data model. History and sary to calculate dates based on the local introduction of
linguistics are exemplars of areas where relational data- the Gregorian calendar, for example. These difficulties
bases are cumbersome and which would benefit from can be time-consuming and arduous for researchers.
the greater freedom afforded by data warehouse dimen- Awkward and irregular data with abstruse dating systems
sional modeling. thus do not fit easily into a relational model that does not
lend itself to hierarchical data. Many of these problems
also occur in linguistics computing.
BACKGROUND Linguistics is a data-rich field, with multifarious
forms for words, multitudinous rules for coding sounds,
Hudson (2001, p. 240) declared relational databases to be words and phrases, and also numerous other parameters
the predominant software used in recent, historical re- - geography, educational and social status. Databases
search involving computing. Historical databases have are used for housing many types of linguistic data from
been created using different types of data from diverse a variety of research domains - phonetics, phonology,
countries and time periods. Some databases are modest morphology, syntax, lexicography, computer-assisted
and independent, others part of a larger conglomerate learning (CAL), historical linguistics and dialectology.
like the North Atlantic Population Project (NAPP) Data integrity and consistency are of utmost importance
project that entails integrating international census data. in this field. Relational DataBase Management Systems
One issue that is essential to good database creation is (RDBMSs) are able to provide this, together with pow-
data modeling; which has been contentiously debated erful and flexible search facilities (Nerbonne, 1998).
recently in historical circles. Bliss and Ritter (2001) discussed the constraints
When reviewing relational modeling in historical imposed on them when using the rigid coding structure
research, (Bradley, 1994) contrasted straightforward of the database developed to house pronoun systems
business data with incomplete, irregular, complex- or from 109 languages. They observed that coding intro-
semi-structured historical data. He noted that the rela- duced interpretation of data and concluded that design-
tional model worked well for simply-structured busi- ing a typological database is not unlike trying to fit a
ness data, but could be tortuous to use for historical square object into a round hole. Linguistic data is highly
data. (Breure, 1995) pointed out the advantages of in- variable, database structures are highly rigid, and the
two do not always fit. Brown (2001) outlined the fact
TEAM LinG
that different database structures may reflect a particular is placed on choosing the right subjects to model as
linguistic theory, and also mentioned the trade-off be- opposed to being constrained to model around applica- 0
tween quality and quantity in terms of coverage. tions. Data warehouses do not replace databases as such
The choice of data model thus has a profound effect - they co-exist alongside them in a symbiotic fashion.
on the problems that can be tackled and the data that can Databases are needed both to serve the clerical commu-
be interrogated. For both historical and linguistic re- nity who answer day-to-day queries such as what is A.R.
search, relational data modeling using normalization Smiths current overdraft? and also to feed a data
often appears to impose data structures which do not fit warehouse. To do this, snapshots of data are extracted
naturally with the data and which constrain subsequent from a database on a regular basis (daily, hourly and in
analysis. Coping with complicated dating systems can the case of some mobile phone companies almost real-
also be very problematic. Surprisingly, similar difficul- time). The data is then transformed (cleansed to ensure
ties have already arisen in the business community, and consistency) and loaded into a data warehouse. In addi-
have been addressed by data warehousing. tion, a data warehouse can cope with diverse data sources,
including external data in a variety of formats and sum-
marized data from a database. The myriad types of data
MAIN THRUST of different provenance create an exceedingly rich and
varied integrated data source opening up possibilities
Data Warehousing in the Business not available in databases. Thus all the data in a data
warehouse is integrated. Crucially, data in a warehouse
Context is not updated - it is only added to, thus making it non-
volatile, which has a profound effect on data modeling,
Data warehouses came into being as a response to the as the main function of normalization is to obviate
problems caused by large, centralized databases which update anomalies. Finally, a data warehouse has a time
users found unwieldy to query. Instead, they extracted horizon (that is contains data over a period) of five to ten
portions of the databases which they could then control, years, whereas a database typically holds data that is
resulting in the spider-web problem where each de- current for two to three months.
partment produces queries from its own, uncoordinated
extract database (Inmon, 2001, 99. 6-14). The need was
thus recognized for a single, integrated source of clean
Data Modeling in a Data Warehouse
data to serve the analytical needs of a company. Dimensional Modeling
A data warehouse can provide answers to a com-
pletely different range of queries than those aimed at a There is a fundamental split in the data warehouse com-
traditional database. Using an estate agency as a typical munity as to whether to construct a data warehouse from
business, the type of question their local databases scratch, or to build them via data marts. A data mart is
should be able to answer might be How many three- essentially a cut-down data warehouse that is restricted
bedroomed properties are there in the Botley area up to to one department or one business process. Inmon (2001,
the value of 150,000? The type of over-arching ques- p. 142) recommended building the data warehouse first,
tion a business analyst (and CEOs) would be interested then extracting the data from it to fill up several data marts.
in might be of the general form Which type of property The data warehouse modeling expert Kimball (2002)
sells for prices above the average selling price for advised the incremental building of several data marts
properties in the main cities of Great Britain and how that are then carefully integrated into a data warehouse.
does this correlate to demographic data? (Begg & Whichever way is chosen, the data is normally modeled
Connolly, 2004, p. 1154). To trawl through each local via dimensional modeling. Dimensional models need to
estate agency database and corresponding local county be linked to the companys corporate ERD (Entity Re-
council database, then amalgamate the results into a lationship Diagram) as the data is actually taken from
report would take a long time and a lot of resources. The this (and other) source(s). Dimensional models are
data warehouse was created to answer this type of need. somewhat different from ERDs, the typical star model
having a central fact table surrounded by dimension
Basic Components of a Data tables. Kimball (2002, pp. 16-18) defined a fact table as
the primary table in a dimensional model where the
Warehouse numerical performance measurements of the business
are storedSince measurement data is overwhelmingly
Inmon (2002, p. 31), the father of data warehousing, the largest part of any data mart, we avoid duplicating it
defined a data warehouse as being subject-oriented, in multiple places around the enterprise. Thus the fact
integrated, non-volatile and time-variant. Emphasis table contains dynamic numerical data such as sales
571
TEAM LinG
quantities and sales and profit figures. It also contains required. It is comparatively easy to extend a data ware-
key data in order to link to the dimension tables. Dimen- house and add material from a new source. The data
sion tables contain the textual descriptors of the busi- cleansing techniques developed for data warehousing
ness process being modeled and their depth and breadth are of interest to researchers, as is the tracking facility
define the analytical usefulness of the data warehouse. afforded by the meta data manager (Begg & Connolly,
As they contain descriptive data, it is assumed they will 2004, p. 1169-1170).
not change at the same rapid rate as the numerical data in In terms of using data warehouses off the shelf,
the fact table that will certainly change every time the some humanities research might fit into the numerical
data warehouse is refreshed. Dimension tables typically fact topology, but some might not. The factless fact
have 50-100 attributes (sometimes several hundreds) table has been used to create several American univer-
and these are not usually normalized. The data is often sity data warehouses, but expertise in this area would
hierarchical in the tables and can be an accurate reflec- not be as widespread as that with normal fact tables. The
tion of how data actually appears in its raw state (Kimball whole area of data cleansing may perhaps be daunting
2002, pp. 19-21). There is not the need to normalize as data for humanities researchers (as it is to those in indus-
is not updated in the data warehouse, although there are try). Ensuring vast quantities of data is clean and con-
variations on the star model such as the snowflake and sistent may be an unattainable goal for humanities
starflake models which allow varying degrees of normal- researchers without recourse to expensive data cleans-
ization in some or all of their dimension tables. Coding is ing software. The data warehouse technology is far
disparaged due to the long-term view that definitions may from easy and is based on having existing databases to
be lost and that the dimension tables should contain the extract from, hence double the work. It is unlikely that
fullest, most comprehensible descriptions possible researchers would be taking regular snapshots of their
(Kimball 2002, p. 49). The restriction of data in the fact table data, as occurs in industry, but they could equate to data
to numerical data has been a hindrance to academic com- sets taken at different periods of time to data ware-
puting. However, Kimball has recently developed factless house snapshots (e.g. 1841 census, 1861 census).
fact tables (Kimball, 2002) which do not contain measure- Whilst many data warehouses use familiar WYSIWYGs
ments, thus opening the door to a much broader spectrum and can be queried with SQL-type commands, there is
of possible data warehouses. undeniably a huge amount to learn in data warehousing.
Nevertheless, there are many areas in both linguistics
Applying the Data Warehouse and historical research where data warehouses may
Architecture to Historical and Linguistic prove attractive.
Research
One of the major advantages of data warehousing is the

FUTURE TRENDS
enormous flexibility in modeling data. Normalization is
no longer an automatic straightjacket and hierarchies can Data Warehouses and Linguistics
be represented in dimension tables. The expansive time Research
dimension (Kimball, 2002, p. 39) is a welcome by-product
of this modeling freedom, allowing country-specific calen- The problems related by Bliss and Ritter (2001) con-
dars, synchronization across multiple time zones and the cerning rigid relational data structures and pre-coding
inclusion of multifarious time periods. It is possible to add problems would be alleviated by data warehousing.
external data from diverse sources and summarized data Brown (2001) outlined the dilemma arising from the
from the source database(s). The data warehouse is built alignment of database structures with particular lin-
for analysis which immediately makes it attractive to hu- guistic theories, and also the conflict of quality and
manities researchers. It is designed to continuously re- quantity of data. With a data warehouse there is room
ceive huge volumes (terabytes) of data, but is sensitive for both vast quantities of data and a plethora of detail.
enough to cope with the idiosyncrasies of geographic No structure is forced onto the data so several theoreti-
location dimensions within GISs (Kimball, 2002, p. 227). cal approaches can be investigated using the same data
Additionally a data warehouse has advanced indexing warehouse. Dalli (2001) observed that many linguistic
facilities that make it desirable for those controlling vast databases are standalone with no hope of
quantities of data. With a data warehouse it is theoretically interoperability. His proffered solution to create an
possible to publish the right data that has been collected interoperable database of Maltese linguistic data in-
from a variety of sources and edited for quality and con- volved an RDBMS and XML. Using data warehouses to
sistency. In a data warehouse all data is collated so a store linguistic data should ensure interoperability.
variety of different subsets can be analyzed whenever There is growing interest in corpora databases, with the
572
TEAM LinG
recent dedicated conference at Portsmouth, November trade figures. Similarly a Street directories data ware-
2003. Teich, Hansen and Fankhauser drew attention to the house would contain data from this rich source for 0
multi-layered nature of corpora and speculated as to how whole country for the last 100 years. Lastly, a Taxation
multi-layer corpora can be maintained, queried and ana- data warehouse could afford an overview of taxation of
lyzed in an integrated fashion. A data warehouse would different types, areas or periods. 19th century British
be able to cope with this complexity. census data does not fit into the typical data warehouse
Nerbonne (1998) alluded to the importance of co- model as it does not have the numerical facts to go into
ordinating the overwhelming amount of work being a fact table, but with the advent of factless fact tables a
done and yet to be done. Kretzschmar (2001) delin- data warehouse could now be made to house this data.
eated the challenge of preservation and display for The fact that some institutions have Oracle site licenses
massive amounts of survey data. There appears to be opens to way for humanities researchers with Oracle
many linguistics databases containing data from a range databases to use Oracle Warehouse Builder as part of
of locations/countries. For example, ALAP, the Ameri- the suite of programs available to them. These are
can Linguistic Atlas Project; ANAE, the Atlas of North practical project suggestions which would be impos-
American English (part of the TELSUR Project); TDS, sible to construct using relational databases, but which,
the Typological Database System containing European if achieved, could grant new insights into our history.
data; AMPER, the Multimedia Atlas of the Romance Comparisons could be made between counties and cit-
Languages. Possible research ideas for the future may ies and much broader analysis would be possible than
include a broadening of horizons - instead of the empha- has previously been the case.
sis on individual database projects, there may develop an
integrated warehouse approach with the emphasis on
larger scale, collaborative projects. These could com- CONCLUSION
pare different languages or contain many different types
of linguistic data for a particular language, allowing for The advances made in business data warehousing are
new orders of magnitude analysis. directly applicable to many areas of historical and lin-
guistics research. Data warehouse dimensional model-
Data Warehouses and Historical ing would allow historians and linguists to model vast
Research amounts of data on a countrywide basis (or larger),
incorporating data from existing databases and other
There are inklings of historical research involving data external sources. Summary data could also be included,
warehousing in Britain and Canada. A data warehouse of and this would all lead to a data warehouse containing
current census data is underway at the University of more data than is currently possible, plus the fact that the
Guelph, Canada and the Canadian Century Research data would be richer than in current databases due to the
Infrastructure aims to house census data from the last fact that normalization is no longer obligatory. Whole
100 years in data marts constructed using IBM software data sources could be captured, and more post-hoc analy-
at several sites based in universities across the country. sis would result. Dimension tables particularly lend them-
At the University of Portsmouth, UK, a historical data selves to hierarchical modeling, so data would not need
warehouse of American mining data is under construc- splitting into many tables thus forcing joins while query-
tion using Oracle Warehouse Builder (Delve, Healey, & ing. The time dimension particularly lends itself to his-
Fletcher, 2004). These projects give some idea of the torical research where significant difficulties have been
scale of project a data warehouse can cope with that is, encountered in the past. These suggestions for historical
really large country-/state-wide problems. Following and linguistics research will undoubtedly resonate in
these examples, it would be possible to create a data other areas of humanities research, such as historical
warehouse to analyze all British censuses from 1841 to geography, and any literary or cultural studies involving
1901 (approximately 108 bytes of data). Data from a textual analysis (for example biographies, literary criti-
variety of sources over time such as hearth tax, poor cism and dictionary compilation).
rates, trade directories, census, street directories, wills
and inventories, GIS maps for a city such as Winchester
could go into a city data warehouse. Such a project is REFERENCES
under active consideration for Oslo, Norway. Similarly,
a Voting data warehouse could contain voting data poll Begg, C., & Connolly, T. (2004). Database systems.
book data and rate book data up to 1870 for the whole Harlow: Addison-Wesley.
country. A Port data warehouse could contain all data
from portbooks for all British ports together with yearly Bliss, & Ritter (2001). IRCS (Institute for Research
into Cognitive Science) Conference Proceedings. Re-
573
TEAM LinG
trieved from http://www.ldc.upenn.edu/annotation/da- Wal-Mart. Retrieved from http://www.tgc.com/dsstar/99/

tabase/proceedings.html 0824/100966.html
Bradley, J. (1994). Relational database design. History
and Computing, 6(2), 71-84.
Breure, L. (1995). Interactive data entry. History and
KEY TERMS
Computing, 7(1), 30-49.
Data Modeling: The process of producing a model
Brown. (2001). IRCS Conference Proceedings. Re- of a collection of data which encapsulates its semantics
trieved from http://www.ldc.upenn.edu/annotation/ and hopefully its structure.
database/proceedings.html
Dimension Table: Dimension tables contain the tex-
Burt2, J., & James, T. B. (1996). Source-Oriented Data tual descriptors of the business process being modeled
Processing: The triumph of the micro over the macro. and their depth and breadth define the analytical useful-
History and Computing, 8(3), 160-169. ness of the data warehouse.
Dalli. (2001). IRCS Conference Proceedings. Retrieved Dimensional Model: The dimensional model is the
from http://www.ldc.upenn.edu/annotation/database/ data model used in data warehouses and data marts, the
proceedings.html most common being the star schema, comprising a fact
table surrounded by dimension tables.
Delve, J., Healey, R., & Fletcher, A. (2004, July).
Teaching data warehousing as a standalone unit using Fact Table: the primary table in a dimensional model
Oracle Warehouse Builder. Proceedings of the British where the numerical performance measurements of the
Computer Society TLAD Conference, Edinburgh, UK. business are stored (Kimball, 2002).
Denley, P. (1994). Models, sources and users: Histori- Factless Tact Table: A fact table which contains no
cal database design in the 1990s. History and Comput- measured facts (Kimball 2002).
ing, 6(1), 33-43.
Normalization: The process developed by E.C. Codd
Hudson, P. (2000). History by numbers. An introduc- whereby each attribute in each table of a relational
tion to quantitative approaches. London: Arnold. database depend entirely on the key(s) of that table. As
a result, relational databases comprise many tables,
Inmon, W. H. (2002). Building the data warehouse. each containing data relating to one entity.
New York: Wiley.
Snowflake Schema: A dimensional model having
Kimball, R., & Ross, M. (2002). The data warehouse some or all of the dimension tables normalized.
toolkit. New York: Wiley.
Kretzschmar. (2001). IRCS Conference Proceedings.
Retrieved from http://www.ldc.upenn.edu/annotation/
database/proceedings.html
ENDNOTES
Nerbonne, J. (Ed.). (1998). Linguistic databases. 1

learning or literature concerned with human cul-
Stanford, CA: CSLI Publications. ture (Compact OED)
SKICAT. Retrieved from http://www-aig.jpl.nasa.gov/
2
Now Delve
public/mls/home/fayyad/SKICAT-PR1-94.html
Teich, Hansen, & Fankhauser. (2001). IRCS Conference
Proceedings. Retrieved from http://www.ldc.upenn.edu/
annotation/database/proceedings.html
574
TEAM LinG
575
Hyperbolic Space for Interactive Visualization 0

Jrg Andreas Walter
University of Bielefeld, Germany
INTRODUCTION mensions). Unfortunately, there is no good embedding

of the H2 in R3, which makes it harder to grasp the unusual
For many tasks of exploratory data analysis, visualiza- properties of the H2. Local patches resemble the situation
tion plays an important role. It is a key for efficient at a saddle point, where the neighborhood grows faster
integration of human expertise not only to include his than in flat space. Standard textbooks on Riemannian
background knowledge, intuition and creativity, but also geometry (see, e.g., Morgan, 1993) examine the relation-
his powerful pattern recognition and processing capa- ship and expose that the area a of a circle of radius r are
bilities. The design goals for an optimal user interaction given by
strongly depend on the given visualization task, but they
certainly include an easy and intuitive navigation with a(r) = 4 sinh2(r/2) (1)
strong support for the users orientation.
Since most of available data display devices are two- This bears two remarkable asymptotic properties: (i)
dimensional paper and screens the following for small radius r the space looks flat since a(r) = r2.
problem must be solved: finding a meaningful spatial (ii) For larger r both grow exponentially with the
mapping of data onto the display area. One limiting radius. As observed in Lamping & Rao (1994, 1999) and
factor is the restricted neighborhood around a point in Lamping et al. (1995), this property makes the hyper-
a Euclidean 2-D surface. Hyperbolic spaces open an bolic space ideal for embedding hierarchical structures
interesting loophole. The extraordinary property of ex- (since the number of leaves grows exponentially with
ponential growth of neighborhood with increasing ra- the tree depth). This led to the development of the
dius around all points enables us to build novel hyperbolic tree browser at Xerox Parc (today
displays and browsers. We describe a versatile hyper- starviewer: see www.inxight.com).
bolic data viewer for building data landscapes in a non- The question how effective is visualization and navi-
Euclidean space, which is intuitively browseable with a gation in the hyperbolic space was studied by Pirolli et
very pleasing focus and context technique. al. (2001). By conducting eye-tracker experiments they
found that the focus+context navigation can signifi-
cantly accelerate the information foraging. Risden et
BACKGROUND al. (2000) compared traditional and hyperbolic brows-
ers and found significant improvement in task execution
time for this novel display type.
The Hyperbolic Space H2
2,300 years ago, the Greek mathematician Euclid

founded his geometry on five axioms. The fifth, the
MAIN THRUST
parallel axiom, appeared unnecessary to many of his
colleagues. And they tried hard to prove it derivable In order to use the visualization potential of the H2 we
without success. After almost 2,000 years Lobachevsky must solve two problems:
(1793-1856), Bolyai (1802-1860), and Gauss (1777-
1855) negated the axiom and independently discovered (P1) How can we accommodate the data in the
the non-Euclidean geometry. There exist only two ge- hyperbolic space, and
ometries with constant non-zero curvature. Through our (P2) how to project the H2 onto a suitable display?
sight of common spherical surfaces (e.g., earth, orange)
we are familiar with the spherical geometry and its In the following the answers are described in reverse
constant positive curvature. Its counterpart with consequence, first for the second problem (P2) and then
stant negative curvature is known as the hyperbolic three principal techniques for P1, which are today avail-
plane H2 (with analogous generalizations to higher di- able for different data types.
TEAM LinG
Hyperbolic Space for Interactive Visualization
Figure 1. Regular H2 tessellation with equilateral triangles (here 8 triangles meet at each vertex). Three
snapshots of a simple focus transfer are visible. Note the circular appearance of lines in the Poincar disk PD
and the fish-eye-lens effect: the triangles in the center appear larger and take less space in regions further
away.
Poincare Poincare Poincare
Solution for P2: Poincar Disk PD Regular Tessellations with triangles offer richer
possibilities than the R2. It turns out that there is an
For practical and technological reasons most available infinite set of choices to tessellate H2: for any
displays are flat. The perfect projection into the flat integer n 7, one can construct a regular tessella-
display area should preserve length, area, and angles tions in which n triangles meet at each vertex (in
(=form). But it lays in the nature of a curved space to resist contrast to the plane with allows only n=3,4,6 and
the attempt to simultaneously achieve these goals. Con- the sphere only n=3,4,5). Figure 1 depicts an ex-
sequently several projections or maps of the hyperbolic amples for n=8.
space were developed, four are especially well examined: Moving Around and Changing the Focus: For
(i) the Minkowski, (ii) the upper-half plane, (iii) the Klein- changing the focus point in PD we need a transla-
Beltrami, and (iv) the Poincar or disk mapping. For our tion operation, which can be bound to mouse click
purpose the latter is particularly suitable. Its main charac- and drag events. In the Poincar disk model the
teristics are: Mbius transformation T(z) is the appropriate so-
lution. By describing the Poincar disk PD as the
Display Compatibility: The infinite large area unit circle in the complex plane, the isometric
of the H2 is mapped entirely into a circle, the transformations for a point zPD can be written as
Poincar disk PD.
Circle Rim is Infinity : All remote points are z = T(z; c, ) = ( z+c)/(c* z+1), with | |=1,
close to the rim, without touching it. |c|<1. (2)
Focus+Context: The focus can be moved to each
location in H2, like a fovea. The zooming factor Here the complex number describes a pure rota-
is 0.5 in the center and falls (exponentially) off tion of PD around the origin 0 (the star * denotes
with distance to the fovea. Therefore, the context complex conjugation). The following translation by c
appears very natural. As more remote things are, maps the origin to c and -c becomes the new center 0 (if
the less spatial representation is assigned in the =1). The Mbius transformations are also called the
current display (compare Figure 1). circle automorphies of the complex plane, since they
describe the transformations from circles to (general-
Lines Become Circles: All H2-lines appear as
ized) circles. Here they serve to translate H2 straight
circle arc segments of centered straight lines in lines to lines both appearing as generalized circles in
PD (both belong to the set of so-called general- the PD projection. For further details, see for example,
ized circles). There extensions cross the PD-rim Lamping & Rao (1999) or Walter (2004).
always perpendicular on both ends.
Conformal Mapping: Angles (and therefore Three Layout Techniques in H2
form) relations are preserved in PD, area and
length relations obviously not. Now we turn to the question raised earlier: how to accom-
modate data in the hyperbolic space. In the following
576
TEAM LinG
section three known layout techniques for the H2 are has its prototype vector wa closest to the given input a
shortly reviewed. BMU
= argmin a| wa - x |.
H
The distribution of the reference vectors wa, is itera-
Hyperbolic Tree Layout (HTL) for Tree- tively adapted by a sequence of training vectors x. After
Like Graph Data finding the aBMU all reference vectors are updated to-
wards the stimulus x: wa new:= wa old + h(d a, aBMU)(x - wa).
A first solution to this question for the case of acyclic, tree- Here h(.) is a bell-shaped Gaussian centered at the BMU
like graph data in H2 was provided by Lamping & Rao and decaying with increasing distance d a, aBMU =|ga - g aBMU|
(1994, 1999). Each tree node receives a certain open space in the node ga. Thus, each node or neuron in the
pie segment, where the node chooses the locations of its neighborhood of the aBMU participates in the current learn-
siblings. For all its siblings i it calls recursively the layout- ing step (as indicated by the gray shading in Figure 2).
routine after applying the Mbius transformation in order This neighborhood cooperation in the adaptation
to center i. algorithm has important advantages: (i) it is able to
Tamara Munzner developed another graph layout al- generate topological order between the wa, which means
gorithm for the three-dimensional hyperbolic space that similar inputs are mapped to neighboring nodes;
(1997). While she gains much more space for the layout, (ii) As a result, the convergence of the algorithm can be
the problem of more complex navigation (and viewport sped up by involving a whole group of neighboring
control) in 3-D and, more serious, the problem of occlu- neurons in each learning step.
sion appears. The structure of this neighborhood is essentially
The next two layout techniques are freed from the governed by the structure of h(a, aBMU) = h(d a, aBMU)
requirement of hierarchical data. therefore also called the neighborhood function.
While most learning and visualization applications choose
d a, aBMU as distances in an rectangular (2-D, 3-D) lattice
Hyperbolic Self-Organizing Map (HSOM) this can be generalized to the non-Euclidean case as
suggested by Ritter (1999). The core idea of the Hyper-
The standard Self-Organizing Map (SOM) algorithm is bolic Self-Organizing Map (HSOM) is to employ an H2-
used in many applications for learning and visualization grid of nodes. A particular convenient choice is to take the
(Kohonen, 2001). Figure 2 illustrates the basic opera- ga in PD of a finite patch of the triangular tessellation grid
tion. The feature map is built by a lattice of nodes (or as displayed in Figure 2. The internode distance is com-
formal neurons) a A, each with a reference vector or puted in the appropriate Poincar metric (see equation 4).
prototype vector wa attached, projecting into the input
space X. The response of a SOM to an input vector x is
determined by the reference vector of wa of the discrete
Hyperbolic Multidimensional Scaling
best-matching unit (BMU) a BMU, that is, the node which (HMDS)
Multidimensional scaling refers to a class of algo-

rithms for finding a suitable embedding of N objects in
Figure 2. The Hyperbolic Self-Organizing Map is formed
a low dimensional usually Euclidean space given
by a grid of processing units, called neurons. Here the
only pairwise proximity values (in the following we
usual rectangular grid is replaced by a part of a
represent proximity as dissimilarity or distance values
tessellation grid seen in Figure 1.
between pairs of objects, mathematically written as
dissimilarity ij 0 between the i and j item). Often the
raw dissimilarity distribution is not suitable for the
low-dimensional embedding and an additional prepro-
a* cessing step is applied. We model it here as a mono-
tonic transformation D(.) of dissimilarities ij into
disparities Dij = D(ij).
x The goal of the MDS algorithm is to find a spatial
representation xi of each object i in the M-dimensional
w a* space, where the pair distances d ij = d(xi , xj) (in the
Grid of target space) match the disparities Dij as faithfully as
Neurons a
possible i j: Dij ij.
One of the most widely known MDS algorithms was
Input Space X introduced by Sammon (1969). He formulates a mini-
577
TEAM LinG
mization problem of a cost function which sums over the ibility of the entire structure and space for naviga-
squares of disparities-distance misfits tion in the detail-rich areas.
Comparison: The following table compares the main
E({xi}) = i=1N j>i wi j (ij - Dij)2. (3) properties of the three available layout techniques.
The factors wi j are introduced to weight the disparities All three techniques share the concept of spatial
individually and also to normalize the cost function E() to proximity representing similarities in the data. In HTL
be independent to the absolute scale of the disparities Dij. close relatives are directly connected. Skupin (2002)
The set of xi is found by a gradient descent procedure, pointed out that we humans learn early the usage and the
minimizing iteratively the cost or stress function [see versatile concepts of maps and suggested to build dis-
Sammon (1969) or Cox (1994) for further details on this and plays implementing this underlying map metaphor. The
other MDS algorithms]. HSOM can handle many objects and can generate topic
The recently introduced Hyperbolic Multi-Dimen- maps, while the HMDS is ideal to smaller sets, pre-
sional Scaling (HMDS) combines the concept of MDS sented on the object level (see below Figures 4 & 5).
and hyperbolic geometry (Walter & Ritter, 2002). In-
stead of finding a MDS solution in the low-dimensional Application Examples
Euclidean RM and transferring it to the H2 (which can
not work well), the MDS formalism operates in the
hyperbolic space from the beginning. The key is to Even though the look and feel of an interactive visualiza-
replace the Euclidean distance in the target space by the tion and navigation is hardly compressible to paper
appropriate distance metric for the Poincar model. format, we present some application screenshots of
visualization experiments. Figure 3 displays an applica-
dij = 2 arctanh [ | xi - xj | / |1- x i xj* | ] with xi , xj PD; tion of browsing image collections with an HMDS. Here
(4) the direct image features based on color are employed
to define the concept of similarity.
While the gradients ij,q/ xi, q required for the gradi- Figures 4 and 5 depict the hybrid combination of
ent descent are rather simple to compute for the Euclidean HSOM and HMDS for an application from the field of
geometry, the case becomes complex for HMDS, see text mining of newsfeed articles. The similarity concept
Walter (2002, 2004) for details. is based on semantic information gained via a standard
bag-of-words model of the unstructured text, here
Reuters news articles [see Walter et al. (2003) for
Disparity preprocessing: Due to the non-lin-
further details].
earity of the distance function above, the prepro-
cessing function D(.) has more influence in H2.
Consider, for example, linear rescaling of the
dissimilarities Dij = dij. In the Euclidean case the
FUTURE TRENDS
visual structure is not affected only magnified
by . In contrast in H2, a scales the distribution and A closer look at the comparative table above suggests a
with it the amount of curvature felt by the data. The hybrid architecture of techniques. This includes the com-
optimal depends on the given task, the dataset, bination of HSOM and HMDS for a two-stage navigation
and its dissimilarity structure. One way is to set and retrieval process. This embraces a coarse grain theme
manually and choose a compromise between vis- map (e.g., with the HSOM) and a detailed map using
Table 1.
HTL HSOM HMDS

Data type Dissimilarity, distance
Acyclic graph data Vectorial data
data
Scaling #objects N O( N ) O( N ) O( N 2 )
Layout Assignment to grid of Spatial arrangement
Recursive space
neurons with topographic resembles similarity
partitioning
ordering structure of the data
New object Map to BMU (possibly Iterate cost minimization
Partial re-layout
adapt) procedure
Result Handles large data Good for up to a few
Good for tree data collections; coarse topic / hundred object; detail
theme maps level map
578
TEAM LinG
Figure 3. Snapshot of an Hyperbolic Image Viewer using Figure 4. A HSOM projection of a large collection of
a color distance metric for pairwise dissimilarity newswire articles (Reuter-21578) forms semantically 0
definition for 100 pictures. Note the dynamic zooming of related category clusters as seen by the glyphs indicating
images is dependent on the distance to the current focus the dominant out-of-band category label of the news
point. objects gathered by the HSOM in each node. The
similarity of objects is here derived from the angle of the
feature vectors in the bag-of-word or vector space
model of unstructured text [standard model in information
retrieval, after Salton (1988)]
Figure 5. Screenshot of the HMDS visualizing all HMDS, as suggested in the Hyperbolic Hybrid Data
documents in the node previously selected in the HSOM Viewer prototype in Walter et al. (2003).
(direction 5o clock). The news headlines where enabled The HMDS allows to support different notions of
for the articles cluster now in focus They reveal that the similarity and furthermore to dynamically modulate the
object group are semantically close and all related to a actual distance metric while observing the change in the
strike in the oilseed company Cargill U.K. spatial arrangement of the object.
This feature is valuable in various multi-media appli-
cations with various natural and task-depended notions
H2-MDS Earn
of similarity. For example, when browsing a collection
Acquisition
Money-FX of images, the similarity of objects can be based on textual
Crude
Grain description, metadata, or image features, for example,
using color, shape, and texture. Due to the general data
Trade
Interest
type dissimilarity the dynamic modulation of the distance

Wheat
Ship
Corn
Other metric is possible and can be controlled via the GUI or
other techniques like relevance feedback technique
proved successful in the field of information retrieval.
CARGILL_U.K._STRIKE_TALKS_POSTPONED-486
CARGILL_U.K._STRIKE_TALKS_POSTPONED_TILL_MONDAY-1966
CARGILL_STRIKE_TALKS_CONTINUING_TODAY-5833
CARGILL_U.K._STRIKE_TALKS_TO_RESUME_THIS_AFTERNOO
CARGILL_U.K._STRIKE_TALKS_BREAK_OFF_WITHOUT_RESULT-12425
CARGILL_U.K._STRIKE_TALKS_TO_RESUME_TUESDAY-
CARGILL_U.K._STRIKE_TALKS_TO_CONTINUE_MONDAY-3710
CONCLUSION
Information visualization can benefit from the use of the

hyperbolic space as a projection manifold. On the one
hand it gains the exponentially growing space around
each point, which provides extra space for compressing
object relationships. On the other hand the Poincar
model offers superb visualization and navigation prop-
erties, which were found to yield significant improve-
ment in task time compared to traditional browsing methods
579
TEAM LinG
(Risden et al., 2000; Pirolli et al., 2001). By simple mouse Salton, G., & Buckley, C. (1988). Term-weighting ap-
interaction the focus can be transferred to any location of proaches in automatic text retrieval. Information Process-
interest. The core area close to the center of the Poincar ing and Management, 5(24), 513-523.
disk magnifies the data with a maximal zoom factor and
decreases gradually to the outer area. The object place- Sammon, J. (1969). A non-linear mapping for data structure
holder (text box, image thumbnails, etc.) is scaled in propor- analysis. IEEE Transactions Computers, 18, 401-409.
tion. The fovea is an area with high resolution, while remote Skupin, A. (2002). A cartographic approach to visualizing
areas are gradually compressed but are still visible as conference abstracts. In IEEE Computer Graphics and
context. Interestingly, this situation resembles the log- Applications (pp. 50-58).
polar density distribution of neurons in the retina, which
governs the natural resolution allocation in our visual Walter, J. (2004). H-MDS: A new approach for interactive
perception system. visualization with multidimensional scaling in the hyperbolic
space. Information Systems, 29(4), 273-292.
Walter, J., Ontrup, J., Wessling, D., & Ritter, H. (2003).
REFERENCES Interactive visualization and navigation in large data
collections using the hyperbolic space. In IEEE Interna-
Cox, T., & Cox, M. (1994). Multidimensional scaling. In tional Conference on Data Mining (ICDM03) (pp. 355-
Monographs on statistics and appied probability. 362).
Chapman & Hall.
Walter, J., & Ritter, H. (2002). On interactive visualization
Kohonen, T. (2001). Self-organizing maps. Berlin: of high-dimensional data using the hyperbolic plane. In
Springer. ACM SIGKDD International Conference of Knowledge
Lamping, J., & Rao, R. (1994). Laying out and visualizing Discovery and Data Mining (pp. 123-131).
large trees using a hyperbolic space. In ACM Symp User
Interface Software and Technology (pp. 13-14).
Lamping, J., & Rao, R. (1999). The hyperbolic browser: A KEY TERMS
Focus+Context technique for visualizing large hierar-
chies. In Readings in Information Visualization (pp. Focus+Context Technique: Allows to interactively
382-408). Morgan Kaufmann. transfer the focus as desired while the context of the
region in focus remains in view with gradually degrading
Lamping, J., Rao, R., & Pirolli, P. (1995). A focus+context resolution to the rim. In contrast allows the standard
technique based on hyperbolic geometry for viewing zoom+panning technique to select the magnification
large hierarchies. In ACM SIGCHI Conf Human Factors in factor, trading detail and overview and involving harsh
Computing Systems (pp. 401-408). view cut-offs.
Morgan, F. (1993). Riemannian geometry: A beginners H2: Two-dimensional hyperbolic space (plane).
guide. Jones and Bartlett Publishers.
H2DV: The Hybrid Hyperbolic Data Viewer incorpo-
Munzner, T. (1997). H3: Laying out large directed graphs rates a two stage browsing and navigation using the
in 3d hyperbolic space. In Proceedings of IEEE Sympo- HSOM for a coarse thematic mapping of large object
sium on Information Visualization (pp. 2-10). collections and the HMDS for detailed inspection of
Pirolli, P., Card, S.K., & Van Der Wege. (2001). Visual smaller subsets in the object level.
information foraging in a focus + context visualization. HMDS: Hyperbolic Multi-Dimensional Scaling for
In CHI (pp. 506-513). laying out objects in the H2 such that the spacial ar-
Risden, K., Czerwinski, M., Munzner, T., & Cook, D. rangement resembles the dissimilarity structure of the
(2000). An initial examination of ease of use for 2D and data as close as possible.
3D information visualizations of Web content. Interna- HSOM: Hyperbolic Self-Organizing Mapping. Exten-
tional Journal of Human Computer Studies, 53(5), 695- sion of Kohonens topographic map, which offers the
714. exponential growth of neighborhood for the inner nodes.
Ritter, H. (1999). Self-organizing maps on non-Euclid- Hyperbolic Geometry: Geometry with constant nega-
ean spaces. In Kohonen Maps (pp. 97-110). Elsevier. tive curvature in contrast to the flat Euclidean geometry.
580
TEAM LinG
The space around each point grows exponentially and

allows better layout results than the corresponding flat 0
space.
Hyperbolic Tree Viewer: Visualizes tree-like graphs
in the Poincar mapping of the H2 or H3.
Poincar Mapping Of The H2: A conformal mapping
from the entire H2 into the unit circle in the flat R2 (disk
model)
581
TEAM LinG
582
Identifying Single Clusters in Large Data Sets

Frank Klawonn
University of Applied Sciences Braunschweig/Wolfenbuettel, Germany
Olga Georgieva
Institute of Control and System Research, Bulgaria
INTRODUCTION point to every cluster, and the parameters, characterizing

the cluster, which finally determine the distance values. In
Most clustering methods have to face the problem of the simplest case, a single vector named cluster centre
characterizing good clusters among noise data. The (prototype) represents each cluster. The distance of a
arbitrary noise points that just do not belong to any class data point to a cluster is simply the Euclidean distance
being searched for are of a real concern. The outliers or between the cluster centre and the corresponding data
noise data points are data that severely deviate from the point. More generally, one can use the squared inner-
pattern set by the majority of the data, and rounding and product distance norm, in which by a norm inducing
grouping errors result from the inherent inaccuracy in symmetric and positive matrix different variances in the
the collection and recording of data. In fact, a single directions of the coordinate axes of the data space are
outlier can completely spoil the least squares (LS) accounted for. If the norm inducing matrix is the identity
estimate and thus the results of most LS based cluster- matrix we obtain the standard Euclidean distance that
ing techniques such as the hard C-means (HCM) and the form spherical clusters. Clustering approaches that use
fuzzy C-means algorithm (FCM) (Bezdek, 1999). more complex cluster prototypes than only the cluster
For these reasons, a family of robust clustering centres, leading to adaptive distance measures, are for
techniques has emerged. There are two major families instance the Gustafson-Kessel (GK) algorithm (Gustafson,
of robust clustering methods. The first includes tech- 1979), the volume adaptation strategy (Hppner, 1999;
niques that are directly based on robust statistics. The Keller, 2003) and the Gath-Geva (GG) algorithm (Gath,
second family, assuming a known number of clusters, is 1989). The latter one is not a proper objective function
based on modifying the objective function of FCM in algorithm, but corresponds to a fuzzified expectation
order to make the parameter estimates more resistant to maximization strategy. No matter which kind cluster pro-
the data noise. Among them one promising approach is totype is used, the assignment of the data to the clusters
the noise clustering (NC) technique (Dave, 1991; is based on the corresponding distance measure. In hard
Klawonn, 2004). It maintains the principle of probabi- clustering, a data object is assigned to the closest cluster,
listic clustering, but an additional noise cluster is intro- whereas in fuzzy clustering a membership degree, that is,
duced. NC was developed and investigated in the context a value that belongs to the interval [0,1] is computed. The
of a variety of objective function-based clustering algo- highest membership degree of a data corresponds to the
rithms and it has demonstrated its reliable ability to closest cluster.
detect clusters amongst noise data. Noise clustering has a benefit of the collection of
the noise points in one single cluster. A virtual noise
prototype with no parameters to be adjusted is intro-
duced that has always the same distance to all points in
BACKGROUND the data set. The remaining clusters are assumed to be
the good clusters in the data set. The objective function
Objective function-based clustering aims at minimizing that considers the noise cluster is defined in the same
an objective function that indicates a kind of fitting manner as the general scheme for the clustering mini-
error of the determined clusters to the given data. In this mization functional. The main problem of NC is the
objective function, the number of clusters has to be proper choice of the noise distance. If it is set too small,
fixed in advance. However, as the number of clusters is then most of the points will get classified as noise
usually unknown, an additional scheme has to be applied points, while for a large noise distance most of the
to determine the number of clusters (Guo, 2002; Tao, points will be classified into clusters other than the
2002). The parameters to be optimized are the member- noise cluster. A right selection of the distance will
ship degrees that are values of belonging of each data result in a classification where the points that are actu-
TEAM LinG
ally close enough to the good clusters will get correctly Generally, a number of clusters with different shapes
classified into a good cluster, while the points that are away and densities exist in large data sets and thus a compli- I
from the good clusters will get classified into the noise cated dynamics of the data assigned to the single cluster
cluster. The detection of noise and outliers is a serious will be observed. However, the smooth part of the
problem and has been addressed in various approaches considered curve corresponds to the situation that a
(Leski, 2003; Yang, 2004; Zhang, 2003). However, all relatively small amount of data is removed, which is
these methods need complex additional computations. usually caused by loosing noise data. When we loose
data from a good cluster (with higher density than the
noise data), a small decrease of the noise distance will
MAIN THRUST lead to a large amount of data we loose to the noise
cluster, so that we will see a strong slope in our curve
The purpose of most clustering algorithms is to parti- instead of a plateau. Thus, a strong slope indicates that
tion a given data set into clusters. However, in data at least one cluster is just removed and separated from
mining tasks partitioning is not always the main goal. the (single) good cluster we try to find. By this way the
Finding interesting patterns or substructures in a large algorithm determines the number of clusters and de-
data set in the form of one or a few clusters that do not tects noise data. It does not depend on the initialisation,
cover or partition the data set completely is an impor- so that the danger of converging into local optima is
tant issue in data mining. For this purpose a new cluster- further reduced compared to standard fuzzy clustering
ing algorithm named noise clustering with one good (Hppner, 2003).
cluster, based on the noise clustering technique and able The described procedure is implemented in a cluster
to detect single clusters step-by-step in a given data set, identification algorithm that assesses the dynamics of
has been recently developed (Georgieva, 2004). In addi- the quantity of the points assigned to the good cluster or
tion to identifying clusters step by step, as a side-effect equivalently assigned to the noise cluster through the
noise data are detected automatically. slight decrease of the noise distance. By detecting the
The algorithm assesses the dynamics of the number strong slopes the algorithm separates one cluster at
of points that are assigned to only one good cluster of every algorithm pass. A significant reduction of the
the data set by slightly decreasing the noise distance. noise is achieved even in the first algorithm pass. The
Starting with some large enough noise distance it is clustering procedure is repeated by proceeding with a
decreased with a prescribed decrement till the reason- smaller data set as the original one is reduced by the
able smallest distance is reached. The number of data identified noise data and data belonging to the already
belonging to the good cluster is calculated for every identified cluster(s).
noise distance using the formula for the hard member- The curve that is determined by the dynamics of the
ship values or fuzzy membership values, respectively. data assigned to the noise cluster is smoother in case of
Note that in this scheme only one cluster centre has to fuzzy noise clustering compared to the hard clustering
be computed, which in the case of hard noise clustering case due to the fuzzily defined membership values. Also
is the mean value of the good cluster data points and in local minima of this curve could be observed due to the
the case of fuzzy noise clustering is the weighted aver- given freedom of the points to belong to both good and
age of all data points. It is obvious that by decreasing the noise cluster simultaneously. Fuzzy clustering can deal
noise distance a process of loosing data, that is, better with complex data sets than hard clustering due to
separating them to the noise cluster, will begin. Con- the given relative degree of membership of a point to the
tinuing to decrease the noise distance, we will start to good cluster. However, for the same reason the amount
separate points from the good cluster and add them to of the identified noise points is less than in the hard
the noise cluster. A further reduction of the noise clustering case.
distance will lead to a decreasing amount of data in the
good cluster until the cluster will be entirely empty, as
all data will be assigned to the noise cluster. The de- FUTURE TRENDS
scribed dynamics can be illustrated in a curve viewing
the number of data points assigned to the good cluster Whereas the standard clustering partitions the whole
over the noise distance. In this curve a plateau will data set, the main goal of the noise clustering with one
indicate that we are in a phase of assigning proper noise good cluster is to identify single clusters even in the
data to the noise cluster, whereas a strong slope means case when a large part of the data does not have any kind
that we actually loose data belonging to the good cluster of group structure at all. This will have a large benefit in
to the noise cluster. some application areas of cluster analysis like, for
583
TEAM LinG
instance, gene expression data and astrophysics data Gath, I., & Geva, A.B. (1989). Unsupervised optimal fuzzy
analysis, where the ultimate goal is not to partition the clustering. IEEE Transactions on Pattern Analysis and
data, but to find some well-defined clusters that only cover Machine Intelligence, 7, 773-781.
a small fraction of the data. By removing a large amount of
the noise data the obtained clusters are used to find some Georgieva, O., & Klawonn, F. (2004). A clustering algo-
interesting substructures in the data set. rithm for indentification of single clusters in large datat
One future improvement will lead to an extended pro- sets. In Proceedings of East West Fuzzy Colloquium,
cedure that first finds the location of the centres of the 81(pp. 118-125), Sept. 8-10, Zittau, Germany.
interesting clusters using the standard Euclidean distance. Guo, P., Chen, C.L.P., & Lyu, M.R. (2002). Cluster number
Then, the algorithm could be started again with this cluster selection for a small set of samples using the Bayesian
centres but using more sophisticated distance measures as Ying-Yang model. IEEE Transactions on Neural Net-
GK or volume adaptation strategy that can better adapt to works, 13, 757-763.
the shape of the identified cluster. For extremely large
data sets the algorithm can be combined with speed-up Gustafson, D., & Kessel, W. (1979). Fuzzy clustering with
techniques as the one proposed by Hppner (2002). a fuzzy covariance matrix. In Advances in fuzzy set theory
Another further application consists in incorporation and applications (pp. 605-620). North-Holland.
of the proposed algorithm as an initialisation step for Hppner, F. (2002). Speeding up fuzzy c-means: Using a
other more sophisticated clustering algorithm providing hierarchical data organisation to control the precision of
the number of clusters and their approximate location. membership calculation. Fuzzy Sets and Systems, 12 (3),
365-378.
CONCLUSION Hppner, F., & Klawonn, F. (2003). A contribution to

convergence theory of fuzzy c-means and derivatives.
A clustering algorithm, based on the principles of noise IEEE Transactions on Fuzzy Systems, 11, 682-694.
clustering and able to find good clusters with variable Hppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999).
shape and density step-by-step, is proposed. It can be Fuzzy cluster analysis. Chichester: John Wiley & Sons.
applied to find just a few substructures, but also as an
iterative method to partition a data set including the Keller, A., & Klawonn, F. (2003). Adaptation of cluster
identification of the number of clusters and noise data. sizes in objective function based fuzzy clustering. In C.T.
The algorithm is applicable in terms of both hard and Leondes (ed.), Intelligent Systems: Technology and
fuzzy clustering. It automatically determines the number Applications IV. Database and Learning Systems (pp.
of clusters, while in the standard objective function- 181-199). Boca Raton, FL: CRC Press.
based clustering additional strategies have to be applied Klawonn, F. (2004). Noise clustering with a fixed fraction
in order to define the number of clusters. of noise. In A. Lotfi & J.M. Garibaldi (Eds.), Applications
The identification algorithm is independent of the and Science in Soft Computing (pp. 133-138). Berlin:
dimensionality of the considered problem. It does not Springer.
require comprehensive or costly computations and the
computation time increases linearly only with the num- Leski, J.M. (2003). Generalized weighted conditional fuzzy
ber of data points. clustering. IEEE Transactions on Fuzzy Systems, 11,
709-715.
Tao, C.W. (2002). Unsupervised fuzzy clustering with
REFERENCES multi-center clusters. Fuzzy Sets and Systems, 128,
305-322.
Bezdek, J.C., Keller, J., Krisnapuram, R., & Pal, N.R. (1999).
Fuzzy models and algorithms for pattern recognition and Yang, T.-N., & Wang, S.-D. (2004). Competitive algo-
image processing. Kluwer Academic Publishers. rithms for the clustering of noisy data. Fuzzy Sets and
Systems, 141, 281-299.
Dave, R.N. (1991). Characterization and detection of
noise in clustering. Pattern Recognition Letters, 12, Zhang, J.-S., & Leung, Y.-W. (2003). Robust clustering
657-664. by pruning outliers. IEEE Transactions on Systems,
Man and Cybernetics Part B, 33, 983-999.
584
TEAM LinG
KEY TERMS Hard Clustering: Cluster analysis where each data

object is assigned to a unique cluster. I
Cluster Analysis: Partition a given data set into Noise Clustering: An additional noise cluster is in-
clusters where data assigned to the same cluster should duced in objective function-based clustering to collect
be similar, whereas data from different clusters should be the noise data or outliers. All data objects are assumed to
dissimilar. have a fixed (large) distance to the noise cluster, so that
Cluster Centres (Prototypes): Clusters in objective only data far away from all other clusters will be assigned
function-based clustering are represented by prototypes to the cluster.
that define how the distance of a data object to the Objective Function-Based Clustering: Cluster analy-
corresponding cluster is computed. In the simplest case sis is carried on minimizing an objective function that
a single vector represents the cluster, and the distance indicates a fitting error of the clusters to the data.
to the cluster is the Euclidean distance between cluster
centre and data object. Robust Clustering: Refers to clustering techniques
that behave robust with respect to noise, i.e.that is,
Fuzzy Clustering: Cluster analysis where a data adding some noise to the data in the form changing the
object can have membership degrees to different clus- values of the data objects slightly as well as adding some
ters. Usually it is assumed that the membership degrees outliers will not drastically influence the clustering
of a data object to all clusters sum up to one, so that a result.
membership degree can also be interpreted as the prob-
ability the data object belongs to the corresponding
cluster.
585
TEAM LinG
586
Immersive Image Mining in Cardiology

Xiaoqiang Liu
Delft University of Technology, The Netherlands and Donghua University, China
Henk Koppelaar
Delft University of Technology, The Netherlands
Ronald Hamers
Erasmus Medical Thorax Center, The Netherlands
Nico Bruining
Erasmus Medical Thorax Center, The Netherlands
INTRODUCTION tricles to pump blood and lead to overt failure. To diag-

nose the many possible anomalies and heart diseases is
Buried within the human body, the heart prohibits direct difficult because physicians cant literally see in the
inspection, so most knowledge about heart failure is human heart. Various data acquisition techniques have
obtained by autopsy (in hindsight). Live immersive in- been invented to partly remedy the lack of sight: non-
spection within the human heart requires advanced data invasive inspection including CT (Computered Tomogra-
acquisition, image mining and virtual reality techniques. phy), Angiography, MRI (Magnetic Resonance Imag-
Computational sciences are being exploited as means to ing), ECG signals etc. These techniques do not take into
investigate biomedical processes in cardiology. account crucial features of lesion physiology and vascu-
IntraVascular UltraSound (IVUS) has become a clini- lar remodeling to really mine blood-plaque. IVUS, a mini-
cal tool in recent several years. In this immersive data mal-invasive technique, in which a camera is pulled back
acquisition procedure, voluminous separated slice im- inside the artery, and the resulting immersive tomographic
ages are taken by a camera, which is pulled back in the images are used to remodel the vessel. This remodeling
coronary artery. Image mining deals with the extraction of vessel and its virtual reality (VR) aspect offer interesting
implicit knowledge, image data relationships, or other future alternatives for mining these data to unearth anoma-
patterns not explicitly stored in the image databases (Hsu, lies and diseases in the moving heart and coronary ves-
Lee, & Zhang, 2002). Human medical data are among the sels at earlier stage. It also serves in clinical trials to
most rewarding and difficult of all biological data to mine evaluate results of novel interventional techniques, e.g.
and analyze, which has the uniqueness of heterogeneity local kill by heating cancerous cells via an electrical
and are privacy-sensitive (Cios & Moore, 2002). The goals current through a piezoelectric transducer as well as local
of immersive IVUS image mining are providing medical nano-technology pharmaceutical treatments. Figure 1
quantitative measurements, qualitative assessment, and explains some aspects of IVUS technology.
cardiac knowledge discovery to serve clinical needs on However, IVUS images are more complicated than
diagnostics, therapies, and safety level, cost and risk medical data in general since they suffer from some arti-
effectiveness etc. facts during immersed data acquisition (Mintz et al., 2001):
Non-uniform rotational distortion and motion arti-

BACKGROUND facts.
Ring-down, blood speckle, and near field artifacts.
Heart disease is the leading cause of death in industrial- Obliquity, eccentricity, and problems of vessel cur-
ized nations and is characterized by diverse cellular ab- vature.
normalities associated with decreased ventricular func- Problems of spatial orientation.
tion. At the onset of many forms of heart disease, cardiac
hypertrophy and ventricular changes in wall thickness or The second type of artifacts is treated by image pro-
chamber volume occur as a compensatory response to cessing and therefore falls outside the scope of this paper.
maintain cardiac output. These changes eventually lead The pumping heart, respiring lungs and moving immersed
to greater vascular resistance, chamber dilation, wall catheter camera cause the other three types of artifacts.
fibrosis, which ultimately impair the ability of the ven- These cause distortion on longitudinal position, x-y po-
TEAM LinG
Figure 1. IVUS immersive data acquisition, measurements and remodeling

I
A: IVUS catheter/endosonics and the wire. B: angiograophy of a contrast-filled section of coronary artery. C: two
IVUS cross-sectional images and some quantitative measurements on them. D: Virtual L-View of the vessel
reconstruction. E: Virtual 3D-Impression of the vessel reconstruction.
sition, and spatial orientation of the slices on the coronary position equals to the sum of the pullback distance and
vessel. For example, it has been reported that more than the longitudinal catheter displacement. In F, the absolute
5 mm of longitudinal catheter motion relative to the vessel catheter positions of solid dots are in disorder, which will
may occur during one cardiac cycle (Winter et al., 2004), cause a disordered sequence of camera images. The
when the catheter was pulled back at 0.5 mm/sec and the consecutive image samples selected in relation to the
non-gated samples were stored on S-VHS videotape at a positions of the catheter relative to the coronary vessel
rate of 25 images/sec. Figure 2 explains the longitudinal wall are highlighted in G. In conclusion, these samples
displacement caused by cardiac cycles during a camera used for analysis are anatomically dispersed in space (III,
pullback in a segment of coronary artery. The catheter I, V, II, IV, and VI).
Figure 2. Trajectory position anomalies of the invasive camera
587
TEAM LinG
The data of IVUS sources are voluminous separated This 4D model is used for further cardiac computation to
slices with artifacts, and the quantitative measurements, gain VR display and volume measurements on the vessel.
the physicians interpretations are also essential compo- IVUS images are fundamentally different from histology
nents to remodel the vessel, but the accompanying math- and cannot be used to detect and quantify specific
ematical models are poorly characterized. Mining in these histologic contents directly. But based on quantitative
datasets is more difficult and it inspires interdisciplinary measurements and VR display, combined cardiac knowl-
work in medicine, physics and computer science. edge, the qualitative assessment such as atheroma mor-
phology, unstable lesions and complications after inter-
vention may be gained semi-automatically or artificially
MAIN THRUST by the physicians.
These quantitative measurements and qualitative as-
Immersive Image Mining Processes in sessments may be organized and stored in a database or
data warehouse. Statistics-based mining on colony data
Cardiology IVUS and mining on individual history data lead to knowledge
discovery of heart diseases.
IVUS image mining includes a series of complicated mining
procedures due to the complex immersive data: data recon-
ciliation, image mining, remodeling and VR display, and
Individual Data Mining to Reconcile the
knowledge discovery. Figure 3 shows the function-driven Data
processes.
The data reconciliation compensates or decreases ar- Data reconciliation is the basis and most important step
tifacts to improve data, usually fused with non-invasive to cardiac computation, and it is expecting more effective
inspection cardiology data such as angiography and ECG methods to attack the data artifacts. Three types of
etc. This data mining extracts features in the individual methods are applied: parsimonious data acquiring, data
dataset to use for data reconciliation. fusion of invasive and non-invasive data, and hybrid
Cardiac computation is taken on the reconciled indi- motion compensation.
vidual dataset. Quantitative measurement calculates fea-
tures such as lumen, atheroma, calcium and stent on every Parsimonious Phase-Dependent Data Acquisition:
slice. Vessel remodeling based on these measurements Cardiac knowledge dictates systolic/diastolic tim-
forms a straight 3D volume, and the fusion with the pulling features. It has been suggested to select IVUS
back path determined from angiography yields an accurate images recorded in the end-diastolic phase of the
spatio-temporal (4D) model of the coronary vasculature. cardiac cycle, in which the heart is motionless and
Figure 3. Immersive IVUS image mining
L-Mode
Cardiac
Immersive 3D reconstruction
knowledge
image Picture in picture
acquisition Expert
knowledge
Knowledge Vessel remodeling
discovery & Visual display
Atheroma morphology
Qualitative
Unstable lesions
Data Cardiac assessments
Ruptured plaque
reconciliation computation
Complications after
Quantitative
intervention
measurements

Non-invasive Border Identification
data acquisition Lumen measurements
Calcium measurements

588
TEAM LinG
blood flow has ceased so that their influences on the (Cipra, 1999). An effective and pragmatic model is
catheter can be neglected. Online ECG-gated pull- futuristic yet. Timinger reconstructed the catheter I
back has been used to acquire phase-dependent position on 3D roadmap by a motion compensation
data but the technology is expensive and prolongs algorithm based on an affine model for compensat-
the acquisition procedure. Instead, a retrospective ing the respiratory motion and ECG gating method
image-based gating mining method has been stud- for the catheter positions acquired using a magnetic
ied (Winter et al, 2004; Zhu, Oakeson, & Friedman, tracking system (Timinger, Krueger, Borgert, &
2003). In this method, the different features are Grewer, 2004). Fusing the empirical features of the
mined from sampled IVUS images over time by coronary longitudinal movement with a motion com-
transforming the data with spectral analysis, to pensation model is a novel way to resolve the
discover the most prominent repetition frequencies longitudinal distortion of the IVUS dataset (Liu,
of appearance of these image features. From this Koppelaar, Koffijberg, Bruining, & Hamers, 2004).
mining, the near images at the end-diastolic phases
can be inferred. The selection of images is parsimo- Cardiac Computation to Quantitative
nious: only about 5% of the dataset are selected, and Qualitative Analysis
wherein about 10% of the selections are
mispositioned. Cardiac computation aims at quantitative measurements,
Data Fusion: The motion vessel courses provide remodeling and qualitative assessments on the vessel
helpful information to identify space and time of the and the lesion zones. The technologies of border detec-
IVUS camera in the coronary of the moving heart. tion of image processing and pattern recognition of the
Fusion of complementary information from two or vessel layers are important in the process of cardiac
more differing modalities enhances insights into the calculation (Koning, Dijkstra, von Birgelen, Tuinenburg,
underlying anatomy and physiology. Combining et al., 2002). For qualitative assessments, expert knowl-
non-invasive data mining techniques to battle mea- edge of physicians must be fused within reasoning to get
surement errors is a preferred method. The position- the assessments.
ing for the camera could be remedied if from
angiograms the outer form of a vessel is available,
as a path-road for the camera. Fusing the route and
Quantitative Measurements
IVUS data, a simulator generates a VR reconstruc-
tion of the vessel (Wahle, Mitchell, Olszewski, Long, Cross-Sectional Slices Calculation: The normal
& Sonka, 2000; Sarry & Boire, 2001; Ding & Fried- coronary artery consists of the lumen surrounded
man, 2000; Rotger, Radeva, Mauri, & Fernandez- by intima, media, and adventitia of the vessel wall
Nofrerias, 2002; Weichert, Wawro, & Wilke, 2004). (Halligan & Higano, 2003). The innermost layer
This should help to detect the absolute catheter consists of a complex of three elements: intima,
spatial positions and orientations, but usually the atheroma (diseased arteries), and internal elastic
routes are static and data are parsimonious, phase- membrance. After gaining vessel layers and their
dependent, or without exhibiting the distortion. attributes through edge identification and pattern
There are few papers on the consecutive motion recognition of image processing, every slice is cal-
tracking of the coronary tree (Chen & Carrol, 2003; culated separately and the measurements such as
Shechter, Devernay, Coste-Maniere, Quyyumi, & lumen area, EEM (external elastic membrance) area,
McVeigh, 2003), but they omitted the stretch of the and maximum atheroma thickness can be reported in
arteries that is an important property for accurate detail.
positioning analyses. Vessel Remodeling and Virtual Display: Based on
Hybrid Motion Compensation: Physical prior knowl- the above calculation results, vessel layers includ-
edge would predict space and time of the global ing plaques can be remodeled and visualized in
position of IVUS camera, if it would be fully mod- longitude. VR display is an important way to assist
eled. The many mechanical and electrical mecha- inspecting the artery and the distribution of plaque
nisms in a heart make a full model intractable, but if and lesions, to help navigate in guided surgery
most of its mechanisms could be dispensed with, the facilities for minimally invasive surgery. For ex-
prediction model would become considerably sim- amples, L-Mode display sets of slices taken from a
plified. A hybrid approach could solve the problem single cut plane in longitude (Figure 1, panel D); 3D
if a full Courant-type model would be available, reconstruction display a shaded or wire-frame im-
coined by the term Computational Cardiology age of the vessel to give an entire view (Figure 1,
Panel E).
589
TEAM LinG
Derived Measurements: Calculating on the virtual Volume and complexity in data: First, IVUS data
remodeled vessel, the derived measurements can be have a large volume in one acquisition procedure for
obtained such as hemodynamics, length, volumes one person. Second, usually tracing a patient needs
of the vessel and special lesion zones. a long time: more than several years. Third, the data
Qualitative Assessments: IVUS images are funda- from patients maybe are incomplete. Finally,
mentally different from histology and cannot be immersive data acquisition is a complicated proce-
used to detect and quantify specific histologic con- dure depending on body condition, immersive me-
tents directly. However, based on quantitative mea- chanical system, and even operating procedure.
surements and a virtual model, as well as combined The immersive complexity may even bias the histori-
cardiac knowledge, the qualitative assessment such cal data from the same heart.
as atheroma morphology, unstable lesions and com- Data standards and semantics: IVUS images, espe-
plications after intervention may be gained semi- cially the qualitative assessments, need a consis-
automatically or artificially to serve physicians. tent standard and semantics to support mining.
IVUS documents (Mintz, et al., 2001) and DICOM
Knowledge Discovery in Cardiology IVUS standards may be the base to follow. It is
necessary to consider the importance of relative
Once a large amount of quantitative and qualitative fea- parameters, spatial positions, multi-interpretations
tures are in hand, mining on these data can reveal of image quantitative measurements, which also
knowledge about the forming and regression of some need to be addressed in the IVUS standards.
heart diseases, and can simulate the different stages of the Data fusion: Since heart diseases are complicated,
disease as well as assess surgical treatment procedure IVUS is one of the preferred diagnose tools, so
and risk level. mining IVUS needs considering other clinical infor-
Some papers are reported on medical knowledge dis- mation including physicians interpretation at the
covery in cardiology using data mining. Pressure calcu- same time.
lation by using fluid dynamic equations on 3D IVUS
volumetric measurements predicts the physiologic sever-
ity of coronary lesions (Takayama & Hodgson, 2001). FUTURE TRENDS
Artificial intelligence methods: a structural description,
syntactic reasoning and pattern recognition method are Immersive image mining in cardiology is a new challenge
applied on angiography to recognize stenosis of coronary for medical informatics. In mining our body physicians
artery lumen (Ogiela & Tadeusiewicz, 2002). Logical Analy- thus inspire interdisciplinary work in medicine, physics
sis of Data is a method based on combinatory, optimiza- and computer science which will improve monitoring
tion and theory of Boolean functions is used on 21 heart data to many deep applications serving clinical
variables dataset to predict coronary risk, but these dataset needs in diagnostics, therapies, safety level, cost and risk
do not include any medical images information (Alexe, effectiveness. For instance, in due course of time
2003). Mining in single proton emission computed tomog- nanotechnology will mature to a degree of immersive
raphy images accompanied by clinical information and medical mining equipment, physicians will directly con-
physician interpretation using inductive machine learn- trol medicine by instruction via mobile communication
ing and heuristic approaches to mimic cardiologists because of a transducer inside the human body.
diagnosis (Kurgan, Cios, Tadeusiewicz, Ogiela, &
Goodenday, 2001).
Immersive data mining or fusing the mined data with CONCLUSION
other cardiac data is a challenge for improving medical
knowledge in cardiology. Mining will contribute to some Over the last several years, IVUS has developed into an
difficult cardiology applications, for example: important clinical tool in the assessment of atherosclero-
interventional procedures and healthcare; mining coro- sis. The limitations on data artifacts and difficult discern-
nary vessel movement anomalies; highlight local abnor- ing between entities with similar echodensities (Halligan
mal cellular growth; accurate heart and virtual vessel & Higano, 2003) are waiting for better solution. Hospitals
reconstruction; adjust ferro-electric materials to monitor have stored a large amount of immersive data, which may
heart movement anomalies; prophylactic patient monitor- be mined for effective application in cardiac knowledge
ing etc. Some technicalities should be considered in this discovery.
mining field. Data mining technologies play crucial roles in the
whole procedures in IVUS application: from data acquisi-
590
TEAM LinG
tion, data reconciliation, image processing, vessel remod- tional Conference on Systems, Man and Cybernetics,
eling and virtual display, knowledge discovery, disease Hague, Netherlands. I
diagnosing, clinical treatment, etc. Reconciling coronary
longitudinal movement with a motion compensation model Mintz, G. S., Nissen, S. E., Anderson, W. D., Bailey, S. R.,
is a novel way to resolve the longitudinal distortion of the Erbel, R., Fitzgerald, P. J., Pinto, F. J., Rosenfield, K.,
IVUS dataset. This fusion with other cardiac datasets and Siegel, R. J., Tuzcu, E. M., & Yock, P. G. (2001). ACC clinical
online processing are very important for future applica- expert consensus document on standards for the acqui-
tion, which means more effective and efficient mining sition, measurement and reporting of intravascular ultra-
methods on the complicated datasets need to be studied. sound studies: A report of the American college of cardi-
ology task force on clinical expert consensus documents.
Journal of the American College of Cardiology, 37,
14781492.
REFERENCES
Ogiela, M. R. & Tadeusiewicz, R. (2002). Syntactic reason-
Alexe, S. (2003). Coronary risk prediction by logical analy- ing and pattern recognition for analysis of coronary artery
sis of data. Annals of Operations Research, 119(1-4), 15- images. Artificial Intelligence in Medicine, 26, 145-159.
42.
Rotger, B., Radeva, P., Mauri, J., & Fernandez-Nofrerias,
Chen, S-Y. J. & Carroll, J. D. (2003). Kinematic and defor- E. (2002). Internal and external coronary vessel images
mation analysis of 4-D coronary arterial trees recon- registration. In M. T. Escrig, F. Toledo, & E. Golobardes
structed from cine angiograms. IEEE Transactions on (Eds.), Topics in Artificial Intelligence, 5th Catalonian
Medical Imaging, 22(6), 710-720. Conference on AI, Castelln (pp. 408-418). Spain.
Cios, K. J. & Moore, W. (2002). Uniqueness of medical Sarry, L. & Boire, J. Y. (2001). Three-Dimensional tracking
data mining. Artificial Intelligence in Medicine Journal, of coronary arteries from biplane angiographic sequences
26(1-2), 1-24. using parametrically deformationable models. IEEE Trans-
actions on Medical Imaging, 20(12), 1341-1351.
Cipra, B. A. (1999). Failure in sight for a mathematical
model of the heart. SIAM News, 32(8). Shechter, G., Devernay, F., Coste-Maniere, E., Quyyumi,
A., & McVeigh, E. R. (2003). Three-Dimentional motion
Ding, Z. & Friedman, M.H. (2000). Quantification of 3D tracking of coronary arteries in biplane cineangiograms.
coronary arterial motion using clinical biplane IEEE Transaction on Medical Imaging, 2, 1-16.
cineangiograms. The International Journal of Cardiac
Imaging, 16(5), 331-346. Takayama, T. & Hodgson, J. M. (2001). Prediction of the
physiologic severity of coronary lesions using 3D IVUS:
Halligan, S. & Higano, S. T. (2003). Coronary assessment Validation by direct coronary pressure measurements.
beyond angiography. Applications in Imaging Cardiac Catheterization and Cardiovascular Interventions,
Interventions, 12, 29-35. 53(1), 48-55.
Hsu, W., Lee, M. L., & Zhang, J. (2002). Image mining: Timinger, H., Krueger, S., Borgert, J., & Grewer, R. (2004).
Trends and developments. Journal of Intelligent Infor- Motion compensation for interventional navigation on
mation System: Special Issue on Multimedia Data Min- 3D static roadmaps based on an affine model and gating.
ing, 19(1), 7-23. Physics in Medicine and Biology, 49, 719-732.
Koning, G., Dijkstra, J., von Birgelen, C., Tuinenburg, J. C. Wahle A., Mitchell, S. C., Olszewski, M. E., Long R. M., &
et al. (2002). Advanced contour detection for three-di- Sonka, M. (2000). Accurate visualization and quantifica-
mensional intracoronary ultrasound: A validationin tion of coronary vasculature by 3-D/4-D fusion from
vitro and in vivo. The International Journal of Cardio- biplane angiography and intravascular ultrasound. EBiOS
vascular Imaging, 18, 235-248. 2000, Biomonitoring and Endoscopy Technologies (pp.
Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & 144-155). Amsterdam NL, SPIE Europto, 4158.
Goodenday, L. S. (2001). Knowledge discovery approach Weichert, F., Wawro, M., & Wilke, C. (2004). A 3D com-
to automated cardiac SPECT diagnosis. Artificial Intelli- puter graphics approach to brachytherapy planning. The
gence in Medicine, 23(2), 149-169. International Journal of Cardiovascular Imaging, 20,
Liu, X., Koppelaar, H., Koffijberg, H., Bruining, N., & 173-182.
Hamers, R. (2004, October). Data reconciliation of immersive Winter, S. A. de, Hamers, R. Degertekin, M., Tanabe, K.,
heart inspection. Paper presented at the IEEE Interna- Lemos, P.A., Serruys, P.W., Roelandt, J. R. T. C., &
591
TEAM LinG
Bruining, N. (2004). Retrospective image-based gating of UltraSound transducer in human arteries. The dataset is
intracoronary ultrasound images for improved quantita- usually a volume with artifacts caused by the complicated
tive analysis: The intelligate method. Catheterization immersed environments.
and Cardiovascular Interventions, 61(1), 84-94.
IVUS Data Reconciliation: Mining in the IVUS indi-
Zhu, H., Oakeson, K. D., & Friedman, M. H. (2003). Re- vidual dataset to compensate or decrease artifacts to get
trieval of cardiac phase from IVUS sequences. In F. Wil- improved data using for further cardiac calculation and
liam, M.F. Walker, & Insana (Eds.), Medical Imaging medical knowledge discovery.
2003: Ultrasonic Imaging and Signal Processing, Pro-
ceedings of SPIE 5035 (pp. 135-146). IVUS Standards and Semantics: It refers to the stan-
dard and semantics of IVUS data and their medical quan-
titative measurements, qualitative assessments. The con-
sistent definition and description improve medical data
KEY TERMS management and mining.
Medical Image Mining: It involves extracting the
Cardiac Data Fusion: Fusion of complementary infor- most relevant image features into a form suitable for data
mation from two or more differing cardiac modalities mining for medical knowledge discovery; or generating
(IVUS, ECG, CT, physicians interpretation etc.) enhances image patterns to improve the accuracy of images re-
insights into the underlying anatomy and physiology. trieved from image databases.
Computational Cardiology: Using mathematic and Virtual Reality Vasculature Reconstruction: For
computer model to simulate the heart motion and its effective applications of intravascular analyses and
properties as a whole. brachytherapy, reconstruct and visualize the vessel-walls
Immersive IVUS Images: The real-time cross-sec- interior structure in a single 3D/4D model by fusing
tional images obtained from a pullback IntraVascular invasive IVUS data and non-invasive angiography on?
592
TEAM LinG
593
Imprecise Data and the Data Mining Process I

Marvin L. Brown
Grambling State University, USA
John F. Kros
East Carolina University, USA
Missing or inconsistent data has been a pervasive prob- The article focuses on the reasons for data inconsistency
lem in data analysis since the origin of data collection. and the types of missing data. In addition, trends regarding
The management of missing data in organizations has missing data and data mining are discussed along with
recently been addressed as more firms implement large- future research opportunities and concluding remarks.
scale enterprise resource planning systems (see Vosburg
& Kumar, 2001; Xu et al., 2002). The issue of missing
data becomes an even more pervasive dilemma in the REASONS FOR DATA
knowledge discovery process, in that as more data is INCONSISTENCY
collected, the higher the likelihood of missing data
becomes.
The objective of this research is to discuss impre- Data inconsistency may arise for a number of reasons,
cise data and the data mining process. The article begins including:
with a background analysis, including a brief review of
both seminal and current literature. The main thrust of Procedural Factors
the chapter focuses on reasons for data inconsistency Refusal of Response
along with definitions of various types of missing data. Inapplicable Responses
Future trends followed by concluding remarks com-
plete the chapter. These three reasons tend to cover the largest areas of
missing data in the data mining process.
BACKGROUND Procedural Factors
The analysis of missing data is a comparatively recent Data entry errors are common and their impact on the
discipline. However, the literature holds a number of knowledge discovery process and data mining can gen-
works that provide perspective on missing data and data erate serious problems. Inaccurate classifications, er-
mining. Afifi and Elashoff (1966) provide an early roneous estimates, predictions, and invalid pattern rec-
seminal paper reviewing the missing data and data min- ognition may also take place. In situations where data-
ing literature. Little and Rubins (1987) milestone work bases are being refreshed with new data, blank responses
defined three unique types of missing data mechanisms from questionnaires further complicate the data mining
and provided parametric methods for handling these process. If a large number of similar respondents fail to
types of missing data. These papers sparked numerous complete similar questions, the deletion or
works in the area of missing data. Lee and Siau (2001) misclassification of these observations can take the
present an excellent review of data mining techniques researcher down the wrong path of investigation or lead
within the knowledge discovery process. The refer- to inaccurate decision-making by end users.
ences in this section are given as suggested reading for
any analyst beginning their research in the area of data Refusal of Response
mining and missing data.
Some respondents may find certain survey questions
offensive or they may be personally sensitive to certain
TEAM LinG
Imprecise Data and the Data Mining Process
questions. For example, some respondents may have no [Data] Missing Completely At Random
opinion regarding certain questions such as political or (MCAR)
religious affiliation. In addition, questions that refer to
ones education level, income, age or weight may be
deemed too private for some respondents to answer. Kim (2001), based on an earlier work, classified data as
Furthermore, respondents may simply have insuffi- MCAR when the probability of response [shows that]
cient knowledge to accurately answer particular ques- independence exists between X and Y. MCAR data
tions. Students or inexperienced individuals may have exhibits a higher level of randomness than does MAR. In
insufficient knowledge to answer certain questions (such other words, the observed values of Y are truly a random
as salaries in various regions of the country, retirement sample for all values of Y, and no other factors included
options, insurance choices, etc). in the study may bias the observed values of Y.
Consider the case of a laboratory providing the
results of a chemical compound decomposition test in
Inapplicable Responses which a significant level of iron is being sought. If
certain levels of iron are met or missing entirely and no
Sometimes questions are left blank simply because the other elements in the compound are identified to corre-
questions apply to a more general population rather than late then it can be determined that the identified or
to an individual respondent. If a subset of questions on missing data for iron is MCAR.
a questionnaire does not apply to the individual respon-
dent, data may be missing for a particular expected Non-Ignorable Missing Data
group within a data set. For example, adults who have
never been married or who are widowed or divorced are
likely to not answer a question regarding years of marriage. In contrast to the MAR situation where data missingness
is explained by other measured variables in a study; non-
ignorable missing data arise due to the data missingness
pattern being explainable and only explainable by
TYPES OF MISSING DATA the very variable(s) on which the data are missing.
For example, given two variables, X and Y, data is
The following is a list of the standard types of missing deemed Non-Ignorable when the probability of response
data: depends on variable X and possibly on variable Y. For
example, if the likelihood of an individual providing his
Data Missing at Random or her weight varied within various age categories, the
Data Missing Completely at Random missing data is non-ignorable (Kim, 2001). Thus, the
Non-Ignorable Missing Data pattern of missing data is non-random and possibly
Outliers Treated as Missing Data predictable from other variables in the database.
In practice, the MCAR assumption is seldom met. Most
It is important for an analyst to understand the differ- missing data methods are applied upon the assumption of
ent types of missing data before they can address the MAR. And in correspondence to Kim (2001), Non-Ignor-
issue. Each type of missing data is defined next. able missing data is the hardest condition to deal with, but
unfortunately, the most likely to occur as well.
[Data] Missing At Random (MAR)
Outliers Treated As Missing Data
Rubin (1978), in a seminal missing data research paper,
defined missing data as MAR when given the variables Many times it is necessary to classify these outliers as
X and Y, the probability of response depends on X but missing data. Pre-testing and calculating threshold
not on Y. Cases containing incomplete data must be boundaries are necessary in the pre-processing of data
treated differently than cases with complete data. For in order to identify those values which are to be classi-
example, if the likelihood that a respondent will provide fied as missing. Data whose values fall outside of pre-
his or her weight depends on the probability that the defined ranges may skew test results. Consider the case
respondent will not provide his or her age, then the of a laboratory providing the results of a chemical
missing data is considered to be Missing At Random compound decomposition test. If it has been predeter-
(MAR) (Kim, 2001). mined that the maximum amount of iron that can be
594
TEAM LinG
contained in a particular compound is 500 parts/million, procedures resulting in the replacement of missing
then the value for the variable iron should never exceed values by attributing them to other available data. This I
that amount. If, for some reason, the value does exceed research investigates the most common imputation
500 parts/million, then some visualization technique methods including:
should be implemented to identify that value. Those
offending cases are then presented to the end users. Case Deletion
Mean Substitution
Cold Deck Imputation
COMMONLY USED METHODS OF Hot Deck Imputation
ADDRESSING MISSING DATA Regression Imputation
These methods were chosen mainly as they are very

Several methods have been developed for the treatment common in the literature and their ease and expediency
of missing data. The simplest of these methods can be of application (Little & Rubin, 1987). However, it has
broken down into the following categories: been concluded that although imputation is a flexible
method for handling missing-data problems it is not
Use Of Complete Data Only without its drawbacks. Caution should be used when
Deleting Selected Cases Or Variables employing imputation methods as they can generate
Data Imputation substantial biases between real and imputed data. None-
theless, imputation methods tend to be a popular method
These categories are based on the randomness of the for addressing the issue of missing data.
missing data, and how the missing data is estimated and
used for replacement. Each category is discussed next.
Case Deletion
Use of Complete Data Only
The simple deletion of data that contains missing val-
ues may be utilized when a nonrandom pattern of miss-
Use of complete data only is generally referred to as the ing data is present. Large sample sizes permit the
complete case approach and is readily available in all deletion of a predetermined percentage of cases, and/
statistical analysis packages. When the relationships or when the relationships within the data set are strong
within a data set are strong enough to not be significantly enough to not be significantly affected by missing data.
affected by missing data, large sample sizes may allow Case deletion is not recommended for small sample
for the deletion of a predetermined percentage of cases. sizes or when the user knows strong relationships within
While the use of complete data only is a common ap- the data exist.
proach, the cost of lost data and information when cases
containing missing value are simply deleted can be dra-
matic. Overall, this method is best suited to situations
Mean Substitution
where the amount of missing data is small and when
missing data is classified as MCAR. This type of imputation is accomplished by estimating
missing values by using the mean of the recorded or
Delete Selected Cases or Variables available values. It is important to calculate the mean
only from valid responses that are chosen from a veri-
fied population having a normal distribution. If the data
The simple deletion of data that contains missing values distribution is skewed, the median of the available data
may be utilized when a non-random pattern of missing can be used as a substitute.
data is present. However, it may be ill-advised to elimi- The main advantage of mean substitution is its ease
nate ALL of the samples taken from a test. This method of implementation and ability to provide all cases with
tends to be the method of last choice. complete information. Although a common imputation
method, three main disadvantages to mean substitution
Data Imputation Methods do exist:
A number of researchers have discussed specific impu- Understatement of variance

tation methods. Seminal works include Little & Rubin Distortion of actual distribution of values mean
(1987), and Rubin (1978) has published articles regard- substitution allows more observations to fall into
ing imputation methodologies. Imputation methods are the category containing the calculated mean than
may actually exist
595
TEAM LinG
Depression of observed correlations due to the Once the most similar case has been identified, hot
repetition of a constant value deck imputation substitutes the most similar complete
cases value for the missing value. Since case two con-
Obviously, a researcher must weigh the advantages tains the value of 23 for item four, a value of 23 replaces
against the disadvantages before implementation. the missing data point for case three. The advantages of
hot deck imputation include conceptual simplicity, main-
Cold Deck Imputation tenance and proper measurement level of variables, and
the availability of a complete set of data at the end of the
Cold deck imputation methods select values or use imputation process that can be analyzed like any complete
relationships obtained from sources other than the cur- set of data. One of hot decks disadvantages is the difficulty
rent data. With this method, the end user substitutes a in defining what is similar, Hence, many different
constant value derived from external sources or from schemes for deciding on what is similar may evolve.
previous research for the missing values. It must be
ascertained by the end user that the replacement value Regression Imputation
used is more valid than any internally derived value.
Unfortunately, feasible values are not always provided Regression Analysis is used to predict missing values
using cold deck imputation methods. Many of the same based on the variables relationship to other variables in
disadvantages that apply to the mean substitution method the data set. Single and/or multiple regression can be
apply to cold deck imputation. Cold deck imputation used to impute missing values. The first step consists of
methods are rarely used as the sole method of imputa- identifying the independent variables and the dependent
tion and instead are generally used to provide starting variable. In turn, the dependent variable is regressed on
values for hot deck imputation methods. the independent variables. The resulting regression equa-
tion is then used to predict the missing values. Table 2
Hot Deck Imputation displays an example of regression imputation.
From the table, twenty cases with three variables
Generally speaking, hot deck imputation replaces miss- (income, age, and years of college education) are listed.
ing values with values drawn from the next most similar Income contains missing data and is identified as the
case. The implementation of this imputation method dependent variable while age and years of college edu-
results in the replacement of a missing value with a value cation are identified as the independent variables.
selected from an estimated distribution of similar re- The following regression equation is produced for
sponding units for each missing value. In most instances, the example
the empirical distribution consists of values from re-
sponding units. For example, Table 1 displays a data set
containing missing data. Table 2. Illustration of regression imputation
From Table 1, it is noted that case three is missing
Years of College Regression
data for item four. In this example, case one, two, and Case Income Age Education Prediction
four are examined. Using hot deck imputation, each of 1 $95,131.25 26 4 $96,147.60
the other cases with complete data is examined and the 2 $108,664.75 45 6 $104,724.04
value for the most similar case is substituted for the 3 $98,356.67 28 5 $98,285.28
missing data value. Case four is easily eliminated, as it 4 $94,420.33 28 4 $96,721.07
5 $104,432.04 46 3 $100,318.15
has nothing in common with case three. Case one and
6 $97,151.45 38 4 $99,588.46
two both have similarities with case three. Case one has 7 $98,425.85 35 4 $98,728.24
one item in common whereas case two has two items in 8 $109,262.12 50 6 $106,157.73
common. Therefore, case two is the most similar to 9 $95,704.49 45 3 $100,031.42
case three. 10 $99,574.75 52 5 $105,167.00
11 $96,751.11 30 0 $91,037.71
12 $111,238.13 50 6 $106,157.73
Table 1. Illustration of hot deck imputation: incomplete 13 $102,386.59 46 6 $105,010.78
data set 14 $109,378.14 48 6 $105,584.26
15 $98,573.56 50 4 $103,029.32
16 $94,446.04 31 3 $96,017.08
Case Item 1 Item 2 Item 3 Item 4
17 $101,837.93 50 4 $103,029.32
1 10 22 30 25
2 23 20 30 23 18 ??? 55 6 $107,591.43
3 25 20 30 ??? 19 ??? 35 4 $98,728.24
4 11 25 10 12 20 ??? 39 5 $101,439.40
596
TEAM LinG
y = $79,900.95 + $268.50(age) + (integration of domain-specific logic into data mining

$2,180,97(years of college education)
solutions). As gigabyte-, terabyte-, and petabyte-size I
data sets become more prevalent in data warehousing
applications, the issue of dealing with missing data will
Predictions of income can be made using the regres-
itself become an integral solution for the use of such
sion equation and the right-most column of the table
data rather than simply existing as a component of the
displays these predictions. For cases eighteen, nine-
knowledge discovery and data mining processes (Han &
teen, and twenty, income is predicted to be $107,591.43,
Kamber, 2001).
$98,728.24, and $101,439.40, respectfully. An advan-
Although the issue of statistical analysis with miss-
tage to regression imputation is that it preserves the
ing data has been addressed since the early 1970s, the
variance and covariance structures of variables with
advent of data warehousing, knowledge discovery, data
missing data.
mining and data cleansing has pushed the concept of
Although regression imputation is useful for simple
dealing with missing data into the limelight. Brown &
estimates, it has several inherent disadvantages:
Kros (2003) provide an overview of the trends, tech-
niques, and impacts of imprecise data on the data mining
Regression imputation reinforces relationships
process related to the k-Nearest Neighbor Algorithm,
that already exist within the data as the method is
Decision Trees, Association Rules, and Neural Networks.
utilized more often, the resulting data becomes
more reflective of the sample and becomes less
generalizable
Understates the variance of the distribution CONCLUSION
An implied assumption that the variable being
estimated has a substantial correlation to other It can be seen that future generations of data miners will
attributes within the data set be faced with many challenges concerning the issues of
Regression imputation estimated value is not con- missing data. This article gives a background analysis
strained and therefore may fall outside predeter- and brief literature review of missing data concepts. The
mined boundaries for the given variable authors addressed reasons for data inconsistency and
methods for addressing missing data. Finally, the au-
In addition to these points, there is also the problem thors offered their opinions on future developments and
of over-prediction. Regression imputation may lead to trends on issues expected to face developers of knowl-
over-prediction of the models explanatory power. For edge discovery software and the needs of end users
example, if the regression R 2 is too strong, when confronted with the issues of data inconsistency
multicollinearity most likely exists. Otherwise, if the and missing data.
R2 value is modest, errors in the regression prediction
equation will be substantial. Overall, regression imputa-
tion not only estimates the missing values but also REFERENCES
derives inferences for the population.
Afifi, A., & Elashoff, R. (1966). Missing observations
in multivariate statistics I: Review of the literature.
FUTURE TRENDS Journal of the American Statistical Association, 61,
595-604.
Poor data quality has plagued the knowledge discovery
Brown, M.L., & Kros, J.F. (2003). The impact of missing
process and all associated data mining techniques. Fu-
data on data mining. In J. Wang (Ed.), Data mining
ture data mining systems should be sensitive to noise
opportunities and challenges (pp. 174-198). Hershey,
and have the ability to deal with all types of data pollu-
PA: Idea Group Publishing.
tion, both internally and in conjunction with end users.
Systems should still produce the most significant find- Han, J., & Kamber, M. (2001). Data mining: Concepts
ings from the data set possible even if noise is present. and techniques. San Francisco: Academic Press.
As data mining continues to evolve and mature as a
viable business tool, it will be monitored to address its Kim, Y. (2001). The curse of the missing data. Re-
role in the technological life cycle. Tools for dealing trieved from http://209.68.240.11:8080/2ndMoment/
with missing data will grow from being used as a hori- 978476655/addPostingForm/
zontal solution (not designed to provide business-spe- Lee, S., & Siau, K. (2001). A review of data mining
cific end solutions) into a type of vertical solution techniques. Industrial Management & Data Systems,
101(1), 41-46.
597
TEAM LinG
Little, R., & Rubin, D. (1987). Statistical analysis with Data Imputation: The process of estimating missing
missing data. New York: Wiley. data of an observation based on the valid values of other
variables.
Rubin, D. (1978). Multiple imputations in sample surveys:
A phenomenological Bayesian approach to nonresponse. Data Missing at Random (MAR): When given the
In Imputation and Editing of Faulty or Missing Survey variables X and Y, the probability of response depends
Data (pp. 1-23). Washington, DC: U.S. Department of on X but not on Y.
Commerce.
Inapplicable Responses: Respondents omit answer
Vosburg, J., & Kumar, A. (2001). Managing dirty data in due to doubts of applicability.
organizations using ERP: Lessons from a case study.
Industrial Management & Data Systems, 101(1), 21-31. Knowledge Discovery Process: The overall pro-
cess of information discovery in large volumes of ware-
Xu, H., Horn Nord, J., Brown, N., & Nord, G.D. (2002). housed data.
Data quality issues in implementing an ERP. Industrial
Management & Data Systems, 102(1), 47-58. Non-Ignorable Missing Data: Arise due to the data
missingness pattern being explainable, non-random, and
possibly predictable from other variables.
Procedural Factors: Inaccurate classifications of
KEY TERMS new data, resulting in classification error or omission.
[Data] Missing Completely at Random (MCAR): When Refusal of Response: Respondents outward omission
the observed values of a variable are truly a random of a response due to personal choice, conflict, or inexpe-
sample of all values of that variable (i.e., the response rience.
exhibits independence from any variables).
598
TEAM LinG
599
Incorporating the People Perspective into I

Data Mining
Nilmini Wickramasinghe
Cleveland State University, USA
INTRODUCTION namely by incorporating a people-based perspective into

the traditional KDD process, not only provides a more
Todays economy is increasingly based on knowledge complete and macro perspective on knowledge creation
and information (Davenport & Grover, 2001). Knowl- but also a more balanced approach, which in turn serves
edge is now recognized as the driver of productivity and to enhance the knowledge base of an organization and
economic growth, leading to a new focus on the roles of facilitates the realization of effective knowledge. The
information technology and learning in economic per- implications for data mining are clearly far-reaching and
formance. Organizations trying to survive and prosper in are certain to help organizations more effectively realize
such an economy are turning their focus to strategies, the full potential of their knowledge assets, improve the
processes, tools, and technologies that can facilitate the likelihood of using/reusing the created knowledge, and
creation of knowledge. A vital and well-respected tech- thereby enables them to be well positioned in todays
nique in knowledge creation is data mining, which en- knowledgeable economy.
ables critical knowledge to be gained from the analysis
of large amounts of data and information. Traditional
data mining and the KDD process (knowledge discovery BACKGROUND
in data bases) tends to view the knowledge product as a
homogeneous product. Knowledge, however, is a multi- Knowledge Creation through Data
faceted construct, drawing upon various philosophical Mining and the KDD Process
perspectives including Lockean/Leibnitzian and
Hegelian/Kantian, exhibiting subjective and objective KDD, and more specifically, data mining, approaches
aspects, as well as having tacit and explicit forms knowledge creation from a primarily technology driven
(Nonaka, 1994; Alavi & Leidner, 2001; Schultze & perspective. In particular, the KDD process focuses on
Leidner, 2002; Wickramasinghe et al., 2003). The the- how data are transformed into knowledge by identifying
sis of this article is that taking a broader perspective of valid, novel, potentially useful, and ultimately under-
the resultant knowledge product from the KDD process,
Figure 1. Integrated view of the knowledge discovery process (Adapted from Wickramasinghe et al., 2003)
Knowledge
evolution
Data Information Knowledge
Steps in knowledge
discovery
Selection Preprocessing Transformation Data Mining Interpretation / Evaluation
Data Target Preprocessed Transformed Patterns Knowledge

Data Data Data
Types of
data mining
Exploratory Predictive
Data Mining Data Mining
TEAM LinG
Incorporating the People Perspective into Data Mining
standable patterns in data (Spiegler, 2003; Fayyad, there is an articulation of nuances; for example, as in
Piatetsky-Shapiro, & Smyth, 1996). KDD is primarily health care, if a renowned surgeon is questioned as to why
used on data sets for creating knowledge through model he does a particular procedure in a certain manner, by his
building or by finding patterns and relationships in data. articulation of the steps, the tacit knowledge becomes
From an application perspective, data mining and explicit, and d) Explicit-to-tacit knowledge transformation
KDD are often used interchangeably. Figure 1 presents a usually occurs as new explicit knowledge is internalized;
generic representation of a typical knowledge discovery it can then be used to broaden, reframe, and extend ones
process. This figure not only depicts each stage within tacit knowledge. These transformations are often referred
the KDD process but also highlights the evolution of to as the modes of socialization, combination,
knowledge from data through information in this process externalization, and internalization, respectively (Nonaka,
as well as the two major types of data mining, namely, 1994). Integral to this changing of knowledge through the
exploratory and predictive, whereas the last two steps knowledge spiral is that new knowledge is created (Nonaka
(i.e., data mining and interpretation/evaluation) in the & Nishiguchi, 2001), which can bring many benefits to
KDD process are considered predictive data mining. It is organizations. Specifically, in todays knowledge-centric
important to note in Figure 1 that typically in the KDD economy, processes that effect a positive change to the
process, the knowledge component itself is treated as a existing knowledge base of the organization and facilitate
homogeneous block. Given the well-established multifac- better use of the organizations intellectual capital, as the
eted nature of the knowledge construct (Boland & Tenkasi, knowledge spiral does, are of paramount importance.
1995; Malhotra, 2000; Alavi & Leidner, 2001; Schultze & Two other primarily people-driven frameworks that
Leidner, 2002; Wickramasinghe et al., 2003), this would focus on knowledge creation as a central theme are
appear to be a significant limitation or oversimplification Spenders and Blacklers respective frameworks
of knowledge creation through data mining as a technique (Newell, Robertson, Scarbrough, & Swan, 2002; Swan,
and of the KDD process in general. Scarbrough, & Preston, 1999). Spender draws a distinc-
tion between individual knowledge and social knowl-
The Psychosocial-Driven Perspective edge, each of which he claims can be implicit or explicit
to Knowledge Creation (Newell et al.). From this framework, you can see that
Spenders definition of implicit knowledge corresponds
Knowledge can exist in essentially two forms: explicit, to Nonakas tacit knowledge. However, unlike Spender,
or factual, knowledge and tacit, or experiential (i.e., Nonaka doesnt differentiate between individual and
know how) (Polyani, 1958, 1966). Of equal signifi- social dimensions of knowledge; rather, he focuses on
cance is the fact that organizational knowledge is not the nature and types of the knowledge itself. In contrast,
static; rather, it changes and evolves during the lifetime Blackler (Newell et al.) views knowledge creation from
of an organization (Becerra-Fernandez & Sabherwal, an organizational perspective, noting that knowledge
2001; Bendoly, 2003; Choi & Lee, 2003). Furthermore, can exist as encoded, embedded, embodied, encultured,
it is possible to change the form of knowledge, that is, and/or embrained. In addition, Blackler emphasized that
transform existing tacit knowledge into new explicit for different organizational types, different types of
knowledge, and existing explicit knowledge into new knowledge predominate, and he highlighted the connec-
tacit knowledge, or to transform the subjective form of tion between knowledge and organizational processes
knowledge into the objective form of knowledge (Nonaka (Newell et al.).
& Nishiguchi, 2001; Nonaka, 1994). This process of Blacklers types of knowledge can be thought of in
transforming the form of knowledge and thus increasing terms of spanning a continuum of tacit (implicit) through
the extant knowledge base as well as the amount and to explicit with embrained being predominantly tacit
utilization of the knowledge within the organization is (implicit) and encoded being predominantly explicit
known as the knowledge spiral (Nonaka & Nishiguchi, while embedded, embodied, and encultured types of
2001). In each of these instances, the overall extant knowledge exhibit varying degrees of a tacit (implicit)/
knowledge base of the organization grows to a new explicit combination. An integrated view of all three
superior knowledge base. frameworks is presented in Figure 2. Specifically, from
According to Nonaka and Nishiguchi (2001), four Figure 2, Spenders and Blacklers perspectives comple-
things are true: a) Tacit-to-tacit knowledge transforma- ment Nonakas conceptualization of knowledge cre-
tion usually occurs through apprenticeship-type rela- ation and, more importantly, do not contradict his thesis
tions, where the teacher or master passes on the skill to the of the knowledge spiral, wherein the extant knowledge
apprentice, b) Explicit-to-explicit knowledge transforma- base is continually being expanded to a new knowledge
tion usually occurs via formal learning of facts, c) Tacit- base, be it tacit/explicit (in Nonakas terminology),
to-explicit knowledge transformation usually occurs when implicit/explicit (in Spenders terminology), or
600
TEAM LinG
Figure 2. People driven knowledge creation map metaframework for knowledge creation. By so doing, a
richer and more complete approach to knowledge cre- I
ation is realized. Such an approach not only leads to a
The Knowledge Continuum
deeper understanding of the knowledge creation pro-
cess but also offers a knowledge creation methodology
Explicit Other K Types
(embodied, encultured,embedded)
Tacit/Implicit/Embrained
that is more customizable to specific organizational
Social
contexts, structures, and cultures. Furthermore, it brings
the human factor back into the knowledge creation
process and doesnt oversimplify the complex knowl-
edge construct as a homogenous product.
The Knowledge Spiral
Specifically, in Figure 3, the knowledge product of
(Spenders
Spenders
Actors)
Actors
data mining is broken into its constituent components
based on the people-driven perspectives (i.e., Blackler,
Spender, and Nonaka, respectively) of knowledge cre-
ation. On the other hand, the specific modes of trans-
Individual formation of the knowledge spiral discussed by Nonaka
in his Knowledge Spiral should benefit from the algo-
rithmic structured nature of both exploratory and pre-
embrained/encultured/embodied/embedded/encoded (in dictive data-mining techniques. For example, if you
Blacklers terminology). consider socialization, which is described in Nonaka
and Nishiguchi (2001) and Nonaka (1994) as the pro-
cess of creating new tacit knowledge through discus-
ENRICHING DATA MINING WITH THE sion within groups, more specifically groups of ex-
KNOWLEDGE SPIRAL perts, and you then incorporate the results of data-
mining techniques into this context, this provides a
To conceive of knowledge as a collection of information structured forum and hence a jump start for guiding the
seems to rob the concept of all of its life. . . . Knowledge dialogue and, consequently, knowledge creation. Note,
resides in the user and not in the collection. It is how the however, that this only enriches the socialization pro-
user reacts to a collection of information that matters cess without restricting the actual brainstorming activi-
(Churchman, 1971, p. 10). ties and thus not necessarily leading to the side effect of
truncating divergent thoughts. This also holds for
Churchman is clearly underscoring the importance of Nonakas other modes of knowledge transformation.
people in the process of knowledge creation. However,
most formulations of information technology (IT) en- An Example within a Health Care
abled knowledge management, and data mining in par- Context
ticular seems to have not only ignored the ignored the
human element but also taken a very myopic and homog- An illustration from health care will serve to illustrate
enous perspective on the knowledge construct itself. the potential of combining the people perspective with
Recent research that has surveyed the literature on KM data mining. Health care is a very information-rich
indicates the need for more frameworks for knowledge industry. The collecting of data and information per-
management, particularly a metaframework to facilitate meates most if not all areas of this industry. By incor-
more successful realization of the KM steps porating a people perspective with data mining, it will
(Wickramasinghe & Mills, 2001; Holsapple & Joshi, be possible to realize the full potential of these data
2002; Alavi & Leidner, 2001; Schultze & Leidner, 2002). assets.
From a macro knowledge management perspective, the Table 1 details some specific instances of each of
knowledge spiral is the cornerstone of knowledge cre- the transformations identified in Figure 3, and Table 2
ation. From a micro data-mining perspective, one of the provides an example of explicit knowledge stored in a
key strengths of data mining as a technique is that it medication repository 2.
facilitates knowledge creation from data. Therefore, by Using the association rules data-mining algorithm,
integrating the algorithmic approach of knowledge cre- the following patterns can be discovered:
ation (in particular data mining) with the psychosocial
approach of knowledge creation (i.e., the people-driven D1 is administered to 60% of the patients (i.e., 3/5).
frameworks of knowledge creation, in particular the
knowledge spiral), it is indeed possible to develop a
D1 and D2 are administered together to 40% of
the patients (i.e., 2/5).
601
TEAM LinG
Figure 3. Integrating data mining with the knowledge spiral
Nonakas/Spenders/Blacklers K-Types
Tacit/Implicit/Embrained Other K Types (embodied, encultured,embedded) Explicit

KNOWLEDGE
The Knowledge Spiral

Social
(Tacit)
1. SOCIALIZATION 2. EXTERNALIZATION
Spenders Exploratory
Actors
DM
Predictive
Individual
(Explicit) 3. INTERNALIZATION 4. COMBINATION
Specific Modes of Transformations and the Role of Data Mining in Realizing Them:
1. Socialization experiential to experiential via practice group interactions on data-mining results
2. Externalization written extensions to data-mining results
3. Internalization enhancing experiential skills by learning from data-mining results
4. Combination gaining insights through applying various data mining visualization techniques
Table 1. Data mining as an enabler of the knowledge spiral
EXPLICIT1 TACIT
TO
FROM
EXPLICIT The performance of exploratory data mining, such as By assimilating and internalizing
summarization and visualization, makes it possible to knowledge discovered through data
upgrade, expand, and/or revise current facts and mining, physicians can turn this
protocols. explicit knowledge into tacit
knowledge, which they can apply to
treating new patients.
TACIT Interpretation of findings from data mining helps reveal Interpreting treatment patterns (for
tacit knowledge, which can then be articulated and stored example, for hip disease) discovered
as explicit cases in the case repository. through data mining enables
interaction among physicians, hence
making it possible to stimulate and
grow their own tacit knowledge.
602
TEAM LinG
Table 2. Drugs administered to patients As physicians discuss the implications of these find-
Patient ID Drug
ings, tacit knowledge from some of them is transformed I
into tacit knowledge for other physicians. Thus, during
1 D1, D2 the interaction of physicians from different specialties,
2 S3, D4, D5
an environment of existing tacit to new tacit knowledge
3 D3, D1, D2
4 D3, D5, D1 transformations occur. These knowledge transforma-
5 D5, D2 tions are summarized in Table 3. Figure 3 and Table 3
illustrate how data mining helps to realize the four modes
or transformations of the knowledge spiral (socializa-
D2 is administered to 67% of the patients who are tion, externalization, internalization, and combination).
given drug D1 (i.e., 2/3).
As the physicians try to understand these findings,
one physician could explain that D2 has to be given with FUTURE TRENDS
D1 for patients who had a heart attack at age 40 or less.
Thus, from this observation, the following rule can be The two significant ways to create knowledge are through
added to the rule repository: If a patients age is <= 40 the a) synthesis of new knowledge through socialization
years, the patient has a heart attack, and D1 is adminis- with experts (a primarily people-dominated perspec-
tered to the patient, then D2 should also be administered tive) and b) discovery by finding interesting patterns
to that patient. This is an example of existing tacit through observation and combination of explicit data (a
knowledge, because it originated from the physicians primarily technology-driven perspective) (Becerra-
head and was transformed into new explicit knowledge Fernandez, et al., 2004). In todays knowledge economy,
that is now recorded for everyone to use (as Figure 3 knowledge creation and the maximization of an
demonstrates). organizations knowledge and data assets are key strate-
Table 3. Data mining as an enabler of the knowledge spiral from the example data set
EXPLICIT1 TACIT
FROM TO
EXPLICIT D1 is administered to 60% of

Patient ID Drug the patients.
1 D1, D2 D1 and D2 are administered
2 S3, D4, D5 together to 40% of the patients.
3 D3, D1, D2 D2 is administered to 67% of
4 D3, D5, D1 the patients who are given drug
5 D5, D2 D1.
TACIT If a patients age is <= 40 years,

the patient has a heart attack,
and D1 is
administered to the patient, then
D2 should also be administered
to that patient.
INTERACTION
1
Each entry gives an example of how data mining enables the knowledge transfer from the knowledge type in the cell row to the
knowledge type in the cell column.
603
TEAM LinG
gic necessities. Furthermore, more techniques, such as Becerra-Fernandez, I., & Sabherwal, R. (2001). Organiza-
business intelligence and business analytics, which have tional knowledge management: A contingency perspec-
their foundations in traditional data mining, are being tive. Journal of Management Information Systems, 18(1),
embraced by organizations in order to try to facilitate 23-55.
the discovery of novel and unique patterns in data that
will lead to new knowledge and maximization of an Bendoly, E. (2003). Theory and support for process
organizations data assets. Full maximization of an frameworks of knowledge discovery and data mining
organizations data assets, however, will not be realized from ERP systems. Information & Management, 40, 639-
until the people perspective is incorporated into these 647.
data-mining techniques to enable the full potential of Boland, R., & Tenkasi, R. (1995). Perspective making
knowledge creation to occur. Thus, as organizations perspective taking. Organizational Science, 6, 350-372.
strive to survive and thrive in todays competitive busi-
ness environment, incorporating a people perspective Choi, B., & Lee, H. (2003). An empirical investigation
into their data-mining initiatives will increasingly be- of KM styles and their effect on corporate perfor-
come a competitive necessity. mance. Information & Management, 40, 403-417.
Churchman, C. (1971). The design of inquiring sys-
tems: Basic concepts of systems and organizations.
CONCLUSION New York: Basic Books.
Sustainable competitive advantage is dependent on build- Davenport, T., & Grover, V. (2001). Knowledge man-
ing and exploiting core competencies (Newell et al., agement. Journal of Management Information Sys-
2002). In order to sustain competitive advantage, re- tems, 18(1), 3-4.
sources that are idiosyncratic (and thus scarce) and Davenport, T., & Prusak, L. (1998). Working knowl-
difficult to transfer or replicate are required (Grant, edge. Boston: Harvard Business School Press.
1991). A knowledge-based view of the firm identifies
knowledge as the organizational asset that enables sus- Fayyad, Piatetsky-Shapiro, & Smyth, (1996). From data
tainable competitive advantage, especially in mining to knowledge discovery: An overview. In Fayyad,
hypercompetitive environments (Wickramasinghe, Piatetsky-Shapiro, Smyth, & Uthurusamy (Eds.), Ad-
2003; Davenport & Prusak, 1998; Zack, 1999). This is vances in knowledge discovery and data mining. Menlo
attributed to the fact that barriers exist regarding the Park, CA: AAAI Press/MIT Press.
transfer and replication of knowledge (Wickramasinghe,
2003), thus making knowledge and knowledge manage- Grant, R. (1991). The resource-based theory of competi-
ment of strategic significance (Kanter, 1999). The key tive advantage: Implications for strategy formulation.
to maximizing the knowledge asset is in finding novel California Management Review, 33(3), 114-135.
and actionable patterns and in continuously creating new Holsapple, C., & Joshi, K. (2002). Knowledge manipu-
knowledge, thereby increasing the extant knowledge lation activities: Results of a delphi study. Information
base of the organization. By incorporating a people & Management, 39, 477-419.
perspective into data-mining, it becomes truly possible
to support both major types of knowledge creation Kanter, J. (1999). Knowledge management practically
scenarios and thereby realize the synergistic effect of speaking. Information Systems Management.
the respective strengths of these approaches in enabling Malhotra, Y. (2000). Knowledge management and new
superior knowledge creation to ensue. organizational form. In Malhotra (Ed.), Knowledge
management and virtual organizations. Hershey, PA:
Idea Group Publishing.
REFERENCES
Newell, S., Robertson, M., Scarbrough, H., & Swan, J.
Alavi, M., & Leidner, D. (2001). Review: Knowledge (2002). Managing knowledge work. New York: Palgrave
management and knowledge management systems: Con- Macmillan.
ceptual foundations and research issues. MIS Quar- Nonaka, I. (1994). A dynamic theory of organizational
terly, 25(1), 107-136. knowledge creation. Organizational Science, 5, 14-37.
Becerra-Fernandez, I., Gonzalez, A., & Sabherwal, R. (2004). Nonaka, I., & Nishiguchi, T. (2001). Knowledge emer-
Knowledge management. Upper Saddle River, NJ: Prentice gence. Oxford, UK: Oxford University Press.
Hall.
604
TEAM LinG
Polyani, M. (1958). Personal knowledge: Towards a Externalization: A knowledge transfer mode that in-
postcritical philosophy. Chicago: University of Chi- volves new explicit knowledge being derived from exist- I
cago Press. ing tacit knowledge.
Polyani, M. (1966). The tacit dimension. London: Hegelian/Kantian Perspective of Knowledge
Routledge & Kegan Paul. Management: Refers to the subjective component of
knowledge management and can be viewed as an ongoing
Schultze, U., & Leidner, D. (2002). Studying knowledge phenomenon being shaped by social practices of commu-
management in information systems research: Dis- nities and encouraging discourse and divergence of mean-
courses and theoretical assumptions. MIS Quarterly, ing.
26(3), 212-242.
Internalization: A knowledge transfer mode that
Spiegler, I. (2003). Technology and knowledge: Bridg- involves new tacit knowledge being derived from exist-
ing a generating gap. Information & Management, 40, ing explicit knowledge.
533-539.
Knowledge Spiral: The process of transforming
Swan, J., Scarbrough, H., & Preston, J. (1999). Knowl- the form of knowledge and thus increasing the extant
edge management: The next fad to forget people? Pro- knowledge base as well as the amount and utilization of
ceedings of the Seventh European Conference in In- the knowledge within the organization.
formation Systems.
Lockean/Leibnitzian Perspective of Knowledge
Wickramasinghe, N. (2003). Do we practise what we Management: Refers to the objective aspects of knowl-
preach: Are knowledge management systems in practice edge management, where the need for knowledge is to
truly reflective of knowledge management systems in improve effectiveness and efficiency.
theory? Business Process Management Journal, 9(3),
295-316. Socialization: A knowledge transfer mode that in-
volves new tacit knowledge being derived from existing
Wickramasinghe, N., Fadlalla, A., Geisler, E., & Schaffer, tacit knowledge.
J. (2003). Knowledge management and data mining: Stra-
tegic imperatives for healthcare. Proceedings of the Third Tacit Knowledge: Also known as experiential knowl-
Hospital of the Future Conference, Warwick, UK. edge (i.e., know how) (Cabena et al., 1998) represents
knowledge that is gained through experience.
Wickramasinghe, N., & Mills, G. (2001). MARS: The
electronic medical record system the core of the kaiser
galaxy. International Journal of Healthcare Technol- ENDNOTES
ogy Management, 3(5/6), 406-423.
Zack, M. (1999). Knowledge and strategy. Boston:
1
Each entry explains how data mining enables the
Butterworth Heinemann. knowledge transfer from the type of knowledge in
the cell row to the type of knowledge in the cell
column.
2
The example is kept small and simple for illustrative
KEY TERMS purposes; naturally, in large medical databases the
data would be much larger.
Combination: A knowledge transfer mode that in- 3
Each entry gives an example of how data mining
volves new explicit knowledge being derived from ex- enables the knowledge transfer from the knowl-
isting explicit knowledge. edge type in the cell row to the knowledge type in
the cell column.
Explicit Knowledge: Also known as factual knowl-
edge (i.e., know what) (Cabena et al., 1998), repre-
sents knowledge that is well established and documented.
605
TEAM LinG
606
Incremental Mining from News Streams

Seokkyung Chung
University of Southern California, USA
Jongeun Jun
Dennis McLeod
INTRODUCTION the specified keywords are generated from users knowl-

edge space. Thus, if users are unaware of the airplane
With the rapid growth of the World Wide Web, Internet crash that occurred yesterday, then they cannot issue a
users are now experiencing overwhelming quantities of query about that accident even though they might be
online information. Since manually analyzing the data interested.
becomes nearly impossible, the analysis would be per- The first two drawbacks stated above have been
formed by automatic data mining techniques to fulfill addressed by query expansion based on domain-inde-
users information needs quickly. pendent ontologies. However, it is well known that this
On most Web pages, vast amounts of useful knowl- approach leads to a degradation of precision. That is,
edge are embedded into text. Given such large sizes of text given that the words introduced by term expansion may
collection, mining tools, which organize the text datasets have more than one meaning, using additional terms can
into structured knowledge, would enhance efficient docu- improve recall, but decrease precision. Exploiting a manu-
ment access. This facilitates information search and, at ally developed ontology with a controlled vocabulary
the same time, provides an efficient framework for docu- would be helpful in this situation (Khan, McLeod, &
ment repository management as the number of documents Hovy, 2004). However, although ontology-authoring tools
becomes extremely huge. have been developed in the past decades, manually con-
Given that the Web has become a vehicle for the structing ontologies whenever new domains are encoun-
distribution of information, many news organizations are tered is an error-prone and time-consuming process. There-
providing newswire services through the Internet. Given fore, integration of knowledge acquisition with data min-
this popularity of the Web news services, text mining on ing, which is referred to as ontology learning, becomes
news datasets has received significant attentions during a must (Maedche & Staab, 2001).
the past few years. In particular, as several hundred news To facilitate information navigation and search on a
stories are published everyday at a single Web news site, news database, clustering can be utilized. Since a collec-
triggering the whole mining process whenever a docu- tion of documents is easy to skim if similar articles are
ment is added to the database is computationally imprac- grouped together, if the news articles are hierarchically
tical. Therefore, efficient incremental text mining tools classified according to their topics, then a query can be
need to be developed. formulated while a user navigates a cluster hierarchy.
Moreover, clustering can be used to identify and deal with
near-duplicate articles. That is, when news feeds repeat
BACKGROUND stories with minor changes from hour to hour, presenting
only the most recent articles is probably sufficient. In
The simplest document access method within Web news particular, a sophisticated incremental hierarchical docu-
services is keyword-based retrieval. Although this method ment clustering algorithm can be effectively used to
seems effective, there exist at least three serious draw- address high rate of document update. Moreover, in order
backs. First, if a user chooses irrelevant keywords, then to achieve rich semantic information retrieval, an ontol-
retrieval accuracy will be degraded. Second, since key- ogy-based approach would be provided. However, one of
word-based retrieval relies on the syntactic properties of the main problems with concept-based ontologies is that
information (e.g., keyword counting), semantic gap can- topically related concepts and terms are not explicitly
not be overcome (Grosky, Sreenath, & Fotouhi, 2002). linked. That is, there is no relation between court-attor-
Third, only expected information can be retrieved since ney, kidnap-police, and etcetera. Thus, concept-based
TEAM LinG
ontologies have a limitation in supporting a topical search. into the database, whenever a new document is
In sum, it is essential to develop incremental text mining inserted it should perform a fast update of the I
methods for intelligent news information presentation. existing cluster structure.
6. Meaningful theme of clusters: We expect each
cluster to reflect a meaningful theme. We define
MAIN THRUST meaningful theme in terms of precision and recall.
That is, if a cluster (C) is about Turkey earth-
In the following, we will explore text mining approaches quake, then all documents about Turkey earth-
that are relevant for news streams data. quake should belong to C, and documents that do
not talk about Turkey earthquake should not
Requirements of Document Clustering belong to C.
7. Interpretability of resulting clusters: A clustering
in News Streams structure needs to be tied up with a succinct sum-
mary of each cluster. Consequently, clustering re-
Data we are considering are high dimensional, large in sults should be easily comprehensible by users.
size, noisy, and a continuous stream of documents. Many
previously proposed document clustering algorithms did
not perform well on this dataset due to a variety of
Previous Document Clustering
reasons. In the following, we define application-depen- Approaches
dent (in terms of news streams) constraints that the
clustering algorithm must satisfy. The most widely used document clustering algorithms fall
into two categories: partition-based clustering and hier-
1. Ability to determine input parameters: Many clus- archical clustering. In the following, we provide a concise
tering algorithms require a user to provide input overview for each of them, and discuss why these ap-
parameters (e.g., the number of clusters), which is proaches fail to address the requirements discussed above.
difficult to be determined in advance, in particular Partition-based clustering decomposes a collection of
when we are dealing with incremental datasets. documents, which is optimal with respect to some pre-
Thus, we expect the clustering algorithm not to need defined function (Duda, Hart, & Stork, 2001; Liu, Gong,
such kind of knowledge. Xu, & Zhu, 2002). Typical methods in this category
2. Scalability with large number of documents: The include center-based clustering, Gaussian Mixture Model,
number of documents to be processed is extremely and etcetera. Center-based algorithms identify the clus-
large. In general, the problem of clustering n objects ters by partitioning the entire dataset into a pre-deter-
into k clusters is NP-hard. Successful clustering mined number of clusters (e.g., k-means clustering). Al-
algorithms should be scalable with the number of though the center-based clustering algorithms have been
documents. widely used in document clustering, there exist at least
3. Ability to discover clusters with different shapes five serious drawbacks. First, in many center-based clus-
and sizes: The shape of document cluster can be of tering algorithms, the number of clusters needs to be
arbitrary shapes; hence we cannot assume the shape determined beforehand. Second, the algorithm is sensi-
of document cluster (e.g., hyper-sphere in k-means). tive to an initial seed selection. Third, it can model only a
In addition, the sizes of clusters can be of arbitrary spherical (k-means) or ellipsoidal (k-medoid) shape of
numbers, thus clustering algorithms should iden- clusters. Furthermore, it is sensitive to outliers since a
tify the clusters with wide variance in size. small amount of outliers can substantially influence the
4. Outliers Identification: In news streams, outliers mean value. Note that capturing an outlier document and
have a significant importance. For instance, a unique forming a singleton cluster is important. Finally, due to the
document in a news stream may imply a new technol- nature of an iterative scheme in producing clustering
ogy or event that has not been mentioned in previ- results, it is not relevant for incremental datasets.
ous articles. Thus, forming a singleton cluster for Hierarchical (agglomerative) clustering (HAC) identi-
the outlier is important. fies the clusters by initially assigning each document to
5. Efficient incremental clustering: Given different its own cluster and then repeatedly merging pairs of
ordering of a same dataset, many incremental clus- clusters until a certain stopping condition is met (Zhao &
tering algorithms produce different clusters, which Karypis, 2002). Consequently, its result is in the form of
is an unreliable phenomenon. Thus, the incremental a tree, which is referred to as a dendrogram. A dendrogram
clustering should be robust to the input sequence. is represented as a tree with numeric levels associated to
Moreover, due to the frequent document insertion its branches. The main advantage of HAC lies in its ability
607
TEAM LinG
to provide a view of data at multiple levels of abstraction. methodology is required in terms of incremental news
Although HAC can model arbitrary shapes and different article clustering.
sizes of clusters, and can be extended to the robust version
(in outlier handling sense), it is not relevant for news Dynamic Topic Mining
streams application due to the following two reasons.
First, since HAC builds a dendrogram, a user should Dynamic topic mining is a framework that supports the
determine where to cut the dendrogram to produce actual identification of meaningful patterns (e.g., events, top-
clusters. This step is usually done by human visual inspec- ics, and topical relations) from news stream data (Chung
tion, which is a time-consuming and subjective process. & McLeod, 2003). To build a novel paradigm for an
Second, the computational complexity of HAC is expen- intelligent news database management and navigation
sive since pairwise similarities between clusters need to be scheme, it utilizes techniques in information retrieval,
computed. data mining, machine learning, and natural language
processing.
Topic Detection and Tracking In dynamic topic mining, a Web crawler downloads
news articles from a news Web site on a daily basis.
Over the past six years, the information retrieval commu- Retrieved news articles are processed by diverse infor-
nity has developed a new research area, called TDT (Topic mation retrieval and data mining tools to produce useful
Detection and Tracking) (Makkonen, Ahonen-Myka, & higher-level knowledge, which is stored in a content
Salmenkivi, 2004; Allan, 2002). The main goal of TDT is to description database. Instead of interacting with a Web
detect the occurrence of a novel event in a stream of news news service directly, by exploiting the knowledge in the
stories, and to track the known event. In particular, there database, an information delivery agent can present an
are three major components in TDT. answer in response to a user request (in terms of topic
detection and tracking, keyword-based retrieval, docu-
1. Story segmentation: It segments a news stream (e.g., ment cluster visualization, etc). Key contributions of the
including transcribed speech) into topically cohe- dynamic topic mining framework are development of a
sive stories. Since online Web news (in HTML for- novel hierarchical incremental document clustering al-
mat) is supplied in segmented form, this task only gorithm, and a topic ontology learning framework.
applies to audio or TV news. Despite the huge body of research efforts on docu-
2. First Story Detection (FSD): It identifies whether a ment clustering, previously proposed document cluster-
new document belongs to an existing topic or a new ing algorithms are limited in that it cannot address special
topic. requirements in a news environment. That is, an algo-
3. Topic tracking: It tracks events of interest based on rithm must address the seven application-dependent
sample news stories. It associates incoming news constraints discussed before. Toward this end, the dy-
stories with the related stories, which were already namic topic mining framework presents a sophisticated
discussed before. It can be also asked to monitor the incremental hierarchical document clustering algorithm
news stream for further stories on the same topic. that utilizes a neighborhood search. The algorithm was
tested to demonstrate the effectiveness in terms of the
Event is defined as some unique thing that happens seven constraints. The novelty of the algorithm is the
at some point in time. Hence, an event is different from a ability to identify meaningful patterns (e.g., news events,
topic. For example, airplane crash is a topic while Chi- and news topics) while reducing the amount of compu-
nese airplane crash in Korea in April 2002 is an event. tations by maintaining cluster structure incrementally.
Note that it is important to identify events as well as topics. In addition, to overcome the lack of topical relations
Although a user is not interested in a flood topic, the user in conceptual ontologies, a topic ontology learning frame-
may be interested in the news story on the Texas flood if work is presented. The proposed topic ontologies pro-
the users hometown is from Texas. Thus, a news recom- vide interpretations of news topics at different levels of
mendation system must be able to distinguish different abstraction. For example, regarding to a Winona Ryder
events within a same topic. court trial news topic (T), the dynamic topic mining could
Single-pass document clustering (Chung & McLeod, capture winona, ryder, actress, shoplift, beverly as
2003) has been extensively used in TDT research. How- specific terms describing T (i.e., the specific concept for
ever, the major drawback of this approach lies in order- T) while attorney, court, defense, evidence, jury, kill,
sensitive property. Although the order of documents is law, legal, murder, prosecutor, testify, trial as general
already fixed since documents are inserted into the data- terms representing T (i.e., the general concept for T).
base in chronological order, order-sensitive property im- There exists research work on extracting hierarchical
plies that the resulting cluster is unreliable. Thus, new relations between terms from a set of documents (Tseng,
608
TEAM LinG
2002). However, the dynamic topic mining framework is would be provided. To overcome the problem of concept-
unique in that the topical relations are dynamically gen- based ontologies (i.e., topically related concepts and I
erated based on incremental hierarchical clustering rather terms are not explicitly linked), topic ontologies are pre-
than based on human defined topics, such as Yahoo sented to characterize news topics at multiple levels of
directory. abstraction. In sum, coupling with topic ontologies and
concept-based ontologies, supporting a topical search as
well as semantic information retrieval can be achieved.
FUTURE TRENDS
There are many future research opportunities in news REFERENCES

streams mining. First, although a document hierarchy can
be obtained using unsupervised clustering, as shown in Aggarwal, C.C., Gates, S.C., & Yu, P.S. (2004). On using partial
Aggarwal, Gates, & Yu (2004), the cluster quality can be supervision for text categorization. IEEE Transactions on
enhanced if a pre-existing knowledge base is exploited. Knowledge and Data Engineering, 16(2), 245-255.
That is, based on this priori knowledge, we can have some
control while building a document hierarchy. Allan, J. (2002). Detection as multi-topic tracking. Infor-
Second, document representation for clustering can mation Retrieval, 5(2-3), 139-157.
be augmented with phrases by employing different levels Chung, S., & McLeod, D. (2003, November). Dynamic
of linguistic analysis (Hatzivassiloglou, Gravano, & topic mining from news stream data. In ODBASE03 (pp.
Maganti, 2000). That is, representation model can be 653-670). Catania, Sicily, Italy.
augmented by adding n-gram (Peng & Schuurmans, 2003),
or frequent itemsets using association rule mining (Hand, Duda, R.O., Hart, P.E., & Stork D.G. (2001). Pattern clas-
Mannila, & Smyth, 2001). Investigating how different sification. New York: Wiley Interscience.
feature selection algorithms affect on the accuracy of Grosky, W.I., Sreenath, D.V., & Fotouhi, F. (2002). Emer-
clustering results is an interesting research work. gent semantics and the multimedia semantic web. SIGMOD
Third, besides exploiting text data, other information Record, 31(4), 54-58.
can be utilized since Web news articles are composed of
text, hyperlinks, and multimedia data. For example, both Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles
terms and hyperlinks (which point to related news articles of data mining (adaptive computation and machine
or Web pages) can be used for feature selection. learning). Cambridge, MA: The MIT Press.
Finally, a topic ontology learning framework can be
extended to accommodating rich semantic information Hatzivassiloglou, V., Gravano, L., & Maganti, A. (2000,
extraction. For example, topic ontologies can be anno- July). An investigation of linguistic features and cluster-
tated within Protg (Noy, Sintek, Decker, Crubezy, ing algorithms for topical document clustering. In ACM
Fergerson, & Musen, 2001) WordNet tab. In addition, a SIGIR Conference on Research and Development in Infor-
query expansion algorithm based on ontologies needs to mation Retrieval (SIGIR00) (pp. 224-231). Athens, Greece.
be developed for intelligent news information presenta- Khan, L., McLeod, D., & Hovy, E. (2004). Retrieval effec-
tion. tiveness of an ontology-based model for information
selection. The VLDB Journal, 13(1), 71-85.
CONCLUSION Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002, August).
Document clustering with cluster refinement and model
Incremental text mining from news streams is an emerging selection capabilities. In ACM SIGIR International Con-
technology as many news organizations are providing ference on Research and Development in Information
newswire services through the Internet. In order to ac- Retrieval (SIGIR02) (pp. 91-198). Tampere, Finland.
commodate dynamically changing topics, efficient incre- Maedche, A., & Staab, S. (2001). Ontology learning for the
mental document clustering algorithms need to be devel- semantic Web. IEEE Intelligent Systems, 16(2), 72-79.
oped. The algorithms must address the special require-
ments in news clustering, such as high rate of document Makkonen, J., Ahonen-Myka, H., & Salmenkivi, M.
update or ability to identify event level clusters, as well as (2004). Simple semantics in topic detection and track-
topic level clusters. ing. Information Retrieval, 7(3-4), 347-368.
In order to achieve rich semantic information retrieval Noy, N.F., Sintek, M., Decker, S., Crubezy, M.,
within Web news services, an ontology-based approach Fergerson, R.W., & Musen M.A. (2001). Creating seman-
609
TEAM LinG
tic Web contents with Protg 2000. IEEE Intelligent First Story Detection: A TDT component that identi-
Systems, 6(12), 60-71. fies whether a new document belongs to an existing topic
or a new topic.
Peng, F., & Schuurmans, D. (2003, April). Combining naive
Bayes and n-gram language models for text classification. Ontology: A collection of concepts and inter-relation-
European Conference on IR Research (ECIR03) (pp. ships.
335-350). Pisa, Italy.
Text Mining: A process of identifying patterns or
Tseng, Y. (2002). Automatic thesaurus generation for Chi- trends in natural language text including document clus-
nese documents. Journal of the American Society for Infor- tering, document classification, ontology learning, and
mation Science and Technology, 53(13), 1130-1138. etcetera.
Zhao, Y., & Karypis, G. (2002, November). Evaluations of Topic Detection And Tracking (TDT): Topic Detec-
hierarchical clustering algorithms for document datasets. tion and Tracking (TDT) is a DARPA-sponsored initiative
In ACM International Conference on Information and to investigate the state of the art for news understanding
Knowledge Management (CIKM02) (pp. 515-524). systems. Specifically, TDT is composed of the following
McLean, VA. three major components: (1) segmenting a news stream
(e.g., including transcribed speech) into topically cohe-
sive stories; (2) identifying novel stories that are the first
KEY TERMS to discuss a new event; and (3) tracking known events
given sample stories.
Clustering: An unsupervised process of dividing
data into meaningful groups such that each identified Topic Ontology: A collection of terms that character-
cluster can explain the characteristics of underlying data ize a topic at multiple levels of abstraction.
distribution. Examples include characterization of differ- Topic Tracking: A TDT component that tracks events
ent customer groups based on the customers purchasing of interest based on sample news stories. It associates
patterns, categorization of documents in the World Wide incoming news stories with the related stories, which were
Web, or grouping of spatial locations of the earth where already discussed before or it monitors the news stream
neighbor points in each region have similar short-term/ for further stories on the same topic.
long-term climate patterns.
Dynamic Topic Mining: A framework that supports
the identification of meaningful patterns (e.g., events,
topics, and topical relations) from news stream data.
610
TEAM LinG
611
Inexact Field Learning Approach for Data I

Mining
Honghua Dai
Deakin University, Australia
INTRODUCTION values of each attribute in the training set, rather than on

individual point values. The discovered rule is in a form
Inexact fielding learning (IFL) (Ciesieski & Dai, 1994; Dai called b-rule and is somewhat analogous to a decision
& Ciesieski, 1994a, 1994b, 1995, 2004; Dai & Li, 2001) is a tree found by an induction algorithm. The algorithm is
rough-set, theory-based (Pawlak, 1982) machine learning linear in both the number of attributes and the number of
approach that derives inexact rules from fields of each instances (Dai & Ciesieski, 1994a, 2004).
attribute. In contrast to a point-learning algorithm (Quinlan, The advantage of this inexact field-learning approach
1986, 1993), which derives rules by examining individual is its capability of inducing high-quality classification
values of each attribute, a field learning approach (Dai, rules from low-quality data and its high efficiency that
1996) derives rules by examining the fields of each at- makes it an ideal algorithm to discover reliable knowledge
tribute. In contrast to exact rule, an inexact rule is a rule from large and very large low-quality databases suitable
with uncertainty. The advantage of the IFL method is the for data mining, which needs higher discovering capability.
capability to discover high-quality rules from low-quality
data, its property of low-quality data tolerant (Dai &
Ciesieski, 1994a, 2004), high efficiency in discovery, and INEXACT FIELD-LEARNING
high accuracy of the discovered rules. ALGORITHM
Detailed description and the applications of the algorithm

BACKGROUND can be found from the listed articles (Ciesieski & Dai,
1994a; Dai & Ciesieski, 1994a, 1994b, 1995, 2004; Dai, 1996;
Achieving high prediction accuracy rates is crucial for all Dai & Li, 2001; Dai, 1996). The following is a description
learning algorithms, particularly in real applications. In of the inexact field-learning algorithm, the Fish_net algorithm:
the area of machine learning, a well-recognized problem is
that the derived rules can fit the training data very well, but Input: The input of the algorithm is a training data set with
they fail to achieve a high accuracy rate on new unseen m instances and n attributes as follows:
cases. This is particularly true when the learning is per-
formed on low-quality databases. Such a problem is re-
Instances X1 X2 ... Xn Classes
ferred as the Low Prediction Accuracy (LPA) problem (Dai
Instance1 a11 a12 ... a1n 1
& Ciesieski, 1994b, 2004; Dai & Li, 2001), which could be
caused by several factors. In particular, overfitting low- Instance 2 a21 a22 ... a2 n 2
quality data and being misled by them seem to be the ... ... ... ... ... ...
significant problems that can hamper a learning algorithm Instancem am1 am 2 ... amn m
from achieving high accuracy. Traditional learning meth- (1)
ods derive rules by examining individual values of in-
stances (Quinlan, 1986, 1993). To generate classification Learning Process:
rules, these methods always try to find cut-off points,
such as in well-known decision tree algorithms (Quinlan,
Step 1: Work Out Fields of each attribute {xi |1 i n}
1986, 1993).
with respect to each class.
What we present here is an approach to derive rough
classification rules from large low-quality numerical data-
bases that appear to be able to overcome these two h(j k ) = [h(jlk ) , h(juk ) ]
problems. The algorithm works on the fields of continu- (2)
(k = 1, 2,..., s; j = 1, 2,...n).
ous numeric variables; that is, the intervals of possible
TEAM LinG
Inexact Field Learning Approach for Data Mining
h(juk ) = max{ aij | i = 1, 2,..., m} hu( + ) = max {( I i ) | i = 1, 2,..., m} (8)

(k)
aij a j ( Ii ), Ii +
(3)
(k = 1, 2,..., s; j = 1, 2,...n) hl( + ) = min {( I i ) | i = 1, 2,..., m} (9)
( Ii ), Ii +
Similarly we can find h =< hl , hu >

h(jlk ) = min( k ){aij | i = 1, 2,..., m}
aij a j
(4) Step 4: Construct Belief Function using the derived
(k = 1, 2,..., s; j = 1, 2,...n).
contribution fields.
Step 2: Construct Contribution Function based on

the fields found in Step 1. 1 Contribution NegativeRegion

Br (C ) = 1 Contribution PositiveRegion
c a (10)
Contribution RoughRegion
s b a
0 x j U i k h(j i ) h(j k )
s
ck ( x j ) = 1 x j h(j k ) U i k h(j i )

x j a x h( k ) ( s h (i ) ) (5) Step 5: Decide Threshold. It could have 6 different
j I Ui k j
b a j
cases to be considered. The simplest case is to take
(k = 1, 2,..., s) the threshold
= midpoint of h + and h (11)

The formula (5) is given on the assumption that
s Step 6: Form the Inexact Rule.
[a, b] = h (j k ) I (U i k h (j i ) ) , and for any small number
s s
> 0, b h(j k ) and a + U i k h(ji ) or a U i k h(ji ) . 1 N
Otherwise, the formula (5) becomes,

If ( I ) = ( xi ) >
N i =1 (12)
Then = 1( Br (c))
s
0 x j U i k h(j i ) h(j k )
s
This algorithm was tested on three large real observa-
ck ( x j ) = 1 x j h (j k ) U i k h (ji ) tional weather data sets containing both high-quality and

x j b x h( k ) ( s h (i ) ) (6) low-quality data. The accuracy rates of the forecasts were
j I Ui k j
a b
j
86.4%, 78%, and 76.8%. These are significantly better
(k = 1, 2,..., s) than the accuracy rates achieved by C4.5 (Quinlan, 1986,
1993), feed forward neural networks, discrimination analy-
sis, K-nearest neighbor classifiers, and human weather
forecasters. The fish-net algorithm exhibited significantly
Step 3: Work Out Contribution Fields by applying less overfitting than the other algorithms. The training
the constructed contribution functions to the train- times were shorter, in some cases by orders of magnitude
ing data set. (Dai & Ciesieski, 1994a, 2004; Dai 1996).
Calculate the contribution of each instance.
FUTURE TRENDS
n
( I i ) = ( ( xij )) / n
j =1
(7) The inexact field-learning approach has led to a success-
(i = 1, 2,..., m) ful algorithm in a domain where there is a high level of
noise. We believe that other algorithms based on fields
also can be developed. The b-rules, produced by the
Work out the contribution field for each class. current FISH-NET algorithm involve linear combinations
h+ =< hl+ , hu+ >
of attributes. Non-linear rules may be even more accurate.
612
TEAM LinG
While extensive tests have been done on the fish-net proach, such as C4.5, on all the data sets and lower
algorithm with large meteorological databases, nothing in than the feed-forward neural network. A reason- I
the algorithm is specific to meteorology. It is expected that ably low LPA error rate was achieved by the feed-
the algorithm will perform equally well in other domains. In forward neural network but with the high time cost
parallel to most existing exact machine-learning methods, of error back-propagation. The LPA error rate of
the inexact field-learning approaches can be used for large the KNN method is comparable to fish-net. This
or very large noisy data mining, particularly where the data was achieved after a very high-cost genetic algo-
quality is a major problem that may not be dealt with by rithm search.
other data-mining approaches. Various learning algorithms 4. The FISH-NET algorithm obviously was not af-
can be created, based on the fields derived from a given fected by low-quality data. It performed equally
training data set. There are several new applications of well on low-quality data and high-quality data.
inexact field learning, such as Zhuang and Dai (2004) for
Web document clustering and some other inexact learning
approaches (Ishibuchi et al., 2001; Kim et al., 2003). The REFERENCES
major trends of this approach are in the following:
Ciesielski, V., & Dai, H. (1994a). FISHERMAN: A compre-
1. Heavy application for all sorts of data-mining tasks hensive discovery, learning and forecasting systems.
in various domains. Proceedings of 2nd Singapore International Conference
2. Developing new powerful discovery algorithms in on Intelligent System, Singapore.
conjunction with IFL and traditional learning ap-
proaches. Dai, H. (1994c). Learning of forecasting rules from large
3. Extend current IFL approach to deal with high dimen- noisy meteorological data [doctoral thesis]. RMIT,
sional, non-linear, and continuous problems. Melbourne, Victoria, Australia.
Dai, H. (1996a). Field learning. Proceedings of the 19th
Australian Computer Science Conference.
CONCLUSION
Dai, H. (1996b). Machine learning of weather forecasting
The inexact field-learning algorithm: Fish-net is developed rules from large meteorological data bases. Advances in
for the purpose of learning rough classification/forecast- Atmospheric Science, 13(4), 471-488.
ing rules from large, low-quality numeric databases. It runs Dai, H. (1997). A survey of machine learning [technical
high efficiently and generates robust rules that do not report]. Monash University, Melbourne, Victoria, Aus-
overfit the training data nor result in low prediction accu- tralia.
racy.
The inexact field-learning algorithm, fish-net, is based Dai, H. & Ciesielski, V. (1994a). Learning of inexact rules
on fields of the attributes rather than the individual point by the FISH-NET algorithm from low quality data. Pro-
values. The experimental results indicate that: ceedings of the 7 th Australian Joint Conference on
Artificial Intelligence, Brisbane, Australia.
1. The fish-net algorithm is linear both in the number of
instances and in the number of attributes. Further, Dai, H. & Ciesielski, V. (1994b). The low prediction
the CPU time grows much more slowly than the other accuracy problem in learning. Proceedings of Second
algorithms we investigated. Australian and New Zealand Conference On Intelligent
2. The Fish-net algorithm achieved the best prediction Systems, Armidale, NSW, Australia.
accuracy tested on new unseen cases out of all the Dai, H., & Ciesielski, V. (1995). Inexact field learning
methods tested (i.e., C4.5, feed-forward neural net- using the FISH-NET algorithm [technical report].
work algorithms, a k-nearest neighbor method, the Monash University, Melbourne, Victoria, Australia.
discrimination analysis algorithm, and human experts.
3. The fish-net algorithm successfully overcame the Dai, H., & Ciesielski, V. (2004). Learning of fuzzy classi-
LPA problem on two large low-quality data sets fication rules by inexact field learning approach [tech-
examined. Both the absolute LPA error rate and the nical report]. Deakin University, Melbourne, Australia.
relative LPA error rate (Dai & Ciesieski, 1994b) of the Dai, H. & Li, G. (2001). Inexact field learning: An approach
fish-net were very low on these data sets. They were to induce high quality rules from low quality data. Pro-
significantly lower than that of point-learning ap- ceedings of the 2001 IEEE International Conference on
Data Mining (ICDM-01), San Jose, California.
613
TEAM LinG
Ishibuchi, H., Yamamoto, T., & Nakashima, T. (2001). KEY TERMS

Fuzzy data mining: Effect of fuzzy discretization. Proceed-
ings of IEEE International Conference on Data Mining, b-Rule: A type of inexact rule that represent the
San Jose, California. uncertainty with contribution functions and belief func-
Kim, M., Ryu, J., Kim, S., & Lee, J. (2003). Optimization of tions.
fuzzy rules for classification using genetic algorithm. Exact Learning: The learning approaches that are
Proceedings of the 7th Pacific-Asia Conference on Knowl- capable of inducing exact rules.
edge Discovery and Data Mining, Seoul, Korea.
Exact Rules: Rules without uncertainty.
Pawlak, Z. (1982). Rough sets. International Journal of
Information and Computer Science, 11(5), 145-172. Field Learning: Derives rules by looking at the field
of the values of each attribute in all the instances of the
Quinlan, R. (1986). Induction of decision trees. Machine training data set.
Learning, 1, 81-106.
Inexact Learning: The learning by which inexact rules
Quinlan, R. (1993). C4.5: Programs for machine learning. are induced.
San Mateo, CA: Morgan Kaufmann Publishers.
Inexact Rules: Rules with uncertainty.
Zhuang, L. & Dai, H. (2004). Maximal frequent itemset
approach for Web document clustering. Proceedings of Low-Quality Data: Data with lots of noise, missing
the 2004 International Conference on Computer and values, redundant features, mistakes, and so forth.
Information Technology (CIT04), Wuhan, China. LPA (Low Prediction Accuracy) Problem: The prob-
lem when derived rules can fit the training data very well
but fail to achieve a high accuracy rate on new unseen
cases.
Point Learning: Derives rules by looking at each
individual point value of the attributes in every instance
of the training data set.
614
TEAM LinG
615
Information Extraction in Biomedical I

Literature
Min Song
Il-Yeol Song
Xiaohua Hu
Hyoil Han
INTRODUCTION we review recent advances in applying IE techniques to

biomedical literature.
Information extraction (IE) technology has been de-
fined and developed through the US DARPA Message
Understanding Conferences (MUCs). IE refers to the MAIN THRUST
identification of instances of particular events and rela-
tionships from unstructured natural language text docu- This article attempts to synthesize the works that have
ments into a structured representation or relational been done in the field. Taxonomy helps us understand
table in databases. It has proved successful at extracting the accomplishments and challenges in this emerging
information from various domains, such as the Latin field. In this article, we use the following set of criteria
American terrorism, to identify patterns related to ter- to classify the biomedical literature mining related
rorist activities (MUC-4). Another domain, in the light studies:
of exploiting the wealth of natural language documents,
is to extract the knowledge or information from these 1. What are the target objects that are to be extracted?
unstructured plain-text files into a structured or rela- 2. What techniques are used to extract the target
tional form. This form is suitable for sophisticated objects from the biomedical literature?
query processing, for integration with relational data- 3. How are the techniques or systems evaluated?
bases, and for data mining. Thus, IE is a crucial step for
fully making text files more easily accessible.
Figure 1. Shows the overview of a typical biomedical
literature mining system.
BACKGROUND
Curated DB
The advent of large volumes of text databases and search

engines have made them readily available to domain
Genbank
experts and have significantly accelerated research on SwissProt
bioinformatics. With the size of a digital library com- Evaluate the system
monly exceeding millions of documents, rapidly in- Biomedical
creasing, and covering a wide range of topics, efficient Extract entities &
relationships
Text Mining Systems
Build a KB
Knowledge base
(KB)
and automatic extraction of meaningful data and rela-

tions has become a challenging issue. To tackle this Integrated into the
Patient Records
issue, rigorous studies have been carried out recently to system MESH
apply IE to biomedical data. Such research efforts began MEDLINE UMLS

BLAST
to be called biomedical literature mining or text mining SNOMED CT
in bioinformatics (de Bruijn & Martin, 2002; Hirschman

Literature Collections Gene
Ontologies
Ontologies
et al., 2002; Shatkay & Feldman, 2003). In this article,
TEAM LinG
Information Extraction in Biomedical Literature
4) From what data sources are the target objects ex- identification of each entity with an SVM classifier, and
tracted? the second phase is post-processing to correct the errors
by the SVM with a simple dictionary lookup. Bunescu, et
Target Objects al. (2004) studied protein name identification and protein-
protein interaction. Among several approaches used in
In terms of what is to be extracted by the systems, most their study, the main two ways are one using POS tagging
studies can be broken into the following two major and the other using the generalized dictionary-based
areas: (1) named entity extraction such as proteins or tagging. Their dictionary-based tagging presents higher
genes; and (2) relation extraction, such as relationships F-value. Table 1 summarizes the works in the areas of
between proteins. Most of these studies adopt informa- named entity extraction in biomedical literature.
tion extraction techniques using curated lexicon or The second target object type of biomedical literature
natural language processing for identifying relevant extraction is relation extraction. Leek (1997) applies HMM
tokens such as words or phrases in text (Shatkay & techniques to identify gene names and chromosomes
Feldman, 2003). through heuristics. Blaschke et al. (1999) extract protein-
In the area of named entity extraction, Proux et al. protein interactions based on co-occurrence of the form
(2000) use single word names only with selected test p1I1 p2 within a sentence, where p1, p2 are
set from 1,200 sentences coming from Flybase. Collier, proteins, and I1 is an interaction term. Protein names and
et al. (2000) adopt Hidden Markov Models (HMMs) for interaction terms (e.g., activate, bind, inhibit) are pro-
10 test classes with small training and test sets. vided as a dictionary. Proux (2000) extracts an interact
Krauthammer et al. (2000) use BLAST database with relation for the gene entity from Flybase database.
letters encoded as 4-tuples of DNA. Demetriou and Pustejovsky (2002) extracts an inhibit relation for the gene
Gaizuaskas (2002) pipeline the mining processes, in- entity from MEDLINE. Jenssen, et al. (2001) extract a gene-
cluding hand-crafted components and machine learning gene relations based on co-occurrence of the form
components. For the study, they use large lexicon and g1g2 within a MEDLINE abstracts, where g1 and g2
morphology components. Narayanaswamy et al. (2003) are gene names. Gene names are provided as a dictionary,
use a part of speech (POS) tagger for tagging the parsed harvested from HUGO, LocusLink, and other sources.
MEDLINE abstracts. Although Narayanaswamy and his Although their study uses 13,712 named human genes
colleagues (2003) implement an automatic protein name and millions of MEDLINE abstracts, no extensive quan-
detection system, the number of words used is 302, and, titative results are reported and analyzed. Friedman, et
thus, it is difficult to see the quality of their system, al. (2001) extract a pathway relation for various biologi-
since the size of the test data is too small. Yamamoto, et cal entities from a variety of articles. In their work, the
al. (2003) use morphological analysis techniques for precision of the experiments is high (from 79-96%).
preprocessing protein name tagging and apply support However, the recalls are relatively low (from 21-72%).
vector machine (SVM) for extracting protein names. Bunescu et al. (2004) conducted protein/protein interac-
They found that increasing training data from 390 ab- tion identification with several learning methods, such as
stracts to 1,600 abstracts improved F-value perfor- pattern matching rule induction (RAPIER), boosted wrap-
mance from 70% to 75%. Lee et al. (2003) combined an SVM per induction (BWI), and extraction using longest com-
and dictionary lookup for named entity recognition. Their mon subsequences (ELCS). ELCS automatically learns
approach is based on two phases: the first phase is rules for extracting protein interactions using a bottom-up
Table 1. A summary of works in biomedical entity extraction
Author Named Entities Database No. of Words Learning F Value

Methods
Collier, et al. (2000) Proteins and MEDLINE 30,000 HMM 73
DNA
Krauthammer, et al. Gene and Protein Review articles 5,000 Character 75
(2000) sequence
mapping
Demetriou and Protein, Species, MEDLINE 30,000 PASTA 83
Gaizauskas (2002) and 10 more template filing
Narayanaswamy Protein MEDLINE 302 Hand-crafted 75.86
(2003) rules and co-
occurrence
Yamamoto, et al. Protein GENIA 1,600 abstracts BaseNP 75
(2003) recognition
Lee, et al. (2003) Protein GENIA 10,000 SVM 77
DNA
RNA
Bunescu (2004) Protein MEDLINE 5,206 RAPIER, 57.86
BWI, TBL,
k-NN , SVMs,
MaxEnt
616
TEAM LinG
Table 2. A summary of relation extraction for biomedical data

Authors Relation Entity DB Learning Precision Recall I
Methods
Leek (1997) Location Gene OMIM HMM 80% 36%
Blaschke (1999) Interact Protein MEDLINE Co-occurrence n/a n/a
Proux (2000) Interact Gene Flybase Co-occurrence 81% 44%
Pustejovsky (2001) Inhibit Gene MEDLINE Co-occurrence 90% 57%
Jenssen (2001) Location Gene MEDLINE Co-occurrence n/a n/a
Friedman (2001) Pathway Many Articles Co-occurrence 96% 63%
and thesauri
Bunescu (2004) Interact Protein MEDLINE RAPIER, BWI, n/a n/a
ELCS
approach. They conducted experiments in two ways: one of an HMM correspond to candidate POS tags, and
with manually crafted protein names and the other with the probabilistic transitions among states represent pos-
extracted protein names by their name identification method. sible parses of the sentence, according to the matches of
In both experiments, Bunescu, et al. compared their results the terms occurring in it to the POSs. In the context of
with human-written rules and showed that machine learn- biomedical literature mining, HMM is also used to model
ing methods provide higher precisions than human-writ- families of biological sequences as a set of different
ten rules. Table 2 summarizes the works in the areas of utterances of the same word generated by an HMM
relation extraction in biomedical literature. technique (Baldi et al., 1994).
Ray and Craven (2001) have proposed a more sophis-
Techniques Used ticated HMM-based technique to distinguish fact-bear-
ing sentences from uninteresting sentences. The target
The most commonly used extraction technique is co- biological entities and relations that they intend to ex-
occurrence based. The basic idea of this technique is that tract are protein subcellular localizations and gene-dis-
entities are extracted based on frequency of co-occur- order associations. With a predefined lexicon of loca-
rence of biomedical named entities such as proteins or tions and proteins and several hundreds of training
genes within sentences. This technique was introduced sentences derived from Yeast database, they trained and
by Blaschke, et al. (1999). Their goal was to extract tested the classifiers over a manually labeled corpus of
information from scientific text about protein interac- about 3,000 MEDLINE abstracts. There have been sev-
tions among a predetermined set of related programs. eral studies applying natural language tagging and pars-
Since Blaschke and his colleagues study, numerous ing techniques to biomedical literature mining. Friedman,
other co-occurrence-based systems have been proposed et al. (2001) propose methods parsing sentences and using
in the literature. All are associated with information thesauri to extract facts about genes and proteins from
extraction of biomedical entities from the unstructured biomedical documents. They extract interactions among
text corpus. The common denominator of the co-occur- genes and proteins as part of regulatory pathways.
rence-based systems is that they are based on co-occur-
rences of names or identifiers of entities, typically Evaluation
along with activation/dependency terms. These systems
are differentiated one from another by integrating dif- One of the pivotal issues yet to be explored further in
ferent machine learning techniques such as syntactical biomedical literature mining is how to evaluate the
analysis or POS tagging, as well as ontologies and con- techniques or systems. The focus of the evaluation
trolled vocabularies (Hahn et al., 2002; Pustejovsky et conducted in the literature is on extraction accuracy.
al., 2002; Yakushiji et al., 2001). Although these tech- The accuracy measures used in IE are precision and
niques are straightforward and easy to develop, from the recall ratio. For a set of N items, where N is either
performance standpoint, recall and precision are much terms, sentences, or documents, and the system needs
lower than any other machine-learning techniques (Ray to label each of the terms as positive or negative,
& Craven, 2001). according to some criterion (positive, if a term belongs
In parallel with co-occurrence-based systems, the to a predefined document category or a term class). As
researchers began to investigate other machine learning discussed earlier, the extraction accuracy is measured
or NLP techniques. One of the earliest studies was done by precision and recall ratio. Although these evaluation
by Leek (1997), who utilized Hidden Markov Models techniques are straightforward and are well accepted,
(HMMs) to extract sentences discussing gene location of recall ratios often are criticized in the field of information
chromosomes. HMMs are applied to represent sentence retrieval, when the total number of true positive terms is
structures for natural language processing, where states not clearly defined.
617
TEAM LinG
In IE, an evaluation forum similar to TREC in informa- Papers (IFBP) transcription factor database are natural
tion retrieval (IR) is the Message Understanding Confer- language processing based systems.
ence (MUC). Participants in MUC tested the ability of An alternative approach that may be more relevant in
their systems to identify entities in text to resolve co- practice is based on the treatment of text with statistical
reference, extract and populate attributes of entities, and methods. In this approach, the possible relevance of
perform various other extraction tasks from written text. words in a text is deduced from the comparison of the
As identified by Shatkay and Feldman (2003), the impor- frequency of different words in this text with the fre-
tant challenge in biomedical literature mining is the cre- quency of the same words in reference sets of text.
ation of gold-standards and critical evaluation methods Some of the major methods using the statistical ap-
for systems developed in this very active field. The proach are AbXtract and the automatic pathway discov-
framework of evaluating biomedical literature mining sys- ery tool of Ng and Wong (1999). There are advantages
tems was recently proposed by Hirschman, et al. (2002). to each of these approaches (i.e., grammar or pattern
According to Hirschman, et al. (2002), the following ele- matching). Generally, the less syntax that is used, the
ments are needed for a successful evaluation: (1) chal- more domain-specific the system is. This allows the
lenging problem; (2) task definition; (3) training data; (4) construction of a robust system relatively quickly, but
test data; (5) evaluation methodology and implementa- many subtleties may be lost in the interpretation of
tion; (6) evaluator; (7) participants; and (8) funding. In sentences. Recently, GENIA corpus has been used for
addition to these elements for evaluation, the existing extracting biomedical-named entities (Collier et al., 2000;
biomedical literature mining systems encounter the is- Yamamoto et al., 2003). The reason for the recent surge
sues of portability and scalability, and these issues need of using GENIA corpus is because GENIA provides anno-
to be taken into consideration of the framework for evalu- tated corpus that can be used for all areas of NLP and IE
ation. applied to the biomedical domain that employs super-
vised learning. With the explosion of results in molecular
Data Sources biology, there is an increased need for IE to extract
knowledge to build databases and to search intelligently
In terms of data sources from which target biomedical for information in online journal collections.
objects are extracted, most of the biomedical data min-
ing systems focus on mining MEDLINE abstracts of
National Library of Medicine. The principal reason for FUTURE TRENDS
relying on MEDLINE is related to complexity. Ab-
stracts occasionally are easier to mine, since many With the taxonomy proposed here, we now identify the
papers contain less precise and less well supported research trends of applying IE to mine biomedical literature.
sections in the text that are difficult to distinguish from
more informative sections by machines (Andrade & 1. A variety of biomedical objects and relations are
Bork, 2000). The current version of MEDLINE contains to be extracted.
nearly 12 million abstracts stored on approximately 2. Rigorous studies are conducted to apply advanced
43GB of disk space. A prominent example of methods IE techniques, such as Random Common Field and
that target entire papers is still restricted to a small Max Entropy based HMM to biomedical data.
number of journals (Friedman et al., 2000; Krauthammer 3. Collaborative efforts to standardize the evaluation
et al., 2002). The task of unraveling information about methods and the procedures for biomedical litera-
function from MEDLINE abstracts can be approached ture mining.
from two different viewpoints. One approach is based 4. Continue to broaden the coverage of curated data-
on computational techniques for understanding texts bases and extend the size of the biomedical data-
written in natural language with lexical, syntactical, and bases.
semantic analysis. In addition to indexing terms in docu-
ments, natural language processing (NLP) methods ex-
tract and index higher-level semantic structures com- CONCLUSION
posed of terms and relationships between terms. How-
ever, this approach is confronted with the variability, The sheer size of biomedical literature triggers an in-
fuzziness, and complexity of human language (Andrade tensive pursuit for effective information extraction
& Bork, 2000). The Genies system (Friedman et al., tools. To cope with such demand, the biomedical litera-
2000; Krauthammer et al., 2002), for automatically ture mining emerges as an interdisciplinary field that
gathering and processing of knowledge about molecular information extraction and machine learning are applied
pathways, and the Information Finding from Biological to the biomedical text corpus.
618
TEAM LinG
In this article, we approached the biomedical literature MEDSYNDIKATE text mining system. Proceedings of
mining from an IE perspective. We attempted to synthe- the Pacific Symposium on Biocomputing. I
size the research efforts made in this emerging field. In
doing so, we showed how current information extraction Hirschman, L., Park, J.C., Tsujii, J., Wong, L., & Wu, C.H.
can be used successfully to extract and organize informa- (2002). Accomplishments and challenges in literature data
tion from the literature. We surveyed the prominent meth- mining for biology. Bioinformatics, 18(12), 1553-1561.
ods used for information extraction and demonstrated Jenssen, T.K., Laegreid, A., Komorowski, J., & Hovig, E.
their applications in the context of biomedical literature (2001). A literature network of human genes for high-
mining throughput analysis of gene expression. Nature Genet-
The following four aspects were used in classifying ics, 28(1), 21-8.
the current works done in the field: (1) what to extract;
(2) what techniques are used; (3) how to evaluate; and Krauthammer, M., Rzhetsky, A., Morozov P., & Friedman,
(4) what data sources are used. The taxonomy proposed C. (2000). Using BLAST for identifying gene and protein
in this article should help identify the recent trends and names in journal articles. Gene, 259(1-2), 245-252.
issues pertinent to the biomedical literature mining. Lee, K., Hwang, Y., & Rim, H. (2003). Two-phase
biomedical NE recognition based on SVMs. Proceed-
ings of the ACL 2003 Workshop on Natural Language
REFERENCES Processing in Biomedicine.
Andrade, M.A., & Bork, P. (2000). Automated extrac- Leek, T.R. (1997). Information extraction using hidden
tion of information in molecular biology. FEBS Letters, Markov models [masters theses]. San Diego, CA: De-
476,12-7. partment of Computer Science, University of California.
Blaschke, C., Andrade, M.A., Ouzounis, C., & Valencia, A. Narayanaswamy, M., Ravikumar, K.E., & Vijay-Shanker,
(1999). Automatic extraction of biological information K. (2003). A biological named entity recognizer. Pro-
from scientific text: Protein-protein interactions, Pro- ceedings of the Pacific Symposium on Biocomputing.
ceedings of the First International Conference on Intel- Ng, S.K., & Wong, M. (1999). Toward routine automatic
ligent Systems for Molecular Biology. pathway discovery from on-line scientific text abstracts.
Bunescu, R. et al. (2004). Comparative experiments on Proceedings of the Genome Informatics Series: Work-
learning information extractors for proteins and their shop on Genome Informatics.
interactions [to be published]. Journal Artificial Intelli- Proux, D., Rechenmann, F., & Julliard, L. (2000). A
gence in Medicine on Summarization and Information pragmatic information extraction strategy for gathering
Extraction from Medical Documents. data on genetic interactions. Proceedings of the Inter-
Collier, N., Nobata,C., & Tsujii, J. (2000). Extracting the national Conference on Intelligent System for Mo-
names of genes and gene products with a hidden Markov lecular Biology.
model. Proceedings of the 18th International Confer- Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., &
ence on Computational Linguistics (COLING2000). Cochran, B. (2002). Robust relational parsing over bio-
De Bruijn, B., & Martin, J. (2002). Getting to the (c)ore of medical literature: extracting inhibit relations. Pacific
knowledge: Mining biomedical literature. International Symposium on Biocomputing (pp. 362-73).
Journal of Medical Informatics, 67, 7-18. Ray, S., & Craven, M. (2001). Representing sentence struc-
Demetriou, G., & Gaizauskas, R. (2002). Utilizing text ture in hidden Markov models for information extraction.
mining results: The pasta Web system. Proceedings of Proceedings of the 17th International Joint Conference on
the Workshop on Natural Language Processing in the Artificial Intelligence, Seattle, Washington.
Biomedical Domain. Shatkay, H., & Feldman, R. (2003). Mining the biomedical
Friedman, C., Kra, P., Yu, H., Krauthammer, M., & literature in the genomic era: An overview. Journal of
Rzhetsky, A. (2001). GENIES: A natural-language pro- Computational Biology, 10(6), 821-855.
cessing system for the extraction of molecular path- Yakushiji, A., Tateisi, Y., Miyao,Y., & Tsujii, J. (2001).
ways from journal articles. Bioinformatics, 17, S74-82. Event extraction from biomedical papers using a full
Hahn, U., Romacker, M., & Schulz, S. (2002). Creating parser. Proceedings of the Pacific Symposium on
knowledge repositories from biomedical reports: The Biocomputing.
619
TEAM LinG
Yamamoto, K., Kudo, T., Konagaya, A., & Matsumoto, Y. lems inherent in the processing and manipulation of
(2003). Protein name tagging for biomedical annotation in natural language.
text. Proceedings of the ACL 2003 Workshop on Natural
Language Processing in Biomedicine. Part of Speech (POS): A classification of words ac-
cording to how they are used in a sentence and the types
of ideas they convey. Traditionally, the parts of speech
are the noun, pronoun, verb, adjective, adverb, preposi-
KEY TERMS tion, conjunction, and interjection.
Precision: The ratio of the number of correctly filled
F-Value: Combines recall and precision in a single slots to the total number of slots the system filled.
efficiency measure (it is the harmonic mean of preci-
sion and recall): F = 2 * (recall * precision) / (recall + Recall: Denotes the ratio of the number of slots the
precision). system found correctly to the number of slots in the
answer key.
Hidden Markov Model (HMM): A statistical model
where the system being modeled is assumed to be a Support Vector Machine (SVM): A learning ma-
Markov process with unknown parameters, and the chal- chine that can perform binary classification (pattern
lenge is to determine the hidden parameters from the recognition) as well as multi-category classification
observable parameters, based on this assumption. and real valued function approximation (regression es-
timation) tasks.
Natural Language Processing (NLP): A subfield of
artificial intelligence and linguistics. It studies the prob-
620
TEAM LinG
621
Instance Selection I
Huan Liu
Lei Yu
INTRODUCTION 1996). It includes data selection, preprocessing, data

mining, interpretation, and evaluation. The first two
The amounts of data have become increasingly large in processes (data selection and preprocessing) play a
recent years as the capacity of digital data storage world- pivotal role in successful data mining (Han & Kamber,
wide has significantly increased. As the size of data 2001). Facing the mounting challenges of enormous
grows, the demand for data reduction increases for effec- amounts of data, much of the current research concerns
tive data mining. Instance selection is one of the effective itself with scaling up data-mining algorithms (Provost
means to data reduction. This article introduces the basic & Kolluri, 1999). Researchers have also worked on
concepts of instance selection and its context, necessity, scaling down the data an alternative to the scaling up
and functionality. The article briefly reviews the state-of- of the algorithms. The major issue of scaling down data
the-art methods for instance selection. is to select the relevant data and then present it to a data-
Selection is a necessity in the world surrounding us. mining algorithm. This line of work is parallel with the
It stems from the sheer fact of limited resources, and work on scaling up algorithms, and the combination of
data mining is no exception. Many factors give rise to the two is a two-edged sword in mining nuggets from
data selection: Data is not purely collected for data massive data.
mining or for one particular application; there are miss- In data mining, data is stored in a flat file and de-
ing data, redundant data, and errors during collection and scribed by terms called attributes or features. Each
storage; and data can be too overwhelming to handle. line in the file consists of attribute-values and forms an
Instance selection is one effective approach to data instance, which is also called a record, tuple, or data
selection. This process entails choosing a subset of data point in a multidimensional space defined by the at-
to achieve the original purpose of a data-mining applica- tributes. Data reduction can be achieved in many ways
tion. The ideal outcome is a model independent, mini- (Liu & Motoda, 1998; Blum & Langley, 1997; Liu &
mum sample of data that can accomplish tasks with little Motoda, 2001). By selecting features, we reduce the
or no performance deterioration. number of columns in a data set; by discretizing feature
values, we reduce the number of possible values of
features; and by selecting instances, we reduce the
BACKGROUND AND MOTIVATION number of rows in a data set. We focus on instance
selection here.
When we are able to gather as much data as we wish, a Instance selection reduces data and enables a data-
natural question is How do we efficiently use it to our mining algorithm to function and work effectively with
advantage? Raw data is rarely of direct use, and manual huge data. The data can include almost everything re-
analysis simply cannot keep pace with the fast accumulated to a domain (recall that data is not solely collected
lation of massive data. Knowledge discovery and data for data mining), but one application normally involves
mining (KDD), an emerging field comprising disci- using one aspect of the domain. It is natural and sensible
plines such as databases, statistics, and machine learn- to focus on the relevant part of the data for the applica-
ing, comes to the rescue. KDD aims to turn raw data into tion so that the search is more focused and mining is
nuggets and create special edges in this ever-competi- more efficient. Cleaning data before mining is often
tive world for science discovery and business intelli- required. By selecting relevant instances, we can usu-
gence. The KDD process is defined as the nontrivial ally remove irrelevant, noise, and redundant data. The
process of identifying valid, novel, potentially useful, high-quality data will lead to high-quality results and
and ultimately understandable patterns in data reduced costs for data mining.
(Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy,
TEAM LinG
Instance Selection
MAJOR LINES OF RESEARCH AND partitions, the resulting sample is more representative
DEVELOPMENT than a randomly generated one. Recent methods can be
found in Liu, Motoda, and Yu (2002), in which samples
A spontaneous response to the challenge of instance selected from partitions based on data variance result in
selection is, without fail, some form of sampling. Al- better performance than samples selected with random
though sampling is an important part of instance selec- sampling.
tion, other approaches do not rely on sampling but
resort to search or take advantage of data-mining algo- Methods for Labeled Data
rithms. In this section, we start with sampling methods
and proceed to other instance-selection methods asso- One key data-mining application is classification
ciated with data-mining tasks, such as classification and predicting the class of an unseen instance. The data for
clustering. this type of application is usually labeled with class
values. Instance selection in the context of classifica-
Sampling Methods tion has been attempted by researchers according to the
classifiers being built. In this section, we include five
Sampling methods are useful tools for instance selec- types of selected instances.
tion (Gu, Hu, & Liu, 2001). Critical points are the points that matter the most to
Simple random sampling is a method of selecting n a classifier. The issue originated from the learning
method of Nearest Neighbor (NN) (Cover & Thomas,
instances out of the N such that every one of the (nN ) 1991). NN usually does not learn during the training
distinct samples has an equal chance of being drawn. If phase. Only when it is required to classify a new sample
an instance that has been drawn is removed from the data does NN search the data to find the nearest neighbor for
set for all subsequent draws, the method is called ran- the new sample, using the class label of the nearest
dom sampling without replacement. Random sampling neighbor to predict the class label of the new sample.
with replacement is entirely feasible: At any draw, all N During this phase, NN can be very slow if the data are
instances of the data set have an equal chance of being large and can be extremely sensitive to noise. There-
drawn, no matter how often they have already been fore, many suggestions have been made to keep only the
drawn. critical points, so that noisy ones are removed and the
Stratified random sampling divides the data set of N data set is reduced. Examples can be found in Yu, Xu,
instances into subsets of N1, N2,, Nl instances, respec- Ester, and Kriegel (2001) and Zeng, Xing, and Zhou
tively. These subsets are nonoverlapping, and together (2003), in which critical data points are selected to
they comprise the whole data set (i.e., N1+N2,,+Nl =N). improve the performance of collaborative filtering.
The subsets are called strata. When the strata have been Boundary points are the instances that lie on bor-
determined, a sample is drawn from each stratum, the ders between classes. Support vector machines (SVM)
drawings being made independently in different strata. If provide a principled way of finding these points through
a simple random sample is taken in each stratum, the whole minimizing structural risk (Burges, 1998). Using a non-
procedure is described as stratified random sampling. It is linear function to map data points to a high-dimen-
often used in applications when one wishes to divide a sional feature space, a nonlinearly separable data set
heterogeneous data set into subsets, each of which is becomes linearly separable. Data points on the bound-
internally homogeneous. aries, which maximize the margin band, are the support
Adaptive sampling refers to a sampling procedure vectors. Support vectors are instances in the original
that selects instances depending on results obtained data sets and contain all the information a given classi-
from the sample. The primary purpose of adaptive sam- fier needs for constructing the decision function. Bound-
pling is to take advantage of data characteristics in order ary points and critical points are different in the ways
to obtain more precise estimates. It takes advantage of they are found.
the result of preliminary mining for more effective Prototypes are representatives of groups of instances
sampling, and vice versa. via averaging (Chang, 1974). A prototype that repre-
Selective sampling is another way of exploiting data sents the typicality of a class is used in characterizing a
characteristics to obtain more precise estimates in sam- class rather than describing the differences between
pling. All instances are first divided into partitions classes. Therefore, they are different from critical points
according to some homogeneity criterion, and then or boundary points.
random sampling is performed to select instances from Tree-based sampling is a method involving deci-
each partition. Because instances in each partition are sion trees (Quinlan, 1993), which are commonly used
more similar to each other than instances in other classification tools in data mining and machine learn-
622
TEAM LinG
Instance Selection
ing. Instance selection can be done via the decision tree pseudo points that can be reconstructed from sufficient
built. Breiman and Friedman (1984) propose delegate sam- statistics rather than keeping only the k means. I
pling. The basic idea is to construct a decision tree such Squashed data are some pseudo data points gener-
that instances at the leaves of the tree are approximately ated from the original data. In this aspect, they are similar
uniformly distributed. Delegate sampling then samples to prototypes, as both may or may not be in the original
instances from the leaves in inverse proportion to the data set. Squashed data points are different from proto-
density at the leaf and assigns weights to the sampled types in that each pseudo data point has a weight, and
points that are proportional to the leaf density. the sum of the weights is equal to the number of instances
In real-world applications, although large amounts in the original data set. Presently, two ways of obtaining
of data are potentially available, the majority of data are squashed data are (a) model free (DuMouchel, Volinsky,
not labeled. Manually labeling the data is a labor-inten- Johnson, Cortes, & Pregibon, 1999) and (b) model depen-
sive and costly process. Researchers investigate whether dent, or likelihood based (Madigan, Raghavan,
experts can be asked to label only a small portion of the DuMouchel, Nason, Posse, & Ridgeway, 2002).
data that is most relevant to the task if labeling all data is
too expensive and time-consuming, a process that is
called instance labeling. Usually an expert can be en- FUTURE TRENDS
gaged to label a small portion of the selected data at
various stages. So we wish to select as little data as As shown in this article, instance selection has been
possible at each stage and use an adaptive algorithm to studied and employed in various tasks, such as sam-
guess what else should be selected for labeling in the pling, classification, and clustering. Each task is very
next stage. Instance labeling is closely associated with unique, as each has different information available and
adaptive sampling, clustering, and active learning. different requirements. Clearly, a universal model of
instance selection is out of the question. This short
Methods for Unlabeled Data article provides some starting points that can hopefully
lead to more concerted study and development of new
When data are unlabeled, methods for labeled data can- methods for instance selection. Instance selection deals
not be directly applied to instance selection. The wide- with scaling down data. When we better understand
spread use of computers results in huge amounts of data instance selection, we will naturally investigate whether
stored without labels, for example, Web pages, transac- this work can be combined with other lines of research,
tion data, newspaper articles, and e-mail messages (Baeza- such as algorithm scaling-up, feature selection, and
Yates & Ribeiro-Neto, 1999). Clustering is one ap- construction, to overcome the problem of huge amounts
proach to finding regularities from unlabeled data. We of data,. Integrating these different techniques to
discuss three types of selected instances here. achieve the common goal effective and efficient
Prototypes are pseudo data points generated from the data mining is a big challenge.
formed clusters. The idea is that after the clusters are
formed, one may just keep the prototypes of the clusters
and discard the rest of the data points. The k-means CONCLUSION
clustering algorithm is a good example of this sort.
Given a data set and a constant k, the k-means clustering With the constraints imposed by computer memory and
algorithm is to partition the data into k subsets such that mining algorithms, we experience selection pressures
instances in each subset are similar under some measure. more than ever. The central point of instance selection
The k means are iteratively updated until a stopping is approximation. Our task is to achieve as good of
criterion is satisfied. The prototypes in this case are the mining results as possible by approximating the whole
k means. data with the selected instances and, hopefully, to do
Bradley, Fayyad, and Reina (1998) extend the k- better in data mining with instance selection, as it is
means algorithm to perform clustering in one scan of the possible to remove noisy and irrelevant data in the
data. By keeping some points that defy compression plus process. In this short article, we have presented an
some sufficient statistics, they demonstrate a scalable k- initial attempt to review and categorize the methods of
means algorithm. From the viewpoint of instance selec- instance selection in terms of sampling, classification,
tion, prototypes plus sufficient statistics is a method of and clustering.
representing a cluster by using both defiant points and
623
TEAM LinG
Instance Selection
REFERENCES Madigan, D., Raghavan, N., DuMouchel, W., Nason, M.,

Posse, C., & Ridgeway, G. (2002). Likelihood-based data
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern infor- squashing: A modeling approach to instance construc-
mation retrieval. Addison-Wesley and ACM Press. tion. Journal of Data Mining and Knowledge Discovery,
6(2), 173-190.
Blum, A., & Langley, P. (1997). Selection of relevant
features and examples in machine learning. Artificial Provost, F., & Kolluri, V. (1999). A survey of methods for
Intelligence, 97, 245-271. scaling up inductive algorithms. Journal of Data Mining
and Knowledge Discovery, 3, 131-169.
Bradley, P., Fayyad, U., & Reina, C. (1998). Scaling clus-
tering algorithms to large databases. Proceedings of the Quinlan, R. J. (1993). C4.5: Programs for machine learn-
Fourth International Conference on Knowledge Discov- ing. Morgan Kaufmann.
ery and Data Mining (pp. 9-15). Yu, K., Xu, X., Ester, M., & Kriegel, H. (2001). Select-
Breiman, L. & Friedman, J. (1984). Tools for large data set ing relevant instances for efficient and accurate col-
analysis. In E.J. Wegman & J.G. Smith (Eds.), Statistical laborative filtering. Proceedings of the 10th Interna-
signal processing. New York: M. Dekker. tional Conference on Information and Knowledge Man-
agement (pp.239-46).
Burges, C. (1998). A tutorial on support vector machines.
Journal of Data Mining and Knowledge Discovery, 2, Zeng, C., Xing, C., & Zhou, L. (2003). Similarity mea-
121-167. sure and instance selection for collaborative filtering.
Proceedings of the 12th International Conference on
Chang, C. (1974). Finding prototypes for nearest neigh- World Wide Web (pp. 652-658).
bor classifiers. IEEE Transactions on Computers, C-23.
Cover, T. M., & Thomas, J. A. (1991). Elements of
information theory. Wiley. KEY TERMS
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C.,
& Pregibon, D. (1999). Squashing flat files flatter. Classification: A process of predicting the classes
Proceedings of the Fifth ACM Conference on Knowl- of unseen instances based on patterns learned from
edge Discovery and Data Mining (pp. 6-15). available instances with predefined classes.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Clustering: A process of grouping instances into
Uthurusamy, R. (1996). From data mining to knowledge clusters so that instances are similar to one another
discovery. Advances in Knowledge Discovery and Data within a cluster but dissimilar to instances in other
Mining. clusters.
Gu, B., Hu, F., & Liu, H. (2001). Sampling: Knowing Data Mining: The application of analytical methods
whole from its part. In H. Liu & H. Motoda (Eds.), and tools to data for the purpose of discovering patterns,
Instance selection and construction for data mining. statistical or predictive models, and relationships among
Boston: Kluwer Academic. massive data.
Han, J., & Kamber, M. (2001). Data mining: Concepts Data Reduction: A process of removing irrelevant
and techniques. Morgan Kaufmann. information from data by reducing the number of fea-
tures, instances, or values of the data.
Liu, H., & Motoda, H., (1998). Feature selection for
knowledge discovery and data mining. Boston: Kluwer Instance: A vector of attribute values in a multidi-
Academic. mensional space defined by the attributes, also called a
record, tuple, or data point.
Liu, H., & Motoda, H. (Eds.). (2001). Instance selec-
tion and construction for data mining. Boston: Kluwer Instance Selection: A process of choosing a subset
Academic. of data to achieve the original purpose of a data-mining
application as if the whole data is used.
Liu, H., Motoda, H., & Yu, L. (2002). Feature selection
with selective sampling. Proceedings of the 19th Interna- Sampling: A procedure that draws a sample, Si, by a
tional Conference on Machine Learning (pp. 395-402). random process in which each Si receives its appropriate
probability, Pi, of being selected.
624
TEAM LinG
625
Integration of Data Sources through I

Data Mining
Andreas Koeller
INTRODUCTION for each datum object. Database integration or migra-

tion projects often deal with hundreds of tables and
Integration of data sources refers to the task of develop- thousands of fields (Dasu, Johnson, Muthukrishnan, &
ing a common schema as well as data transformation Shkapenyuk, 2002), with some tables having 100 or
solutions for a number of data sources with related more fields and/or hundreds of thousands of rows. Meth-
content. The large number and size of modern data ods of improving the efficiency of integration projects,
sources make manual approaches at integration increas- which still rely mostly on manual work (Kang & Naughton,
ingly impractical. Data mining can help to partially or 2003), are critical for the success of this important task.
fully automate the data integration process.
MAIN THRUST
BACKGROUND
In this article, I explore the application of data-mining
Many fields of business and research show a tremen- methods to the integration of data sources. Although
dous need to integrate data from different sources. The data transformation tasks can sometimes be performed
process of data source integration has two major com- through data mining, such techniques are most useful in
ponents. the context of schema matching. Therefore, the follow-
Schema matching refers to the task of identifying ing discussion focuses on the use of data mining in
related fields across two or more databases (Rahm & schema matching, mentioning data transformation where
Bernstein, 2001). Complications arise at several levels, appropriate.
for example
Schema-Matching Approaches
Source databases can be organized by using sev-
eral different models, such as the relational model, Two classes of schema-matching solutions exist:
the object-oriented model, or semistructured schema-only-based matching and instance-based match-
models (e.g., XML). ing (Rahm & Bernstein, 2001).
Information stored in a single table in one rela- Schema-only-based matching identifies related
tional database can be stored in two or more tables database fields by taking only the schema of input data-
in another. This problem is common when source bases into account. The matching occurs through lin-
databases show different levels of normalization guistic means or through constraint matching. Linguis-
and also occurs in nonrelational sources. tic matching compares field names, finds similarities in
A single field in one database, such as Name, could field descriptions (if available), and attempts to match
correspond to multiple fields, such as First Name field names to names in a given hierarchy of terms
and Last Name, in another. (ontology). Constraint matching matches fields based
on their domains (data types) or their key properties
Data transformation (sometimes called instance (primary key, foreign key). In both approaches, the data
matching) is a second step in which data in matching in the sources are ignored in making decisions on match-
fields must be translated into a common format. Fre- ing. Important projects implementing this approach in-
quent reasons for mismatched data include data format clude ARTEMIS (Castano, de Antonellis, & de Capitani
(such as 1.6.2004 vs. 6/1/2004), numeric precision di Vemercati, 2001) and Microsofts CUPID (Madhavan,
(3.5kg vs. 3.51kg), abbreviations (Corp. vs. Corpora- Bernstein, & Rahm, 2001).
tion), or linguistic differences (e.g., using different Instance-based matching takes properties of the
synonyms for the same concept across databases). data into account as well. A very simple approach is to
Todays databases are large both in the number of conclude that two fields are related if their minimum
records stored and in the number of fields (dimensions) and maximum values and/or their average values are
TEAM LinG
Integration of Data Sources through Data Mining
equal or similar. More sophisticated approaches consider pendencies by mapping the discovery problem to a
the distribution of values in fields. A strong indicator of problem of discovering patterns (specifically cliques)
a relation between fields is a complete inclusion of the in graphs. This approach is able to discover inclusion
data of one field in another. I take a closer look at this dependencies with several dozens of attributes in tables
pattern in the following section. Important instance-based with tens of thousands of rows. Both algorithms rely on
matching projects are SemInt (Li & Clifton, 2000) and LSD the antimonotonic property of the inclusion depen-
(Doan, Domingos, & Halevy, 2001). dency discovery problem. This property is also used in
Some projects explore a combined approach, in which association rule mining and states that patterns of size k
both schema-level and instance-level matching is per- can only exist in the solution of the problem if certain
formed. Halevy and Madhavan (2003) present a Corpus- patterns of sizes smaller than k exist as well. Therefore,
based schema matcher. It attempts to perform schema it is meaningful to first discover small patterns (e.g.,
matching by incorporating known schemas and previous single-attribute inclusion dependency) and use this in-
matching results and to improve the matching result by formation to restrict the search space for larger patterns.
taking such historical information into account.
Data-mining approaches are most useful in the context Instance-Based Matching in the
of instance-based matching. However, some mining-re- Presence of Data Mismatches
lated techniques, such as graph matching, are employed
in schema-only-based matching as well. Inclusion dependency discovery captures only part of
the problem of schema matching, because only exact
Instance-Based Matching through matches are found. If attributes across two relations are
Inclusion Dependency Mining not exact subsets of each other (e.g., due to entry
errors), then data mismatches requiring data transfor-
An inclusion dependency is a pattern between two mation, or partially overlapping data sets, it becomes
databases, stating that the values in a field (or set of more difficult to perform data-driven mining-based dis-
fields) in one database form a subset of the values in covery. Both false negatives and false positives are
some field (or set of fields) in another database. Such possible. For example, matching fields might not be
subsets are relevant to data integration for two reasons. discovered due to different encoding schemes (e.g., use
First, fields that stand in an inclusion dependency to one of a numeric identifier in one table, where text is used to
another might represent related data. Second, knowl- denote the same values in another table). On the other
edge of foreign keys is essential in successful schema hand, purely data-driven discovery relies on the assump-
matching. Because a foreign key is necessarily a subset tion that semantically related values are also syntacti-
of the corresponding key in another table, foreign keys cally equal. Consequently, fields that are discovered by
can be discovered through inclusion dependency dis- a mining algorithm to be matching might not be seman-
covery. tically related.
The discovery of inclusion dependencies is a very
complex process. In fact, the problem is in general NP- Data Mining by Using Database
hard as a function of the number of fields in the largest Statistics
inclusion dependency between two tables. However, a
number of practical algorithms have been published. The problem of false negatives in mining for schema
De Marchi, Lopes, and Petit (2002) present an algo- matching can be addressed by more sophisticated min-
rithm that adopts the idea of levelwise discovery used in ing approaches. If it is known which attributes across
the famous Apriori algorithm for association rule min- two relations relate to one another, data transforma-
ing. Inclusion dependencies are discovered by first com- tion solutions can be used. However, automatic discov-
paring single fields with one another and then combining ery of matching attributes is also possible, usually
matches into pairs of fields, continuing the process through the evaluation of statistical patterns in the data
through triples, then 4-sets of fields, and so on. How- sources. In the classification of Kang and Naughton
ever, due to the exponential growth in the number of (2003), interpreted matching uses artificial intelligence
inclusion dependencies in larger tables, this approach techniques, such as Bayesian classification or neural
does not scale beyond inclusion dependencies with a networks, to establish hypotheses about related at-
size of about eight fields. tributes. In the uninterpreted matching approach, statis-
A more recent algorithm (Koeller & Rundensteiner, tical features, such as the unique value count of an
2003) takes a graph-theoretic approach. It avoids enu- attribute or its frequency distribution, are taken into
merating all inclusion dependencies between two tables consideration. The underlying assumption is that two
and finds candidates for only the largest inclusion de-
626
TEAM LinG
attributes showing a similar distribution of unique values FUTURE TRENDS

might be related even though the actual data values are not I
equal or similar. Increasing amounts of data are being collected at all
Another approach for detecting a semantic relation- levels of business, industry, and science. Integration of
ship between attributes is to use information entropy data also becomes more and more important as busi-
measures. In this approach, the concept of mutual infor- nesses merge and research projects increasingly re-
mation, which is based on entropy and conditional en- quire interdisciplinary efforts. Evidence for the need
tropy of the underlying attributes, measures the reduc- for solutions in this area is provided by the multitude of
tion in uncertainty of one attribute due to the knowledge partial software solutions for such business applica-
of the other attribute (Kang & Naughton, 2003, p. 207). tions as ETL (Pervasive Software, Inc., 2003), and by
the increasing number of integration projects in the
Further Problems in Information life sciences, such as Genbank by the National Center
Integration through Data Mining for Biotechnology Information (NCBI) or Gramene by
the Cold Spring Harbor Laboratory and Cornell Univer-
In addition to the approaches mentioned previously, sev- sity. Currently, the integration of data sources is a
eral other data-mining and machine-learning approaches, daunting task, requiring substantial human resources. If
in particular classification and rule-mining techniques, automatic methods for schema matching were more
are used to solve special problems in information inte- readily available, data integration projects could be
gration. completed much faster and could incorporate many
For example, a common problem occurring in real- more databases than is currently the case.
world integration projects is related to duplicate records Furthermore, an emerging trend in data source inte-
across two databases, which must be identified. This gration is the move from batch-style integration, where
problem is usually referred to as the record-linking a set of given data sources is integrated at one time into
problem or the merge/purge problem (Hernandz & one system, to real-time integration, where data sources
Stolfo, 1998). Similar statistical techniques as the ones are immediately added to an integration system as they
described previously are used to approach this problem. become available. Solutions to this new challenge can
The Commonwealth Scientific and Industrial Research also benefit tremendously from semiautomatic or au-
Organisation (2003) gives an overview of approaches, tomatic methods of identifying database structure and
which include decision models, predictive models such relationships.
as support vector machines, and a Bayesian decision cost
model.
In a similar context, data-mining and machine-learn- CONCLUSION
ing solutions are used to improve the data quality of
existing databases as well. This important process is Information integration is an important and difficult
sometimes called data scrubbing. Lbbers, Grimmer, task for businesses and research institutions. Although
and Jarke (2003) present a study of the use of such data sources can be integrated with each other by manual
techniques and refer to the use of data mining in data means, this approach is not very efficient and does not
quality improvement as data auditing. scale to the current requirements. Thousands of data-
Recently, the emergence of Web Services such as bases with the potential for integration exist in every
XML, SOAP, and UDDI promises to open opportunities field of business and research, and many of those
for database integration. Hansen, Madnick, and Siegel databases have a prohibitively high number of fields
(2002) argue that Web services help to overcome some and/or records to make manual integration feasible.
of the technical difficulties of data integration, which Semiautomatic or automatic approaches to integration
mostly stem from the fact that traditional databases are are needed.
not built with integration in mind. On the other hand, Web Data mining provides very useful tools to automatic
services by design standardize data exchange protocols data integration. Mining algorithms are used to identify
and mechanisms. However, the problem of identifying schema elements in unknown source databases, to re-
semantically related databases and achieving schema late those elements to each other, and to perform
matching remains. additional tasks, such as data transformation. Essential
627
TEAM LinG
business tasks such as extraction, transformation, and Kang, J., & Naughton, J. F. (2003). On schema matching
loading (ETL) and data integration and migration in gen- with opaque column names and data values. Proceed-
eral become more feasible when automatic methods are ings of the ACM SIGMOD International Conference
used. on Management of Data, USA (pp. 205-216).
Although the underlying algorithmic problems are
difficult and often show exponential complexity, several Koeller, A., & Rundensteiner, E. A. (2003). Discovery of
interesting solutions to the schema-matching and data high-dimensional inclusion dependencies. Proceedings
transformation problems in integration have been proof the 19th IEEE International Conference on Data
posed. This is an active area of research, and more com- Engineering, India (pp. 683-685).
prehensive and beneficial applications of data mining to Li, W., & Clifton, C. (2000). SemInt: A tool for identifying
integration are likely to emerge in the near future. attribute correspondences in heterogeneous databases
using neural network. Journal of Data and Knowledge
Engineering, 33(1), 49-84.
REFERENCES
Lbbers, D., Grimmer, U., & Jarke, M. (2003). Systematic
development of data mining-based data quality tools.
Castano, S., de Antonellis, V., & de Capitani di Vemercati, Proceedings of the 29th International Conference on
S. (2001). Global viewing of heterogeneous data sources. Very Large Databases, Germany (pp. 548-559).
IEEE Transactions on Knowledge and Data Engineer-
ing, 13(2), 277-297. Madhavan, J., Bernstein, P. A., & Rahm, E. (2001) Generic
schema matching with CUPID. Proceedings of the 27th
Commonwealth Scientific and Industrial Research International Conference on Very Large Databases,
Organisation. (2003, April). Record linkage: Current Italy (pp. 49-58).
practice and future directions (CMIS Tech. Rep. No.
03/83). Canberra, Australia: L. Gu, R. Baxter, D. Vickers, Massachusetts Institute of Technology, Sloan School of
& C. Rainsford. Retrieved July 22, 2004, from http:// Management. (2002, May). Data integration using
www.act.cmis.csiro.au/rohanb/PAPERS/record Web services (Working Paper 4406-02). Cambridge, MA.
_linkage.pdf M. Hansen, S. Madnick, & M. Siegel. Retrieved July 22,
2004, from http://hdl.handle.net/1721.1/1822
Dasu, T., Johnson, T., Muthukrishnan, S., & Shkapenyuk,
V. (2002). Mining database structure; or, how to build a Pervasive Software, Inc. (2003). ETL: The secret weapon
data quality browser. Proceedings of the 2002 ACM in data warehousing and business intelligence.
SIGMOD International Conference on Management [Whitepaper]. Austin, TX: Pervasive Software.
of Data, USA (pp. 240-251).
Rahm, E., & Bernstein, P. A. (2001). A survey of ap-
de Marchi, F., Lopes, S., & Petit, J.-M. (2002). Efficient proaches to automatic schema matching. VLDB Jour-
algorithms for mining inclusion dependencies. Pro- nal, 10(4), 334-350.
ceedings of the Eighth International Conference on
Extending Database Technology, Prague, Czech Repub-
lic, 2287 (pp. 464-476).
KEY TERMS
Doan, A. H., Domingos, P., & Halevy, A. Y. (2001). Rec-
onciling schemas of disparate data sources: A machine- Antimonotonic: A property of some pattern-find-
learning approach. Proceedings of the ACM SIGMOD ing problems stating that patterns of size k can only exist
International Conference on Management of Data, USA if certain patterns with sizes smaller than k exist in the
(pp. 509-520). same dataset. This property is used in levelwise algo-
Halevy, A. Y., & Madhavan, J. (2003). Corpus-based rithms, such as the Apriori algorithm used for associa-
knowledge representation. Proceedings of the 18th tion rule mining or some algorithms for inclusion de-
International Joint Conference on Artificial Intelli- pendency mining.
gence, Mexico (pp. 1567-1572). Database Schema: A set of names and conditions
Hernndez, M. A., & Stolfo, S. J. (1998). Real-world data that describe the structure of a database. For example, in
is dirty: Data cleansing and the merge/purge problem. a relational database, the schema includes elements
Journal of Data Mining and Knowledge Discovery, 2(1), such as table names, field names, field data types, pri-
9-37. mary key constraints, or foreign key constraints.
628
TEAM LinG
Domain: The set of permitted values for a field in a Levelwise Discovery: A class of data-mining algo-
database, defined during database design. The actual data rithms that discovers patterns of a certain size by first I
in a field are a subset of the fields domain. discovering patterns of size 1, then using information
from that step to discover patterns of size 2, and so on. A
Extraction, Transformation, and Loading (ETL): De- well-known example of a levelwise algorithm is the Apriori
scribes the three essential steps in the process of data algorithm used to mine association rules.
source integration: extracting data and schema from the
sources, transforming it into a common format, and load- Merge/Purge: The process of identifying duplicate
ing the data into an integration database. records during the integration of data sources. Related
data sources often contain overlapping information ex-
Foreign Key: A key is a field or set of fields in a tents, which have to be reconciled to improve the quality
relational database table that has unique values, that is, of an integrated database.
no duplicates. A field or set of fields whose values form
a subset of the values in the key of another table is Relational Database: A database that stores data in
called a foreign key. Foreign keys express relationships tables, which are sets of tuples (rows). A set of corre-
between fields of different tables. sponding values across all rows of a table is called an
attribute, field, or column.
Inclusion Dependency: A pattern between two da-
tabases, stating that the values in a field (or set of fields) Schema Matching: The process of identifying an
in one database form a subset of the values in some field appropriate mapping from the schema of an input data
(or set of fields) in another database. source to the schema of an integrated database.
629
TEAM LinG
630
Intelligence Density
David Sundaram
Victor Portougal
INTRODUCTION Figure 1. Steps for increasing intelligence density

(Dhar & Stein, 1997)
The amount of information that decision makers have to
process has been increasing at a tremendous pace. A few Knowledge
years ago it was suggested that information in the world
was doubling every 16 months. The very volume has
prevented this information from being used effectively.
Another problem that compounds the situation is the
fact that the information is neither easily accessible nor
available in an integrated manner. This has led to the oft-
quoted comment that though computers have promised
a fount of wisdom they have swamped us with a flood of
data. Decision Support Systems (DSS) and related deci-
sion support tools like data warehousing and data mining
have been used to glean actionable information and
nuggets from this flood of data. Data
BACKGROUND need to have the ability to scrub or cleanse the data of

errors. After scrubbing the data we need to have tools
Dhar and Stein (1997) define Intelligence Density (ID) and technologies that will allow us to integrate data in a
as the amount of useful decision support information flexible manner. This integration should support not
that a decision maker gets from using a system for a only data of different formats but also data that are not
certain amount of time. Alternately ID can be defined as of the same type.
the amount of time taken to get the essence of the Enterprise Systems/Enterprise Resource Planning
underlying data from the output. This is done using the (ERP) systems with their integrated databases have pro-
utility concept, initially developed in decision theory vided clean and integrated view of a large amount of
and game theory (Lapin & Whisler, 2002). Numerical information within the organization thus supporting the
utility values, referred to as utilities (sometimes called lower levels of the intelligence density pyramid (Figure 2).
utiles) express the true worth of information. These But even in the biggest and best organizations with
values are obtained by constructing a special utility massive investments in ERP systems we still find the
function. Thus intelligence density can be defined more need for data warehouses and OLAP even though they
formally as follows: predominantly support the lower levels of the intelli-
gence density pyramid. Once we have an integrated view
Utilities of decision making power gleaned (quality) of the data we can use data mining and other decision
Intelligence Density = ----------------------------------------------------------------- support tools to transform the data and discover patterns
Units of analytic time spent by the decision maker and nuggets of information from the data.
Increasing the intelligence density of its data en-

ables an organization to be more effective, productive, MAIN THRUST
and flexible. Key processes that allow one to increase
the ID of data are illustrated in Figure 1. Mechanisms Three key technologies that can be leveraged to over-
that will allow us to access different types of data need come the problems associated with information of low
to be in place first. Once we have access to the data we
TEAM LinG
Figure 2. ERP and DSS support for increasing homonyms and synonyms. The key steps that need to be
intelligence density (Adapted from Shafiei and undertaken to transform raw data to a form that can be I
Sundaram, 2004) stored in a Data Warehouse for analysis are:
K now ledge The extraction and loading of the data into the Data
Warehouse environment from a number of sys-
Learn D SS tems on a periodic basis
D iscover Conversion of the data into a format that is appro-
Transform priate to the Data Warehouse
Integrate Cleansing of the data to remove inconsistencies,
Scrub ER P inappropriate values, errors, etc
A ccess System Integration of the different data sets into a form
D ata that matches the data model of the Data Warehouse
Transformation of the data through operations such
as summarisation, aggregation, and creation of de-
rived attributes.
intelligence density are Data Warehousing (DW), Online
Analytical Processing (OLAP), and Data Mining (DM). Once all these steps have been completed the data is
These technologies have had a significant impact on the ready for further processing. While one could use dif-
design and implementation of DSS. A generic decision ferent programs/packages to accomplish the various
support architecture that incorporates these technolo- steps listed above they could also be conducted within a
gies is illustrated in Figure 3. This architecture high- single environment. For example, Microsoft SQL Server
lights the complimentary nature of data warehousing, (2004) provides the Data Transformation Services by
OLAP, and data mining. The data warehouse and its which raw data from organisational data stores can be
related components support the lower end of the intel- loaded, cleansed, converted, integrated, aggregated, sum-
ligence density pyramid by providing tools and tech- marized, and transformed in a variety of ways.
nologies that allow one to extract, load, cleanse, con-
vert, and transform the raw data available in an
organisation into a form that then allows the decision
maker to apply OLAP and data mining tools with ease. Figure 3. DSS architecture incorporating data
The OLAP and data mining tools in turn support the warehouses, OLAP, and data mining (Adapted from
middle and upper levels of the intelligence density Srinivasan et al., 2000)
pyramid. In the following paragraphs we look at each of
these technologies with a particular focus on their abil-
ity to increase the intelligence density of data.
Data Warehousing
Data warehouses are fundamental to most information

system architectures and are even more crucial in DSS
architectures. A Data Warehouse is not a DSS, but a Data
Warehouse provides data that is integrated, subject-
oriented, time-variant, and non-volatile in a bid to
support decision making (Inmon, 2002). There are a
number of processes that needs to be undertaken before
data can enter a Data Warehouse or be analysed using
OLAP or Data Mining Tools. Most Data Warehouses
reside on relational DBMS like ORACLE, Microsoft
SQL Server, or DB2. The data from which the Data
Warehouses are built can exist on varied hardware and
software platforms. The data quite often also needs to be
extracted from a number of different sources from
within as well as without the organization. This requires
the resolution of many data integration issues such as
631
TEAM LinG
OLAP FUTURE TRENDS
OLAP can be defined as the creation, analysis, ad hoc There are two key trends that are evident in the commer-
querying, and management of multidimensional data cial as well as the research realm. The first is the comple-
(Thomsen, 2002). Predominantly the focus of most OLAP mentary use of various decision support tools such as
systems is on the analysis and ad hoc querying of the data warehousing, OLAP, and data mining in a synergis-
multidimensional data. Data warehousing systems are tic fashion leading to information of high intelligence
usually responsible for the creation and management of density. Another subtle but vital trend is the ubiquitous
the multidimensional data. A superficial understanding inclusion of data warehousing, OLAP, and data mining in
might suggest that there does not seem to be much of a most information technology architectural landscapes.
difference between data warehouses and OLAP. This is This is especially true of DSS architectures.
due to the fact, that both are complimentary technologies
with the aim of increasing the intelligence density of data.
OLAP is a logical extension to the data warehouse. OLAP CONCLUSION
and related technologies focus on providing support for
the analytical, modelling, and computational requirements In this chapter we first defined intelligence density and
of decision makers. While OLAP systems provide a me- the need for decision support tools that would provide
dium level of analysis capabilities most of the current intelligence of a high density. We then introduced
crop of OLAP systems do not provide the sophisticated three emergent technologies integral in the design and
modeling or analysis functionalities of data mining, math- implementation of DSS architectures whose prime pur-
ematical programming, or simulation systems. pose is to increase the intelligence density of data. We
introduced and described data warehousing, OLAP, and
Data Mining data mining briefly from the perspective of their ability
to increase intelligence density of data. We also pro-
Data mining can be defined as the process of identifying posed a generic decision support architecture that
valid, novel, useful, and understandable patterns in data complementarily uses data warehousing, OLAP, and
through automatic or semiautomatic means (Berry & data mining.
Linoff, 1997). Data mining borrows techniques that origi-
nated from diverse fields such as computer science,
statistics, and artificial intelligence. Data mining is now REFERENCES
being used in a range of industries and for a range of tasks
in a variety of contexts (Wang, 2003). The complexity of Berry, M.J.A., & Linoff, G. (1997). Data mining tech-
the field of data mining makes it worthwhile to structure niques: For marketing, sales, and customer support.
it into goals, tasks, methods, algorithms, and algorithm John Wiley & Sons Inc.
implementations. The goals of data mining drive the
tasks that need to be undertaken, and the tasks drive the Berson, A., & Smith, S.J. (1997). Data warehousing, Data
methods that will be applied. The methods that will be mining, & OLAP. McGraw-Hill.
applied, drives the selections of algorithms followed by Dhar, V., & Stein, R. (1997). Intelligent decision support
the choice of algorithm implementations. methods: The science of knowledge work. Prentice Hall.
The goals of data mining are description, prediction,
and/or verification. Description oriented tasks include Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).
clustering, summarisation, deviation detection, and vi- The KDD process for extracting useful knowledge
sualization. Prediction oriented tasks include classifica- from volumes of data. Communications of the ACM,
tion and regression. Statistical analysis techniques are 39(11), 27-34.
predominantly used for verification. Methods or tech-
niques to carry out these tasks are many, chief among Inmon, W.H. (2002). Building the data warehouse. John
them are: neural networks, rule induction, market basket, Wiley & Sons.
cluster detection, link, and statistical analysis. Each Kimball, R., & Ross, M. (2002). The data warehouse
method may have several supporting algorithms and in toolkit: The complete guide to dimensional model-
turn each algorithm may be implemented in a different ing. John Wiley & Sons.
manner. Data mining tools such as Clementine (SPSS,
2004) not only support the discovery of nuggets but also Lapin, L., & Whisler, W.D. (2002). Quantitative decision
support the entire intelligence density pyramid by pro- making with spreadsheet applications. Belmont, CA:
viding a sophisticated visual interactive environment. Duxbury/Thomson Learning.
632
TEAM LinG
Microsoft. (2004). Microsoft SQL Server. Retrieved from Data Warehouses: provide data that are integrated,
http://www.microsoft.com/ subject-oriented, time-variant, and non-volatile thereby I
increasing the intelligence density of the raw input data.
Shafiei, F., & Sundaram, D. (2004, January 5-8). Multi-
enterprise collaborative enterprise resource planning and Decision Support Systems/Tools: in a wider sense
decision support systems. Thirty-Seventh Hawaii Inter- can be defined as systems/tools that affect the way
national Conference on System Sciences (CD/ROM). people make decisions. But in our present context could
be defined as systems that increase the intelligence
SPSS. (2004). Clementine. Retrieved from http:// density of data.
www.spss.com
Enterprise Resource Planning /Enterprise Sys-
Srinivasan, A., Sundaram, D., & Davis, J. (2000). Imple- tems: are integrated information systems that support
menting decision support systems. McGraw Hill. most of the business processes and information system
Thomsen, E. (2002). OLAP solutions: Building multidi- requirements in an organization.
mensional information systems (2nd ed.). New York; Intelligence Density: is the useful decision sup-
Chichester, UK: Wiley. port information that a decision maker gets from using
Wang, J. (2003). Data mining: Opportunities and chal- a system for a certain amount of time or alternately the
lenges. Hershey, PA: Idea Group Publishing. amount of time taken to get the essence of the underly-
ing data from the output.
Westphal, C., & Blaxton, T. (1998). Data mining solu-
tions: Methods and tools for solving real-world prob- Online Analytical Processing (OLAP): enables the
lems. John Wiley & Sons. creation, management, analysis, and ad hoc querying of
multidimensional data thereby increasing the intelligence
density of the data already available in data warehouses.
KEY TERMS Utilities (Utiles): are numerical utility values, ex-

pressing the true worth of information. These values are
Data Mining: can be defined as the process of obtained by constructing a special utility function.
identifying valid, novel, useful, and understandable pat-
terns in data through automatic or semiautomatic means
ultimately leading to the increase of the intelligence den-
sity of the raw input data.
633
TEAM LinG
634
Intelligent Data Analysis

Xiaohui Liu
Brunel University, UK
INTRODUCTION challenges for data analysts as well as new opportunities

for intelligent systems in data analysis, and have led to the
Intelligent Data Analysis (IDA) is an interdisciplinary emergence of the field of Intelligent Data Analysis (IDA),
study concerned with the effective analysis of data. IDA which draws the techniques from diverse fields, including
draws the techniques from diverse fields, including arti- artificial intelligence (AI), databases, high-performance
ficial intelligence, databases, high-performance comput- computing, pattern recognition, and statistics. What dis-
ing, pattern recognition, and statistics. These fields often tinguishes IDA is that it brings together often comple-
complement each other (e.g., many statistical methods, mentary methods from these diverse disciplines to solve
particularly those for large data sets, rely on computation, challenging problems with which any individual disci-
but brute computing power is no substitute for statistical pline would find difficult to cope, and to explore the most
knowledge) (Berthold & Hand 2003; Liu, 1999). appropriate strategies and practices for complex data
analysis.
BACKGROUND
MAIN THRUST
The job of a data analyst typically involves problem
formulation, advice on data collection (though it is not In this paper, we will explore the main disciplines and
uncommon for the analyst to be asked to analyze data that associated techniques as well as applications to help
have already been collected), effective data analysis, and clarify the meaning of intelligent data analysis, followed
interpretation and report of the finding. Data analysis is by a discussion of several key issues.
about the extraction of useful information from data and
is often performed by an iterative process in which explor- Statistics and Computing: Key
atory analysis and confirmatory analysis are the two Disciplines
principal components.
Exploratory data analysis, or data exploration, re- IDA has its origins in many disciplines, principally statis-
sembles the job of a detective; that is, understanding tics and computing. For many years, statisticians have
evidence collected, looking for clues, applying relevant studied the science of data analysis and have laid many
background knowledge, and pursuing and checking the of the important foundations. Many of the analysis meth-
possibilities that clues suggest. ods and principles were established long before comput-
Data exploration is not only useful for data under- ers were born. Given that statistics are often regarded as
standing but also helpful in generating possibly interest- a branch of mathematics, there has been an emphasis on
ing hypotheses for a later studynormally a more formal mathematics rigor, a desire to establish that something is
or confirmatory procedure for analyzing data. Such pro- sensible on theoretical ground before trying it out on
cedures often assume a potential model structure for the practical problems (Berthold & Hand, 2003). On the other
data and may involve estimating the model parameters hand, the computing community, particularly in machine
and testing hypotheses about the model. learning (Mitchell, 1997) and data mining (Wang, 2003) is
Over the last 15 years, we have witnessed two phe- much more willing to try something out (e.g., designing
nomena that have affected the work of modern data new algorithms) to see how they perform on real-world
analysts more than any others. First, the size and variety datasets, without worrying too much about the theory
of machine-readable data sets have increased dramati- behind it.
cally, and the problem of data explosion has become Statistics is probably the oldest ancestor of IDA, but
apparent. Second, recent developments in computing what kind of contributions has computing made to the
have provided the basic infrastructure for fast data access subject? These may be classified into three categories.
as well as many advanced computational methods for First, the basic computing infrastructure has been put in
extracting information from large quantities of data. These place during the last decade or so, which enables large-
developments have created a new range of problems and scale data analysis (e.g., advances in data warehousing
TEAM LinG
and online analytic processing, computer networks, desk- such as bioinformatics and e-science have called for new
top technologies have made it possible to easily organize ways of analyzing the data. Therefore, it is a very difficult I
and move the data around for the analysis purpose). The task to have a sensible summary of the type of IDA
modern computing processing power also has made it applications that are possible. The following is a partial list.
possible to efficiently implement some of the very
computationally-intensive analysis methods such as sta- Bioinformatics: A huge amount of data has been
tistical resampling, visualizations, large-scale simulation generated by genome-sequencing projects and other
and neural networks, and stochastic search and optimiza- experimental efforts to determine the structures and
tion methods. functions of biological molecules and to under-
Second, there has been much work on extending tra- stand the evolution of life (Orengo et al., 2003). One
ditional statistical and operational research methods to of the most significant developments in
handle challenging problems arising from modern data bioinformatics is the use of high-throughput de-
sets. For example, in Bayesian networks (Ramoni et al., vices such as DNA microarray technology to study
2002), where the work is based on Bayesian statistics, one the activities of thousands of genes in a single
tries to make the ideas work on large-scale practical experiment and to provide a global view of the
problems by making appropriate assumptions and devel- underlying biological process by revealing, for ex-
oping computationally efficient algorithms; in support ample, which genes are responsible for a disease
vector machines (Cristianini & Shawe-Taylor, 2000), where process, how they interact and are regulated, and
one tries to see how the statistical learning theory (Vapnik, which genes are being co-expressed and participate
1998) could be utilized to handle very high-dimensional in common biological pathways. Major IDA chal-
datasets in linear feature spaces; and in evolutionary lenges in this area include the analysis of very high
computation (Eiben & Michalewicz, 1999) one tries to dimensional but small sample microarray data, the
extend the traditional operational research search and integration of a variety of data for constructing
optimization methods. biological networks and pathways, and the han-
Third, new kinds of IDA algorithms have been pro- dling of very noisy microarray image data.
posed to respond to new challenges. Here are several Medicine and Healthcare: With the increasing de-
examples of the novel methods with distinctive comput- velopment of electronic patient records and medical
ing characteristics: powerful three-dimensional virtual information systems, a large amount of clinical data
reality visualization systems that allow gigabytes of data is available online. Regularities, trends, and surpris-
to be visualized interactively by teams of scientists in ing events extracted from these data by IDA meth-
different parts of the world (Cruz-Neira, 2003); parallel and ods are important in assisting clinicians to make
distributed algorithms for different data analysis tasks informed decisions, thereby improving health ser-
(Zaki & Pan, 2002); so-called any-time analysis algorithms vices (Bellazzi et al., 2001). Examples of such appli-
that are designed for real-time tasks, where the system, if cations include the development of novel methods
stopped any time from its starting point, would be able to to analyze time-stamped data in order to assess the
give some satisfactory (not optimal) solution (of course, progression of disease, autonomous agents for moni-
the more time it has, the better solution would be); induc- toring and diagnosing intensive care patients, and
tive logic programming extends the deductive power of intelligent systems for screening early signs of
classic logic programming methods to induce structures glaucoma. It is worth noting that research in
from data (Mooney, 2004) ; Association rule learning bioinformatics can have significant impact on the
algorithms were motivated by the need in retail industry understanding of disease and consequently better
where customers tend to buy related items (Nijssen & therapeutics and treatments. For example, it has
Kok, 2001), while work in inductive databases attempt to been found using DNA microarray technology that
supply users with queries involving inductive capabili- the current taxonomy of cancer in certain cases
ties (De Raedt, 2002). Of course, this list is not meant to appears to group together molecularly distinct dis-
be exhaustive, but it gives some ideas about the kind of eases with distinct clinical phenotypes, suggesting
IDA work going on within the computing community. the discovery of subgroups of cancer (Alizadeh et al., 2000).
Science and Engineering: Enormous amounts of
IDA Applications data have been generated in science and engineer-
ing (Cartwight, 2000) (e.g., in cosmology, chemical
Data analysis is performed for a variety of reasons by engineering, or molecular biology, as discussed
scientists, engineers, business communities, medical and previously). In cosmology, advanced computational
government researchers, and so forth. The increasing size tools are needed to help astronomers understand
and variety of data as well as new exciting applications the origin of large-scale cosmological structures
635
TEAM LinG
as well as the formation and evolution of their systems. Important progress has been made, but
astrophysical components (i.e., galaxies, quasars, further work is needed urgently to come up with
and clusters). In chemical engineering, mathemati- practical and effective methods for managing differ-
cal models have been used to describe interactions ent kinds of data quality problems in large databases.
among various chemical processes occurring in- Scalability: Currently, technical reports analyzing
side a plant. These models are typically very large really big data are still sketchy. Analysis of big,
systems of nonlinear algebraic or differential equa- opportunistic data (i.e., data collected for an unre-
tions. Challenges for IDA in this area include the lated purpose) is beset with many statistical pit-
development of scalable, approximate, parallel, or falls. Much research has been done to develop
distributed algorithms for large-scale applications. efficient, heuristic, parallel, and distributed algo-
Business and Finance: There is a wide range of rithms that are able to scale well. We will be eager
successful business applications reported, although to see more practical experience shared when ana-
the retrieval of technical details is not always easy, lyzing large, complex, real-world datasets in order
perhaps for obvious reasons. These applications to obtain a deep understanding of the IDA process.
include fraud detection, customer retention, cross Mapping Methods to Applications: Given that there
selling, marketing, and insurance. Fraud is costing are so many methods developed in different com-
industries billions of pounds, so it is not surprising munities for essentially the same task (e.g., classi-
to see that systems have been developed to combat fication), what are the important factors in choos-
fraudulent activities in such areas as credit card, ing the most appropriate method(s) for a given
health care, stock market dealing, or finance in gen- application? The most commonly used criterion is
eral. Interesting challenges for IDA include timely the prediction accuracy. However, it is not always
integration of information from different resources the only, or even the most important, criterion for
and the analysis of local patterns that represent evaluating competing methods. Credit scoring is
deviations from a background model (Hand et al., 2002). one of the most quoted applications where
misclassification cost is more important than pre-
IDA Key Issues dictive accuracy. Other important factors in decid-
ing that one method is preferable to another in-
In responding to challenges of analyzing complex data clude computational efficiency and interpretability
from a variety of applications, particularly emerging ones of methods.
such as bioinformatics and e-science, the following issues Human-Computer Collaboration: Data analysis is
are receiving increasing attention in addition to the develop- often an iterative, complex process in which both
ment of novel algorithms to solve new emerging problems. analyst and computer play an important part. An
interesting issue is how one can have an effective
Strategies: There is a strategic aspect to data analy- analysis environment where the computer will per-
sis beyond the tactical choice of this or that test, form complex and laborious operations and pro-
visualization or variable. Analysts often bring exog- vide essential assistance, while the analyst is al-
enous knowledge about data to bear when they lowed to focus on the more creative part of the data
decide how to analyze it. The question of how data analysis using knowledge and experience.
analysis may be carried out effectively should lead
us to having a close look not only at those individual
components in the data analysis process but also at FUTURE TRENDS
the process as a whole, asking what would consti-
tute a sensible analysis strategy. The strategy should There is strong evidence that IDA will continue to gen-
describe the steps, decisions, and actions that are erate a lot of interest in both academic and industrial
taken during the process of analyzing data to build communities, given the number of related conferences,
a model or answer a question. journals, working groups, books, and successful case
Data Quality: Real-world data contain errors and are studies already in existence. It is almost inconceivable
incomplete and inconsistent. It is commonly ac- that this topic will fade in the foreseeable future, since
cepted that data cleaning is one of the most difficult there are so many important and challenging real-world
and most costly tasks in large-scale data analysis problems that demand solutions from this area, and there
and often consumes most of project resources. Re- are still so many unanswered questions. The debate on
search on data quality has attracted a significant what constitutes intelligent or unintelligent data analy-
amount of attention from different communities sis will carry on for a while.
and includes statistics, computing, and information
636
TEAM LinG
More analysis tools and methods inevitably will ap- Berthold, M., & Hand, D.J. (Eds). (2003). Intelligent data
pear, but help in their proper use will not be fast enough. analysis: An introduction. Springer-Verlag. I
A tool can be used without an essential understanding of
what it can offer and how the results should be inter- Cartwright, H. (Ed). (2000). Intelligent data analysis in
preted, despite the best intentions of the user. Research science. Oxford University Press.
will be directed toward the development of more helpful Cristianini, N., & Shawe-Taylor, J. (2000). An introduction
middle-ground tools, those that are less generic than to support vector machines. Cambridge University Press
current data analysis software tools but more general than
specialized data analysis applications. Cruz-Neira, C. (2003). Computational humanities: The new
Much of the current work in the area is empirical in challenge for virtual reality. IEEE Computer Graphics
nature, and we are still in the process of accumulating and Applications, 23(3), 10-13.
more experience in analyzing large, complex data. A lot of De Raedt, L. (2002). A perspective on inductive data-
heuristics and trial and error have been used in exploring bases. ACM SIGKDD Explorations Newsletter, 4(2), 69-77.
and analyzing these data, especially the data collected
opportunistically. As time goes by, we will see more Eiben, A.E., & Michalewicz, Z. (Eds). (2003). Evolution-
theoretical work that attempts to establish a sounder ary computation. IOS Press.
foundation for analysts of the future.
Hand, D.J., Adams, N., & Bolton, R. (2002). Pattern detec-
tion and discovery. Lecture Notes in Artificial Intelli-
gence, 2447.
CONCLUSION
Liu, X. (1999). Progress in intelligent data analysis. Inter-
Statistical methods have been the primary analysis tool, national Journal of Applied Intelligence, 11(3), 235-240.
but many new computing developments have been ap-
plied to the analysis of large and challenging real-world Mitchell, T. (1997). Machine learning. McGraw Hill.
datasets. Intelligent data analysis requires careful think- Mooney, R. et al. (2004). Relational data mining with
ing at every stage of an analysis process, assessment, and inductive logic programming for link discovery. In H.
selection of the most appropriate approaches for the Kargupta et al. (Eds.), Data mining: Next generation
analysis tasks in hand and intelligent application of rel- challenges and future directions. AAAI Press.
evant domain knowledge. This is an area with enormous
potential, as it seeks to answer the following key ques- Nijssen, S., & Kok, J. (2001). Faster association rules for
tions. How can one perform data analysis most effectively multiple relations. Proceedings of the International Joint
(intelligently) to gain new scientific insights, to capture Conference on Artificial Intelligence.
bigger portions of the market, to improve the quality of Orengo, C., Jones, D., & Thornton, J. (Eds). (2003).
life, and so forth? What are the guiding principles to Bioinformatics: Genes, proteins & computers. BIOS Sci-
enable one to do so? How can one reduce the chance of entific Publishers.
performing unintelligent data analysis? Modern datasets
are getting larger and more complex, but the number of Ramoni M., Sebastiani P., & Cohen, P. (2002). Bayesian
trained data analysts is certainly not keeping up at any clustering by dynamics. Machine Learning, 47(1), 91-121.
rate. This poses a significant challenge for the IDA and
other related communities such as statistics, data mining, Vapnik, V.N. (1998). Statistical learning theory. Wiley.
machine learning, and pattern recognition. The quest for Wang, J. (Ed.). (2003). Data mining: Opportunities and
bridging this gap and for crucial insights into the process challenges. Hershey, PA: Idea Group Publishing.
of intelligent data analysis will require an interdisciplinary
effort from all these disciplines. Zaki, M., & Pan, Y. (2002). Recent developments in parallel
and distributed data mining. Distributed and Parallel
Databases: An International Journal, 11(2), 123-127.
REFERENCES
KEY TERMS
Alizadeh, A.A. et al. (2000). Distinct types of diffuse large
b-cell lymphoma identified by gene expression profiling. Bioinformatics: The development and application of
Nature, 403, 503-511. computational and mathematical methods for organiz-
Bellazzi, R., Zupan, B., & Liu, X. (Eds). (2001). Intelligent ing, analyzing, and interpreting biological data.
data analysis in medicine and pharmacology. London.
637
TEAM LinG
E-Science: The large-scale science that will increas- procedures. They can be incomplete, inaccurate, out-of-
ingly be carried out through distributed global collabora- date, or inconsistent.
tions enabled by the Internet.
Support Vector Machines (SVM): Learning machines
Intelligent Data Analysis: An interdisciplinary study that can perform difficult classification and regression
concerned with the effective analysis of data, which estimation tasks. SVM non-linearly map their n-dimen-
draws the techniques from diverse fields including AI, sional input space into a high-dimensional feature space.
databases, high-performance computing, pattern recog- In this high-dimensional feature space, a linear classifier
nition, and statistics. is constructed.
Machine Learning: A study of how computers can be Visualization: Visualization tools to graphically dis-
used automatically to acquire new knowledge from past play data in order to facilitate better understanding of
cases or experience or from the computers own experi- their meanings. Graphical capabilities range from simple
ences. scatter plots to three-dimensional virtual reality systems.
Noisy Data: Real-world data often contain errors due
to the nature of data collection, measurement, or sensing
638
TEAM LinG
639
Intelligent Query Answering I

Zbigniew W. Ras
Agnieszka Dardzinska
Bialystok Technical University, Poland
INTRODUCTION The goal of this article is to provide foundations and

basic results for knowledge discovery-based QAS.
One way to make query answering system (QAS) intel-
ligent is to assume a hierarchical structure of its at-
tributes. Such systems have been investigated by Cuppens BACKGROUND
& Demolombe (1988), Gal & Minker (1988), and
Gaasterland et al. (1992), and they are called coopera- Modern query answering systems area of research is
tive. Any attribute value listed in a query, submitted to related to enhancements of query answering systems
cooperative QAS, is seen as a node of the tree represent- into intelligent systems. The emphasis is on problems in
ing that attribute. If QAS retrieves an empty set of users posing queries and systems producing answers.
objects, which match query q in a target information This becomes more and more relevant as the amount of
system S, then any attribute value listed in q can be gener- information available from local or distributed infor-
alized and the same the number of objects that possibly can mation sources increases. We need systems not only
match q in S can increase. In cooperative systems, these easy to use but also intelligent in answering the users
generalizations are usually controlled by users. needs. A query answering system often replaces human
Another way to make QAS intelligent is to use knowl- with expertise in the domain of interest, thus it is
edge discovery methods to increase the number of important, from the users point of view, to compare the
queries which QAS can answer: knowledge discovery system and the human expert as alternative means for
module of QAS extracts rules from a local system S and accessing information.
requests their extraction from remote sites (if system is Knowledge systems are defined as information sys-
distributed). These rules are used to construct new tems coupled with a knowledge base simplified in Ras
attributes and/or impute null or hidden values of at- (2002), Ras and Joshi (1997), and Ras and Dardzinska
tributes in S. By enlarging the set of attributes from (1997) to a set of rules treated as definitions of attribute
which queries are built and by making information sys- values. If information system is distributed with autono-
tems less incomplete, we not only increase the number mous sites, these rules can be extracted either from the
of queries which QAS can handle but also increase the information system, which is seen as local (query was
number of retrieved objects. submitted to that system), or from remote sites. Domains
So, QAS based on knowledge discovery has two of attributes in the local information system S and the set
classical scenarios that need to be considered: of decision values used in rules from the knowledge base
associated with S form the initial alphabet for the local
In a standalone and incomplete system, associa- query answering system. When the knowledge base as-
tion rules are extracted from that system and used sociated with S is updated (new rules are added or some
to predict what values should replace null values deleted), the alphabet for the local query answering sys-
before queries are answered. tem is automatically changed. In this paper we assume
When system is distributed with autonomous sites that knowledge bases for all sites are initially empty.
and user needs to retrieve objects, from one of Collaborative information system (Ras, 2002) learns rules
these sites (called client), satisfying query q based describing values of incomplete attributes and attributes
on attributes which are not local for that site, we classified as foreign for its site called a client. These rules
search for definitions of these non-local attributes can be extracted at any site but their condition part should
at remote sites and use them to approximate q (Ras, use, if possible, only terms that can be processed by the
2002; Ras & Joshi, 1997; Ras & Dardzinska, 2004). query-answering system associated with the client. When
TEAM LinG
Intelligent Query Answering
the time progresses more and more rules can be added to (2003b) is to use not only functional dependencies to
the local knowledge base, which means that some at- chase S (Atzeni & DeAntonellis, 1992) but also use rules
tribute values (decision parts of rules) foreign for the discovered from a complete subsystem of S to do the
client are also added to its local alphabet. The choice of chasing.
which site should be contacted first, in search for defini- In the first step, intelligent QAS identifies all in-
tions of foreign attribute values, is mainly based on the complete attributes used in a query. An attribute is
number of attribute values common for the client and incomplete in S if there is an object in S with incomplete
server sites. The solution to this problem is given in Ras information on this attribute. The values of all incom-
(2002). plete attributes are treated as concepts to be learned (in
a form of rules) from S.
Incomplete information in S is replaced by new data
MAIN THRUST provided by Chase algorithm based on these rules. When
the process of removing incomplete vales in the local
The technology dimension will be explored to help information system is completed, QAS finds the answer
clarify the meaning of intelligent query answering based to query in a usual way.
on knowledge discovery and chase.
Intelligent Query Answering for
Intelligent Query Answering for Distributed Autonomous Information
Standalone Information System Systems
QAS for an information system is concerned with iden- Semantic inconsistencies are due to different interpre-
tifying all objects in the system satisfying a given de- tations of attributes and their values among sites (for
scription. For example, an information system might instance one site can interpret the concept young
contain information about students in a class and clas- differently than other sites). Different interpretations
sify them using four attributes of hair color, eye are also due to the way each site is handling null values.
color, gender, and size. A simple query might be to Null value replacement by values suggested either by
find all students with brown hair and blue eyes. When an statistical or knowledge discovery methods is quite
information system is incomplete, students having brown common before a user query is processed by QAS.
hair and unknown eye color can be handled by either Ontology (Guarino, 1998; Sowa, 1999, 2000; Van
including or excluding them from the answer to the Heijst et al., 1997) is a set of terms of a particular
query. In the first case we talk about optimistic approach information domain and the relationships among them.
to query evaluation while in the second case we talk Currently, there is a great deal of interest in the devel-
about pessimistic approach. Another option to handle opment of ontologies to facilitate knowledge sharing
such a query would be to discover rules for eye color in among information systems.
terms of the attributes hair color, gender, and size. Ontologies and inter-ontology relationships between
These rules could then be applied to students with them are created by experts in the corresponding do-
unknown eye color to generate values that could be used main, but they can also represent a particular point of
in answering the query. Consider that in our example one view of the global information system by describing
of the generated rules said: customized domains. To allow intelligent query pro-
cessing, it is often assumed that an information system
(hair, brown) (size, medium) (eye, brown). is coupled with some ontology. Inter-ontology relation-
ships can be seen as semantical bridges between ontolo-
Thus, if one of the students having brown hair and gies built for each of the autonomous information sys-
medium size has no value for eye color, then the query tems so they can collaborate and understand each other.
answering system should not include this student in the In Ras and Dardzinska (2004), the notion of optimal
list of students with brown hair and blue eyes. Attributes rough semantics and the method of its construction have
hair color and size are classification attributes and eye been proposed. Rough semantics can be used to model
color is the decision attribute. semantic inconsistencies among sites due to different
We are also interested in how to use this strategy to interpretations of incomplete values of attributes. Dis-
build intelligent QAS for incomplete information sys- tributed chase (Ras & Dardzinska, 2004) is a chase-type
tems. If a query is submitted to information system S, algorithm, driven by a client site of a distributed infor-
the first step of QAS is to make S as complete as mation system (DIS), which is similar to chase algo-
possible. The approach proposed in Dardzinska & Ras rithms based on knowledge discovery and presented in
640
TEAM LinG
Dardzinska & Ras (2003a, 2003b). Distributed chase has ent interpretations of incomplete attribute values among
one extra feature in comparison to other chase-type algo- sites have been considered. I
rithms: the dynamic creation of knowledge bases at all sites In Ras (2002), it was shown that query q(B) can be
of DIS involved in the process of solving a query submit- processed at site S by discovering definitions of values
ted to the client site of DIS. of attributes from B - [A B] at the remote sites for S
The knowledge base at the client site may contain and next use them to answer q(B).
rules extracted from the client information system and Foreign attributes for S in B, can be also seen as
also rules extracted from information systems at remote attributes entirely incomplete in S, which means values
sites in DIS. These rules are dynamically updated through (either exact or partially incomplete) of such attributes
the incomplete values replacement process (Ras & should be ascribed by chase to all objects in S before
Dardzinska, 2004). query q(B) is answered. The question remains, if values
Although the names of attributes are often the same discovered by chase are really correct?
among sites, their semantics and granularity levels may Classical approach to this kind of problem is to
differ from site to site. As the result of these differ- build a simple DIS environment (mainly to avoid diffi-
ences, the knowledge bases at the client site and at culties related to different granularity and different
remote sites have to satisfy certain properties in order to semantics of attributes at different sites). As the test-
be applicable in a distributed chase. ing data set (Ras & Dardzinska, 2005) have taken 10,000
So, assume that system S = (X,A,V), which is a part of tuples randomly selected from a database of some insur-
DIS, is queried be user. ance company. This sample table, containing 100 at-
Chase algorithm, to be applicable to S, has to be based tributes, was randomly partitioned into four subtables of
on rules from the knowledge base D associated with S, equal size containing 2,500 tuples each. Next, from each
which satisfies the following conditions: of these subtables 40 attributes (columns) have been
randomly removed leaving four data tables of the size
1. Attribute value used in decision part of a rule from 2,50060 each. One of these tables was called a client and
D has the granularity level either equal to or finer the remaining 3 have been called servers. Now, for all
than the granularity level of the corresponding objects at the client site, values of one of the attributes,
attribute in S. which were chosen randomly, have been hidden. This
2. The granularity level of any attribute used in the attribute is denoted by d. At each server site, if attribute
classification part of a rule from D is either equal d was listed in its domain schema, descriptions of d using
or softer than the granularity level of the corre- See5 software (data are complete so it was not necessary
sponding attribute in S. to use ERID) have been learned. All these descriptions,
3. Attribute used in the decision part of a rule from D in the form of rules, have been stored in the knowledge
either does not belong to A or is incomplete in S. base of the client. Distributed Chase was applied to
predict what is the real value of the hidden attribute for
Assume again that S=(X,A,V) is an information sys- each object x at the client site. The threshold value =
tem (Pawlak, 1991; Ras & Dardzinska, 2004), where X is 0.125 was used to rule out all values predicted by distrib-
a set of objects, A is a set of attributes (seen as partial uted Chase with confidence below that threshold. Al-
functions from X into 2(V[0,1])) and, V is a set of values of most all hidden values (2476 out of 2500) have been
attributes from A. By [0,1] we mean the set of real discovered correctly (assuming = 0.125).
numbers from 0 to 1. Let L(D)={[t vc] D: c In(A)}
be a set of all rules (called a knowledge-base) extracted Distributed Chase and Security
initially from the information system S by ERID Problem of Hidden Attributes
(Dardzinska & Ras, 2003c), where In(A) is a set of
incomplete attributes in S.
Assume now that query q(B) is submitted to system Assume now that an information system S=(X,A,V)
S=(X,A,V), where B is the set of all attributes used in is a part of DIS and attribute bA has to be hidden. For
q(B) and that A B . All attributes in B - [A B] are that purpose, we construct Sb=(X,A,V) to replace S,
called foreign for S. If S is a part of a distributed infor- where:
mation system, definitions of foreign attributes for S can
be extracted at its remote sites (Ras, 2002). Clearly, all 1. aS(x) = aSb(x), for any a A-{b}, x X,
semantic inconsistencies and differences in granularity 2. bSb(x) is undefined, for any x X,
of attribute values among sites have to be resolved first. 3. bS(x) Vb.
In Ras & Dardzinska (2004) only different granularity of
attribute values and different semantics related to differ-
641
TEAM LinG
Users are allowed to submit queries to S b and not to S. given threshold value. It means that the new updated
What about the information system Chase(Sb)? How it information system S has to be chased again before any
differs from S? query is answered by QAS.
If bS(x) = bChase(Sb)(x), where x X, then values of
additional attributes for object x have to be hidden in Sb
to guarantee that value bS(x) can not be reconstructed by REFERENCES
Chase. In Ras and Dardzinska (2005) it was shown how to
identify the minimal number of such values. Atzeni, P., & DeAntonellis, V. (1992). Relational data-
base theory. The Benjamin Cummings Publishing Com-
pany.
FUTURE TRENDS
Cuppens, F., & Demolombe, R. (1988). Cooperative
One of the main problems related to semantics of an answering: A methodology to provide intelligent access
incomplete information system S is the freedom of how to databases. In Proceedings of the Second Interna-
new values are constructed to replace incomplete values tional Conference on Expert Database Systems (pp.
in S, before any rule extraction process begins. This 333-353).
replacement of incomplete attribute values in some of Dardzinska, A., & Ras, Z.W. (2003a). Rule-based Chase
the slots in S can be done either by chase or/and by a algorithm for partially incomplete information sys-
number of available statistical methods (Giudici, 2003). tems. In Proceedings of the Second International
This implies that semantics of queries submitted to S Workshop on Active Mining, Maebashi City, Japan (pp.
and driven (defined) by query answering system QAS 42-51).
based on chase may often differ. Although rough seman-
tics can be used by QAS to handle this problem, we still Dardzinska, A., & Ras, Z.W. (2003b). Chasing unknown
have to look for new alternate methods. values in incomplete information systems. In Proceed-
Assuming different semantics of attributes among ings of ICDM03 Workshop on Foundations and New
sites in DIS, the use of global ontology or local ontolo- Directions of Data Mining, Melbourne, Florida (pp. 24-
gies built jointly with inter-ontology relationships among 30). IEEE Computer Society.
them seems to be necessary for solving queries in DIS Dardzinska, A., & Ras, Z.W. (2003c). On rule discovery
using knowledge discovery and chase. Still a lot of from incomplete information systems. In Proceedings
research has to be done in this area. of ICDM03 Workshop on Foundations and New Di-
rections of Data Mining. Melbourne, Florida (pp. 31-35).
IEEE Computer Society.
CONCLUSION
Gaasterland, T., Godfrey, P., & Minker, J. (1992). Relax-
Assume that the client site in DIS is represented by ation as a platform for cooperative answering. Journal of
partially incomplete information system S. When a Intelligent Information Systems, 1(3), 293-321.
query is submitted to S, its query answering system QAS Gal, A., & Minker, J. (1988). Informative and cooperative
will replace S by Chase(S) and next will solve the query answers in databases using integrity constraints. In natu-
using, for instance, the strategy proposed in Ras & Joshi ral language understanding and logic programming
(1997). Rules used by Chase can be extracted from S or (pp. 288-300). North Holland.
from its remote sites in DIS assuming that all differ-
ences in semantics of attributes and differences in Giudici, P. (2003). Applied data mining: Statistical meth-
granularity levels of attributes are resolved first. We ods for business and industry. West Sussex, UK: Wiley.
can argue here why the resulting information system
obtained by Chase can not be stored aside and reused Guarino, N. (Ed.). (1998). Formal ontology in information
when a new query is submitted to S? If system S is not systems. Amsterdam: IOS Press.
frequently updated, we can do that by keeping a copy of Pawlak, Z. (1991). Rough sets-theoretical aspects of
Chase(S) and next reusing that copy when a new query is reasoning about data. Kluwer.
submitted to S. But, the original information system S
still has to be kept so when user wants to enter new data Ras, Z. (2002). Reducts-driven query answering for
to S, they can be stored in the original system. System distributed knowledge systems. International Journal
Chase(S), if stored aside, can not be reused by QAS of Intelligent Systems, 17(2), 113-124.
when the number of updates in the original S exceeds a
642
TEAM LinG
Ras, Z., & Dardzinska, A. (2004). Ontology based distrib- Distributed Chase: Kind of a recursive strategy ap-
uted autonomous knowledge systems. Information Sys- plied to a database V, based on functional dependencies I
tems International Journal, 29(1), 47-58. or rules extracted both from V and other autonomous
databases, by which a null value or an incomplete value
Ras, Z., & Dardzinska, A. (2005). Data security and null in V is replaced by a new more complete value. Any
value imputation in distributed information systems. In differences in semantics among attributes in the involved
Advances in Soft Computing, Proceedings of MSRAS04 databases have to be resolved first.
Symposium (pp. 133-146). Poland: Springer-Verlag.
Intelligent Query Answering: Enhancements of query
Ras, Z., & Joshi, S. (1997). Query approximate answering answering systems into sort of intelligent systems (ca-
system for an incomplete DKBS. Fundamenta Informaticae, pable or being adapted or molded). Such systems should
30(3), 313-324. be able to interpret incorrectly posed questions and
Sowa, J.F. (1999). Ontological categories. In L. compose an answer not necessarily reflecting precisely
Albertazzi (Ed.), Shapes of forms: From Gestalt psychol- what is directly referred to by the question, but rather
ogy and phenomenology to ontology and mathematics reflecting what the intermediary understands to be the
(pp. 307-340). Kluwer. intention linked with the question.
Sowa, J.F. (2000). Knowledge representation: Logi- Knowledge Base: A collection of rules defined as
cal, philosophical, and computational foundations. expressions written in predicate calculus. These rules
Pacific Grove, CA: Brooks/Cole Publishing. have a form of associations between conjuncts of values
of attributes.
Van Heijst, G., Schreiber, A., & Wielinga, B. (1997).
Using explicit ontologies in KBS development. Inter- Ontology: An explicit formal specification of how
national Journal of Human and Computer Studies, 46 to represent objects, concepts and other entities that are
(2/3), 183-292. assumed to exist in some area of interest and relation-
ships holding among them. Systems that share the same
ontology are able to communicate about domain of
discourse without necessarily operating on a globally
KEY TERMS shared theory. System commits to ontology if its ob-
servable actions are consistent with the definitions in
Autonomous Information System: Information the ontology.
system existing as an independent entity.
Query Semantics: The meaning of a query with an
Chase: Kind of a recursive strategy applied to a information system as its domain of interpretation.
database V, based on functional dependencies or rules Application of knowledge discovery and Chase in query
extracted from V, by which a null value or an incomplete evaluation makes semantics operational.
value in V is replaced by a new more complete value.
Semantics: The meaning of expressions written in
some language, as opposed to their syntax, which de-
scribes how symbols may be combined independently
of their meaning.
643
TEAM LinG
644
Interactive Visual Data Mining

Shouhong Wang
University of Massachusetts Dartmouth, USA
Hai Wang
INTRODUCTION pattern depends on the data miner and does not solely
depend on the statistical strength of the pattern. Second,
In the data mining field, people have no doubt that high heuristic search in combinatorial spaces built on com-
level information (or knowledge) can be extracted from the puter and human interaction is useful for effective knowl-
database through the use of algorithms. However, a one- edge discovery. One strategy for effective knowledge
shot knowledge deduction is based on the assumption discovery is the use of human-computer collaboration.
that the model developer knows the structure of knowl- One technique used for human-computer collabora-
edge to be deducted. This assumption may not be invalid tion in the business information systems field is data
in general. Hence, a general proposition for data mining is visualization (Bell, 1991; Montazami & Wang, 1988) which
that, without human-computer interaction, any knowl- is particularly relevant to data mining (Keim & Kriegel,
edge discovery algorithm (or program) will fail to meet the 1996; Wang, 2002). From the human side of data visualiza-
needs from a data miner who has a novel goal (Wang, S. tion, graphics cognition and problem solving are the two
& Wang, H., 2002). Recently, interactive visual data major concepts of data visualization. It is a commonly
mining techniques have opened new avenues in the data accepted principle that visual perception is compounded
mining field (Chen, Zhu, & Chen, 2001; de Oliveira & out of processes in a way which is adaptive to the visual
Levkowitz, 2003; Han, Hu & Cercone, 2003; Shneiderman, presentation and the particular problem to be solved
2002; Yang, 2003). (Kosslyn, 1980; Newell & Simon, 1972).
Interactive visual data mining differs from traditional
data mining, standalone knowledge deduction algorithms,
and one-way data visualization in many ways. Briefly, MAIN THRUST
interactive visual data mining is human centered, and is
implemented through knowledge discovery loops coupled Major components of interactive visual data mining and
with human-computer interaction and visual representa- their functions that make data mining more effective are
tions. Interactive visual data mining attempts to extract the current research theme in this field. Wang, S. and
unsuspected and potentially useful patterns from the data Wang, H. (2002) have developed a model of interactive
for the data miners with novel goals, rather than to use the visual data mining for human-computer collaboration
data to derive certain information based on a priori human knowledge discovery. According to this model, an inter-
knowledge structure. active visual data mining system has three components
on the computer side, besides the database: data visual-
ization instrument, data and model assembly, and human-
BACKGROUND computer interface.
A single generic knowledge deduction algorithm is insuf- Data Visualization Instrument

ficient to handle a variety of goals of data mining since a
goal of data mining is often related to its specific problem Data visualization instruments are tools for presenting
domain. In fact, knowledge discovery in databases is the data in human understandable graphics, images, or anima-
nontrivial process of identifying valid, novel, potentially tion. While there have been many techniques for data
useful, and ultimately understandable patterns of data visualization, such as various statistical charts with col-
mining (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). By ors and animations, the self-organizing maps (SOM)
this definition, two aspects of knowledge discovery are method based on Kohonen neural network (Kohonen,
important to meaningful data mining. First, the criteria of 1989) has become one of the promising techniques of data
validity, novelty, usefulness of knowledge to be discov- visualization in data mining. SOM is a dynamic system
ered could be subjective. That is, the usefulness of a data that can learn the topological relations and abstract struc-
TEAM LinG
tures in the high-dimensional input vectors using low 5. Homogeneousness Examination: Knowledge formula-
dimensional space for representation. These low-dimen- tion often needs to identify the ranges of values of a I
sional presentations can be viewed and interpreted by determinant variable so that observations with values
human in discovering knowledge (Wang, 2000). of a certain range in this variable have a homogeneous
behavior. This query function provides interactive
Data and Model Assembly mechanism for the data miner to decompose variables
for homogeneousness examination.
The data and model assembly is a set of query functions
that assemble the data and data visualization instruments Human-Computer Interface
for data mining. Query tools are characterized by struc-
tured query language (SQL), the standard query language Human-computer interface allows the data miner to dialog
for relational database systems. To support human-com- with the computer. It integrates the data base, data visu-
puter collaboration effectively, query processing is nec- alization instruments, and data and model assembly into
essary in data mining. As the ultimate objective of data a single computing environment. Through the human-
retrieval and presentation is the formulation of knowl- computer interface, the data miner is able to access the
edge, it is difficult to create a single standard query data visualization instruments, select data sets, invoke
language for all purposes of data mining. Nevertheless, the query process, organize the screen, set colors and
the following functionalities can be implemented through animation speed, and manage the intermediate data min-
the design of query that support the examination of the ing results.
relevancy, usefulness, interestingness, and novelty of
extracted knowledge.
FUTURE TRENDS
1. Schematics Examination: Through this query func-
tion, the data miner is allowed to set different values Interactive visual data mining techniques will become key
for the parameters of the data visualization instru- components of any data mining instruments. More theo-
ment to perceive various schematic visual presen- ries and techniques of interactive visual data mining will
tations. be developed in the near future, followed by comprehen-
2. Consistency Examination: To cross-check the data sive comparisons of these theories and techniques. Query
mining results, the data miner may choose different systems along with data visualization functions on large-
sets of data of the database to check if the conclu- scale database systems for data mining will be available
sion from one set of data is consistent with others. for data mining practitioners.
This query function allows the data miner to make
such consistency examination.
3. Relevancy Examination: It is a fundamental law CONCLUSION
that, to validate a data mining result, one must use
external data, which are not used in generating this Given the fact that a one-shot knowledge deduction may
result but are relevant to the problem being inves- not provide an alternative result if it fails, we must provide
tigated. For instance, the data of customer at- an integrated computing environment for the data miner
tributes can be used for clustering to identify through interactive visual data mining. An interactive
significant market segments for the company. visual data mining system consists of three intertwined
However, whether the market segments relevant components, besides the database: data visualization
to a particular product, one must use separate instrument, data and model assembly instrument, and
product survey data. This query function allows human-computer interface. In interactive visual data min-
the data miner to use various external data to ing, the human-computer interaction and effective visual
examine the data mining results. presentations of multivariate data allow the data miner to
4. Dependability Examination: The concept of de- interpret the data mining results based on the particular
pendability examination in interactive visual data problem domain, his/her perception, specialty, and the
mining is similar to that of factor analysis in tradi- creativity. The ultimate objective of interactive visual
tional statistical analysis, but the dependability data mining is to allow the data miner to conduct the
examination query function is more comprehensive experimental process and examination simultaneously
in determining whether a variable contributes the through the human-computer collaboration in order to
data mining results in a certain way. obtain a satisfactory result.
645
TEAM LinG
REFERENCES Wang, S., & Wang, H. (2002). Knowledge discovery

through self-organizing maps: Data visualization and
Bell, P. C. (1991). Visual interactive modelling: The past, the query processing. Knowledge and Information Systems,
present, and the prospects. European Journal of Opera- 4(1), 31-45.
tional Research, 54(3), 274-286. Yang, L. (2003). Visual exploration of large relational data
Chen, M., Zhu, Q., & Chen, Z. (2001). An integrated sets through 3D projections and footprint splatting.
interactive environment for knowledge discovery from IEEE Transactions on Knowledge and Data Engineer-
heterogeneous data resources. Information and Software ing, 15(6), 1460-1471.
Technology, 43(8), 487-496.
de Oliveira, M. C. F., & Levkowitz, H. (2003). From visual KEY TERMS
data exploration to visual data mining: A survey. IEEE
Transactions on Visualization and Computer Graphics, Data and Model Assembly: A set of query functions
9(3), 378-394. that assemble the data and data visualization instru-
ments for data mining.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The
KDD process for extracting useful knowledge from vol- Data Visualization: Presentation of data in human
umes of data. Communications of the ACM , 39(11), 27-34. understandable graphics, images, or animation.
Han, J., Hu, X., & Cercone, N. (2003). A visualization model Human-Computer Interface: Integrated computing
of interactive knowledge discovery systems and its imple- environment that allows the data miner to access the data
mentations. Information Visualization, 2(2), 105-112. visualization instruments, select data sets, invoke the
query process, organize the screen, set colors and anima-
Keim, D. A., & Kriegel, H. P. (1996). Visualization tech- tion speed, and manage the intermediate data mining
niques for mining large databases: A comparison. IEEE results.
Transactions on Knowledge & Data Engineering, 8(6),
923-938. Interactive Data Mining: Human-computer collabo-
ration knowledge discovery process through the inter-
Kohonen, T. (1989). Self-organization and associative action between the data miner and the computer to extract
memory (3rd ed.). Berlin: Springer-Verlag. novel, plausible, useful, relevant, and interesting knowl-
Kosslyn, S. M. (1980). Image and mind. Cambridge, MA: edge from the data base.
Harvard University Press. Query Tool: Structured query language that sup-
Montazemi, A., & Wang, S. (1988). The impact of informa- ports the examination of the relevancy, usefulness, inter-
tion presentation modes on decision making: A meta- estingness, and novelty of extracted knowledge for inter-
analysis. Journal of Management Information Systems, active data mining.
5(3), 101-127. Self-Organizing Map (SOM): Two layer neural net-
Newell, A., & Simon, H. A. (1972). Human problem solving. work that maps the high-dimensional data onto low-
Englewood Cliffs, NJ: Prentice Hall. dimensional pictures through unsupervised learning or
competitive learning process. It allows the data miner to
Shneiderman, B. (2002). Inventing discovery tools: Com- view the clusters on the output maps.
bining information visualization with data mining. Infor-
mation Visualization, 1, 5-12. Visual Data Mining: Data mining process through
data visualization. The fundamental concept of visual
Wang, S. (2000) Neural networks. In M. Zeleny (Ed.), IEBM data mining is the interaction between data visual pre-
Handbook of IT in Business (pp. 382-391). London, UK: sentation, human graphics cognition, and problem solv-
International Thomson Business Press. ing.
Wang, S. (2002). Nonlinear pattern hypothesis generation
for data mining. Data & Knowledge Engineering, 40(3),
273-283.
646
TEAM LinG
647
Interscheme Properties Role in Data I

Warehouses
Pasquale De Meo
Giorgio Terracina
Universit della Calabria, Italy
Domenico Ursino
INTRODUCTION Mello, Castano & Heuser, 2002; McBrien &

Poulovassilis, 2003) plays a key role.
In this article, we illustrate a general approach for the However, when involved systems are numerous and/
semi-automatic construction and management of data or large, schema integration alone typically ends up
warehouses. Our approach is particularly suited when producing a too complex global schema that may, in
the number or the size of involved sources is large and/ fact, fail to supply a satisfactory and convenient de-
or when it changes quite frequently over time. Our scription of available data. In these cases, schema inte-
approach is based mainly on the semi-automatic deriva- gration steps must be completed by executing schema
tion of interschema properties (i.e., terminological and abstraction steps (Palopoli, Pontieri, Terracina &
structural relationships holding among concepts be- Ursino, 2000). Carrying out a schema abstraction activ-
longing to different input schemas). It consists of the ity amounts to clustering objects belonging to a schema
following steps: (1) enrichment of schema descriptions into homogeneous subsets and producing an abstracted
obtained by the semi-automatic extraction of schema obtained by substituting each subset with one
interschema properties; (2) exploitation of derived single object representing it.
interschema properties for obtaining in a data reposi- In order for schema integration and abstraction to be
tory an integrated and abstracted view of available data; correctly carried out, the designer has to understand
and (3) design of a three-level data warehouse having as clearly the semantics of involved information sources.
its core the derived data repository. One of the most common ways for deriving and repre-
senting schema semantics consists in detecting the so-
called interschema properties (Castano, De Antonellis
BACKGROUND & De Capitani di Vimercati, 2001; Doan, Madhavan,
Dhamankar, Domingos & Levy, 2003; Gal, Anaby-Tavor,
In the last years, an enormous increase of data available Trombetta & Montesi, 2004; Madhavan, Bernstein &
in electronic form has been witnessed, as well as a Rahm, 2001; Melnik, Garcia-Molina & Rahm, 2002;
corresponding proliferation of query languages, data Palopoli, Sacc, Terracina & Ursino, 2003; Palopoli,
models, and data management systems. Traditional ap- Terracina & Ursino, 2001; Rahm & Bernstein, 2001).
proaches to data management do not seem to guarantee, These are terminological and structural properties re-
in these cases, the needed level of access transparency lating to concepts belonging to different schemas.
to stored data while preserving the autonomy of local In the literature, several manual methods for deriv-
data sources. This situation contributed to push the ing interschema properties have been proposed (see
development of new architectures for data source Batini, Lenzerini, and Navathe (1986) for a survey about
interoperability, allowing users to query preexisting this argument). These methods can produce very precise
autonomous data sources in a way that guarantees model and satisfactory results. However, since they require a
language and location transparency. great amount of work, to the human expert, they are
In all the architectures for data source difficult to be applied when involved sources are numer-
interoperability, components handling the reconcilia- ous and large.
tion of involved information sources play a relevant To handle large amounts of data, various semi-auto-
role. In the construction of these components, schema matic methods also have been proposed. These are much
integration (Chua, Chiang, & Lim, 2003; dos Santos less resource consuming than manual ones; moreover,
TEAM LinG
Interscheme Properties Role in Data Warehouses
interschema properties obtained by semi-automatic tech- Terminological properties are synonymies,

niques can be updated and maintained more simply. In the homonymies, hyponymies, overlappings, and type con-
past, semi-automatic methods were based on considering flicts. A synonymy between two concepts A and B indi-
only structural similarities among objects belonging to cates that they have the same meaning. A homonymy
different schemas. Presently, all interschema property between two concepts A and B indicates that they have
derivation techniques also take into account the context the same name but different meanings. Concept A is said
in which schema concepts have been defined (Rahm & to be a hyponym of concept B (which, in turn, is a
Bernstein, 2001). hypernym of A), if A has a more specific meaning than
The dramatic increase of available data sources led B. An overlapping exists between concepts A and B, if
also to a large variety of both structured and semi- they are neither synonyms nor hyponyms of the other
structured data formats; in order to uniformly manage but share a significant set of properties; more formally,
them, it is necessary to exploit a unified paradigm. In there exists an overlapping between A and B, if there
this context, one of the most promising solutions is exist non-empty sets of properties {pA1, pA2, , pAn} of
XML. Due to its semi-structured nature, XML can be A and {pB1, pB2, , p Bn} of B such that, for 1i n, pAi is
exploited as a unifying formalism for handling the a synonym of pBi. A type conflict indicates that the same
interoperability of information sources characterized concept is represented by different constructs (e.g., an
by heterogeneous data representation formats. element and an attribute in an XML source) in different
schemas.
A subschema similarity represents a similitude be-
MAIN THRUST tween fragments of different schemas.
Structural properties are inclusions and assertions
Overview of the Approach between knowledge patterns. An inclusion between two
concepts A and B indicates that the instances of A are a
In this article we define a new framework for uniformly subset of the instances of B. An assertion between
and semi-automatically constructing a data warehouse knowledge patterns indicates either a subsumption or an
from numerous and large information sources charac- equivalence between knowledge patterns. Roughly speak-
terized by heterogeneous data representation formats. ing, knowledge patterns can be seen as views on involved
In more detail, the proposed framework consists of the information sources.
following steps: Our interschema property extraction approach is
characterized by the following features: (1) it is XML-
Translation of involved information sources into based; (2) it is almost automatic; and (3) it is semantic.
XML ones. Given two concepts belonging to different informa-
Application to the XML sources derived in the tion sources, one of the most common ways for deter-
previous step of almost automatic techniques for mining their semantics consists of examining their
detecting interschema properties, specifically neighborhoods, since the concepts and the relationships
conceived to operate on XML environments. in which they are involved contribute to define their
Exploitation of derived interschema properties meaning. In addition, our approach exploits two further
for constructing an integrated and uniform repre- indicators for defining in a more precise fashion the
sentation of involved information sources. semantics of involved data sources. These indicators are
Exploitation of this representation as the core of the types and the cardinalities of the elements, taking
the reconciled data level of a data warehouse1. the attributes belonging to the XML schemas into con-
sideration.
In the following subsections, we will illustrate the It is clear from this reasoning that concept neighbor-
last three steps of this framework. The translation step hood plays a crucial role in the interschema property
is not discussed here, because it is performed by apply- computation. In XML schemas, concepts are expressed
ing the translation rules from the involved source for- by elements or attributes. Since, for the interschema
mats to XML already proposed in the literature. property extraction, it is not important to distinguish
concepts represented by elements from concepts repre-
Extraction of Interschema Properties sented by attributes, we introduce the term x-compo-
nent for denoting an element or an attribute in an XML
A possible taxonomy classifies interschema properties schema.
into terminological properties, subschema similarities, In order to compute the neighborhood of an x-com-
and structural properties. ponent, it is necessary to define a semantic distance2
between two x-components of the same schema; it takes
648
TEAM LinG
into account how much they are related. The formulas for mation sources; this becomes the core of the reconciled
computing this distance are quite complex; due to space data level in a three-level DW. I
limitations we cannot show them here. However, the inter- Generally, in classical approaches, this global rep-
ested reader can refer to De Meo, Quattrone, Terracina, resentation is obtained by integrating all involved data
and Ursino (2003) for a detailed illustration of them. sources into a unique one. However, when involved
We can define now the neighborhood of an x-compo- sources are numerous and large, a unique global schema
nent. In particular, given an x-component xS of an XML presumably encodes an enormous number and variety
schema S, the neighborhood of level j of xS consists of all of objects and becomes far too complex to be used
x-components of S whose semantic distance from xS is effectively.
less than or equal to j. In order to overcome the drawbacks mentioned
In order to verify if two x-components x1j, belonging previously, our approach does not directly integrate
to an XML schema S1, and x2k, belonging to an XML involved source schemas to construct a global flat
Schema S2, are synonymous, it is necessary to examine schema. Rather, it first groups them into homogeneous
their neighborhoods. More specifically, first, it is nec- clusters and then integrates schemas on a cluster-by-
essary to verify if their nearest neighborhoods (i.e., the cluster basis. Each integrated schema thus obtained is
neighborhoods of level 0) are similar. This decision is then abstracted to construct a global schema represent-
made by computing a suitable objective function associ- ing the cluster. The aforementioned process is iterated
ated with the maximum weight matching on a bipartite over the set of obtained cluster schemas, until one
graph constructed from the x-components of the neigh- schema is left. In this way, a hierarchical structure is
borhoods into consideration and the lexical synonymies obtained, which is called Data Repository (DR).
stored in a thesaurus (e.g., WordNet) 3. If these two Each cluster of a DR represents a group of homoge-
neighborhoods are similar, then x1j and x 2k are assumed to neous schemas and is, in turn, represented by a schema
be synonymous. (hereafter called C-schema). Clusters of level n of the
However, observe that the neighborhoods of level 0 hierarchy are obtained by grouping some C-schemas of
of x 1j and x2k provide quite a limited vision of their level n-1; clusters of level 0 are obtained by grouping
contexts. If a higher certainty on the synonymy between input source schemas. Therefore, each cluster Cl is
x1j and x 2k is required, it is necessary to verify the simi- characterized by (1) its identifier C-id; (2) its C-schema;
larity, not only of their neighborhoods of level 0, but also (3) the group of identifiers of clusters whose C-schemas
of the other neighborhoods. As a consequence, it is originated the C-schema of Cl (hereafter called O-
possible to introduce a severity level u against which identifiers); (4) the set of interschema properties in-
interschema properties can be determined, and to say volving objects belonging to the C-schemas that origi-
that x1j and x2k are synonymous with a severity level u, if nated the C-schema of Cl; and (5) a level index.
all neighborhoods of x1j and x 2k of a level lesser than or It is clear from this reasoning that the three funda-
equal to u are similar. mental operations for obtaining a DR are (1) schema
After all synonymies of S1 and S2 have been extracted, clustering (Han & Kumber, 2001), which takes a set of
homonymies can be derived. In particular, there exist a schemas as input and groups them into semantically
homonymy between two x-components x1j and x2k with a homogeneous clusters; (2) schema integration, which
severity level u if: (1) x1j and x2k have the same name; (2) produces a global schema from a set of heterogeneous
both of them are elements or both of them are attributes; input schemas; and (3) schema abstraction, which
and (3) they are not synonymous with a severity level u. groups concepts of a schema into homogeneous clus-
In other words, a homonymy indicates that two concepts ters and, in the abstracted schema, represents each
having the same name represent different meanings. cluster with only one concept.
Due to space constraints, we cannot describe in this
article the derivation of all the other interschema prop- Exploitation of the Uniform
erties mentioned; however, it follows the same philoso- Representation for Constructing a DW
phy as the detection of synonymies and homonymies.
The interested reader can find a detailed description of it The Data Repository can be exploited as the core struc-
in Ursino (2002). ture of the reconciled level of a new three-level DW
architecture. Indeed, different from classical three-level
Construction of a Uniform architectures, in order to reconcile data, we do not di-
Representation rectly integrate involved schemas to construct a flat glo-
bal schema. Rather, we first collect subsets of involved
Detected interschema properties can be exploited for schemas into homogeneous clusters and construct a DR
constructing a global representation of involved infor- that is used as the core of the reconciled data level.
649
TEAM LinG
In order to pinpoint the differences between classical possibly relative to different, yet related and comple-
three-level DW architectures and ours, the following mentary, application contexts.
observations can be drawn:
A classical three-level DW architecture is a par- CONCLUSION

ticular case of the one proposed here, since it
corresponds to a case where involved sources are In this article we have illustrated an approach for the
all grouped into one cluster and no abstraction is semi-automatic construction and management of data
carried out over the associated C-schema. warehouses. We have shown that our approach is par-
The architecture we propose here is naturally con- ticularly suited when the number or the size of involved
ducive to an incremental DW construction. sources is large and/or when they change quite fre-
In a classical three-level architecture designed quently over time.
over a large number of sources, presumably hun- Various experiments have been conducted for veri-
dreds of concepts are represented in the global fying the applicability of our approach. The interested
schema. In particular, the global schema can be reader can find many of them in Ursino (2002), as far as
seen as partitioned into subschemas loosely re- their application to Italian Central Government Offices
lated to each other, whereas each subschema con- databases is concerned, and in De Meo, Quattrone,
tains objects tightly related to each other. Such a Terracina, and Ursino (2003) for their application to
difficulty does not characterize our architecture, semantically heterogeneous XML sources.
where a source cluster would have been associated In the future, we plan to extend our studies on the
to a precise semantics. role of interschema properties to the various applica-
In our architecture, data mart design is presumably tion fields mentioned in the previous section.
simpler than with classical architectures, since
each data mart will insist over a bunch of data
sources spanned by a subtree of our core DR REFERENCES
rooted at some C-schema of level k, for some k.
In our architecture, reconciled data are (virtually) Batini, C., Lenzerini, M., & Navathe, S.B. (1986). A
represented at various abstraction levels within comparative analysis of methodologies for database
the core DR. scheme integration. ACM Computing Surveys, 15(4),
323-364.
It follows from these observations that by paying a
limited price in terms of required space and computa- Castano, S., De Antonellis, V., & De Capitani di Vimercati,
tion time, we obtain an architecture that retains all S. (2001). Global viewing of heterogeneous data sources.
worths of classical three-level architectures but over- IEEE Transactions on Data and Knowledge Engineer-
comes some of their limitations arising when involved ing, 13(2), 277-297.
data sources are numerous and large.
Chua, C.E.H, Chiang, R.H.L., & Lim, E.P. (2003). Instance-
based attribute identification in database integration. The
International Journal on Very Large Databases, 12(3),
FUTURE TRENDS 228-243.
In the next years, interschema properties presumably De Meo, P., Quattrone, G., Terracina, G., & Ursino, D.
will play a relevant role in various applications involving (2003). Almost automatic and semantic integration of
heterogeneous data sources. Among them we cite ontol- XML schemas at various severity levels. Proceedings
ogy matching, semantic Web, e-services, semantic query of the International Conference on Cooperative In-
processing, Web-based financial services, and biologi- formation Systems (CoopIS 2003), Taormina, Italy.
cal data management.
As an example, in this last application field, the high- Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., &
throughput techniques for data collection developed in Halevy, A. (2003). Learning to match ontologies on the
the last years have led to an enormous increase of semantic Web. The International Journal on Very
available biological databases; for instance, the largest Large Databases, 12(4), 303-319.
public DNA database contains much more than 20 GB of dos Santos Mello, R., Castano, S., & Heuser, C.A. (2002).
data (Hunt, Atkinson & Irving, 2002). Interschema prop- A method for the unification of XML schemata. Infor-
erties can play a relevant role in this context; indeed, mation & Software Technology, 44(4), 241-249.
they can allow the manipulation of biological data sources
650
TEAM LinG
Gal, A., Anaby-Tavor, A., Trombetta, A., & Montesi, D. KEY TERMS
(2004). A framework for modeling and evaluating auto- I
matic semantic reconciliation. The International Jour- Assertion Between Knowledge Patterns: A particular
nal on Very Large Databases [forthcoming]. interschema property. It indicates either a subsumption or
Han, J. & Kamber, M. (2001). Data mining: Concepts and an equivalence between knowledge patterns. Roughly
techniques. Morgan Kaufmann Publishers. speaking, knowledge patterns can be seen as views on
involved information sources.
Hunt, E., Atkinson, M.P., & Irving, R.W. (2002). Database
indexing for large DNA and protein sequence collections. Data Repository: A complex catalogue of a set of
The International Journal on Very Large Databases, sources organizing both their description and all associ-
11(3), 256-271. ated information at various abstraction levels.
Madhavan, J., Bernstein, P.A., & Rahm, E. (2001). Generic Homonymy: A particular interschema property. An
schema matching with cupid. Proceedings of the Interna- homonymy between two concepts A and B indicates that
tional Conference on Very Large Data Bases (VLDB they have the same name but different meanings.
2001), Rome, Italy. Hyponymy/Hypernymy: A particular interschema
McBrien, P., & Poulovassilis, A. (2003). Data integration property. Concept A is said to be a hyponym of a concept
by bi-directional schema transformation rules. Proceed- B (which, in turn, is a hypernym of A), if A has a more
ings of the International Conference on Data Engineer- specific meaning than B.
ing (ICDE 2003), Bangalore, India. Interschema Properties: Terminological and struc-
Melnik, S., Garcia-Molina, H., & Rahm, E. (2002). Similarity tural relationships involving concepts belonging to dif-
flooding: A versatile graph matching algorithm and its ferent sources.
application to schema matching. Proceedings of the Inter- Overlapping: A particular interschema property.
national Conference on Data Engineering (ICDE 2002), An overlapping exists between two concepts A and B, if
San Jos, California. they are neither synonyms nor hyponyms of the other
Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). but share a significant set of properties; more formally,
Intensional and extensional integration and abstraction there exists an overlapping between A and B, if there
of heterogeneous databases. Data & Knowledge Engi- exist non-empty sets of properties {pA1, pA2, , pAn} of
neering, 35(3), 201-237. A and {pB1, pB2, , pBn} of B such that, for 1i n, pAi is
a synonym of pBi..
Palopoli, L., Sacc, D., Terracina, G., & Ursino, D. (2003).
Uniform techniques for deriving similarities of objects Schema Abstraction: The activity that clusters ob-
and subschemas in heterogeneous databases. IEEE Trans- jects belonging to a schema into homogeneous groups
actions on Knowledge and Data Engineering, 15(2), and produces an abstracted schema obtained by substi-
271-294. tuting each group with one single object representing it.
Palopoli, L., Terracina, G., & Ursino, D. (2001). A graph- Schema Integration: The activity by which differ-
based approach for extracting terminological properties ent input source schemas are merged into a global
of elements of XML documents. Proceedings of the structure representing all of them.
International Conference on Data Engineering (ICDE Subschema Similarity: A particular interschema
2001), Heidelberg, Germany. property. It represents a similitude between fragments
Rahm, E., & Bernstein, P.A. (2001). A survey of approaches to of different schemas.
automatic schema matching. The International Journal on Synonymy: A particular interschema property. A
Very Large Databases, 10(4), 334-350. synonymy between two concepts A and B indicates that
Ursino, D. (2002). Extraction and exploitation of inten- they have the same meaning.
sional knowledge from heterogeneous information Type Conflict: A particular interschema property.
sources: Semi-automatic approaches and tools. Springer. It indicates that the same concept is represented by
different constructs (e.g., an element and an attribute in
an XML source) in different schemas.
651
TEAM LinG
ENDNOTES 3
Clearly, if necessary, a more specific thesaurus,
possibly constructed with the support of a human
1
Here and in the following, we shall consider a expert, might be used.
three-level data warehouse architecture.
2
Semantic distance is often called connection cost
in the literature.
652
TEAM LinG
653
Inter-Transactional Association Analysis for I

Prediction
Ling Feng
University of Twente, The Netherlands
Tharam Dillon
INTRODUCTION intratransactional associations, an intertransactional as-

sociation describes the association relationships across
The discovery of association rules from large amounts of different transactions, such as, if (company) As stock
structured or semi-structured data is an important data- goes up on day one, Bs stock will go down on day two
mining problem (Agrawal et al., 1993; Agrawal & Srikant, but go up on day four. In this case, whether we treat
1994; Braga et al., 2002, 2003; Cong et al., 2002; Miyahara company or day as the unit of transaction, the associated
et al., 2001; Termier et al., 2002; Xiao et al., 2003). It has items belong to different transactions.
crucial applications in decision support and marketing
strategy. The most prototypical application of associa-
tion rules is market-basket analysis using transaction MAIN TRUSTS
databases from supermarkets. These databases contain
sales transaction records, each of which details items Extensions from Intratransaction to
bought by a customer in the transaction. Mining associa- Intertransaction Associations
tion rules is the process of discovering knowledge such
as, 80% of customers who bought diapers also bought We extend a series of concepts and terminologies for
beer, and 35% of customers bought both diapers and beer, intertransactional association analysis. Throughout the
which can be expressed as diaper beer (35%, 80%), discussion, we assume that the following notation is
where 80% is the confidence level of the rule, and 35% is used.
the support level of the rule indicating how frequently the
customers bought both diapers and beer. In general, an A finite set of literals called items I = {i 1, i2, , i n}.
association rule takes the form X Y (s, c), where X and A finite set of transaction records T = {t1, t2, , tl},
Y are sets of items, and s and c are support and confidence, where for ti T, t i I.
respectively. A finite set of attributes called dimensional at-
tributes A = {a1, a2, , am}, whose domains are finite
subsets of nonnegative integers.
BACKGROUND
An Enhanced Transactional Database
While the traditional association rules have demonstrated
strong potential in areas such as improving marketing
Model
strategies for the retail industry (Dunham, 2003; Han &
Kamer, 2001), their emphasis is on description rather than In classical association analysis, records in a transac-
prediction. Such a limitation comes from the fact that tional database contain only items. Although transac-
traditional association rules only look at association tions occur under certain contexts, such as time, place,
relationships among items within the same transactions, customers, and so forth, such contextual information has
whereas the notion of the transaction could be the items been ignored in classical association rule mining, due to
bought by the same customer, the atmospheric events the fact that such rule mining was intratransactional in
that happened at the same time, and so on. To overcome nature. However, when we talk about intertransactional
this limitation, we extend the scope of mining association associations across multiple transactions, the contexts of
rules from such traditional intra-transactional associa- occurrence of transactions become important and must be
tions to intertransactional associations for prediction taken into account.
(Feng et al., 1999, 2001; Lu et al., 2000). Compared to Here, we enhance the traditional transactional data-
base model by associating each transaction record with
TEAM LinG
Inter-Transactional Association Analysis for Prediction
a number of attributes that describe the context within Normalized Extended Item (Transaction)
which the transaction happens. We call them dimen- Sets
sional attributes, because, together, these attributes
constitute a multi-dimensional space, and each transac- We call an extended itemset a normalized extended itemset,
tion can be mapped to a certain point in this space. if all its extended items are positioned with respect to the
Basically, dimensional attributes can be of any kind, as smallest reference point of the set. In other words, the
long as they are meaningful to applications. Time, dis- extended items in the set have the minimal relative dis-
tance, temperature, latitude, and so forth are typical di- tance 0 for each dimension. Formally, let Ie = {(d1,1, d1,2,
mensional attributes. (i ), (d2,1, d2,2, , d2,m)(i 2), , (dk,1, dk,2, , dk,m)(ik)} be an
,~d1,m) 1
extended itemset. Ie is a normalized extended itemset, if
Multidimensional Contexts and only if for j (1 j k) i (1 i m), min (dj, i) = 0.
The normalization concept can be applied to an ex-
An m-dimensional mining context can be defined through tended transaction set as well. We call an extended trans-
m dimensional attributes a1, a2, , am, each of which action set a normalized extended transaction set, if all its
represents a dimension. When m=1, we have a single- extended transactions are positioned with respect to the
dimensional mining context. Let ni = (ni.a1, ni.a2, , ni.am) smallest reference point of the set. Any non-normalized
and nj = (nj.a1, nj.a2, , nj.am) be two points in an m- extended item (transaction) set can be transformed into a
dimensional space, whose values on the m dimensions are normalized one through a normalization process, where
represented as ni.a1, ni.a2, , ni.am and nj.a1, nj.a2, , nj.am, the intention is to reposition all the involved extended
respectively. Two points ni and nj are equal, if and only if items (transactions) based on the smallest reference point
for k (1 k m), ni.ak = nj.ak. A relative distance between of this set. We use INE and TNE to denote the set of all
ni and nj is defined as ni, nj = (nj.a1-ni.a1, nj.a2-ni.a2, , possible normalized extended itemsets and normalized
nj.am-ni.am). We also use the notation (d1, d2, , dm), where extended transaction sets, respectively. According to the
dk = nj.ak-ni.a k (1 k m), to represent the relative distance above definitions, any superset of a normalized extended
between two points ni and nj in the m-dimensional space. item (transaction) set is also a normalized extended item
Besides, the absolute representation (ni.a1, ni.a2, , (transaction) set.
ni.a m) for point ni, we also can represent it by indicating its
relative distance n0, ni from a certain reference point n0, Multidimensional Intertransactional
(i.e., n0+n0, ni, where ni = n0+n0, ni). Note that ni, n0,
ni, and (ni.a1-n0.a1, ni.a2-n0.a2, , ni.am-n0.am) can be used inter-
Association Rule Framework
changeably, since each of them refers to the same point
ni in the space. Let N = {n1, n 2, , nu} be a set of points in With the above extensions, we are now in a position to
an m-dimensional space. We construct the smallest refer- formally define intertransactional association rules and
ence point of N, n*, where for k (1 k m), n*.ak = min related measurements.
(n1.ak, n 2.ak, , nu.ak).
Definition 1
Extended Items (Transactions)
A multidimensional intertransactional association rule is
The traditional concepts regarding item and transaction an implication of the form X Y, where
can be extended accordingly under an m-dimensional
context. We call an item i kI happening at the point (d1, (1) X INE and Y IE;
, (i.e., at the point (n0.a1+d1, n0.a2+d2, , n0.am+dm)), (2) The extended items in X and Y are positioned with
d2, , dm)
an extended item and denote it as (d1, d2, , dm)(ik). In a respect to the same reference point;
similar fashion, we call a transaction tkT happening at the (3) For (x1, x2, , xm)(i x) X, (y1, y2, , ym)(ix) Y, xj yj
point (d1, d2, , dm) an extended transaction and denote it as (1 j m);
D(d1, d2, , dm)(t k). The set of all possible extended items, IE, (4) X Y = .
is defined as a set of (d1, d2, , dm)(i k) for any ikI at all
possible points (d1, d2, , dm) in the m-dimensional space. TE Different from classical intratransactional association
is the set of all extended transactions, each of which rules, the intertransactional association rules capture the
contains a set of extended items, in the mining context. occurrence contexts of associated items. The first clause
654
TEAM LinG
of the definition requires the precedent and antecedent Candidate Generation

of an intertransactional association rule to be a normal- I
ized extended itemset and an extended itemset, respec- Pass 1: Let I={1, 2, ..., m} be a set of items in a
tively. The second clause of the definition ensures that database. To generate the candidate set C1 of 1-
items in X and Y are comparable in terms of their contex- itemsets, we need to associate all possible inter-
tual positions. For prediction, each of the consequent vals with each item. That is,
items in Y takes place in a context later than any of its
precedent items in X, as stated by the third clause. C1 = { 0(1), 1(1), , maxspan(1), 0(2), 1(2), ,
maxspan(2),
Definition 2
0(m), 1(m), , maxspan(m) }
Given a normalized extended itemset X and an extended
itemset Y, let |Txy| be the total number of minimal extended Starting from transaction t at the reference point s
transaction sets that contain XY, |Tx| and the total number (i.e., extended transaction s(t)), the transaction t at the
of minimal extended transaction sets that contain X, and |Te| point s+x in the dimensional space (i.e., extended trans-
be the total number of extended transactions in the data- action s+x(t)) is scanned to determine whether item i n
base. The support and confidence of an intertransactional exists. If so, the count of x(i n) increases by one. One
association rule X Y are support(X Y) = |T xy|/|T e| and scan of the database will deliver the frequent set L1.
confidence(X Y) = |Txy| / |T x|.
Pass 2: We generate a candidate 2-itemset {0(i m),
Discovery of Intertransactional x(in)} from any two frequent 1-itemsets in L1, 0(im)
Association Rules and x(i n), and obtain C2 = {{0(i m), x(i n)} | (x=0
im<in) (x
0)}.
To investigate the feasibility of mining intertransaction
rules, we extended the classical a priori algorithm and Pass k>2: Given Lk-1, the set of all frequent (k-1)-
applied it to weather forecasting. To limit the search space, itemsets, the candidate generation procedure re-
we set a maximum span threshold, maxspan, to define a turns a superset of the set of all frequent k-itemsets.
sliding window along time dimensions. Only the associa- This procedure has two parts. In the join phase, we
tions among the items that co-occurred within the window join two frequent (k-1)-itemsets in Lk-1 to derive a
are of interest. Basically, the mining process of candidate k-itemset. Let p={u1 (i 1), u2 (i2),,uk-
intertransactional association rules can be divided into (i )} and q={v1 (j 1), v2 (j2),, vk-1 (jk-1)}, where
1 k-1
two steps: frequent extended itemset discovery and asso- p, q Lk-1, we have
ciation rule generation.
insert into Ck
1. Frequent Extended Itemset Discovery
select p.u1 (i1),, , p. uk-1 (i k-1), q. vk-1 (jk-1)
In this phase, we find the set of all frequent extended
itemsets. For simplicity, in the following, we use itemset from p in Lk-1, q in Lk-1
and extended itemset, transaction and extended transac-
tion interchangeably. Let Lk represent the set of frequent where (i1=j1 u1=v1) (i k-2=j k-2 uk-2=vk-2) (uk-1<vk-
k-itemsets and Ck the set of candidate k-itemsets. The 1
(uk-1=vk-1 ik-1<jk-1))
algorithm makes multiple passes over the database. Each
pass consists of two phases. First, the set of all frequent Next, in the prune phase, we delete all those extended
(k-1)-itemsets Lk-1 found in the (k-1)th pass is used to itemsets in Ck that have some (k-1)-subsets with sup-
generate the candidate itemset Ck. The candidate genera- ports less than the support threshold.
tion procedure ensures that Ck is a superset of the set of
all frequent k-itemsets. The algorithm now scans the data- Support Counting
base. For each list of consecutive transactions, it deter-
mines which candidates in Ck are contained and increments To facilitate the efficient support counting process, a
their counts. At the end of the pass, Ck is examined to check candidate Ck of k-itemsets is divided into k groups, with
which of the candidates actually are frequent, yielding Lk. each group Go containing o number of items whose
The algorithm terminates when L k becomes empty. In the intervals are 0 (1 o k). For example, a candidate set of
following, we detail the procedures for candidate genera- 3-itemsets
tion and support counting.
655
TEAM LinG
C3={ {0(a), 1(a), 2(b)}, {0(c), (d), 2(d)}, {0(a), wind speed, temperature, relative humidity, rainfall, and
0(b), 3(h)}, {0(l), 0(m), 0(n)}, {0(p), v0(q), 0(r)} ] mean sea level pressure, and so forth every six hours each
day. In some data records, certain atmospheric observa-
is divided into three groups: tions, such as relative humidity, are missing. Since the
context constituted by the time dimension is valid for the
G1={{0(a), 1(a), 2(b)}},G2={{0(c), 0(d), 2(d)}, whole set of meteorological records (transactions), we fill
{0(a), 0(b), 3(h)}}, G3={{0(l), 0(m), 0(n)}, in these empty fields by averaging their nearby values. In
{0(p), 0(q), 0(r)} } this way, the data to be mined contains no missing fields;
in addition, no database holes (i.e., meaningless contexts)
Each group is stored in a modified hash-tree. Only exist in the mining space. Essentially, there is one dimen-
those items with interval 0 participate the construction of sion in this case; namely, time. After preprocessing the
this hash tree (e.g., in G2, only {0(c), 0(d)} and {0(a), data set, we discover intertransactional association rules
D0(b)} enter the hash-tree). The construction process is from the 1995 meteorological records and then examine
similar to that of a priori. The rest items, 2(d) and 3(h), their prediction accuracy using the 1996 meteorological
are simply attached to the corresponding itemsets, {0(c), data from the same area in Hong Kong. Considering
0(d)} and {0(a), 0(b)}, respectively, in the leaves of the seasonal changes of weather, we extract records from
tree. May to October for our experiments, totaling 736 records
Upon reading one transaction of the database, every (total_days * record_num_per_day = (31+30+31+
hash tree is tested. If one itemset is contained, its attached 31+30+31) * 4 = 736) for each year. These raw data sets,
itemsets whose intervals are larger than 0 will be checked containing continuous data, are further converted into
against the consecutive transactions. In the previous appropriate formats with which the algorithms can work.
example, if {0(a), 0(b)} exists in the current extended Each record has six meteorological elements (items). The
transaction at the point s, then the extended transaction interval of every two consecutive records is six hours. We
s+3(t) will be scanned to see whether it contains item h. set maxspan=11 in order to detect the association rules for
If so, the support of 3-itemset {0(a), 0(b), 3(h)} will a three-day horizon (i.e., (11+1) / 4 = 3).
increase by 1. At support=45% and confidence=92%, from the 1995
meteorological data, we found only one classical associa-
Association Rule Generation tion rule: if the humidity is medium wet, then there is no
rain at the same time (which is quite obvious), but we
Using sets of frequent itemsets, we can find the desired found 580 intertransactional association rules. Note that
intertransactional association rules. The generation of the number of intertransactional association rules re-
intertransaction association rules is similar to the genera- turned depends on the maxspan parameter setting. Table
tion of the classical association rules. 1 lists some significant intertransactional association
rules found from the single-station meteorological data.
Application of Intertransactional We measure their predictive capabilities using the 1996
meteorological data recorded by the same station through
Association Rules to Weather Prediction-Rate (XY) = sup(XY) / sup(X), which can
Prediction achieve more than 90% prediction rate. From the test
results, we find that, with intertransactional association
We apply the previous algorithm to studying Hong Kong rules, more comprehensive and interesting knowledge
meterological data. The database records wind direction, can be discovered from the databases.
Table 1. Some significant intertransactional association rules found from meteorological data
Extended Association Rules Sup. Conf. Pred. Rate

0(2), 3(13) 4(13) 46% 92% 90%
(If there is an east wind direction and no rain in 18 hours, then
there also will be no rain in 24 hours.)
0(2), 0(25), 2(25) 3(25) 45% 95% 91.8%
(If it is currently warm with an east wind direction, and still
warm 12 hours later, then it will be continuously warm until 18
hours later.)
656
TEAM LinG
FUTURE TRENDS Braga, D. et al. (2003). Discovering interesting information

in XML data with association rules. Proceedings of the I
Intertransactional association rule framework provides a 18th Symposium on Applied Computing.
general view of associations among events (Feng et al., Cong, G., Yi, L., Liu, B., & Wang, K. (2002). Discovering
2001). frequent substructures from hierarchical semi-structured
data. Proceedings of the SIAM DM02, Arlington, Vir-
Traditional Association Rule Mining: If the dimen- ginia.
sional attributes are not present in the knowledge to
be discovered, or those attributes are not interre- Dunham, M.H. (2003). Data mining: Introductory and
lated, then they can be ignored in analysis, and the advanced topics. Upper Saddle River, NJ: Prentice Hall.
problem becomes the traditional transaction-based
association rule problem. Feng, L. et al. (1999). Mining multi-dimensional inter-
Mining in Multidimensional Databases or Data transaction association rules with templates. Proceed-
Warehouse: Multidimensional database organizes ings of the International ACM Conference on Informa-
data into data cubes with dimensional attributes tion and Knowledge Management, Kansas City, USA.
and measures, on which multidimensional Feng, L. et al. (2002). A template model for multidimen-
intertransactional association mining can be per- sional inter-transactional association rules. International
formed. Journal of VLDB, 11(2), 153-175.
Mining Spatial Association Rules: Existence of
spatial objects can be viewed as events. The coor- Feng, L., Dillon, T., & Liu, J. (2001). Inter-transactional
dinates are dimensional attributes. Related proper- association rules for multidimensional contexts for pre-
ties can be added as other dimensional attributes. diction and their applications to studying meteorological
data. International Journal of Data and Knowledge
Engineering, 37, 85-115.
CONCLUSION Feng, L., Li, Q., & Wong, A.K.Y. (2001). Mining inter-
transactional association rules: Generalization and em-
While the classical association rules have demonstrated pirical evaluation. Proceedings of the 3rd International
strong potential values, such as improving market strat- Conference on Data Warehousing and Knowledge Dis-
egies for retail industry, they are limited to finding asso- covery, Munich, Germany.
ciations among items within a transaction. In this article,
we propose a more general form of association rules, Han, J., & Kamer, M. (2001). Data mining: Concepts and
named multi-dimensional intertransaction association techniques. San Diego: Morgan Kaufmann Publishers.
rules. These rules can represent not only the associations Lu, H., Feng, L., & Han, J. (2000). Beyond intra-transaction
of items within transactions but also associations of items association analysis: Mining multi-dimensional inter-
among different transactions. The classical association transaction association rules. ACM Transactions on In-
rule can be viewed as a special case of multi-dimensional formation Systems, 18(4), 423-454.
intertransaction association rules.
Miyahara, T. et al. (2001). Discovery of frequent tree
structured patterns in semistructured Web documents.
REFERENCES Proceedings of the 5th PAKDD, Hong Kong, China.
Termier, A., Rousset, M., & Sebag, M. (2002). Treefinder:
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining A first step towards XML data mining. Proceedings of the
associations between sets of items in massive databases. IEEE International Conference on Data Mining,
Proceedings of the ACM SIGMOD International Confer- Maebashi City, Japan.
ence on Management of Data, Washington, D.C.
Xiao, Y., Yao, J., Li, Z., & Dunham, M. (2003). Efficicent
Agrawal, R., & Srikant, R. (1994). Fast algorithms for data mining for maximal frequent subtrees. Proceedings of
mining association rules. Proceedings of the 20 th Confer- the 3rd IEEE International Conference on Data Mining,
ence on Very Large Data Bases, Santiago de Chile, Chile. Melbourne, Florida.
Braga, D. et al. (2002). Mining association rules from XML
data. Proceedings of the 4th International Conference
on Data Warehousing and Knowledge Discovery, Aix-en
Provence, France.
657
TEAM LinG
KEY TERMS Intratransactional Association: Correlations among

items within the same transactions.
Dimensional Attribute: One dimension of an m-di- Multidimensional Context: The context where data-
mensional context. base transactions, and thus the association relation-
Extended Item: An item associated with the context ships, happen.
where it happens. Normalized Extended Itemset: A set of extended items
Extended Transaction: A transaction associated with whose contextual positions have been positioned with
the context where it happens. respect to the smallest reference position of the set.
Intertransactional Association: Correlations among Normalized Extended Transaction Set: A set of ex-
items not only within the same transactions but also tended transactions whose contextual positions have
across different transactions. been positioned with respect to the largest reference
position of the set.
658
TEAM LinG
659
Interval Set Representations of Clusters I

Pawan Lingras
Rui Yan
Mofreh Hogo
Czech Technical University, Czech Republic
Chad West
IBM Canada Limited, Canada
INTRODUCTION precisely one cluster, we had to assign objects 4 and 9

to Cluster B. However, the dotted circle seems to rep-
The amount of information that is available in the new resent a more natural representation of Cluster B. An
information age has made it necessary to consider variability to specify that Object 9 may either belong to
ous summarization techniques. Classification, cluster- Cluster B or Cluster C, and Object 4 may belong to
ing, and association are three important data-mining Cluster A or Cluster B, will provide a better representa-
features. Association is concerned with finding the tion of the clustering of the 12 objects in Figure 1.
likelihood of co-occurrence of two different concepts. Rough-set theory provides such an ability.
For example, the likelihood of a banana purchase given The notion of rough sets was proposed by Pawlak
that a shopper has bought a cake. Classification and (1992). Let U denote the universe (a finite ordinary set),
clustering both involve categorization of objects. Clas- and let R be an equivalence (indiscernibility) relation on
sification processes a previously known categorization U. The pair A = (U , R) is called an approximation space.
of objects from a training sample so that it can be The equivalence relation R partitions the set U into
applied to other objects whose categorization is un- disjoint subsets. Such a partition of the universe is
known. This process is called supervised learning.
denoted by U / R = {E1 , E 2 , L , Em } , where Ei is an
Clustering groups objects with similar characteristics.
As opposed to classification, the grouping process in equivalence class of R. If two elements u, v U belong
clustering is unsupervised. The actual categorization of to the same equivalence class, we say that u and v are
objects, even for a sample, is unknown. Clustering is an indistinguishable. The equivalence classes of R are called
important step in establishing object profiles. the elementary or atomic sets in the approximation
Clustering in data mining faces several additional space A = (U , R ) . Because it is not possible to differen-
challenges compared to conventional clustering appli- tiate the elements within the same equivalence class,
cations (Krishnapuram, Joshi, Nasraoui, & Yi, 2001). one may not be able to obtain a precise representation
The clusters tend to have fuzzy boundaries. There is a
for an arbitrary set X U in terms of elementary sets
likelihood that an object may be a candidate for more
than one cluster. Krishnapuram et al. argued that clus- in A. Instead, its lower and upper bounds may represent
tering operations in data mining involve modeling an the set X. The lower bound A( X ) is the union of all the
unknown number of overlapping sets. This article de- elementary sets, which are subsets of X. The upper
scribes fuzzy and interval set clustering as alternatives bound A( X ) is the union of all the elementary sets that
to conventional crisp clustering.
have a nonempty intersection with X. The pair
( A( X ), A( X )) is the representation of an ordinary set X
BACKGROUND in the approximation space A = (U , R ) , or simply the
rough set of X. The elements in the lower bound of X
Conventional clustering assigns various objects to pre- definitely belong to X, while elements in the upper
cisely one cluster. Figure 1 shows a possible clustering bound of X may or may not belong to X. The pair
of 12 objects. In order to assign all the objects to
TEAM LinG
Interval Set Representations of Clusters
Figure 1. Conventional clustering
Cluster C
Cluster B
Cluster A
Figure 2. Rough set approximation
An equivalence class
Lower approximation
Actual set
Upper approximation
( A( X ), A( X )) also provides a set theoretic interval for sists of one gene per object. The gene for an object is a
string of bits that describes which lower and upper
the set X. Figure 2 illustrates the lower and upper
approximations the object belongs to. The string for a
approximation of a set.
gene can be partitioned into two parts, lower and upper,
as shown in Figure 4 for three clusters. Both lower and
upper parts of the string consist of three bits each. The
MAIN THRUST ith bit in the lower/upper string tells whether the object
is in the lower/upper approximation of the ith cluster.
Interval Set Clustering Figure 4 shows examples of all the valid genes for three
clusters. An object represented by g1 belongs to the
Rough sets were originally used for supervised learning. upper bounds of the first and second clusters. An object
There are an increasing number of research efforts on represented by g6 belongs to the lower and upper bounds
clustering in relation to rough-set theory (do Prado, of the second cluster. Any other value not given by g1 to
Engel, & Filho, 2002; Hirano & Tsumoto, 2003; Peters, g7 is not valid. The objective of the genetic algorithms
Skowron, Suraj, Rzasa, & Borkowski, 2002). Lingras (GAs) is to minimize the within-group-error. Lingras
(2001) developed rough-set representation of clusters. provided a formulation of within-group-error for rough-
Figure 3 shows how the 12 objects from Figure 1 could set based clustering. The resulting GAs were used to
be clustered by using rough sets. Instead of Object 9 evolve interval clustering of highway sections. Lingras
belonging to any one cluster, it belongs to the upper (2002) applied the unsupervised rough-set clustering
bounds of Clusters B and C. Similarly, Object 4 belongs based on GAs for grouping Web users. However, the
to the upper bounds of Clusters A and B. clustering process based on GAs seemed
Lingras (2001; Lingras & West, 2004; Lingras, computationally expensive for scaling to larger datasets.
Hogo, & Snorek, 2004) proposed three different ap- The K-means algorithm is one of the most popular
proaches for unsupervised creation of rough or interval statistical techniques for conventional clustering
set representations of clusters: evolutionary, statisti- (Hartigan & Wong, 1979). Lingras and West (2004)
cal, and neural. Lingras (2001) described how a rough- provided a theoretical and experimental analysis of a
set theoretic clustering scheme could be represented by modified K-means clustering based on the properties of
using a rough-set genome. The rough-set genome con- rough sets. It was used to create interval set representa-
660
TEAM LinG
Figure 3. Rough set based clustering

Upper bound of Group C I
12
11
10
Lower bound of Group A Boundary area of Groups B and C
9
1
Lower bound of Group B
3 8 7
4
2 5 6
Upper bound of Group A Upper bound of Group B

Boundary area of Groups A and B
Figure 4. Rough set gene d ( v, x i ) = min d ( v, x j )

Let 1 j k
and
T = { j : d ( v, x i ) / d ( v, x j ) threshold and i j}.
1. If T , v exists in the upper bounds of clusters i

and all the clusters j T .
2. Otherwise, v exists in the lower and upper bounds
of cluster i. Moreover, v does not belong to any
other lower or upper bounds.
It was shown (Lingras & West, 2004) that the above

method of assignment preserves important rough-set
properties (Pawlak, 1992; Skowron, Stepaniuk, & Peters,
2003; Szmigielski & Polkowski, 2003; Yao, 2001).
Fuzzy Set Clustering
A fuzzy generalization of clustering uses a fuzzy member-

ship function to describe the degree of membership
tions of clusters of Web users as well as a large set of (between [0, 1]) of an object in a given cluster. The sum
supermarket customers. The Kohonen (1988) neural of fuzzy memberships of an object to all clusters must be
network or self-organizing map is another popular clus- equal to 1.
tering technique. It is desirable in some applications due Fuzzy C-means (FCMs) is a landmark algorithm in
to its adaptive capabilities. Lingras et al. (2004) intro- the area of fuzzy C-partition clustering. It was first
duced interval set clustering, using a modification of the proposed by Bezdek (1981). Rhee and Hwang (2001)
Kohonen self-organizing maps, based on rough-set proposed a type-2 fuzzy C-means algorithm to solve
theory. Both the modified statistical- and neural-net- membership typicality. They point out that because the
work-based approaches used the following principle for memberships generated are relative numbers, they may
assigning objects to lower and upper bounds of clusters. not be suitable for applications in which the member-
For each object, v, let d ( v, x i ) be the distance between ships are supposed to represent typicality. Moreover,
v and the centroid of cluster i. The ratio d ( v, x i ) / d ( v, x j ) , the conventional fuzzy C-means process suffers from
noisy data, that is, when a noise point is located far from
1 j k , is used to determine the membership of v.
all the clusters, an undesirable clustering result may
661
TEAM LinG
occur. The algorithm of Rhee and Hwang is based on the article briefly describes several approaches to creating
fact that when updating the cluster centers, higher mem- interval and fuzzy set representations of clusters. The
bership values should contribute more than memberships interval set clustering is based on the theory of rough
with smaller values. sets. Changes in clusters can provide important clues
Rough-set theory and fuzzy-set theory complement about the changing nature of the usage of a facility as
each other (Shi, Shen, & Liu, 2003). It is possible to create well as the changing nature of its users. Use of fuzzy and
interval clusters based on the fuzzy memberships ob- interval set clustering also adds an interesting dimen-
tained by using the fuzzy C-means algorithm described in sion to cluster migration studies. Due to rough bound-
the previous section. Let 1 0 . An object v aries of interval clusters, it may be possible to get early
belongs in the lower bound of cluster i if its membership warnings of potential significant changes in clustering
in the cluster is more than . Similarly, if its membership patterns.
in cluster i is greater than , the object belongs in the
upper bound of cluster i. Because 1 0 , if an REFERENCES
object belongs to the lower bound of a cluster, it will also
belong to its upper bound. Lingras and Yan (2004) de- Bezdek, J. C. (1981). Pattern recognition with fuzzy
scribe further conditions on and that ensure satis- objective function. New York: Plenum Press.
faction of other important properties of rough sets.
Do Prado, H. A., Engel, P. M., & Filho, H. C. (2002).
Rough clustering: An alternative to finding meaningful
clusters by using the reducts from a dataset. In J. Alpigini,
FUTURE TRENDS J. F. Peters, A. Skowron, & N. Zhong, (Eds.), Proceed-
ings of the Symposium on Rough Sets and Current
Temporal data mining is an application of data-mining Trends in Computing: Vol. 2475. Springer-Verlag.
techniques to the data that takes the time dimension into
account. Temporal data mining is assuming increasing Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS136:
importance. Much of the temporal data-mining tasks are A k-means clustering algorithm. Applied Statistics, 28,
related to the use and analysis of temporal sequences of 100-108.
raw data. There is little work that analyzes the results of
data mining over a period of time. Changes in cluster Hirano, S., & Tsumoto, S. (2003). Dealing with rela-
characteristics of objects, such as supermarket customers tively proximity by rough clustering. Proceedings of
or Web users, over a period of time can be useful in data the 22nd International Conference of the North Ameri-
mining. Such an analysis can be useful for formulating can Fuzzy Information Processing Society (pp. 260-265).
marketing strategies. Marketing managers may want to Kohonen, T. (1988). Self-organization and associa-
focus on specific groups of customers. Therefore, they tive memory. Berlin: Springer-Verlag.
may need to understand the migrations of the customers
from one group to another group. The marketing strate- Krishnapuram, R., Joshi, A., Nasraoui, O., & Yi, L.
gies may depend on desirability of these cluster migra- (2001). Low-complexity fuzzy relational clustering al-
tions. The overlapping clusters created by using interval gorithms for Web mining. IEEE Transactions on Fuzzy
or fuzzy set clustering can be especially useful in such Systems, 9(4), 595-607.
studies. Overlapping clusters make it possible for an Lingras, P. (2001). Unsupervised rough set classifica-
object to transition from the core of a cluster to the core tion using GAs. Journal of Intelligent Information
of another cluster through the overlapping region. When Systems, 16(3), 215-228.
an object moves from the core of a cluster to an overlap-
ping region, it may be possible to provide an early warning Lingras, P. (2002). Rough set clustering for Web min-
that can trigger an appropriate marketing campaign. ing. Proceedings of the IEEE International Confer-
ence on Fuzzy Systems.
Lingras, P., Hogo, M., & Snorek, M. (2004). Interval set
CONCLUSION clustering of Web users using modified Kohonen self-
organizing maps based on the properties of rough sets.
Clusters in data mining tend to have fuzzy and rough Web Intelligence and Agent Systems: An International
boundaries. An object may potentially belong to more Journal.
than one cluster. Interval or fuzzy set representation
enables modeling of such overlapping clusters. This
662
TEAM LinG
Lingras, P., & West, C. (2004). Interval set clustering of KEY TERMS
Web users with rough k-means. Journal of Intelligent I
Information Systems, 23(1), 5-16. Clustering: A form of unsupervised learning that
Lingras, P., & Yan, R. (2004). Interval clustering using divides a data set so that records with similar content are
fuzzy and rough set theory. Proceedings of the 23rd in the same group and groups are as different from each
International Conference of the North American Fuzzy other as possible.
Information Processing Society (pp. 780-784), Canada. Evolutionary Computation: A solution approach
Pawlak, Z. (1992). Rough sets: Theoretical aspects of guided by biological evolution that begins with potential
reasoning about data. New York: Kluwer Academic. solution models, then iteratively applies algorithms to
find the fittest models from the set to serve as inputs to
Peters, J. F., Skowron, A., Suraj, Z., Rzasa, W., & the next iteration, ultimately leading to a model that best
Borkowski, M. (2002). Clustering: A rough set ap- represents the data.
proach to constructing information granules, soft com-
puting and distributed processing. Proceedings of the Fuzzy C-Means Algorithms: Clustering algorithms
Sixth International Conference (pp. 57-61). that assign a fuzzy membership in various clusters to an
object instead of assigning the object precisely to a
Rhee, F. C. H., & Hwang, C. (2001). A type-2 fuzzy c- cluster.
means clustering algorithm. Proceedings of the IFSA
World Congress and the 20th International Confer- Fuzzy Membership: Instead of specifying whether
ence of the North American Fuzzy Information Pro- an object precisely belongs to a set, fuzzy membership
cessing Society (Vol. 4, pp. 1926-1929). specifies a degree of membership between [0,1].
Shi, H., Shen, Y., & Liu, Z. (2003). Hyperspectral bands Interval Set Clustering Algorithms: Clustering
reduction based on rough sets and fuzzy C-means clus- algorithms that assign objects to lower and upper bounds
tering. Proceedings of the 20th IEEE Instrumentation of a cluster, making it possible for an object to belong
and Measurement Technology Conference (Vol. 2, pp. to more than one cluster.
1053-1056). Interval Sets: If a set cannot be precisely defined,
Skowron, A., Stepaniuk, J., & Peters, J. F. (2003). one can describe it in terms of a lower bound and an
Rough sets and infomorphisms: Towards approximation upper bound. The set will contain its lower bound and
of relations in distributed environments. Fundamental will be contained in its upper bound.
Informatica, 54(2-3), 263-277. Rough Sets: Rough sets are special types of interval
Szmigielski, A., & Polkowski, L. (2003). Computing sets created by using equivalence relations.
from words via rough mereology in mobile robot navi- Self-Organization: A system structure that often
gation. Proceedings of the IEE/RSJ International Con- appears without explicit pressure or involvement from
ference on Intelligent Robots and Systems (Vol. 4, pp. outside the system.
3498-3503).
Supervised Learning: A learning process in which
Yao, Y. Y. (2001). Information granulation and rough set the exemplar set consists of pairs of inputs and desired
approximation. International Journal of Intelligent Sys- outputs. The process learns to produce the desired
tems, 16(1), 87-104. outputs from the given inputs.
Temporal Data Mining: An application of data-
mining techniques to the data that takes the time dimen-
sion into account.
Unsupervised Learning: Learning in the absence
of external information on outputs.
663
TEAM LinG
664
Kernel Methods in Chemoinformatics

Huma Lodhi
Imperial College London, UK
INTRODUCTION KMs are new to such tasks. They have been applied for
applications in chemoinformatics since the late 1990s.
Millions of people are suffering from fatal diseases KMs and their utility for applications in
such as cancer, AIDS, and many other bacterial and viral chemoinformatics are the focus of the research pre-
illnesses. The key issue is now how to design lifesaving sented in this article. These methods possess special
and cost-effective drugs so that the diseases can be characteristics that make them very attractive for tasks
cured and prevented. It would also enable the provision such as the induction of SAR/QSAR. KMs such as SVMs
of medicines in developing countries, where approxi- map the data into some higher dimensional feature
mately 80% of the world population lives. Drug design space and train a linear predictor in this higher dimen-
is a discipline of extreme importance in sional space. The kernel trick offers an effective way to
chemoinformatics. Structure-activity relationship (SAR) construct such a predictor by providing an efficient
and quantitative SAR (QSAR) are key drug discovery method of computing the inner product between mapped
tasks. instances in the feature space. One does not need to
During recent years great interest has been shown in represent the instances explicitly in the feature space.
kernel methods (KMs) that give state-of-the-art perfor- The kernel function computes the inner product by
mance. The support vector machine (SVM) (Vapnik, implicitly mapping the instances to the feature space.
1995; Cristianini and Shawe-Taylor, 2000) is a well- These methods can handle very high-dimensional noisy
known example. The building block of these methods is data and can avoid overfitting. SVMs suffer from a
an entity known as the kernel. The nondependence of drawback that is the difficulty of interpretation of the
KMs on the dimensionality of the feature space and the models for nonlinear kernel functions.
flexibility of using any kernel function make them an
optimal choice for different tasks, especially modeling
SAR relationships and predicting biological activity or MAIN THRUST
toxicity of compounds. KMs have been successfully
applied for classification and regression in pharmaceu- I now present basic principles for the construction of
tical data analysis and drug design. SVMs and also explore empirical findings.
Support Vector Machines and Kernels

BACKGROUND
The support vector machine was proposed in 1992
Recently, I have seen a swift increase in the interest and (Boser, Guyon, & Vapnik, 1992). A detailed analysis of
development of data mining and learning methods for SVMs can be found in Vapnik (1995) and Cristianini and
problems in chemoinformatics. There exist a number of Shawe-Taylor (2000). An SVM works by embedding the
challenges for these techniques for chemometric prob- input data, d i ,K , d n , into a Hilbert space through a non-
lems. High dimensionality is one of them. There are
datasets with a large number of dimensions and a few linear mapping, , and constructing the linear function
data points. Label and feature noise are other important in this space. Mapping may not be known explicitly
problems. Despite these challenges, learning methods but be accessed via the kernel function described in the
are a common choice for applications in the domain of later section on kernels. The kernel function returns the
chemistry. This is due to automated building of predic-
tors that have strong theoretical foundations. Learning inner product between the mapped instances d i and d j in
techniques, including neural networks, decision trees, a higher dimensional space that is for any mapping
inductive logic programming, and kernel methods such
: D F , k (d i , d j ) = (d i ), (d j ) . I now briefly de-
as SVMs, kernel principal component analysis, and ker-
nel partial least square, have been applied to chemometric scribe SVMs for classification, regression, and kernel
problems with great success. Among these methods, functions.
TEAM LinG
Support Vector Classification n
satisfies positive (semi)definiteness, a a k (d d ) 0

i , j =1
i j i j K
The support vector machine for classification (SVC)
(Vapnik, 1995) is based on the idea of constructing the for ai , a j R . The n n matrix with entries of the form
maximal margin hyperplane in feature space. This unique
hyperplane separates the data into two categories with K ij = k (d i , d j ) is known as the kernel matrix, or the
maximum margin (hard margin). The maximal margin Gram matrix, that is a symmetric positive definite ma-
hyperplane fails to generalize well when there is a high trix.
level of noise in the data. In order to handle noise, data Linear, polynomial, and Gaussian Radial Basis Func-
margin errors are allowed, hence achieving a soft mar- tion (RBF) kernels are well-known examples of general
gin instead of a hard margin (no margin errors). The purpose kernel functions. Linear kernel function is
generalization performance is improved by maintaining given by k linear (d i , d j ) = k (d i , d j ) = d id j . Given a kernel k ,
the right balance between the margin maximization and
the polynomial construction is given by
error. To find a hyperplane, one has to solve a convex
k poly (d i , d j ) = (k (d i , d j ) + c ) . Here, p is a positive inte-
p
quadratic optimization problem. The support vector clas-
sifier is constructed by using only the inner products ger, and c is a nonnegative constant. Clearly, this incurs
between the mapped instances. The classification func- a small computational cost to define a new feature
tion for a new example d is given by space. The feature space corresponding to a degree p
n polynomial kernel includes all products of at most p
f = sgn i ci k (d i , d ) + b , where i are Lagrange mul- input features. Note that for p = 1 , I get the linear
i =1
construction. Furthermore, RBF kernel defines a fea-
tiplies, ci { 1,+1} are categories, and b R . ture space with an infinite number of dimensions. Given
a set of instances, the RBF kernel is given by
Support Vector Regression
2
d d
Support vector machines for regression (SVR) (Vapnik,
K RBF (d i , d j ) = exp
i j
1995) inherit all the main properties that characterize 2 2 .

SVC. SVR embeds the input data into a Hilbert space
through a nonlinear mapping and constructs a linear
regression function in this space. In order to apply New kernels can be designed by keeping in mind that
support vector technique to regression, a reasonable kernel functions are closed under addition and multipli-
loss function is used. Vapniks insensitive loss func- cation. Kernel functions can be defined over general
tion is a popular choice that is defined by sets (Watkins, 2000; Haussler, 1999). This important
fact has allowed successful exploration of novel kernels
c f (d ) = max (0, c f (d ) ) , where c R . This loss for discrete spaces such as strings, graphs, and trees
function allows errors below some > 0 and controls (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins,
the width of insensitive band. Regression estimation is 2002; Kashima, Tsuda, & Inokuchi, 2003; Mahe, Ueda,
performed by solving an optimization problem. The Akutsu, Perret, & Vert, 2004).
corresponding regression function f is given by
n
Applications
f = (~i i )k (d i , d ) + b , where ~i , i are Lagrange
i =1 Given that I have presented basic concepts of SVMs, I
multipliers, and b R . now describe the applications of SVMs in chemical
domains.
Kernels SAR/QSAR analysis plays a crucial role in the design
and development of drugs. It is based on the assumption
A kernel function calculates an inner product between that the chemical structure and activity of compounds
mapped instances in a feature space and allows implicit are correlated. The aim of mining molecular databases
feature extraction. The mathematical foundation of such is to select a set of important compounds, hence form-
a function was established during the first decade of the ing a small collection of useful molecules. The predic-
20th century (Mercer, 1909). A kernel function is a tion of a new compound with low error probability is an
important factor, as the false prediction can be costly
symmetric function k (d i , d j ) = k (d j , d i ) for all d i , d j and
665
TEAM LinG
and result in a loss of information. Furthermore, accurate noncongeneric datasets (Helma, Cramer, Kramer, & De
prediction can speed up the drug design process. Raedt, in press). The performance of SVC is compared
In order to apply learning techniques such as KMs to with decision trees C4.5 and a rule learner. SVC and rule
chemometric data, the compounds are transformed into a learner showed excellent results.
form amenable to these techniques. Modeling SAR/QSAR In chemoinformatics, extracted features play a key
analysis can be viewed as comprising two stages. In the role in SAR/QSAR analysis. The feature extraction
first stage, descriptors (features) are extracted or com- module is constructed in such a way so as the loss of
puted, and molecules are transformed into vectors. The information is at a minimum. This can make feature
dimensionality of the vector space can be very high, extraction tasks as complex and expensive as solving
where, generally, molecular datasets comprise several the entire problem. Kernel methods are an effective
tens to hundred compounds. In the second stage, an alternative to explicit feature extraction. Kernels that
induction algorithm (SVC or SVR) is applied to learn an transform the compounds into feature vectors without
SAR/QSAR model. The similarity between two com- explicitly representing them can be constructed. This
pounds is measured by the inner product between two can be achieved by viewing compounds as graphs
vectors. The similarity between compounds is inversely (Kashima, Tsuda, & Inokuchi, 2003; Mahe, Ueda, Akutsu,
proportional to the cosine of the angle between the Perret, & Vert, 2004). In this way, kernel methods can be
vectors. It is worth noting that kernel methods can be utilized to generate or extract feature for SAR/QSAR
applied not only for the induction of model but also for analysis. Kashima et al. proposed a kernel function that
the feature extraction through specialized kernels. computes the similarity between compounds (labeled
I now describe the application of kernel methods to graphs). The compounds are implicitly transformed into
inducing structure activity relationship models. I first feature vectors, where each entry of feature vector rep-
focus on the classification task for chemometric data. resents the number of label paths. Label paths are gen-
Trotter, Buxton, and Holden (2001) studied the efficacy erated by random walks on graphs. In order to perform
of an SVC to separating the compounds that will cross binary classification of compounds, voted kernel
the blood/brain barrier from those that will not cross the perceptron (Freund & Shapire, 1999) is employed with
barrier. SVC substantially improved other machine-learn- promising results. Mahe, et al (2004) have improved the
ing techniques, including neural networks and decision graph kernels in terms of efficiency (computational time)
trees. In another study, a classification problem has been and classification accuracy. An SVC in conjunction with
formulated in order to predict the reduction of graph kernels is used to predict mutageneicity of aro-
dihydrofolate reductase by pyrimidines (Burbidge, Trot- matic and hetroaromatics nitro compounds. SVC in con-
ter, Holden, & Buxton, 2001). An SVC in conjunction junction with graph kernels validates the efficacy of KMs
with an RBF kernel is used to conduct experiments. for feature extraction and induction of SAR models.
Experimental results show that SVC outperforms the I now focus on the regression task for chemometric
other classification techniques, including artificial neu- data. In a chemical domain, datasets are generally char-
ral networks, C5.0 decision tree, and nearest neighbor. acterized as having high dimensionality and few data
Structural feature extraction is an important prob- points. Partial least square (PLS) (a regression method)
lem in chemoinformatics. Structural feature extraction is very useful for such scenarios as compared to linear
refers to the problem of computing the features that least square regression. Demiriz, Bemmett, Breneman,
describe the chemical structure. In order to compute and Embrechts (2001) have applied an SVR for QSAR
structural features, Kramer, Frank, and Helma (2000) analysis. The authors performed feature selection by
viewed compounds as labeled graphs where vertices de- removing high-variance variables and induced a model
pict atoms and edges describe bonds between them. The by using a linear programming SVR. The experimental
method performs automated construction of structural results show that SVR in conjunction with RBF kernel
features of two-dimensional represented compounds. outperforms PLS for chemometric datasets. PLSs
Features are constructed by retrieving sequences of lin- ability of handling high-dimensional data makes it an
early connected atoms and bonds on the basis of some optimal choice to combine it with KMs. There exist
statistical criterion. The authors argue that SVMs abil- two techniques to kernelize PLS (Bennett & Embrechts,
ity to handle high-dimensional data makes them attrac- 2003; Rosipal, Trejo, Matthews, & Wheeler, 2003).
tive in this scenario, as the dimensionality of the feature One technique is based on mapping the data into a
space may be very large. The authors applied an SVC to higher dimensional space and constructing a linear
predict mutageneicity and carcinogenicity of compounds regression function in the space (Rosipal, Trejo,
with promising results. In another study, an SVC in Matthews, & Wheeler, 2003). Alternatively, PLS is
conjunction with structural features is applied to model kernelized by obtaining a low-rank approximation of
mutageneicity structure activity relationships from kernel matrix and computing a regression function based
666
TEAM LinG
on this approximation (Bennett & Embrechts). Kernel FUTURE TRENDS

partial least square is used to predict the binding affinities K
of molecules to human serum albumin. Experimental re- Researchers using kernel methods have handled a prin-
sults show that kernelized versions of PLS outperform cipal area in chemoinformatics; however, there exist
other methods, including PLS and SVR (Bennett & other important application areas, such as structure
Embrechts). Different principal component techniques, elucidation, that require the utilization of such methods.
including power method, singular value decomposition, There is a need to develop new methods for challenging
and eigen-value decomposition, have also been kernelized tasks that are difficult to solve from existing methods.
to show the efficiency of KMs for datasets comprising few Interest in and development of KMs for applications in
points and a large number of variables (Wu, Massarat, & chemoinformatics are swiftly increasing, and I believe
de Jong, 1997). I conclude this section with a case study. they will lead to progress in both fields.
Gram-Schmidt Kernels for Prediction of

Activity of Compounds CONCLUSION
As an example, I now present the results of a novel Kernel methods have strong theoretical foundations.
approach for modeling structure-activity relationships These methods combine the principles of statistical
(Lodhi & Guo, 2002) based on the use of a special learning theory and functional analysis. SVMs and other
kernel namely Gram-Schmidt kernel (GSK) (Cristianini, kernel methods have been applied in different domains,
Shawe-Taylor, & Lodhi, 2002). The kernel efficiently including text mining, face recognition, protein homol-
performs Gram-Schmidt orthogonalisation in a kernel- ogy detection, analyses and classification of gene ex-
induced feature space. It is based on the idea of building pression data, and many others with great success. They
a more informative kernel matrix, as compared to Gram have shown impressive performance in
matrices, which are constructed by standard kernels. An chemoinformatics. I believe that the growing popularity
SVC in conjunction with GSK is used to perform QSAR of machine-learning techniques, especially kernel meth-
analysis. The learning process of an SVC in conjunction ods in chemoinformatics, will lead to significant devel-
with GSK comprises two stages. In the first stage, highly opment in both disciplines.
informative features are extracted in a kernel-induced
feature space. In the next stage, a soft margin classifier
is trained. In order to perform the analysis, the GSK REFERENCES
algorithm requires a set of compounds an underlying
kernel function and the number T , which specifies the Bennett, K. P., & Embrechts, M. J. (2003). Advances in
dimensionality of the feature space. For the underlying learning theory: Methods, models and applications. Nato
kernel function, an RBF kernel is employed. GSK in Science Series III: Computer & Systems Science, 190,
conjunction with SVC is applied on a benchmark dataset 227-250.
(King, Muggleton, Lewis, & Sternberg, 1992) to predict
the inhibition of dihydrofolate reductase by pyrim- Boser, B. E., Guyon, I. M., & Vapnik, V. (1992). A
idines. The dataset contains 55 compounds that are training algorithm for optimal margin classifier. Pro-
divided into 5-fold cross-validation series. For each ceedings of the Fifth Annual ACM workshop on Com-
drug there are three positions of possible substitution, putational Learning Theory (pp. 144-152).
and the number of attributes for each substitution is
Burbidge, R., Trotter, M., Holden, S., & Buxton, B.
nine. The experimental results show that SVC in con-
(2001). Drug design by machine learning: Support vec-
junction with GSK achieves lower classification error
tor machines for pharmaceutical data. Computers and
than the best reported results (Burbidge, Trotter, Holden,
Chemistry, 26(1), 4-15.
& Buxton, 2001). Burbidge et al. performed SAR analysis
on this dataset by using a number of learning techniques Cristianini, N., & Shawe-Taylor, J. (2000). An intro-
with SVC in conjunction with RBF kernels, achieving a duction to support vector machines. Cambridge, MA:
classification error of 0.1269. An SVC in conjunction with Cambridge University Press.
GSK improved these results, achieving a classification
error of 0.1120 (Lodhi & Guo, 2002). Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002).
Latent semantic kernels. Journal of Intelligent Infor-
mation Systems, 18(2/3), 127-152.
667
TEAM LinG
Demiriz, A., Bemmett, K. P., Breneman, C. M., & Embrechts, Rosipal, R., Trejo, L., Matthews, B., & Wheeler, K. (2003).
M. J. (2001). Support vector machine regression in Nonlinear kernel-based chemometric tools: A machine
chemometrics. Computing Science and Statistics. learning approach. Proceedings of the Third Interna-
tional Symposium on PLS and Related Methods (pp. 249-
Freund, Y., & Shapire, R. (1999). Large margin classifica- 260).
tion using perceptron algorithm. Machine Learning, 37(3),
277-296. Trotter, M., Buxton, B., & Holden, S. (2001). Support
vector machines in combinatorial chemistry. Measure-
Haussler, D. (1999). Convolution kernels on discrete ment and Control, 34(8), 235-239.
structures (Tech. Rep. No. UCSC-CRL-99-10). Santa
Cruz: University of California, Computer Science De- Vapnik, V. (1995). The nature of statistical learning
partment. theory. Springer-Verlag.
Helma, C., Cramer, T., Kramer, S., & De Raedt, L. (in Watkins, C. (2000). Dynamic alignment kernels. In P. J.
press). Data mining and machine learning techniques for Bartlett, B. Schlkopf, D. Schuurmans, & A. J. Smola,
identification of mutageneicity inducing substructures Advances in large-margin classifiers (pp. 39-50). Cam-
and structure activity relationships of noncongeneric bridge, MA: MIT Press.
compounds. Journal of Chemical Information and
Computer Systems. Wu, W., Massarat, D. L., & de Jong, S. (1997). The
kernel PCA algorithm for wide data. Part I: Theory and
Kashima, H., Tsuda, K., & Inokuchi, A. (2003). algorithms. Chemometrics and Intelligent Laboratory
Marginalized kernels between labeled graphs. Proceed- Systems, 36, 165-172.
ings of the 20th International Conference on Machine
Learning.
King, R. D., Muggleton, S., Lewis, R. A., & Sternberg, KEY TERMS
M. J. E. (1992). Drug design by machine learning: The
use of inductive logic programming to model the struc- Chemoinformatics: Storage, analysis, and drawing
ture activity relationships of timethoprim analogues inferences from chemical information (obtained from
binding to dihydrofolate reeducates. Proceedings of chemical data) by using computational methods for drug
the National Academy of Sciences, USA, 89 (pp. 11322- discovery.
11326).
Kernel Function: A function that computes the
Kramer, S., Frank, E., & Helma, C. (2002). Fragment genera- inner product between mapped instances in a feature
tion and support vector machines for inducing SARs. SAR space. It is a symmetric, positive definite function.
and QSAR in Environmental Research, 13(5), 509-523.
Kernel Matrix: A matrix that contains almost all
Lodhi, H., & Guo, Y. (2002). Gram-Schmidt kernels applied the information required by kernel methods. It is ob-
to structure activity analysis for drug design. Proceed- tained by computing the inner product between n in-
ings of the Second ACM SIGKDD Workshop on Data stances.
Mining in Bioinformatics (pp. 37-42).
Machine Learning: A discipline that comprises the
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, study of how machines learn from experience.
N., & Watkins, C. (2002). Text classification using
string kernels. Journal of Machine Learning Research, Margin: A real-valued function. The sign and mag-
2, 419-444. nitude of the margin give insight into the prediction of
an instance. Positive margin indicates correct predic-
Mahe, P., Ueda, N., Akutsu, T., Perret, J.-L., & Vert, J.- tion, whereas negative margin shows incorrect predic-
P. (2004). Extension of marginalized graph kernels. tion.
Proceedings of the 21st International Conference on
Machine Learning. Quantitative Structure-Activity Relationship
(QSAR): Illustrates quantitative relationships between
Mercer, J. (1909). Functions of positive and negative chemical structures and the biological and pharmaco-
type and their connection with the theory of integral logical activity of chemical compounds.
equations. Philosophical Transactions of the Royal
Society London (A), 209, 415-446. Support Vector Machines (SVMs): SVMs (im-
plicitly) map input examples into a higher dimensional
feature space via a kernel function and construct a linear
function in this space.
668
TEAM LinG
669
Knowledge Discovery with Artificial Neural K

Networks
Juan R. Rabual Dopico
University of A Corua, Spain
Daniel Rivero Cebrin

Julin Dorado de la Calle

Nieves Pedreira Souto

INTRODUCTION their inductive learning convert them into a very robust

technique that can be used in almost any domain. An ANN
The world of Data Mining (Cios, Pedrycz & Swiniarrski, is an information processing technique that is inspired on
1998) is in constant expansion. New information is ob- neuronal biology and consists of a large amount of inter-
tained from databases thanks to a wide range of tech- connected computational units (neurons), usually in dif-
niques, which are all applicable to a determined set of ferent layers. When an input vector is presented to the
domains and count with a series of advantages and input neurons, this vector is propagated and processed in
inconveniences. The Artificial Neural Networks (ANNs) the network until it becomes an output vector in the
technique (Haykin, 1999; McCulloch & Pitts, 1943; Orchad, output neurons. Figure 1 shows an example of neural
1993) allows us to resolve complex problems in many network with 3 inputs, 2 outputs and three layers with 3,
disciplines (classification, clustering, regression, etc.), 4 and two neurons in each one.
and presents a series of advantages that convert it into a The ANNs have proven to be a very powerful tool in
very powerful technique that is easily adapted to any a lot of applications, but they present a big problem: their
environment. The main inconvenience of ANNs, how- reasoning process cannot be explained, i.e. there is no
ever, is that they can not explain what they learn and what clear relationship between the inputs that are presented
reasoning was followed to obtain the outputs. This im- to the network and the outputs it produces. This means
plies that they can not be used in many environments in that ANNs cannot be used in certain domains, even
which this reasoning is essential. though several approaches and attempts to explain their
This article presents a hybrid technique that not only behaviour have tried to solve this problem.
benefits from the advantages of ANNs in the data-mining The reasoning of ANNs, and the rules extraction that
field, but also counteracts their inconveniences by using explains their functioning, were explained in various ways.
other knowledge extraction techniques. Firstly we extract One of the first attempts established an equivalence
the requested information by applying an ANN, then we between ANNs and fuzzy rules (Bentez, Castro, &
apply other Data Mining techniques to the ANN in order Requena, 1997; Buckley, Hayashi, & Czogala, 1993; Jang
to explain the information that is contained inside the & Sun, 1992), obtaining only theoretical solutions. Other
network. We thus obtain a two-levelled system that offers works were based on the individual analysis of each
the advantages of the ANNs and compensates for its
shortcomings with other Data Mining techniques. Figure 1. Example of Artificial Neural Network
BACKGROUND
Ever since artificial intelligence appeared, ANNs have

been widely studied. Their generalisation capacity, and
TEAM LinG
Knowledge Discovery with Artificial Neural Networks
neuron in the network, and its connection with the con- Streeter, Mydlowec, Yu, & Lanza, 2003; Wong & Leung,
secutive neurons. Towell and Shavlik (1994) in particular 2000), to a set of inputs / outputs produced by the ANN.
see the connections between neurons as rules, and The set of network inputs / produced outputs is dynami-
Andrews and Geva (1994) uses networks with functions cally modified, as explained on this paper.
that allow a clear identification of the dominant inputs.
Other approaches are the RULENEG (Rule-Extraction
from Neural Networks by Step-wise Negation) (Pop, Hay- ARCHITECTURE
ward, & Diederich, 1994) and TREPAN (Craven, 1996)
algorithms. The first approach however modifies the train- This article presents an architecture in two levels for the
ing set and therefore loses the generalisation capacity of extraction of knowledge from databases. In a first level, we
the ANNs. The TREPAN approach is similar to decision apply an ANN as Data Mining technique; in the second
tree algorithms such as CART (Classification and Regres- level, we apply a knowledge extraction technique to this
sion Trees) or C4.5, which turns the ANN into a MofN (M- network.
of-N) decision tree.
The DEDEC (Decision Detection) algorithm (Tickle, Data Mining with ANNs
Andrews, Golea, & Tickle, 1998) extracts rules by finding
minimal information sufficient to distinguish, from the Artificial Neural Networks constitute a Data Mining
neural network point of view, between a given pattern and technique that has been widely used as a technique for
all other patterns. The DEDEC algorithm uses the trained the extraction of knowledge from databases. Their train-
ANN to create examples from which rules can be extracted. ing process is based on examples, and presents several
Unlike other approaches, it also uses the weight vectors advantages that other models do not offer:
of the network to obtain an additional analysis that
improves the extraction of rules. This information is then A high generalisation level. Once ANNs are trained
used to direct the strategy for generating a (minimal) set with a training set, they produce outputs (close to
of examples for the learning phase. It also uses an efficient desired or supposed outputs) for inputs that were
algorithm for the rule extraction phase. Based on these never presented to them before.
and other already mentioned techniques, Chalup, Hay- A high error tolerance. Since ANNs are based on the
ward and Diedrich (1998); Visser, Tickle, Hayward and successive and parallel interconnection between
Andrews (1996); and Tickle, Andrews, Golea and Tickle many processing elements (neurons), the output of
(1998) also presented their solutions. the system is not significantly affected if one of
The methods for the extraction of logical rules, devel- them fails.
oped by Duch, Adamczak and Grabczewski (2001), are A high noise tolerance.
based on multilayer perceptron networks with the MLP2LN
(Multi-Layer Perceptron converted to Logical Network) All these advantages turn ANNs into the ideal tech-
method and its constructive version C-MLP2LN. MLP2LN nique for the extraction of knowledge in almost any
consists in taking a multilayer and already trained domain. They are trained with many different training
perceptron and simplify it in order to obtain a network with algorithms. The most famous one is the backpropagation
weights 0, +1 or -1. C-MLP2LN acts in a similar way. After algorithm (Rumelhart, Hinton & Williams, 1986), but many
this process, the dominant rules are easily extracted, and other training algorithms are applied according to the
the weights of the input layer allow us to deduct which topology of the network and the use that is given to it. In
parameters are relevant. the course of recent years, Evolutionary Computation
More recently, genetic algorithms (GAs) have been techniques such as Genetic Algorithms (Holland, 75)
used to discover rules in ANNs. Keedwell, Narayanan and (Goldberg, 89) (Rabual, Dorado, Pazos, Gestal, Rivero &
Savic (2000) use a GA in which the chromosomes are rules Pedreira, 2004a; Rabual, Dorado, Pazos, Pereira & Rivero,
based on value intervals or ranges applied to the inputs 2004b) are gaining ground, because they correct the
of the ANN. The values are obtained from the training defects of other training algorithms, such as the tendency
patterns. to generate local minimums or to overtrain the network.
The most recent works in rules extraction from ANNs Even so, and in spite of these algorithms that train the
are presented by Rivero, Rabual, Dorado, Pazos and network automatically (and even search for the topology
Pedreira (2004) and Rabual, Dorado, Pazos and Rivero of the network), ANNs present a series of defects that
(2003). They extract rules by applying a symbolic regres- make them useless in many application fields. As we
sion system, based on Genetic Programming (GP) already said, their main defect is the fact that in general
(Engelbrecht, Rouwhorst & Schoeman, 2001; Koza, Keane, they are not interpretable: once an input is applied to the
670
TEAM LinG
network and an output obtained, we cannot follow the 5

(1 0)
steps of the reasoning that was used by the network to
0.1
+ 1 = 115 = 161051
K
produce this output. This is why ANNs are not applicable
in those domains in which we need to know why an output
is the result of a certain input. The best example of this type patterns, an amount that is excessive for any Data Mining
of domain is the medical field: if we use an ANN to obtain tool.
the diagnosis for a certain disease, we need to know why The problem of choosing the input set that will be
that specific diagnosis or output value is produced. applied to the network is solved with a hybrid solution:
The purpose of this article is to correct this defect the inputs of the training set are multiplied by duplicating
through the use of another knowledge extraction tech- some elements of the set with certain modifications.
nique. This technique is applied to the ANN in order to These inputs are presented to the network so that for
extract its internal knowledge and express it in terms that each of them an output vector is obtained, and as such
are proper of the used technique (i.e. decision trees, se- we shape a set of patterns that is applied to the selected
mantic networks, IF-THEN-ELSE rules, etc.). algorithm for the extraction of knowledge.
It is important to note that the duplicated inputs were
modified so that the knowledge extraction algorithm
Extracting Knowledge from ANNs already starts to explore the areas that lie beyond those
that were used to train the network. These first dupli-
The second level of the architecture is based on the cated inputs are nevertheless insufficient to explore
application of a knowledge extraction technique to the these areas, which is why the knowledge extraction
ANN that was used for Data Mining. This technique algorithm is controlled by a system for the generation of
depends on how we wish to express the internal knowledge new patterns.
of the network: if we want decision trees, we could use the This new patterns generation system takes those
CART algorithm, if we want rules, we could use Genetic patterns whose adjustment is sufficiently good in the
Programming, etc. knowledge extraction algorithm, replaces them with new
In general, the second knowledge extraction technique patterns that are created from existing ones (either those
is applied to the network to explain its behaviour when it that will be replaced, or others from the training set), and
produces a set of outputs from a set of inputs, without applies to them modifications that create patterns in
considering its internal functioning. This means that the areas that were not explored before. When an area is
network is treated as a black box: we obtain information already explored by the algorithm and is already repre-
on its behaviour by generating a set of inputs / obtained sented (by trees, rules, etc.), it is deleted and the system
outputs to which we apply the knowledge extraction tech- continues exploring new areas.
nique. Obtaining this set is complicated, because it has to The whole process is illustrated in Figure 2.
be so representative that we can extract from it all the
knowledge that a network really contains, not just part of
the knowledge. Figure 2. ANN knowledge extraction process
A first option could be to use the inputs of the training
set that was used to create the network, and the outputs Pattern Generation System
generated from this inputs by the network. This method

supposes that the training set is already sufficiently ample

and representative of the search space in which the net-
Input Output
work will be applied. However, our aim is not to explain the set set
behaviour of the network only in those patterns in which

it was trained or in patterns that are close (i.e., the search
area in which the network was trained); we want to explain
the functioning of the network as a whole, including in Knowledge
areas in which it was not trained. Extraction Patterns
We discard the option of elaborating an exhaustive

inputs set by taking all the combinations of all the possible Discard
patterns with
values in each network input with small intervals, because Knowledge Extraction
Algorithm good adjustment
the combinatorial explosion would make this set too large.
For instance, for a network with 5 inputs that each takes
values between 0 and 1, and with intervals of 0.1, we would
ANN Knowledge
obtain
671
TEAM LinG
In this way, we create a closed system in which the the Australian Conference on Neural Networks,
patterns set is constantly updated and the search continues Brisbane, Queensland (pp. 9-12).
for new areas that are not yet covered by the knowledge that
we have of the network. A detailed description of the Bentez, J.M., Castro, J.L., & Requena, I. (1997). Are
method and of its application to concrete problems can be artificial neural networks black boxes? IEEE Transactions
found in Rabual, Dorado, Pazos and Rivero (2003); on Neural Networks, 8, 1156-1164.
Rabual, Dorado, Pazos, Gestal, Rivero and Pedreira (2004a); Buckley, J.J., Hayashi, Y., & Czogala, E. (1993). On the
and Rabual, Dorado, Pazos, Pereira & Rivero (2004b). equivalence of neural nets and fuzzy expert systems.
Fuzzy Sets Systems, 53, 129-134.
FUTURE TRENDS Chalup, S., Hayward, R., & Diedrich, J. (1998). Rule extrac-
tion from artificial neural networks trained on elemen-
The proposed architecture is based on the application of tary number classification task. Queensland University
a knowledge extraction technique to an ANN, whereas the of Technology, Neurocomputing Research Centre. QUT
technique that will be used depends on the type of NRC technical report.
knowledge that we wish to obtain. Since there is a great Cios, K., Pedrycz, W., & Swiniarski, R. (1998). Data mining
variety of techniques and algorithms that generate infor- methods for knowledge discovery. The 1st Edition, Kluver
mation of the same kind (IF-THEN rules, trees, etc.), we International Series in Engineering and Computer Sci-
need to study them and carry out experiments to test their ence (pp. 495). Boston: Kluwer Academic Publishers.
functioning in the proposed system, and in particular their
adequacy for the new patterns generation system. Craven, M. W. (1996). Extracting comprehensible mod-
Also, this system for the generation of new patterns els from trained neural networks. PhD Thesis, University
involves a large number of parameters, such as the per- of Wisconsin, Madison.
centage of patterns change, the number of new patterns, Duch, W., Adamczak, R., & Grabczewski, K. (2001). A new
or the maximum error with which we consider that a pattern methodology of extraction, optimisation and application
is represented by a rule. Since there are so many param- of crisp and fuzzy logical rules. IEEE Transactions on
eters, we need to study not only each parameter sepa- Neural Networks, 12, 277-306.
rately but also the influence of all the parameters on the
final result. Engelbrecht, A.P., Rouwhorst, S.E., & Schoeman L. (2001).
A building block approach to genetic programming for
rule discovery. In Abbass, R. Sarkar, & C. Newton (Eds.),
CONCLUSION Data mining: A heuristic approach (pp. 175-189). Hershey:
Idea Group Publishing.
Goldberg, D. E. (1989). Genetic algorithms in search,
This article proposes a system architecture that makes optimization and machine learning. Reading, MA:
good use of the advantages of the ANNs, such as Data Addison-Wesley.
Mining, and avoids its inconveniences. On a first level, we
apply an ANN to extract and model a set of data. The Haykin, S. (1999). Neural networks (2nd ed.). Englewood
resulting model offers all the advantages of ANNs, such Cliffs, NJ: Prentice Hall.
as noise tolerance and generalisation capacity. On a
second level, we apply another knowledge extraction Holland, J. H. (1975). Adaptation in natural and artificial
technique to the ANN, and thus obtain the knowledge of systems. University of Michigan Press.
the ANN, which is the generalisation of the knowledge Jang, J., & Sun, C. (1992). Functional equivalence between
that was used for its learning; this knowledge is expressed radial basis function networks and fuzzy inference sys-
in the shape that is decided by the user. It is obvious that tems. IEEE Transactions on Neural Networks, 4, 156-158.
the union of various techniques in a hybrid system con-
veys a series of advantages that are associated to them. Keedwell E., Narayanan A., & Savic D. (2000). Creating
rules from trained neural networks using genetic algo-
rithms. Proceedings of the International Journal of
REFERENCES Computers, Systems and Signals (IJCSS) (Vol. 1, pp. 30-
42).
Andrews, R., & Geva, S. (1994). Rule extraction from a Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W.,
constrained error backpropagation MLP. Proceedings of Yu, J., & Lanza, G. (Eds.). (2003). Genetic programming
672
TEAM LinG
IV: Routine human-competitive machine intelligence. Wong, M.L., & Leung, K.S. (2000). Data mining using
Dordrecht, The Netherlands: Kluwer Academic Publishers. grammar based genetic programming and applications. K
Series in Genetic Programming, 3, 232. Boston: Kluwer
McCulloch, W.S., & Pitts, W. (1943). A logical calculus of Academic Publishers.
ideas immanent in nervous activity. Bulletin of Math-
ematical Biophysics, (5), 115-133.
Orchard, G. (Ed.) (1993). Neural computing. Research and KEY TERMS
Applications. Ed. London: Institute of Physics Publish-
ing, Londres. Area of the Search Space: Set of specific ranges or
Pop, E., Hayward, R., & Diederich, J. (1994). RULENEG: values of the input variables that constitute a subset of
Extracting rules from a trained ANN by stepwise nega- the search space.
tion. Queensland University of Technology, Artificial Neural Networks: A network of many simple
Neurocomputing Research Centre. QUT NRC technical processors (units or neurons) that imitates a biologi-
report. cal neural network. The units are connected by unidirec-
Rabual, J.R., Dorado, J., Pazos, A., & Rivero, D. (2003). tional communication channels, which carry numeric data.
Rules and generalization capacity extraction from ANN Neural networks can be trained to find nonlinear relation-
with GP. Lecture notes in Computer Science, 606-613. ships in data, and are used in applications such as robot-
ics, speech recognition, signal processing or medical
Rabual, J.R., Dorado, J., Pazos, A., Gestal, M., Rivero, D., diagnosis.
& Pedreira, N. (2004a). Search the optimal RANN architec-
ture, reduce the training set and make the training process Backpropagation Algorithm: Learning algorithm of
by a distribute genetic algorithm. Artificial Intelligence ANNs, based on minimising the error obtained from the
and Applications, 1, 415-420. comparison between the outputs that the network gives
after the application of a set of network inputs and the
Rabual, J.R., Dorado, J., Pazos, A., Pereira, J., & Rivero, outputs it should give (the desired outputs).
D. (2004 b). A new approach to the extraction of ANN rules
and to their generalization capacity through GP. Neural Data Mining: The application of analytical methods
computation, 7(16), 1483-1523. and tools to data for the purpose of identifying patterns,
relationships or obtaining systems that perform useful
Rivero, D., Rabual, J.R., Dorado, J., Pazos, A., & Pedreira, tasks such as classification, prediction, estimation, or
N. (2004). Extracting knowledge from databases and ANNs affinity grouping.
with genetic programming: Iris flower classification prob-
lem. Intelligent agents for data mining and information Evolutionary Computation: Solution approach guided
retrieval (pp. 136-152). by biological evolution, which begins with potential so-
lution models, then iteratively applies algorithms to find
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning the fittest models from the set to serve as inputs to the next
representations by back-propagating errors. Nature, 323, iteration, ultimately leading to a model that best repre-
533-536. sents the data.
Tickle, A.B., Andrews, R., Golea, M., & Diederich, J. Knowledge Extraction: Explicitation of the internal
(1998). The truth will come to light: Directions and chal- knowledge of a system or set of data in a way that is easily
lenges in extracting the knowledge embedded within interpretable by the user.
trained artificial neural networks. IEEE Transaction on
Neural Networks. (9), 1057-1068. Rule Induction: Process of learning, from cases or
instances, if-then rule relationships that consist of an
Towell G., & Shavlik J.W. (1994). Knowledge-based arti- antecedent (if-part, defining the preconditions or cover-
ficial neural networks. Artificial Intelligence, 70, 119-165. age of the rule) and a consequent (then-part, stating a
classification, prediction, or other expression of a prop-
Visser, U., Tickle, A., Hayward, R., & Andrews, R. (1996). erty that holds for cases defined in the antecedent).
Rule-extraction from trained neural networks: Different
techniques for the determination of herbicides for the Search Space: Set of all possible situations of the
plant protection advisory system PRO_PLANT. Proceed- problem that we want to solve could ever be in.
ings of the Rule Extraction from Trained Artificial Neu-
ral Networks Workshop (pp. 133-139), Brighton, UK.
673
TEAM LinG
674
Learning Bayesian Networks

Marco F. Ramoni
Harvard Medical School, USA
Paola Sebastiani
Boston University School of Public Health, USA
INTRODUCTION tractive features to acquire and represent information

burdened by uncertainty. The development of amenable
Born at the intersection of artificial intelligence, statistics, algorithms to propagate probabilistic information through
and probability, Bayesian networks (Pearl, 1988) are a the graph (Lauritzen, 1988; Pearl, 1988) put Bayesian
representation formalism at the cutting edge of knowl- networks at the forefront of artificial intelligence research.
edge discovery and data mining (Heckerman, 1997). Baye- Around the same time, the machine-learning community
sian networks belong to a more general class of models came to the realization that the sound probabilistic nature
called probabilistic graphical models (Whittaker, 1990; of Bayesian networks provided straightforward ways to
Lauritzen, 1996) that arise from the combination of graph learn them from data. As Bayesian networks encode
theory and probability theory, and their success rests on assumptions of conditional independence, the first ma-
their ability to handle complex probabilistic models by chine-learning approaches to Bayesian networks con-
decomposing them into smaller, amenable components. A sisted of searching for conditional independence struc-
probabilistic graphical model is defined by a graph, where tures in the data and encoding them as a Bayesian network
nodes represent stochastic variables and arcs represent (Glymour, 1987; Pearl, 1988). Shortly thereafter, Cooper
dependencies among such variables. These arcs are an- and Herskovitz (1992) introduced a Bayesian method that
notated by probability distribution shaping the interac- was further refined by Heckerman, et al. (1995) to learn
tion between the linked variables. A probabilistic graphi- Bayesian networks from data.
cal model is called a Bayesian network, when the graph These results spurred the interest of the data-mining
connecting its variables is a directed acyclic graph (DAG). and knowledge-discovery community in the unique fea-
This graph represents conditional independence assump- tures of Bayesian networks (Heckerman, 1997); that is, a
tions that are used to factorize the joint probability distri- highly symbolic formalism, originally developed to be
bution of the network variables, thus making the process used and understood by humans, well-grounded on the
of learning from a large database amenable to computa- sound foundations of statistics and probability theory,
tions. A Bayesian network induced from data can be used able to capture complex interaction mechanisms and to
to investigate distant relationships between variables, as perform prediction and classification.
well as making prediction and explanation, by computing
the conditional probability distribution of one variable,
given the values of some others. MAIN THRUST
A Bayesian network is a graph, where nodes represent

BACKGROUND stochastic variables and (arrowhead) arcs represent de-
pendencies among these variables. In the simplest case,
The origins of Bayesian networks can be traced back as variables are discrete, and each variable can take a finite
far as the early decades of the 20th century, when Sewell set of values.
Wright developed path analysis to aid the study of
genetic inheritance (Wright, 1923, 1934). In their current Representation
form, Bayesian networks were introduced in the early
1980s as a knowledge representation formalism to encode Suppose we want to represent the variable gender. The
and use the information acquired from human experts in variable gender may take two possible values: male and
automated reasoning systems in order to perform diag- female. The assignment of a value to a variable is called the
nostic, predictive, and explanatory tasks (Charniak, 1991; state of the variable. So, the variable gender has two
Pearl, 1986, 1988). Their intuitive graphical nature and states: Gender = Male and Gender = Female. The graphical
their principled probabilistic foundations were very at- structure of a Bayesian network looks like this:
TEAM LinG
Figure 1. Learning
L
Learning a Bayesian network from data consists of the
induction of its two different components: (1) the graphi-
cal structure of conditional dependencies (model selec-
tion) and (2) the conditional distributions quantifying the
dependency structure (parameter estimation).
There are two main approaches to learning Bayesian
networks from data. The first approach, known as con-
straint-based approach, is based on conditional indepen-
dence tests. As the network encodes assumptions of
conditional independence, along this approach we need
The network represents the notion that obesity and to identify conditional independence constraints in the
gender affect the heart condition of a patient. The variable data by testing and then encoding them into a Bayesian
obesity can take three values: yes, borderline and no. The network (Glymour, 1987; Pearl, 1988; Whittaker, 1990).
variable heart condition has two states: true and false. In The second approach is Bayesian (Cooper &
this representation, the node heart condition is said to be Herskovitz, 1992; Heckerman et al., 1995) and regards
a child of the nodes gender and obesity, which, in turn, model selection as an hypothesis testing problem. In this
are the parents of heart condition. approach, we suppose to have a set M= {M0,M1, ...,Mg} of
The variables used in a Bayesian networks are sto- Bayesian networks for the random variables Y1, ..., Yv,, and
chastic, meaning that the assignment of a value to a each Bayesian network represents an hypothesis on the
variable is represented by a probability distribution. For dependency structure relating these variables. Then, we
instance, if we do not know for sure the gender of a patient, choose one Bayesian network after observing a sample of
we may want to encode the information so that we have data D = {y 1k, ..., yvk}, for k = 1, . . . , n. If p(M h) is the prior
better chances of having a female patient rather than a probability of model M h, a Bayesian solution to the model
male one. This guess, for instance, could be based on selection problem consists of choosing the network with
statistical considerations of a particular population, but maximum posterior probability:
this may not be our unique source of information. So, for
the sake of this example, lets say that there is an 80% p(Mh|D) p(Mh)p(D|Mh).
chance of being female and a 20% chance of being male.
Similarly, we can encode that the incidence of obesity is The quantity p(Mh|D) is the marginal likelihood, and its
10%, and 20% are borderline cases. The following set of computation requires the specification of a parameteriza-
distributions tries to encode the fact that obesity in- tion of each model Mh and the elicitation of a prior distri-
creases the cardiac risk of a patient, but this effect is more bution for model parameters. When all variables are dis-
significant in men than women: crete or all variables are continuous, follow Gaussian
The dependency is modeled by a set of probability distributions, and the dependencies are linear and the
distributions, one for each combination of states of the marginal likelihood factorizes into the product of marginal
variables gender and obesity, called the parent variables likelihoods of each node and its parents. An important
of heart condition. property of this likelihood modularity is that in the com-
parison of models that differ only for the parent structure
of a variable Yi, only the local marginal likelihood matters.
Thus, the comparison of two local network structures that
Figure 2. specify different parents for Yi can be done simply by
evaluating the product of the local Bayes factor BFh,k =
p(D|Mhi) / p(D|Mki), and the ratio p(Mhi)/ p(Mki), to compute
the posterior odds of one model vs. the other as p(M hi|D)
/ p(Mki|D).
In this way, we can learn a model locally by maximizing
the marginal likelihood node by node. Still, the space of the
possible sets of parents for each variable grows exponen-
tially with the number of parents involved, but successful
heuristic search procedures (both deterministic and sto-
chastic) exist to render the task more amenable (Cooper &
Herskovitz, 1992; Singh & Larranaga, 1996; Valtorta, 1995).
675
TEAM LinG
Once the structure has been learned from a dataset, we latter networks are termed linear Gaussian networks,
still need to estimate the conditional probability distribu- which still enjoy the decomposability properties of the
tions associated to each dependency in order to turn the marginal likelihood. Imposing the assumption that con-
graphical model into a Bayesian network. This process, tinuous variables follow linear Gaussian distributions
called parameter estimation, takes a graphical structure and that discrete variables only can be parent nodes in
and estimates the conditional probability distributions of the network but cannot be children of any continuous
each parent-child combination. When all the parent vari- node, leads to a closed-form solution for the computation
ables are discrete, we need to compute the conditional of the marginal likelihood (Lauritzen, 1992). The second
probability distribution of the child variable, given each technical challenge is the identification of sound meth-
combination of states of its parent variables. These con- ods to handle incomplete information, either in the form
ditional distributions can be estimated either as relative of missing data (Sebastiani & Ramoni, 2001) or com-
frequencies of cases or, in a Bayesian fashion, by using pletely unobserved variables (Binder et al., 1997). A third
these relative frequencies to update some, possibly uni- important area of development is the extension of Baye-
form, prior distribution. A more detailed description of sian networks to represent dynamic processes
these estimation procedures for both discrete and continu- (Ghahramani, 1998) and to decode control mechanisms.
ous cases is available in Ramoni and Sebastiani (2003). The most fundamental challenge of Bayesian net-
works today, however, is the full deployment of their
Prediction and Classification potential in groundbreaking applications and their es-
tablishment as a routine analytical technique in science
Once a Bayesian network has been defined, either by hand and engineering. Bayesian networks are becoming in-
or by an automated discovery process from data, it can be creasingly popular in various fields of genomic and
used to reason about new problems for prediction, diagno- computational biologyfrom gene expression analysis
sis, and classification. Bayes theorem is at the heart of the (Friedman, 2004) to proteimics (Jansen et al., 2003) and
propagation process. genetic analysis (Lauritzen & Sheehan, 2004)but they
One of the most useful properties of a Bayesian net- are still far from being a received approach in these areas.
work is the ability to propagate evidence irrespective of Still, these areas of application hold the promise of
the position of a node in the network, contrary to standard turning Bayesian networks into a common tool of statis-
classification methods. In a typical classification system, tical data analysis.
for instance, the variable to predict (i.e., the class) must be
chosen in advance before learning the classifier. Informa-
tion about single individuals then will be entered, and the CONCLUSION
classifier will predict the class (and only the class) of these
individuals. In a Bayesian network, on the other hand, the Bayesian networks are a representation formalism born
information about a single individual will be propagated in at the intersection of statistics and artificial intelligence.
any direction in the network so that the variable(s) to Thanks to their solid statistical foundations, they have
predict must not be chosen in advance. been turned successfully into a powerful data-mining
Although the problem of propagating probabilistic in- and knowledge-discovery tool that is able to uncover
formation in Bayesian networks is known to be, in the complex models of interactions from large databases.
general case, NP-complete (Cooper, 1990), several scalable Their high symbolic nature makes them easily under-
algorithms exist to perform this task in networks with hun- standable to human operators. Contrary to standard
dreds of nodes (Castillo, et al., 1996; Cowell et al., 1999; Pearl, classification methods, Bayesian networks do not re-
1988). Some of these propagation algorithms have been quire the preliminary identification of an outcome vari-
extended, with some restriction or approximations, to net- able of interest, but they are able to draw probabilistic
works containing continuous variables (Cowell et al., 1999). inferences on any variable in the database. Notwith-
standing these attractive properties and the continuous
interest of the data-mining and knowledge-discovery
FUTURE TRENDS community, Bayesian networks still are not playing a
routine role in the practice of science and engineering.
The technical challenges of current research in Bayesian
networks are focused mostly on overcoming their current
limitations. Established methods to learn Bayesian net- REFERENCES
works from data work under the assumption that each
variable is either discrete or normally distributed around a Binder, J. et al. (1997). Adaptive probabilistic networks
mean that linearly depends on its parent variables. The with hidden variables. Mach Learn, 29(2-3), 213-244.
676
TEAM LinG
Castillo, E. et al. (1996). Expert systems and probabiistic Pearl, J. (1986). Fusion, propagation, and structuring in
network models. New York: Springer. belief networks. Artif. Intell., 29(3), 241-288. L
Charniak, E. (1991). Bayesian networks without tears. AI Pearl, J. (1988). Probabilistic reasoning in intelligent
Magazine, 12(8), 50-63. systems: Networks of plausible inference. San Francisco:
Morgan Kaufmann.
Cooper, G.F. (1990). The computational complexity of
probabilistic inference using Bayesian belief networks. Ramoni, M., & Sebastiani, P. (2003). Bayesian methods. In
Artif Intell, 42(2-3), 393-405. M.B. Hand (Ed.), Intelligent data analysis: An introduc-
tion (pp. 128-166). New York: Springer.
Cooper, G.F., & Herskovitz, G.F. (1992). A Bayesian method
for the induction of probabilistic networks from data. Sebastiani, P., & Ramoni, M. (2001). Bayesian selection of
Mach Learn, 9, 309-347. decomposable models with incomplete data. J Am Stat
Assoc, 96(456), 1375-1386.
Cowell, R.G., et al. (1999). Probabilistic networks and
expert systems. New York: Springer. Singh, M., & Valtorta, M. (1995). Construction of Baye-
sian network structures from data: A brief survey and an
Friedman, N. (2004). Inferring cellular networks using efficient algorithm. Int J Approx Reason, 12, 111-131.
probabilistic graphical models. Science, 303, 799-805.
Whittaker, J. (1990). Graphical models in applied multi-
Ghahramani, Z. (1998). Learning dynamic Bayesian net- variate statistics. New York: John Wiley & Sons.
works. In C.L. Giles, & M. Gori (Eds.), Adaptive process-
ing of sequences and data structures (pp. 168-197). New Wright, S. (1923). The theory of path coefficients: A reply
York: Springer. to Niles criticisms. Genetics, 8, 239-255.
Glymour, C., Scheines, R., Spirtes, P., & Kelly, K. (1987). Wright, S. (1934). The method of path coefficients. Ann
Discovering causal structure: Artificial intelligence, Math Statist, 5, 161-215.
philosophy of science, and statistical modeling. San
Diego, CA: Academic Press.
Heckerman, D. (1997). Bayesian networks for data mining. KEY TERMS
Data Mining and Knowledge Discovery, 1(1), 79-119.
Heckerman, D. et al. (1995). Learning Bayesian networks: Bayes Factor: Ratio between the probability of the
The combinations of knowledge and statistical data. Mach observed data under one hypothesis divided by its prob-
Learn, 20, 197-243. ability under an alternative hypothesis.
Jansen, R. et al. (2003). A Bayesian networks approach for Conditional Independence: Let X, Y, and Z be three
predicting protein-protein interactions from genomic data. sets of random variables; then X and Y are said to be
Science, 302, 449-453. conditionally independent given Z, if and only if
p(x|z,y)=p(x|z) for all possible values x, y, and z of X, Y,
Larranaga, P., Kuijpers, C., Murga, R., & Yurramendi, Y. and Z.
(1996). Learning Bayesian network structures by search-
ing for the best ordering with genetic algorithms. IEEE T Directed Acyclic Graph (DAG): A graph with di-
Syst Man Cyb, 26, 487-493. rected arcs containing no cycles; in this type of graph,
for any node, there is no directed path returning to it.
Lauritzen, S.L. (1992). Propagation of probabilities, means
and variances in mixed graphical association models. J Probabilistic Graphical Model: A graph with nodes
Amer Statist Assoc, 87, 1098-108. representing stochastic variables annotated by probabil-
ity distributions and representing assumptions of condi-
Lauritzen, S.L. (1996). Graphical models. Oxford: tional independence among its variables.
Clarendon Press.
Statistical Independence: Let X and Y be two dis-
Lauritzen, S.L. (1988). Local computations with probabilities joint sets of random variables; then X is said to be
on graphical structures and their application to expert sys- independent of Y, if and only if p(x)=p(x|y) for all
tems (with discussion). J Roy Stat Soc B Met, 50, 157-224. possible values x and y of X and Y.
Lauritzen, S.L., & Sheehan, N.A. (2004). Graphical models
for genetic analysis. Statist Sci, 18(4), 489-514.
677
TEAM LinG
678
Learning Information Extraction Rules for Web

Data Mining
Chia-Hui Chang
National Central University, Taiwan
Chun-Nan Hsu
Institute of Information Science, Academia Sinica, Taiwan
INTRODUCTION wrapper must perform information extraction in order to

extract the contents in HTML documents.
The explosive growth and popularity of the World Wide Wrapper induction (WI) systems are software tools
Web has resulted in a huge number of information sources that are designed to generate wrappers. A wrapper usu-
on the Internet. However, due to the heterogeneity and ally performs a pattern-matching procedure (e.g., a form of
the lack of structure of Web information sources, access finite-state machines), which relies on a set of extraction
to this huge collection of information has been limited to rules. Tailoring a WI system to a new requirement is a task
browsing and keyword searching. Sophisticated Web- that varies in scale, depending on the text type, domain,
mining applications, such as comparison shopping, re- and scenario. To maximize reusability and minimize main-
quire expensive maintenance costs to deal with different tenance cost, designing a trainable WI system has been
data formats. The problem in translating the contents of an important topic in research fields, including message
input documents into structured data is called informa- understanding, machine learning, pattern mining, and so
tion extraction (IE). Unlike information retrieval (IR), which forth. The task of Web IE differs largely from traditional
concerns how to identify relevant documents from a IE tasks in that traditional IE aims at extracting data from
document collection, IE produces structured data ready totally unstructured free texts that are written in natural
for post-processing, which is crucial to many applications language. In contrast, Web IE processes online docu-
of Web mining and search tools. ments that are semi-structured and usually generated
Formally, an information extraction task is defined by automatically by a server-side application program. As a
its input and its extraction target. The input can be result, traditional IE usually take advantage of natural
unstructured documents like free text that are written in language processing techniques such as lexicons and
natural language or semi-structured documents that are grammars, while Web IE usually applies machine learning
pervasive on the Web, such as tables, itemized and and pattern-mining techniques to exploit the syntactical
enumerated lists, and so forth. The extraction target of an patterns or to lay out structures of the template-based
IE task can be a relation of k-tuple (where k is the number documents.
of attributes in a record), or it can be a complex object with
hierarchically organized data. For some IE tasks, an at-
tribute may have zero (missing) or multiple instantiations BACKGROUND
in a record. The difficulty of an IE task can be complicated
further when various permutations of attributes or typo- In the past few years, many approaches of WI systems
graphical errors occur in the input documents. and how to apply machine learning and pattern mining
Programs that perform the task of information extrac- techniques to train WI systems have been proposed with
tion are referred to as extractors or wrappers. A wrapper various degrees of automation. Kushmerick and Thomas
is originally defined as a component in an information (2003) conducted a survey that categorizes WI systems
integration system that aims at providing a single uniform based on the wrapper programs underlying formalism
query interface to access multiple information sources. In (whether they are finite-state approaches or Prolog-like
an information integration system, a wrapper is generally logic programming systems). Sarawagi (2002) further dis-
a program that wraps an information source (e.g., a data- tinguished deterministic finite-state approaches from
base server or a Web server) such that the information probabilistic hidden Markov models in her 2002 VLDB
integration system can access that information source tutorial. Another survey can be found in Laender, et al.
without changing its core query answering mechanism. In (2002), which categorizes Web extraction tools into six
the case where the information source is a Web server, a classes based on their underlying techniquesdeclara-
TEAM LinG
Learning Information Extraction Rules for Web Data Mining
tive languages, HTML structure analysis, natural lan- systems place emphasize on the learning techniques in
guage processing, machine learning, data modeling, and their paper. Recently, several works have been proposed L
ontology. to simplify the annotation process. For example, Lixto
(Baumgartner et al., 2001), DEByE (Laender et al., 2002)
and OLERA (Chang & Kuo, 2004) are three such systems
MAIN THRUST that stress the importance of how annotation or examples
are received from users. Note that OLERA also features
We classify previous work in Web IE into three catego- the so-called semi-supervised approach, which receives
ries. The first category contains the systems that require rough rather than exact and perfect examples from users
users to possess programming expertise. This category of to reduce labeling effort.
wrapper generation systems provides specialized lan- The third category contains the WI systems that do
guages or toolkits for wrapper construction, such as W4F not require any preprocessing of the input documents by
(Sahuguet & Azavant, 2001) and XWrap (Liu et al., 2000). the users. We call them annotation-free WI systems.
Such languages or toolkits were proposed as alternatives Example systems include IEPAD (Chang & Lui, 2001),
to general-purpose languages in order to allow program- RoadRunner (Crescenzi et al., 2001), DeLa (Wang &
mers to concentrate on formulating the extraction rules Lochovsky, 2003), and EXALG (Arasu & Garcia-Molina,
without being concerned about the detailed process of 2003). Since no extraction targets are specified, such WI
input strings. To apply these systems, users must learn systems make heuristic assumptions about the data to be
the language in order to write their extraction rules. There- extracted. For example, the first three systems assume the
fore, such systems also feature user-friendly interfaces existence of multiple tuples to be extracted in one page;
for easy use of the toolkits. However, writing correct therefore, the approach is to discover repeated patterns
extraction rules requires significant programming exper- in the input page. With such an assumption, IEPAD and
tise. In addition, since the structures of Web pages are not DeLa only apply to Web pages that contain multiple data
always obvious and change frequently, writing special- tuples. RoadRunner and EXALG, on the other hand, try to
ized extraction rules can be time-consuming, error-prone, extract structured data by deducing the template and the
and not scalable to a large number of Web sites. There- schema of the whole page from multiple Web pages. Their
fore, there is a need for automatic wrapper induction that assumption is that strings that are stationary across
can generalize extraction rules for each distinct IE task. pages are presumably template, and strings that are vari-
The second category contains the WI systems that ant are presumably schema and need to be extracted.
require users to label some extraction targets as training However, as commercial Web pages often contain mul-
examples for WI systems to apply a machine-learning tiple topics where a lot of information is embedded in a
algorithm to learn extraction rules from the training ex- page for navigation, decoration, and interaction pur-
amples. No programming is needed to configure these WI poses, their systems may extract both useful and useless
systems. Many IE tasks for Web mining belong to this information from a page. However, the criterion of what is
category; for example, IE for semi-structured text such as useful is quite subjective and depends on the application.
RAPIER (Califf & Mooney, 1999), SRV (Freitag, 2000), In summary, these approaches are, in fact, not fully
WHISK (Soderland, 1999), and for IE for template-based automatic. Rather, post-processing is required for users
pages such as WIEN (Kushmerick et al., 2000), SoftMealy to select useful data and to assign the data to a proper
(Hsu and Dung, 1998), STALKER (Muslea, et al., 2001), attribute.
and so forth. Compared to the first category, these WI
systems are preferable, since general users, instead of Task Difficulties
only programmers, can be trained to use these WI systems
for wrapper construction. A critical issue of WI systems is what types of documents
However, since the learned rules only apply to Web and structuring variations can be handled. Documents
pages from a particular Web site, labeling training ex- can be classified into structured, semi-structured, and
amples can be laborious, especially when we need to unstructured sets (Hsu & Dung, 1998). Early IE systems
extract contents from thousands of data sources. There- like RAPIER, SRV, and WHISK are designed to handle
fore, researchers have focused on developing tools that documents that contain semi-structured texts, while re-
can reduce labeling effort. For instance, Muslea, et al. cent IE systems are designed mostly to handle documents
(2002) proposed selective sampling, a form of active that contain semi-structured data (Laender et al., 2002). In
learning that reduces the number of training examples. this survey, we focus on semi-structured data extraction
Chidlovskii, et al. (2000) designed a wrapper generation and possible structure variation. These include missing
system that requires a small amount (one training record) data attributes, multi-valued attributes, attribute permu-
of labeling by the user. Earlier annotation-based WI tations, nested data structures, and so forth. Table 1 lists
679
TEAM LinG
these variations and whether a WI system can correctly limiters will fail in this case, because no delimiter exists
extract the contents from a document with a layout variation. between the department codes COMP and the course
Most WI systems can handle missing attributes except number 4016. Note that delimiter-related issues some-
for WIEN. Multi-valued attributes can be considered as a times are caused by the tokenization/encoding method
special case of nested objects. However, it is neither right (see next paragraph) of the WI systems. Therefore, they
nor wrong to say annotation-free WI systems can handle do not necessarily cause problems for all WI systems.
multi-valued attributes, because it depends on how the Finally, most WI systems assume that the attributes of a
values are delimited (by HTML tags or by delimiters like data object occur in a contiguous string, which does not
comma, spaces, etc.). Similarly, though RoadRunner and interleave with other data objects except for MDR (Liu et
EXALG support the extraction of nested objects in gen- al., 2003), which is able to handle non-contiguous (NC)
eral, whether they can handle a particular set of documents data records.
with nested objects depends, in fact, on the quality of
template-based input. Encoding Scheme, Scanning Passes,
Permutations of attributes refer to multiple attribute and Other Features
orders in different data tuples in the target documents (see
the PMI example in Chang & Kuo, 2004). Note that both Table 2 compares important features of WI systems. In
missing attributes and permutations of attributes can lead order to learn the set of extraction rules, WI systems
to multiple attribute orders. Some approaches (e.g., need to know how to segment a string of characters (the
STALKER and multi-pass SoftMealy) utilize multiple scans input) into tokens. For example, SoftMealy segments a
to deal with attribute permutation. Some (e.g., IEPAD, page into tokens including HTML tags as well as words
DeLa) employ string alignment technique for this issue. separated by spaces and uses a token taxonomy tree for
However, the way they handle the extract data results in extraction rule generation. On the other hand, IEPAD
different power. EXALG combines two equivalence classes and RoadRunner regard every text string between two
to produce disjunction rules for handling permutated at- HTML tags as one token, which leads to coarser extrac-
tributes. However, the procedure may fail when all tokens tion granularity. Most WI systems have a predefined
fall in one equivalence class or when no equivalence class feature set for rule generalization. Still, some systems
is formed. such as OLERA (Chang & Kuo, 2004) explicitly allow
Another difficult issue is what we called common user-defined encoding schemes for tokenization. The
delimiters (CD) for attributes and record boundaries. An extraction mechanisms of various WI systems also
example can be found in an Internet address finder docu- play an important role for extraction efficiency. Some
ment set from the repository of information sources for WI systems scan the input document once, referred to
information extraction (RISE, http://www.isi.edu/info- as single-pass extractor. Others scan the input docu-
agents/RISE/repository.html). In that document set, HTML ment several times to complete the extraction. Gener-
tags like <TR> and <TD> are used as delimiters for both ally speaking, single-pass wrappers are more efficient
record boundaries and attributes. Such document sets are than multi-pass wrappers. However, multi-pass wrap-
especially difficult for annotation-free systems. Even for pers are more effective at handling data objects with
annotation-based systems, the problem cannot be com- unrestricted attribute permutations or complex object
pletely handled with their default extraction mechanism. extraction.
Nonexistent delimiters (ND) also cause problems for some Some of the WI systems have special characteristics
WI systems that rely on delimiter-based extraction rules. or requirements that deserve discussion. For example,
For example, suppose we want to extract the department EXALG limits extraction failure to only part of the records
code and course number from the following string: (a few attributes) instead of the entire page. STALKER
COMP4016. WI systems that depend on recognizing de-
Table 1. The expressive power of various WI systems
WI systems Nested Missing Multi- Permute ND CD NC

Objects valued
Annotation- WIEN No No No No No No No
based SoftMealy No Yes Yes Yes Yes Part No
STALKER Yes Yes Yes Yes Yes Part No
OLERA Yes Yes Yes Limited Yes No No
Annotation- IEPAD No Yes -- Limited Part No No
free DeLa Yes Yes -- Yes Part No No
RoadRunner Yes* Yes -- No No No No
EXALG Yes* Yes -- Limited No No No
680
TEAM LinG
Table 2. Features of WI systems (C*=consecutive, E*=episodic)
WI Systems Encoding Passes EF ER Requirement

L
Annotation- WIEN Word Single Entire 1 C*
based SoftMealy Word Single / Entire / Disj C*
Multiple Partial
Stalker Word Multiple Partial Disj E*
OLERA Multiple Both Partial Disj C*
Annotation- IEPAD HTML Single Entire Disj C* Only multi-
free DeLa HTML + Single Entire Disj C* record pages
Word
Roadrunner HTML Single Entire 1 C* At least two
EXALG Word Single Partial Disj C* input pages
required
and multi-pass SoftMealy also feature this characteristic CONCLUSION

(see the extraction failure [EF] column). In such cases,
lower-level encoding schemes may be necessary for fine- In this overview, we present our taxonomy of WI systems
grain extraction tasks. Therefore, multi-level encodings from the users viewpoint and compare important features
are even more important. As for requirements, IEPAD of WI systems that affect their effectiveness. We focus on
cannot handle single-record pages, for it assumes that semi-structured documents that usually are produced
there are multiple records in one page. RoadRunner and with templates by a server-side Web application. The
EXALG require at least two pages as the input in order to extension to non-HTML can be achieved for some WI
generate extraction templates. In contrast, most other WI systems, if some proper tokenization or feature set is used.
systems can potentially generate extraction rules from Some of the WI systems surveyed here have been applied
just one example page. successfully in large-scale real-world applications, in-
The supports for disjunctive rules or extraction rules cluding Web intelligence, comparison shopping, knowl-
that allow intermittent delimiters (i.e., episodic rules) in edge management, and bioinformatics (Chang et al., 2003;
addition to consecutive delimiters also are an indication Hsu et al., 2005).
of the expressive power of WI systems (see the extraction
rule [ER] column). These supports are helpful when no
single consecutive extraction rule can be generalized; REFERENCES
thus, disjunctive rules and episodic rules serve as alter-
natives. Finally, accuracy and automation are two of the Arasu, A., & Garcia-Molina, H. (2003). Extracting struc-
major concerns for all WI systems. For annotation-free tured data from Web pages. Proceedings of ACM SIGMOD
systems, they may reach a status where several patterns International Conference on Management of Data (pp. 337-
or grammars are induced. Some systems (e.g., IEPAD) 348), San Diego, CA, USA.
leave the decision to users to choose the correct pattern
or grammar; therefore, they are not fully automatic. Fully Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Super-
automatic systems may require further analysis on the vised wrapper generation with Lixto. Proceedings of the
derived schema to ensure good quality data extraction 27th International Conference on Very Large Data Bases
(Yang et al., 2003). (VLDB) (pp. 715-716), Roma, Italy.
Califf, M., & Mooney, R. (1999). Relational learning of
pattern-match rules for information extraction. Proceed-
FUTURE TRENDS ings of the 16th National Conference on Artificial Intel-
ligence (AAAI) (pp. 328-334), Orlando, FL, USA.
The most critical issue for these state-of-the-art WI sys-
tems is that the generated extraction rules only apply to Chang, C.-H., & Kuo, S.-C. (2004). OLERA: A semi-super-
a small collection of documents that share the same layout vised approach for Web data extraction with visual sup-
structure. Usually, those rules only apply to documents port. IEEE Intelligent Systems, 19(6), 56-64.
from one Web site. Generalization across different layout
structures is still an extremely difficult and challenging Chang, C.-H., & Lui, S.-C. (2001). IEPAD: Information
research problem. Before this problem can be resolved, extraction based on pattern discovery. Proceedings of
information integration and data mining to the scale of the 10th International World Wide Web (WWW) Confer-
billions of Web sites is still too expensive and impractical. ence (pp. 681-688), Hong Kong, China.
681
TEAM LinG
Chang, C.-H., Siek, H., Lu, J.-J., Chiou, J.-J., & Hsu, C.-N. Muslea, I. (1999). Extraction patterns for information ex-
(2003). Reconfigurable Web wrapper agents. IEEE Intel- traction tasks: A survey. The AAAI-99 Workshop on
ligent Systems, 18(5), 34-40. Machine Learning for Information Extraction (pp. 435-
442), Sydney, Australia.
Chidlovskii, B., Ragetli, J., & Rijke, M. (2000). Automatic
wrapper generation for Web search engines. Proceedings Muslea, I., Minton, S., & Knoblock, C.A. (2001). Hierarchi-
of WAIM (pp. 399-410), Shanghai, China. cal wrapper induction for semi-structured information
sources. Journal of Autonomous Agents and Multi-Agent
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrun- Systems, 4, 93-114.
ner: Towards automatic data extraction from large Web
sites. Proceedings of the 27th International Conference Muslea, I., Minton, S., & Knoblock, C. (2002). Active +
on Very Large Data Bases (VLDB) (pp. 109-118), Roma, semi-supervised learning = Robust multi-view learning.
Italy. Proceedings of the 19th International Conference on
Machine Learning (ICML), Sydney, Australia.
Freitag, D. (2000). Machine learning for information ex-
traction in informal domains. Machine Learning, 39(2/3), Sahuguet, A., & Azavant, F. (2001). Building intelligent
169-202. Web applications using lightweight wrappers. Data and
Knowledge Engineering, 36(3), 283-316.
Hsu, C.-N., Chang, C.-H., Hsieh, C.-H., Lu, J.-J., & Chang,
C.-C. (2004). Reconfigurable Web wrapper agents for Sarawagi, S. (2002). Automation in information extraction
biological information integration. Journal of the Ameri- and integration. Proceedings of the 28th International
can Society for Information Science and Technology Conference on Very Large Data Bases (VLDB), Tutorial,
(JASIST), 56(5), 505-517. Hong Kong, China.
Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state Soderland, S. (1999). Learning information extraction rules
transducers for semi-structured data extraction from the for semi-structured and free text. Journal of Machine
Web. Information Systems, 23(8), 521-538. Learning, 34(1-3), 233-272.
Kushmerick, N. (2000). Wrapper induction: Efficiency and Wang, J., & Lochovsky, F.H. (2003). Data extraction and
expressiveness. Artificial Intelligence Journal, 118(1- label assignment. Proceedings of the 10th International
2), 15-68. World Wide Web Conference (pp. 187-196), Budapest,
Hungary.
Kushmerick, N., & Thomas, B. (2002). Adaptive informa-
tion extraction: Core technologies for information agents. Yang, G., Ramakrishnan, I.V., & Kifer, M. (2003). On the
Intelligent Information Agents R&D In Europe: An complexity of schema inference from Web pages in the
AgentLink Perspective. Lecture Notes in Computer Sci- presence of nullable data attributes. Proceedings of the
ence, 2586, 2003, 79-103. Springer. 12th International Conference on Information and
Knowledge Management (pp. 224-231), New Orleans,
Laender, A.H.F., Ribeiro-Neto, B.A., & Da Silva, A.S. Louisiana, USA.
(2002). DEByEData extraction by example. Data and
Knowledge Engineering, 40(2), 121-154.
Laender, A.H.F., Ribeiro-Neto, B.A., Da Silva, A.S., &
KEY TERMS
Teixeira, J.S. (2002). A brief survey of Web data extraction
tools. SIGMOD Record, 31(2), 84-93. Hidden Markov Model: A variant of a finite state
machine having a set of states, an output alphabet,
Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data transition probabilities, output probabilities, and initial
records in Web pages. Proceedings of the Ninth ACM state probabilities. It is only the outcome, not the state
SIGKDD International Conference on Knowledge Dis- visible to an external observer, and, therefore, states are
covery and Data Mining (pp. 601-606), Washington, D.C., hidden to the outside; hence, the name Hidden Markov Model.
USA.
Information Extraction: An information extraction
Liu, L., Pu, C., & Han, W. (2000). Xwrap: An Xml-enabled task is to extract or pull out of user-defined and pertinent
wrapper construction system for Web information sources. information from input documents.
Proceedings of the 16th International Conference on
Data Engineering (ICDE) (pp. 611-621), San Diego, Cali- Logic Programming: A declarative, relational style of
fornia, USA. programming based on first-order logic. The original logic
682
TEAM LinG
programming language was Prolog. The concept is based Transducer: A finite state machine specifically with
on Horn clauses. a read-only input and a write-only output. The input and L
output cannot be reread or changed.
Semi-Structured Documents: Semi-structured (or
template-based) documents refer to documents that are Wrapper: A program that extracts data from the input
formatted between structured and unstructured documents. documents and wraps them in user desired and structured
form.
Software Agents: An artificial agent that operates in
a software environment such as operating systems, com-
puter applications, databases, networks, and virtual domains.
683
TEAM LinG
684
Locally Adaptive Techniques for Pattern

Classification
Carlotta Domeniconi
George Mason University, USA
Dimitrios Gunopulos
University of California, USA
INTRODUCTION Learning tasks with data represented as a collection of

a very large number of features abound. For example,
Pattern classification is a very general concept with nu- microarrays contain an overwhelming number of genes
merous applications ranging from science, engineering, relative to the number of samples. The Internet is a vast
target marketing, medical diagnosis, and electronic com- repository of disparate information growing at an expo-
merce to weather forecast based on satellite imagery. A nential rate. Efficient and effective document retrieval and
typical application of pattern classification is mass mail- classification systems are required to turn the ocean of
ing for marketing. For example, credit card companies bits around us into useful information and, eventually,
often mail solicitations to consumers. Naturally, they into knowledge. This is a challenging task, since a word
would like to target those consumers who are most likely level representation of documents easily leads 30,000 or
to respond. Often, demographic information is available more dimensions.
for those who have responded previously to such solici- This paper discusses classification techniques to
tations, and this information may be used in order to target mitigate the curse of dimensionality and to reduce bias by
the most likely respondents. Another application is elec- estimating feature relevance and selecting features ac-
tronic commerce of the new economy. E-commerce pro- cordingly. This paper has both theoretical and practical
vides a rich environment to advance the state of the art in relevance, since many applications can benefit from im-
classification, because it demands effective means for text provement in prediction performance.
classification in order to make rapid product and market
recommendations.
Recent developments in data mining have posed new BACKGROUND
challenges to pattern classification. Data mining is a
knowledge-discovery process whose aim is to discover In a classification problem, an observation is character-
unknown relationships and/or patterns from a large set of ized by q feature measurements x = (x1 , L, x q ) q and
data, from which it is possible to predict future outcomes.
As such, pattern classification becomes one of the key is presumed to be a member of one of J classes, L j ,
steps in an attempt to uncover the hidden knowledge
j = 1, L, J . The particular group is unknown, and the
within the data. The primary goal is usually predictive
accuracy, with secondary goals being speed, ease of use, goal is to assign the given object to the correct group,
and interpretability of the resulting predictive model. using its measured features x .
While pattern classification has shown promise in Feature relevance has a local nature. Therefore, any
many areas of practical significance, it faces difficult chosen fixed metric violates the assumption of locally
challenges posed by real-world problems, of which the constant class posterior probabilities, and fails to make
most pronounced is Bellmans curse of dimensionality, correct predictions in different regions of the input space.
which states that the sample size required to perform In order to achieve accurate predictions, it becomes
accurate prediction on problems with high dimensionality crucial to be able to estimate the different degrees of
is beyond feasibility. This is because, in high dimensional relevance that input features may have in various loca-
spaces, data become extremely sparse and are apart from tions of the feature space.
each other. As a result, severe bias that affects any Consider, for example, the rule that classifies a new
estimation process can be introduced in a high-dimen- data point with the label of its closest training point in the
sional feature space with finite samples. measurement space (1-Nearest Neighbor rule). Suppose
TEAM LinG
Locally Adaptive Techniques for Pattern Classification
that each instance is described by 20 features, but only for distance computation capable of reducing the bias of
three of them are relevant to classifying a given instance. the estimation. L
In this case, two points that have identical values for the Friedman (1994) describes an adaptive approach (the
three relevant features may, nevertheless, be distant from Machete and Scythe algorithms) for classification that
one another in the 20-dimensional input space. As a combines some of the best features of kNN learning and
result, the similarity metric that uses all 20 features will be recursive partitioning. The resulting hybrid method in-
misleading, since the distance between neighbors will be herits the flexibility of recursive partitioning to adapt the
dominated by the large number of irrelevant features. This shape of the neighborhood N (x 0 ) of query x0 , as well as
shows the effect of the curse of dimensionality phenom-
the ability of nearest neighbor techniques to keep the
enon; that is, in high dimensional spaces, distances
between points within the same class or between different points within N (x 0 ) close to the point being predicted.
classes may be similar. This fact leads to highly-biased The method is capable of producing nearly continuous
estimates. Nearest neighbor approaches (Ho, 1998; Lowe, probability estimates with the region N (x 0 ) centered at x0
1995) are especially sensitive to this problem. and the shape of the region separately customized for
In many practical applications, things often are further each individual prediction point.
complicated. In the previous example, the three relevant The major limitation concerning the Machete/Scythe
features for the classification task at hand may be depen- method is that, like recursive partitioning methods, it
dent on the location of the query point (i.e., the point to applies a greedy strategy. Since each split is conditioned
be classified) in the feature space. Some features may be on its ancestor split, minor changes in an early split, due
relevant within a specific region, while other features may to any variability in parameter estimates, can have a
be more relevant in a different region. Figure 1 illustrates significant impact on later splits, thereby producing dif-
a case in point, where class boundaries are parallel to the ferent terminal regions. This makes the predictions highly
coordinate axes. For query a, dimension X is more rel- sensitive to the sampling fluctuations associated with the
evant, because a slight move along the X axis may change random nature of the process that produces the training
the class label, while for query b, dimension Y is more data and, therefore, may lead to high variance predictions.
relevant. For query c, however, both dimensions are In Hastie and Tibshirani (1996), the authors propose
equally relevant. a discriminant adaptive nearest neighbor classification
These observations have two important implications. method (DANN), based on linear discriminant analysis.
Distance computation does not vary with equal strength Earlier related proposals appear in Myles and Hand (1990)
or in the same proportion in all directions in the feature and Short and Fukunaga (1981). The method in Hastie and
space emanating from the input query. Moreover, the Tibshirani (1996) computes a local distance metric as a
value of such strength for a specific feature may vary from product of weighted within and between the sum of
location to location in the feature space. Capturing such squares matrices. The authors also describe a method of
information, therefore, is of great importance to any clas- performing global dimensionality reduction by pooling
sification procedure in high-dimensional settings. the local dimension information over all points in the
training set (Hastie & Tibshirani, 1996a, 1996b).
While sound in theory, DANN may be limited in
MAIN THRUST practice. The main concern is that in high dimensions, one
may never have sufficient data to fill in q q (within and
Severe bias can be introduced in pattern classification in between sum of squares) matrices (where q is the dimen-
a high dimensional input feature space with finite samples. sionality of the problem). Also, the fact that the distance
In the following, we introduce adaptive metric techniques metric computed by DANN approximates the weighted
Chi-squared distance only when class densities are
Gaussian and have the same covariance matrix may cause
Figure 1. Feature relevance varies with query locations a performance degradation in situations where data do not
follow Gaussian distributions or are corrupted by noise,
which is often the case in practice.
A different adaptive nearest neighbor classification
method (ADAMENN) has been introduced to try to mini-
mize bias in high dimensions (Domeniconi, Peng &
Gunopulos, 2002) and to overcome the previously men-
tioned limitations. ADAMENN performs a Chi-squared
distance analysis to compute a flexible metric for produc-
685
TEAM LinG
ing neighborhoods that are highly adaptive to query loca- 1

tions. Let x be the nearest neighbor of a query x0 com- r i (x 0 ) =
K

zN ( x0 )
ri (z ), (3)
puted according to a distance metric D(x, x0 ) . The goal is
to find a metric D(x, x0 ) that minimizes E[ r (x 0 , x)] , where where N (x 0 ) denotes the neighborhood of x0 containing
the K nearest training points, according to a given
r (x 0 , x) = j =1 Pr( j | x 0 )(1 Pr( j | x)) . Here Pr( j | x) is the
J
metric. r i measures how well on average the class pos-

class conditional probability at x . That is, r (x 0 , x) is the terior probabilities can be approximated along input
finite sample error risk given that the nearest neighbor to feature i within a local neighborhood of x0 . Small r i
x0 by the chosen metric is x . It can be shown (Domeniconi, implies that the class posterior probabilities will be well
Peng & Gunopulos, 2002) that the weighted Chi-squared approximated along dimension i in the vicinity of x0 .
distance
Note that r i (x0 ) is a function of both the test point x0 and
the dimension i , thereby making r i (x0 ) a local relevance
J
[Pr( j | x) Pr( j | x0 )]2
D(x, x 0 ) = . (1) measure in dimension i .
j =1 Pr( j | x0 )
The relative relevance, as a weighting scheme, then
R ( x )t
approximates the desired metric, thus providing the founda- can be given by wi (x0 ) = q Rl ( x0 )t , where t = 1, 2 , giving
i 0
l =1
tion upon which the ADAMENN algorithm computes a mea-
rise to linear and quadratic weightings, respectively, and
sure of local feature relevance, as shown in the following.
The first observation is that Pr( j | x) is a function of Ri (x 0 ) = max qj =1{r j (x 0 )} r i (x 0 ) (i.e., the larger the Ri , the
x . Therefore, one can compute the conditional expecta- more relevant dimension i ). We propose the following
tion of Pr( j | x) , denoted by Pr( j | xi = z ) , given that xi exponential weighting scheme
assumes value z , where xi represents the i th compo- q

nent of x . That is, wi (x0 ) = exp(cRi ( x0 )) / exp(cRl ( x0 )) (4)
l =1
Pr( j | xi = z ) = E[Pr( j | x) | xi = z ] = Pr( j | x) p( x | xi = z ) dx . where c is a parameter that can be chosen to maximize

(minimize) the influence of r i on wi . When c = 0 , we have
Here, p (x | xi = z ) is the conditional density of the other
input variables defined as wi = 1/q , which has the effect of ignoring any differ-
ence among the r i s. On the other hand, when c is large,
p (x | xi = z ) = p (x) ( xi z )/ p (x) ( xi z )dx , a change in r i will be exponentially reflected in wi . The

exponential weighting is more sensitive to changes in
local feature relevance and, in general, gives rise to better
where ( x z ) is the Dirac delta function having the
performance improvement. In fact, it is more stable,
properties (x z ) = 0 if x z and

because it prevents neighborhoods from extending infi-

( x z )dx = 1 . Let
nitely in any direction (i.e., zero weight). However, this
can occur when either linear or quadratic weighting is
J
[Pr( j | z ) Pr( j | xi = zi )]2 used. Thus, equation (4) can be used to compute the
ri (z ) = . (2) weight associated with each feature, resulting in the
j =1 Pr( j | xi = zi )
weighted distance computation:
ri (z ) represent the ability of feature i to predict the q
Pr( j | z) s at xi = zi . The closer Pr( j | xi = zi ) is to Pr( j | z ) ,

D(x, y ) = w (x y )
i =1
i i i
2
. (5)
the more information feature i carries for predicting the
class posterior probabilities locally at z .
The weights w i enable the neighborhood to elon-
We now can define a measure of feature relevance for
gate less important feature dimensions and, at the same
x0 as time, to constrict the most influential ones. Note that
686
TEAM LinG
the technique is query-based, because the weights de- Chi-squared distance (1) by a Taylor series expansion,
pend on the query (Aha, 1997; Atkeson, Moore & Shaal, given that class densities are Gaussian and have the same L
1997). covariance matrix. In contrast, ADAMENN does not
An intuitive explanation for (2) and, hence, (3) goes as make such assumptions, which are unlikely in real-
follows. Suppose that the value of ri (z) is small, which world applications. Instead, it attempts to approximate
the weighted Chi-Squared distance (1) directly. The
implies a large weight along dimension i . Consequently,
main concern with DANN is that, in high dimensions, we
the neighborhood is shrunk along that direction. This, in
may never have sufficient data to fill in q q matrices.
turn, penalizes points along dimension i that are moving It is interesting to note that the ADAMENN algorithm
away from z i . Now, ri (z ) can be small, only if the subspace potentially can serve as a general framework upon which
to develop a unified adaptive metric theory that encom-
spanned by the other input dimensions at xi = z i likely
passes both Friedmans work and that of Hastie and
contains samples similar to z in terms of the class condi- Tibshirani.
tional probabilities. Then, a large weight assigned to
dimension i based on (4) says that moving away from the
subspace and, hence, from the data similar to z is not a
FUTURE TRENDS
good thing to do. Similarly, a large value of ri (z) and,
hence, a small weight indicates that in the vicinity of z i Almost all problems of practical interest are high dimen-
along dimension i , one is unlikely to find samples similar sional. With the recent technological trends, we can
to z . This corresponds to an elongation of the neighbor- expect an intensification of research efforts in the area of
hood along dimension i . Therefore, in this situation, in feature relevance estimation and selection. In
bioinformatics, the analysis of micro-array data poses
order to better predict the query, one must look farther
challenging problems. Here, one has to face the problem
away from z i . of dealing with more dimensions (genes) than data points
One of the key differences between the relevance (samples). Biologists want to find marker genes that are
measure (3) and Friedmans is the first term in the squared differentially expressed in a particular set of conditions.
difference. While the class conditional probability is used Thus, methods that simultaneously cluster genes and
in (3), its expectation is used in Friedmans. This differ- samples are required to find distinctive checkerboard
ence is driven by two different objectives: in the case of patterns in matrices of gene expression data. In cancer
Friedmans, the goal is to seek a dimension along which data, these checkerboards correspond to genes that are
the expected variation of Pr( j | x) is maximized, whereas in up- or down-regulated in patients with particular types of
(3) a dimension is found that minimizes the difference tumors. Increased research efforts in this area are needed
between the class probability distribution for a given and expected.
query and its conditional expectation along that dimen- Clustering is not exempt from the curse of dimension-
sion (2). Another fundamental difference is that the ma- ality. Several clusters may exist in different subspaces,
chete/scythe methods, like recursive partitioning, employ comprised of different combinations of features. Since
a greedy peeling strategy that removes a subset of data each dimension could be relevant to at least one of the
points permanently from further consideration. As a re- clusters, global dimensionality reduction techniques are
sult, changes in an early split, due to any variability in not effective. We envision further investigation on this
parameter estimates, can have a significant impact on later problem with the objective of developing robust tech-
splits, thereby producing different terminal regions. This niques in the presence of noise.
makes predictions highly sensitive to the sampling fluc- Recent developments on kernel-based methods sug-
tuations associated with the random nature of the process gest a framework to make the locally adaptive techniques
that produces the training data, thus leading to high discussed previously more general. One can perform
variance predictions. In contrast, ADAMENN employs a feature relevance estimation in an induced feature space
patient averaging strategy that takes into account not and then use the resulting kernel metrics to compute
only the test point x0 , but also its K 0 nearest neighbors. distances in the input space. The key observation is that
kernel metrics may be non-linear in the input space but are
As such, the resulting relevance estimates (3) are, in still linear in the induced feature space. Hence, the use of
general, more robust and have the potential to reduce the suitable non-linear features allows the computation of
variance of the estimates. locally adaptive neighborhoods with arbitrary orienta-
In Hastie and Tibshirani (1996), the authors show tions and shapes in input space. Thus, more powerful
that the resulting metric approximates the weighted classification techniques can be generated.
687
TEAM LinG
CONCLUSION KEY TERMS
Pattern classification faces a difficult challenge in fi- Classification: The task of inferring concepts from
nite settings and high dimensional spaces, due to the observations. It is a mapping from a measurement space
curse of dimensionality. In this paper, we have presented into the space of possible meanings, viewed as finite and
and compared techniques to address data exploration discrete target points (class labels). It makes use of
tasks such as classification and clustering. All methods training data.
design adaptive metrics or parameter estimates that are
local in input space in order to dodge the curse of Clustering: The process of grouping objects into
dimensionality phenomenon. Such techniques have been subsets, such that those within each cluster are more
demonstrated to be effective for the achievement of closely related to one another than objects assigned to
accurate predictions. different clusters, according to a given similarity measure.
Curse of Dimensionality: Phenomenon that refers to
the fact that, in high-dimensional spaces, data become
REFERENCES extremely sparse and are far apart from each other. As a
result, the sample size required to perform an accurate
Aha, D. (1997). Lazy learning. Artificial Intelligence Re- prediction in problems with high dimensionality is usually
view, 11, 1-5. beyond feasibility.
Atkeson, C., Moore, A.W., & Schaal, S. (1997). Locally Kernel Methods: Pattern analysis techniques that
weighted learning. Artificial Intelligence Review, 11, 11-73. work by embedding the data into a high-dimensional
vector space and by detecting linear relations in that
Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally space. A kernel function takes care of the embedding.
adaptive metric nearest neighbor classification. IEEE
Transactions on Pattern Analysis and Machine Intelli- Local Feature Relevance: Amount of information that
gence, 24(9), 1281-1285. a feature carries to predict the class posterior probabilities
at a given query.
Friedman, J.H. (1994). Flexible metric nearest neighbor
classification. Technical Report. Stanford University. Nearest Neighbor Methods: Simple approach to the
classification problem. It finds the K nearest neighbors
Hastie, T., & Tibshirani, R (1996a). Discriminant adaptive
of the query in the training set and then predicts the class
nearest neighbor classification. IEEE Transactions on
label of the query as the most frequent one occurring in
Pattern Analysis and Machine Intelligence, 18(6), 607-615.
the K neighbors.
Hastie, T., & Tibshirani, R. (1996b). Discriminant analysis
Pattern: A structure that exhibits some form of regu-
by Gaussian mixtures. Journal of the Royal Statistical
larity, able to serve as a model representing a concept of
Society, 58, 155-176.
what was observed.
Ho, T.K. (1998). Nearest neighbors in random subspaces.
Recursive Partitioning: Learning paradigm that em-
Proceedings of the Joint IAPR International Workshops
ploys local averaging to estimate the class posterior
on Advances in Pattern Recognition.
probabilities for a classification problem.
Lowe, D.G. (1995). Similarity metric learning for a variable-
Subspace Clustering: Simultaneous clustering of both
kernel classifier. Neural Computation, 7(1), 72-85.
row and column sets in a data matrix.
Myles, J.P., & Hand, D.J. (1990). The multi-class metric
Training Data: Collection of observations (character-
problem in nearest neighbor discrimination rules. Pattern
ized by feature measurements), each paired with the cor-
Recognition, 23(11), 1291-1297.
responding class label.
Short, R.D., & Fukunaga, K. (1981). Optimal distance
measure for nearest neighbor classification. IEEE Trans-
actions on Information Theory, 27(5), 622-627.
688
TEAM LinG
689
Logical Analysis of Data L

Endre Boros
RUTCOR, Rutgers University, USA
Peter L. Hammer
RUTCOR, Rutgers University, USA
Toshihide Ibaraki
Kwansei Gakuin University, Japan
INTRODUCTION theme of LAD. In principle, any Boolean function that

agrees with the given data is a potential extension and is
The logical analysis of data (LAD) is a methodology considered in LAD. However, as in general learning theory,
aimed at extracting or discovering knowledge from data in simplicity and generalization power of the chosen exten-
logical form. The first paper in this area was published as sions are the main objectives. To achieve these goals,
Crama, Hammer, & Ibaraki (1988) and precedes most of the LAD breaks up the problem of finding a most promising
data mining papers appearing in the 1990s. Its primary extension into a series of optimization problems, each with
target is a set of binary data belonging to two classes for their own objectives, first finding a smallest subset of the
which a Boolean function that classifies the data into two variables needed to distinguish the vectors in T from
classes is built. In other words, the extracted knowledge those in F (finding a so called support set), next finding all
is embodied as a Boolean function, which then will be the monomials which have the highest agreement with the
used to classify unknown data. As Boolean functions that given data (finding the strongest patterns), finally find-
classify the given data into two classes are not unique, ing a best combination of the generated patterns (finding
there are various methodologies investigated in LAD to a theory). In what follows, we shall explain briefly each of
obtain compact and meaningful functions. As will be these steps. More details about the theory of partially
mentioned later, numerical and categorical data also can defined Boolean functions can be found in Boros, et al.
be handled, and more than two classes can be represented (1998, 1999).
by combining more than one Boolean function. An example of a binary dataset (or pdBf) is shown in
Many of the concepts used in LAD have much in Table 1; Table 2 gives three extensions of it, among many
common with those studied in other areas, such as data others. Extension f1 may be considered as a most compact
mining, learning theory, pattern recognition, possibility one, as it contains only two variables, while extension f 3
theory, switching theory, and so forth, where different
terminologies are used in each area. A more detailed
description of LAD can be found in some of the references Table 1. An example for pdBf (T, F)
listed in the following and, in particular, in the forthcom-
x1 x2 x3 x4 x5 x6 x7 x8
ing book (Crama & Hammer, 2004).
0 1 0 1 0 1 1 0
T 1 1 0 1 1 0 0 1
0 1 1 0 1 0 0 1
BACKGROUND 1 0 1 0 1 0 1 0
F 0 0 0 1 1 1 0 0
Let us consider a binary dataset (T, F, where T {0,1} n 1 1 0 1 0 1 0 1
(resp., F {0,1} n) is the set of positive (resp., negative) 0 0 1 0 1 0 1 0
data (i.e., those belonging to positive (resp., negative)
class). The pair (T, F) is called a partially defined Boolean
function (or, in short, a pdBf), which is the principal Table 2. Some extensions of the pdBf (T, F) given in Table 1
mathematical notion behind the theory of LAD. A Bool-
ean function f: {0,1}n {0,1} is called an extension of the
f1 = x5 x8 x5 x8
pdBf (T, F) if f(x)=1 for all vectors x in T, and f(y)=0 for all f 2 = x1 x5 x3 x7 x1 x5 x7
vectors y in F. The construction of extensions that carry
the essential information of the given dataset is the main
f 3 = x5 x8 x6 x7
TEAM LinG
Logical Analysis of Data
also may be interesting, as it reveals a monotone depen- least one vector in T satisfying this conjunction, but no
dence on the involved variables. vector in F satisfies it, where a literal is either a variable or
its complement. In the example of Tables 1 and 2,
x5 x8 , x5 x8 , x1 x5 , K are patterns. The notion of a pat-
MAIN THRUST
tern is closely related to the association rule, which is
In many applications, the available data is not binary, commonly used in data mining. Each pattern captures a
which necessitates the generation of relevant binary certain characteristic of (T, F) and forms a part of knowl-
features, for which the three-staged LAD analysis then edge about the data set. Several types of patterns (prime,
can be applied. Following are some details about all the spanned, strong) have been analyzed in the literature
main stages of LAD, including the generation of binary (Alexe et al., 2002), and efficient algorithms for enumerat-
features, finding support sets, generating patterns, and ing large sets of patterns have been described (Alexe &
constructing theories. Hammer, 2004; Alexe & Hammer, 2004; Eckstein et al., 2004).
Binarization of Numerical and Theories of a pdBf

Categorical Data A set of patterns that together cover T (i.e., for any xT,
there is at least one pattern that x satisfies) is called a
Many of the existing datasets contain numerical and/or theory of (T, F). Note that a theory defines an extension
categorical variables. Such data also can be handled by of (T, F), since disjunction of the patterns yields such a
LAD after converting such data into binary data (i.e., Boolean function. The Boolean functions fi in Table 2
binarization). A typical method for this is to introduce a correspond to theories constructed for the dataset of
cut point i for a numerical variable xi; if xii holds; then, Table 1. Exchanging the roles of T and F, we can define co-
it is converted to 1, and 0 otherwise. It is possible to use patterns and co-theories, which play a similar role in LAD.
more than one cut point for a variable to create one or more Efforts in LAD have been directed to the construction of
number of regions of 1. Similar ideas also can be used for compact patterns and theories from (T, F), as such theo-
a categorical variable, defining a region converted into 1 ries are expected to have better performance to classify
by a subset of values of the variable. Binarization aims at unknown data. An implementation in this direction and its
finding the cut points for numerical attributes/finding the results can be found in Boros, et al. (2000). The tradeoff
subsets of values for categorical attributes, such that the between comprehensive and comprehensible theories is
given positive and negative observations can be distin- studied in Alexe, et al. (2002).
guished with the resulting binary variables, and their
number is as small as possible. Mathematical properties
of and algorithms for binarization are studied in Boros, et
al. (1997). Efficient methods for the binarization of cat-
FUTURE TRENDS
egorical data are proposed in Boros and Mekov (2004).
Extensions by Special Types of
Support Sets Boolean Functions
It is desirable from the viewpoint of simplicity to build an In many cases, it is known in advance that the given
extension by using as small a set of variables as possible. dataset has certain properties, such as monotone depen-
A subset S of the n original variables is called a support dence on variables. To utilize such information, exten-
set, if the projections of T and F on S still has an extension. sions by Boolean functions with the corresponding prop-
The pdBf in Table 1 has the following minimal support erties are important. For special classes of Boolean func-
sets: S1={5,8}, S2={6,7}, S3={1,2,5}, S4={1,2,6}, S5={2,5,7}, tions, such as monotone (or positive), Horn, k-DNF,
S6={2,6,8}, S7={1,3,5,7}, S8={1,4,5,7}. For example, f1 in decomposable, and threshold, the algorithms and com-
Table 2 is constructed from S1. Several methods to find plexity of finding extensions were investigated (Boros et
small support sets are discussed and compared in Boros, al., 1995; Boros et al., 1998). If there is no extension in the
et al. (2003). specified class of Boolean functions, we may still want to
find an extension in the class with the minimum number of
Pattern Generation errors. Such an extension is called the best-fit extension
and studied in Boros, et al. (1998).
As a basic tool to construct an extension f, a conjunction
of a set of literals is called a pattern of (T, F), if there is at
690
TEAM LinG
Imperfect Datasets Alexe, S. et al. (2003). Coronary risk prediction by logical

analysis of data. Annals of Operations Research, 119, L
In practice, the available data may contain missing or 15-42.
erroneous bits, due to various reasons (e.g., obtaining Alexe, S., & Hammer, P.L. (2004). Accelerated algorithm
their values is dangerous or expensive). More typically, for pattern detection in logical analysis of data. Discrete
the binary attributes obtained by binarization can result in Applied Mathematics.
some uncertainties; for example, for a numerical variable xi
the binary attribute xii may not be evaluated with high Boros, E. et al. (2000). An implementation of logical
confidence, if the actual value of xi is too close to the value analysis of data. IEEE Trans. on Knowledge and Data
of the cut point i (e.g., if it is within the error rate of the Engineering, 12, 292-306.
measurement procedure used to obtain the value of xi). For
such binary datasets involving such missing/uncertain Boros, E., Gurvich, V., Hammer, P.L., Ibaraki, T., & Kogan,
bits, three types of extensionsrobust, consistent, and A. (1995). Decompositions of partially defined Boolean
most robuswere defined. For example, a robust exten- functions. Discrete Applied Mathematics, 62, 51-75.
sion is one that remains as an extension of (T, F) for any Boros, E., Hammer, P.L., Ibaraki, T., & Kogan, A. (1997).
assignment of 0, 1 values to the missing bits. Theoretical Logical analysis of numerical data. Mathematical Pro-
and algorithmic studies in this direction are made in Boros, gramming, 79, 163-190.
et al. (1999, 2003).
Boros, E., Horiyama, T., Ibaraki, T., Makino, K., & Yagiura,
M. (2003). Finding essential attributes from binary data.
CONCLUSION Annals of Mathematics and Artificial Intelligence, 39,
223-257.
The LAD approach has been applied successfully to a Boros, E., Ibaraki, T., & Makino, K. (1998). Error-free and
number of real-world datasets. Among the published re- best-fit extensions of a partially defined Boolean func-
sults, Boros, et al. (2000) deals with the datasets from the tion. Information and Computation, 140, 254-283.
repository of the University of California at Irvine (http:/
/www.ics.uci.edu/~mlearn/MLRepository.html), as well as Boros, E., Ibaraki, T., & Makino, K. (1999). Logical analy-
some other pilot studies. LAD has been applied success- sis of data with missing bits. Artificial Intelligence, 107,
fully for several medical problems, including the detection 219-264.
of ovarian cancer (Alexe et al., 2004), stratification of risk Boros, E., Ibaraki, T., & Makino K. (2003). Variations on
among cardiac patients (Alexe et al., 2003), and polymer extending partially defined Boolean functions with miss-
design for artificial bones (Abramson et al., 2002). Among ing bits. Information and Computation, 180(1), 53-70.
other applications of LAD, we mention those concerning
economic problems, including an analysis of labor produc- Boros, E., & Mekov, V. (2004). Exact and approximate
tivity in China (Hammer et al., 1999) and a recent analysis discrete optimization algorithms for finding useful dis-
of country risk ratings (Hammer et al., 2003). junctions of categorical predicates in data analysis. Dis-
crete Applied Mathematics, 144(1-2), 43-58.
Crama, Y., & Hammer, P.L. (2004). Boolean Functions,
REFERENCES forthcoming.
Abramson, S., Alexe, G., Hammer, P.L., Knight, D., & Kohn, Crama, Y., Hammer, P.L., & Ibaraki, T. (1988). Cause-
J. (2002). A computational approach to predicting cell effect relationship and partially defined Boolean func-
growth on polymeric biomaterials. To appear in J. Biomed. tions. Annals of Operations Research, 16, 299-326.
Mater. Res.
Eckstein, J., Hammer, P.L., Liu, Y., Nediak, M., & Simeone,
Alexe, G. et al. (2004). Ovarian cancer detection by logical B. (2004). The maximum box problem and its application
analysis of proteomic data. Proteomics, 4(3), 766-783. to data analysis. Computational Optimization and Ap-
plications, 23(3), 285-298.
Alexe, G., Alexe, S., Hammer, P.L., & Kogan, A. (2002).
Comprehensive vs. comprehensible classifiers in LAD. To Hammer, A., Hammer, P.L., & Muchnik, I. (1999). Logical
appear in Discrete Applied Mathematics. analysis of Chinese labor productivity patterns. Annals
of Operations Research, 87, 165-176.
Alexe, G., & Hammer, P.L. (2004). Spanned patterns in the
logical analysis of data. Discrete Applied Mathematics.
691
TEAM LinG
Hammer, P.L., Kogan, A., & Lejeune, M.A. (2003). A non- LAD (Logical Analysis of Data): A methodology that
recursive regression model for country risk rating. tries to extract and/or discover knowledge from datasets
RUTCOR Research Report RRR 9-2003. by utilizing the concept of Boolean functions.
Pattern: A conjunction of literals that is true for some
KEY TERMS data vectors in T but is false for all data vectors in F, where
(T, F) is a given pdBf. A co-pattern is similarly defined by
Binarization: The process of deriving a binary repre- exchanging the roles of T and F.
sentation for numerical and/or categorical attributes. Support Set: A set of variables S for a data set (T, F)
Boolean Function: A function from {0,1}n to {0,1}. A such that projections of T and F on S still have an
function from a subset of {0,1}n to {0,1} is called a partially extension.
defined Boolean function (pdBf). A pdBf is defined by a Theory: A set of patterns of a pdBf (T, F), such that
pair of datasets (T, F), where T (resp., F) denotes a set of each data vector in T has a pattern satisfying it.
data vectors belonging to positive (resp., negative) class.
Extension: A Boolean function f that satisfies f(x)=1
for xT and f(x)=0 for xF for a given pdBf (T, F).
692
TEAM LinG
693
The Lsquare System for Mining Logic Data L

Giovanni Felici
Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy
Klaus Truemper
INTRODUCTION that holds true for all the records in one set while it is false
for all records in another set, can become extremely
The method described in this chapter is designed for difficult when the dimension involved is not trivial, and
data mining and learning on logic data. This type of data many different techniques and approaches have been
is composed of records that can be described by the proposed in the literature. In this article we describe one
presence or absence of a finite number of properties. of them, the Lsquare System, developed in collaboration
Formally, such records can be described by variables between IASI-CNR and UTD and described in detail in
that may assume only the values true or false, usually Felici & Truemper (2002), Felici, Sun, & Truemper (2004),
referred to as logic (or Boolean) variables. In real and Truemper (2004). The system is freely distributed for
applications, it may also happen that the presence or research and study purposes at www.leibnizsystem.com.
absence of some property cannot be verified for some Data mining in logic domains is becoming a very
record; in such a case we consider that variable to be interesting topic for both research and applications. The
unknown (the capability to treat formally data with motivations for the study of such models are frequently
missing values is a feature of logic-based methods). For found in real life situations where one wants to extract
example, to describe patient records in medical diagno- usable information from data expressed in logic form.
sis applications, one may use the logic variables healthy, Besides medical applications, these types of problems
old, has_high_temperature, among many others. A often arise in marketing, production, banking, finance,
very common data mining task is to find, based on and credit rating. A quick scan of the updated Irvine
training data, the rules that separate two subsets of the Repository (see Murphy & Aha, 1994) is sufficient to
available records, or explains the belonging of the data show the relevance of logic-based models in data min-
to one subset or the other. For example, one may desire ing and learning application. The literature describes
to find a rule that, based one the many variables observed several methods that address learning in logic domains,
in patient records, is able to distinguish healthy patients for example the very popular decision trees (Breiman et
from sick ones. Such a rule, if sufficiently precise, may al., 1984), the highly combinatorial approach of Boros
then be used to classify new data and/or to gain informa- et al. (1996), the interior point method of Kamath et al.
tion from the available data. This task is often referred (1992), or the partial enumeration scheme proposed by
to as machine learning or pattern recognition and ac- Triantaphyllou & Soyster (1996). While the problem
counts for a significant portion of the research con- formulation adopted by Lsquare is somewhat related the
ducted in the data mining community. When the data work in Kamath et al. (1992) and Triantaphyllou et al.
considered is in logic form or can be transformed into (1994), substantial differences are found in the solution
it by some reasonable process, it is of great interest to method adopted. Most of the methods considered in this
determine explanatory rules in the form of the combina- area are of intrinsic deterministic nature, being based on
tion of logic variables, or logic formulas. In the ex- the formal description of a problem in mathematical form
ample above, a rule derived from data could be: and in its solution by a specific algorithm. Nevertheless,
some real life situations present uncertainty and errors in
if (has_high_temperature is true) and (running_nose the data that are often successfully dealt with by the use
is true) then (the patient is not healthy). of fuzzy set and fuzzy membership theory. In such cases
the proposed system may embed the uncertainty and the
Clearly such rules convey a lot of information and fuzziness of the data in a pre-processing step, providing
can be easily understood and interpreted by domain fuzzy functions that determine the value of the Boolean
experts. Despite the apparent simplicity of this setting, variables.
the problem of determining, if possible, a logic formula
TEAM LinG
The Lsquare System for Mining Logic Data
BACKGROUND of logic data; say g and f. We say that f is nested in g if for

any entry fi of f equal to +1 or 1, the corresponding entry
This section provides the needed definitions and concepts. gi of g satisfies gi = f i. It is easy to show that if A and B are
sets of {0, +/-1} records of the same length, a separating
Propositional Logic, SAT and MINSAT set S exists if and only if no record a A is nested in any
record b B. A clear characterization of separating
A Boolean variable may take on the values true or vectors can thus be given: a {0, +/-1} vector s separates
false. One or more Boolean variables can be assembled a record a A from B if:
in a Boolean formula using the operators , , and . A
Boolean formula where Boolean variables, possibly s is not nested in any b B (1)
negated, are joined by the operator (resp. ) is called s is nested in a (2)
a conjunction (resp. disjunction). A conjunctive nor-
mal form system (CNF) is a Boolean formula where Accordingly, we say that a separating set S separates
some disjunctions are joined with the operator . A B from A if, for each a A, there exists a separating
disjunctive normal form system (DNF) is a Boolean vector s S that separates a from B.
formula where some conjunctions are joined with the
operator . Each disjunction (resp. conjunction) of a
CNF (resp. DNF) system is called a clause. A Boolean MAIN THRUST
formula is satisfiable if there exists an assignment of
true/false values to its Boolean variables so that the The problem of finding a separating set for A and B is
value of the formula is true. The problem of deciding decomposed into a sequence of subproblems, each of
whether a CNF formula is satisfiable is known as the which identifies a vector s that separates a nonempty
satisfiability problem (SAT). In the affirmative case, subset of A from B solving two minimization problems.
one also must find an assignment of true/false values To formulate these problems we introduce Boolean
for the Boolean variables of the CNF that make the variables linked with the elements s i of the vector s to be
formula true. A variation of SAT is the minimum cost found. More precisely, we introduce Boolean variables
satisfiability problem (MINSAT), where rational costs pi and qi and state that s i = 1 if pi = True and qi = False,
are associated with the true value of each logic variable s i = 1 if pi = False and qi = True, and si = 0 if pi = qi =
and the solution to be determined has minimum sum of False. The case pi = qi = True is ruled out by enforcing
costs for the variables with value true. the following conditions:
Logic Data and Logic Separation pi qi , i = 1, 2, ..., n (3)
We consider {0, +/-1} vectors of given length n, each of We can express the separation conditions (1) and (2)
which has an associated outcome with value true or with the Boolean variables pi and qi. For (1) we have that
false. We call these vectors records of logic data and s must not be nested in any b B. Defining b+ as the set
view them as an encoding of logic information. A 1 in a of indices i for which bi of b is equal to 1, that is, b+ =
record means that a certain Boolean variable has value {ibi = 1} and b- = {ibi = -1}, b0 = {ibi = 0}, we
true, and a 1 that the variable has value false. The value summarize condition (1) writing that
0 is used for unknown. The outcome is considered to be
the value of a Boolean variable t that we want to explain ( i(b+ b0) qi) (i(b- b0) pi); bB (4)
or predict. We collect the records for which the prop-
erty t is present in a set A, and those for which t is not For condition (2) we have to enforce that, if s sepa-
present in set B. For ease of recognition, we usually rates a from B, then s is nested in a. In order to do so we
denote a member of A by a, and of B by b. The Lsquare introduce a new Boolean variable da that determines
system deduces {0, +/-1} separating vectors that effec- whether s must separate a from B. That is, da = true
tively represent logic formulas and may be used to means that s need not separate a from B, while da = false
compute for each record the associated outcome, that requires that separation. For a A, the separation condi-
is, to separate the records in A from the records in B. A tion is:
separating set is a collection of separating vectors. The
separation of A and B makes sense only when both A and qi da; i (a+ a0)
B are non-empty, and when each record of A or B pi da; i (a+ a0) (5)
contains at least one {+/-1} entry. Consider two records
694
TEAM LinG
We now formulate the problem of determining a vector property t. One then determines a set S that separates B
s that separates as many a A from B as possible (this from A, and uses that set to guess whether a new vector L
amounts to a satisfying solution for (3)-(5) that assigns r is in A or B. That is, we guess r to be in A if at least one
value true to as few variables da as possible). For each a s S is nested in r, and to be in B otherwise. Of course,
A, define a rational cost c a that is equal to 1 if da is true, the classification of r based on S is correct if r is in A or
and equal to 0 otherwise. Using these costs and (3)-(5), the B, but otherwise need not be correct. Specifically, we may
desired s may be found by solving the following MINSAT guess a record of A -A to be in B, and a record of B -B to
problem, with variables da, a A, and pi, qi, i = 1, 2, , n. be in A. Usually such errors are referred to as type error
and type error respectively. The utility of S depends on
min c which type of error is made how many times. In some
s. t. p q i = 1, 2,..., n settings, an error of one of the two types is bad, but an
((b+ b0) q ) (i(b- b0) pi) bB (6) error of the other type is worse. For example, a non-
invasive diagnostic system for cancer that claims a case
q d a , (a+a0) to be benign when a malignancy is present has failed
pi da a , (a+a0) badly. On the other hand, prediction of a malignancy for
an actually benign case triggers additional tests, and
The solution of problem (6) identifies an s and a thus is annoying but not nearly as objectionable as an
largest subset A = {a A da = False} that is separated error of the first type. We can influence the extent of type
from B by s. Restricting the problem to A and using error and type errors by an appropriate choice of the
costs c(p i) and c(qi) associated with the variables pi and qi, objective function c(p i) and c(q i). As anticipated, when
we define a second objective function and solve again the c(pi) = c(qi) = 1 for all i = 1, , n the formulas determined
problem, obtaining a separating vector whose properties by the solution of the sequence of MINSAT problems
depend on the c(pi) and c(qi). A simple example of the role will have minimal support, that is, will try to use the
of cost values c(p i) and c(q i) is the following. Assume that minimum number of variables to separate A from B. On
c(pi) and c(qi) assign cost of 1 when pi and qi are true and the other hand, setting c(pi) = c(q i) = -1 for all i = 1, ,
cost 0 when they are false. The separating vector will use n, we will obtain the opposite effect, that is, a formula with
the minimum number of logic variables, that is, the maximum support. If we use a single vector s to classify
minimum amount of information contained in the data to a vector r, then we guess r to be in A if s is nested in r.
separate the sets. On the opposite, if we assign cost 0 for The latter condition tends to become less stringent when
true and 1 for false, it will use the maximum amount of the number of nonzero entries in s is reduced. Hence, we
information to define the separating sets. If one separating heuristically guess that a solution vector s with mini-
vector is not sufficient to separate all set A, we set A A mum support tends to avoid type errors. Conversely,
and iterate. The disjunction of all separating vectors consti- an s with maximum support tends to avoid type errors.
tutes the final logic formula that separates A from B. We apply this heuristic argument to the separating set S
produced under one of the two choices of objective
The Leibniz System functions, and thus expect that a set S with minimum
(resp. maximum) support tends to avoid type (resp. )
The MINSAT problems considered belong to the so errors. The above considerations are combined with a
called NP class, that means, shortly, that the time needed sophisticated voting scheme embedded in Lsquare, partly
for their solution cannot be bounded by a polynomial inspired to the notion of stacked generalization origi-
function in the size of the input, as explained by the nally described in Wolpert (1992) (see also Breiman,
modern theory of computational complexity (Garey & 1996). Such scheme refines the separating procedure
Johnson, 1979). We solve the MINSAT instances with described in the previous section by a resampling tech-
the Leibniz System, a logic programming solver devel- nique of the logic records available for training. For
oped at the University of Texas at Dallas. Such solver is each subsample we determine 4 separating formulas,
based on decomposition and combinatorial optimization by switching the roles of A and B in the MINSAT
results described in Truemper (1998). formulation and by switching the signs of the cost
coefficients. Then, each formula so determined is used
Error Control and Voting System to produce a vote for each elements of A and B, the vote
being 1 (resp. 1) if the element is recognised to
In learning problems one views the sets A and B as belong to A (resp. B). The sum of all the votes so
obtained is the vote total V. If the vote V is positive (resp.
training sets, and considers them to be subsets of sets A
negative), then the record belongs to A (resp. to B). If A
and B where A consists of all {0, +/-1} records of length and B are representative subsets of A and B, then the
n with property t, and B consists of all such records without
695
TEAM LinG
votes V for records of A (resp. of B) tend to be positive REFERENCES

(resp. negative). To assess the reliability of the classifi-
cation of records of A and B, Lsquare computes two Boros, E. et al. (1996). An implementation of logical
estimated probability distributions. For any odd integer k, analysis of data. RUTCOR Research Report 29-96. NJ:
the first (resp. second) distribution supplies an estimate Rutgers University.
(resp. ) of the probability that the vote V for a record
of A (resp. B) is less (resp. greater) than k. For example, if Breiman, L. (1996). Stacked regressions. Machine Learn-
one declares records with votes V greater than a given k to ing, 24, 49-64.
be in B, then is an estimate of the probability that a record Breiman, L., Friedman J.H., Olshen R.A., & Stone, C.J.
of A will be misclassified. Such threshold k can be choosen (1984). Classification and regression trees. Wadsworth
in such a way that the total estimated error (+) is small International.
but also balanced in the desired direction, according to what
error is more serious for the specific application. An exten- Felici, G., Sun, F-S., & Truemper, K. (2004). A method for
sive treatment of this feature and of its statistical and controlling errors in two-class classifications. IASI
mathematical background may be found in Felici, Sun, & Technical Report n. 980, November, 1998.
Truemper (1998).
Felici, G., & Truemper, K. (2002). A Minsat approach for
learning in logic domains. INFORMS Journal on Comput-
ing, 14 (1), 20-36.
FUTURE TRENDS
Garey, M.R., & Johnson, D.S. (1979). Computers and
Data mining in logic setting presents several interesting intractability: A guide to the theory of NP-completeness.
features, amongst which is the interpretability of its San Francisco: Freeman.
results, that are making it a more and more appreciated
tool. The mathematical problems generated by this type Kamath, A.P., Karmarkar, N.K., Ramakrishnan, K.J., &
of data mining are typically very complex, require very Resende M.G.C. (1992). A continuous approach to in-
powerful computing machines and sophisticated solu- ductive inference. Mathematical Programming, 57,
tion approaches. All efforts made to design and test new 215-238.
efficient algorithmic frameworks to solve mathemati- Murphy, P.M., & Aha, D.W. (1994). UCI repository of
cal problems related with data mining in logic will thus machine learning databases: Machine readable data
be a rewarding topic for future research. Moreover, the repository. Department of Computer Science, Univer-
size of the available databases is increasing at a high sity of California, Irvine.
rate, and these methods suffer from both the computa-
tional complexity of the algorithm and from the explo- Triantaphyllou, E., Allen, L., Soyster, L., & Kumara,
sion of the dimensions often related with the conversion S.R.T. (1994). Generating logical expressions from
of rational variables into Boolean ones. The solution of positive and negative examples via a branch-and-bound
data mining in logic setting in an effective and scalable approach. Computers and Operations Research, 21,
way is thus an exciting challenge. Only recently, re- 185-197.
searchers have started to consider distributed and grid Triantaphyllou, E., & Soyster, L. (1996). On the mini-
computing as a potential key feature to win it. mum number of logical clauses inferred from examples.
Computers and Operations Research, 21, 783-799.
CONCLUSION Truemper, K. (1998). Effective logic computation. New

York: Wiley-Interscience.
In this chapter we have considered a particular type of Truemper, K. (2004). Design of logic-based intelligent
data mining and learning problem and have described systems. New York: Wiley.
one method to solve such problems. Despite their in-
trinsic difficulty, these problems are of high interest Truemper, K. (2004). The Leibniz system for logic pro-
for their applicability and for their theoretical proper- gramming. Version 7.1 Leibniz. Plano, Texas.
ties. For brevity, we have omitted many details and also
results and comparisons with other methods on real life Wolpert, D.H. (1992). Stacked generalization. Neural
and test instances. Extensive information on the behav- Networks, 5, 241-259.
ior of Lsquare on many data sets coming from the Irvine
Repository (Murphy & Aha, 1994) is available in Felici &
Truemper (2002) and Felici, Sun, and Truemper (2004).
696
TEAM LinG
KEY TERMS Logic Data Mining: The application of data mining

techniques where both data and extracted information are L
expressed by logic variables.
Classification Error: Number of elements that are MINSAT: Given a Boolean expression E and a cost
classified in the wrong class by a classification rule. In two function on the variables, determine, if exists, an assign-
class problems, the classification error is divided into the ment to the variables in E such that E is true and the sum
so-called false positive and false negative. of the cost is minumum.
Combinatorial Optimization: Branch of optimization in Propositional Logic: A system of symbolic logic based
applied mathematics and computer science related to on the symbols and, or, not to stand for whole
algorithm theory and computational complexity theory. propositions and logical connectives. It only considers
Error Control: Selection of a classification rule in whether a proposition is true or false.
such a way to obtain a desired proportion of false positive Voting: Classification method based on the combina-
and false negative. tion of several different classifiers obtained by different
methods and/or different data.
697
TEAM LinG
698
Marketing Data Mining

Victor S.Y. Lo
Fidelity Personal Investments, USA
INTRODUCTION (1) Acquisition: Which prospects are most likely to

become customers?
Data mining has been widely applied over the past two (2) Development: Which customers are most likely
decades. In particular, marketing is an important appli- to purchase additional products (cross-selling) or
cation area. Many companies collect large amounts of to add monetary value (up-selling)?
customer data to understand their customers needs and (3) Retention: Which customers are most retainable?
predict their future behavior. This article discusses This can be relationship or value retention.
selected data mining problems in marketing and pro-
vides solutions and research opportunities.
MAIN THRUST
BACKGROUND Standard database marketing problems have been de-

scribed in literature (e.g., Hughes, 1994; Jackson & Wang,
Analytics are heavily used in two marketing areas: mar- 1996; and Roberts & Berger, 1999). In this article, I
ket research and database marketing. The former ad- describe the problems that are infrequently mentioned
dresses strategic marketing decisions through the analy- in the data mining literature as well as their solutions
sis of survey data, and the latter handles campaign deci- and research opportunities. These problems are embed-
sions through the analysis of behavioral and demo- ded in various components of the campaign process,
graphic data. Due to the limited sample size of a survey, from campaign design to response modeling to cam-
market research normally is not considered data mining. paign optimization; see Figure 2. Each problem is de-
This article focuses on database marketing, where data scribed in the Problem-Solution-Opportunity format.
mining is used extensively to maximize marketing re-
turn on investment by finding the optimal targets. A
typical application is developing response models to CAMPAIGN DESIGN
identify likely campaign responders. As summarized in
Figure 1, a previous campaign provides data on the The design of a marketing campaign is the starting point
dependent variable (responded or not), which is merged of a campaign process. It often does not receive enough
with individual characteristics, including behavioral and attention in data mining. If a campaign is not designed
demographic variables, to form an analyzable data set. A properly, postcampaign learning can be infeasible (e.g.,
response model is then developed to predict the re- insufficient sample size). On the other hand, if a cam-
sponse rate given the individual characteristics. The paign is scientifically designed, learning opportunities
model is then used to score the population and predict can be maximized.
response rates for all individuals. Finally, the best list of The design process includes activities such as deter-
individuals will be targeted in the next campaign in order mining the sample size for treatment and control groups
to maximize effectiveness and minimize expense. (both have to be sufficiently large such that measure-
Response modeling can be applied in the following ment and modeling, when required, are feasible), decid-
activities (Peppers & Rogers, 1997, 1999): ing on sampling methods (pure or stratified random
Figure 1. Response modeling process
Collect Develop Score

Select best
data model population
targets
TEAM LinG
Figure 2. Database marketing campaign process

M
Design Measure Model Optimize
Execute
sampling), and creating a cell design structure (testing SAS with estimable main effects, all two-way interac-
various offers and also by age, income, or other vari- tion effects, and quadratic effects on quantitative vari-
ables). I focus on the latter here. ables generates a design of 256 cells, which may still be
considered large. An optimal design using PROC OPTEX
Problem 1 with the same estimable effects generates a design of
only 37 cells (see Table 2; refer to Table 1 for attribute
Classical designs often test one variable at a time. For level definitions).
example, in a cell phone direct mail campaign, you may Fractional factorial design has been used in credit
test a few price levels of the phone. After launching the card acquisition but is not widely used in other indus-
campaign and uncovering the price level that led to the tries for marketing. Two reasons are (a) a lack of experi-
highest revenue, another campaign is launched to test a mental design knowledge and experience and (b) busi-
monthly fee, a third campaign tests the direct mail ness process requirement tight coordination with list
message, and so forth. A more efficient way is to struc- selection and creative design professionals is required.
ture the cell design such that all these variables are
testable in one campaign. Consider an example: A credit Opportunity 1
card company would like to determine the best combi-
nation of treatments for each prospect; the treatment Mayer and Sarkissien (2003) proposed using individual
attributes and attribute levels are summarized in Table 1. characteristics as attributes in the optimal design, where
The number of all possible combinations = 44 x 22 = individuals are chosen optimally. Using both indi-
1024 cells, which is not practical to test. vidual characteristics and treatment attributes as design
attributes is theoretically interesting. In practice, we
Solution 1 should compare this optimal selection of individuals with
stratified random sampling. Simulation and theoretical
To reduce the number of cells, a fractional factorial and empirical studies are required to evaluate this idea.
design can be applied (full factorial refers to the design Additionally, if many individual variables (say, hundreds)
that includes all possible combinations); see Montgom- are used in the design, then constructing an optimal
ery (1991) and Almquist and Wyner (2001). Two types design may be very computationally intensive due to the
of fractional factorials are a) an orthogonal design large design matrix and thus the design may require a
where all attributes are made orthogonal (uncorrelated) unique optimization technique to solve.
with each other and b) an optimal design where a certain
criterion related to the variance-covariance matrix of
parameter estimates is optimized; see Kuhfeld (1997, RESPONSE MODELING
2004) for the applications of SAS PROC FACTEX and
PROC OPTEX in market research. (Kuhfelds market Problem 2
research applications are also applicable to database
marketing). As stated in the introduction, response modeling uses
For the preceding credit card problem, an orthogo- data from a previous marketing campaign to identify
nal fractional factorial design using PROC FACTEX in
Table 1. Example of design attributes and attribute levels
Attribute Attribute level

APR 5.9% (0), 7.9% (1), 10.9% (2), 12.9% (3)
Credit limit $2,000 (0), $5,000 (1), $7,000 (2), $10,000 (3)
Color Platinum (0), Titanium (1), Ruby (2), Diamond (3)
Rebate None (0), 0.5% (1), 1% (2), 1.5% (3)
Brand SuperCard (0), AdvantagePlus (1)
Creative A (0), B (1)
699
TEAM LinG
Table 2. An optimal design example

Cell Apr Limit Rebate Color Brand Creative Cell Apr Limit Rebate Color Brand Creative
1 0 0 0 3 0 1 21 2 1 3 1 1 1
2 0 0 3 3 1 1 22 2 3 0 1 0 1
3 0 0 3 3 0 0 23 3 0 0 3 1 0
4 0 0 0 2 1 1 24 3 0 3 3 0 1
5 0 0 0 2 0 0 25 3 0 3 2 1 1
6 0 0 0 1 0 1 26 3 0 3 2 0 0
7 0 0 0 0 1 0 27 3 0 0 1 1 1
8 0 0 3 0 0 1 28 3 0 0 0 0 0
9 0 1 3 2 1 0 29 3 0 3 0 1 0
10 0 3 0 3 1 0 30 3 1 0 2 0 1
11 0 3 3 3 0 1 31 3 1 2 1 0 0
12 0 3 3 2 0 1 32 3 3 0 3 1 1
13 0 3 2 1 1 1 33 3 3 0 3 0 0
14 0 3 3 1 0 0 34 3 3 3 3 1 0
15 0 3 0 0 0 1 35 3 3 0 2 1 0
16 0 3 3 0 1 1 36 3 3 0 0 1 0
17 1 0 3 1 1 0 37 3 3 3 0 0 1
18 1 2 0 1 1 0
19 1 3 2 0 0 0
20 2 0 1 0 1 1
likely responders given individual characteristics (Berry & Solution 2

Linoff, 1997, 2000; Jackson & Wang, 1996; Rud, 2001).
Treatment and control groups were set up in the campaign The key issue is that fitting Model 1 and applying it to
for measurement. The methodology typically uses treat- find the best individuals is inappropriate. The correct
ment data to identify characteristics of customers who are way is to find those who are most positively influenced
likely to respond. In other words, for individual iW, by the campaign. Doing so requires you to predict lift:
P(Y i = 1| X i; treatment) P(Y i = 1| Xi; control) for each
P(Y i = 1| Xi; treatment) = f(Xi), (1) i and then select those with the highest values of esti-
mated lift. Alternative solutions to this problem are:
where W represents all the individuals in the campaign,
Yi is the dependent variable (defined by the campaigns (1) Fitting two separate treatment and control
call to action, indicating whether or not those people models: This solution is relatively straightfor-
responded), Xi is a vector of independent variables that ward except that estimated lift can be sensitive to
represent individual characteristics, and f(.) is the func- statistically insignificant differences in param-
tional form. For examples, in logistic regression, f(.) is a eters of the treatment and control models.
logistic function of Xi (Hosmer & Lemeshow, 1989); in (2) Using dummy variable and interaction effects:
decision trees, f(.) is a step function of Xi (Zhang & Singer, Lo (2002) proposed using the independent vari-
1999); and in neural network, f(.) is a nonlinear function of ables X i, Ti, and Xi*Ti, where Ti equals 1 if i is in
Xi (Haykin, 1999). the treatment group but otherwise equals 0, and
A problem may arise when we apply Model (1) to the modeling the response rate by using a logistic
next campaign. Depending on the industry and product, regression:
the model may select individuals who are likely to respond
regardless of the marketing campaign. The results of the
new campaign may show equal or similar (treatment minus exp( + 'X i + Ti + 'X i Ti )
Pi = (2)
control) performance in model and random cells; see Table 1 + exp( + 'X i + Ti + 'X i Ti )
3 for an example.
In Equation 2, denotes the intercept, is a vector of
parameters measuring the main effects of the indepen-
dent variables, denotes the main treatment effect, and
Table 3. Sample campaign measurement of model measures additional effects of the independent vari-
effectiveness (response rates) ables due to treatment. (See Russek-Cohen and Simon
Treatment Control Incremental (treatment (1997) for similar problems in biomedicine.) To pre-
(e.g., mail) (e.g., no mail) minus control) dict the lift, use the following equation:
Model 1% 0.8% 0.2%
Random 0.5% 0.3% 0.2%
700
TEAM LinG
M
exp( + 'X i + Ti + 'X iTi + 'Z + ' Z X )
Pi | treatment Pi | control Pi = i i i
1 + exp( + 'X i + Ti + 'X iTi + 'Z + 'Z X )
i i i
exp( + + 'X i + 'X i ) exp( + 'X i )
= ( (3) where Z i is the vector of treatment attributes, (4)
1+ exp( + + 'X i + 'X i ) 1+ exp( + 'X i )
and and are additional parameters ( Z i = 0 if Ti = 0).
That is, we can use the parameter estimates from

In practice, Equation (4) can have many variables
Model (2) in Equation (3) to predict the lift for each i in a
including many interaction effects, and thus
new data set. Individuals with high positive predicted lift
multicollinearity may be a concern. Developing two sepa-
values will be selected for the next campaign. We can
rate treatment and control models may be more practical.
apply a similar technique of using a dummy variable and
Also, unlike the single-treatment problem, no commercial
interactions in other supervised learning algorithms, such
software is known to solve this problem directly.
as neural network. See Lo (2002) for a detailed description
of the methodology.
Opportunity 3
(3) Decision tree approach: To maximize lift, the
only known commercial software is Quadstone, based Similar to Opportunity 2, alternative methods to handle
on decision trees where each split at every parent multiple treatment response modeling should exist.
node is processed to maximize the difference be-
tween lift (treatment minus control response rates)
in the left and right children nodes; see Radcliffe OPTIMIZATION
and Surry (1999) for their Uplift Analysis.
Marketing optimization refers to delivering the best
Opportunity 2 treatment combination to the right people so as to
optimize certain criteria subject to constraints. The step
(1) Alternative modeling methods for this lift-based naturally follows campaign design and response model-
problem should exist, and readers are encouraged to ing, especially when multiple treatment combinations
uncover them. (2) As in typical data mining, the number were tested in the previous campaign(s), and the re-
of potential predictors (independent variables) is large. sponse model score for each treatment combination and
Variable reduction techniques are available for standard each individual is available (see Problem 3).
problems (i.e., finding the best subset associated with
the dependent variable), but a unique variable reduction Problem 4
technique for the lift-based problem is needed (i.e.,
finding the best subset associated with the lift). The main issue is that the number of decision variables
equals the number of individuals in the targets (n) times
Problem 3 number of treatment combinations (m). For example, in
an acquisition campaign where the target population has
An extension of Problem 2 is the presence of multiple 30 million individuals and the number of treatment
treatments. Data can be obtained from an experimen- combinations is 100. Then the number of decision vari-
tally designed campaign where multiple treatment at- ables is n times m, which equals 3 billion. Consider the
tributes were tested. The question is how to incorporate following binary integer programming problem
treatment attributes and individual characteristics in the (Bertsimas & Tsitsiklis, 1997):
same response model.
Maximize = ij xij
Solution 3 i j
subject to :
An extension of Equation (2) is to incorporate both treat- x ij max. # of individuals receiving treatment combination j,
i
ment attributes and individual characteristics in the
same lift-based response model.
c x
i j
ij ij expense budget, plus other relevant constraints,
xij = 0 or 1,
where ij = inc. value received by sending treatment comb. j to individual i,
cij = cost of sending treatment comb. j to individual i. (5)
701
TEAM LinG
In Model (5), ij is linked to the predicted incremental Solution 5

response rate of individual i and treatment combination j.
For example, if ij is the predicted incremental response The simplest way is to perform sensitivity analysis of
rate (i.e., lift), the objective function is to maximize the total response rates on optimization. The downside is that if the
number of incremental responders; if ij is incremental variability is high and/or the number of treatment combi-
revenue, it may be a constant times the incremental re- nations is large, the large number of possibilities will make
sponse rate. Similar forms of Model 5 appear in Storey and it difficult. An alternative is to solve the stochastic ver-
Cohen (2002) and Ansari and Mela (2003); here, however, sion of the optimization problem by using stochastic
I am maximizing incremental value or responders. In the programming (see Birge & Louveaux, 1997). Solving such
literature, similar formulations for aggregate-level mar- a large stochastic programming problem typically relies
keting optimization, such as Stapleton, Hanna, and on the Monte Carlo simulation method (Ahmed & Shapiro,
Markussen (2003) and Larochelle and Sanso (2000), are 2002) and can easily become a large project.
common, but not for individual-level optimization.
Opportunity 5
Solution 4
The two alternatives in Solution 5 may not be practical
One way to address such a problem is through heuristics, for large optimization problems. Thus, researchers have
which do not guarantee global optimum. For example, to an opportunity to develop a practical solution. You may
reduce the size of the problem, you may group the large consider using robust optimization, which incorporates
number of individuals in the target population into clus- the trade-off between optimal solution and conserva-
ters and then apply linear programming to solve at the tism in linear programming; see Ben-Tal, Margalit, and
cluster level. This is exactly the first stage outlined in Nemirovski (2000) and Bertsimas and Sim (2004).
Storey and Cohen (2002). The second stage of their
approach is to optimize at the individual level for each
of the optimal clusters obtained from the first stage. FUTURE TRENDS
SAS has implemented this methodology in their new
software known as Marketing Optimization. For exact (1) Experimental design is expected to be more uti-
solutions, consider Marketswitch at lized to maximize learning.
www.marketswitch.com, where the problem is solved (2) More marketers are now aware of the lift problem
with an exact solution through a mathematical transfor- in response modeling (see Solution 2), and alter-
mation. The software has been applied at various compa- native methods will be developed to address the
nies (Leach, 2001). problem.
(3) Optimization can be performed across campaigns,
Opportunity 4 channels, customer segments, and business initia-
tives so that a more global optimal solution is
(1) Simulation and empirical studies can be used to achieved. This requires not only data and response
compare existing heuristics to exact global optimiza- models but also coordination and cultural shift.
tion. (2) General heuristics such as simulated annealing,
genetic algorithms, and tabu search can be attempted
(Goldberg, 1989; Michalewicz & Fogel, 2002). CONCLUSION
Problem 5 Several problems with suggested solutions and opportu-
nities for research have been described in this article.
Predictive models never produce exact results. Random They are not frequently mentioned in standard literature
variation of predicted response rates can be estimated in but are very valuable in marketing. The solutions involve
various ways. For example, in the holdout sample, the multiple areas, including data mining, statistics, opera-
modeler can compute the standard deviation of lift by tions research, database marketing, and marketing sci-
decile. More advanced methods, such as bootstrapping, ence. Researchers are encouraged to evaluate current
can also be applied to assess variability (Efron & solutions and develop new methodologies.
Tibshirani, 1998). Then how should the variability be
accounted for in the optimization stage?
702
TEAM LinG
REFERENCES Kuhfeld, W. (2004). Experimental design, efficiency, cod-

ing, and choice designs. SAS Technical Note, TS-694C. M
Ahmed, S., & Shapiro, A. (2002). The sample average Larochelle, J., & Sanso, B. (2000). An optimization model
approximation method for stochastic programs with inte- for the marketing-mix problem in the banking industry.
ger recourse (Tech. Rep.). Georgia Institute of Technol- INFOR, 38(4), 390-406.
ogy, School of Industrial & Systems Engineering.
Leach, P. (2001, August). Capital One optimizes customer
Almquist, E., & Wyner, G. (2001, October). Boost your relationships. 1to1 Magazine.
marketing ROI with experimental design. Harvard Busi-
ness Review, 135-141. Lo, V. S. Y. (2002). The true-lift model a novel data
mining approach to response modeling in database mar-
Ansari, A., & Mela, C. (2003). E-customization. Journal of keting. SIGKDD Explorations, 4(2), 78-86.
Marketing Research, XL, 131-145.
Mayer, U. F., & Sarkissien, A. (2003). Experimental design
Ben-Tal, A., Margalit, T., & Nemirovski, A. (2000). Robust for solicitation campaigns. Proceedings of the Ninth ACM
modeling of multi-stage portfolio problems. In H. Frenk, K. SIGKDD International Conference on Knowledge Dis-
Roos, T. Terlaky, & S. Zhang (Eds.), High performance covery and Data Mining (pp. 717-722).
optimization. Kluwer.
Michalewicz, Z., & Fogel, D. B. (2002). How to solve it:
Berry, M. J. A., & Linoff, G. S. (1997). Data mining Modern heuristics. Springer.
techniques: For marketing, sales, and customer sup-
port. Wiley. Montgomery, D. C. (1991). Design and analysis of
experiments (3rd ed.). Wiley.
Berry, M. J. A., & Linoff, G. S. (2000). Mastering data
mining. Wiley. Peppers, D., & Rogers, M. (1997). Enterprise one-to-
one. Doubleday.
Bertsimas, D., & Sim, M. (2004). The price of robust-
ness. Operations Research, 52(1), 35-53. Peppers, D., & Rogers, M. (1999). The one-to-one
fieldbook. Doubleday.
Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to
linear optimization. Athena Scientific. Radcliffe, N. J., & Surry, P. (1999). Differential re-
sponse analysis: Modeling true response by isolating
Birge, J. R., & Louveaux, F. (1997). Introduction to sto- the effect of a single action. Proceedings of Credit
chastic programming. Springer. Scoring and Credit Control VI, Credit Research Cen-
Efron, B., & Tibshirani, R. J. (1998). An introduction to the tre, University of Edinburgh Management School.
bootstrap. CRC. Roberts, M. L., & Berger, P. D. (1999). Direct market-
Goldberg, D. E. (1989). Genetic algorithms in search, ing management. Prentice Hall.
optimization & machine learning. Addison-Wesley. Rud, O. P. (2001). Data mining cookbook. Wiley.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Russek-Cohen, E., & Simon, R. M. (1997). Evaluating
elements of statistical learning. Springer. treatments when a gender by treatment interaction may
Haykin, S. (1999). Neural networks: A comprehensive exist. Statistics in Medicine, 16(4), 455-464.
foundation. Prentice Hall. Stapleton, D. M., Hanna, J. B., & Markussen, D. (2003).
Hosmer, D. W., & Lemeshow, S. (1989). Applied logis- Marketing strategy optimization: Using linear program-
tic regression. Wiley. ming to establish an optimal marketing mixture. Ameri-
can Business Review, 21(2), 54-62.
Hughes, A. M. (1994). Strategic database marketing.
McGraw-Hill. Storey, A., & Cohen, M. (2002). Exploiting response
models: Optimizing cross-sell and up-sell opportuni-
Jackson, R., & Wang, P. (1996). Strategic database ties in banking. Proceedings of the Eighth ACM
marketing. NTC. SIGKDD International Conference on Knowledge Dis-
Kuhfeld, W. (1997). Efficient experimental designs covery and Data Mining (pp. 325-331).
using computerized searches. Sawtooth Software Re- Zhang, H., & Singer, B. (1999). Recursive partitioning in
search Paper Series. the health sciences. Springer.
703
TEAM LinG
KEY TERMS Fractional Factorial: A subset of the full factorial

design; that is, a subset of all possible combinations.
Campaign Design: The art and science of designing Marketing Optimization: Delivering the best treat-
a marketing campaign, including cell structure, sam- ments to the right individuals.
pling, and sizing.
Response Modeling: Predicting a response given in-
Control Group: Individuals who look like those in the dividual characteristics by using data from a previous
treatment group but are not contacted. campaign.
Database Marketing: A branch of marketing that Treatment Group: Individuals who are contacted
applies database technology and analytics to understand (e.g., mailed) in a campaign.
customer behavior and improve effectiveness in mar-
keting campaigns.
704
TEAM LinG
705
Material Acquisitions Using Discovery M

Informatics Approach
Chien-Hsing Wu
National University of Kaohsiung, Taiwan, ROC
Tzai-Zang Lee
National Cheng Kung University, Taiwan, ROC
INTRODUCTION tion can explore patterns in databases that are meaningful,

interpretable, and decision supportable (Wang, 2003).
Material acquisition is a time-consuming but important The discovery informatics in circulation databases using
task for a library, because the quality of a library is not induction mechanism are used in the decision of library
in the number of materials that are available but in the acquisition budget allocation (Kao, Chang, & Lin, 2003;
number of materials that are actually utilized. Its goal is Wu, 2003). The circulation database is one of the impor-
to predict the users needs for information with respect tant knowledge assets for library managerial decisions.
to the materials that will most likely be used (Bloss, For example, information such as 75.4% of patrons who
1995). Discovery informatics using the technology of made use of Organizations also made use of Financial
knowledge discovery in databases can be used to inves- Economics via association relationship discovery is
tigate in-depth how the acquired materials are being supportive for the material acquisition operation. Conse-
used and, in consequence, can be a predictive sign of quently, the data-mining technique with an association
information needs. mechanism can be utilized to explore informatics that are
useful.
BACKGROUND
MAIN THRUST
Material searchers regularly spend large amounts of
time to acquire resources for enormous numbers of Utilization discovery as a base of material acquisitions,
library users. Therefore, something significant should comprising a combination of association utilization and
be relied on to produce the acquisition recommendation statistics utilization, is discussed in this article. The
list for the limited budget (Whitmire, 2002). Major association utilization is derived by a data-mining tech-
resources for material acquisitions are, in general, the nique. Systemically, when data mining is applied in the
personal collections of the librarians and recommenda- field of material acquisitions, it follows five stages:
tions by users, departments, and vendors (Stevens, 1999). collecting datasets, preprocessing collected datasets,
The collections provided by these collectors are usually mining preprocessed datasets, gathering discovery
determined by their individual preferences, rather than informatics, interpreting and implementing discovered
by a global view, and thus may not be adequate for the informatics, and evaluating discovered informatics. The
material acquisitions to rely on. Information in the statistics utilization is simply the sum of numeric values
usage data may show something different from the of strength for all different types of categories in pre-
collectors recommendations (Hamaker, 1995). For processed circulation data tables (Kao et al., 2003).
example, knowing which materials were most utilized These need both domain experts and data miners to
by the patrons would be highly useful for material accomplish the tasks successfully.
acquisitions.
First, circulation statistics is one of the most sig- Collecting Datasets
nificant references for library material acquisition de-
cisions (Budd & Adams, 1989; Tuten & Lones, 1995; Most libraries have employed computer information
Pu, Lin, Chien, & Juan, 1999). It is a reliable factor by systems to collect circulation data that mainly includes
which to evaluate the success of material utilization users identifier, name, address, and department for a
(Wise & Perushek, 2000). Second, the data-mining user; identifier, material category code, name, author,
technique with a capability of description and predic- publisher, and publication date for a material; and users
TEAM LinG
Material Acquisitions Using Discovery Informatics Approach
identifier, material identifier, material category code, date and are support and confidence, respectively (Meo,
borrowed, and date returned for a transaction. In order to Pseila, & Ceri, 1998). The P is regarded as the condition,
consider the importance of a material category, a data and Q as the conclusion, meaning that P can produce Q
table must be created to define the degree of importance implicitly. For example, an association rule Systems
that a material presents to a department (or a group of =>Organizations & Management (0.25, 0.33) means, If
users). For example, five scales of degree can be abso- materials in the category of Systems were borrowed in a
lutely matching, highly matching, matching, likely transaction, materials in Organization & Management
matching, and absolutely not matching, and their were also borrowed in the same transaction with a support
importance strength can be defined as 0.4, 0.3, 0.2, 0.1, and of 0.25 and a confidence of 0.33. Support is defined as the
0.0, respectively (Kao et al., 2003). ratio of the number of transactions observed to the total
number of transactions, whereas confidence is the ratio of
Preprocessing Data the number of transactions to the number of conditions.
Although association rules having the form of P=>Q (,
Preprocessing data may have operations of refinement ) can be generated in a transaction, the inverse associa-
and reconstruction of data tables, consistency of tion rules and single material category in a transaction
multityped data tables, elimination of redundant (or also need to be considered.
unnecessary) attributes, combination of highly corre- When two categories (C1 and C2) are utilized in a
lated attributes, and discretization of continuous at- transaction, it is difficult to determine the association
tributes. Two operations in this stage for material acqui- among C1=>C2, C2=>C1, and both. A suggestion from
sitions are the elimination of unnecessary attributes and librarians is to take the third one (both) as the decision
the reconstruction of data tables. For the elimination of of this problem (Wu, Lee, & Kao, 2004). This is also
unnecessary attributes, four data tables are preprocessed supported by the study of Meo et al. (1998), which deals
to derive the material utilization. They are users tables with association rule generation in customer purchasing
(two attributes: department identifier and user identi- transactions. The number of support and confidence of
fier), category tables (two attributes: material identifi- C1=>C2 may be different from those of C2=>C1. As a
ers and material category code), circulation table (three result, the inverse rules are considered as an extension
attributes: user identifier, material category code, and for the transactions that contain more than two catego-
date borrowed), and importance table (three attributes: ries to determine the number of association rules. The
department identifier, material category identifier, and number of rules can be determined via 2*[n*(n-1)/2],
importance). For the reconstruction of data tables, a where n is the number of categories in a transaction. For
new table can be generated that contains attributes of example, {C1, C2, C3} are the categories of a transac-
department identifier, user identifier, material category tion, and 6 association rules are then produced to be
code, strength, and date borrowed. {C1=>C2, C1=>C3, C2=>C3, C2=>C1, C3=>C1,
C3=>C2}. Unreliable association rules may occur be-
Mining Data cause their supports and confidences are too small.
Normally, there is a predefined threshold that defines
Mining mechanisms can perform knowledge discovery the value of support and confidence to filter the unreli-
with the form of association, classification, regression, able association rules. Only when the support and con-
clustering, and summarization/generalization (Hirota & fidence of a rule satisfy the defined threshold is the rule
Pedrycz, 1999). The association with a form of If regarded as a reliable rule. However, no evidence exists
Condition Then Conclusion captures relationships be- so far is reliable determining the threshold. It mostly
tween variables. The classification is to categorize a set depends on how reliable the management would like the
of data based on their values of the defined attributes. discovered rules to be. For a single category in a trans-
The regression is to derive a prediction model by alter- action, only the condition part without support and
ing the independent variables for dependent one(s) in a confidence is considered, because of the computation
defined database. Clustering is to put together the physi- of support and confidence for other transactions.
cal or abstract objects into a class based on similar Another problem is the redundant rules in a transac-
characteristics. The summarization/generalization is tion. It is realized that an association rule is to reveal the
to abridge the general characteristics over a set of company of a certain kind of material category, inde-
defined attributes in a database. pendent of the number of its occurrences. Therefore, all
The association informatics can be employed in redundant rules are eliminated. In other words, there is
material acquisitions. Like a rule, it takes the form of only one rule for a particular condition and only one conclu-
P=>Q (, ), where P and Q are material categories, and sion in a transaction. Also, the importance of a material to a
706
TEAM LinG
department is omitted. However, the final material utilization Evaluating Discovered Informatics
will take into account this concern when the combination with M
statistics utilization is performed. Performance of the discovered informatics needs to be
tested. Criteria used can be validity, significance/unique-
Gathering Discovery Informatics ness, effectiveness, simplicity, and generality (Hirota &
Pedrycz, 1999). The validity looks at whether the discov-
The final material utilization as the discovery informatics ered informatics is practically applicable. Uniqueness/
contains two parts. One is statistics utilization, and the significance deals with how different the discovered
other is association utilization (Wu et al., 2004). It is informatics are to the knowledge that library manage-
expressed as Formula 1 for a material category C. ment already has. Effectiveness is to see the impact the
discovered informatics has on the decision that has been
k made and implemented. Simplicity looks at the degree of
MatU (C ) = nC + nk * ( * support + * confidence ) (1) understandability, while generality looks at the degree
i
of scalability. The criteria used to evaluate the discov-
ered material utilization for material acquisitions can be
Where in particular the uniqueness/significance and effective-
ness. The uniqueness/significance can show that mate-
MatU(C): material utilization for category C rial utilization is based not only on statistics utilization,
nC: statistics utilization but also on association utilization. The solution of effec-
nk: statistics utilization of the kth category that can tiveness evaluation can be found by answering the
produce C questions Do the discovered informatics significantly
: intensity of support help reflect the information categories and subject areas
support: number of support of materials requested by users? and Do the discov-
: intensity of confidence ered informatics significantly help enhance material uti-
confidence: number of confidence lizations for next year?
Interpreting and Implementing

Discovery Informatics FUTURE TRENDS
Interpretation of discovery informatics can be performed Digital libraries using innovative Internet technology
by any visualization techniques, such as table, figure, promise a new information service model, where li-
graph, animation, diagram, and so forth. The main discov- brary materials are digitized for users to access any-
ery informatics for material acquisition have three tables time from anywhere. In fact, it is almost impossible for
indicating statistics utilization, association rules, and a library to provide patrons with all the materials avail-
material utilization. The statistics utilization table lists able because of budget limitations. Having the collec-
each material category and its utilization. The associations that closely match the patrons needs is a primary
tion rule table has four attributes, including condition, goal for material acquisitions. Libraries must be cen-
conclusion, supports, and confidence. Each tuple in this tered on users and based on contents while building a
table represents an association rule. The association global digital library (Kranich, 1999). This results in
utilization is computed according to this table. The ma- the increased necessity of discovery informatics tech-
terial utilization table has five attributes, including ma- nology. Advanced research tends to the integrated stud-
terial category code, statistics utilization, association ies that may have requests for information for different
utilization, material utilization, and percentage. In this subjects (material categories). The material associa-
table, the value of the material utilization is the sum of tions discovered in circulation databases may reflect
statistics utilization and association utilization. For each these requests.
material category, the percentage is the ratio of its Library management has paid increased attention to
utilization to the total utilization. Implementation deals easing access, filtering and retrieving knowledge
with how to utilize the discovered informatics. The ma- sources, and bringing new services onto the Web, and
terial utilization can be used as a base of material acqui- users are industriously looking for their needs and
sitions by which informed decisions about allocating figuring out what is really good for them. Personalized
budget is made. information service becomes urgent. The availability
707
TEAM LinG
of accessing materials via the Internet is rapidly changing Kranich, N. (1999). Building a global digital library. In C.-
the strategy from print to digital forms for libraries. For C. Chen (Ed.), IT and global digital library development
example, what can be relied on while making the decision (pp. 251-256). West Newton, MA: MicroUse Information.
on which electronic journals or e-books are required for a
library, how do libraries deal with the number of login Lu, H., Feng, L., & Han, J. (2000). Beyond intratransaction
names and the number of users entering when analysis of association analysis: Mining multidimensional
arrival is concerned, and how do libraries create person- intertransaction association rules. ACM Transaction on
alized virtual shelves for patrons by analyzing their trans- Information Systems, 18(4), 423-454.
action profiles? Furthermore, data collection via daily Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL
circulation operation may be greatly impacted by the way for mining association rules. Data Mining & Knowledge
a user makes use of the online materials and, as a conse- Discovery, 2, 195-224.
quence, makes the material acquisitions operation even
more difficult. Discovery informatics technology can help Pu, H. T., Lin, S. C., Chien, L. F., & Juan, Y. F. (1999).
find the solutions for these issues. Exploration of practical approaches to personalized
library and networked information services. In C.-C.
Chen (Ed.), IT and global digital library development
CONCLUSION (pp. 333-343). West Newton, MA: MicroUse Information.
Stevens, P. H. (1999). Whos number one? Evaluating
Material acquisition is an important operation for library acquisitions departments. Library Collections, Acqui-
that needs both technology and management involved. sitions, & Technical Services, 23, 79-85.
Circulation data are more than data that keep material
usage records. Discovery informatics technology is an Tuten, J. H., & Lones, B. (1995). Allocation formulas in
active domain that is connected to data processing, ma- academic libraries. Chicago, IL: Association of Col-
chine learning, information representation, and manage- lege and Research Libraries.
ment, particularly when it has shown a substantial aid in Wang, J. (2003). Data mining: Opportunities and chal-
decision making. Data mining is an application-depen- lenges. Hershey, PA: Idea Group.
dent issue, and applications in domain will need adequate
techniques to deal with. Although discovery informatics Whitmire, E. (2002). Academic library performance mea-
depends highly on the technologies used, its use with sures and undergraduates library use and educational
respect to applications in domain still needs more efforts outcomes. Library & Information Science Research, 24,
to concurrently benefit management capability. 107-128.
Wise, K., & Perushek, D. E. (2000). Goal programming as
a solution technique for the acquisition allocation prob-
REFERENCES lem. Library & Information Science Research, 22(2), 165-
183.
Bloss, A. (1995). The value-added acquisitions librar-
ian: Defining our role in a time of change. Library Wu, C. H. (2003). Data mining applied to material acquisi-
Acquisitions: Practice & Theory, 19(3), 321-330. tion budget allocation for libraries: Design and develop-
ment. Expert Systems with Applications, 25(3), 401-411.
Budd, J. M., & Adams, K. (1989). Allocation formulas in
practice. Library Acquisitions: Practice & Theory, 13, Wu, C. H., Lee, T. Z., & Kao, S. C. (2004). Knowledge
381-390. discovery applied to material acquisitions for libraries.
Information Processing & Management, 40(4), 709-725.
Hamaker, C. (1995). Time series circulation data for collec-
tion development; Or, you cant intuit that. Library Ac-
quisition: Practice & Theory, 19(2), 191-195.
KEY TERMS
Hirota, K., & Pedrycz, W. (1999). Fuzzy computing for data
mining. Proceedings of the IEEE, 87(9), 1575-1600. Association Rule: The implication of connections for
Kao, S. C., Chang, H. C., & Lin, C. H. (2003). Decision variables that are explored in databases, having a form of
support for the academic library acquisition budget allo- AB, where A and B are disjoint subsets of a dataset of
cation via circulation data base mining. Information Pro- binary attributes.
cessing & Management, 39(1), 133-147.
708
TEAM LinG
Circulation Database: The information of material Material Acquisition: A process of material informa-
usages that are stored in a database, including user tion collection by recommendations of users, vendors, M
identifier, material identifier, date the material is borrowed colleges, and so forth. Information explored in databases
and returned, and so forth. can be also used. The collected information is, in general,
used in purchasing materials.
Digital Library: A library that provides the resources
to select, structure, offer, access, distribute, preserve, and Material Category: A set of library materials with
maintain the integrity of the collections of digital works. similar subjects.
Discovery Informatics: Knowledge explored in data-
bases with the form of association, classification, regres-
sion, summarization/generalization, and clustering.
709
TEAM LinG
710
Materialized Hypertext View Maintenance

Giuseppe Sindoni
ISTAT - National Institute of Statistics, Italy
INTRODUCTION ploits the knowledge provided by both the view definition

expression and the database update operations. Gupta et
A hypertext view is a hypertext containing data from an al. (2001) consider a variant of the view maintenance
underlying database. The materialization of such problem: to keep a materialized view up-to-date when the
hypertexts, that is, the actual storage of their pages in the view definition itself changes. They try to adapt the
site server, is often a valid option1. Suitable auxiliary data view in response to changes in the view definition. Vista
structures and algorithms must be designed to guarantee (1998) reports on the integration of view maintenance
consistency between the structures and contents of each policies into a database query optimizer. She presents the
heterogeneous component where base data is stored and design, implementation and use of a query optimizer
those of the derived hypertext view. responsible for the generation of both maintenance ex-
This topic covers the maintenance features required pressions to be used for view maintenance and execution
by the derived hypertext to enforce consistency between plans. Zhuge et al. (1995) show that decoupling of the
page content and database status (Sindoni, 1998). Spe- base data (at the sources) from the view definition and
cifically, the general problem of maintaining hypertexts view maintenance machinery (at the warehouse) can lead
after changes in the base data and how to incrementally the warehouse to compute incorrect views. They intro-
and automatically maintain the hypertext view are dis- duce an algorithm that eliminates the anomalies.
cussed and a solution using a Definition Language for Fernandez et al. (2000), Sindoni (1998) and Labrinidis
Web page generation and an algorithm and auxilary data & Roussopoulos (2000) have brought these principles to
structure for automatic and incremental hypertext view the Web hypertext field. Fernandez et al. (2000) provide a
maintenance is presented. declarative query language for hypertext view specifica-
tion and a template language for specification of its HTML
representation. Sindoni (1998) deals with the maintenance
BACKGROUND issues required by a derived hypertext to enforce consis-
tency between page content and database state. Hypertext
Some additional maintenance features are required by a views are defined as nested oid-based views over the set
materialized hypertext to enforce consistency between of base relations. A specific logical model is used to
page contents and the current database status. In fact, describe the structure of the hypertext and a nested
every time a transaction is issued on the database, its relational algebra extended with an oid invention operator
updates must be efficiently and effectively extended to is proposed, which allows views and view updates to be
the derived hypertext. In particular, (i) updates must be defined. Labrinidis & Roussopoulos (2000) analytically
incremental, that is, only the hypertext pages dependent and quantitatively compare three materialization policies
on database changes must be updated and (ii) all data- (inside the DBMS, at the web server and virtual). Their
base updates must propagate to the hypertext. results indicate that materialization at the Web server is
The principle of incremental maintenance has been a more scalable solution and can facilitate an order of
previously explored by several authors in the context of magnitude more users than the other two policies, even
materialized database views (Blakeley et al., 1986; Gupta under high update workloads.
et al., 2001; Paraboschi et al., 2003; Vista, 1998; Zhuge et The orthogonal problem of deferring maintenance
al., 1995). Paraboschi et al. (2003) give a useful overview operations, thus allowing the definition of different poli-
of the materialized view maintenance problem in the con- cies, has been studied by Bunker et al. (2001), who provide
text of multidimensional databases. Blakeley et al. (1986) an overview of the view maintenance subsystem of a
propose a method in which all database updates are first commercial data warehouse system. They describe opti-
filtered to remove those that cannot possibly affect the mizations and discuss how the systems focus on star
view. For the remaining updates, they apply a differential schemas and data warehousing influences the mainte-
algorithm to re-evaluate the view expression. This ex- nance subsystem.
TEAM LinG
MAIN THRUST pages to be maintained. It may be based on the concept

of view dependency graph, for the maintenance of the M
With respect to incremental view maintenance, the het- hypertext class described by the logical model. A view
erogeneity caused by the semistructured nature of views, dependency graph stores information about the base
different data models and formats between base data and tables, which are used in the hypertext view definitions.
derived views makes the materialized hypertext view dif- Finally, incremental page maintenance can be per-
ferent to the database view context and introduces some formed by a maintenance algorithm that takes as its input
new issues. a set of changes on the database and produces a minimal
Hypertexts are normally modeled by object-like mod- set of update instructions for hypertext pages. The algo-
els, because of their nested and network-like structure. rithm can be used whenever hypertext maintenance is
Thus with respect to relational materialized views, where required.
base tables and views are modeled using the same logical
model, maintaining a hypertext view derived from rela- A Manipulation Language for
tional base tables involves the additional challenge of Materialized Hypertexts
taking into account for the materialized views a different
data model to that of base tables. Once a derived hypertext has been designed with a logical
In addition each page is physically stored as a marked- model and an algebra has been used to define its materi-
up text file, possibly on a remote server. Direct access to alization as a view on base tables, a manipulation lan-
single values on the page is thus not permitted. Whenever guage is of course needed to populate the site with page
a page needs to be updated, it must therefore be com- scheme instances2 and maintain them when database
pletely regenerated from the new database status. Fur- tables are updated. The language is based on invocations
thermore, consistency between pages must be preserved, of algebra expressions.
which is an operation analogous to the one of preserving The languages used for page creation and mainte-
consistency between nested objects. nance may be very simple, such as that composed of only
The problem of dynamically maintaining consistency two instructions: GENERATE and REMOVE. They allow ma-
between base data and derived hypertext is the hypertext nipulation of hypertext pages and can refer to the whole
view maintenance problem. It has been addressed in the hypertext, to all instances of a page scheme, or to pages
framework of the STRUDEL project (Fernandez et al., that satisfy a condition.
2000) as the problem of incremental view updates for The GENERATE statement has the following general
semistructured data, by Sindoni (1998) and by Labrinidis syntax:
& Roussopoulos (2000).
There are a number of related issues: G ENERATE ALL | 
[W HERE <CONDITION>]
different maintenance policies should be allowed
(immediate or deferred); Its semantics essentially create the proper set of
this implies the design of auxiliary data structures pages, taking data from the base tables as specified by the
to keep track of database updates, but their man- page scheme definitions. The ALL keyword allows gen-
agement overloads the system and they must there- eration of all instances of each page scheme. The REMOVE
fore be as light as possible; statement has a similar syntax and allows the specified
finally, due to the particular network structure of sets of pages to be removed from the hypertext.
a hypertext, consistency must be maintained not
only between the single pages and the database, but Incremental Maintenance of
also between page links.
Materialized Hypertexts
To deal with such issues, a manipulation language for
derived hypertexts; an auxiliary data structure for (i) Whenever new data are inserted in the database, the
representing the dependencies between the database and status of all affected tables changes and the hypertexts
hypertext and (ii) logging database updates; and an whose content derives from those tables no longer reflect
algorithm for automatic hypertext incremental mainte- the databases current status. These pages must there-
nance can be introduced. For example, hypertext views fore be updated in line with the changes. An extension to
and view updates can be defined using a logical model and the system is thus needed to incrementally enforce con-
an algebra (Sindoni, 1998). sistency between database and hypertext.
An auxiliary data structure allows information on the The simple brute force approach to the problem
dependencies between database tables and hypertext would simply regenerate the whole hypertext from the new
711
TEAM LinG
database status, thus performing a huge number of unnec- FUTURE TRENDS

essary page creation operations. A more sophisticated
approach would regenerate only the instances of the There are a number of connected problems that may merit
involved page schemes, again however unnecessarily further investigation.
regenerating all page instances not actually containing
the newly inserted data. The optimal solution is hence to Defining the most suitable maintenance policy for
regenerate only the page instances containing the newly each page class involves a site analysis process to
inserted data. To perform such an incremental mainte- extract data on page access and, more generally, on
nance process, the database transaction must be extended site quality and impact.
by a GENERATE statement; with a suitable WHERE condition If a deferred policy is chosen for a given class of
restricting the generation of pages to only those contain- pages, this causes part of the hypertext to become
ing new data. This is done by using a list of URLs of pages temporarily inconsistent with its definition. Con-
affected by the database change, which is produced by the sequently, transactions reading multiple page
maintenance algorithm. schemes may not perform efficiently. The
Let us now consider a deletion of data from the data- concurrency control problem needs to be extended
base. In principle, this corresponds to the deletion and/or to the context of hypertext views and suitable
modification of derived hypertext instances. Pages will be algorithms must be developed. There is a particular
deleted whenever all the information they contain has need to address the problem of avoiding dangling
been deleted from the database, and re-generated when- links, that is, links pointing nowhere because the
ever only part of their information has been deleted. Dele- former target page has been removed.
tion transactions from the database are therefore extended Performance analysis is needed to show how trans-
by either REMOVE or GENERATE statements on the derived action overhead and view refresh time are affected
hypertext. in the above approaches. Its results should be used
The main maintenance problems in this framework are: to define a set of system-tuning parameters, to be
(i) to produce the proper conditions for the GENERATE and used by site administrators for optimization pur-
REMOVE statements automatically and (ii) to distinguish poses.
database updates corresponding to page removals from
those corresponding to page replacements. These prob-
lems are normally solved by allowing the system to log CONCLUSION
deletions from each base table and maintain information on
which base tables are involved in the generation of each We have shown that one of the most important problems
page scheme instance. in managing information repositories containing struc-
When the underlying database is relational, this can be tured and semistructured data is to provide applications
done by systems that effectively manage the sets of able to manage their materialization.
inserted and deleted tuples of each base relation since the This approach to the definition and generation of a
last hypertext maintenance operation. These sets will be database-derived Web site is particularly suitable for
addressed as + and -. were first introduced in the defining a framework for its incremental updates. In fact,
database programming language field, while Sindoni (1998) the ability to formally define mappings between base
and Labrinidis & Roussopoulos (2001) used them in the data and derived hypertext allows easy definition of
context of hypertext views. In this framework, for a data- hypertext updates in the same formalism.
base relation, a is essentially a relation with the same To enable the system to propagate database updates
scheme storing the set of inserted (+)or deleted (-) to the relevant pages, a Data Manipulation Language
tuples since the last maintenance operation. must be defined, which allows automatic generation or
A view dependency graph is also implemented, allowing removal of a derived site or specific subset of its pages.
information on dependencies between base relations and The presence of a DML for pages allows an algorithm for
derived views to be maintained. It also maintains informa- their automatic maintenance to be defined. This can use
tion on oid attributes, that is, the attributes of each base a specific data structure that keeps track of view-table
relation whose values are used to generate hypertext URLs. dependencies and logs table updates. Its output is a set
The maintenance algorithm takes +, - and the view of the Data Manipulation Language statements to be
dependency graph as inputs and produces as output the executed in order to update the hypertext.
sequence of GENERATE and REMOVE statements necessary The algorithm may be used to implement different
for hypertext maintenance. It allows maintenance to be view maintenance policies. It allows maintenance to be
postponed and performed in batch mode, by using the table deferred and performed in batch mode, by using the
update logs and implementing proper triggering mecha- table update logs and implementing proper triggering
nisms. For more details of the algorithm see Sindoni (1998). mechanisms.
712
TEAM LinG
These techniques, like many others defined previ- Vista, D. (1998). Integration of incremental view mainte-
ously, are now being applied to XML data by various nance into query optimizers. In EDBT (pp. 374-388). M
researchers (Braganholo et al., 2003; Chen et al., 2002;
Alon et al., 2003; Zhang et al., 2003; Shanmugasundaram Zhang, X. et al. (2003). Rainbow: Multi-XQuery optimiza-
et al., 2001). The most challenging issue for long-term tion using materialized XML views. SIGMOD Conference
research is probably that of extending hypertext incre- (pp. 671).
mental maintenance to the case where data come from Zhuge, Y. et al. (1995). View maintenance in a warehousing
many heterogeneous, autonomous and distributed data- environment. In SIGMOD Conference (pp. 316-327).
bases.
KEY TERMS
REFERENCES
Database Status: The structure and content of a
database at a given time stamp. It comprises the database
Alon, N. et al. (2003). Typechecking XML views of rela- object classes, their relationships and their object in-
tional databases. ACM Transactions on Computational stances.
Logic, 4 (3), 315-354.
Deferred Maintenance: The policy of not performing
Blakeley, J. et al. (1986). Efficiently updating materialized database maintenance operations when their need be-
views. In ACM SIGMOD International Conf. on Manage- comes evident, but postponing them to a later moment.
ment of Data (SIGMOD86) (pp. 61-71).
Dynamic Web Pages: Virtual pages dynamically
Braganholo, V. P. et al. (2003). On the updatability of XML constructed after a client request. The request is usually
views over relational databases. In WebDB (pp. 31-36). managed by a specific program or described using a
Bunker, C.J. et al. (2001). Aggregate maintenance for data specific query language whose statements are embed-
warehousing in Informix Red Brick Vista. In VLDB 2001 ded into pages.
(pp. 659-662). Immediate Maintenance: The policy of perform-
Chen, Y.B. et al. (2000). Designing valid XML views. In ing database maintenance operations as soon as their
Entity Relationship Conference (pp. 463-478). need becomes evident.
Fernandez, M.F. et al. (2000). Declarative specification of Link Consistency: The ability of a hypertext net-
Web sites with Strudel. The VLDB Journal, 9(1), 38-55. work links to always point to an existing and semanti-
cally coherent target.
Gupta, A. et al. (2001). Adapting materialized views after
redefinitions: Techniques and a performance study. In- Materialized Hypertext: A hypertext dynamically
formation Systems, 26 (5), 323-362. generated from an underlying database and physically
stored as a marked-up text file.
Labrinidis, A., & Roussopoulos, N. (2000). WebView
materialization. In SIGMOD00 (pp. 367-378). Semistructured Data: Data with a structure not as
rigid, regular, or complete as that required by traditional
Labrinidis, A., & Roussopoulos, N. (2001). Update propa- database management systems.
gation strategies for improving the quality of data on the
Web. In VLDB (pp. 391-400).
Paraboschi, S. et al. (2003). Materialized views in multidi- ENDNOTES
mensional databases. In Multidimensional databases
(pp. 222-251). Hershey, PA: Idea Group Publishing. 1
For more details, see the paper Materialized
Shanmugasundaram, J. et al. (2001). Querying XML views Hypertext Views.
of relational data. In VLDB (pp. 261-270).
2
A page scheme is essentially the abstract repre-
sentation of pages with the same structure and a
Sindoni, G. (1998). Incremental maintenance of hypertext page scheme instance is a page with the structure
views. In Proceedings of the Workshop on the Web and described by the page scheme. For the definition
Databases (WebDB98) (in conjunction with EDBT98). of page scheme and instance, see paper Material-
LNCS 1590 (pp. 98-117). Berlin: Springer-Verlag. ized Hypertext Views.
713
TEAM LinG
714
Materialized Hypertext Views

Giuseppe Sindoni
ISTAT - National Institute of Statistics, Italy
INTRODUCTION Baresi et al., 2000; Balasubramanian et al., 2001; Merialdo

et al., 2003; Rossi & Schwabe, 2002; Simeon & Cluet, 1998).
A materialized hypertext view can be defined as a hypertext The hypertext can then be accessed by any Internet-
containing data coming from a database and whose pages enabled machine running a Web browser.
are stored in files (Sindoni, 1999). A Web site presenting In such a framework, hypertext can be regarded as
data from a data warehouse is an example of such a view. database views, but in contrast with classic databases,
Even if the most popular approach to the generation of such as relational databases, the model describing the
such sites is based on dynamic Web pages, a rationale for view cannot be the same as the one describing the data-
the materialized approach has produced many research base storing the published data.
efforts. The topic will cover logical models to describe the The most relevant hypertext logical models proposed
structure of the hypertext. so far can be classified into three major groups, according
to the purpose of the hypertext being modeled. Some
approaches are aimed at building hypertext as integration
BACKGROUND views of distributed data sources (Aguilera et al., 2002;
Beeri et al., 1998; Simeon & Cluet, 1998), other as views of
Hypertext documents in the Web are in essence collec- an underlying local database (Fernandez et al., 2000;
tions of HTML (HyperText Markup Language) or XML Merialdo et al., 2003). There are also proposals for models
(the eXtensible Markup Language) files and are delivered and methods to build hypertext independently from the
to users by an HTTP (HyperText Transfer Protocol) server. source of the data that they publish (Agosti et al., 1995;
Hypertexts are very often used to publish very large Baresi et al., 2000; Balasubramanian et al., 2001; Rossi &
amounts of data on the Web, in what are known as data Schwabe, 2002; Simeon & Cluet, 1998).
intensive Web sites. These sites are characterized by a Models can be based on graphs (Fernandez et al.,
large number of pages sharing the same structure, such as 2000; Simeon & Cluet, 1998), on XML Data Type Defini-
in a University Web site, where there are numerous pages tions (Aguilera et al., 2002), extension of the Entity Rela-
containing staff information, that is, Name, Position, tionship model (Balasubramanian et al., 2001), logic rules
Department, and so on. Each staff page is different, but (Beeri et al., 1998) or object-like paradigms (Merialdo et al.,
they all share the types of information and their logical 2003; Rossi & Schwabe, 2002).
organization. A group of pages sharing the same structure
is called page class. Similarly to databases, where it is
possible to distinguish between the intensional (data- MAIN THRUST
base structure) and extensional (the database records)
levels, in data intensive Web sites it is possible to distin- The most common way to automatically generate a de-
guish between the site structure (the structure of the rived hypertext is based principally on the dynamic con-
different page classes and page links) and site pages struction of virtual pages following a client request.
(instances of page classes). Usually, the request is managed by a specific program (for
Pages of a data intensive site may be generated dy- example a Common Gateway Interface CGI called as a
namically, that is, on demand, or be materialized, as will link in HTML files) or described using a specific query
be clarified in the following. In both approaches, each language, whose statements are embedded into pages.
page corresponds to an HTML file and the published data These pages are often called pull pages, because it is up
normally come from a database, where they can be up- to the client browser to pull out the interesting informa-
dated more efficiently than in the hypertext files them- tion. Unfortunately, this approach has some major draw-
selves. The database is queried to extract records relevant backs:
to the hypertext being generated and page instances are
filled with values according to a suitable hypertext model it involves a degree of Data Base Management
describing page classes (Agosti et al., 1995; Aguilera et System overloading, because every time a page is
al., 2002; Beeri et al., 1998; Crestani & Melucci, 2003;
TEAM LinG
requested by a client browser, a query is issued to PAGE SCHEME AuthorPage

the database in order to extract the relevant data; Name : TEXT;
WorkList: LIST OF
M
it introduces some platform-dependence, because
the embedded queries are usually written in a pro- (Authors: TEXT;
Title: TEXT;
prietary language and the CGIs must be compiled on
Reference: TEXT;
the specific platform; Year : TEXT;
it hampers site mirroring, because if the site needs ToRefPage: LINK TO ConferencePage
to be moved to another server, either the database UNION JournalPage;
needs to be replicated, or some network overload is AuthorList:LIST OF
introduced due to remote queries; (Name: TEXT;
it doesnt allow the publication of some site metadata, ToAuthorPage: LINK TO AuthorPage OPTIONAL;););
more specifically information about the structure of END PAGE SCHEME
the site, which may be very useful to querying
applications. Each AUTHOR PAGE instance has a simple attribute
(NAME). Pages can also have complex attributes: lists,
An alternative approach is based on the concept of possibly nested at an arbitrary level, and links to other
materialized hypertext view: a derived hypertext whose pages. The example shows the page scheme with a list
pages are actually stored by the system on a server or attribute (WORKLIST). Its elements are tuples, formed by
directly on the client machine, using a mark-up language three simple attributes (AUTHORS, TITLE and YEAR), a link
like HTML. This approach overcomes the above disad- to an instance of either a C ONFERENCE P AGE or a
vantages because: (i) pages are static, so the HTTP J OURNALPAGE and the corresponding anchor (REFERENCE),
server can work on its own; (ii) there is no need to embed and a nested list (AUTHORLIST) of other authors of the same
queries or script calls in the pages, as standard sites are work.
generated; (iii) due to their standardization, sites can be Once a description of the hypertext is available, its
mirrored more easily, as they are not tied to a specific materialization is made possible using a mapping lan-
technology; and finally, (iv) metadata can be published guage, such as those described in Merialdo et al., 2003).
by either embedding them into HTML comments or di-
rectly generating XML files.
A data model, preferably object-oriented, is used to FUTURE TRENDS
describe a Web hypertext. This allows the system to
manage nested objects by decomposition. This means One of the topics currently attracting the interest of many
that each hypertext page is seen as an object with at- researchers and practitioners of the Web and databases
tributes, which can be atomic or complex, such as a list of fields is XML. Most efforts are aimed at modeling XML
values. Complex attributes are also modeled as nested repositories and defining query languages for querying
objects into the page object. These objects can also have and transforming XML sources (World Wide Web Con-
both atomic and complex attributes (objects) and the sortium, 2004).
nesting mechanism is virtually unlimited. One of the current research directions is to explore
Below we will describe the Araneus Data Model (ADM) XML as both a syntax for metadata publishing and a
(Merialdo et al., 2003), as an example of a hypertext data document model to be queried and restructured. Mecca,
model. Different models can be found in (Fernandez et al., Merialdo, & Atzeni (1999) show that XML modeling
2000; Fraternali & Paolini, 1998). primitives may be considered as a subset of the Object
ADM is a page-oriented model, as page is the main Data Management Group standard enriched with union
concept. Each hypertext page is seen as an object having types and XML repositories may in principle be queried
an identifier (its Uniform Resource Locator - URL) and a using a language like the Object Query Language.
number of attributes. Its structure is abstracted by its
page scheme and each page is an instance of a page
scheme. The notion of page scheme may be compared to CONCLUSION
that of relation scheme, in the relational data model, or
object class, in object oriented databases. The following Data-intensive hypertext can be published by assuming
example describes the page of an author in a bibliographic that the pages contain data coming from an underlying
site, described by the AUTHOR PAGE page scheme. database and that their logical structure is described
715
TEAM LinG
according to a specific model. Pages may be mapped on the Mecca, G. et al. (1999). Araneus in the Era of XML. IEEE
database and automatically generated using a program- Data Engineering Bulletin, 22(3), 19-26.
ming language.
To allow external applications to access these metadata, Merialdo, P. et al. (2003). Design and development of
a materialized approach to page generation can be adopted. data-intensive Web sites: The araneus approach. ACM
The massive diffusion of XML as a preferred means for Transactions on Internet Technology, 3(1), 49-92.
describing a Web pages structure and publishing it on the Rossi, G., & Schwabe, D. (2002). Object-oriented design
Internet is facilitating integrated access to heterogeneous, structures in Web application models. Annals of Soft-
distributed data sources: the Web is rapidly becoming a ware Engineering, 13(1-4), 97-110.
repository of global knowledge. The research challenge
for the 21st century will probably be to provide global users Simeon, G., & Cluet, S. (1998). Using YAT to build a Web
with applications to efficiently and effectively find the server. In Proceedings of the Workshop on the Web and
required information. This could be achieved by utilizing Databases (Web and DataBases 98) (in conjunction
models, methods and tools which have been already devel- with Extending DataBase Technology 98). Lecture Notes
oped for knowledge discovery and data warehousing in in Computer Science (Vol. 1590) (pp. 118-135).
more controlled and local environments. Sindoni, G. (1999). Maintenance of data and metadata in
Web-based information systems. PhD Thesis. Universit
degli studi di Roma La Sapienza.
REFERENCES
World Wide Web Consortium. (2004). XML Query
Agosti, M. et al. (1995). Automatic authoring and con- (XQuery). Retrieved August 23, 2004, from http://
struction of hypertext for information retrieval. Multime- www.w3.org/XML/Query
dia Systems, 3, 15-24.
Aguilera, V. et al. (2002). Views in a large-scale XML KEY TERMS
repository. Very Large DataBase Journal, 11(3), 238-255.
Dynamic Web Pages: Virtual pages dynamically con-
Balasubramanian, V. et al. (2001). A case study in system- structed after a client request. Usually, the request is
atic hypertext design. Information Systems, 26(4), 295-320. managed by a specific program or is described using a
Baresi, L. et al. (2000). From Web sites to Web applications: specific query language whose statements are embed-
New issues for conceptual modeling. In Entity Relation- ded into pages.
ship (Workshops) (pp. 89-100). HTML: The Hypertext Markup Language. A lan-
Beeri, C. et al. (1998). WebSuite: A tools suite for harness- guage based on labels to describe the structure and
ing Web data. In Proceedings of the Workshop on the Web layout of a hypertext.
and Databases (Web and DataBases 98) (in conjunction HTTP: The HyperText Transaction Protocol. An
with Extending DataBase Technology 98). Lecture Notes Internet protocol, used to implement communication
in Computer Science (Vol. 1590) (pp. 152-171). between a Web client, which requests a file, and a Web
Crestani, F., & Melucci, M. (2003). Automatic construction server, which delivers it.
of hypertexts for self-referencing: The hyper-text book Knowledge Management: The practice of transform-
project. Information Systems, 28(7), 769-790. ing the intellectual assets of an organization into busi-
Fernandez, M. F. et al. (2000). Declarative specification of ness value.
Web sites with Strudel. Very Large DataBase Journal, Materialized Hypertext View: A hypertext contain-
9(1), 38-55. ing data coming from a database and whose pages are
Fraternali, P., & Paolini, P. (1998). A conceptual model and stored in files.
a tool environment for developing more scalable, dynamic, Metadata: Data about data. Structured information
and customizable Web applications. In VI Intl. Conference describing the nature and meaning of a set of data.
on Extending Database Technology (EDBT 98) (pp. 421-
435). XML: The eXtensible Markup Language. An evolu-
tion of HTML, aimed at separating the description of the
hypertext structure from that of its layout.
716
TEAM LinG
717
Materialized View Selection for Data Warehouse M

Design
Dimitri Theodoratos
Alkis Simitsis
National Technical University of Athens, Greece
INTRODUCTION When the source relations change, the materialized

views need to be updated. The materialized views are
A data warehouse (DW) is a repository of information usually maintained using an incremental strategy. In such
retrieved from multiple, possibly heterogeneous, autono- a strategy, the changes to the source relations are propa-
mous, distributed databases and other information sources gated to the DW. The changes to the materialized views
for the purpose of complex querying, analysis, and deci- are computed using the changes of the source relations
sion support. Data in the DW are selectively collected and are eventually applied to the materialized views. The
from the sources, processed in order to resolve inconsis- expressions used to compute the view changes involve
tencies, and integrated in advance (at design time) before the changes of the source relations and are called main-
data loading. DW data are usually organized multi-dimen- tenance expressions. Maintenance expressions are is-
sionally to support online analytical processing (OLAP). sued by the DW against the data sources, and the answers
A DW can be seen abstractly as a set of materialized views are sent back to the DW. When the source relation
defined over the source relations. During the initial design changes affect more than one materialized view, multiple
of a DW, the DW designer faces the problem of deciding maintenance expressions need to be evaluated. The tech-
which views to materialize in the DW. This problem has niques of multi-query optimization can be used to detect
been addressed in the literature for different classes of common subexpressions among maintenance expressions
queries and views, and with different design goals. in order to derive an efficient global evaluation plan for all
the maintenance expressions.
BACKGROUND
MAIN THRUST
Figure 1 shows a simplified DW architecture. The DW
contains a set of materialized views. The users address When selecting views to materialize in a DW, one attempts
their queries to the DW. The materialized views are used to satisfy one or more design goals. A design goal is either
partially or completely for the evaluation of the user the minimization of a cost function or a constraint. A
queries. This is achieved through partial or complete constraint can be classified as user-oriented or system-
rewritings of the queries using the materialized views. oriented. Attempting to satisfy the constraints can result
in no feasible solution to the view selection problem. The
design goals determine the design of the algorithms that
Figure 1. A simplified DW architecture select views to materialize from the space of alternative
view sets.
queries answers
Minimization of Cost Functions
Data Most approaches comprise in their design goals the

Warehouse
minimization of a cost function.
Maintenance Query Evaluation Cost: Often, the queries that the

Expressions
DW has to satisfy are given as input to the view
selection problem. The overall query evaluation
Data
Sources
... cost is the sum of the cost of evaluating each input
TEAM LinG
Materialized View Selection for Data Warehouse Design
query rewritten (partially or completely) over the System-Oriented Constraints

materialized views. This sum also can be weighted,
each weight indicating the frequency or importance System-oriented constraints are dictated by the restric-
of the corresponding query. Several approaches tions of the system and are transparent to the users.
aim at minimizing the query evaluation cost (Gupta
& Mumick, 1999; Harinarayan et al., 1996; Shukla et Space Constraint: Although the degradation of the
al., 1998). cost of disk space allows for massive storage of
View Maintenance Cost: The view maintenance data, one cannot consider that the disk space is
cost is the sum of the cost of propagating each unlimited. The space constraint restricts the space
source relation change to the materialized views. occupied by the selected materialized views not to
This sum can be weighted, each weight indicating exceed the space allocated to the DW for this end.
the frequency of propagation of the changes of the Space constraints are adopted in many works (Gupta,
corresponding source relation. The maintenance 1997; Golfarelli & Rizzi, 2000; Harinarayan et al.,
expressions can be evaluated more efficiently if 1996, Theodoratos & Sellis, 1999).
they can be partially rewritten over views already View Maintenance Cost Constraint: In many prac-
materialized at the DW; the evaluation of parts of the tical cases, the refraining factor in materializing all
maintenance expression is avoided since their ma- the views in the DW is not the space constraint but
terializations are present at the DW. Moreover, the view maintenance cost. Usually, DWs are up-
access of the remote data sources and expensive dated periodically (e.g., at nighttime) in a large batch
data transmissions are reduced. Materialized views update transaction. Therefore, the update window
that are added to the DW for reducing the view must be sufficiently short so that the DW is avail-
maintenance cost are called auxiliary views (Ross able for querying and analysis during the daytime.
et al., 1996; Theodoratos & Sellis, 1999). Obviously, The view maintenance cost constraint states that
maintaining the auxiliary views incurs additional the total view maintenance cost should be less than
maintenance cost. However, if this cost is less than a given amount of view maintenance time. Gupta and
the reduction to the maintenance cost of the initially Mumick (1999), Golfareli and Rizzi (2000), and Lee
materialized views, it is worth keeping the auxiliary and Hammer (2001) consider a view maintenance
views in the DW. Ross, et al. (1996) derive auxiliary cost constraint in selecting materialized views.
views to materialize in order to minimize the view Self Maintainability: A materialized view is self-
maintenance cost. maintainable if it can be maintained for any instance
Operational Cost: Minimizing the query evaluation of the source relations over which it is defined and
cost and the view maintenance cost are conflicting for all source relation changes, using only these
requirements. Low view maintenance cost can be changes, the view definition, and the view material-
obtained by replicating source relations at the DW. ization. The notion is extended to a set of views in
In this case, though, the query evaluation cost is a straightforward manner. By adding auxiliary views
high, since queries need to be computed from the to a set of materialized views, one can make the
replicas of the source relations. Low query evalua- whole view set self-maintainable. There are differ-
tion cost can be obtained by materializing at the DW ent reasons for making a view set self-maintainable:
all the input queries. In this case, all the input (a) the remote source relations need not be con-
queries can be answered by a simple lookup, but the tacted in turn for evaluating maintenance expres-
view maintenance cost is high, since complex main- sions during view updating; (b) anomalies due to
tenance expressions over the source relations need concurrent changes are eliminated, and the view
to be computed. The input queries may overlap; that maintenance process is simplified; (c) the material-
is, they may share many common subexpressions. ized views can be maintained efficiently even if the
By materializing common subexpressions and other sources are not able to answer queries (e.g., legacy
views over the source relations, it is possible, in systems), or if they are temporarily unavailable
general, to reduce the view maintenance cost. These (e.g., in mobile systems). By adding auxiliary views
savings must be balanced against higher query to a set of materialized views, the whole view set can
evaluation cost. For this reason, one can choose to be made self-maintainable. Self-maintainability can
minimize a linear combination of the query evalua- be trivially achieved by replicating at the DW all the
tion and view maintenance cost, which is called source relations used in the view definitions. Self-
operational cost. Most approaches endeavor to maintainability viewed as a constraint requires that
minimize the operational cost (Baralis et al., 1997; Gupta, the set of materialized views taken together is self-
1997; Theodoratos & Sellis, 1999; Yang et al., 1997). maintainable. Quass, et al., (1996), Akinde, et al.,
718
TEAM LinG
(1998), Liang, et al., (1999), and Theodoratos (2000) evaluate an input query using the views material-
aim at making the DW self-maintainable. ized at the DW should not exceed a given bound. M
Answering the Input Queries Using Exclusively the The bound for each query is given by the users and
Materialized Views: This constraint requires the reflects their needs for fast answers. For some
existence of a complete rewriting of the input que- queries, fast answers may be required, while for
ries, initially defined over the source relations, over others, the response time may not be predominant.
the materialized views. Clearly, if this constraint is
satisfied, the remote data sources need not be con- Search Space and Algorithms
tacted for evaluating queries. This way, expensive
data transmissions from the DW to the sources, and Solving the problem of selecting views for materializa-
conversely, are avoided. Some approaches assume tion involves addressing two main tasks: (a) generating
a centralized DW environment, where the source a search space of alternative view sets for materialization
relations are present at the DW site. In this case, the and (b) designing optimization algorithms that select an
answerability of the queries from the materialized optimal or near-optimal view set from the search space.
views is trivially guaranteed by the presence of the A DW is usually organized according to a star schema
source relations. The answerability of the queries where a fact table is surrounded by a number of dimen-
also can be trivially guaranteed by appropriately sion tables. The dimension tables define hierarchies of
defining select-project views on the source relations aggregation levels. Typical OLAP queries involve star
and replicating them at the DW. This approach as- joins (key/foreign key joins between the fact table and
sures also the self-maintainability of the materialized the dimension tables) and grouping and aggregation at
views. Theodoratos and Sellis (1999) do not assume different levels of granularity. For queries of this type,
a centralized DW environment or replication of part the search space can be formed in an elegant way as a
of the source relations at the DW and explicitly multidimensional lattice (Baralis et al., 1997; Harinarayan
impose this constraint in selecting views for materi- et al., 1996).
alization. Gupta (1997) states that the view selection problem is
NP-hard. Most of the approaches on view selection
User-Oriented Constraints problems avoid exhaustive algorithms. The adopted al-
gorithms fall into two categories: deterministic and ran-
User-oriented constraints express requirements of the domized. In the first category belong greedy algorithms
users. with performance guarantee (Gupta, 1997; Harinarayan
et al., 1996), 0-1 integer programming algorithms (Yang et
Answer Data Currency Constraints: An answer al., 1997), A* algorithms (Gupta & Mumick, 1999), and
data currency constraint sets an upper bound on the various other heuristic algorithms (Baralis et al., 1997;
time elapsed between the point in time the answer to Ross et al., 1996; Shukla et al., 1998; Theodoratos &
a query is returned to the user and the point in time Sellis, 1999). In the second category belong simulated
the most recent changes of a source relation that are annealing algorithms (Kalnis et al., 2002; Theodoratos et
taken into account in the computation of this answer al., 2001), iterative improvement algorithms (Kalnis et al.,
are read (this time reflects the currency of answer 2002) and genetic algorithms (Lee & Hammer, 2001). Both
data). Currency constraints are associated with ev- categories of algorithms exploit the particularities of the
ery source relation in the definition of every input specific view selection problem and the restrictions of
query. The upper bound in an answer data currency the class of queries considered.
constraint (minimal currency required) is set by the
users according to their needs. This formalization of
data currency constraints allows stating currency FUTURE TRENDS
constraints at the query level and not at the materi-
alized view level, as is the case in some approaches. The view selection problem has been addressed for
Therefore, currency constraints can be exploited by different types of queries. Research has focused mainly
DW view selection algorithms, where the queries are on queries over star schemas. Newer applications (e.g.,
the input, while the materialized views are the output XML or Web-based applications) require different types
(and, therefore, are not available). Furthermore, it of queries. This topic has only been partially investi-
allows stating different currency constraints for dif- gated (Golfarelli et al., 2001; Labrinidis & Roussopoulos,
ferent relations in the same query. 2000).
Query Response Time Constraints: A query re- A relevant issue that needs further investigation is
sponse time constraint states that the time needed to the construction of the search space of alternative view
719
TEAM LinG
sets for materialization. Even though the construction of Baralis, E., Paraboschi, S., & Teniente, E. (1997). Materi-
such a search space for grouping and aggregation queries alized views selection in a multidimensional database.
is straightforward (Harinarayan et al., 1966), it becomes an International Conference on Very Large Data Bases,
intricate problem for general queries (Golfarelli & Rizzi, 2001). Athens, Greece.
Indexes can be seen as special types of views. Gupta,
et al. (1997) show that a two-step process that divides the Golfarelli, M., & Rizzi, S. (2000). View materialization for
space available for materialization and picks views first nested GPSJ queries. International Workshop on Design
and then indexes can perform very poorly. More work and Management of Data Warehouses (DMDW),
needs to be done on the problem of automating the Stockholm, Sweden.
selection of views and indexes together. Golfarelli, M., Rizzi, S., & Vrdoljak B. (2001). Data ware-
DWs are dynamic entities that evolve continuously house design from XML sources. ACM International
over time. As time passes, new queries need to be satis- Workshop on Data Warehousing and OLAP (DOLAP),
fied. A dynamic version of the view selection problem Atlanta, Georgia.
chooses additional views for materialization and avoids
the design of the DW from scratch (Theodoratos & Sellis, Gupta, H. (1997). Selection of views to materialize in a data
2000). A system that dynamically materializes views in the warehouse. International Conference on Database
DW at multiple levels of granularity in order to match the Theory (ICDT), Delphi, Greece.
workload (Kotidis & Roussopoulos, 2001) is a current Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman, J.D.
trend in the design of a DW. (1997). Index selection for OLAP. IEEE International
Conference on Data Engineering, Birmingham, UK.
CONCLUSION Gupta, H., & Mumick, I.S. (1999). Selection of views to

materialize under a maintenance cost constraint. Interna-
A DW can be seen as a set of materialized views. A central tional Conference on Database Theory (ICDT), Jerusa-
problem in the design of a DW is the selection of views to lem, Israel.
materialize in it. Depending on the requirements of the Harinarayan, V., Rajaraman, A., & Ullman, J. (1996). Imple-
prospective users of the DW, the materialized view selec- menting data cubes efficiently. ACM SIGMOD Interna-
tion problem can be formulated with various design goals tional Conference on Management of Data (SIGMOD),
that comprise the minimization of cost functions and the Montreal, Canada.
satisfaction of user- and system-oriented constraints.
Because of its importance, different versions of it have Kalnis, P., Mamoulis, N., & Papadias, D. (2002). View
been the focus of attention of many researchers in recent selection using randomized search. Data & Knowledge
years. Papers in the literature deal mainly with the issue Engineering, 42(1), 89-111.
of determining a search space of alternative view sets for
materialization and with the issue of designing optimiza- Kotidis, Y., & Roussopoulos, N. (2001). A case for dy-
tion algorithms that avoid examining exhaustively the namic view management. ACM Transactions on Data-
usually huge search space. Some results of this research base Systems, 26(4), 388-423.
have been used already in commercial database manage- Labrinidis, A., & Roussopoulos, N. (2000). WebView
ment systems (Agrawal et al., 2000). materialization. ACM SIGMOD International Conference
on Management of Data (SIGMOD), Dallas, Texas.
REFERENCES Lee, M., & Hammer, J. (2001). Speeding up materialized

view selection in data warehouses using a randomized
Agrawal, S., Chaudhuri, S., & Narasayya, V.R. (2000). algorithm. International Journal of Cooperative Infor-
Automated selection of materialized views and indexes in mation Systems (IJCIS), 10(3), 327-353.
SQL databases. International Conference on Very Large Liang, W. (1999). Making multiple views self-maintainable
Data Bases (VLDB), Cairo, Egypt. in a data warehouse. Data & Knowledge Engineering,
Akinde, M.O., Jensen, O.G., & Bhlen, H.M. (1998). Mini- 30(2), 121-134.
mizing detail data in data warehouses. International Quass, D., Gupta, A., Mumick, I.S., & Widom, J. (1996).
Conference on Extending Database Technology (EDBT), Making views self-maintainable for data warehousing.
Valencia, Spain. International Conference on Parallel and Distributed
Information Systems (PDIS), Florida Beach, Florida.
720
TEAM LinG
Ross, K., Srivastava, D., & Sudarshan, S. (1996). Materi- KEY TERMS
alized view maintenance and integrity constraint check- M
ing: Trading space for time. ACM SIGMOD International Auxiliary View: A view materialized in the DW exclu-
Conference on Management of Data (SIGMOD), Montreal, sively for reducing the view maintenance cost.
Canada.
Materialized View: A view whose answer is stored in
Shukla, A., Deshpande, P., & Naughton, J. (1998). Mate- the DW.
rialized view selection for multidimensional datasets. In-
ternational Conference on Very Large Data Bases Operational Cost: A linear combination of the query
(VLDB), New York. evaluation and view maintenance cost.
Theodoratos, D. (2000). Complex view selection for data Query Evaluation Cost: The sum of the cost of evalu-
warehouse self-maintainability. International Conference ating each input query rewritten over the materialized
on Cooperative Information Systems (CoopIS), Eilat, views.
Israel. Self-Maintainable View: A materialized view that can
Theodoratos, D., Dalamagas, T., Simitsis, A., & be maintained, for any instance of the source relations,
Stavropoulos, M. (2001). A randomized approach for the and for all source relation changes, using only these
incremental design of an evolving data warehouse. Inter- changes, the view definition, and the view materialization.
national Conference on Conceptual Modeling (ER), View: A named query.
Yokohama, Japan.
View Maintenance Cost: The sum of the cost of
Theodoratos, D., & Sellis, T. (1999). Designing data ware- propagating each source relation change to the material-
houses. Data & Knowledge Engineering, 31(3), 279-301. ized views.
Theodoratos, D., & Sellis, T. (2000). Incremental design of
a data warehouse. Journal of Intelligent Information
Systems (JIIS), 15(1), 7-27.
Yang, J., Karlapalem, K., & Li, Q. (1997). Algorithms for
materialized view design in data warehousing environ-
ment. International Conference on Very Large Data
Bases, Athens, Greece.
721
TEAM LinG
722
Methods for Choosing Clusters in Phylogenetic

Trees
Tom Burr
Los Alamos National Laboratory, USA
INTRODUCTION evolving. One question was whether the subtypes of

group M could be explained by the past population
One data mining activity is cluster analysis, of which there dynamics of the virus. For each of many simulated data
are several types. One type deserving special attention is sets each having approximately 100 taxa, model-based
clustering that arises due to evolutionary relationships clustering was applied to automate the process of choos-
among organisms. Genetic data is often used to infer ing the number of clusters. The novel application of
evolutionary relations among a collection of species, model-based clustering and its potential for scaling to
viruses, bacterial, or other taxonomic units (taxa). A phy- large numbers of taxa will be described, along with the two
logenetic tree (Figure 1, top) is a visual representation of methods mentioned in the introduction section.
either the true or the estimated branching order of the taxa, It is well-known that cluster analysis results can de-
depending on the context. Because the taxa often cluster pend strongly on the metric. There are at least three
in agreement with auxiliary information, such as geo- unique metric-related features of DNA data. First, the
graphic or temporal isolation, a common activity associ- DNA data is categorical. Second, a favorable trend in
ated with tree estimation is to infer the number of clusters phylogenetic analysis of DNA data is to choose the
and cluster memberships, which is also a common goal in evolutionary model using goodness of fit or likelihood
most applications of cluster analysis. However, tree esti- ratio tests (Huelsenbeck & Rannala, 1997). For nearly all
mation is unique because of the types of data used and the of the currently used evolutionary models, there is an
use of probabilistic evolutionary models which lead to associated distance measure. Therefore, there is the po-
computationally demanding optimization problems. Fur- tential to make an objective metric choice. Third, the
thermore, novel methods to choose the number of clusters evolutionary model is likely to depend on the region of the
and cluster memberships have been developed and will be genome. DNA regions that code for amino acids are more
described here. The methods include a unique application constrained over time due to selective pressure and there-
of model-based clustering, a maximum likelihood plus fore are expected to have a smaller rate of change than
bootstrap method, and a Bayesian method based on non-coding sequences.
obtaining samples from the posterior probability distribu- A common evolutionary model is as follows (readers
tion on the space of possible branching orders. who are uninterested in the mathematical detail should
skip this paragraph). Consider a pair of taxa denoted x and
y. Define Fxy as
BACKGROUND
n AA n AC n AG n AT
Tree estimation is frequently applied to genetic data of
nCA nCC nCG nCT ,
various types; we focus here on applications involving NFxy =
nGA nGC nGG nGT
DNA data, such as that from HIV. Trees are intended to

convey information about the genealogy of such viruses nTA nTC nTG nTT
and the most genetically similar viruses are most likely to
be most related. However, because the evolutionary pro-
where N is the number of base pairs (sites) in set of aligned
cess includes random effects, there is no guarantee that
sequences, nAA is the number of sites with taxa x and y
closer in genetic distance implies closer in time, for
both having an A, n AC is the number of sites with taxa x
every pair of sequences.
having an A and taxa y having a C, etc. The most general
Sometimes the cluster analysis must be applied to
time-reversible model (GTR) for which a distance measure
large numbers of taxa, or applied repeatedly to the same
has been defined (Swofford, Olsen, Waddell, & Hillis,
number of taxa. For example, Burr, Myers, and Hyman
1996) defines the distance between taxa x and y as d xy =
(2001) recently investigated how many subtypes (clus-
-trace{ log( -1Fxy)} where is a diagonal matrix of the
ters) arise under a simple model of how the env (gp120)
average base frequencies in taxa x and y and the trace is
region of HIV-1, group M sequences (Figure 1, top) are
TEAM LinG
Methods for Choosing Clusters in Phylogenetic Trees
the sum of diagonal elements. The GTR is fully specified follows that Pij(t) = 0.25 + 0.75e-t and that the distance
by 5 relative rate parameters (a, b, c, d, e) and 3 relative between taxa x and y is 3/4 log(1 - 4/3D) where D is the M
frequency parameters (A, C, and G with T determined percentage of sites where x and y differ (regardless of what
via A+ C + G + T = 1) in the rate matrix Q defined as kind of difference because all relative substitution rates
and base frequencies are assumed to be equal). Important
generalizations include allowing unequal relative frequen-
a C b G c T cies and/or rate parameters), and to allow the rate to vary

a A d G e T , across DNA sites. Allowing to vary across sites via a
Q/ =
a A d C f G gamma-distributed rate parameter is one way to model the
fact that sites often have different observed rates. If the rate
a A e C f G
is assumed to follow a gamma distribution with shape
parameter then these gamma distances can be obtained
where is the overall substitution rate. The rate matrix Q from the original distances by replacing the function log(x)
is related to the substitution probability matrix P via Pij(t)= with (1-x-1/) in the dxy = -trace{P log(P-1Fxy)} formula
e Qt, where P ij(t) is the probability of a change from (Swofford et al. 1996). Generally, this rate heterogeneity
nucleotide i to j in time t and P ij(t) satisfies the time and the fact that multiple substitutions at the same site
reversibility and stationarity criteria: iP ij = jP ji. Com- tend to saturate any distance measure make it a practical
monly used models such as Jukes-Cantor (Swofford et challenge to find a metric such that the distance between
al. 1996) assumes that a = b = c = d = e = 1 and A= any two taxa increases linearly with time.
C = G= T= 0.25. For the Jukes-Cantor model, it
Figure 1. HIV Data (env region). (Top) Hierarchical Clustering; (Middle) Principle Coordinate plot; (Bottom)
Results of model-based clustering under six different assumptions regarding volume (V), shape (S), and orientation
(O). E denotes equal among clusters and V denotes varying among clusters, for V, S, and O respectively. For
example, case 6 has varying V, equal S, and varying O among clusters. Models 1 and 2 each assume a spherical shape
(I denotes the identify matrix, so S and O are equal among clusters, while V is equal for case 1 and varying for case
2 ). Note that the B and D subtypes tend to be merged.
0.4
0.2
B
GG
F
F
AA
A
FF
BB
D D
GG
AA
D
G E
AA
A
D D
BC
E A
A
BB
FF
G G
G
EE
AA
B
B
D D
D D
0.0
B
A
B
B
D
C C
F
E
G
E
E
E
E
E
E
F
F
EE
BB
D
CC
C
C
C
D
C
G
B
E
E
D
D
D
C
C
C
C
C
C
EEE E
E E
EE EEEE EE E
A D BB B B
AA
A AAAA A A B B BDB D DD
G G A A BBD
B B DD
D D D B
BDD
B
0.0
G GA A
GG G GG
A
F
x2
G F F F
FFFF
F
-0.2
CC CC
CC
C C CC
C
C
-0.2 -0.1 0.0 0.1 0.2
x1
1400
1
2
3 1
2
3 1
2
3 1
3 1 1
3 1
2 6
5 5
6 3 3 1
3 1
3 1 1 1
6
4 4 5
6 5 5 3 3 3 1
3 1
3 1
4
6 3
5 5 15 EI 5 3
2
5 1
4
6 2 VI 5 5
2 3 5 5
1000
3 EEE 5
BIC
5 5
3 5
4
6 4 VVV 5
2 1
3 5 EEV
1 6 VEV
600
1
2
3
4
5
6 1
5 10 15 20
number of clusters
723
TEAM LinG
MAIN THRUST MCMC as a way to evaluate the likelihood of many

different branching orders. In MCMC, the likelihood ratio
Much more can be said about evolutionary models, but of branching orders (and branch lengths) is used to
the background should suffice to convey the notion that generate candidate branching orders according to their
relationships among taxa determine the probability (albeit relative probabilities; and therefore, to evaluate the rela-
in a complicated way via the substitution probability tive probability of many different branching orders.
matrix P) of observing various DNA character states A third option (Figure 1, middle) is to represent the
among the taxa. Therefore, it is feasible to turn things DNA character data using: (a) the substitution probabil-
around and estimate taxa relationships on the basis of ity matrix P to define distances between each pair of taxa
observed DNA. The quality of the estimate depends on (described above), (b) multi-dimensional scaling to re-
the adequacy of the model, the number of observed DNA duce the data dimension and represent each taxa in two to
sites, the number of taxa, and the complexity of the true four or five dimensions in such a way that the pairwise
tree. These important issues are discussed for example in distances are closely approximated by distances com-
Swofford et al. (1996) and nearly all published methods puted using the low-dimensional representation; and (c)
now are related to varying degrees to the fundamental model-based clustering of the low-dimensional data. Sev-
result of Felsenstein (1981) for calculating the likelihood eral clustering methods could be applied to the data if the
of a given set of character states for a given tree. Likeli- data (A, C, T, and G) were coded so that distances could
hood-based estimation is typically computationally de- be computed. An effective way to do this is to represent
manding and rarely (only for small number of taxa, say 15 the pairwise distance data via multidimensional scaling.
or less) is the likelihood evaluated for all possible branch- Multidimensional scaling represents the data in new co-
ing orders. Instead, various search strategies are used ordinates such that distances computed in the new coor-
that begin with a subset of the taxa. More recently, branch dinates very closely approximate the original distances
rearrangement strategies are also employed that allow computed using the chosen metric. For n taxa with an n-
exploration of the space of likelihoods (Li, Pearl, & Doss, by-n distance matrix, the result of multidimensional scal-
2000; Simon & Larget, 1998) and approximate probabilities ing is an n-by-p matrix (p coordinates) that can be used to
via Markov Chain Monte Carlo (MCMC, see below) of closely approximate the original distances. Therefore,
each of the most likely trees. Finding the tree that opti- multidimensional scaling provides a type of data com-
mizes some criteria such as the likelihood is a large topic pression with the new coordinates being suitable for
that is largely outside our scope here (see Salter, 2000). input to clustering methods. For example, one could use
Our focus is on choosing groups of taxa because the cmdscale function in S-PLUS (2003) to implement
knowledge of group structure adds considerably to our classical multidimensional scaling. The implementation of
understanding of how the taxa is evolving (Korber et al. model-based clustering we consider includes kmeans as
2000) and perhaps also leads to efficient ways to estimate a special case as we will describe.
trees containing a large number of taxa (Burr, Skourikhine, In model-based clustering, it is assumed that the data
Macken & Bruno, 1999). Here we focus on choosing are generated by a mixture of probability distributions in
groups of taxa without concern for the intended use of the which each component of the mixture represents a cluster.
estimated group structure, and consider three strategies. Given n p-dimensional observations x = (x1,x2,,xn), as-
One option for clustering taxa is to construct trees that sume there are G clusters and let fk(xi|k) be the probability
have maximum likelihood (not necessarily a true maximum density for cluster k. The model for the composite of
likelihood because of incomplete searches used), identify clusters is typically formulated in one of two ways. The
groups, and then repeat using resampled data sets. The classification likelihood approach maximizes
resampled (bootstrap) data sets are obtained from the
original data sets by sampling DNA sites with replace- LC(1,G;1,,n | x) = i fi(xi | i),
ment. This resampling captures some of the variation
inherent in the evolutionary process. If, for example, 999 where the i are discrete labels satisfying i = k if xi belongs
of 1000 bootsrapped ML trees each show a particular to cluster k. The mixture likelihood approach maximizes
group being monophyletic, (coalesce to a common ances-
tor before any taxa from outside the group) then the LM(1,G;1,,G | x) = i k kfk(xi | I),
consensus is that this group is strongly supported
(Efron, Halloran, & Holmes, 1996). where k is the probability that an observation belongs to
A second option is to evaluate the probability of each cluster k.
of the most likely branching orders. This is computationally Fraley and Raftery (1999) describe their latest version
demanding and relies heavily on efficient branch rear- of model-based clustering where the f k are assumed to be
rangement methods (Li, Pearl, & Doss, 2000) to implement multivariate Gaussian with mean k and covariance matrix
724
TEAM LinG
k. Banfield and Raftery (1993) developed a model-based that were chosen in advance of the analysis. The plot of
framework by parameterizing the covariance matrix in the MCMC-based estimate versus the ML + bootstrap M
terms of its eigenvalue decomposition in the form based estimate was consistent with the hypothesis that
k = k D k A k D k T , where D k is the orthonormal matrix of both methods assign (on average) the same probability to
eigenvectors, Ak is a diagonal matrix with elements propor- a chosen group, unless the number of DNA sites was very
tional to the eigenvalues of k and k is a scalar, and under small, in which case there can be a non-negligible bias in
one convention is the largest eigenvalue of k. The the ML method, resulting in bias in the ML + bootstrap
orientation of cluster k is determined by Dk, Ak determines results. Because the group was chosen in advance, the
the shape, while k specifies the volume. Each of the method for choosing groups was not fully tested, so there
volume, shape, and orientation (VSO) can be variable is a need for additional research.
among groups, or fixed at one value for all groups. One
advantage of the mixture-model approach is that it allows
the use of approximate Bayes factors to compare models, FUTURE TRENDS
giving a means of selecting the model parameterization
(which of V, S, and O are variable among groups) and the The recognized subtypes of HIV-1 were identified using
number of clusters (Figure 1, bottom). The Bayes factor is informal observations of tree shapes followed by ML +
the posterior odds for one model against another model bootstrap (Korber & Myers, 1992). Although identifying
assuming that neither model is favored a priori (uniform such groups is common in phylogenetic trees, there have
prior). When the EM algorithm (estimation-maximum like- been only a few attempts to formally evaluate clustering
lihood, Dempster, Laird, & Rubin, 1977) is used to find the methods for the underlying genetic data. ML + boot-
maximum mixture likelihood, the most reliable approxima- strap remains the standard way to assign confidence to
tion to twice the log Bayes factor (called the Bayesian hypothesized groups and the group structure is usually
Information Criterion, BIC) is BIC = 2lM ( x,) mM log( n) , hypothesized either by using auxiliary information (geo-
graphic, temporal, or other) or by visual inspection of
where lM ( x,) is the maximized mixture loglikelihood for trees (which often display distinct groups). A more thor-
the model and mM is the number of independent param- ough evaluation could be performed using realistic simu-
eters to be estimated in the model. A convention for lated data with known branching orders. Having known
calibrating BIC differences is that differences less than 2 branching order is almost the same as having known
correspond to weak evidence, differences between 2 and groups; however, choosing the number of groups is likely
6 are positive evidence, differences between 6 and 10 are to involve arbitrary decisions even when the true branch-
strong evidence, and differences more than 10 are very ing order is known.
strong evidence.
The two clustering methods ML + bootstrap and
model-based clustering have been compared on the CONCLUSION
same data sets (Burr, Gattiker, & LaBerge, 2002b) and the
differences were small. For example, ML + bootstrap One reason to cluster taxa is that evolutionary processes
suggested 7 clusters for 95 HIV env sequences and 6 can sometimes be revealed once the group structure is
clusters in HIV gag (p17) sequences while model-based recognized. Another reason is that phylogenetic trees are
clustering suggests 6 for env (tends to merge the so- complicated objects that can often be effectively summa-
called B and D subtypes see Figure 1) and 6 for gag. Note rized by identifying the major groups together with a
from Figure 1(top) that only 7 of the 10 recognized sub- description of the typical between and within group
types were included among the 95 sequences. However, variation. Also, if we correctly choose the number of
it is likely that case-specific features will determine the clusters present in the tree for a large number of taxa (100
extent of difference between the methods. Also, model- or more), then we can then use these groups to rapidly
based clustering provides a more natural and automatic construct a good approximation to the true tree very
way to identify candidate groups. Once these candidate rapidly. One strategy for doing this is to repeatedly apply
groups have been identified, then either method is rea- model-based clustering to relatively small numbers of taxa
sonable for assigning confidence measures to the result- (100 or fewer) and check for consistent indications for the
ing cluster assignments. The two clustering methods number of groups. We described two other clustering
ML + bootstrap and an MCMC-based estimate of the strategies (ML + bootstrap, and MCMC-based) and
posterior probability on the space of branching orders note that two studies have made limited comparisons of
have also been compared on the same data sets (Burr, these three methods on the same genetic data.
Doak, Gattiker, & Stanbro, 2002a) with respect to the
confidence that each method assigns to particular groups
725
TEAM LinG
REFERENCES Korber, B., & Myers, G. (1992). Signature pattern analysis:

A method for assessing viral sequence Relatedness. AIDS
Research and Human Retroviruses, 8, 1549-1560.
Banfield, J., & Raftery, A. (1993). Model-based gaussian
and non-gaussian clustering. Biometrics, 49, 803-821. Li, S., Pearl, D., & Doss, H. (2000). Phylogenetic tree
construction using Markov Chain Monte Carlo. Journal
Burr, T., Charlton, W., & Stanbro, W. (2000). Comparison of the American Statistical Association, 95(450), 493-
of signature pattern analysis methods in molecular epide- 508.
miology. Proc. Mathematical and Engineering Methods
in Medicine and Biological Sciences, 1 (pp. 473-479). Salter, L. (2000). Algorithms for phylogenetic tree recon-
struction. Proc. Mathematical and Engineering Meth-
Burr, T., Doak, J., Gattiker, J., & Stanbro, W. (2002a). ods in Medicine and Biological Sciences, 2 (pp. 459-465).
Assessing confidence in phylogenetic trees: Bootstrap
versus Markov Chain Monte Carlo. Mathematical and S-PLUS - Statical Programming Lanugage. (2003). Insight-
Engineering Methods in Medicine and Biological Sci- ful Corp, Seattle, Washington.
ences, 1, 181-187. Swofford, D.L., Olsen, G.J., Waddell, P.J., & Hillis, D.M.
Burr, T., Gattiker, J., & LaBerge, G. (2002b). Genetic (1996). Phylogenetic inference in molecular systematics
subtyping using cluster analysis. Special Interest Group (2nd ed.) (pp. 407-514). (Hillis et al., Eds.) Sunderland,
on Knowledge Discovery and Data Mining Explora- Massachusetts: Sinauer Associates.
tions, 3, 33-42.
Burr, T., Myers, G., & Hyman, J. (2001). The origin of AIDS KEY TERMS
Darwinian or Lamarkian? Phil. Trans. R. Soc. Lond. B.,
356, 877-887. Bayesian Information Criterion: An approximation
to the Bayes Factor which can be used to estimate the
Burr, T., Skourikhine, A.N., Macken, C., & Bruno, W. Bayesian posterior probability of a specified model.
(1999). Confidence measures for evolutionary trees: Ap-
plications to molecular epidemiology. Proc. of the 1999 Bootstrap: A resampling scheme in which surrogate
IEEE Inter. Conference on Information, Intelligence and data is generated by resampling the original data or
Systems (pp. 107-114). sampling from a model that was fit to the original data.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum Coalesce: In the context of phylogenetic trees, two
likelihood for incomplete data via the EM algorithm. lineages coalesce at the time that they most recently share
Journal of the Royal Statistical Society, Series B, 39,1- a common ancestor (and hence, come together in the
38. tree).
Efron, B., Halloran, E., & Holmes, S. (1996). Bootstrap Estimation-Maximization Algorithm: An algorithm
confi-dence levels for phylogenetic trees. Proc. Natl. for computing maximum likelihood estimates from incom-
Acad. Sci. USA, 93, 13429. plete data. In the case of fitting mixtures, the group labels
are the missing data.
Felsenstein, J. (1981). Evolutionary trees from DNA se-
quences: A maximum likelihood approach. Journal of HIV: Human-Immunodeficiency Virus.
Molecular Evolution, 17, 368-376.
Markov Chain Monte Carlo: A stochastic method to
Fraley, C., & Raftery, A. (1999). MCLUST: Software for approximate probabilities, available in many situations for
model-based cluster analysis. Journal of Classification, which analytical methods are not available. The method
16, 297-306. involves generating observations from the probability
distribution by evaluating the likelihood ratio of any two
Huelsenbeck, J., & Rannala, B. (1997). Phylogenetic meth- candidate solutions.
ods come of age: testing hypotheses in an evolutionary
context. Science, 276, 227-232. Mixture of Distributions: A combination of two or
more distributions in which observations are generated
Korber, B., Muldoon, N., Theiler, J., Gao, R., Gupta, R. from distribution i with probability pi and (pi =1).
Lapedes, A, Hahn, B., Wolinsky, W., & Bhattacharya, T.
(2000). Timing the ancestor of the HIV-1 pandemic Strains. Model-Based Clustering: A clustering method with
Science, 288, 1788-1796. relatively flexible assumptions regarding the volume,
shape, and orientation of each cluster.
726
TEAM LinG
Phylogenetic Tree: A representation of the branching grated (for interval-valued random variables) to give the
order and branch lengths of a collection of taxa, which, in probability of observing values in a specified set. M
its most common display form, looks like the branches of
a tree. Substitution Probability Matrix: A matrix whose i, j
entry is the probability of substituting DNA character j (C,
Probability Density Function: A function that can be G, T or A) for character i over a specified time period.
summed (for discrete-valued random variables) or inte-
727
TEAM LinG
728
Microarray Data Mining

Li M. Fu
University of Florida, USA
Based on the concept of simultaneously studying the The laboratory information management system (LIMS)
expression of a large number of genes, a DNA microarray keeps track of and manages data produced from each
is a chip on which numerous probes are placed for step in a microarray experiment, such as hybridization,
hybridization with a tissue sample. Biological complex- scanning, and image processing. As microarray experi-
ity encoded by a deluge of microarray data is being ments generate a vast amount of data, the efficient
translated into all sorts of computational, statistical, or storage and use of the data require a database manage-
mathematical problems bearing on biological issues ment system. Although some databases are designed to
ranging from genetic control to signal transduction to be data archives only, other databases such as ArrayDB
metabolism. Microarray data mining is aimed to iden- (Ermolaeva, Rastogi, & Pruitt, 1998) and Argus
tify biologically significant genes and find patterns that (Comander, Weber, Gimbrone, & Garcia-Cardena,
reveal molecular network dynamics for reconstruction 2001) allow information storage, query, and retrieval,
of genetic regulatory networks and pertinent metabolic as well as data processing, analysis, and visualization.
pathways. These databases also provide a means to link microarray
data to other bioinformatics databases (e.g., NCBI Entrez
systems, Unigene, KEGG, and OMIM). The integration
BACKGROUND with external information is instrumental to the inter-
pretation of patterns recognized in the gene-expression
The idea of microarray-based assays seemed to emerge data. To facilitate the development of microarray data-
as early as of the 1980s (Ekins & Chu, 1999). In that bases and analysis tools, there is a need to establish a
period, a computer-based scanning and image-process- standard for recording and reporting microarray gene
ing system was developed to quantify the expression expression data. The MIAME (Minimum Information
level in tissue samples of each cloned complementary about Microarray Experiments) standard includes a de-
DNA sequences spotted in a two-dimensional array on scription of experimental design, array design, samples,
strips of nitrocellulose, which could be the first proto- hybridization, measurements, and normalization con-
type of the DNA microarray. The microarray-based trols (Brazma, Hingamp, & Quackenbush, 2001).
gene expression technology was actively pursued in the
mid-1990s (Schena, Heller, & Theriault, 1998) and has Data Mining Objectives
seen rapid growth since then.
Microarray technology has catalyzed the develop- Data mining addresses the question of how to discover
ment of the field known as functional genomics by a gold mine from historical or experimental data, par-
offering high-throughput analysis of the functions of ticularly in a large database. The goal of data mining and
genes on a genomic scale (Schena et al., 1998). There knowledge discovery algorithms is to extract implicit
are many important applications of this technology, and previously unknown nontrivial patterns, regulari-
including elucidation of the genetic basis for health and ties, or knowledge from large data sets that can be used
disease, discovery of biomarkers of therapeutic re- to improve strategic planning and decision making. The
sponse, identification and validation of new molecular discovered knowledge capturing the relations among
targets and modes of action, and so on. The accomplish- the variables of interest can be formulated as a function
ment of decoding human genome sequence together for making prediction and classification or as a model
with recent advances in the biochip technology has for understanding the problem in a given domain. In the
ushered in genomics-based medical therapeutics, diag- context of microarray data, the objectives are identify-
nostics, and prognostics. ing significant genes and finding gene expression pat-
TEAM LinG
terns associated with known or unknown categories. To determine which genes are differentially expressed,
Microarray data mining is an important topic in a common approach is based on fold-change; in this M
bioinformatics, dealing with information processing on approach, we simply decide a fold-change threshold (e.g.,
biological data, particularly genomic data. 2C) and select genes associated with changes greater
than that threshold. If a cDNA microarray is used, the ratio
Practical Factors Prior to Data Mining of the test over control expression in a single array can be
converted easily to fold change in both cases of up-
Some practical factors should be taken into account regulation (induction) and down-regulation (suppres-
prior to microarray data mining. First of all, microarray sion). For oligonucleotide chips, fold-change is com-
data produced by different platforms vary in their for- puted from two arrays, one for test and the other for
mats and may need to be processed differently. For control sample. In this case, if multiple samples in each
example, one type of microarray with cDNA as probes condition are available, the statistical t-test or Wilcoxon
produces ratio data from two channel outputs, whereas tests can be applied, but the catch is that the Bonferroni
another type of microarray using oligonucleotide probes adjustment to the level of significance on hypothesis
generates nonratio data from a single channel. Not only testing would be necessary to account for the presence of
may different platforms pick up gene expression activ- multiple genes. The t-test determines the difference in
ity with different levels of sensitivity and specificity, mean expression values between two conditions and
but also different data processing techniques may be identifies genes with significant difference. The nonpara-
required for different data formats. metric Wilcoxon test is a good alternative in the case of
Normalizing data to allow direct array-to-array com- non-Gaussian data distribution. SAM (Significance Analy-
parison is a critical issue in array data analysis, because sis of Microarrays) (Tusher, Tibshirani, & Chu, 2001) is a
several variables in microarray experiments can affect state-of-the-art technique based on balanced perturba-
measured mRNA levels (Schadt, Li, Ellis, & Wong, tion of repeated measurements and minimization of the
2001; Yang, Dudoit, & Luu, 2002). Variations may false discovery rate.
occur during sample handling, slide preparation, hybrid-
ization, or image analysis. Normalization is essential Coordinated Gene Expression
for correct microarray data interpretation. In simple
ways, data can be normalized by dividing or subtracting Identifying genes that are co-expressed across multiple
expression values by a representative value (e.g., mean conditions is an issue with significant implications in
or median in an array) or by taking a linear transforma- microarray data mining. For example, given gene ex-
tion to zero mean and unit variance. As an example, data pression profiles measured over time, we are interested
normalization in the case of cDNA arrays may proceed in knowing what genes are functionally related. The
as follows: The local background intensity is subtracted answer to this question also leads us to deduce the
from the value of each spot on the array; the two chan- functions of unknown genes from their correlation with
nels are normalized against the median values on that genes of known functions. Equally important is the
array; and the Cy5/Cy3 fluorescence ratios and log 10- problem of organizing samples based on their gene
transformed ratios are calculated from the normalized expression profiles so that distinct phenotypes or dis-
values. In addition, genes that do not change signifi- ease processes may be recognized or discovered.
cantly can be removed through a filter in a process The solutions to both problems are based on so-
called data filtration. called cluster analysis, which is meant to group objects
into clusters according to their similarity. For example,
Differential Gene Expression genes are clustered by their expression values across
multiple conditions; samples are clustered by their ex-
To identify genes differentially expressed across two pression values across genes. The issue is the question
conditions is one of the most important issues in of how to measure the similarity between objects. Two
microarray data mining. In cancer research, for ex- popular measures are the Euclidean distance and
ample, we wish to understand what genes are abnormally Pearsons correlation coefficient. Clustering algorithms
expressed in a certain type of cancer, so we conduct a can be divided into hierarchical and nonhierarchical
microarray experiment and collect the gene expression (partitional). Hierarchical clustering is either
profiles of normal and cancer tissues, respectively, as agglomerative (starting with singletons and progres-
the control and test samples. The information regarding sively merging) or divisive (starting with a single clus-
differential expression is derived from comparing the ter and progressively breaking). Hierarchical
test against the control sample. agglomerative clustering is most commonly used in the
729
TEAM LinG
cluster analysis of microarray data. In this method, two ferred to as discriminant analysis. In practice, given a
most similar clusters are merged at each stage until all limited number of samples, correct discriminant analy-
the objects are included in a single cluster. The result is sis must rely on the use of an effective gene selection
a dendrogram (a hierarchical tree) that encodes the technique to reduce the gene number and, hence, the
relationships among objects by showing how clusters data dimensionality. The objective of gene selection is
merge at each stage. Partitional clustering algorithms to select genes that most contribute to classification as
are best exemplified by k-means and self-organization well as provide biological insight. Approaches to gene
maps (SOMs). selection range from statistical analysis (Golub et al.,
1999) and a Bayesian model (Lee, Sha, Dougherty,
Gene Selection for Discriminant Vannucci, & Mallick, 2003) to Fishers linear dis-
Analysis criminant analysis (Xiong, Li, Zhao, Jin, & Boerwinkle,
2001) and support vector machines (SVMs) (Guyon,
Taking an action based on the category of the pattern Weston, Barnhill, & Vapnik, 2002). This is one of the
recognized in microarray gene expression data is an most challenging areas in microarray data mining. De-
increasingly important approach to medical diagnosis spite good progress, the reliability of selected genes
and management (Furey, Cristianini, & Duffy, 2000; should be further improved. Table 1 summarizes some
Golub, Slonim, & Tamayo, 1999; Khan, Wei, & Ringner, of most important microarray data-mining problems
2001). A class predictor derived on this basis can auto- and their solutions.
matically discover the distinction between different
classes of samples, independent of previous biological Microarray Data-Mining Applications
knowledge (Golub et al., 1999). Gene expression infor-
mation appears to be a more reliable indicator than Microarray technology permits a large-scale analysis
phenotypic information for categorizing the underlying of gene functions in a genomic perspective and has
causes of diseases. The microarray approach has offered brought about important changes in how we conduct
hope for clinicians to arrive at more objective and accu- basic research and practice clinical medicine. There
rate cancer diagnoses and hence choose more appropri- have existed an increasing number of applications with
ate forms of treatment (Tibshirani, Hastie, Narasimhan, this technology. Here, the role of data mining in discov-
& Chu, 2002). ering biological and clinical knowledge from microarray
The central question is how to construct a reliable data is examined.
classifier that predicts the class of a sample on the basis Consider that only the minority of all the yeast (Sac-
of its gene expression profile. This is a pattern recogni- charomyces cerevisiae) open reading frames in the ge-
tion problem, and the type of analysis involved is re- nome sequence could be functionally annotated on the
Table 1. Three common computational problems in microarray data mining
Problem 1: To identify differentially expressed genes, given microarray gene expression

data collected in two conditions, types, or states.
[Solutions:]
Fold change
t-test or Wilcoxon rank sum test (with Bonferronis correction)
Significance analysis of microarrays
Problem 2: To identify genes expressed in a coordinated manner, given microarray gene

expression data collected across a set of conditions or time points.
[Solutions:]
Hierarchical clustering
Self-organization
k-means clustering
Problem 3: To select genes for discriminant analysis, given microarray gene expression data
of two or more classes.
[Solutions:]
Neighborhood analysis
Support vector machines
Principal component analysis
Bayesian analysis
Fishers linear discriminant analysis
730
TEAM LinG
basis of sequence information alone (Zweiger, 1999), sis, owing to possible failure of detection or presence of
although microarray results showed that nearly 90% of all marker variants. The approach of constructing a classifier M
yeast mRNAs (messenger RNAs) are observed to be based on gene expression profiles has gained increasing
present (Wodicka, Dong, Mittmann, Ho, & Lockhart, interest, following the success in demonstrating that
1997). Functional annotation of a newly discovered gene microarray data differentiated between two types of leu-
based on sequence comparison with other known gene kemia (Golub et al., 1999). In this application, the two data-
sequences is sometimes misleading. Microarray-based mining problems are to identify gene expression patterns
genome-wide gene expression analysis has made it pos- or signatures associated with each type of leukemia and
sible to deduce the functions of novel or poorly charac- to discover subtypes within each. The first problem is
terized genes from co-expression with already known dealt with by gene selection, and the second one by
genes (Eisen, Spellman, Brown, & Botstein, 1998). The cluster analysis. Table 2 illustrates some applications of
microarray technology is a valuable tool for measuring microarray data mining.
whole-genome mRNA and enables system-level explora-
tion of transcriptional regulatory networks (Cho, Campbell,
& Winzeler, 1998; DeRisi, Iyer, & Brown, 1997; Laub, FUTURE TRENDS
McAdams, Feldblyum, Fraser, & Shapiro, 2000; Tavazoie,
Hughes, Campbell, Cho, & Church, 1999). Hierarchical The future challenge is to realize biological networks that
clustering can help us recognize genes whose cis-regula- provide qualitative and quantitative understanding of
tory elements are bound by the same proteins (transcrip- molecular logic and dynamics. To meet this challenge,
tion factors) in vivo. Such a set of coregulated genes is recent research has begun to focus on leveraging prior
known as a regulon. Statistical characterization of known biological knowledge and integration with biological analy-
regulons is used to derive criteria for inferring new sis in quest of biological truth. In addition, there is
regulatory elements. To identify regulatory elements increasing interest in applying statistical bootstrapping
and associated transcription factors is fundamental to and data permutation techniques to mining microarray
building a global gene regulatory network essential for data for appraising the reliability of leaned patterns.
understanding the genetic control and biology in living
cells. Thus, determining gene functions and gene net-
works from microarray data is an important application CONCLUSION
of data mining.
The limitation of the morphology-based approach to Microarray technology has rapidly emerged as a power-
cancer classification has led to molecular classifica- ful tool for biological research and clinical investiga-
tion. Techniques such as immunohistochemistry and tion. However, the large quantity and complex nature of
RT-PCR are used to detect cancer-specific molecular data produced in microarray experiments often plague
markers, but pathognomonic molecular markers are researchers who are interested in using this technology.
unfortunately unavailable for most solid tumors Microarray data mining uses specific data processing and
(Ramaswamy, Tamayo, & Rifkin, 2001). Furthermore, normalization strategies and has its own objectives, re-
molecular markers do not guarantee a definitive diagno-
Table 2. Examples of microarray data mining applications

Classical Work:
v Identified functional related genes and their genetic control upon metabolic shift from fermentation to respiration
(DeRisi et al., 1997).
v Explored co-expressed or coregulated gene families by cluster analysis (Eisen et al., 1998).
v Determined genetic network architecture based on coordinated gene expression analysis and promoter motif analysis
(Tavazoie et al., 1999).
v Differentiated acute myeloid leukemia from acute lymphoblastic leukemia by selecting genes and constructing a
classifier for discriminant analysis (Golub et al., 1999).
v Selected genes differentially expressed in response to ionizing radiation based on significance analysis (Tusher et al.,
2001).
Recent Work:
v Analyzed gene expression in the Arabidopsis genome (Yamada, Lim, & Dale, 2003).
v Discovered conserved genetic modules (Stuart, Segal, Koller, & Kim, 2003).
v Elucidated functional properties of genetic networks and identified regulatory genes and their target genes (Gardner, di
Bernardo, Lorenz, & Collins, 2003).
v Identified genes associated with Alzheimers disease (Roy Walker, Smith, & Liu, 2004).
731
TEAM LinG
quiring effective computational algorithms and statistical tion by gene expression monitoring. Science, 286(5439),
techniques to arrive at valid results. The microarray tech- 531-537.
nology has been perceived as a revolutionary technology
in biomedicine, but the hardware device does not pay off Guyon, I., Weston, J., Barnhill, S., & Vapnik (2002). Gene
unless backed up with sound data-mining software. selection for cancer classification using support vector
machines. Machine Learning, 46(1/3), 389-422.
Khan, J., Wei, J. S., & Ringner, M. (2001). Classification
ACKNOWLEDGMENT and diagnostic prediction of cancers using gene expres-
sion profiling and artificial neural networks. Nat Med,
This work is supported by the National Science Founda- 7(6), 673-679.
tion under Grant IIS-0221954.
Laub, M. T., McAdams, H. H., Feldblyum, T., Fraser &
Shapiro (2000). Global analysis of the genetic network
controlling a bacterial cell cycle. Science, 290(5499),
REFERENCES 2144-2148.
Brazma, A., Hingamp, P., & Quackenbush, J. (2001). Mini- Lee, K. E., Sha, N., Dougherty, E. R., Vannucci & Mallick
mum information about a microarray experiment (2003). Gene selection: A Bayesian variable selection
(MIAME) toward standards for microarray data. Nat approach. Bioinformatics, 19(1), 90-97.
Genet, 29(4), 365-371. Ramaswamy, S., Tamayo, P., & Rifkin, R. (2001). Multiclass
Cho, R. J., Campbell, M. J., & Winzeler, E. A. (1998). A cancer diagnosis using tumor gene expression signa-
genome-wide transcriptional analysis of the mitotic tures. Proceedings of the National Acad Sci, USA, 98(26),
cell cycle. Mol Cell, 2(1), 65-73. 15149-15154.
Comander, J., Weber, G. M., Gimbrone, M. A., Jr., & Garcia- Roy Walker, P., Smith, B., & Liu, Q. Y. (2004). Data mining
Cardena (2001). Argus: A new database system for Web- of gene expression changes in Alzheimer brain. Artif Intell
based analysis of multiple microarray data sets. Genome Med, 31(2), 137-154.
Res, 11(9), 1603-1610. Schadt, E. E., Li, C., Ellis, B., & Wong (2001). Feature
DeRisi, J. L., Iyer, V. R., & Brown, P. O. (1997). Exploring extraction and normalization algorithms for high-den-
the metabolic and genetic control of gene expression on sity oligonucleotide gene expression array data. Jour-
a genomic scale. Science, 278(5338), 680-686. nal of Cell Biochemistry, (Suppl. 37), 120-125.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein Schena, M., Heller, R. A., & Theriault, T. P. (1998). Microarrays:
(1998). Cluster analysis and display of genome-wide Biotechnologys discovery platform for functional genomics.
expression patterns. Proceedings of the National Acad Trends Biotechnol, 16(7), 301-306.
Sci, USA, 95(25), 14863-14868. Stuart, J. M., Segal, E., Koller, D., & Kim (2003). A gene-
Ekins, R., & Chu, F. W. (1999). Microarrays: Their origins coexpression network for global discovery of conserved
and applications. Trends Biotechnol, 17(6), 217-218. genetic modules. Science, 302(5643), 249-255.
Ermolaeva, O., Rastogi, M., & Pruitt, K. D., (1998). Data Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho & Church
management and analysis for gene expression arrays. (1999). Systematic determination of genetic network ar-
Nat Genet, 20(1), 19-23. chitecture. Nat Genet, 22(3), 281-285.
Furey, T. S., Cristianini, N., & Duffy, N., (2000). Support Tibshirani, R., Hastie, T., Narasimhan, B., & Chu (2002).
vector machine classification and validation of cancer Diagnosis of multiple cancer types by shrunken cen-
tissue samples using microarray expression data. troids of gene expression. Proceedings of the National
Bioinformatics, 16(10), 906-914. Acad Sci, USA, 99(10), 6567-6572.
Gardner, T. S., di Bernardo, D., Lorenz, D., & Collins (2003). Tusher, V. G., Tibshirani, R., & Chu, G. (2001). Significance
Inferring genetic networks and identifying compound analysis of microarrays applied to the ionizing radiation
mode of action via expression profiling. Science, 301(5629), response. Proceedings of the National Acad Sci, USA,
102-105. 98(9), 5116-5121.
Golub, T. R., Slonim, D. K., & Tamayo, P. (1999). Molecular Wodicka, L., Dong, H., Mittmann, M., Ho & Lockhart
classification of cancer: Class discovery and class predic- (1997). Genome-wide expression monitoring in Saccharo-
myces cerevisiae. Nat Biotechnol, 15(13), 1359-1367.
732
TEAM LinG
Xiong, M., Li, W., Zhao, J., Jin & Boerwinkle (2001). Cis-Regulatory Element: The genetic region that af-
Feature (gene) selection in gene expression-based tu- fects the activity of a gene on the same DNA molecule. M
mor classification. Mol Genet Metab, 73(3), 239-247.
Clustering: The process of grouping objects accord-
Yamada, K., Lim, J., & Dale, J. M. (2003). Empirical analysis ing to their similarity. This is an important approach to
of transcriptional activity in the Arabidopsis genome. microarray data mining.
Science, 302(5646), 842-846.
Functional Genomics: The study of gene functions
Yang, Y. H., Dudoit, S., & Luu, P. (2002). Normalization for on a genomic scale, especially based on microarrays.
cDNA microarray data: A robust composite method ad-
dressing single and multiple slide systematic variation. Gene Expression: Production of mRNA from DNA
Nucleic Acids Res, 30(4), e15. (a process known as transcription) and production of
protein from mRNA (a process known as translation).
Zweiger, G. (1999). Knowledge discovery in gene-expres- Microarrays are used to measure the level of gene
sion-microarray data: Mining the information output of the expression in a tissue or cell.
genome. Trends Biotechnol, 17(11), 429-436.
Genomic Medicine: Integration of genomic and
clinical data for medical decision.
Microarray: A chip on which numerous probes are
KEY TERMS placed for hybridization with a tissue sample to analyze
its gene expression.
Bioinformatics: All aspects of information pro-
cessing on biological data, in particular genomic data. Postgenome Era: The time after the complete hu-
The rise of bioinformatics is driven by the genomic man genome sequence is decoded.
projects.
Transcription Factor: A protein that binds to the
cis-element of a gene and affects its expression.
733
TEAM LinG
734
Microarray Databases for Biotechnology

Richard S. Segall
Arkansas State University, USA
INTRODUCTION that are then organized into a results set as a text file that
can then be subjected to analyses such as data mining.
Microarray informatics is a rapidly expanding disci- Jagannathan (2002) of the Swiss Institute of
pline in which large amounts of multi-dimensional data Bioinformatics (SIB) described databases for
are compressed into small storage units. Data mining of microarrays including their construction from
microarrays can be performed using techniques such as microarray experiments such as gathering data from
drill-down analysis rather than classical data analysis on cells subjected to more than one conditions. The latter
a record-by-record basis. Both data and metadata can be are hybridized to a microarray that is stored after the
captured in microarray experiments. The latter may be experiment by methods such as scanned images. Hence
constructed by obtaining data samples from an experi- data is to be stored both before and after the experi-
ment. Extractions can be made from these samples and ments, and the software used must be capable of dealing
formed into homogeneous arrays that are needed for with large volumes of both numeric and image data.
higher level analysis and mining. Jagannathan (2002) also discussed some of the most
Biologists and geneticists find microarray analy- promising existing non-commercial microarray data-
sis as both a practical and appropriate method of storing bases of ArrayExpress, which is a public microarray
images, together with pixel or spot intensities and iden- gene expression repository, the Gene Express Omnibus
tifiers, and other information about the experiment. (GEO), which is a gene expression database hosted at
the National Library of Medicine, and GeneX, which is
an open source database and integrated tool set released
by the National Center for Genome Resources (NCGR)
BACKGROUND in Santa Fe, New Mexico.
Grant (2001) wrote an entire thesis on microarray
A Microarray has been defined by Schena (2003) as an databases describing the scene for its application to
ordered array of microscopic elements in a planar sub- genetics and the human genome and its sequence of the
strate that allows the specific binding of genes or gene three billion-letter sequences of genes. Kim (2002)
products. Schena (2003) claims microarray databases presented improved analytical methods for micro-array
as a widely recognized next revolution in molecular based genome composition analysis by selecting a sig-
biology that enables scientists to analyze genes, pro- nal value that is used as a cutoff to discriminate present
teins, and other biological molecules on a genomic
scale.
Figure 1. Overview of the microarray process (Kennedy,
According to an article (2004) on the National Cen-
(2003)
ter for Biotechnology Information (NCBI) Web site,
because microarrays can be used to examine the ex-
pression of hundreds or thousands of genes at once, it GENAdb
promises to revolutionize the way scientists examine Genomics Array Database
gene expression, and this technology is still consid- Overview of the Microarray Process
ered to be in its infancy.

>m6kp10a06f1 (xseqid=j10a06f1)
NAATTCCCGACCGTGAAAGTAAACCTAAAA Plants
Sequences GCCTATTTATTTCACCTCTCTCTCTCTCTC
TCTGCAACTAATCACTTGTTCNATCTCGAA
The following Figure 1 is from a presentation by

GCTGAAGCTAAAGCTTTCGCTAATTTGCTT
Samples
Kennedy (2003) of CSIRO (Commonwealth Scientific Libraries
& Industrial Research Organisation) in Australia as Slides
available on the Web, and illustrates an overview of the

microarray process starting with sequence data of indi- Scans
vidual clones that can be organized into libraries. Indi-

vidual samples are taken from the library as spots and Result Sets
arranged by robots onto slides that are then scanned by

lasers. The image scanned by lasers is than quantified Analyses
according to the color generated by each individual spot
TEAM LinG
and divergent genes. Do et al. (2003) provided compara- AMAD, which is a Web driven database system written
tive evaluation of microarray-based gene expression entirely in PERL and JavaScript (UAMS, Bioinformatics M
databases by analyzing the requirements for microarray Center, 2004).
data management, and Sherlock (2003) discussed storage
and retrieval of microarray data for molecular biology.
Kemmeren (2001) described a bioinformatics pipe- MAIN THRUST
line for supporting microarray analysis with example of
production and analysis of DNA (Deoxyribonucleic The purpose of this article is to help clarify the meaning
Acid) microarrays that require informatics support. of microarray informatics. The latter is addressed by
Gonclaves & Marks (2002) discussed roles and require- summarizing some illustrations of applications of data
ments for a research microarray database. mining to microarray databases specifically for bio-
An XML description language called MAML technology.
(Microarray Annotation Markup Language) has been First, it needs to be stated which data mining tools
developed to allow communication with other databases are useful in data mining of microarrays. SAS Enterprise
worldwide (Cover Pages 2002). Liu (2004) discusses Miner, which was used in Segall et al. (2003, 2004a,
microarray databases and MIAME (Minimal Informa- 2004b) as discussed below contains the major data
tion about a Microarray Experiment) that defines what mining tools of decisions trees, regression, neural net-
information at least should be stored. For example, the works, and clustering, and also other data mining tools
MIAME for array design would be the definite structure such as association rules, variable selection, and link
and definition of each array used and their elements. The analysis. All of these are useful data mining tools for
Microarray Gene Expression Database Group (MGED) microarray databases regardless if using SAS Enter-
composed and developed the recommendations for prise Miner or not. In fact, an entire text has been written
microarray data annotations for both MAIME and MAML by Draghici (2003) on data analysis tools for DNA
in 2000 and 2001 respectively in Cambridge, United microarrays that includes these data mining tools as
Kingdom. well as numerous others tools such as analysis of func-
Jonassen (2002) presents a microarray informatics tional categories and statistical procedure of correc-
resource Web page that includes surveys and introduc- tions for multiple comparisons.
tory papers on informatics aspects, and database and
software links. Another resourceful Web site is that
from the Lawrence Livermore National Labs (2003)
Scientific and Statistical Data Mining
entitled Microarray Links that provides an extensive list and Visual Data Mining for Genomes
of active Web links for the categories of databases,
microarray labs, and software and tools including data Data mining of microarray databases has been discussed
mining tools. by Deyholos (2002) for bioinformatics by methods that
University-wide database systems have been estab- include correlation of patterns and identifying the sig-
lished such as at Yale as the Yale Microarray Database nificance analysis of microarrays (SAM) for genes within
(YMD) to support large-scale integrated analysis of DNA. Visual data mining was utilized to distinguish the
large amounts of gene expression data produced by a intensity of data filtering and the effect of normaliza-
wide variety of microarray experiments for different tion of the data using regression plots.
organisms as described by Cheung (2004), and similarly Tong (2002) discusses supporting microarray stud-
at Stanford with Stanford Microarray Database (SMD) ies for toxicogenomic databases through data integra-
as described by both Sherlock (2001) and Selis (2003). tion with public data and applying visual data mining
Microarray Image analysis is currently included in such as ScatterPlot viewer.
university curricula, such as in Rouchka (2003) Intro- Chen et al. (2003) presented a statistical approach
duction to Bioinformatics graduate course at University using a Gene Expression Analysis Refining System
of Louisville. (GEARS).
In relation to the State of Arkansas, the medical Piatetsky-Shapiro and Tamayo (2003) discussed the
school is situated in Little Rock and is known as the main types of challenges for microarrray data mining as
University of Arkansas for Medical Sciences (UAMS). including gene selection, classification, and clustering.
A Bioinformatics Center is housed within UAMS that is According to Piatetsky-Shapiro and Tamayo (2003), one
involved with the management of microarray data. The of the important challenges for data mining of microarrays
software utilized at UAMS for microarray analysis in- is that the difficulty of collecting microarray samples
cludes BASE (BioArray Software Environment) and causes the number of samples to remain small and while
735
TEAM LinG
the number of fields corresponding to the number of genes Scientific and Statistical Data Mining
is typically in the thousands this creates a high likeli- and Visual Data Mining for Plants
hood of finding false positives.
Piatetsky-Shapiro and Tamayo (2003) identify areas in
which micorarrays and data mining tools can be im- Segall et al. (2003, 2004a, 2004b) performed data mining
proved that include better accuracy, more robust mod- for assessing the impact of environmental stresses on
els and estimators as well as better appropriate biologi- plant geonomics and specifically for plant data from the
cal interpretation of the computational or statistical Osmotic Stress Microarray Information Database
results for those microarrays constructed from bio- (OSMID). The latter databases are considered to be
medical or DNA data. representative of those that could be used for biotech
Piatetsky-Shapiro and Tamayo (2003) summarize up application such as the manufacture of plant-made-phar-
the areas in which microarray and microarry data mining maceuticals (PMP) and genetically modified (GM) foods.
tools can be improved by stating: The Osmotic Stress Microarray Information Data-
base (OSMID) database that was used in the data mining
Typically a computational researcher will apply his or in Segall et al. (2003, 2004a, 2004b) contains the
her favorite algorithm to some microarray dataset and results of approximately 100 microarray experiments
quickly obtain a voluminous set of results. These results performed at the University of Arizona as part of a
are likely to be useful but only if they can be put in National Science Foundation (NSF) funded project
context and followed up with more detailed studies, for named the The Functional Genomics of Plant Stress
example by a biologist or a clinical researcher. Often whose data constitutes a data warehouse.
this follow up and interpretation is not done carefully The OSMID microarray database is available for
enough because of the additional significant research public access on the Web hosted by Universite
involvement, the lack of domain expertise or proper Montpellier II (2003) in France, and the OSMID con-
collaborators, or due to the limitations of the tains information about the more than 20,000 ESTs
computational analysis itself. (Experimental Stress Tolerances) that were used to
produce these arrays. These 20,000 ESTs could be
Draghici (2003) discussed in-depth other challenges considered as components of data warehouse of plant
in using microarrays specifically for gene expression microarray databases that was subjected to data mining
studies, such as being very noisy or prone to error after in Segall et al. (2003, 2004a, 2004b). The data mining
the scanning and image processing steps, consensus as to was performed using SAS Enterprise Miner and its
how to perform normalization, and the fact that cluster analysis module that yielded both scientific and
microarrays are not necessarily able to substitute com- statistical data mining as well as visual data mining. The
pletely other biological factors or tools in the realm of conclusions of Segall et al. (2003, 2004a, 2004b)
the molecular biologist. included the facts about the twenty-five different varia-
Mamitsuka et al. (2003) mined biological active pat- tions or levels of the environmental factor of salinity
terns in metabolic pathways using microarray expres- on plant of corn, as also evidenced by the visualization
sion profiles. Mamitsuka (2003) utilized microarray of the clusters formed as a result of the data mining.
data sets of gene expressions on yeast proteins.
Curran et al. (2003) performed statistical methods Other Useful Sources of Tools and
for joint data mining of gene expressions and DNA Projects for Microarray Informatics
sequence databases. The statistical methods used in-
clude linear mixed effect model, cluster analysis, and A bibliography on microarray data analysis cre-
logistic regression. ated as available on the Web by Li (2004) that
Zaki et al. (2003) reported on an overview of the includes book and reprints for the last ten years.
papers on data mining in bioinformatics as presented at The Rosalind Franklin Centre for Genomics Re-
the International Conference on Knowledge Discovery search (RFCGR) of the Medical Research Council
and Data Mining held in Washington, DC in August 2003. (MRC) (2004) in the UK provides a Web site with
Some of the novel data mining techniques discussed in links for data mining tools and descriptions of
papers at this conference included gene expression analy- their specific applications to gene expressions and
sis, protein/RNA (ribonucleic acid) structure predic- microarray databases for genomics and genetics.
tion, and gene finding. Reviews of data mining software as applied to
genetic microarray databases are included in an
736
TEAM LinG
annotated list of references for microarray soft Finally, the author also wishes to acknowledge the
ware review compiled by Leung et al. (2002). useful reviews of the three anonymous referees of the M
Web links for the statistical analysis of microarray earlier version of this article without whose construc-
data are provided by van Helden (2004). tive comments the final form of this article would not
Reid (2004) provides Web links of software tools have been possible.
for microarray data analysis including image
analysis.
Bio-IT World Journal Web site has a Microarray REFERENCES
Resource Center that includes a link of extensive
resources for microarray informatics at the Eu- Bio-IT World Inc. (2004). Microarray resources and
ropean Bioinformatics Institute (EBI). articles. Retrieved from http://www.bio-itworld.com/
resources/microarray/
FUTURE TRENDS Chen, C.H. et al. (2003). Gene expression analysis refining
system (GEARS) via statistical approach: A preliminary
The wealth of resources available on the Web for report. Genome Informatics, 14, 316-317.
microarray informatics only supports the premise that Cheung, K.H. et al. (2004). Yale Microarray Database
microarray informatics is a rapidly expanding field. System. Retrieved from http://crcjs.med.utah.edu/
This growth is in both software and methods of analysis bioinfo/abstracts/Cheung,%20Kei.doc
that includes techniques of data mining.
Future research opportunities in microarray Curran, M.D., Liu, H., Long, F., & Ge, N. (2003,Decem-
informatics include the biotech applications for manu- ber). Machine learning in low-level microarray analy-
facture of plant-made-pharmaceuticals (PMP) and ge- sis. SIGKDD Explorations, 5(2),122-129.
netically modified (GM) foods. Deyholos, M. (2002). An introduction to exploring ge-
nomes and mining microarrays. In OReilly
Bioinformatics Technology Conference, January 28-31,
CONCLUSION 2002, Tucson, AZ. Retrieved from http://
conferences.oreillynet.com/cs/bio2002/view/e_sess/
Because data within genome databases is composed of 1962 - 11k - May 7, 2004
micro-level components such as DNA, microarray data-
bases are a critical tool for analysis in biotechnology. Do, H., Toralf, K., & Rahm, E. (2003). Comparative
Data mining of microarray databases opens up this field evaluation of microarray-based gene expression da-
of microarray informatics as multi-facet tools for knowl- tabases. Retrieved from http://www.btw2003.de/pro-
edge discovery. ceedings/paper/96.pdfDraghici, S. (2003). Data analy-
sis tools for DNA microarrays. Boca Raton, FL: Chapman
& Hall/CRC.
ACKNOWLEDGMENT Goncalves, J., & Marks, W.L. (2002). Roles and re-
quirements for a research microarray Database. IEEE
The author wishes to acknowledge the funding provided Engineering Medical Biol Magazine, 21(6), 154-157.
by a block grant from the Arkansas Biosciences Institute Grant, E. (2001, September). A microarray database. The-
(ABI) as administered by Arkansas State University sis for Masters of Science in Information Technology.
(ASU) to encourage development of a focus area in The University of Glasgow.
Biosciences Institute Social and Economic and Regula-
tory Studies (BISERS) for which he served as Co- Jagannathan, V. (2002). Databases for microarrays. Pre-
Investigator (Co-I) in 2003, and with which funding the sentation at Swiss Institute of Bioinformatics (SIB), Uni-
analyses of the Osmotic Stress Microarray Information versity of Lausanne, Switzerland. Retrieved from http://
Database (OSMID) discussed within this article were www.ch.embnet.org/CoursEMBnet/CHIP02/ppt/
performed. Vidhya.ppt
The author also wishes to acknowledge a three-year
software grant from SAS Incorporated to the College of Jonassen, I. (2002). Microarray informatics resource page.
Business at Arkansas State University for SAS Enterprise Retrieved from http://www.ii.uib.no/~inge/micro
Miner that was used in the data mining of the OSMID Kemmeren, P.C., & Holstege, F.C. (2001). A bioinformatics
microarrays discussed within this article. pipeline for supporting microarray analysis. Retrieved
737
TEAM LinG
from http://www.genomics.med.uu.nl/presentations/ Rouchka, E. (2003). CECS 694 Introduction to

Bioinformatics-2001-Patrick2.ppt Bioinformatics. Lecture 12. Microarray Image Analy-
sis, University of Louisville. Retrieved from http://
Kennedy, G. (2003). GENAdb: Genomics Array Database. kbrin.a-bldg.louisville.edu/~rouchka/CECS694_ 2003/
CSIRO (Commonwealth Scientific & Industrial Research Week12.html
Organisation) Plant Industry, Australia. Retrieved from
http://www.pi.csiro.au/gena/repository/GENAdb.ppt Schena, M. (2003). Microarray analysis. New York: John
Wiley & Sons.
Kim, C.K., Joyce E.A., Chan, K., & Falkow, S. (2002).
Improved analytical methods for microarray-based ge- Segall, R.S., Guha, G.S., & Nonis, S. (2003). Data mining
nome composition analysis. Genome Biology, 3(11). Re- for analyzing the impact of environmental stress on
trieved from http://genomebiology.com/2002/3/11/re- plants: A case study using OSMID. Manuscript in prepa-
search/0065 ration for journal submission.
Lawrence Livermore National Labs. (2003). Microarray Segall, R.S., Guha, G.S., & Nonis, S. (2004b, May). Data
Links. Retrieved from http://microarray.llnl.gov/ mining for assessing the impact of environmental stresses
links.html. on plant geonomics. In Proceedings of the Thirty-Fifth
Meeting of the Southwest Decision Sciences Institute
Leung, Y.F. (2002). Microarray software review. In D. (pp. 23-31). Orlando, FL.
Berrar, W. Dubitzky & M. Granzow (Eds.), A practical
approach to microarray data analysis (pp. 326-344). Segall, R.S., & Nonis, S. (2004a, February). Data mining for
Boston: Kluwer Academic Publishers. analyzing the impact of environmental stress on plants: A
case study using OSMID. Accepted for publication in
Li, W. (2004). Bibliography on microarray data Acxiom Working Paper Series of Acxiom Laboratory of
analysis.Retrieved from http://www.nslij-genetics.org. Applied Research (ALAR) and presented at Acxiom Con-
microarray/2004.html ference on Applied Research and Information Technol-
Liu, Y. (2004). Microarray Databases and MIAME (Mini- ogy, University of Arkansas at Little Rock (UALR).
mum Information About a Microarray Experiment). Re- Selis, S. (2003, February 15). Stanford researcher advo-
trieved from http://titan.biotec.uiuc.edu/cs491jh/slides/ cates far-reaching microarray data exchange. News
cs491jh-Yong.ppt release of Stanford University School of Medicine. Re-
Mamitsuka, H., Okuno, Y., & Yamaguchi, A. (2003, Decem- trieved from http://www.stanfordhospital.com/
ber). Mining biological active patterns in metabolic path- newsEvents/mewsReleases/2003/02/aaasSherlock.html
ways using microarray expression profiles. SIGKDD Ex- Sherlock, G. et al. (2001). The Stanford microarray data-
plorations, 5(2), 113-121. base. Nucleic Acids Research, 29(1), 152-155.
Medical Research Council. (2004). Genome Web: Gene Sherlock, G., & Ball, C.A. (2003). Microarray databases:
expression and microarrays. Retrieved from http:// Storage and retrieval of microarray data. In M.J. Brownstein
www.rfcgr.mrc.ac.uk/GenomeWeb/nuc-genexp.html & A. Khodursky (Eds.), Functional genomics: Methods
Microarray Markup Language (MAML). (2002, Febru- and protocols (pp. 235-248). Methods in Molecular Biol-
ary 8). Cover Pages Technology Reports. Retrieved ogy Series (Vol. 224). Totowa, NJ, Humana Press.
from http://xml.coverpages.org/maml.html Tong, W. (2002, December). ArrayTrack-Supporting
National Center for Biotechnology Information (NCBI). microarray studies through data integration. U. S. Food
(2004, March 30). Microarrays: Chipping away at the and Drug Administration (FDA)/National Center for Toxi-
mysteries of science and medicine. National Library of cological Research (NCTR) Toxioinformatics Workshop:
Medicine (NLM), National Institutes of Health (NIH). Toxicogenomics Database, Study Design and Data Analy-
Retrieved from http://www.ncbi.nlm.nih.gov/About/ sis.
primer/microarrys.html Universite Montpellier II. (2003). The Virtual Library of
Piatetsky-Shapiro, G., & Tamayo, P. (2003, December). Plant-Array: Databases. Retrieved from http://www.univ-
Microarray data mining: Facing the challenges. SIGKDD montp2.fr/~plant_arrays/databases.html
Explorations, 5(2), 1-5. University of Arkansas for Medical Sciences (UAMS)
Reid, J.F. (2004). Software tools for microarray data analy- Bioinformatics Center. (2004). Retrieved from http://
sis. Retrieved from http://www.ifom-firc.it/ bioinformatics.uams.edu/microarray/database.html
MICROARRAY/data_analysis.htm
738
TEAM LinG
Van Helden, J. (2004). Statistical analysis of microarray MIAME (Minimal Information about a Microarray
data: Links.Retrieved from http://www.scmbb.ulb.ac.be/ Experiment): Defines what information at least should be stored. M
~jvanheld/web_course_microarrays/links.html
Microarray Databases: Store large amounts of com-
Zaki, M.J., Wang, H.T., & Toivonen, H.T. (2003, Decem- plex data as generated by microarray experiments (e.g., DNA).
ber). Data mining in bioinformatics. SIGKDD Explora-
tions, 5(2), 198-199. Microarray Informatics: The study of the use of
microarray databases to obtain information about experi-
mental data.
Microarray Markup Language (MAML): An XML
KEY TERMS (Extensible Markup Language)-based format for com-
municating information about data from microarray
Data Warehouses: A huge collection of consistent experiments.
data that is both subject-oriented and time variant, and
used in support of decision-making. Scientific and Statistical Data Mining: The use of
data and image analyses to investigate knowledge dis-
Genomic Databases: Organized collection of data covery of patterns in the data.
pertaining to the genetic material of an organism.
Visual Data Mining: The use of computer gener-
Metadata: Data about data, for example, data that ated graphics in both 2-D and 3-D for the use in knowl-
describes the properties or characteristics of other data. edge discovery of patterns in data.
739
TEAM LinG
740
Mine Rule
Rosa Meo
Universit degli Studi di Torino, Italy
Giuseppe Psaila
Universit degli Studi di Bergamo, Italy
INTRODUCTION (1998) and Baralis, et al. (1999) discuss the usage of MINE
RULE in this context.
Mining of association rules is one of the most adopted We want to show that, thanks to a highly expressive
techniques for data mining in the most widespread appli- query language, it is possible to exploit all the semantic
cation domains. A great deal of work has been carried possibilities of association rules and to solve very
out in the last years on the development of efficient different problems with a unique language, whose state-
algorithms for association rules extraction. Indeed, this ments are instantiated along the different semantic di-
problem is a computationally difficult task, known as mensions of the same application domain. We discuss
NP-hard (Calders, 2004), which has been augmented by examples of statements solving problems in different
the fact that normally association rules are being ex- application domains that nowadays are of a great impor-
tracted from very large databases. Moreover, in order to tance. The first application is the analysis of a retail
increase the relevance and interestingness of obtained data, whose aim is market basket analysis (Agrawal et
results and to reduce the volume of the overall result, al., 1993) and the discovery of user profiles for cus-
constraints on association rules are introduced and must tomer relationship management (CRM). The second
be evaluated (Ng et al.,1998; Srikant et al., 1997). application is the analysis of data registered in a Web
However, in this contribution, we do not focus on the server on the accesses to Web sites by users. Cooley, et
problem of developing efficient algorithms but on the al. (2000) present a study on the same application
semantic problem behind the extraction of association domain. The last domain is the analysis of genomic
rules (see Tsur et al. [1998] for an interesting generali- databases containing data on micro-array experiments
zation of this problem). (Fayyad, 2003). We show many practical examples of
We want to put in evidence the semantic dimensions MINE RULE statements and discuss the application
that characterize the extraction of association rules; problems that can be solved by analyzing the association
that is, we describe in a more general way the classes of rules that result from those statements.
problems that association rules solve. In order to ac-
complish this, we adopt a general-purpose query lan-
guage designed for the extraction of association rules BACKGROUND
from relational databases. The operator of this lan-
guage, MINE RULE, allows the expression of con- An association rule has the form B H, where B and H are
straints, constituted by standard SQL predicates that sets of items, respectively called body (the antecedent)
make it suitable to be employed with success in many and head (the consequent). An association rule (also
diverse application problems. For a comparison be- denoted for short with rule) intuitively means that items
tween this query language and other state-of-the-art in B and H often are associated within the observed data.
languages for data mining, see Imielinski, et al. (1996); Two numerical parameters denote the validity of the rule:
Han, et al. (1996); Netz, et al. (2001); Botta, et al. support is the fraction of source data for which the rule
(2004). holds; confidence is the conditional probability that H
In Imielinski, et al. (1996), a new approach to data holds, provided that B holds. Two minimum thresholds for
mining is proposed, which is constituted by a new gen- support and confidence are specified before rules are
eration of databases called Inductive Databases (IDBs). extracted, so that only significant rules are extracted.
With an IDB, the user/analyst can use advanced query This very general definition, however, is incomplete
languages for data mining in order to interact with the and very ambiguous. For example, what is the meaning
knowledge discovery (KDD) system, extract data min- of fraction of source data for which the rule holds? Or
ing descriptive and predictive patterns from the data- what are the items associated by a rule? If we do not
base, and store them in the database. Boulicaut, et al. answer these basic questions, an association rule does
TEAM LinG
Mine Rule
not have a precise meaning. Consider, for instance, the the discovery of significant and unexpected information
original problem for which association rules were ini- in very different application domains. M
tially proposed in Agrawal, et al. (1993)the market
baskets analysis. If we have a database collecting single The main features and clauses of MINE RULE are as
purchase transactions (i.e., transactions performed by follows (see Meo, et al. [1998] for a detailed description):
customers in a retail store), we might wish to extract
association rules that associate items sold within the Selection of the relevant set of data for a data
same transactions. Intuitively, we are defining the se- mining process: This feature is specified by the
mantics of our problemitems are associated by a rule FROM clause.
if they appear together in the same transaction. Support Selection of the grouping features w.r.t., which
denotes the fraction of the total transactions that con- data are observed: These features are expressed
tain all the items in the rule (both B and H), while by the GROUP BY clause.
confidence denotes the conditional probability that, Definition of the structure of rules and cardi-
found B in a transaction, also H is found in the same nality constraints on body and head, specified
transaction. Thus a rule in the SELECT clause: Elements in rules can be
single values or tuples.
{pants, shirt} {socks, shoes} Definition of coupling constraints: These are
support=0.02 confidence=0.23 constraints applied at the rule level (mining condi-
tion instantiated by a WHERE clause associated to
means that the items pants, shirt, socks, and shoes SELECT) for coupling values.
appear together in 2% of the transactions, while having Definition of rule evaluation measures and
found items pants and shirt in a transaction, the prob- minimum thresholds: These are support and con-
ability that the same transaction also contains socks and fidence (even if, theoretically, other statistical
shoes is 23%. measures also would be possible). Support of a
rule is computed on the total number of groups in
Semantic Dimensions which it occurs and satisfies the given constraints.
Confidence is the ratio between the rule support
MINE RULE puts in evidence the semantic dimensions and the support of the body satisfying the given
that characterize the extraction of association rules constraints. Thresholds are specified by clause
from within relational databases and force users (typi- EXTRACTING RULES WITH.
cally analysts) to understand these semantic dimen-
sions. Indeed, extracted association rules describe the
most recurrent values of certain attributes that occur in MAIN THRUST
the data (in the previous example, the names of the
purchased product). This is the first semantic dimension In this section, we introduce MINE RULE in the context
that characterizes the problem. These recurrent values are of the three application domains. We describe many
observed within sets of data grouped by some common examples of queries that can be conceived as a sort of
features (i.e., the transaction identifier in the previous template, because they are instantiated along the rel-
example but, in general, the date, the customer identifier, evant dimensions of an application domain and solve
etc.). This constitutes the second semantic dimension of some frequent, similar, and critical situations for users
the association rule problem. Therefore, extracted asso- of different applications.
ciation rules describe the observed values of the first
dimension, which are recurrent in entities identified by the First Application: Retail Data Analysis
second dimension.
When values belonging to the first dimension are We consider a typical data warehouse gathering infor-
associated, it is possible that not every association is mation on customers purchases in a retail store:
suitable, but only a subset of them should be selected,
based on a coupling condition on attributes of the ana- FactTable (TransId, CustId, TimeId, ItemId, Num,
lyzed data (e.g., a temporal sequence between events Discount)
described in B and H). This is the third semantic dimen- Customer (CustId, Profession, Age, Sex)
sion of the problem; the coupling condition is called
mining condition. Rows in FactTable describe sales. The dimensions of
It is clear that MINE RULE is not tied to any particular data are the customer (CustId), the time (TimeId), and the
application domain, since the semantic dimensions allow purchased item (ItemId); each sale is characterized by the
741
TEAM LinG
Mine Rule
number of sold pieces (Num) and the discount (Discount); MINE RULE CustomerProfiles AS
the transaction identifier (TransId) is reported, as well. We SELECT DISTINCT 1..1 Profession, Age AS BODY,
also report table Customer. 1..n Item AS HEAD, SUPPORT, CONFIDENCE
FROM FactTable JOIN Customer
Example 1: We want to extract a set of association ON FactTable.CustId=Customer.CustId
rules, named FrequentItemSets, that finds the asso- GROUP BY CustId
ciations between sets of items (first dimension of EXTRACTING RULES WITH SUPPORT:0.6,
the problem) purchased together in a sufficient CONFIDENCE:0.9
number of dates (second dimension), with no spe-
cific coupling condition (third dimension). These The observed entity is the customer (first dimen-
associations provide the business relevant sets of sion of data) described by a single pair in the body
items, because they are the most frequent in time. (cardinality constraint 1..1); the head associates prod-
The MINE RULE statement is now reported. ucts frequently purchased by customers (second di-
mension of data) with the profile reported in the body
MINE RULE FrequentItemSets AS (see the SELECT clause). Thus a rule
SELECT DISTINCT 1..n ItemId AS BODY,
1..n ItemId AS HEAD, SUPPORT, CONFIDENCE {(employee, 35)} {socks, shoes}
FROM FactTable support=0.7 confidence=0.96
GROUP BY TimeId
EXTRACTING RULES WITH SUPPORT:0.2, means that customers that are employees and 35 years
CONFIDENCE:0.4 old often (96% of cases) buy socks and shoes. Support
tells about the absolute frequency of the profile in the
The first dimension of the problem is specified in the customer base (GROUP BY clause). This solution can
SELECT clause that specifies the schema of each ele- be generalized easily for any profiling problem.
ment in association rules, the cardinality of body and
head (in terms of lower and upper bound), and the statis- Second Application: Web Log Analysis
tical measures for the evaluation of association rules
(support and confidence); in the example, body and head Typically, Web servers store information concerning
are not empty sets of items, and their upper bound is access to Web sites stored in a standard log file. This is
unlimited (denoted as 1..n). a relational table (WebLogTable) that typically con-
The GROUP BY clause provides the second dimen- tains at least the following attributes:
sion of the problem: since attribute TimeId is specified,
rules denote that associated items have been sold in the RequestID: identifier of the request;
same date (intuitively, rows are grouped by values of IPcaller: IP address from which the request is origi-
TimeId, and rules associate values of attribute ItemId nated;
appearing in the same group). Date: date of the request;
Support of an association rule is computed in terms TS: time stamp;
of the number of groups in which any element of the rule Operation: kind of operation (for instance, get or put);
co-occurs; confidence is computed analogously. In this Page URL: URL of the requested page;
example, support is computed over the different instants Protocol: transfer protocol (such as TCP/IP);
of time, since grouping is made according to the time Return Code: code returned by the Web server;
identifier. Support and confidence of rules must not be Dimension: dimension of the page (in Bytes).
lower than the values in EXTRACTING clause (respec-
tively 0.2 and 0.4). Example 1: To discover Web communities of users
on the basis of the pages they visited frequently,
Example 2: Customer profiling is a key problem in we might find associations between sets of users
CRM applications. Association rules allow to ob- (first dimension) that have all visited a certain
tain a description of customers (e.g., w.r.t. age and number of pages (second dimension); no coupling
profession) in terms of frequently purchased prod- conditions are necessary (third dimension). Users
ucts. To do that, values coming from two distinct are observed by means of their IP address, Ipcaller,
dimensions of data must be associated. whose values are associated by rules (see SE-
LECT). In this case, support and confidence of
association rules are computed, based on the num-
742
TEAM LinG
Mine Rule
ber of pages visited by users in rules (see GROUP Third Application: Genes Classification
BY). Thus, rule by Micro-Array Experiments M
{Ip1, Ip2} {Ip3, Ip4} We consider information on a single micro-array experi-
support=0.4 confidence=0.45 ment containing data on several samples of biological
tissue tied to correspondent probes on a silicon chip.
means that users operating from Ip1, Ip2, Ip3 and Ip4 Each sample is treated (or hybridized) in various ways
visited the same set of pages, which constitute 40% of and under different experimental conditions; these can
the total pages in the site. determine the over-expression of a set of genes. This
means that the sets of genes are active in the experimen-
MINE RULE UsersSamePages AS tal conditions (or inactive, if, on the contrary, they are
SELECT DISTINCT 1..n IPcaller AS BODY, under-expressed). Biologists are interested in discov-
1..n IPcaller AS HEAD, SUPPORT, CONFIDENCE ering which sets of genes are expressed similarly and
FROM WebLogTable under what conditions.
GROUP BY PageUrl A micro-array typically contains hundreds of samples,
EXTRACTING RULES WITH SUPPORT:0.2, and for each sample, several thousands of genes are
CONFIDENCE:0.4 measured. Thus, input relation, called MicroArrayTable,
contains the following information:
Example 2: In Web log analysis, it is interesting
to discover the most frequent crawling paths. SampleID: identifier of the sample of biological
tissue tied to a probe on the microchip;
MINE RULE FreqSeqPages AS GeneId: identifier of the gene measured in the
SELECT DISTINCT 1..n PageUrl AS BODY, sample;
1..n PageUrl AS HEAD, SUPPORT, CONFIDENCE TreatmentConditionId: identifier of the experi-
WHERE BODY.Date < HEAD.Date mental conditions under which the sample has
FROM WebLogTable been treated;
GROUP BY IPcaller LevelOfExpression: measured valueif higher
EXTRACTING RULES WITH SUPPORT:0.3, than a threshold T2, the genes are over-expressed;
CONFIDENCE:0.4 if lower than another threshold T1, genes are un-
der-expressed.
Rows are grouped by user (IPcaller) and sets of pages
frequently visited by a sufficient number of users are Example: This analysis discovers sets of genes
associated. Furthermore, pages are associated only if (first dimension of the problem) that, in the same
they denote a sequential pattern (third dimension); in experimental conditions (second dimension), are
fact, the mining condition WHERE BODY.Date < expressed similarly (third dimension).
HEAD.Date constrains the temporal ordering between
pages in antecedent and consequent of rules. Conse- MINE RULE SimilarlyCorrelatedGenes AS
quently, rule SELECT DISTINCT 1..n GeneId AS BODY,
1..n GeneId AS HEAD, SUPPORT, CONFIDENCE
{P1, P2} {P3, P4, P5} support=0.5 confidence=0.6 WHERE BODY.LevelOfExpression < T1 AND
HEAD.LevelOfExpression < T1 OR
means that 50% of users visit pages P3, P4, and P5 after BODY.LevelOfExpression > T2 AND
pages P1 and P2. This solution can be generalized easily HEAD.LevelOfExpression > T2
for any problem requiring the search for sequential FROM MicroArrayTable
patterns. GROUP BY SampleId, TreatmentConditionId
Many other examples are possible, such as rules that EXTRACTING RULES WITH SUPPORT:0.95,
associate users to frequently visited Web pages (high- CONFIDENCE:0.8
light the fidelity of the users to the service provided by
a Web site) or frequent requests of a page by a browser The mining condition introduced by WHERE con-
that cause an error in the Web server (interesting be- strains both the sets of genes to be similarly expressed
cause it constitutes a favorable situation to hackers in the same experimental conditions (i.e., samples of
attacks). tissue treated in the same conditions). Support thresh-
743
TEAM LinG
Mine Rule
old (0.95) determines the proportion of samples in which REFERENCES

the sets of genes must be expressed similarly; confidence
determines how strongly the two sets of genes are corre- Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining
lated. association rules between sets of items in large data-
This statement might help biologists to discover the bases. Proceedings of the International Conference on
sets of genes that are involved in the production of Management of Data, Washington, D.C.
proteins involved in the development of certain diseases
(e.g., cancer). Baralis, E., & Psaila, G. (1999). Incremental refinement of
mining queries. Proceedings of the First International
Conference on Data Warehousing and Knowledge Dis-
FUTURE TRENDS covery, Florence, Italy.
Botta, M., Boulicaut, J.-F., Masson, C., & Meo, R. (2004).
This contribution wants to evaluate the usability of a Query languages supporting descriptive rule mining: A
mining query language and its resultsassociation comparative study. In R. Meo, P. Lanzi, & M. Klemettinen
rulesin some practical applications. We identified (Eds.), Database support for data mining applications.
many useful patterns that corresponded to concrete user (pp. 24-51). Berlin: Springer-Verlag.
problems. We showed that the exploitation of the nug-
gets of information embedded in the databases and of Boulicaut, J.-F., Klemettinen, M., & Mannila, H. (1998).
the specialized mining constructs provided by the query Querying inductive databases: A case study on the MINE
languages enables the rapid customization of the mining RULE operator Proceedings of the International Con-
procedures leading to the real users needs. Given our ference on Principles of Data Mining and Knowledge
experience, we also claim that, independently of the Discovery, Nantes, France.
application domain, the use of queries in advanced lan- Calders, T. (2004). Computational complexity of itemset
guages, as opposed to ad-hoc heuristics, eases the speci- frequency satisfiability. Proceedings of the Sympo-
fication and the discovery of a large spectrum of pat- sium on Principles Of Database Systems, Paris, France.
terns. This motivates the need for powerful query lan-
guages in KDD systems. Cooley, R., Tan, P.N., & Srivastava, J. (2000). Discovery
For the future, we believe that a critical point will be of interesting usage patterns from Web data. In Proceed-
the availability of powerful query optimizers, such as ings of WEBKDD-99 International Workshop on Web
the one proposed in Meo (2003). This one is able to Usage Analysis and User Profiling, San Diego, California.
solve data mining queries incrementally; that is, by Berlin: Springer Verlag.
modification of the previous queries results, material-
ized in the database. Fayyad, U.M. (2003). Special issue on microarray data
mining. SIGKDD Explorations, 5(2), 1-139.
Han, J., Fu, Y., Wang, W., Koperski, K., & Zaiane, O.
CONCLUSION (1996). DMQL: A data mining query language for rela-
tional databases. Proceedings of the Workshop on Re-
In this contribution, we focused on the semantic prob- search Issues on Data Mining and Knowledge Discov-
lem behind the extraction of association rules. We put ery, Montreal, Canada.
in evidence the semantic dimensions that characterize
the extraction of association rules; we did this by apply- Imielinski, T., & Mannila, H. (1996). A database per-
ing a general purpose query language designed for the spective on knowledge discovery. Communications of
extraction of association rules, named MINE RULE, to the ACM, 39(11), 58-64.
three important application domains. Imielinski, T., Virmani, A., & Abdoulghani, A. (1996).
The query examples we provided show that the min- DataMine: Application programming interface and query
ing language is powerful and, at the same time, versatile language for database mining. Proceedings of the Inter-
because its operational semantics seems to be the basic national Conference on Knowledge Discovery and
one. Indeed, these experiments allow us to claim that Data Mining, Portland, Oregon.
Imielinski and Mannilas (1996) initial view on induc-
tive databases was correct: <<There is no such thing as Meo, R. (2003). Optimization of a language for data
real discovery, just a matter of the expressive power of mining. Proceedings of the Symposium on Applied
the query languages>>. Computing, Melbourne, Florida.
744
TEAM LinG
Mine Rule
Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL KEY TERMS
for mining association rules. Journal of Data Mining and M
Knowledge Discovery, 2(2), 195-224. Association Rule: An association between two sets
Netz, A., Chaudhuri, S., Fayyad, U.M., & Bernhardt, J. of items co-occurring frequently in groups of data.
(2001). Integrating data mining with SQL databases: Constraint-Based Mining: Data mining obtained by
OLE DB for data mining Proceedings of the International means of evaluation of queries in a query language allow-
Conference on Data Engineering, Heidelberg, Germany. ing predicates.
Ng, R.T., Lakshmanan, V.S., Han, J., & Pang, A. (1998). CRM: Management, understanding, and control of
Exploratory mining and pruning optimizations of con- data on the customers of a company for the purposes of
strained associations rules. Proceedings of the Inter- enhancing business and minimizing the customers churn.
national Conference Management of Data, Seattle,
Washington. Inductive Database: Database system integrating
in the database source data and data mining patterns
Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining asso- defined as the result of data mining queries on source
ciation rules with item constraints. Proceedings of the data.
International Conference on Knowledge Discovery
from Databases, Newport Beach, California. KDD: Knowledge Discovery Process from the data-
base, performing tasks of data pre-processing, transfor-
Tsur, D. et al. (1998). Query flocks: A generalization of mation and selection, and extraction of data mining
association-rule mining. Proceedings of the International patterns and their post-processing and interpretation.
Conference Management of Data, Seattle, Washington.
Semantic Dimension: Concept or entity of the
studied domain that is being observed in terms of other
concepts or entities.
Web Log: File stored by the Web server containing
data on users accesses to a Web site.
745
TEAM LinG
746
Mining Association Rules on a NCR Teradata

System
Soon M. Chung
Wright State University, USA
Murali Mangamuri
Wright State University, USA
INTRODUCTION port and minimum confidence, respectively. This problem

can be decomposed into the following two steps:
Data mining from relations is becoming increasingly im-
portant with the advent of parallel database systems. In 1. Find all sets of items (called itemsets) that have
this paper, we propose a new algorithm for mining asso- support above the user-specified minimum support.
ciation rules from relations. The new algorithm is an These itemsets are called frequent itemsets or large
enhanced version of the SETM algorithm (Houtsma & itemsets.
Swami 1995), and it reduces the number of candidate 2. For each frequent itemset, all the association rules
itemsets considerably. We implemented and evaluated that have minimum confidence are generated as
the new algorithm on a parallel NCR Teradata database follows: For every frequent itemset f, find all non-
system. The new algorithm is much faster than the SETM empty subsets of f. For every such subset a, gener-
algorithm, and its performance is quite scalable. ate a rule of the form a => (f - a) if the ratio of
support(f) to support(a) is at least the minimum
confidence.
BACKGROUND
Finding all the frequent itemsets is a very resource-
Data mining, also known as knowledge discovery from consuming task, but generating all the valid association
databases, is the process of finding useful patterns from rules from the frequent itemsets is quite straightforward.
databases. One of the useful patterns is the association There are many association rule-mining algorithms
rule, which is formally described in Agrawal, Imielinski, proposed (Agarwal, Aggarwal & Prasad, 2000; Agrawal,
and Swami (1993) as follows: Let I = {i1, i2, . . . , i m} be a set Imielinski & Swami, 1993; Agrawal & Srikant, 1994; Bayardo,
of items. Let D represent a set of transactions, where each 1998; Burdick, Calimlim & Gehrke, 2001; Gouda & Zaki,
transaction T contains a set of items, such that T I. Each 2001; Holt & Chung, 2001, 2002; Houtsma & Swami, 1995;
transaction is associated with a unique identifier, called Park, Chen & Yu, 1997; Savasere, Omiecinski & Navathe,
transaction identifier (TID). A set of items X is said to be 1995; Zaki, 2000). However, most of these algorithms are
in transaction T if X T. An association rule is an designed for data stored in file systems. Considering that
implication of the form X => Y, where X I, Y I and X relational databases are used widely to manage the corpo-
Y = . The rule X => Y holds in the database D with ration data, integrating the data mining with the relational
confidence c if c% of the transactions in D that contain X database system is important. A methodology for tightly
also contain Y. The rule X => Y has a support s if s% of coupling a mining algorithm with relational database
the transactions in D contain X U Y. For example, beer and using user-defined functions is proposed in Agrawal and
disposable diapers are items such that beer => diapers is Shim (1996), and a detailed study of various architectural
an association rule mined from the database if the co- alternatives for coupling mining with database systems is
occurrence rate of beer and disposable diapers (in the presented in Sarawagi, Thomas, and Agrawal (1998).
same transaction) is not less than the minimum support, The SETM algorithm proposed in Houtsma and Swami
and the occurrence rate of diapers in the transactions (1995) was expressed in the form of SQL queries. Thus, it
containing beer is not less than the minimum confidence. can be applied easily to relations in the relational data-
The problem of mining association rules is to find all bases and can take advantage of the functionalities pro-
the association rules that have support and confidence vided by the SQL engine, such as the query optimization,
greater than or equal to the user-specified minimum sup- efficient execution of relational algebra operations, and
indexing. SETM also can be implemented easily on a
TEAM LinG
Mining Association Rules on a NCR Teradata System
parallel database system that can execute the SQL queries software, isolated from other vprocs but sharing some of
in parallel on different processing nodes. By processing the physical resources of the node, such as memory and M
the relations directly, we can easily relate the mined CPUs (NCR Teradata Division, 2002).
association rules to other information in the same data- Vprocs and the tasks running under them communi-
base, such as the customer information. cate using the unique-address messaging, as if they were
In this paper, we propose a new algorithm named physically isolated from one another. The Parsing Engine
Enhanced SETM (ESETM), which is an enhanced version (PE) and the Access Module Processor (AMP) are two
of the SETM algorithm. We implemented both ESETM and types of vprocs. Each PE executes the database software
SETM on a parallel NCR Teradata database system and that manages sessions, decomposes SQL statements into
evaluated and compared their performance for various steps, possibly parallel, and returns the answer rows to
cases. It has been shown that ESETM is considerably the requesting client. The AMP is the heart of the Teradata
faster than SETM. RDBMS. The AMP is a vproc that performs many data-
base and file-management tasks. The AMPs control the
management of the Teradata RDBMS and the disk sub-
MAIN THRUST system. Each AMP manages a portion of the physical disk
space and stores its portion of each database table within
NCR Teradata Database System that disk space, as shown in Figure 2 (NCR Teradata
Division, 2002 ).
The algorithms are implemented on an NCR Teradata
database system. It has two nodes, where each node SETM Algorithm
consists of 4 Intel 700MHz Xeon processors, 2GB shared
memory, and 36GB disk space. The nodes are intercon- The SETM algorithm proposed in (Houtsma & Swami,
nected by a dual BYNET interconnection network sup- 1995) for finding frequent itemsets and the corresponding
porting 960Mbps of data bandwidth for each node. More- SQL queries used are as follows:
over, nodes are connected to an external disk storage
// SALES = <trans_id, item>
subsystem configured as a level-5 RAID (Redundant
k := 1;
Array of Inexpensive Disks) with 288GB disk space. sort SALES on item;
The relational DBMS used here is Teradata RDBMS F1 := set of frequent 1-itemsets and their counts;
(version 2.4.1), which is designed specifically to function R 1 := filter SALES to retain supported items;
repeat
in the parallel environment. The hardware that supports
k := k + 1;
Teradata RDBMS software is based on off-the-shelf Sym- sort R k-1 on trans_id, item1, . . . , itemk-1;
metric Multiprocessing (SMP) technology. The hardware R k := merge-scan Rk-1, R 1;
is combined with a communication network (BYNET) that sort R k on item1, . . . , item k ;
F k := generate frequent k-itemsets from the sorted Rk;
connects the SMP systems to form Massively Parallel
Rk := filter R k to retain supported k-itemsets;
Processing (MPP) systems, as shown in Figure 1 (NCR until R k = {}
Teradata Division, 2002).
The versatility of the Teradata RDBMS is based on In this algorithm, initially, all frequent 1-itemsets and
virtual processors (vprocs) that eliminate the depen- their respective counts (F1=<item, count>) are generated
dency on specialized physical processors. Vprocs are a by a simple sequential scan over the SALES table. After
set of software processes that run on a node within the creating F1, R1 is created by filtering SALES using F1. A
multitasking environment of the operating system. Each merge-scan is performed for creating Rk table using R k-1
vproc is a separate, independent copy of the processor
Figure 1. Teradata system architecture Figure 2. Query processing in the Teradata system
747
TEAM LinG
and R1 tables. R k table can be viewed as the set of candidate generate candidate 2-itemsets and directly generates
k-itemsets coupled with their transaction identifiers. frequent 2-itemsets. This view or subquery is used also
to create candidate 3-itemsets.
SQL query for generating Rk:
INSERT INTO R k CREATE VIEW R 2 (trans_id, item 1, item 2) AS
SELECT p.trans_id, p.item 1, . . . , p.item k-1, q.item SELECT P1.trans_id, P1.item, P2.item
FROM (SELECT p.trans_id, p.item FROM SALES p, F 1 q
FROM Rk-1 p, R1 q WHERE p.item = q.item) AS P1,
WHERE q.trans_id = p.trans_id AND q.item > p.itemk-1 (SELECT p.trans_id, p.item FROM SALES p, F 1 q WHERE p.item = q.item) AS P2
WHERE P1.trans_id = P2.trans_id AND
P1.item < P2.item
Frequent k-itemsets are generated by a sequential scan
over Rk and selecting only those itemsets that meet the Note that R1 is not created, since it will not be used for
minimum support constraint. the generation of Rk. The set of frequent 2-itemsets, F 2,
can be generated directly by using this R2 view.
SQL query for generating Fk:
INSERT INTO F k
INSERT INTO F 2
SELECT p.item1, . . . , p.itemk, COUNT(*)
SELECT item 1 , item 2 , COUNT(*)
FROM R k p
FROM R2
GROUP BY p.item 1, . . . , p.itemk
GROUP BY item 1, item2
HAVING COUNT(*) >= :minimum_support
HAVING COUNT(*) >= :minimum_support
Rk table is created by filtering Rk table using Fk. Rk table The second modification is to generate Rk+1 using the
can be viewed as a set of frequent k-itemsets coupled with join of Rk with itself, instead of the merge-scan of Rk with
their transaction identifiers. This step is performed to R1.
ensure that only the candidate k-itemsets (R k) relative to
frequent k-itemsets are used to generate the candidate SQL query for generating Rk+1:
(k+1)-itemsets. INSERT INTO R k+1
SELECT p.trans_id, p.item1, . . . , p.item k, q.itemk
SQL query for generating R k: FROM R k p, Rk q
INSERT INTO R k WHERE p.trans_id = q.trans_id AND
SELECT p.trans_id, p.item 1, . . . , p.itemk p.item 1 = q.item 1 AND
FROM R k p, Fk q .
WHERE p.item 1 = q.item 1 AND .
. p.item k-1 = q.itemk-1 AND
. p.item k < q.itemk
p.item k-1 = q.item k-1 AND
p.item k = q.item k
This modification reduces the number of candidates
ORDER BY p.trans_id, p.item1, . . . , p.itemk
(k+1)-itemsets generated compared to the original SETM
algorithm. The performance of the algorithm can be
A loop is used to implement the procedure described improved further if candidate (k+1)-itemsets are gener-
above, and the number of iterations depends on the size of ated directly from candidate k-itemsets using a subquery
the largest frequent itemset, as the procedure is repeated as follows:
until Fk is empty.
SQL query for R k+1 using R k:
Enhanced SETM (ESETM) INSERT INTO R k+1
SELECT P1.trans_id, P1.item 1, . . . , P1.itemk, P2.item k
FROM
The Enhanced SETM (ESETM) algorithm has three modi- (SELECT p.* FROM Rk p, Fk q WHERE p.item 1 = q.item1
fications to the original SETM algorithm: AND . . . AND p.item k = q.item k) AS P1,
(SELECT p.* FROM Rk p, Fk q WHERE p.item 1 = q.item1
AND . . . AND p.item k = q.item k) AS P2
1. Create frequent 2-itemsets without materializing R1
WHERE P1.trans_id = P2.trans_id AND
and R2. P1.item 1 = P2.item 1 AND
2. Create candidate (k+1)-itemsets in Rk+1 by joining Rk .
with itself. .
P1.itemk-1 = P2.itemk-1 AND
3. Use a subquery to generate Rk rather than material-
P1.item k < P2.itemk
izing it, thereby generating Rk+1 directly from Rk.
The number of candidate 2-itemsets can be very large, Rk is generated as a derived table using a subquery,
so it is inefficient to materialize R2 table. Instead of creat- thereby saving the cost of materializing R k table.
ing R2 table, ESETM creates a view or a subquery to
748
TEAM LinG
ESETM with Pruning (PSETM) in the Subquery Q1 is too high when there are not many
candidates to be pruned. In our implementation, the prun- M
In the ESETM algorithm, candidate (k+1)-itemsets in Rk+1 ing is performed until the number of rows in Fk becomes
are generated by joining Rk with itself on the first k-1 items, less than 1,000, or up to five passes. The difference
as described previously. For example, a 4-itemset {1, 2, 3, between the total execution times with and without prun-
9} becomes a candidate 4-itemset only if {1, 2, 3} and {1, ing was very small for most of the databases we tested.
2, 9} are frequent 3-itemsets. It is different from the subset-
infrequency-based pruning of the candidates used in the Performance Analysis
Apriori algorithm, where a (k+1)-itemset becomes a can-
didate (k+1)-itemset, only if all of its k-subsets are fre- In this section, the performance of the Enhanced SETM
quent. So, {2, 3, 9} and {1, 3, 9} also should be frequent (ESETM), ESETM with pruning (PSETM), and SETM are
for {1, 2, 3, 9} to be a candidate 4-itemset. The above SQL- evaluated and compared. We used synthetic transaction
query for generating Rk+1 can be modified such that all the databases generated according to the procedure de-
k-subsets of each candidate (k+1)-itemset can be checked. scribed in (Agrawal & Srikant, 1994).
To simplify the presentation, we divided the query into The total execution times of ESETM, PSETM and
subqueries. Candidate (k+1)-itemsets are generated by SETM are shown in Figure 3 for the database T10.I4.D100K,
the Subquery Q1 using Fk. where Txx.Iyy.DzzzK indicates that the average number of
items in a transaction is xx, the average size of maximal
Subquery Q0: potential frequent itemset is yy, and the number of trans-
SELECT item 1,item2, . . . , itemk FROM F k
actions in the database is zzz in thousands.
Subquery Q1: ESETM is more than three times faster than SETM for
SELECT p.item1, p.item 2, . . . , p.item k, q.item k all minimum support levels, and the performance gain
FROM Fk p, Fk q increases as the minimum support level decreases. ESETM
WHERE p.item1 = q.item 1 AND
and PSETM have almost the same total execution time,
.
. because the effect of the reduced number of candidates in
p.itemk-1 = q.item k-1 AND PESTM is offset by the extra time required for the pruning.
p.item k < q.item k AND The time taken for each pass by the algorithms for the
(p.item2, . . . , p.itemk, q.itemk) IN (Subquery Q0) AND
T10.I4.D100K database with the minimum support of
.
. 0.25% is shown in Figure 4. The second pass execution
(p.item1, . . . , p.itemj-1, p.itemj+1, . . . , p.item k, q.itemk) IN time of ESTM is much smaller than that of SETM, because
(Subquery Q0 ) AND R2 table (containing candidate 2-itemsets together with
.
the transaction identifiers) and R2 table (containing fre-
.
(p.item1, . . . , p.item k-2, p.item k, q.item k) IN (Subquery Q0) quent 2-itemsets together with the transaction identifiers)
are not materialized. In the later passes, the performance
of ESETM is much better than that of SETM, because
Subquery Q2:
ESTM has much less candidate itemsets generated and
SELECT p.* FROM R k p, F k q
WHERE p.item 1 = q.item 1 AND . . . . AND p.itemk = q.item k does not materialize Rk tables, for k > 2.
In Figure 5, the size of Rk table containing candidate
INSERT INTO R k+1 k-itemsets is shown for each pass when the T10.I4.D100K
SELECT p.trans_id, p.item1, . . . , p.item k, q.itemk
database is used with the minimum support of 0.25%. From
FROM (Subquery Q 2) p, (Subquery Q 2) q
WHERE p.trans_id = q.trans_id AND the third pass, the size of Rk table for ESETM is much
p.item 1 = q.item1 AND
. Figure 3. Total execution times (for T10.I4.D100K)
.
p.itemk-1 = q.item k-1 AND SETM ESETM PSETM
p.itemk < q.itemk AND
1200
(p.item1, . . . , p.itemk, q.item k) IN (Subquery Q1)
1000
The Subquery Q1 joins Fk with itself to generate the 800

Time (sec)
candidate (k+1)-itemsets, and all candidate (k+1)-itemsets 600
having any infrequent k-subset are pruned. The Subquery 400
Q2 derives Rk, and Rk+1 is generated as: Rk+1 = (Rk JOIN Rk) 200
JOIN (Subquery Q1). 0

1.00% 0.50% 0.25% 0.10%
However, it is not efficient to prune all the candidates
Minimum Support
in all the passes, since the cost of pruning the candidates
749
TEAM LinG
smaller than that of SETM because of the reduced number because the number of candidate itemsets generated in
of candidate itemsets. PSETM performs additional prun- the later passes is small.
ing of candidate itemsets, but the difference in the number
of candidates is very small in this case.
The scalability of the algorithms is evaluated by in- FUTURE TRENDS
creasing the number of transactions and the average size
of transactions. Figure 6 shows how the three algorithms Relational database systems are used widely, and the size
scale up as the number of transactions increases. The of existing relational databases grows quite rapidly. Thus,
database used here is T10.I4, and the minimum support is mining the relations directly without transforming them
0.5%. The number of transactions ranges from 100,000 to into certain file structures is very useful. However, due to
400,000. SETM performs poorly as the number of transac- the high operation complexity of the mining processes,
tions increases, because it generates much more candi- parallel data mining is essential for very large databases.
date itemsets than others. Currently, we are developing an algorithm for mining
The effect of the transaction size on the performance association rules across multiple relations using our par-
is shown in Figure 7. In this case, the size of the database allel NCR Teradata database system.
wasnt changed by keeping the product of the average
transaction size and the number of transactions constant.
The number of transactions was 20,000 for the average CONCLUSION
transaction size of 50 and 100,000 for the average transac-
tion size of 10. We used the fixed minimum support count In this paper, we proposed a new algorithm, named En-
of 250 transactions, regardless of the number of transac- hanced SETM (ESETM) for mining association rules from
tions. The performance of SETM deteriorates as the relations. ESETM is an enhanced version of the SETM
transaction size increases, because the number of candi- algorithm (Houtsma & Swami, 1995), and its performance
date itemsets generated is very large. On the other hand, is much better than SETM, because it generates much less
the total execution times of ESETM and PSETM are stable,
Figure 4. Per pass execution times (for T10.I4.D100K) Figure 5. Size of R'k (for T10.I4.D100K)
SETM ESETM PSETM

SETM ESETM PSETM
250 2500
No. of Tuples (in 1000s)
200 2000
Time (sec)
150 1500
100 1000
50 500
0
0
R3 R4 R5 R6 R7 R8 R9
1 2 3 4 5 6 7 8 9
Number of Passes
Figure 6. Effect of the number of transactions Figure 7. Effect of the transaction size
SETM ESETM PSETM SETM ESETM PSETM
1200 5000
1000 4000
800
Time (sec)
Time (sec)
3000
600
2000
400
200 1000
0 0
100 200 300 400 10 20 30 40 50
Number of Transactions (in 1000s) Average Transaction Size
750
TEAM LinG
candidate itemsets to count. ESTM and SETM are imple- Holt, J.D., & Chung, S.M. (2001). Multipass algorithms for
mented on a parallel NCR database system, and we evalu- mining association rules in text databases. Knowledge M
ated their performance in various cases. ESTM is at least and Information Systems, 3(2), 168-183.
three times faster than SETM in most of our test cases, and
its performance is quite scalable. Holt, J.D., & Chung, S.M. (2002). Mining association rules
using inverted hashing and pruning. Information Pro-
cessing Letters, 83(4), 211-220.
ACKNOWLEDGMENTS Houtsma, M., & Swami, A. (1995). Set-oriented mining for
association rules in relational databases. Proceedings of
This research was supported in part by NCR, LexisNexis, the International Conference on Data Engineering,
Ohio Board of Regents (OBR), and AFRL/Wright Broth- Taipei, Taiwan.
ers Institute (WBI)
NCR Teradata Division (2002). Introduction to Teradata
RDBMS.
REFERENCES Park, J.S., Chen, M.S., & Yu, P.S. (1997). Using a hash-
based method with transaction trimming for mining asso-
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining ciation rules. IEEE Trans. on Knowledge and Data Engi-
association rules between sets of items in large data- neering, 9(5), 813-825.
bases. Proceedings of the ACM SIGMOD International
Conference on Management of Data, Washington, D.C., Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrat-
USA. ing association rule mining with relational database sys-
tems: Alternatives and implications. Proceedings of the
Agrawal, R., & Shim, K. (1996). Developing tightly-coupled ACM SIGMOD International Conference on Manage-
data mining applications on a relational database system. ment of Data, Seattle, WA, USA
Proceedings of the International Conference on Knowl-
edge Discovery and Data Mining, Portland, OR, USA. Savasere, A., Omiecinski, E., & Navathe, S. (1995). An
eficient algorithm for mining association rules in large
Agrawal. R., & Srikant, R. (1994). Fast algorithms for databases. Proceedings of the VLDB Conference, Zurich,
mining association rules. Proceedings of the VLDB Confer- Switzerland.
ence.
Zaki, M.J. (2000). Scalable algorithms for association
Agarwal, R.C., Aggarwal, C.C., & Prasad, V.V.V. (2000). mining. IEEE Trans. on Knowledge and Data Engineer-
Depth first generation of long patterns. Proceedings of ing, 12(3), 372-390.
the International Conference on Knowledge Discovery
and Data Mining, Boston, MS, USA.
KEY TERMS
Bayardo, R.J. (1998). Efficient mining long patterns from
databases. Proceedings of the ACM SIGMOD Interna- Association Rule: Implication of the form X => Y,
tional Conference on Management of Data, Seattle, WA, meaning that database tuples satisfying the conditions of
USA. X are also likely to satisfy the conditions of Y.
Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A Data Mining: Process of finding useful data patterns
maximal frequent itemset algorithm for transaction data- hidden in large data sets.
bases. Proceedings of the International Conference on
Data Engineering, Heidelberg, Germany. Parallel Database System: Database system support-
ing the parallel execution of the individual basic database
Gouda, K., & Zaki, M.J. (2001). Efficiently mining maximal operations, such as relational algebra operations and
frequent itemsets. Proceedings of the 1st IEEE Interna- aggregate operations.
tional Conference on Data Mining, San Jose, CA, USA.
751
TEAM LinG
752
Mining Association Rules Using Frequent

Closed Itemsets
Nicolas Pasquier
Universit de Nice-Sophia Antipolis, France
In the domain of knowledge discovery in databases and The support of an itemset I is the proportion of objects
its computational part called data mining, many works containing I in the context. An itemset is frequent if its
addressed the problem of association rule extraction that support is greater or equal to the minimal support thresh-
aims at discovering relationships between sets of items old defined by the user. An association rule r is an
(binary attributes). An example association rule fitting in implication with the form r: I1 I2 - I1 where I1 and I2 are
the context of market basket data analysis is cereal milk frequent itemsets such that I1 I2. The confidence of r is
sugar (support 10%, confidence 60%). This rule states the number of objects containing I2 divided by the number
that 60% of customers who buy cereals and sugar also buy of objects containing I1. An association rule is generated
milk, and that 10% of all customers buy all three items. if its support and confidence are at least equal to the
When an association rule support and confidence exceed minsupport and minconfidence thresholds. Association
some user-defined thresholds, the rule is considered rules with 100% confidence are called exact association
relevant to support decision making. Association rule rules; others are called approximate association rules.
extraction has proved useful to analyze large databases in The natural decomposition of the association rule-mining
a wide range of domains, such as marketing decision problem is:
support; diagnosis and medical research support; tele-
communication process improvement; Web site manage- 1. Extracting frequent itemsets and their support from
ment and profiling; spatial, geographical, and statistical the context.
data analysis; and so forth. 2. Generating all valid association rules from frequent
The first phase of association rule extraction is the data itemsets and their support.
selection from data sources and the generation of the data
mining context that is a triplet D = (O, I, R), where O and I The first phase is the most computationally expensive
are finite sets of objects and items respectively, and R O part of the process, since the number of potential frequent
I is a binary relation. An item is most often an attribute itemsets 2|I| is exponential in the size of the set of items, and
value or an interval of attribute values. Each couple (o, i) context scans are required. A trivial approach would
R denotes the fact that the object o O is related to the item consider all potential frequent itemsets at the same time,
i I. If an object o is in relation with all items of an itemset but this approach cannot be used for large databases
I (a set of items) we say that o contains I. where I is large. Then, the set of potential frequent
This phase helps to improve the extraction efficiency itemsets that constitute a lattice called itemset lattice
and enables the treatment of all kinds of data, often mixed must be decomposed into several subsets considered one
in operational databases, with the same algorithm. Data- at a time.
mining contexts are large relations that do not fit in
main memory and must be stored in secondary memory. Level-Wise Algorithms for Extracting
Consequently, each context scan is very time consuming. Frequent Itemsets
Table 1. Example context These algorithms consider all itemsets of a given size (i.e.,
OID Items all itemsets of a level in the itemset lattice) at a time. They
1 ACD are based on the properties that all supersets of an
2 BCE infrequent itemset are infrequent and all subsets of a
3 ABCE frequent itemset are frequent (Agrawal et al., 1995).
4 BE Using this property, the candidate k-itemsets (itemsets of
5 ABCE size k) of the kth iteration are generated by joining two
6 BCE frequent (k-1)-itemsets discovered during the preceding
TEAM LinG
Mining Association Rules Using Frequent Closed Itemsets
Figure 1. Itemset lattice These algorithms perform an iterative search in the

itemset lattice advancing during each iteration by one M
A BC D E level from the bottom upwards, as in A PRIORI, and by one
or more levels from the top downwards. Compared to
preceding algorithms, both the number of iterations and,
A BC E A BC D A BD E ACDE BCDE
thus, the number of context scans and the number of CPU
operations carried out are reduced. The most well known
A BC A BE ACE BCE A BD ACD BCD ADE BDE CDE
algorithms based on this approach are PINCER-SEARCH (Lin
& Kedem, 1998) and MAX-MINER (Bayardo, 1998).
AB AC AE BE BC CE AD CD BD DE Relevance of Extracted Association

Rules
A B C E D
For many datasets, a huge number of association rules is
extracted, even for high minsupport and minconfidence
values. This problem is crucial with correlated data, for
which several million association rules sometimes are
extracted. Moreover, a majority of these rules bring the
same information and, thus, are redundant. To illustrate
this problem, nine rules extracted from the mushroom
iteration, if their k-1 first items are identical. Then, one dataset (ftp://ftp.ics.uci.edu/pub/machine-learning-data-
database scan is performed to count the supports of the bases/mushroom/) are presented in the following. All
candidates, and infrequent ones are pruned. This process have the same support (51%) and confidence (54%), and
is repeated until no new candidate can be generated. the item free gills in the antecedent:
This approach is used in the well known APRIORI and
OCD algorithms. Both carry out a number of context scans 1. free_gills edible
equal to the size of the largest frequent itemsets. Several 2. free_gills edible, partial_veil
optimizations have been proposed to improve the effi- 3. free_gills edible, white_veil
ciency by avoiding several context scans. The COFI* (El- 4. free_gills edible, partial_veil, white_veil
Hajj & Zaane, 2004) and FP-GROWTH (Han et al., 2004) 5. free_gills, partial_veil edible
algorithms use specific data structures for that, and the 6. free_gills, partial_veil edible, white_veil
PASCAL algorithm (Bastide et al., 2000) uses a method 7. free_gills, white_veil edible
called pattern counting inference to avoid counting all 8. free_gills, white_veil edible, partial_veil
supports. 9. free_gills, partial_veil, white_veil edible
The most relevant rule from the viewpoint of the user is

Algorithms for Extracting Maximal rule 4, since all other rules can be deduced from this one,
Frequent Itemsets including support and confidence. This rule is a non-redun-
dant association rule with minimal antecedent and maximal
Maximal and minimal itemsets are defined according to the consequent, or minimal non-redundant rule, for short.
inclusion relation. Maximal frequent itemsets are frequent
itemsets of which all supersets are infrequent. They form Association Rules Reduction Methods
a border under which all itemsets are frequent; knowing
all maximal frequent itemsets, we can deduce all frequent
Several approaches for reducing the number of rules and
itemsets, but not their support. Then, the following ap-
selecting the most relevant ones have been proposed.
proach for mining association rules was proposed:
The application of templates (Baralis & Psaila, 1997)
or Boolean operators (Bayardo, Agrawal & Gunopulos,
1. Extracting maximal frequent itemsets and their sup-
2000) allows selecting rules according to the users
ports from the context.
preferences.
2. Deriving frequent itemsets from maximal fre-
When taxonomies of items exist, generalized asso-
quent itemsets and counting their support in the
ciation rules (Han & Fu, 1999) (i.e., rules between
context during one final scan.
items of different levels of taxonomies) can be extracted.
3. Generating all valid association rules from fre-
This produces fewer but more general associations. Other
quent itemsets.
statistical measures, such as Pearsons correlation or c 2,
753
TEAM LinG
also can be used instead of the confidence to determine the Figure 2. Closed itemset lattice
rule precision (Silverstein, Brin & Motwani, 1998).
Several methods to prune similar rules by analyzing ABCDE
their structures also have been proposed. This allows the
extraction of rules only, with maximal antecedents among
those with the same support and the same consequent ACD ABCE
(Bayardo & Agrawal, 1999), for instance.
AC BCE
MAIN THRUST
Algorithms for Extracting Frequent C BE

Closed Itemsets
In contrast with the (maximal) frequent itemsets-based

approaches, the frequent closed itemsets approach
(Pasquier et al., 1998; Zaki & Ogihara, 1998) is based on the
closure operator of the Galois connection. This operator
associates with an itemset I the maximal set of items pute their supports and closures; for each generator G,
common to all the objects containing I (i.e., the intersection the intersection of all objects containing G gives its
of these objects). The frequent closed itemsets are fre- closure, and counting them gives its support. Then,
quent itemsets with (I) = I. An itemset C is a frequent infrequent generators and generators of frequent closed
closed itemset, if no other item i C is common to all itemsets previously discovered are pruned. During the
objects containing C. (k+1)th iteration, candidate (k+1)-generators are con-
The frequent closed itemsets, together with their sup- structed by joining two frequent k-generators having
ports, constitute a generating set for all frequent itemsets identical k-1 first items.
and their supports and, thus, for all association rules, their In the A-CLOSE algorithm, generators are identified by
supports, and their confidences (Pasquier et al., 1999a). comparing supports only, since the support of a genera-
This property relies on the properties that the support of tor is different from the supports of all its subsets. Then,
a frequent itemset is equal to the support of its closure and one more context scan is performed at the end of the
that the maximal frequent itemsets are maximal frequent algorithm to compute closures of all frequent generators
closed itemsets. Using these properties, a new approach discovered.
for mining association rules was proposed: Recently, the CHARM (Zaki & Hsiao, 2002), CLOSET +
(Wang, Han & Pei, 2003) and BIDE (Wang & Han, 2004)
1. Extracting frequent closed itemsets and their sup- algorithms have been proposed. These algorithms effi-
ports from the context. ciently extract frequent closed itemsets but not their
2. Deriving frequent itemsets and their supports from generators. The TITANIC algorithm (Stumme et al., 2002)
frequent closed itemsets. can extract frequent closed sets according to different
3. Generating all valid association rules from frequent closures, such as functional dependencies or Galois
itemsets. closures, for instance.
The search space in the first phase is reduced to the Comparing Execution Times
closed itemset lattice, which is a sublattice of the itemset
lattice. Experiments conducted on both synthetic and opera-
The first algorithms based on this approach proposed tional datasets showed that (maximal) frequent itemsets-
are CLOSE (Pasquier et al., 1999a) and A-CLOSE (Pasquier based approaches are more efficient than closed itemsets-
et al., 1999b). To improve the extraction efficiency, both based approaches on weakly correlated data, such as
perform a level-wise search for generators of frequent market-basket data. In such data, nearly all frequent
closed itemsets. The generators of a closed itemset C are itemsets also are frequent closed itemsets (i.e., closed
the minimal itemsets whose closure is C; an itemset G is itemset lattice and itemset lattice are nearly identical),
a generator of C, if there is no other itemset G G whose and closure computations add execution times.
closure is C. Correlated data constitute a challenge for efficiently
During an iteration k, CLOSE considers a set of candi- extracting association rules, since the number of fre-
date k-generators. One context scan is performed to com- quent itemsets is most often very important, even for
754
TEAM LinG
high minsupport values. On these data, few frequent For weakly correlated data, very few exact rules are
itemsets are also frequent closed itemsets. Thus, the extracted, and the reduction for approximate rules is in the M
closure helps to reduce the search space; fewer itemsets order of five for both the min-max and the Luxenburger
are tested, and the number of context scans is reduced. On bases.
such data, maximal frequent itemsets-based approaches For correlated data, the Duquenne-Guigues basis re-
suffer from the time needed to compute frequent itemset duces exact rules to a few tens; for the min-max exact basis,
supports that require accessing the dataset. With the the reduction factor is about some tens. For approximate
closure, these supports are derived from the supports of association rules, both the Luxenburger and the min-max
frequent closed itemsets without accessing the dataset. bases reduce the number of rules by a factor of some
hundreds.
Extracting Bases for Association Rules If the number of rules can be reduced from several
million to a few hundred or a few thousand, visualization
Bases are minimal sets, with respect to some criteria, from tools such as templates and/or generalization tools such
which all rules can be deduced with support and confi- as taxonomies are required to explore so many rules.
dence. The Duquenne-Guigues and the Luxenburger ba-
sis for global and partial implications were adapted to
association rule framework in Pasquier et al. (1999c) and FUTURE TRENDS
Zaki (2000). These bases are minimal regarding the number
of rules; no smaller set allows the deduction of all rules Most recent researches on association rules extraction
with support and confidence. However, they do not concern applications to natural phenomena modeling,
contain the minimal non-redundant rules. gene expression analysis (Creighton & Hanash, 2003),
An association rule is redundant, if it brings the same biomedical engineering (Gao, Cong et al., 2003), and
information or less general information than those con- geospatial, telecommunications, Web and semi-struc-
veyed by another rule with identical support and confi- tured data analysis (Han et al., 2002). These applications
dence. Then, an association rule r is a minimal non- most often require extending existing methods. For in-
redundant association rule, if there is no association rule stance, to extract only rules with low support and high
r with the same support and confidence whose anteced- confidence in semi-structured (Cohen et al., 2001) or
ent is a subset of the antecedent of r and whose conse- medical data (Ordonez et al., 2001), to extract temporal
quent is a superset of the consequent of r. An inference association rules in Web data (Yang & Parthasarathy,
system based on this definition was proposed in Cristofor 2002) or adaptive sequential association rules in long-
and Simovici (2002). term medical observation data (Brisson et al., 2004). Fre-
The Min-Max basis for exact association rules con- quent closed itemsets extraction also is applied as a
tains all rules G g(G) - G between a generator G and its conceptual analysis technique to explore biological (Pfaltz
closure (G) such that (G) G. The Min-Max basis for & Taylor, 2002) and medical data (Cremilleux, Soulet &
approximate association rules contains all rules G C - Rioult, 2003).
G between a generator itemset G and a frequent closed These domains are promising fields of application for
itemset C that is a superset of its closure: (G) C. These association rules and frequent closed itemsets-based
bases, also called informative bases, contain, respec- techniques, particularly in combination with other data
tively, the minimal non-redundant exact and approximate mining techniques, such as clustering and classification.
association rules. Their union constitutes a basis for all
association rules: They all can be deduced with their
support and confidence (Bastide et al., 2000). The objec- CONCLUSION
tive is to capture the essential knowledge in a minimal
number of rules without information loss. Next-generation data-mining systems should answer the
Algorithms for determining generators, frequent closed analysts requirements for high-level ready-to-use knowl-
itemsets, and the min-max bases from frequent itemsets edge that will be easier to exploit. This implies the integra-
and their supports are presented in Pasquier et al. (2004). tion of data-mining techniques in DBMS and domain-
specific applications (Ansari et al., 2001). This integration
Comparing Sizes of Association Rule should incorporate the use of knowledge visualization
Sets and exploration techniques, knowledge consolidation by
cross-analysis of results of different techniques, and the
Results of experiments conducted on both synthetic and incorporation of background knowledge, such as taxono-
operational datasets show that the generation of the mies or gene annotations for gene expression data, for
bases can reduce substantially the number of rules. example, in the process.
755
TEAM LinG
REFERENCES Gao Cong, F.P., Tung, A., Yang, J., & Zaki, M.J. (2003).
CARPENTER: Finding closed patterns in long biological
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & datasets. Proceedings of the KDD Conference.
Verkamo, A.I. (1995). Fast discovery of association rules. Han, J., & Fu, Y. (1999). Mining multiple-level association
Advances in knowledge discovery and data mining. rules in large databases. IEEE Transactions on Knowl-
AAAI/MIT Press. edge and Data Engineering, 11(5), 798-804.
Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent
Integrating e-commerce and data mining: Architecture patterns without candidate generation: A frequent-pat-
and challenges. Proceedings of the ICDM Conference. tern tree approach. Data Mining and Knowledge Discov-
Baralis, E., & Psaila, G. (1997). Designing templates for ery, 8(1), 53-87.
mining association rules. Journal of Intelligent Informa- Han, J., Russ, B., Kumar, V., Mannila, H., & Pregibon, D.
tion Systems, 9(1), 7-32. (2002). Emerging scientific applications in data mining.
Bastide, Y., Pasquier, N., Taouil, R., Lakhal, L., & Stumme, Communications of the ACM, 45(8), 54-58.
G. (2000). Mining minimal non-redundant association Lin, D., & Kedem, Z.M. (1998). P INCER-SEARCH: A new
rules using frequent closed itemsets. Proceedings of the algorithm for discovering the maximum frequent set. Pro-
DOOD Conference. ceedings of the EBDT Conference.
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, Ordonez, C. et al. (2001). Mining constrained association
L. (2000). Mining frequent closed itemsets with counting rules to predict heart disease. Proceedings of the ICDM
inference. SIGKDD Explorations, 2(2), 66-75. Conference.
Bayardo, R.J. (1998). Efficiently mining long patterns from Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1998).
databases. Proceedings of the SIGMOD Conference. Pruning closed itemset lattices for association rules. Pro-
Bayardo, R.J., & Agrawal, R. (1999). Mining the most ceedings of the BDA Conference.
interesting rules. Proceedings of the KDD Conference. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999a).
Bayardo, R.J., Agrawal, R., & Gunopulos, D. (2000). Con- Efficient mining of association rules using closed itemset
straint-based rule mining in large, dense databases. Data lattices. Information Systems, 24(1), 25-46.
Mining and Knowledge Discovery, 4(2/3), 217-240. Pasquier N., Bastide, Y., Taouil, R., & Lakhal, L. (1999b).
Brisson, L., Pasquier, N., Hebert, C., & Collard, M. (2004). Discovering frequent closed itemsets for association
HASAR: Mining sequential association rules for athero- rules. Proceedings of the ICDT Conference.
sclerosis risk factor analysis. Proceedings of the PKDD Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999c).
Discovery Challenge. Closed set based discovery of small covers for associa-
Cohen, E. et al. (2001). Finding interesting associations tion rules. Proceedings of the BDA Conference.
without support pruning. IEEE Transaction on Knowl- Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., & Lakhal,
edge and Data Engineering, 13(1), 64,78. L. (2004). Generating a condensed representation for
Creighton, C., & Hanash, S. (2003). Mining gene expres- association rules. Journal of Intelligent Information
sion databases for association rules. Bioinformatics, Systems.
19(1), 79-86. Pfaltz J., & Taylor C. (2002, July). Closed set mining of biological
Cremilleux, B., Soulet, A., & Rioult, F. (2003). Mining the data. Proceedings of the KDD/BioKDD Conference.
strongest emerging patterns characterizing patients af- Silverstein, C., Brin, S., & Motwani, R. (1998). Beyond
fected by diseases due to atherosclerosis. Proceedings of market baskets: Generalizing association rules to de-
the PKDD Discovery Challenge. pendence rules. Data Mining and Knowledge Discov-
Cristofor, L., & Simovici, D.A. (2002). Generating an infor- ery, 2(1), 39-68.
mative cover for association rules. Proceedings of the Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., &
ICDM Conference. Lakhal, L. (2002). Computing iceberg concept lattices with
El-Hajj, M., & Zaane, O.R. (2004). COFI approach for TITANIC. Data and Knowledge Engineering, 42(2), 189-
mining frequent itemsets revisited. Proceedings of the 222.
SIGMOD/DMKD Workshop.
756
TEAM LinG
Wang, J., & Han, J. (2004). BIDE: Efficient mining of KEY TERMS
frequent closed sequences. Proceedings of the ICDE M
Conference. Association Rules: An implication rule between two
Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching for itemsets with statistical measures of range (support) and
the best strategies for mining frequent closed itemsets. precision (confidence).
Proceedings of the KDD Conference. Basis for Association Rules: A set of association
Yang, H., & Parthasarathy, S. (2002). On the use of con- rules that is minimal with respect to some criteria and from
strained associations for Web log mining. Proceedings of which all association rules can be deduced with support
the KDD/WebKDD Conference. and confidence.
Zaki, M.J. (2000). Generating non-redundant association Closed Itemset: An itemset that is a maximal set of
rules. Proceedings of the KDD Conference. items common to a set of objects. An itemset is closed if
it is equal to the intersection of all objects containing it.
Zaki, M.J., & Hsiao, C.-J. (2002). CHARM: An efficient
algorithm for closed itemset mining. Proceedings of the Frequent Itemset: An itemset contained in a number
SIAM International Conference on Data Mining. of objects at least equal to some user-defined threshold.
Zaki, M.J., & Ogihara, M. (1998). Theoretical foundations Itemset: A set of binary attributes, each correspond-
of association rules. Proceedings of the SIGMOD/DMKD ing to an attribute value or an interval of attribute values.
Workshop.
757
TEAM LinG
758
Mining Chat Discussions

Stanley Loh
Catholic University of Pelotas, Brazil, and Lutheran University of Brasil, Brazil
Daniel Licthnow
Catholic University of Pelotas, Brazil
Thyago Borges
Tiago Primo
Rodrigo Branco Kickhfel

Gabriel Simes
Gustavo Piltcher
Ramiro Saldaa
INTRODUCTION cussed and then offer new information or can remember

people of existing information sources. This kind of sys-
According to Nonaka and Takeuchi (1995), the majority tems is named recommender systems.
of the organizational knowledge comes from interac- Furthermore, chat sessions have implicit knowledge
tions between people. People tend to reuse solutions about what the participants know and how they are view-
from other persons in order to gain productivity. ing the world. Analyzing chat discussions allows under-
When people communicate to exchange information standing what people are looking for and how people
or acquire knowledge, the process is named Collabora- collaborates one with each other. Intelligent software
tion. Collaboration is one of the most important tasks systems can analyze discussions in chats to extract
for innovation and competitive advantage within learn- knowledge about the group or about the subject being
ing organizations (Senge, 2001). It is important to discussed.
record knowledge to later reuse and analysis. If knowl- Mining tools can analyze chat discussions to under-
edge is not adequately recorded, organized and retrieved, stand what is being discussed and help people. For ex-
the consequence is re-work, low productivity and lost of ample, a recommender system can analyze textual mes-
opportunities. sages posted in a web chat, identify the subject of the
Collaboration may be realized through synchronous discussion and then look for items stored in a Digital
interactions (e.g., exchange of messages in a chat), asyn- Library to recommend individually to each participant of
chronous interactions (e.g., electronic mailing lists or the discussion. Items can be electronic documents, web
forums), direct contact (e.g., two persons talking) or indi- pages and bibliographic references stored in a digital
rect contact (when someone stores knowledge and others library, past discussions and authorities (people with
can retrieve this knowledge in a remote place or time). expertise in the subject being discussed). Besides that,
In special, chat rooms are becoming important tools mining tools can analyze the whole discussion to map the
for collaboration among people and knowledge exchange. knowledge exchanged among the chat participants.
Intelligent software systems may be integrated into chat The benefits of such technology include supporting learn-
rooms in order to help people in this collaboration task. ing environments, knowledge management efforts within
For example, systems can identify the theme being dis- organizations, advertisement and support to decisions.
TEAM LinG
BACKGROUND the words and the grammar present in the texts represent
knowledge from people, expressed in written formats M
Some works has investigated the analysis of online (Sowa, 2000).
discussions. Brutlag and Meek (2000) have studied the An ontology or thesaurus can be used to help to
identification of themes in e-mails. The work compares identify cue words for each subject. The ontology or
the identification by analyzing only the subject of the e- thesaurus has concepts of a domain or knowledge area,
mails against analyzing the message bodies. One con- including relations between concepts and the terms
clusion is that e-mail headers perform so well as mes- used in written languages to express these concepts
sage bodies, with the additional advantage of reducing (Gilchrist, 2003). The ontology can be created by ma-
the number of features to be analyzed. chine learning methods (supervised learning), where
Busemann, Schmeier and Arens, (2000) investigated human experts select training cases for each subject
the special case of messages registered in call centers. (e.g., texts of positive and negative examples) and an
The work proved possible to identify themes in this kind intelligent software system identifies the keywords that
of message, although the informality of the language used define each subject. The TFIDF method from Salton and
in the messages. This informality causes mistakes due to McGill (1983) is the most used in this kind of task.
jargons, misspellings and grammatical inaccuracy. If considering that the terms that compose the mes-
The work of Durbin, Richter, and Warner (2003) has sages compose a bag of words (have no difference in
shown possible to identify affective opinions about importance), probabilistic techniques can be used to
products and services in e-mails sent by customers, in identify the subject. By other side, natural language
order to alert responsible people or to evaluate the processing techniques can identify syntactic elements
organization and customers satisfaction. Furthermore, and relations, then supporting more precise subject
the work identifies the intensity of the rating, allowing identification.
the separation of moderate or intensive opinions. The identification of themes should consider the
Tong (2001) investigated the analysis of online dis- context of the messages to determine if the concept
cussions about movies. Messages represent comments identified is really present in the discussion. A group of
about movies. This work proved to be feasible to find messages is better to infer the subject than a single
positive and negative opinions, by analyzing key or cue message. That avoids misunderstandings due to words
words. Furthermore, the work also extracts information ambiguity and use of synonyms.
about the movies, like directors and actors, and then
examines opinions about these particular characteristics. Making Recommendations in a Chat
The only work found in the scientific literature that Discussion
analyzes chat messages is the one from Khan, Fisher,
Shuler, Wu, and Pottenger (2002). They apply mining A recommender system is a software whose main goal is
techniques over chat messages in order to find social to aid in the social collaborative process of indicating
interactions among people. The goal is to find who is or receiving indications (Resnick & Varian, 1997).
related to whom inside a specific area, by analyzing the Recommender systems are broadly used in electronic
exchange of messages in a chat and the subject of the commerce for suggesting products or providing infor-
discussion. mation about products and services, helping people to
decide in the shopping process (Lawrence et al., 2001;
Schafer et al., 2001). The offered gain is that people do
MAIN THRUST not need to request recommendation or to perform a
query over an information base, but the system decides
Following, the chapter explains how messages can be what and when to suggest. The recommendation is usu-
mined, how recommendations can be made and how the ally based on user profiles and reuse of solutions.
whole discussion (an entire chat session) can be analyzed. When a subject is identified in a message, the
recommender searches for items classified in this sub-
Identifying Themes in Chat Messages ject. Items can come from different databases. For
example, a Digital Library may provide electronic docu-
To provide people with useful information during a ments, links to Web pages and bibliographic references.
collaboration session, the system has to identify what is A profile database may contain information about
being discussed. Textual messages sent by the users in people, including the interest areas of each person, as
the chat can be analyzed for this purpose. Texts can lead well an associated degree, informing the users knowl-
to the identification of the subject discussed because edge level on the subject or how much is his/her compe-
tence in the area (his/her expertise). This can be used to
759
TEAM LinG
indicate the most active user in the area or who is the FUTURE TRENDS
authority in the subject.
A database of past discussions records everything Recommender systems are still an emerging area. There
that occurs in the chat, during every discussion session. are some doubts and open issues. For example, whether
Discussions may be stored by sessions, identified by is good or bad to recommend items already suggested
data and themes discusses and can include who partici- in past discussions (re-recommend, as if remembering
pated in the session, all the messages exchanged (with a the person). Besides that it is important to analyze the
label indicating who sent it), the concept identified in level of the participants in order to recommend only
each message, the recommendations made during the basic or advanced items.
session for each user and documents downloaded or read Collaborative filtering techniques can be used to
during the session. Past discussions may be recom- recommend items already seen by other users (Resnick,
mended during a chat session, remembering the partici- et al., 1994; Terveen & Hill, 2001). Grouping people
pants that other similar discussions have already hap- with similar characteristics allows for crossing of
pened. This database also allows users to review the recommended items, for example, to offer documents
whole discussion later after the session. The great ben- read by one person to others.
efit is that users do not re-discuss the same question. In the same way, software systems can capture
relevance feedback from users to narrow the list of
Mining a Chat Session recommendations. Users should read some items of
the list and rate them, so that the system can use this
Analyzing the themes discussed in a chat session can information to eliminate items from the list or to
bring an important overview of the discussion and also of reorder the items in a new ranking.
the subject. Statistical tools applied over the messages The context of the messages needs to be more
sent and the subjects identified in each message can help studied. To infer the subject being discussed, the sys-
users to understand which were the themes more dis- tem can analyze a group of messages, but it is necessary
cussed. Counting the messages associated with each to determine how many (a fixed number or all messages
subject, it is possible to infer the central point of the sent in the past N minutes?).
discussion and the peripheral themes. An orthographic corrector is necessary to clean the
The list of subjects identified during the chat session messages posted to the chat. Lots of linguistic mis-
compose an interesting order, allowing users to analyze takes are expected since people are using chats in a
the path followed by the participants during the discus- hurry, with little attention to the language, without
sion. For example, it is possible to observe which was the revisions and in an informal way. Furthermore, the text
central point of the discussion, whether the discussion mining tools must analyze special signs like novel
deviated from the main subject and whether the subjects abbreviations, emoticons and slang expressions. Spe-
present in the beginning of the discussion were also cial words may be added to the domain ontology in
present at the end. The coverage of the discussion may be order to hold the differences in the language.
identified by the number of different themes discussed.
Furthermore, this analysis allows identifying the depth
of the discussion, that is, whether more specific themes CONCLUSION
were discussed or whether the discussion occurred su-
perficially at a higher conceptual level. An example of such a system discussed in this chapter
Analyzing the messages sent by every participant is available in http://gpsi.ucpel.tche.br/sisrec. Cur-
allows determining the degree of participation of each rently, the system uses a domain ontology for com-
person in the discussion: who participated more and who puter science, but others can be used. Similarly, the
did less. Furthermore, it is possible to observe which are current digital library only has items related to Com-
the interesting areas for each person and in someway to puter Science.
determine the expertise of the group and of the partici- The recommendation system facilitates the organi-
pants (which are the areas where the group is more zational learning because people receive suggestions
competent). of information sources during online discussions. The
Association techniques can be used to identify corre- main advantage of the system is to free the user of the
lations between themes or between themes and persons. burden to search information sources during the online
For example, it is possible to find that some theme is discussion. Users do not have to choose attributes or
present always when other theme is also present or to find requirements from a menu of options, in order to retrieve
that every discussion where some person participated had items of a database; the system decides when and what
a certain theme as the principal. information to recommend to the user. This proactive
760
TEAM LinG
approach is useful for non-experienced users that receive Busemann, S., Schmeier, S., & Arens, R.G. (2000) Message
hits about what to read in a specific subject. Users classification in the call center. In Proceedings of the M
information needs are discovered naturally during the Applied Natural Language Processing Conference
conversation. ANLP2000 (pp. 159-165), Seattle, WA.
Furthermore, when the system indicates people who
are authorities in each subject, nave users can meet Durbin, S.D., Richter, J.N., & Warner, D. (2003). A system
these authorities for getting more knowledge. for affective rating of texts. In Proceedings of the 3rd
Other advantage of the system is that part of the Workshop on Operational Text Classification, 9th ACM
knowledge shared in the discussion can be made explicit International Conference on Knowledge Discovery and
through the record of the discussion for future retrieval. Data Mining (KDD-2003), Washington, DC.
Besides that, the system allows the posterior analysis of Khan, F.M., Fisher, T.A., Shuler, L., Wu, T., & Pottenger,
each discussion, presenting the subjects discussed, the W. M. (2002). Mining chat-room conversations for social
messages exchanged, the items recommended and the and semantic interactions. Technical Report LU-CSE-02-
order in which the subjects were discussed. 011, Lehigh University, Bethlehem, Pennsylvania, USA.
An important feature is the statistical analysis of the
discussion, allowing understanding the central point, Gilchrist, A. (2003). Thesauri, taxonomies and ontologies
the peripheral themes, the order of the discussion, its an etymological note. Journal of Documentation, 59(1),
coverage and depth. 7-18.
The benefit of mining chat sessions is of special Lawrence, R.D. et al. (2001). Personalization of supermar-
interest for Knowledge Management efforts. Organiza- ket product recommendations. Journal of Data Mining
tions can store tacit knowledge formatted as discus- and Knowledge Discovery, 5(1/2), 11-32.
sions. The discussions can be retrieved, so that knowl-
edge can be reused. In the same way, the contents of a Nonaka, I., & Takeuchi, T. (1995). The knowledge-creating
Digital Library (or Organizational Memory) can be bet- company: How Japanese companies create the dynamics of
ter used through recommendations. People do not have innovation. Cambridge: Oxford University Press.
to search for contents neither to remember items in
order to suggest to others. Recommendations play this Resnick, P. et al. (1994). GroupLens: An open architecture
role in a proactive way, examining what people are for collaborative filtering of Netnews. In Proceedings of
discussing and users profiles and selecting interesting the Conference on Computer Supported Cooperative
new contents. Work (pp. 175-186).
In special, such systems (that mine chat sessions) Resnick, P., & Varian, H. (1997). Recommender systems.
can be used in e-learning environments, supporting the Communications of the ACM, 40(3), 56-58.
construction of knowledge by individuals or groups.
Recommendations help the learning process, suggest- Salton, G., & McGill, M.J. (1983). Introduction to modern
ing complementary contents (documents and sites stored information retrieval. New York: McGraw-Hill.
in the Digital Library). Recommendations also include Schafer, J.B. et al. (2001). E-commerce recommendation
authorities in topics being discussed, that is, people applications. Journal of Data Mining and Knowledge
with high degrees of knowledge. Discovery, 5(1/2), 115-153.
Senge, P.M. (2001). The fifth discipline: The art and
ACKNOWLEDGMENTS practice of the learning organization (9th ed.). So
Paulo: Best Seller (in Portuguese).
This research group is partially supported by CNPq, an Sowa, J.F. (2000). Knowledge representation: Logical,
entity of the Brazilian government for scientific and philosophical, and computational foundations. Pacific
technological development. Grove, CA: Brooks/Cole Publishing Co.
Terveen, L., & Hill, W. (2001). Human-computer col-
REFERENCES laboration in recommended systems. In J. Carroll (Ed.),
Human computer interaction in the new millennium.
Brutlag, J.D., & Meek, C. (2000). Challenges of the email Boston: Addison-Wesley.
domain for text classification. In Proceedings of the 7th Tong, R. (2001). Detecting and tracking opinions in online
International Conference on Machine Learning (ICML discussions. In Proceedings of the Workshop on Opera-
2000) (pp. 103-110), Stanford University, Stanford, CA, tional Text Classification, SIGIR, New Orleans, Louisi-
USA. ana, USA.
761
TEAM LinG
KEY TERMS Mining: The application of statistical techniques to

infer implicit patterns or rules in a collection of data, in
Chat: A software system that enables real-time com- order to discover new and useful knowledge.
munication among users through the exchange of tex- Ontology: A formal and explicit definition of concepts
tual messages. (classes or categories) and their attributes and relations.
Collaboration: The process of communication Recommendations: Results of the process of provid-
among people with the goal of sharing information and ing useful resources to a user, like products, services or
knowledge. information.
Digital Library: A set of electronic resources (usually Recommender System: A software system that makes
documents) combined with a software system which al- recommendations to a user, usually analyzing the users
lows storing, organizing and retrieving the resources. interest or need.
Knowledge Management: Systems and methods for Text Mining: The process of discovering new infor-
storing, organizing and retrieving explicit knowledge. mation analyzing textual collections.
762
TEAM LinG
763
Mining Data with Group Theoretical Means M

Gabriele Kern-Isberner
University of Dortmund, Germany
INTRODUCTION jects in artificial intelligence, in a theoretical as well as

in a practical respect. For instance, a sales assistant who
Knowledge discovery refers to the process of extract- has a general knowledge about the preferences of his or
ing new, interesting, and useful knowledge from data and her customers can use this knowledge when consulting
presenting it in an intelligible way to the user. Roughly, any new customer.
knowledge discovery can be considered a three-step Typically, two central problems have to be solved in
process: preprocessing data; data mining, in which the practical applications: First, where do the rules come
actual exploratory work is done; and interpreting the from? How can they be extracted from statistical data?
results to the user. Here, I focus on the data-mining step, And second, how should rules be represented? How
assuming that a suitable set of data has been chosen should conditional knowledge be propagated and com-
properly. bined for further inferences? Both of these problems
The patterns that we search for in the data are plau- can be dealt with separately, but it is most rewarding to
sible relationships, which agents may use to establish combine them, that is, to discover rules that are most
cognitive links for reasoning. Such plausible relation- relevant with respect to some inductive inference for-
ships can be expressed via association rules. Usually, malism and to build up the best model from the discov-
the criteria to judge the relevance of such rules are ered rules that can be used for queries.
either frequency based (Bayardo & Agrawal, 1999) or
causality based (for Bayesian networks, see Spirtes,
Glymour, & Scheines, 1993). Here, I will pursue a MAIN THRUST
different approach that aims at extracting what can be
regarded as structures of knowledge relationships This article presents an approach to discover associa-
that may support the inductive reasoning of agents and tion rules that are most relevant with respect to the
whose relevance is founded on information theory. The maximum entropy methods. Because entropy is related
method that I will sketch in this article takes numerical to information, this approach can be considered as
relationships found in data and interprets these relation- aiming to find the most informative rules in data. The
ships as structural ones, using mostly algebraic tech- basic idea is to exploit numerical relationships that are
niques to elaborate structural information. observed by comparing (relative) frequencies, or ratios
of frequencies, and so forth, as manifestations of inter-
actions of underlying conditional knowledge.
BACKGROUND My approach differs from usual knowledge discov-
ery and data-mining methods in various respects:
Common sense and expert knowledge is most generally
expressed by rules, connecting a precondition and a It explicitly takes the instrument of inductive in-
conclusion by an if-then construction. For example, you ference into consideration.
avoid puddles on sidewalks because you are aware of the It is based on statistical information but not on
fact that if you step into a puddle, then your feet might probabilities close to 1; actually, it mostly uses
get wet; similarly, a physician would likely expect a only structural information obtained from the data.
patient showing the symptoms of fever, headache, and a It is not based on observing conditional indepen-
sore throat to suffer from a flu, basing his diagnosis on dencies (as for learning causal structures), but
the rule that if a patient has a fever, headache, and sore aims at learning relevant conditional dependen-
throat, then the ailment is a flu, equipped with a suffi- cies in a nonheuristic way.
ciently high probability. As a further novelty, it does not compute single,
If-then rules are more formally denoted as condi- isolated rules, but yields a set of rules by taking
tionals. The crucial point with conditionals is that they into account highly complex interactions of rules.
carry generic knowledge that is applicable to different Zero probabilities computed from data are inter-
situations. This fact makes them most interesting ob- preted as missing information, not as certain
knowledge.
TEAM LinG
Mining Data with Group Theoretical Means
The resulting set of rules may serve as a basis for generic nature of conditionals by minimizing the amount
maximum entropy inference. Therefore, the method of information being added, as shown in Kern-Isberner
described in this article addresses minimality aspects, (2001).
as in Padmanabhan and Tuzhilin (2000), and makes use Nevertheless, modelling ME-rule bases has to be
of inference mechanisms, as in Cristofor and Simovici done carefully so as to ensure that all relevant depen-
(2002). Different from most approaches, however, it dencies are taken into account. This task can be difficult
exploits the inferential power of the maximum entropy and troublesome. Usually, the modelling rules are based
methods in full consequence and in a structural, somehow on statistical data. So, a method to compute rule
nonheuristic way. sets appropriate for ME-modelling from statistical data is
urgently needed.
Modelling Conditional Knowledge by
Maximum Entropy (ME) Structures of Knowledge
Suppose a set R* = {(B1|A1)[x1], , (Bn|An)[xn]} of The most typical approach to discover interesting rules
probabilistic conditionals is given. For instance, R* from data is to look for rules with a significantly high
may describe the knowledge available to a physician (conditional) probability and a concise antecedent
when he has to make a diagnosis. Or R* may express (Bayardo & Agrawal, 1999; Agarwal, Aggarwal, & Prasad,
common sense knowledge, such as Students are young 2000; Fayyad & Uthurusamy, 2002; Coenen,
with a probability of (about) 80% and Singles (i.e., Goulbourne, & Leng, 2001). Basing relevance on fre-
unmarried people) are young with a probability of (about) quencies, however, is sometimes unsatisfactory and
70%, the latter knowledge being formally expressed by inadequate, particularly in complex domains such as
R* = { (young|student)[0.8], (young|single)[0.7] }. medicine. Further criteria to measure the interesting-
Usually, these rule bases represent incomplete ness of the rules or to exclude redundant rules have also
knowledge, in that a lot of probability distributions are been brought forth (Jaroszewicz & Simovici, 2001;
apt to represent them. So learning or inductively repre- Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000;
senting the rules, respectively, means to take them as a Zaki, 2000). Some of these algorithms also make use of
set of conditional constraints and to select a unique optimization criteria, which are based on entropy
probability distribution as the best model that can be (Jaroszewicz & Simovici, 2002).
used for queries and further inferences. Paris (1994) Mostly, the rules are considered as isolated pieces
investigates several inductive representation techniques of knowledge; no interaction between rules can be taken
in a probabilistic framework and proves that the prin- into account. In order to obtain more structured infor-
ciple of maximum entropy (ME-principle) yields the mation, one often searches for causal relationships by
only method to represent incomplete knowledge in an investigating conditional independencies and thus
unbiased way, satisfying a set of postulates describing noninteractivity between sets of variables (Spirtes et
sound common sense reasoning. The entropy H(P) of a al., 1993).
probability distribution P is defined as Although causality is undoubtedly most important
for human understanding, the concept seems to be too
H(P) = - w P(w) log P(w), rigid to represent human knowledge in an exhaustive way.
For instance, a person suffering from a flu is certainly sick
where the sum is taken over all possible worlds, w, and (P(sick | flu) = 1), and he or she often will complain about
measures the amount of indeterminateness inherent to headaches (P(headache | flu) = 0.9). Then you have
P. Applying the principle of maximum entropy, then, P(headache | flu) = P(headache | flu & sick), but you would
means to select the unique distribution P* = ME(R*) surely expect that P(headache | not flu) is different from
that maximizes H(P) among all distributions P that P(headache | not flu & sick)! Although the first equality
satisfy the rules in R*. In this way, the ME-method suggests a conditional independence between sick and
ensures that no further information is added, so the headache, due to the causal dependency between head-
knowledge R* is represented most faithfully. ache and flu, the second inequality shows this to be (of
Indeed, the ME-principle provides a most conve- course) false. Furthermore, a physician might also state
nient and founded method to represent incomplete some conditional probability involving sickness and head-
probabilistic knowledge (efficient implementations of ache, so you obtain a complex network of rules. Each of
ME-systems are described in Roedder & Kern-Isberner, these rules will be considered relevant by the expert, but
2003). In an ME-environment, the expert has to list only none will be found when searching for conditional inde-
whatever relevant conditional probabilities he or she is pendencies! So what, exactly, are the structures of knowl-
aware of. Furthermore, ME-modelling preserves the edge by which conditional dependencies (not indepen-
764
TEAM LinG
dencies! See also Simovici, Cristofor, D., & Cristofor, L., Start with a set B of simple rules, the length of which
2000) manifest themselves in data? is considered to be large enough to capture all M
To answer this question, the theory of conditional relevant dependencies.
structures has been presented in Kern-Isberner (2000). Search for numerical relationships in P by inves-
Conditional structures are an algebraic means to make tigating which products of probabilities match.
the effects of conditionals on possible worlds (i.e., Compute the corresponding conditional struc-
possible combinations or situations) transparent, in that tures with respect to B, yielding equations of
they reflect whether the corresponding world verifies group elements.
the conditional or falsifies it, or whether the conditional Solve these equations by forming appropriate
cannot be applied to the world because the if-condition is factor groups.
not satisfied. Consider, for instance, the conditional If Building these factor groups corresponds to elimi-
you step in a puddle, then your feet might get wet. In a nating and joining the basic conditionals in B to
particular situation, the conditional is applicable (you make their information more concise, in accor-
actually step into a puddle) or not (you simply walk dance with the numerical structure of P. Actually,
around it), and it can be found verified (you step in a the antecedents of the conditionals in B are short-
puddle and indeed, your feet get wet) or falsified (you ened so as to comply with the numerical relation-
step in a puddle, but your feet remain dry because you are ships in P.
wearing rain boots).
This intuitive idea of considering a conditional as a So the basic idea of this algorithm is to start with
three-valued event is generalized in Kern-Isberner (2000) long rules and to shorten them in accordance with the
to handle the simultaneous impacts of a set of condition- probabilistic information provided by P without losing
als by using algebraic symbols for positive and negative information.
impact, respectively. Then for each world, a word of Group theory actually provides an elegant frame-
these symbols can be computed, which shows immedi- work, on the one hand, to disentangle highly complex
ately how the conditionals interact on this world. The conditional interactions in a systematic way, and on the
proper mathematical structure for building words are other hand, to make operations on the conditionals
(semi)groups, and indeed, group theory provides the basis computable, which is necessary to make information
for connecting numerical to structural information in an more concise.
elegant way. In short, a probability (or frequency) distri-
bution is called (conditionally) indifferent with respect to How to Handle Sparse Knowledge
a set of conditionals R* iff its numerical information
matches the structural information provided by condi- The frequency distributions calculated from data are
tional structures. In particular, each ME-distribution turns mostly not positive just to the contrary, they would
out to be indifferent with respect to a generating set of be sparse, full of zeros, with only scattered clusters of
conditionals. nonzero probabilities. This overload of zeros is also a
problem with respect to knowledge representation,
Data Mining and Group Theory A because a zero in such a frequency distribution often
Strange Connection? merely means that such a combination has not been
recorded. The strict probabilistic interpretation of zero
The concept of conditional structures, however, is not probabilities, however, is that such a combination does
only an algebraic means to judge well-behavedness with not exist, which does not seem to be adequate.
respect to conditional information. The link between The method sketched in the preceding section is
numerical and structural information, which is provided also able to deal with that problem in a particularly
by the concept of conditional indifference, can also be adequate way: The zero values in frequency distribu-
used in the other direction, that is, to derive structural tions are taken to be unknown but equal probabilities,
information about the underlying conditional relation- and this fact can be exploited by the algorithm. So they
ships from numerical information. More precisely, find- actually help to start with a tractable set B of rules right
ing a set of rules with the ability to represent a given from the beginning (see also Kern-Isberner & Fisseler,
probability distribution P via ME-methods can be done 2004).
by elaborating numerical relationships in P, interpreting In summary, zeros occurring in the frequency dis-
them as manifestations of underlying conditional depen- tribution computed from data are considered as miss-
dencies. The procedure to discover appropriate sets of ing information, and in my algorithm, they are treated
rules is sketched in the following steps: as non-knowledge without structure.
765
TEAM LinG
FUTURE TRENDS Coenen, F., Goulbourne, G., & Leng, P. H. (2001). Comput-
ing association rules using partial totals. Proceedings of
Although by and large, the domain of knowledge discov- the Fifth European Conference on Principles and Prac-
ery and data mining is dominated by statistical tech- tice of Knowledge Discovery in Databases (pp. 54-66).
niques and the problem of how to manage vast amounts Cristofor, L., & Simovici, D. (2002). Generating an informa-
of data, the increasing need for and popularity of human- tive cover for association rules. Proceedings of the IEEE
machine interactions will make it necessary to search International Conference on Data Mining (pp. 597-600).
for more structural knowledge in data that can be used to
support (humanlike) reasoning processes. The method Fayyad, U., & Uthurusamy, R. (2002). Evolving data
described in this article offers an approach to realize mining into solutions for insights. Communications of
this aim. The conditional relationships that my algo- the ACM, 45(8), 28-61.
rithm reveals can be considered as kind of cognitive
links of an ideal agent, and the ME-technology takes the Jaroszewicz, S., & Simovici, D. A. (2001). A general mea-
task of inductive reasoning to make use of this knowl- sure of rule interestingness. Proceedings of the Fifth
edge. Combined with clustering techniques in large European Conference on Principles and Practice of
databases, for example, it may turn out a useful method Knowledge Discovery in Databases (pp. 253-265).
to discover relationships that go far beyond the results Jaroszewicz, S., & Simovici, D. A. (2002). Pruning redun-
provided by other, more standard data-mining techniques. dant association rules using maximum entropy principle.
Proceedings of the Pacific-Asia Conference on Knowl-
edge Discovery and Data Mining.
CONCLUSION
Kern-Isberner, G. (2000). Solving the inverse representa-
In this article, I have developed a new method for discov- tion problem. Proceedings of the 14th European Confer-
ering conditional dependencies from data. This method ence on Artificial Intelligence (pp. 581-585).
is based on information-theoretical concepts and group- Kern-Isberner, G. (2001). Conditionals in nonmonotonic
theoretical techniques, considering knowledge discov- reasoning and belief revision. Lecture Notes in Artifi-
ery as an operation inverse to inductive knowledge cial Intelligence.
representation. By investigating relationships between
the numerical values of a probability distribution P, the Kern-Isberner, G., & Fisseler, J. (2004). Knowledge
effects of conditionals are analyzed and isolated, and discovery by reversing inductive knowledge representa-
conditionals are joined suitably so as to fit the knowl- tion. Proceedings of the Ninth International Confer-
edge structures inherent to P. ence on the Principles of Knowledge Representation
and Reasoning.
Padmanabhan, B., & Tuzhilin, A. (2000). Small is beau-
REFERENCES tiful: Discovering the minimal set of unexpected pat-
terns. Proceedings of the Sixth ACM SIGKDD Interna-
Agarwal, R. C., Aggarwal, C. C., & Prasad, V. V. V. tional Conference on Knowledge Discovery and Data
(2000). Depth first generation of long patterns. Pro- Mining (pp. 54-63).
ceedings of the Sixth ACM-SIGKDD International
Conference on Knowledge Discovery and Data Mining Paris, J. B. (1994). The uncertain reasoners companion:
(pp. 108-118). A mathematical perspective. Cambridge University Press.
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G. & Lakhal, Roedder, W., & Kern-Isberner, G. (2003). From informa-
L. (2000). Mining minimal non-redundant association rules tion to probability: An axiomatic approach. International
using frequent closed itemsets. Proceedings of the First Journal of Intelligent Systems, 18(4), 383-403.
International Conference on Computational Logic (pp. Simovici, D.A., Cristofor, D., & Cristofor, L. (2000). Min-
972-986). ing for purity dependencies in databases (Tech. Rep. No.
Bayardo, R. J., & Agrawal, R. (1999). Mining the most 00-2). Boston: University of Massachusetts.
interesting rules. Proceedings of the Fifth ACM SIGKDD Spirtes, P., Glymour, C., & Scheines, R.. (1993). Causation,
International Conference on Knowledge Discovery prediction and search. Lecture Notes in Statistics, 81.
and Data Mining.
766
TEAM LinG
Zaki, M. J. (2000). Generating non-redundant association Conditional Structure: An algebraic expression that
rules. Proceedings of the Sixth ACM-SIGKDD Interna- makes the effects of conditionals on possible worlds M
tional Conference on Knowledge Discovery and Data transparent and computable.
Mining (pp. 34-43).
Entropy: Measures the indeterminateness inherent to
a probability distribution and is dual to information.
Possible World: Corresponds to the statistical notion
KEY TERMS of an elementary event. Probabilities over possible worlds,
however, have a more epistemic, subjective meaning, in
Conditional: The formal algebraic term for a rule that they are assumed to reflect an agents knowledge.
that need not be strict, but also can be based on plausi-
bility, probability, and so forth. Principle of Maximum Entropy: A method to complete
incomplete probabilistic knowledge by minimizing the
Conditional Independence: A generalization of amount of information added.
plain statistical independence that allows you to take a
context into account. Conditional independence is of- Probabilistic Conditional: A conditional that is as-
ten associated with causal effects. signed a probability. To match the notation of conditional
probabilities, a probabilistic conditional is written as
(B|A)[x] with the meaning If A holds, then B holds with
probability x.
767
TEAM LinG
768
Mining E-Mail Data

Steffen Bickel
Humboldt-Universitt zu Berlin, Germany
Tobias Scheffer
Humboldt-Universitt zu Berlin, Germany
INTRODUCTION Since there is no study that compares large numbers of

data sets, different classifiers and different types of
E-mail has become one of the most important communica- extracted features, it is difficult to judge which text clas-
tion media for business and private purposes. Large sifier performs best specifically for e-mail classification.
amounts of past e-mail records reside on corporate serv- Against this background we try to draw some conclu-
ers and desktop clients. There is a huge potential for sions on the question which is the best text classifier for
mining this data. E-mail filing and spam filtering are well- e-mail. Cohen (1996) applies rule induction to the e-mail
established e-mail mining tasks. E-mail filing addresses classification problem and Provost (1999) finds that Nave
the assignment of incoming e-mails to predefined catego- Bayes outperforms rule induction for e-mail filing. Nave
ries to support selective reading and organize large e-mail Bayes classifiers are widely used for e-mail classification
collections. First research on e-mail filing was conducted because of their simple implementation and low computa-
by Green and Edwards (1996) and Cohen (1996). Pantel tion time (Pantel & Lin, 1998; Rennie, 2000; Sahami,
and Lin (1998) and Sahami, Dumais, Heckerman, and Dumais, Heckerman, & Horvitz, 1998). Joachims (1997,
Horvitz (1998) first published work on spam filtering. Here, 1998) shows that Support Vector Machines (SVMs) are
the goal is to filter unsolicited messages. Recent research superior to the Rocchio classifier and Nave Bayes for
on e-mail mining addresses automatic e-mail answering many text classification problems. Drucker, Wu, and Vapnik
(Bickel & Scheffer, 2004) and mining social networks from (1999) compares SVM with boosting on decision trees.
e-mail logs (Tyler, Wilkinson, & Huberman, 2004). SVM and boosting show similar performance but SVM
In Section Background we will categorize common e- proves to be much faster and has a preferable distribution
mail mining tasks according to their objective, and give an of errors.
overview of the research literature. Our Main Thrust The performance of an e-mail classifier is dependent
Section addresses e-mail mining with the objective of on the extraction of appropriate features. Joachims (1998)
supporting the message creation process. Finally, we shows that applying feature selection for text classifica-
discuss Future Trends and conclude. tion with SVM does not improve performance. Hence,
using SVM one can bypass the expensive feature selec-
tion process and simply include all available features.
BACKGROUND Features that are typically used for e-mail classification
include all tokens in the e-mail body and header in bag-of-
There are two objectives for mining e-mail data: support- words representation using TF- or TFIDF-weighting.
ing communication and discovering hidden properties of HTML tags and single URL elements also provide useful
communication networks. information (Graham, 2003).
Boykin and Roychowdhury (2004) propose a spam
filtering method that is not based on text classification but
Support of Communication on graph properties of message sub-graphs. All addresses
that appear in the headers of the inbound mails are graph
The problems of filing e-mails and filtering spam are text nodes; an edge is added between all pairs of addresses
classification problems. Text classification is a well stud- that jointly appear in at least one header. The resulting
ied research area; a wide range of different methods is sub-graphs exhibit graph properties that differ signifi-
available. Most of the common text classification algo- cantly for spam and non-spam sub-graphs. Based on this
rithms have been applied to the problem of e-mail classi- finding black- and whitelists can be constructed for
fication and their performance has been compared in spam and non-spam addresses. While this idea is appeal-
several studies. Because publishing an e-mail data set ing, it should be noted that the approach is not immedi-
involves disclosure of private e-mails, there are only a ately practical since most headers of spam e-mails do not
small number of standard e-mail classification data sets.
TEAM LinG
Mining E-Mail Data
contain other spam recipients addresses, and most send- vertices from the spring (center) there is a tendency of
ers addresses are used only once. decreasing real hierarchy depth. M
Additionally, the semantic e-mail approach E-mail graphs can also be used for controlling virus
(McDowell, Etzioni, Halevy, & Levy, 2004) aims at sup- attacks. Ebel, Mielsch, and Bornholdt (2002) show that
porting communication by allowing automatic e-mail pro- vertex degrees of e-mail graphs are governed by power
cessing and facilitating e-mail mining; it is the equivalent laws. By equipping the small number of highly connected
of semantic web for e-mail. The goal is to make e-mails nodes with anti-virus software the spreading of viruses
human- and machine-understandable with a standardized can be prevented easily.
set of e-mail processes. Each e-mail has to follow a
standardized process definition that includes specific
process relevant information. An example for a semantic MAIN THRUST
e-mail process is meeting coordination. Here, the indi-
vidual process tasks (corresponding to single e-mails) are In the last section we categorized e-mail mining tasks
issuing invitations and collecting responses. In order to regarding their objective and gave a short explanation on
work, semantic e-mail would require a global agreement the single tasks. We will now focus on the ones that we
on standardized semantic processes, special e-mail cli- consider to be most interesting and potentially most
ents and training for all users. Additional mining tasks for beneficial for users and describe them in greater detail.
support of communication are automatic e-mail answering These tasks aim at supporting the message creation
and sentence completion. They are described in Section process. Many e-mail management systems allow the
Main Thrust. definition of message templates that simplify the message
creation for recurring topics. This is a first step towards
Discovering Hidden Properties of supporting the message creation process, but past e-
Communication Networks mails that are available for mining are disregarded. We
describe two approaches for supporting the message
E-mail communication patterns reveal much information creation process by mining historic data: mining ques-
about hidden social relationships within organizations. tion-answer pairs and mining sentences.
Conclusions about informal communities and informal
leadership can be drawn from e-mail graphs. Differences Mining Question-Answer Pairs
between informal and formal structures in business orga-
nizations can provide clues for improvement of formal We consider the problem of learning to answer incoming
structures which may lead to enhanced productivity. In e-mails from records of past communication. We focus on
the case of terrorist networks, the identification of com- environments in which large amounts of similar answers
munities and potential leaders is obviously helpful as to frequently asked questions are sent such as call
well. Additional potential applications lie in marketing, centers or customer support departments. In these envi-
where companies especially communication providers ronments, it is possible to manually identify equivalence
can target communities as a whole. classes of answers in the records of outbound communi-
In social science, it is common practice for studies on cation. Each class then corresponds to a set of semanti-
electronic communication within organizations to derive cally equivalent answers sent in the past; it depends
the network structure by means of personal interviews or strongly on the application context which fraction of the
surveys (Garton Garton, Haythornthwaite, & Wellman, outbound communication falls into such classes. Map-
1997; Hinds & Kiesler, 1995). For large organizations, this ping inbound messages to one of the equivalence classes
is not feasible. Building communication graphs from e- of answers is now a multi-class text classification problem
mail logs is a very simple and accurate alternative pro- that can be solved with text classifiers.
vided that the data is available. Tyler, Wilkinson, and This procedure requires a user to manually group
Huberman (2004) derive a network structure from e-mail previously sent answers into equivalence classes which
logs and apply a divisive clustering algorithm that decom- can then serve as class labels for training a classifier. This
poses the graph into communities. Tyler, Wilkinson, and substantial manual labeling effort reduces the benefit of
Huberman verify the resulting communities by interview- the approach. Even though it can be reduced by employ-
ing the communication participants; they find that the ing semi-supervised learning (Nigam, McCallum, Thrun,
derived communities correspond to informal communities. & Mitchell, 2000; Scheffer, 2004), it would still be much
Tyler et al. also apply a force-directed spring algorithm preferable to learn from only the available data: stored
(Fruchterman & Rheingold, 1991) to identify leadership inbound and outbound messages. Bickel and Scheffer
hierarchies. They find that with increasing distance of (2004) discuss an algorithm that learns to answer ques-
769
TEAM LinG
Mining E-Mail Data
tions from only the available data and does not require tional information that is needed for answering specific
additional manual labeling. The key idea is to replace the types of questions. This information can be visualized in
manual assignment of outbound messages to equivalence an inseparability graph, where each class of equivalent
classes by a clustering step. answers is represented by a vertex, and an edge is drawn
The algorithms for training (learning from message when a classifier that discriminates between these classes
pairs) and answering a new question are shown in Table 1. achieves only a low AUC performance (the AUC perfor-
In the training phase, a clustering algorithm identifies mance is the probability that, when a positive and a
groups of similar outbound messages. Each cluster then negative example are drawn at random, a discriminator
serves as class label; the corresponding questions which assigns a higher value to the positive than to the nega-
have been answered by a member of the cluster are used tive one). Typical examples of inseparable answers are
as training examples for a multi-class text classifier. The your order has been shipped this morning and your
medoid of each cluster (the outbound message closest to order will be shipped tomorrow. Intuitively, it is not
the center) is used as an answer template. The classifier possible to predict which of these answers a service
maps a newly incoming question to one of the clusters; this employee will send, based on only the question when
clusters medoid is then proposed as answer to the ques- will I receive my shipment?
tion. Depending on the user interface, high confidence
messages might be answered automatically, or an answer Mining Sentences
is proposed which the user may then accept, modify, or
reject (Scheffer, 2004). The message creation process can also be supported on
The approach can be extended in many ways. Multiple a sentence level. Given an incomplete sentence, the task
topics in a question can be identified to mix different of sentence completion is to propose parts or the total
corresponding answer templates and generate a multi- rest of the current sentence, based on an application
topic answer. Question specific information can be ex- specific document collection. A sentence completion
tracted in an additional information extraction step and user interface can, for instance, display a proposed
automatically inserted into answer templates. In this ex- completion in a micro window and insert the proposed
traction step also customer identifications can be ex- text when the user presses the tab key.
tracted and used for a database lookup that provides The sentence completion problem poses new chal-
customer and order specific information for generating lenges for data mining and information retrieval, includ-
more customized answers. ing the problem of finding sentences whose initial frag-
Bickel and Scheffer (2004) analyze the relationship of ment is similar to a given fragment in a very large text
answer classes regarding the separability of the corre- corpus. To this end, Grabski and Scheffer (2004) provide
sponding questions using e-mails sent by the service a retrieval algorithm that uses a special inverted indexing
department of an online shop. By analyzing this relation- structure to find the sentence whose initial fragment is
ship one can draw conclusions about the amount of addi- most similar to a given fragment, where similarity is
Table 1. Algorithms for learning from message pairs and answering new questions
Learning from message pairs.

Input: Message pairs, variance threshold 2, pruning parameter .
1. Recursively cluster answers of message pairs with bisecting partitioning cluster algorithm, end
recursion when cluster variance lies below 2.
2. Prune all clusters with less than elements. Combine all pruned clusters into one
miscellaneous cluster. Let n be the number of resulting clusters.
3. For all n clusters
a. Construct an answer template by choosing the answer that is most similar to the
centroid of this cluster in vector space representation and remove salutation line.
b. Let the inbound mails that have been answered by a mail in the current cluster be the
positive training examples for this answer class.
2. Train SVM classifier that classifies an inbound message into one of the n answer classes or the
miscellaneous class from these training examples. Return this classifier.
Answering new questions.

Input: New question message, message answering hypothesis, confidence threshold .
1. Classify new message into one of the n answer classes and remember SVM decision function
value.
2. If confidence exceeds the confidence threshold, propose the answer template that corresponds
to the classification result. Perform instantiation operations that typically include formulating
a salutation line.
770
TEAM LinG
Mining E-Mail Data
defined in terms of the greatest cosine similarity of the a research topic and first solutions to some of the prob-
TFIDF vectors. In addition, they study an approach that lems involved have been studied. Mining social networks M
compresses the data further by identifying clusters of the from e-mail logs is a new challenge; research on this topic
most frequently used similar sets of sentences. In order to in computer science is in an early stage.
evaluate the accuracy of sentence completion algorithms,
Grabski and Scheffer (2004) measure how frequently the
algorithm, when given a sentence fragment drawn from a ACKNOWLEDGMENT
corpus, provides a prediction with confidence above ,
and how frequently this prediction is semantically equiva- The authors are supported by the German Science Foun-
lent to the actual sentence in the corpus. They find that dation DFG under grant SCHE540/10-1. We would like to
for the sentence mining problem higher precision and thank the anonymous reviewers.
recall values can be obtained than for the problem of
mining question answer pairs; depending on the thresh-
old and the fragment length, precision values of between REFERENCES
80% and 100% and recall values of about 40% can be
observed. Bickel, S., & Scheffer, T. (2004). Learning from message
pairs for automatic email answering. Proceedings of the
European Conference on Machine Learning.
FUTURE TRENDS
Boykin, P., & Roychowdhury, V. (2004). Personal e-mail
Spam filtering and e-mail filing based on message text can be networks: An effective anti-spam tool. Preprint, arXiv id
reduced to the well studied problem of text classification. 0402143.
The challenges that e-mail classification faces today con- Cohen, W. (1996). Learning rules that classify e-mail.
cern technical aspects, the extraction of spam-specific Proceedings of the IEEE Spring Symposium on Machine
features from e-mails, and an arms race between spam filters learning for Information Access, Palo Alto, California,
and spam senders adapting to known filters. By comparison, USA.
research in the area of automatic e-mail answering and
sentence completion is in an earlier stage; we see a substan- Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector
tial potential for algorithmic improvements to the existing machines for spam categorization. IEEE Transactions on
methods. The technical integration of these approaches into Neural Networks, 10(5), 1048-1055.
existing e-mail clients or call-center automation software
provides an additional challenge. Some of these technical Ebel, H., Mielsch, L., & Bornholdt, S. (2002). Scale-free
challenges have to be addressed before mining algorithms topology of e-mail networks. Physical Review, E 66.
that aim at supporting communication can be evaluated Fruchterman, T. M., & Rheingold, E. M. (1991). Force-
under realistic conditions. directed placement. Software Experience and Practice,
Construction of social network graphs from e-mail 21(11).
logs is much easier than by surveys and there is a huge
interest in mining social networks see, for instance, the Garton, L., Haythornthwaite, C., & Wellman, B. (1997).
DARPA program on Evidence Extraction and Link Dis- Studying online social networks. Journal of Computer-
covery (EELD). While social networks have been studied Mediated Communication, 3(1).
intensely in the social sciences and in physics, we see a Grabski, K., & Scheffer, T. (2004). Sentence completion.
considerable potential for new and better mining algo- Proceedings of the SIGIR International Conference on
rithms for social networks that computer scientists can Information Retrieval, Sheffield, UK.
contribute.
Graham, P. (2003). Better Bayesian filtering. Proceedings
of the First Annual Spam Conference, MIT. Retrieved
CONCLUSION from http://www.paulgraham.com/better.html
Green, C., & Edwards, P. (1996). Using machine learning
Some methods that can form the basis for effective spam to enhance software tools for internet information man-
filtering have reached maturity (text classification), addi- agement. Proceedings of the AAAI Workshop on Internet
tional foundations are being worked on (social network Information Management.
analysis). Today, technical challenges dominate the de-
velopment of spam filters. The development of methods Hinds, P., & Kiesler, S. (1995). Communication across
that support and automate communication processes is boundaries: Work, structure, and use of communication
771
TEAM LinG
Mining E-Mail Data
technologies in a large organization. Organization Sci- KEY TERMS

ence, 6(4), 373-393.
Joachims, T. (1997). A probabilistic analysis of the Rocchio Community: A group of people having mutual rela-
algorithm with TFIDF for text categorization. Proceedings tionships among themselves or having common interests.
of the International Conference on Machine Learning. Clusters in social network graphs are interpreted as com-
munities.
Joachims, T. (1998). Text categorization with support
vector machines: Learning with many relevant features. Mining E-Mails: The application of analytical meth-
Proceedings of the European Conference on Machine ods and tools to e-mail data for a) support of communica-
Learning. tion by filing e-mails into folders, filtering spam, answer-
ing e-mails automatically, or proposing completions to
McDowell, L., Etzioni, O., Halevy, A., & Levy, H. (2004). sentence fragments, b) discovery of hidden properties of
Semantic e-mail. Proceedings of the WWW Conference. communication networks by e-mail graph analysis.
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Mining Question-Answer Pairs: Analytical method
Text classification from labeled and unlabeled documents for automatically answering question e-mails using knowl-
using EM. Machine Learning, 39(2/3). edge that is discovered in question-answer pairs of past
e-mail communication.
Pantel, P., & Lin, D. (1998) Spamcop: a spam classification
and organization program. Proceedings of the AAAI Work- Mining Sentences: Analytical method for interac-
shop on Learning for Text Categorization. tively completing incomplete sentences using knowledge
that is discovered in a document collection.
Provost, J. (1999). Nave Bayes vs. rule-learning in clas-
sification of e-mail. Technical Report AI-TR-99-284, Uni- Semantic E-Mail: E-mail framework in which the
versity of Texas at Austin. semantics of e-mails is understandable by both, human
and machine. A standardized definition of semantic e-
Rennie, J. (2000). iFILE: An application of machine learn- mail processes is required.
ing to e-mail filtering. Proceedings of the SIGKDD Text
Mining Workshop. Spam E-Mail: Unsolicited and unwanted bulk e-
mail. Identifying spam e-mail is a text classification
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. task.
(1998). A Bayesian approach to filtering junk e-mail.
Proceedings of AAAI Workshop on Learning for Text Text Classification: The task of assigning docu-
Categorization. ments expressed in natural language to one or more
categories (classes) of a predefined set.
Scheffer, T. (2004). E-mail answering assistance by
semi-supervised text classification. Intelligent Data TFIDF: Weighting scheme for document and query
Analysis, 8(5). representation in the vector space model. Each dimension
represents a term, its value is the product the frequency
Tyler, J. R., Wilkinson, D. M., & Huberman, B. A. of the term in a document (TF) and the inverse document
(2003). E-mail as spectroscopy: Automated discovery frequency (IDF) of the term. The inverse document
of community structure within organizations. Proceed- frequency of a term is the logarithmic proportion of
ings of the International Conference on Communities documents in which the term occurs. The TFIDF scheme
and Technologies (pp. 81-95). Kluwer Academic Publish- assigns a high weight to terms which occur frequently in
ers. the focused document, but are infrequent in average
documents.
772
TEAM LinG
773
Mining for Image Classification Based on M

Feature Elements
Yu-Jin Zhang
Tsinghua University, Beijing, China
INTRODUCTION compute the distance between two vectors in the feature

space. This approach just tries to find associations
Motivation: Image Classification in Web between the feature elements and class attributes of the
image. Techniques for mining the association rules are
Search adapted, and the mined rules are applied to image clas-
sifications. Experiments with real images show that the
The growth of the Internet and storage capability not new approach not only reduces the classification errors
only increasingly makes images a widespread informa- but also diminishes the time complexity.
tion format on the World Wide Web (WWW), but it also The remaining parts of this article are structured as
dramatically expands the number of images on WWW follows:
and makes the search of required images more complex
and time-consuming. To efficiently search images on
the WWW, effective image search engines need to be
Background: (1) Feature Elements vs. Feature
developed. Vectors; (2) Association Rules and Rule Mining;
The classification of images plays an important role (3) Classification Based on Association
both for Web image searching and retrieving, as it is Main Thrust: (1) Extracting Various Types of
time-consuming for users to browse through the huge Feature Elements; (2) Feature Element Based Im-
amount of data on the Web. Classification has been used age Classification; (3) Database Used in Test; (4)
to provide access of large image collections in a more Classification and Comparison Results
efficient manner, because the classification can reduce Direction of Future Research.
search space by filtering out images in an unrelated Conclusion.
category (Hirata, 2000).
The heterogeneous nature of Web images makes
their classification a challenging task. A functional BACKGROUND
classification scheme should take the contents of im-
ages into consideration. The association rule mining,
first proposed by Agrawal (1993), is an appropriate tool
Feature Elements vs. Feature Vectors
for pattern detection in knowledge discovery and data
mining. Its objective is to extract useful information Traditionally, feature vectors are used for object iden-
from very large databases (Renato, 2002). By using tification and classification as well as for content-based
rules extracted from images, the content of images can image retrieval (CBIR). In object identification and
be suitably analyzed, and the information required for classification, different features representing the char-
image classification can be obtained. acteristics of objects are extracted first. These features
mark out an object to a point in the feature space. By
detecting this point in the space, the object can be
Highlights of the Article identified or classified. In CBIR, the procedure is simi-
lar. Features such as color, texture, and shape are ex-
A novel method for image classification based on fea- tracted from images and grouped into feature vectors
ture element through association rule mining is pre- (Zhang, 2003). The similarity among images is mea-
sented. The feature elements can capture well the visual sured by distances between corresponding vectors.
meanings of images according to the subjective percep- However, these feature vectors often are different
tion of human beings. In addition, feature elements are from the representation and description adapted by hu-
discrete entities and are suitable for working with rule- man beings. For example, when people look at a colorful
based classification models. Different from traditional image, they hardly figure out its color histogram but
image classification methods, the proposed classifica- rather are concerned about what particular colors are
tion approach, based on a feature element, does not contained in certain components of the image. In fact,
TEAM LinG
Mining for Image Classification Based on Feature Elements
these color components play a great role in perception and mum support and minimum confidence constraints (Hipp,
represent useful visual meanings of images. The pixels 2000). It is known that rule-based classification models
belonging to these visual components can be taken to form often have difficulty dealing with continuous variables.
perceptual primitive units, by which human beings could However, as a feature element is just a discrete entity,
identify the content of images (Xu, 2001). association rules can easily be used for treating images
The feature elements are defined on the basis of these represented and described by feature elements. In fact,
primitive units. They are discrete quantities, relatively a decision about whether an image I contains feature
independent of each other, and have obvious intuitive element X and/or feature element Y can be properly
visual senses. In addition, they can be considered as sets defined and detected.
of items. Based on feature elements, image classifica-
tion becomes a process of counting the existence of Classification Based on Association
representative components in images. For this purpose,
it is required to find some association rules between the Classification based on associations (CBA) is an algo-
feature elements and the class attributes of image. rithm for integrating classification and association rule
mining (Liu, 1998). Assume that the data set is a normal
Association Rules and Rule Mining relational table that consists of N cases described by
distinct attributes and classified into several known
The association rule can be represented by an expres- classes. All the attributes are treated uniformly. For a
sion X Y, where X and Y can be any discrete entity. As categorical attribute, all the possible values are mapped
we discuss image database, X and Y can be some feature to a set of consecutive positive integers. With these
elements extracted from images. The meaning of X Y mappings, a data case can be treated as a set of (attribute,
is: Given an image database D, for each image I D, X integer value) pairs plus a class label. Each (attribute,
Y expresses that whenever an image I contains X then I integer value) is called an item. Let D be the data set, I the
probably will also contain Y. The support of association set of all items in D, and Y the class labels. A class
rule is defined as the probability p(X I, Y I), and the association rule (CAR) is an implication of the form X
confidence of association rule is defined as the condi- y, where X I, and y Y. A data case d D means d
tional probability p(X I | Y I). A rule with support contains a subset of items; that is, X d and X I. A rule
bigger than a specified minimum support and with con- X y holds in D with confidence C if C percentages of
fidence bigger than a specified minimum confidence is cases in D that contain X are labeled with class y. The rule
considered as a significant association rule. X y has support S in D if the S percentages of cases in
Since the introduction of the association rule mining D are contained in X and are labeled with class y.
by Agrawal (1993), many researches have been con- The objective of CBA is to generate the complete set
ducted to enhance its performance. Most works can be of CARs that satisfy the specified minimum supports
grouped into the following categories: and minimum confidence constraints, and to build a
classifier from CARs. It is easy to see that if the right-
1. Works for mining of different rules, such as multi- hand-side of the association rules is restricted to the
dimensional rules (Yang, 2001). (classification) class attributes, then such rules can be
2. Works for taking advantage of particular techniques, regarded as classification rules to build classifiers.
such as, tree projection (Guralnik, 2004), mul-
tiple minimum supports (Tseng, 2001), constraint-
based clustering (Tung, 2001), and association MAIN THRUST
(Cohen, 2001).
3. Works for developing fast algorithms, such as algo- Extracting Various Types of Feature
rithm based on anti-skew partitioning (Lin, 1998).
4. Works for discovering a temporal database, such
Elements
as discovering temporal association rules
(Guimaraes, 2000; Li, 2003). Various types of feature elements that put emphasis on
different properties will be employed in different applica-
Currently, the association rule mining (Lee, 2003; tions. The extractions of feature elements can be carried
Harms, 2004) is one of the most popular pattern discov- out first by locating the perceptual elements and then by
ery methods in knowledge discovery and data mining. In determining their main properties and giving them suit-
contrast to the classification rule mining (Pal, 2003), able descriptions. Three typical examples are described in
the purpose of association rule mining is to find all the following.
significant rules in the database that satisfy some mini-
774
TEAM LinG
One process for obtaining feature elements primarily Then, a process of discretization is followed (Li,
based on color properties can be described by the follow- 2002). Suppose the wavelet decomposition is performed M
ing steps (Xu, 2001): in six levels; for each level, seven moments are computed.
This gives a 42-D vector. It can be split into six groups,
1. Images are divided into several clusters with a per- each of them being a 7-D vector that represents seven
ceptual grouping based on hue histogram. moments on one level. On the other side, the whole vector
2. For each cluster, the central hue value is taken as its can be split into seven groups; each of them is a 6-D
color cardinality named as Androutsos-cardinality vector that represents one moment on all six levels. This
(AC). In addition, color-coherence-vector (CCV) and process can be described with the help of Figure 1.
color-auto-correlogram (CAC) also are calculated. In all these examples, the feature elements have prop-
3. Additional attributes such as the center coordi- erty represented by numeric values. As not all of the
nates and area of each cluster are recoded to repre- feature elements have the same status in the visual
sent the position and size information of clusters. sense, an evaluation of feature elements is required to
select suitable feature elements according to the sub-
One type of feature element highlighting the form jective perception of human beings (Xu, 2002).
property of clusters is obtained with the help of Zernike
moments (Xu, 2003). They are invariant to similarity Feature Element Based Image
transformations, such as translation, rotation, and scal- Classification
ing of the planar shape (Wee, 2003). Based on Zernike
moments of clusters, different descriptors for express- Feature Element Based Image Classification (FEBIC)
ing circularity, directionality, eccentricity, roundness, uses CBA to find association rules between feature
symmetry, and so forth, can be directly obtained, which elements and class attributes of the images, while the
provides useful semantic meanings of clusters with re- class attributes of unlabeled images could be predicted
spect to human perception. with such rules. In case an unlabeled image satisfies
Wavelet feature element is based on wavelet modulus several rules, which might make this image be classi-
maxima and invariant moments (Zhang, 2003). Wavelet fied into different classes, the support values and con-
modulus maxima can indicate the location of edges in fidence values can be used to make the final decision.
images. A set of seven invariant moments (Gonzalez, In accordance with the assumption in CBA, each
2002) is used to represent the multi-scale edges in image is considered as a data case, which is described
wavelet-transformed images. Three steps are taken first: by a number of attributes. The components of the
feature element are taken as attributes. The labeled
1. Images are decomposed, using dyadic wavelet, into image set can be considered as a normal relational table
a multi-scale modulus image. that is used to mine association rules for classification.
2. Pixels in the wavelet domain whose moduli are In the same way, feature elements from unlabeled im-
locally maxima are used to form multi-scale edges. ages are extracted and form another relational table
3. The seven invariant moments at each scale are without class attributes, on which the classification
computed and combined to form the feature vector rules to predict the class attributes of each unlabeled
of images. image will be applied.
The whole procedure can be summarized as follows:
Figure 1. Splitting and groups feature vectors to 1. Extract feature elements from images.
construct feature elements 2. Form relational table for mining association rules.
3. Use mined rules to predict the class attributes of
m11 m12 m13 m14 m15 m16 m17 unlabeled images.
4. Classify images using the association of feature
m21 m22 m23 m24 m25 m26 m27 elements.
m31 m32 m33 m34 m35 m36 m37
Database Used in Test
m41 m42 m43 m44 m45 m46 m47
The image database for testing consists of 2,558 real-
m51 m52 m53 m54 m55 m56 m57 color images that can be grouped into five different
classes: (1) 485 images with (big) flowers; (2) 565
m61 m62 m63 m64 m65 m66 m67 images with person pictures; (3) 505 images with au-
775
TEAM LinG
Figure 2. Typical image examples from different classes
(a) Flower image (b) Person picture (c) Auto image
(d) Scenery (e) Flower cluster
tos; (4) 500 images with different sceneries (e.g., sunset, Except the classification error, the time complexity is
sunrise, beach, mountain, forest, etc.); and (5) 503 images another important factor to be counted in Web applica-
with flower clusters. Among these classes, the first three tion, as the number of images on the WWW is huge. The
have prominent objects, while the other two normally computation times for two methods are compared during
have no dominant items. Two typical examples from each the test experiments. The time needed for FEBIC is only
class are shown in Figure 2. about 1/100 of the time needed for NFL. Since NFL requires
Among these images, one-third have been used in the many arithmetic operations to compute distance func-
test set and the rest in the training set. The images in the tions, while FEBIC needs only a few operations for judg-
training set are labeled manually and then used in the mining ing the existence of feature elements, such a big difference
of association rules, while the images in the testing set will in computation is well expected.
be labeled automatically by these mined rules.
Classification and Comparison Results FUTURE TRENDS
Classification experiments using two methods with the The detection and description of feature elements play
previously mentioned database are carried out. The pro- an important role in providing suitable information and
posed method FEBIC is compared to another state-of- a basis for association rule mining. How to adaptively
the-art methodnearest feature line (NFL) (Li, 2000). design feature elements that can capture the users
NFL is a classification method based on feature vectors. intention based on perception and interpretation needs
In comparison, the color feature (i.e., AC, CCV, CAC) further research.
and the wavelet feature based on wavelet modulus maxima, The proposed techniques also can be extended to the
and invariant moments are used. content-based retrieval of images over the Internet. As
Two tests are performed. For each test, both meth- feature elements are discrete entities, the similarity
ods use the same training set and testing set. The results between images described by feature elements can be
of these experiments are summarized in Table 1, where computed according to the number of common elements.
the classification error rates for each class and for the
average over the five classes are listed.
The results in Table 1 show that the classification CONCLUSION
error rate of NFL is about 34.5%, while the classifica-
tion error rate of FEBIC is about 25%. The difference is A new approach for image classification that uses fea-
evident. ture elements and employs association rule mining is
proposed. It provides lower classification error and
higher computation efficiency. These advantages make
Table 1. Comparison of classification errors
it quite suitable to be included into a Web search engine
Error rate Test set 1 Test set 2 for images over the Internet.
FEBIC NFL FEBIC NFL
Flower 32.1% 48.8% 36.4% 46.9%
Person 22.9% 25.6% 20.7% 26.1%
Auto 21.3% 23.1% 18.3% 23.1%
Scenery 30.7% 38.0% 32.5% 34.3%
Flower cluster 26.8% 45.8% 20.2% 37.0%
Average 26.6% 35.8% 25.4% 33.2%
776
TEAM LinG
ACKNOWLEDGMENTS Li, Y.J. et al. (2003). Discovering calendar-based temporal

association rules. Data and Knowledge Engineering, M
This work has been supported by the Grants NNSF- 44(2), 193-218.
60172025 and TH-EE9906. Lin, J.L., & Dunham, M.H. (1998). Mining association
rules: Antiskew algorithms. Proceeding of the Interna-
tional Conference on Data Engineering.
REFERENCES
Liu, B., Hsu, W., & Ma, Y.M. (1998). Integrating classifi-
Agrawal, R., Imielinski, T. & Swami, A. (1993). Mining cation and association rule mining. Proceedings of the
association rules between sets of items in large data- International Conference on Knowledge Discovery and
bases. Proceeding of the ACM SIGMOD. Data Mining.
Cohen, E. et al. (2001). Finding interesting associations Pal, S.K. (2003). Soft computing pattern recognition, case
without support pruning. IEEE Trans. Knowledge and generation and data mining. Proceedings of the Interna-
Data Engineering, 13(1), 64-78. tional Conference on Active Media Technology.
Gonzalez, R.C., & Woods, R.E. (2002). Digital image Renato, C. (2002). A theoretical framework for data mining:
processing. Prentice Hall. The informational paradigm. Computational Statistics
and Data Analysis, 38(4), 501-515.
Guimaraes, G. (2000). Temporal knowledge discovery
for multivariate time series with enhanced self-organiz- Tseng, M.C., Lin, W., & Chien, B.C. (2001). Maintenance
ing maps. Proceedings of the International Joint Con- of generalized association rules with multiple minimum
ference on Neural Networks. supports. Proceedings of the Annual Conference of the
North American Fuzzy Information Processing Society.
Guralnik, V., & Karypis, G. (2004). Parallel tree-pro-
jection-based sequence mining algorithms. Parallel Tung, A.K.H. et al. (2001). Constraint-based clustering in
Computing, 30(4), 443-472. large databases. Proceedings of International Confer-
ence on Database Theory.
Harms, S K., & Deogun, J.S. (2004). Sequential associa-
tion rule mining with time lags. Journal of Intelligent Wee, CY. (2003). New computational methods for full and
Information Systems, 22(1), 7-22. subset Zernike moments. Information Sciences, 159(3-4),
203-220.
Hipp, J., Guntzer, U., & Nakhaeizadeh, G. (2000). Algo-
rithms for association rule miningA general survey Xu, Y., & Zhang, Y.J. (2001). Image retrieval framework
and comparison. ACM SIGKDD, 2(1), 58-64. driven by association feedback with feature element
evaluation built in. Proceedings of the SPIE Storage
Hirata, K. et al. (2000). Integration of image matching and and Retrieval for Media Databases.
classification for multimedia navigation. Multimedia Tools
and Applications, 11, 295309. Xu, Y., & Zhang, Y.J. (2002). Feature element theory
for image recognition and retrieval. Proceedings of the
Lee, C.H., Chen, M.S., & Lin, C.R. (2003). Progressive SPIE Storage and Retrieval for Media Databases.
partition miner: An efficient algorithm for mining gen-
eral temporal association rules. IEEE Trans. Knowl- Xu, Y., & Zhang, Y.J. (2003). Semantic retrieval based
edge and Data Engineering, 15(4), 1004-1017. on feature element constructional model and bias com-
petition mechanism. Proceedings of the SPIE Storage
Li, Q., Zhang, Y.J., & Dai, S.Y. (2002). Image search and Retrieval for Media Databases.
engine with selective filtering and feature element based
classification. Proceedings of the SPIE of Internet Yang, C., Fayyad, U., & Bradley, P.S. (2001). Efficient
Imaging III. discovery of error-tolerant frequent itemsets in high
dimensions. Proceedings of the International Confer-
Li, S.Z., Chan, K.L., & Wang, C.L. (2000). Performance ence on Knowledge Discovery and Data Mining.
evaluation of the nearest feature line method in image
classification and retrieval. IEEE Trans. Pattern Analy- Zhang, Y.J. (2003). Content-based visual information
sis and Machine Intelligence, 22(11), 1335-1339. retrieval. Science Publisher.
777
TEAM LinG
KEY TERMS or minimize some classification error (i.e., supervised

pattern detection), or with not only locating occurrences
Classification Error: Error produced by incorrect clas- of the patterns in the database but also deciding whether
sifications, which consists of two types: correct negative such an occurrence is a pattern (i.e., unsupervised pattern
(wrongly classify an item belonging to one class into detection).
another class) and false positive (wrongly classify an item Pattern Recognition: Concerned with the classifica-
from other classes into the current class). tion of individual patterns into pre-specified classes (i.e.,
Classification Rule Mining: A technique/procedure supervised pattern recognition), or with the identification
aiming to discover a small set of rules in the database to and characterization of pattern classes (i.e., unsupervised
form an accurate classifier for classification. pattern recognition).
Content-Based Image Retrieval (CBIR): A process Similarity Transformation: A group of transforma-

framework for efficiently retrieving images from a collections that will preserve the angles between any two
tion by similarity. The retrieval relies on extracting the curves at their intersecting points. It is also called equi-
appropriate characteristic quantities describing the de- form transformation, because it preserves the form of
sired contents of images. In addition, suitable querying, curves. A planar similarity transformation has four de-
matching, indexing, and searching techniques are re- grees of freedom and can be computed from a two-point
quired. correspondence.
Multi-Resolution Analysis: A process to treat a func- Web Image Search Engine: A kind of search engine
tion (i.e., an image) at various levels of resolutions and/ that starts from several initially given URLs and extends
or approximations. In such a way, a complicated function from complex hyperlinks to collect images on the WWW.
could be divided into several simpler ones that can be Web search engine is also known as Web crawler.
studied separately. Web Mining: Concerned with the mechanism for
Pattern Detection: Concerned with locating patterns discovering the correlations among the references to
in the database to maximize/minimize a response variable various files that are available on the server by a given
client visit to the server.
778
TEAM LinG
779
Mining for Profitable Patterns in the Stock M

Market
Yihua Philip Sheng
Wen-Chi Hou
Zhong Chen
Shanghai JiaoTong University, PR China
INTRODUCTION statistical models to more complicated nonlinear mod-

els. A great deal of research has focused on the applica-
The stock market, like other economic phenomena, is a tions of artificial intelligence (AI) algorithms, such as
very complex system. Many factors, such as company artificial neural networks (ANNs) and genetic algo-
news, interest rates, macro economic data, and inves- rithm (e.g., Allen & Karjalainen, 1999; Chenoweth,
tors hopes and fears, all affect its behavior (Pring, Obradovic, & Stephenlee, 1996; Thawornwong, Enke, &
1991; Sharpe, Alexander, & Bailey, 1999). Investors Dagli, 2003). ANNs equipped with effective learning
have longed for tools and algorithms to analyze and algorithms can use different kinds of inputs, handle
predict stock market movement. In this study, we com- noisy data, and identify highly nonlinear models. Ge-
bine a financial theory, the market efficiency theory, netic algorithms constitute a class of search, adaptation,
and a data mining technique to explore profitable trading and optimization techniques to emulate the principles
patterns in the stock market. To observe the price oscil- of natural evolution. More recent studies tend to em-
lation of several consecutive trading days, we examine brace multiple AI techniques in one approach. Tsaih,
the K-lines, each of which represents a stocks one-day Hsu, & Lai (1998) integrated the rule-based systems
movement. We will use a data mining technique with a technique and the neural networks technique to predict
heuristic rating algorithm to mine for reliable patterns the direction of daily price changes in S&P 500 stock
indicating price rise or fall in the near future. index futures. Armano, Murru, & Roli (2002) employed
ANNs with a genetic algorithm to predict the Italian stock
market. Fuzzy logic, a relatively newer AI algorithm,
has also been used in stock market prediction literature
BACKGROUND (e.g., Dourra & Siy, 2001). The increasing popularity of
fuzzy logic is due to its simplicity in constructing the
Methods of Stock Technical Analysis models and less computation load. Fuzzy algorithms
provide a fairly straightforward translation of the quali-
Conventional stock market technical analysis is often tative/linguistic statements of rules.
done through visually identifying patterns or indicators
on the stock price and volume charts. Indicators like Market Efficiency Theory
moving averages and support and resistance level are
easy to implement algorithmatically. Patterns like head- According to the market efficiency theory, a stocks
and-shoulder, inverse head-and-shoulder, broadening price is a full reflection of market information about
tops and bottoms, and etcetera, are easy for human to that stock (Fama, 1991; Malkiel, 1996). Therefore, if
visually identify but difficult for computers to recog- there is information out on the market about a stock, the
nize. For such patterns, methods like smoothing estima- stocks price will adjust accordingly. Interestingly, evi-
tors and kernel regression can be applied to increase dence shows that price adjustment in response to news
their machine-readability (Dawson & Steely, 2003; Lo, usually does not settle down in one day; it actually takes
Mamaysky, & Wang, 2000). some time for the whole market to digest the news. If the
The advances in data mining technology have pushed stocks price really adjusted to relevant events in a
the technical analysis of stock market from simple timely manner, the stock price chart would have looked
indicators, visually recognizable patterns, and linear more like what Figure 1 shows.
TEAM LinG
Mining for Profitable Patterns in the Stock Market
Figure 1. Ideal stock price movement curve under the Figure 3(a) is a price-up K-Line, denoted by an empty
market efficiency theory rectangle, indicating the closing price is higher than the
opening price. Figure 3(b) is a price-down K-line, de-
noted by a solid rectangle, indicating the closing price is
lower than the opening price. Figure 3(c) and 3(d) are 3-
day K-Lines. Figure 3(c) shows that the price was up for
two consecutive days and the second days opening
The flat periods indicate that there were no events price continued on the first days closing price. This
occurring during the periods while the sharp edges indicates that the news was very positive. The price came
indicate sudden stock price movements in response to down a little bit on the third day, which might be due to the
event announcements. However, in reality, most stocks correction to the over-valuation of the good news in the
daily price resembles the curve shown in Figure 2. As prior two days. Actually, the long price shadow above the
the figure shows, there is no obvious flat period for a closing price of the second day already shows some
stock and the stock price seemed to keep on changing. In degree of price correction. Figure 3(d) is the opposite of
some cases, the stock price continuously moves down Figure 3(c). When an event about a stock happens, such
or up for a relatively long period, for example, the as rumors on merger/acquisition, or change of dividend
period of May 17, 2002 to July 2, 2002, and the period policies, the price adjustments might last for several days
of October 16, 2002 to November 6, 2002. This could till the price finally settles down. As a result, the stocks
be either there were negative (or positive) events for the price might keep rising or falling or stay the same during
company every day for a long period of time or the stock the price adjustment period.
price adjustment to events actually spans a period of A stock has a K-Line for every trading day, but not
time, rather than instantly. The latter means that stock every K-Line is of our interest. Our goal is to identify a
price adjustment to the event announcements is not stocks K-Line patterns that reflect investors reactions
efficient and the semi-form of the market efficiency theory to market events such as the releases of good or bad
does not hold. Furthermore, we think the first few days corporate news, major stock analysts upgrade on the
price adjustments of the stock are crucial, and the price stock, and etcetera. Such market events usually can
movements in these early days might contain enough cause the stocks price to oscillate for a period of time.
information to predict whether the rest of price adjustment Certainly, a stocks price sometimes might change with
in the near future is upwards or downwards. large magnitude just for one day or two due to transient
market rumors. These types of price oscillations are
Knowledge Representation regarded as market noises and therefore are ignored.
Whether a stocks daily price oscillates is deter-
mined by examining if the price change on that day is
Knowledge representation holds the key to the success
greater than the average price change of the year. If a
of data mining. A good knowledge representation should
stocks price oscillates for at least three consecutive
be able to include all possible phenomena of a problem
days, we regard it as a signal of the occurrence of a
domain without complicating it (Liu, 1998). Here, we
market event. The markets response to the event is
use K-lines, a widely used representation method of the
recorded in a 3-day K-Line pattern. Then, we examine
daily stock price in Asian stock markets, to describe the
whether this pattern is followed by an up or down trend
daily price change of a stock. Figure 3 is examples of K-
of the stocks price a few days later.
Lines.
Figure 2. The daily stock price curve of Intel Corporation (NasdaqNM Symbol: INTC)
780
TEAM LinG
Figure 3. K-Line examples

M
The relative positions of the K-Lines such as one days 0000100111, reserved
opening/closing prices relative to the prior days closing/ 01000, if the price body covers the day 1 and day
opening prices, the length of the price body, etc. reveal 2s price bodies
market reactions to the events. The following bit-represen-
01001, if the price body covers the day 1s price
tation method, called Relative Price Movement or RPM for
body only
simplicity, is used to describe the positional relationship
of the K-lines in three days. 01010, if the price body covers the day 2s price
body only
Day 1 01011, if the price body is covered by the day 1
and day 2s price bodies
bit 0: 1 if the days price is up, 0 otherwise
01100, if the price body is covered by the day 2s
price body only
bit 1: 1 if the up shadow is longer than the price
01101, if the price body is covered by the day 1s
body, 0 otherwise.
price only
bit 2: 1 if the down shadow is longer than the price
01110, if the whole price body is higher than the
body, 0 otherwise
day 1 and day 2s price bodies
Day 2 01111, if the whole price body is higher than the
day 1s price body only
10000, if the whole price body is higher than the
bits 0 2: The same as Day 1s representation
day 2s price body only
bits 3 5: 10001, if the whole price body is lower than the
001, if the price body covers the day 1s price body day 1 and day 2s price bodies
010, if the price body is covered by the day 1s 10010, if the whole price body is lower than the
price body day 1s price body only
011, if the whole price body is higher than the day 10011, if the whole price body is lower than the
1s price body day 2s price body only
100, if the whole price body is lower than the day 10100, if the price body is partially higher than
1s price body the day 1 and day 2s price bodies
101, if the price body is partially higher than the 10101, if the price body is partially lower than
day 1s price body the day 1 and day 2s price bodies
110, if the price body is partially lower than the day 10110, if the price body is partially higher than
1s price body the day 2s price body only
10111, if the price body is partially higher than
DAY 3 the day 1s price body only
11000, if the price body is partially lower than
bits 0 2: The same as Day 1s representation the day 2s price body only
bits 3 7: 11001, if the price body is partially lower than
the day 1s price body only
781
TEAM LinG
Mining for Rules patterns to reduce the ambiguity of the patterns. Using
the price-up pattern as an example, for a pattern to be
The rules we mine for are similar to those by Liu (1998), labeled as a price-up pattern, we think the times it
Siberschatz & Tuzhilin (1996), and Zaki, Parthasatathy, appeared in Pup should be at least twice as many as the
Ogihara, & Li (1997). They have the following format: times it appeared in the Pdown. Within all the patterns
labeled as price-up patterns, they were sorted based on
Rule type (1): a 3-day K-Line pattern the stocks the ratio of the squared root of its total occurrences plus
price rises 10% in 10 days its occurrence as a price-up pattern over the occurrence
as a price-down pattern.
Rule type (2): a 3-day K-Line pattern the stocks
For price-up pattern:
price falls 10 % in 10 days
Pup PO Pup
The search algorithm for finding 3-day K-Line pat- + * Pup , if Pdown > 2
Pdown Pdown

terns that lead to stock price rise or fall is as follows:
Preference =
up ,
1. For every 3-day K-Line pattern in the database P + PO * Pup , if Pup 2
2. Encode it by using the RPM method to get every
Pdown Pdown Pdown
days bit representation, c1, c2, c3;
3. Increase pattern_occurrence[c1][c2][c3] by 1;
4. base_price = the 3rd days closing price; For price-down pattern:
5. if the stocks price rises 10% or more, as com-
pared to the base_price, in 10 days after the Pdown PO Pdown
+ * Pdown , if >2
occurrence of this pattern Pup Pup Pup

Preference =
Increase Pup[c1][c2][c3] by 1; down
P PO Pdown
+ * Pdown , if 2
Pup Pup Pup
6. if the stocks price falls 10% or more, as com-
pared to the base_price, in 10 days after the
occurrence of this pattern The final winning patterns with positive Preference
score are listed in Table 2.
Increase Pdown[c1][c2][c3] by 1;
Performance Evaluation
We used the daily trading data from January 1, 1994,
through December 31, 1998, of the 82 stocks, as shown in
To evaluate the performance of the found winning pat-
Table 1, as the base data set to mine for the price up and
terns listed in Table 2, we applied them to the prices of
down patterns. After applying the above search algorithm
the same 82 stocks for the period from January 1, 1999,
on the base data set, the Pup and Pdown arrays contained the
counts of all the patterns that led price to rise or fall by 10%
in 10 days. In total, the up-patterns occurred 1,377 times,
Table 2. Final winning patterns sorted by preference
among which there were 870 different types of up-pat-
terns; and the down-patterns occurred 1,001 times, among Pattern Code PO Pup Pdown Preference
which there were 698 different types of down-patterns. Up[00][20][91] 46 15 4 81.68
A heuristic, stated below, was applied to all found Up[01][28][68] 17 7 1 77.86
Up[07][08][88] 11 7 1 72.22
Up[00][24][88] 10 7 1 71.14
Table 1. 82 selected stocks Up[00][30][8E] 9 7 1 70.00
ADBE BA CDN F KO MWY S WAG Up[01][19][50] 28 12 3 69.17
ADSK BAANF CEA FON LGTO NETG SAPE WCOM Up[00][30][90] 39 21 9 63.57
ADVS BEAS CHKP GATE LU NKE SCOC WMT Up[00][31][81] 26 8 2 52.40
AGE BEL CLGY GE MACR NOVL SNPS XOM
Up[00][20][51] 18 8 2 48.97
AIT BTY CNET GM MERQ ORCL SUNW YHOO
AMZN BVEW CSCO HYSL MO PRGN SYBS Up[01][19][60] 24 9 3 41.70
AOL CA DD IBM MOB PSDI SYMC Down[01][1D][71] 10 0 6 66.00
ARDT CAL DELL IDXC MOT PSFT T
Down[00][11][71] 17 1 6 60.74
AVNT CBS DIS IFMX MRK RATL TSFW
AVTC CBTSY EIDSY INTU MSFT RMDY TWX
Down[01][19][79] 35 3 10 53.05
AWRE CCRD ERTS ITWO MUSE RNWK VRSN Down[00][20][67] 18 2 5 23.11
782
TEAM LinG
through December 31, 1999. A stop-loss of 5% was set to 1999; Lo & MacKinlay, 1999). Data mining techniques
reduce the risk imposed by a wrong signal. This is a combined with financial theories can be a powerful ap- M
common practice in the investment industry. If a buy proach for discovering price movement patterns in the
signal is generated, we buy that stock and hold it. The financial market. Unfortunately, researchers in the data
stock will be sold when it reaches the 10% profit target mining field often focus exclusively on computational
or 10 days of holding period, or when its price goes part of market analysis, not paying attention to the theo-
down 5%. Same rules were applied to the sell signals ries of the target area. In addition, the knowledge repre-
but in an opposite way. Table 3 shows the number of buy sentation methods and variables chosen were often based
and short sell signals generated by these patterns. on the common sense, rather than theories. This article
As seen from Table 3, the price-up winning pattern borrows the market efficiency theory to model the problem
worked very well. 42.86% predictions were perfectly and the out-of-sample performance was quite pleasing.
correct. Also, 20 of the 84 buy signals assured 6.7% We believe there will be more studies integrating theories
gain after the signals. If we regard 5% increase also as in multiple disciplines to achieve better results in the near
making money, then in total we had 70.24% chance to future.
win money, and 85.71% chance of not losing money.
The price-down patterns did not work as well as the
price-up patterns. It was probably because there were CONCLUSION
not as many down trends as the up trends in the U.S.
stock market in 1999. Still, by following the sell This paper combines a knowledge discovery technique
signal, there was 43% chance of gaining money and with financial theory, the market efficiency theory, to
87.5% chance of not losing money in 1999. The final solve a classic problem in stock market analysis, that is,
return for the year 1999 was 153.8%, which was supe- finding stock trading patterns that lead to superior fi-
rior as compared to 84% return of Nasdaq composite nancial gains. This study is one of a few efforts that go
and 25% return of Dow Industrial Average. across multi-disciplines to study the stock market and
the results were quite good.
There are also some future research opportunities in
FUTURE TRENDS this direction. For example, the trading volume is not
considered in this research, which it is an important
Being able to identify price rise or drop patterns can be factor of the stock market, and we believe it is worth
exciting for frequent stock traders. By following the further investigation. Using four or more days K-Line
buy or sell signals generated by these patterns, patterns, instead of just 3-day K-Line patterns, is also
frequent stock traders can earn excessive returns over worth exploring.
the simple buy-and-hold strategy (Allen & Karjalainen,
Table 3. The performance of the chosen winning patterns
Accumulated
Times Percentage
Total Buy Signals 84
Price is up at least 10% after the signal 36 42.86%
Price is up 2/3 of 10%, i.e. 6.7%, after the signal 20 66.67%
Price is up 1/2 of 10%, i.e. 5%, after the signal 3 70.24%
Price is up only 1/10 of 10% after the signal 13 85.71%
Price drops after the signal 12 100.00%
Accumulated
Times Percentage
Total Sell Signals 16
Price is down at least 10% after the signal 4 25.00%
Price is down 2/3 of 10%, i.e. 6.7%, after the signal 2 37.50%
Price is down 1/2 of 10%, i.e. 5%, after the signal 1 43.75%
Price is down only 1/10 of 10% after the signal 7 87.50%
Price raises after the signal 2 100.00%
783
TEAM LinG
REFERENCES Thawornwong, S., Enke, D., & Dagli, C. (2003). Neural

networks as a decision maker for stock trading: a technical
Allen, F., & Karjalainen, R. (1999). Using genetic algo- analysis approach. International Journal of Smart Engi-
rithms to find technical trading rules. Journal of Finan- neering System Design, 5, 313-325
cial Economics, 51, 245-271. Tsaih, R., Hsu, Y., & Lai, C.C. (1998). Forecasting S&P
Armano, G., Murru, A., & Roli, F. (2002). Stock market 500 stock index futures with a hybrid AI system. Deci-
prediction by a mixture of genetic-neural experts. In- sion Support Systems, 23, 161-174.
ternational Journal of Pattern Recognition and Arti- Zaki, M.J., Parthasatathy, S., Ogiharam M., & Li, W.
ficial Intelligence, 16, 501-526. (1997). New algorithms for fast discovery of associa-
Chenoweth, T., Obradovic, Z., & Stephenlee, S. (1996). tion rules. In Proceedings of the 3 rd International
Embedding technical analysis into neural network based Conference on Knowledge Discovery and Data Min-
trading systems. Applied Artificial Intelligence, 10, ing (pp. 283-286).
523-541.
Dawson, E.R., & Steeley, J.M. (2003). On the existence of
visual technical patterns in the UK stock market. Journal KEY TERMS
of Business Finance and Accounting, 20, 263-293.
Buy-And-Hold Strategy: An investment strategy
Dourra, H., & Siy, P. (2001). Stock evaluation using for buying portfolios of stocks or mutual funds with
fuzzy logic. International Journal of Theoretical and solid, long-term growth potential. The underlying value
Applied Finance, 4, 585-602. and stability of the investments are important, rather
Fama, E.F. (1991). Efficient capital markets: II. Jour- than the short or medium-term volatility of the market.
nal of Finance, 46, 1575-1617. Fuzzy Logic: Fuzzy logic provides an approach to
Liu, H. (1998). Feature selection for knowledge dis- approximate reasoning in which the rules of inference
covery and data mining. Kluwer Academic Publishers. are approximate rather than exact. Fuzzy logic is useful
in manipulating information that is incomplete, impre-
Lo, A.W., & MacKinlay, A.C. (1999). A non-random cise, or unreliable.
walk down Wall Street. Princeton, NJ: Princeton Uni-
versity Press. Genetic Algorithm: A genetic algorithm is an op-
timization algorithm based on the mechanisms of Dar-
Lo, A.W., Mamaysky, H., & Wang, J. (2000). Founda- winian evolution, which uses random mutation, cross-
tions of technical analysis: Computational algorithms, over and selection procedures to breed better models or
statistical inference, and empirical implementation. The solutions from an originally random starting population
Journal of Finance, 55, 1705-1765 or sample.
Pring, M.J. (1991) Technical analysis explained: The K-Line: It is an Asian version of stock price bar
successful investors guide to spotting investment chart where a lower close than open on a day period is
trends and turning points (3rd ed.). McGraw-Hill Inc. shaded dark and a higher close day period is shaded light.
Sharpe, W.F., Alexander, G.J., & Bailey, J.V. (1999). Market Efficiency Theory: A financial theory that
Investments (6th ed.). Prentice-Hall. states that stock market prices reflect all available,
Siberschatz, A., & Tuzhilin, A. (1996). What makes relevant information.
patterns interesting in knowledge discovery system. Neural Networks: Neural networks are algorithms
IEEE Trans. on Knowledge and Data Engineering, 8, simulating the functioning of human neurons and may be
970-974. used for pattern recognition problems, for example, to
establish a quantitative structure-activity relationship.
784
TEAM LinG
785
Mining for Web-Enabled E-Business M

Applications
Richi Nayak
Queensland University of Technology, Australia
INTRODUCTION cess patterns extracted from the Web data. With the use
of data mining techniques, e-business companies can
A small shop owner builds a relationship with its custom- improve the sales and quality of the products by anticipat-
ers by observing their needs, preferences and buying ing problems before they occur.
behaviour. A Web-enabled e-business will like to accom- When dealing with Web-enabled e-business data, a
plish something similar. It is an easy job for the small shop data mining task is decomposed into many sub tasks
owner to serve his customers better in future by learning (figure 1). The discovered knowledge is presented to user
from past interactions. But, this may not be easy for Web- in an understandable and useable form. The analysis may
enabled e-businesses when most customers may never reveal how a Web site is useful in making decision for a
interact personally, and the number of customers is much user, resulting in improving the Web site. The analysis
higher than of the small shop owner. may also lead into business strategies for acquiring new
Data mining techniques can be applied to understand customers and retaining the existing ones.
and analyse e-business data, and turn into actionable
information, that can support a Web enabled e-business
to improve its marketing, sales and customer support DATA MINING OPPORTUNITIES
operations. This seems to be more appealing, when data
is produced and stored with advance electronic data Data obtained from the Web-enabled e-business transac-
interchange methods, the computing power is affordable, tions can be categorised into (1) primary data that in-
the competitive pressure among businesses is strong, cludes actual Web contents, and (2) secondary data that
and the efficient and commercial data mining tools are includes Web server access logs, proxy server logs,
available for data analysis. browser logs, registration data if any, user sessions and
queries, cookies, etc (Cooley, 2003; Kosala & Blockeel,
2000).
BACKGROUND The goal of mining the primary Web data is to
effectively interpret the searched Web documents. Web
Data mining is the process of searching the trends, clus- search engines discover resources on the Web but have
ters, valuable links and anomalies in the entire data. The many problems such as (1) the abundance problem, where
process benefits from the availability of large amount of hundreds of irrelevant data are returned in response to a
data with rich description. The rich descriptions of data search query, (2) limited coverage problem, where only a
such as wide customer records with many potentially few sites are searched for the query instead of searching
useful fields allow data mining algorithms to search be- the entire Web, (3) limited query interface, where user can
yond obvious correlations. Examples of data mining in only interact by providing few keywords, (4) limited
Web-enabled e-business applications are generation of customization to individual users, etc (Garofalakis,
user profiles, enabling customer relationship manage- Rastogi, Seshadri, & Hyuseok, 1999). Mining of Web
ment, and targeting Web advertising based on user ac- contents can assist e-businesses in improving the orga-
Figure 1. A mining process for Web-enabled e-business data
Data Selection & Data Model Learning Information User

Locating &
Retrieving Data Quality check & & Retrieval Interface
Web documents & Data Transformation & Best Model Selection
Web access logs Data Distribution
Data Gathering Data Processing Data Modelling Information Information Analysis &
extraction Knowledge Assimilation
TEAM LinG
Mining for Web-Enabled E-Business Applications
nization of retrieved result and increasing the precision of Some of the data mining applications appropriate for such
information retrieval. Some of the data mining applica- type of data are:
tions appropriate for such type of data are:
Promoting cross-marketing strategies across prod-
Trend prediction within the retrieved information to ucts. Data mining techniques can analyse logs of
indicate future values. For example, an e-auction different sales indicating customers buying pat-
company provides information about items to auc- terns (Cooley, 2003). Classification and clustering
tion, previous auction details, etc. Predictive mod- of Web access log can help a company to target their
elling can analyse the existing information, and as a marketing (advertising) strategies to a certain group
result estimate the values for auctioneer items or of customers. For example, classification rule min-
number of people participating in future auctions. ing is able to discover that a certain age group of
Text clustering within the retrieved information. For people from a certain locality are likely to buy a
example structured relations can be extracted from certain group of products. Web enabled e-business
unstructured text collections by finding the struc- can also be benefited with link analysis for repeat
ture of Web documents, and present a hierarchical buying recommendations. Schulz, Hahsler, & Jahn
structure to represent the relation among text data (1999) applied link analysis in traditional retail chains,
in Web documents (Wong & Fu, 2000). and found that 70% cross-selling potential exists.
Monitoring a competitors Web site to find unex- Associative rule mining can find frequent products
pected information e.g. offering unexpected ser- bought together. For example, association rule min-
vices and products. Because of the large number of ing can discover rules such as 75% customers who
competitors Web sites and huge information in place an order for product1 from the /company/
them, automatic discovery is required. For instance, product1/ page also place the order for product2
association rule mining can discover frequent word from the /company/product2/ page.
combination in a page that will lead a company to Maintaining or restructuring Web sites to better
learn about competitors (Liu, Ma, & Yu, 2001). serve the needs of customers. Data mining tech-
Categorization of Web pages by discovering simi- niques can assist in Web navigation by discovering
larity and relationships among various Web sites authority sites of a users interest, and overview
using clustering or classification techniques. This sites for those authority sites. For instance, asso-
will lead into effectively searching the Web for the ciation rule mining can discover correlation be-
requested Web documents within the categories tween documents in a Web site and thus estimate
rather than the entire Web. Cluster hierarchies of the probability of documents being requested to-
hypertext documents can be created by analysing gether (Lan, Bressan, & Ooi, 1999). An example
semantic information embedded in link structures association rule resulting from the analysis of a
and document contents (Kosala & Blockeel, 2000). travelling e-business company Web data is: 79%
Documents can also be given classification codes of visitors who browsed pages about Hotel also
according to keywords present in them. browsed pages on visitor information: places to
Providing a higher level of organization for semi- visit. This rule can be used in redesigning the Web
structured or unstructured data available on the site by directly linking the authority and overview
Web. Users do not scan the entire Web site to find Web sites.
the required information, instead they use Web Personalization of Web sites according to each
query languages to search within the document or individuals taste. Data mining techniques can as-
to obtain structural information about Web docu- sist in facilitating the development and execution of
ments. A Web query language restructures extracted marketing strategies such as dynamically changing
information from Web information sources that are a particular Web site for a visitor (Mobasher, Cooley,
heterogenous and semi-structured (Abiteboul, & Srivastave, 1999). This is achieved by building a
Buneman, & Suciu, 2000). An agent based approach model representing correlation of Web pages and
involving artificial intelligent systems can also orga- users. The goal is to find groups of users performing
nize Web based information (Dignum & Cortes, 2001). similar activities. The built model is capable of
categorizing Web pages and users, and matching
The goal of mining the secondary Web data is to between and across Web pages and/or users
capture the buying and traversing habits of customers in (Mobasher, et al, 1999). According to the clusters of
an e-business environment. Secondary Web data in- user profiles, recommendations can be made to a
cludes Web transaction data extracted from Web logs. visitor on return visit or to new visitors (Spiliopoulou,
786
TEAM LinG
Pohle, & Faulstich, 1999). For example, people ac- patterns and make predictions harder (Kohavi, 2001).
cessing educational products in a company Web site Nevertheless, the quality of data is increased with M
between 6-8 p.m. on Friday can be considered as the use of electronic interchange. There is less
academics and can be focused accordingly. noise present in the data due to electronic storage
and processing in comparison to manual process-
ing of data.
DIFFICULTIES IN APPLYING DATA Data warehousing provides a capability for the
MINING good quality data storage. A warehouse integrates
data from operational systems, e-business applica-
The idea of discovering knowledge in large amounts of tions, and demographic data providers, and handles
data with rich description is both appealing and intuitive, issues such as data inconsistency, missing values,
but technically it is challenging. There should be strate- etc. A Web warehouse may be used as data source.
gies implemented for better analysis of data collected from There has been some initiative to warehouse the
Web-enabled e-business sources. Web data generated from e-business applications,
but still long way to go in terms of data mining
Data Format: Data collected from Web-enabled e- (Bhowmick, Madria, & Ng, 2003).
business sources is semi-structured and hierarchi- Another solution of collecting the good quality
cal. Data has no absolute schema fixed in advance Web data is the use of (1) a dedicated server
and the extracted structure may be irregular or incom- recording all activities of each user individually, or
plete. (2) cookies or scripts in the absence of such server
This type of data requires an additional processing (Chan, 1999; Kohavi, 2001). The agent based ap-
before applying to traditional mining algorithms proaches that involve artificial intelligence sys-
whose source is mostly confined to structured data. tems can also be used to discover such Web based
This pre-processing includes transforming unstruc- information.
tured data to a format suitable for traditional mining Data Adaptability: Data on the Web is ever chang-
methods. Web query languages can be used to ing. Data mining models and algorithms should be
obtain structural information from semi-structured adapted to deal with real-time data such that the
data. Based on this structural information, data ap- new data is incorporated for analysis. The con-
propriate to mining techniques are generated. Web structed data model should be updated as the new
query languages that combine path expressions with data approaches.
an SQL-style syntax such as Lorel or UnQL User-interface agents can be used to maximize the
(Abiteboul, et al, 2000) are a good choice for extract- productivity of current users interactions with the
ing structural information. system by adapting behaviours. Another solution
Data Volume: Collected e-business data sets are can be to dynamically modifying mined informa-
large in volume. The mining techniques should be tion as the database changes (Cheung & Lee, 2000)
able to handle such large data sets. or to incorporate user feedback to modify the ac-
Enumeration of all patterns may be expensive and tions performed by the system.
unnecessary. In spite, selection of representative XML Data: It is assumed that in few years XML will
patterns that capture the essence of the entire data be the most highly used language of Internet in
set and their use for mining may prove a more effec- exchanging information.
tive approach. But then selection of such data set Assuming the metadata stored in XML, the inte-
becomes a problem. A more efficient approach would gration of the two disparate data sources becomes
be to use an iterative and interactive technique that much more transparent, field names are matched
takes account into real time responses and feedback more easily and semantic conflicts are described
into calculation. An interactive process involves explicitly (Abiteboul et al., 2000). As a result, the
human analyst in the process, so an instant feedback types of data input to and output from the learned
can be included in the process. An iterative process models and the detailed form of the models are
first considers a selected number of attributes cho- determined. XML documents may not completely
sen by the user for analysis, and then keeps adding be in the same format thus resulting in missing
other attributes for analysis until the user is satis- values when integrated. Various techniques e.g.,
fied. This iterative method reduces the search space tag recognition can be used to fill missing informa-
significantly. tion created from the mismatch in attributes or tags
Data Quality: Web server logs may not contain all the (Abiteboul et al., 2000). Moreover, many query
data needed. Also, noisy and corrupt data can hide languages such as XML-QL, XSL and XML-GL
787
TEAM LinG
(Abiteboul et al., 2000) are designed specifically for REFERENCES

querying XML and getting structured information
from these documents. Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on
Privacy Issues: There are always some concerns of the Web: From relations to substructured data and XML.
proper balancing between companys desire to use California: Morgan Kaumann.
personal information versus individuals desire to
protect it (Piastesky-Shapiro, 2000). Bhowmick, S.S., Madria, S.K., & Ng, W.K. (2003) Web data
The possible solution is to (1) ensure users of management: A warehouse approach. Springer Comput-
secure and reliable data transfer by using high ing & Information Science.
speed high-valued data encryption procedures, and/ Chan, P.K. (1999). A non-invasive learning approach to
or (2) give a choice to user to reveal the information building Web user profile. In Masand & Spiliopoulou
that he wants to and give some benefit in exchange (Eds.), WEBKDD99.
of revealing their information such as discount on
certain shopping product etc. Cheung, D.W., & Lee, S.D. (2000). Maintenance of discovered
association rules. In Knowledge Discovery for Business
Information Systems (The Kluwer International Series in
FUTURE TRENDS Engineering and Computer Science, 600). Boston: Kluwer
Academic Publishers.
Earlier data mining tools such as C5 (http:// Cooley, R. (2003, May). The use of Web structure and
www.rulequest.com) and several neural network softwares content to identify subjectively interesting Web usage
(QuickLearn, Sompack, etc) were limited to some indi- patterns. In ACM Transactions on Internet Technology, 3(2).
vidual researchers. These individual algorithms are ca-
pable of solving a single data mining task. But now the Dignum, F., & Cortes, U. (Eds.). (2001). Agent-mediated
second generation data mining system produced by electronic commerce III: Current issues in agent-based
commercial companies such as clementine (http:// electronic commerce systems. Lecture Notes in Artificial
www.spss.com/clementine/), AnswerTree (http:// Intelligence, Springer Verlag.
www.spss.com/answertree/), SAS (http://www.sas.com/),
IBM Intelligent. Garofalakis, M.N., Rastogi, R., Seshadri, S., & Hyuseok, S.
Miner (http://www.ibm.com/software/data/iminer/) (1999). Data mining and the Web: Past, present and future.
and DBMiner (http://db.cs.sfu.ca/DBMiner) incorporate In Proceedings of the second International Workshop on
multiple discoveries (classification, clustering, etc), pre- Web Information and Data Management (pp. 43-47).
processing (data cleaning, transformation, etc) and post- Kohavi, R. (2001). Mining e-commerce data: The good, the
processing (visualization) tasks, and becoming known to bad and the ugly. In Proceedings of the seventh ACM
public and successful. Moreover, tools that combine ad SIGKDD International Conference on Knowledge Dis-
hoc query or OLAP (Online analytical processing) with covery and Data Mining (KDD 2001).
data mining are also developed (Wu, 2000). Faster CPU,
bigger disks and easy net connectivity make these tools Kosala, R., & Blockeel, H. (2000, July). Web mining re-
liable to analyse large volume of data. search: A survey. SIGKDD Explorations, 2(1), 1-15.
Lan, B., Bressan, S., & Ooi, B.C. (1999). Making Web
servers pushier. In Masand & Spiliopoulou (Eds.),
CONCLUSION WEBKDD99.
It is easy to collect data from Web-enabled e-business Liu, B., Ma, Y., & Yu, P.H. (2001, August). Discovering
sources as visitors to a Web site leave the trail which unexpected information from your competitors Web sites.
automatically is stored in log files by Web server. The data In Proceedings of the seventh ACM SIGKDD Interna-
mining tools can process and analyse such Web server tional Conference on Knowledge Discovery and Data
log files or Web contents to discover meaningful informa- Mining (KDD 2001), SanFrancisco, USA.
tion. This analysis uncovers the previously unknown Masand, B., & Spiliopoulou, M. (1999, August). KDD99
buying habits of their online customers to the companies. workshop on Web usage analysis and user profiling
More importantly, the fast feedback the companies ob- (WEBKDD99), San Diego, CA. ACM.
tained using data mining is very helpful in increasing the
companys benefit.
788
TEAM LinG
Mobasher, B., Cooley, R., & Srivastave, J. (1999). Auto- Link Analysis Data Mining Task: Establishes inter-
matic personalization based on Web usage mining. In nal relationship to reveal hidden affinity among items in M
Masand & Spiliopoulou (Eds.), WEBKDD99. a given data set. Link analysis exposes samples and
trends by predicting correlation of items that are other-
Piastesky-Shapiro, G. (2000, January). Knowledge dis- wise not obvious.
covery in databases: 10 years after. SIGKDD Explora-
tions, 1(2), 59-61, ACM SIGKDD. Mining of Primary Web Data: Assists to effectively
interpret the searched Web documents. Output of this
Schulz, A.G., Hahsler, M., & Jahn, M. (1999). A customer mining process can help e-business customers to improve
purchase incidence model applied to recommendation the organization of retrieved result and to increase the
service. In Masand & Spiliopoulou (Eds.), WEBKDD99. precision of information retrieval.
Spiliopoulou, M., Pohle, C., & Faulstich, L.C. (1999). Mining of Secondary Web Data: Assists to capture
Improving the effectiveness of a Web site with Web the buying and traversing habits of customers in an e-
usage mining. In Masand & Spiliopoulou (Eds.), business environment. Output of this mining process can
WEBKDD99. help e-business to predicting customer behaviour in fu-
Wong, W.C., & Fu, A.W. (2000, July). Finding structure ture, to personalization of Web sites, to promoting cam-
and characteristic of Web documents for classification. In paign by cross-marketing strategies across products.
Proceedings of the ACM SIGMOD Workshop on Research Predictive Modelling Data Mining Task: Makes pre-
issues in Data Mining and Knowledge discovery, ACM. dictions based on essential characteristics about the
Wu, J. (2000, August). Business intelligence: What is data data. The classification task of data mining builds a model
mining? In Data Mining Review Online. to map (or classify) a data item into one of several pre-
defined classes. The regression task of data mining builds
a model to map a data item to a real-valued prediction
KEY TERMS variable.
Clustering Data Mining Task: To identify items with Primary Web Data: Includes actual Web contents.
similar characteristics, and thus creating a hierarchy of Secondary Web Data: Includes Web transaction data
classes from the existing set of events. A data set is extracted from Web logs Examples are Web server access
partitioned into segments of elements (homogeneous) logs, proxy server logs, browser logs, registration data if
that share a number of properties. any, user sessions, user queries, cookies, product corre-
Data Mining (DM) or Knowledge Discovery in Data- lation and feedback from the customer companies.
bases: The extraction of interesting, meaningful, implicit, Web-Enabled E-Business: A business transaction or
previously unknown, valid and actionable information interaction in which participants operate or transact busi-
from a pool of data sources. ness or conduct their trade electronically on the Web.
789
TEAM LinG
790
Mining Frequent Patterns via Pattern

Decomposition
Qinghua Zou
University of California - Los Angeles, USA
Wesley Chu
University of California - Los Angeles, USA
INTRODUCTION tasks are performed on the FP-tree rather than on the

dataset. It has performance improvements over
Pattern decomposition is a data-mining technology that Apriori (Agrawal &Srikant, 1994), since infrequent
uses known frequent or infrequent patterns to decom- items do not appear on the FP-tree, and, thus, the FP-
pose a long itemset into many short ones. It finds frequent tree has a smaller search space than the original
patterns in a dataset in a bottom-up fashion and reduces dataset. However, FP-tree cannot reduce the search
the size of the dataset in each step. The algorithm avoids space further by using infrequent 2-item or longer
the process of candidate set generation and decreases the itemsets.
time for counting supports due to the reduced dataset.
What distinguishes pattern decomposition (Zou et
al., 2002) from most previous works is that it reduces the
BACKGROUND search space of a dataset in each step of its mining
process.
A fundamental problem in data mining is the process of
finding frequent itemsets (FI) in a large dataset that enable
essential data-mining tasks, such as discovering associa- MAIN THRUST
tion rules, mining data correlations, and mining sequential
patterns. Three main classes of algorithms have been Both the technology and application will be discussed to
proposed: help clarify the meaning of pattern decomposition.
Candidates Generation and Test (Agrawal Search Space Definition

&Srikant, 1994; Heikki, Toivonen &Verkamo,
1994; Zaki et al., 1997): Starting at k=0, it first Let N=X:Y be a transaction where X, called the head of N,
generates candidate k+1 itemsets from known fre- is the set of required items, and Y, called the tail of N, is the
quent k itemsets and then counts the supports of set of optional items. The set of possible subsets of Y is
the candidates to determine frequent k+1 itemsets called the power set of Y, denoted by P(Y).
that meet a minimum support requirement.
Sampling Technique (Toivonen, 1996): Uses a sam- Definition 1
pling method to select a random subset of a dataset
for generating candidate itemsets and then tests For N=X:Y, the set of all the itemsets obtained by
these candidates to identify frequent patterns. In concatenating X with the itemsets in P(Y) is called the
general, the accuracy of this approach is highly search space of N, denoted as {X:Y}. That is,
dependent on the characteristics of the dataset and
the sampling technique that has been used. { X : Y } = { X V | V P(Y )}.
Data Transformation: Transforms an original
dataset to a new one that contains a smaller search For example, the search space {b:cd} includes four
space than the original dataset. FP-tree-based (Han, itemsets b, bc, bd, and bcd. The search space {:abcde}
Pei & Yin, 2000) mining first builds a compressed includes all subsets of abcde.
data representation from a dataset, and then, mining
TEAM LinG
Mining Frequent Patterns via Pattern Decomposition
By definition 1, we have {X:Y}={X:Z}, where Z=Y-X Theorem 3

refer to the set of items contained in Y but not in X. Thus, M
we will assume that Y does not contain any item in X, when Pruning Search Space: If Z does not contain the
{X:Y} is mentioned in this article. head X, the space {X:Y} cannot be pruned by Z (i.e.,
{X:Y}-{:Z}={X:Y}). Otherwise, the space can be
Definition 2 pruned as
Let S, S1, and S2 be search spaces. The set {S1, S2} is a k
partition of S if and only if S= S 1 S2 and S1 S2= . The {X:Y}-{:Z} = { Xai : ai +1 ...ak (Y Z )} , a1a2ak=Y-Z.
i =1
relationship is denoted by S=S1+S 2 or S1= S-S2 or S 2= S-S1.
We say S is partitioned into S1 and S2. Similarly, a set {S1,
S2, , Sk} is a partition of S if and only if S= S1 S2 Sk
Proof
and Si Sj= for i,j [1..k] and i j. We denote it as
If Z does not contain X, no itemset in {X:Y} is subsumed
S=S1+S2++Sk. by Z. Therefore, knowing that Z is frequent, we cannot
Let a be an item where aX is an itemset by concatenat- prune any part of the search space {X:Y}.
ing a with X. Otherwise, when X is a subset of Z, we have
k
Theorem 1 {X:Y}= { Xai : ai +1 ...a kV } + X : V , where V=Y Z. The
i =1
For a X,Y, the search space {X:aY} can be partitioned head in the first part is Xai where ai is a member of Y-Z.
into {Xa:Y} and {X:Y} by item a (i.e., Since Z does not contain ai, the first part cannot be pruned
{X:aY}={Xa:Y}+{X:Y}). by Z. For the second part, we have {X:V}-{:Z}={X:V}-
{X:(Z-X)}. Since X Y= , we have V Z-X. Therefore,
Proof {X:V} can be pruned away entirely.
For example, we have {:bcde}-{:abcd} = {:bcde}-{:bcd}
It follows from the fact that each itemset of {X:aY} either = {e:bcd}. Here, a is irrelevant and is removed in the first
contains a (i.e., {Xa:Y}) or does not contain a (i.e., {X:Y}). step. Another example is {e:bcd}-{:abe} = {e:bcd}-{:be}
For example, we have {b:cd}={bc:d}+{b:d}. = {e:bcd}-{e:b} = {ec:bd}+{ed:b}.
Theorem 2 Pattern Decomposition

Partition Search Space: Let a1, a2, , ak be distinct Given a known frequent itemset Z, we are able to decom-
items and a1a2akY be an itemset; the search space pose the search space of a transaction N=X:Y to N=Z:Y,
of {X: a1a2akY} can be partitioned into if X is a subset of Z, where Y is the set of items that appears
in Y but not in Z, denoted by PD(N=X:Y|Z)= Z:Y.
k For example, if we know that an itemset abc is frequent,
{Xa
i =1
i : ai +1 K a k Y } + { X : Y }, where ai X ,Y . we can decompose a transaction N=a:bcd into N=abc:d;
that is, PD(a:bcd|abc)=abc:d.
Given a known infrequent itemset Z, we also can
Proof decompose the search space of a transaction N=X:Y. For
simplicity, we use three examples to show the decompo-
It follows by partitioning the search space via items sition by known infrequent itemsets and leave out its
a1,a2,,ak sequentially as in theorem 1. formal mathematic formula in general cases. Interested
For example, we have {b:cd}={bc:d}+{bd:}+{b:} and readers can refer to Zou, Chu, and Lu (2002) for details. For
{a:bcde}= {ab:cde} +{ac:de}+{a:de}. example, if N=d:abcef and a known infrequent itemsets,
Let {X:Y} be a search space and Z be a known frequent then we have:
itemset. Since Z is frequent, all subsets of Z will be frequent
(i.e., every itemset of {:Z} is frequent). Theorem 3 shows For infrequent 1-itemset ~a, PD(d:abcef|~a) = d:bcef
how to prune the space {X:Y} by Z. by dropping a from its tail.
For infrequent 2-itemset ~ab, PD(d:abcef|~ab) =
d:bcef+da:cef by excluding ab.
791
TEAM LinG
For infrequent 3-itemset ~abc, PD(d:abcef|~abc) = a professor of cardiology at Duke University in Durham,
d:bcef+da:cef+dab:ef by excluding abc. North Carolina.
Staffords findings were based on 1996 data from 10,942
By decomposing a transaction t, we reduce the number doctor visits by people with heart disease. The study
of items in its tails and thus reduce its search space. For may underestimate aspirin use; some doctors may not
example, the search space of a:bcd contains the following have reported instances in which they recommended
eight itemsets {a, ab, ac, ad, abc, abd, acd, abcd}. Its patients take over-the-counter medications, he said.
decomposition result, abc:d, contains only two itemsets He called the data a wake-up call to doctors who
{abc, abcd}, which is only 25% of its original search space. focus too much on acute medical problems and ignore
When using pattern decomposition, we find frequent general prevention.
patterns in a step-wise fashion starting at step 1 for 1-item
itemsets. At step k, it first counts the support for every We can find frequent one-word, two-word, three-
possible k-item itemsets contained in the dataset Dk to find word, four-word, and five-word combinations. For in-
frequent k-item itemsets Lk and infrequent k-item itemsets stance, we found 14 four-word combinations.
~L k. Then, using the Lk and ~Lk, Dk, they can be decom-
posed into Dk+1, which has a smaller search space than Dk. heart aspirin use regul, aspirin they take not, aspirin
These steps continue until the search space Dk becomes patient take not, patient doct use some, aspirin patient
empty. study take, patient they take not, aspirin patient use
some, aspirin doct use some, aspirin patient they not,
An Application aspirin patient they take, aspirin patient doct some,
heart aspirin patient too, aspirin patient doct use, heart
The motivation of our work originates from the problem of aspirin patient study.
finding multi-word combinations in a group of medical
report documents, where sentences can be viewed as Multi-word combinations are effective for document
transactions and words can be viewed as items. The indexing and summarization. The work in Johnson, et al.
problem is to find all multi-word combinations that occur (2002) shows that multi-word combinations can index
at least in two sentences of a document. documents more accurately than single-word indexing
As a simple example, for the following text: terms. Multi-word combinations can delineate the con-
cepts or content of a domain-specific document collec-
Aspirin greatly underused in people with heart disease. tion more precisely than single word. For example, from
DALLAS (AP) Too few heart patients are taking aspirin, the frequent one-word table, we may infer that heart,
despite its widely known ability to prevent heart attacks, aspirin, and patient are the most important concepts in
according to a study released Monday. the text, since they occur more often than others. For the
The study, published in the American Heart Associations frequent two-word table, we see a large number of two-
journal Circulation, found that only 26% of patients who word combinations with aspirin (i.e., aspirin patient,
had heart disease and could have benefited from aspirin heart aspirin, aspirin use, aspirin take, etc.). This infers
took the pain reliever. that the document emphasizes aspirin and aspirin-re-
This suggests that theres a substantial number of lated topics more than any other words.
patients who are at higher risk of more problems because
theyre not taking aspirin, said Dr. Randall Stafford, an
internist at Harvards Massachusetts General Hospital, FUTURE TRENDS
who led the study. As we all know, this is a very
inexpensive medication very affordable. There is a growing need for mining frequent sequence
The regular use of aspirin has been shown to reduce the patterns from human genome datasets. There are 23 pairs
risk of blood clots that can block an artery and trigger a of human chromosomes, approximately 30,000 genes, and
heart attack. Experts say aspirin also can reduce the risk more than 1,000,000 proteins. The previously discussed
of a stroke and angina, or severe chest pain. pattern decomposition method can be used to capture
Because regular aspirin use can cause some side effects, sequential patterns with some small modifications.
such as stomach ulcers, internal bleeding, and allergic When the frequent patterns are long, mining frequent
reactions, doctors too often are reluctant to prescribe it itemsets (FI) are infeasible because of the exponential
for heart patients, Stafford said. number of frequent itemsets. Thus, algorithms mining
Theres a bias in medicine toward treatment, and within frequent closed itemsets (FCI) (Pasquier, Bastide, Taouil
that bias, we tend to underutilize preventative services, & Lakhal, 1999; Pei, Han & Mao, 2000; Zaki & Hsiao, 1999)
even if theyve been clearly proven, said Marty Sullivan, are proposed, since FCI is enough to generate associa-
792
TEAM LinG
tion rules. However, FCI also could be as exponentially Johnson, D., Zou, Q., Dionisio, J.D., Liu, Z., Chu, W.W.
large as the FI. As a result, many algorithms for mining (2002). Modeling medical content for automated summa- M
maximal frequent itemsets (MFI) are proposed, such as rization. Annals of the New York Academy of Sciences.
Mafia (Burdick, Calimlim & Gehrke, 2001), GenMax (Gouda
& Zaki, 2001), and SmartMiner (Zou, Chu & Lu, 2002). Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999).
The main idea of pattern decomposition also is used Discovering frequent closed itemsets for association
in SmartMiner, except that SmartMiner uses tail informa- rules. Proceedings of the 7th International Conference
tion (frequent itemsets) to decompose the search space of on Database Theory.
a dataset rather than the dataset itself. While pattern Pei, J., Han, J., & Mao, R. (2000). Closet: An efficient
decomposition avoids candidate set generation, algorithm for mining frequent closed itemsets. Proceed-
SmartMiner avoids superset checking, which is a time- ings of the SIGMOD International Workshop on Data
consuming process. Mining and Knowledge Discovery.
Toivonen, H. (1996). Sampling large databases for asso-
CONCLUSION ciation rules. Proceedings of the 22nd International
Conference on Very Large Data Bases, Bombay, India.
We propose to use pattern decomposition to find fre- Zaki, M.J., & Hsiao, C. (1999). Charm: An efficient algo-
quent patterns in large datasets. The PD algorithm shrinks rithm for closed association rule mining. Technical Re-
the dataset in each pass so that the search space of the port 99-10. Rensselaer Polytechnic Institute.
dataset is reduced. Pattern decomposition avoids the
costly candidate set generation procedure, and using Zaki, M.J., Parthasarathy, S., Ogihara, M., & Li, W. (1997).
reduced datasets greatly decreases the time for support New algorithms for fast discovery of association rules.
counting. Proceedings of the Third International Conference on
Knowledge Discovery in Databases and Data Mining.
Zou, Q., Chu, W., Johnson, D., & Chiu, H. (2002). Using
ACKNOWLEDGMENT pattern decomposition (PD) methods for finding all fre-
quent patterns in large datasets. Journal Knowledge and
This research is supported by NSF IIS ITR Grant # 6300555. Information Systems (KAIS).
Zou, Q., Chu, W., & Lu, B. (2002). SmartMiner: A depth
REFERENCES first algorithm guided by tail information for mining maxi-
mal frequent itemsets. Proceedings of the IEEE Interna-
Agrawal, R. & Srikant, R. (1994). Fast algorithms for tional Conference on Data Mining, Japan.
mining association rules. Proceedings of the 1994 Inter-
national Conference on Very Large Data Bases.
Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A KEY TERMS
maximal frequent itemset algorithm for transactional data-
bases. Proceedings of the International Conference on Frequent Itemset (FI): An itemset whose support is
Data Engineering. greater than or equal to the minimal support.
Gouda, K., & Zaki, M.J. (2001). Efficiently mining maximal Infrequent Pattern: An itemset that is not a frequent
frequent itemsets. Proceedings of the IEEE International pattern.
Conference on Data Mining, San Jose, California. Minimal Support (minSup): A user-given number that
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns specifies the minimal number of transactions in which an
without candidate generation. Proceedings of the 2000 interested pattern should be contained.
ACM International Conference on Management of Data, Pattern Decomposition: A technique that uses known
Dallas, Texas. frequent or infrequent patterns to reduce the search space
Heikki, M., Toivonen, H., & Verkamo, A.I. (1994). Efficient of a dataset.
algorithms for discovering association rules. Proceed- Search Space: The union of the search space of every
ings of the AAAI Workshop on Knowledge Discovery in transaction in a dataset.
Databases, Seattle, Washington.
793
TEAM LinG
Search Space of a Transaction N=X:Y: The set of Transaction: An instance that usually contains a set
unknown frequent itemsets contained by N. Its size is of items. In this article, we extend a transaction to a
decided by the number of items in the tail of N, i.e. Y. composition of a head and a tail (i.e., N=X:Y), where the
head represents a known frequent itemset, and the tail is
Support of an Itemset x: The number of transactions the set of items for extending the head for new frequent
that contains x. patterns.
794
TEAM LinG
795
Mining Group Differences M

Shane M. Butler
Geoffrey I. Webb
INTRODUCTION Rule discovery is the process of finding rules that best

describe a dataset. A dataset is a collection of records in
Finding differences among two or more groups is an which each record contains one or more discrete attribute-
important data-mining task. For example, a retailer might value pairs (or items). A rule is simply a combination of
want to know what the different is in customer purchas- conditions that, if true, can be used to predict an outcome.
ing behaviors during a sale compared to a normal trading A hypothetical rule about consumer purchasing behav-
day. With this information, the retailer may gain insight iors, for example, might be IF buys_milk AND
into the effects of holding a sale and may factor that into buys_cookies THEN buys_cream.
future campaigns. Another possibility would be to in- Association rule discovery (Agrawal, Imielinski &
vestigate what is different about customers who have a Swami, 1993; Agrawal & Srikant, 1994) is a popular
loyalty card compared to those who dont. This could rule-discovery approach. In association rule mining,
allow the retailer to better understand loyalty rules are sought specifically in the form of where the
cardholders, to increase loyalty revenue, or to attempt antecedent group of items (or itemset), A, implies the
to make the loyalty program more appealing to non- consequent itemset, C. An association rule is written as
cardholders. A C . Of particular interest are the rules where the
This article gives an overview of such group mining probability of C is increased when the items in A also
techniques. First, we discuss two data-mining methods occur. Often association rule-mining systems restrict
designed specifically for this purposeEmerging Pat- the consequent itemset to hold only one item as it
terns and Contrast Sets. We will discuss how these two reduces the complexity of finding the rules.
methods relate and how other methods, such as explor- In association rule mining, we often are searching
atory rule discovery, can also be applied to this task. for rules that fulfill the requirement of a minimum
Exploratory data-mining techniques, such as the tech- support criteria, minsup, and a minimum confidence
niques used to find group differences, potentially can criteria, minconf. Where support is defined as the fre-
result in a large number of models being presented to the quency with which A and C co-occur:
user. As a result, filter mechanisms can be a useful way to
automatically remove models that are unlikely to be of
support( A C ) = frequency(A C )
interest to the user. In this article, we will examine a
number of such filter mechanisms that can be used to
and confidence is defined as the frequency with which A
reduce the number of models with which the user is
and C co-occur, divided by the frequency with which A
confronted.
occurs throughout all the data:
BACKGROUND support ( A C )
confidence(A C ) =
frequency( A)
There have been two main approaches to the group
discovery problem from two different schools of The association rules discovered through this pro-
thought. The first, Emerging Patterns, evolved as a clas- cess then are sorted according to some user-specified
sification method, while the second, Contrast Sets, interestingness measure before they are displayed to
grew as an exploratory method. The algorithms of both the user.
approaches are based on the Max-Miner rule discovery Another type of rule discovery is k-most interesting
system (Bayardo Jr., 1998). Therefore, we will briefly rule discovery (Webb, 2000). In contrast to the support-
describe rule discovery. confidence framework, there is no minimum support or
TEAM LinG
Mining Group Differences
confidence requirement. Instead, k-most interesting rule exploratory method for finding differences between one
discovery focuses on the discovery of up to k rules that group and another that the user can utilize, rather than as
maximize some user-specified interestingness measure. a classification system focusing on prediction accuracy.
To this end, they present filtering and pruning methods
to ensure only the most interesting and optimal number
MAIN THRUST rules are shown to the user, from what is potentially a large
space of possible rules.
Emerging Patterns Contrast Sets are discovered using STUCCO, an algo-
rithm that is based on the Max-Miner search algorithm
Emerging Pattern analysis is applied to two or more (Bayardo Jr., 1998). Initially, only Contrast Sets are sought
datasets, where each dataset contains data relating to a that have supports that are both significant and the
different group. An Emerging Pattern is defined as an difference large (i.e., the difference is greater than a user-
itemset whose support increases significantly from one defined parameter, mindev). Significant Contrast Sets
group to another (Dong & Li, 1999). This support in- (cset), therefore, are defined as those that meet the criteria:
crease is represented by the growth ratethe ratio of
support of an itemset in group 1 over that of group 2. The P(cset | Gi ) P (cset | G j )
support of a group G is given by:
Large Contrast Sets are those for which:
count G ( X )
supp G (X ) =
G support (cset, Gi ) support (cset , G j ) mindev
As Bay and Pazzani have noted, the user is likely to be

The GrowthRate( X ) is defined as 0 if supp1 (X ) = 0 and
overwhelmed by the number of results. Therefore, a filter
supp 2 (X ) = 0 ; if supp1 (X ) = 0 and supp 2 ( X ) 0 ; or method is applied to reduce the number of Contrast Sets
else supp 2 ( X ) supp1 (X ) . The special case where presented to the user and to control the risk of type-1 error
(i.e., the risk of reporting a Contrast Set when no differ-
GrowthRate(X ) = is called a Jumping Emerging Pattern,
ence exists). The filter method employed involves a chi-
as it is said to have jumped from not occurring in one group square test of statistical significance between the itemset
to occurring in another group. This also can be thought on one group to that Contrast Set on the other group(s).
of as an association rule having a confidence equaling 1.0. A correction for multiple comparisons is applied that
Emerging Patterns are not presented to the user, as lowers the value of as the size of the Contrast Set
models are in the exploratory discovery framework. Rather, (number of attribute value pairs) increases.
the Emerging Pattern discovery research has focused on Further pruning mechanisms also are used to filter
using the mined Emerging Patterns for classification, Contrast Sets that are purely specializations of other more
similar to the goals of Liu et al. (1998, 2001). Emerging general Contrast Sets. This is done using another chi-
Pattern mining-based classification systems include CAEP square test of significance to test the difference between
(Dong, Zhang, Wong & Li, 1999), JEP-C (Li, Dong & the parent Contrast Set and its specialization Contrast Set.
Ramamohanarao, 2001), BCEP (Fan & Ramamohanarao,
2003), and DeEP (Li, Dong, Ramamohanarao & Wong,
Mining Group Differences Using Rule
2004). Since the Emerging Patterns are classification based,
the focus is on classification accuracy. This means no Discovery
filtering method is used, other than the infinite growth rate
constraint used during discovery by some the classifiers Webb, Butler, and Newlands (2003) studied how Con-
(e.g., JEP-C and DeEP). This constraint discards any trast Sets relate to generic rule discovery approaches.
They used the OPUS_AR algorithm-based Magnum Opus
Emerging Pattern X for which GrowthRate(X ) .
software to discover rules and to compare them to those
discovered by the STUCCO algorithm.
Contrast Sets OPUS_AR (Webb, 2000) is a rule-discovery algo-
rithm based on the OPUS (Webb, 1995) efficient search
Contrast Sets (Bay & Pazzani, 1999, 2001) are similar to technique, to which the Max-Miner algorithm is closely
Emerging Patterns, in that they are also itemsets whose related. By limiting the consequent to a group variable,
support differs significantly across datasets. However, this rule discovery framework is able to be adapted for
the focus of Contrast Set research has been to develop an group discovery.
796
TEAM LinG
While STUCCO and Magnum Opus specify different conducted a study with a retailer to find interesting pat-
support conditions in the discovery phase, their condi- terns between transactions from two different days. This M
tions were proven to be equivalent (Webb et al., 2003). data was traditional market-basket transactional data, con-
Further investigation found that the key difference be- taining the purchasing behaviors of customers across the
tween the two techniques was the filtering technique. many departments. Magnum Opus was used with the
Magnum Opus uses a binomial sign test to filter spurious group, encoded as a variable and the consequent re-
rules, while STUCCO uses a chi-square test. STUCCO stricted to that variable only.
attempts to control the risk of type-1 error by applying a In this experiment, Magnum Opus discovered all of
correction for multiple comparisons. However, such a the Contrast Sets that STUCCO found, and more. This is
correction, when given a large number of tests, will reduce indicative of the more lenient filtering method of Mag-
the value to an extremely low number, meaning that the num Opus. It was also interesting that, while all of the
risk of type-2 error (i.e., the risk of not accepting a non- Contrast Sets discovered by STUCCO were only of size
spurious rule) is substantially increased. Magnum Opus 1, Magnum Opus discovered conjunctions of sizes up to
does not apply such corrections so as not to increase the three department codes.
risk of type-2 error. This information was presented to the retail market-
While a chi-square approach is likely to be better suited ing manager in the form of a survey. For each rule, the
to Contrast Set discovery, the correction for multiple com- manager was asked if the rule was surprising and if it was
parisons, combined with STUCCOs minimum difference, is potentially useful to the organization. For ease of under-
a much stricter filter than that employed by Magnum Opus. standing, the information was transformed into a plain
As a result of Magnum Opus much more lenient filter text statement.
mechanisms, many more rules are being presented to the end The domain expert judged a greater percentage of the
user. After finding that the main difference between the Magnum Opus rules of surprise than the STUCCO con-
systems was their control of type-1 and type-2 errors via trasts; however, the result was not statistically signifi-
differing statistical test methods, Webb, et al. (2003) con- cant. The percentage of rules found that potentially were
cluded that Contrast Set mining is, in fact, a special case of useful were similar for both systems. In this case, Mag-
the rule discovery task. num Opus probably found some rules that were spurious,
Experience has shown that filters are important for and STUCCO probably failed to discover some rules that
removing spurious rules, but it is not obvious which of the were potentially interesting.
filtering methods used by systems like Magnum Opus and
STUCCO is better suited to the group discovery task.
Given the apparent tradeoff between type-1 and type-2 FUTURE TRENDS
error in these data-mining systems, recent developments
(Webb, 2003) have focused on a new filter method to avoid Mining differences among groups will continue to grow
introducing type-1 and type-2 errors. This approach di- as an important research area. One area likely to be of
vides the dataset into exploratory and holdout sets. Like future interest is improving filter mechanisms. Experi-
the training and test set method of statistically evaluating ence has shown that the use of filter is important, as it
a model within the classification framework, one set is used reduces the number of rules, thus avoiding overwhelm-
for learning (the exploratory set) and the other is used for ing the user. There is a need to develop alternative filters
evaluating the models (the holdout set). A statistical test as well as to determine which filters are best suited to
then is used for the filtering of spurious rules, and it is different types of problems.
statistically sound, since the statistical tests are applied An interestingness measure is a user-generated speci-
using a different set. A key difference between the tradi- fication of what makes a rule potentially interesting.
tional training and test set methodology of the classifica- Interestingness measures are another important issue,
tion framework and the new holdout technique is that many because they attempt to reflect the users interest in a
models are being evaluated in the exploratory framework model during the discovery phase. Therefore, the devel-
rather than only one model in the classification framework. opment of new interestingness measures and determina-
We envisage the holdout technique will be one area of tion of their appropriateness for different tasks are both
future research, as it is adapted by exploratory data-mining expected to be areas of future study.
techniques as a statistically sound filter method. Finally, while the methods discussed in this article
focus on discrete attribute-value data, it is likely that
Case Study there will be future research on how group mining can
utilize quantitative, structural, and sequence data. For
In order to evaluate STUCCO and the more lenient Magnum example, group mining of sequence data could be used
Opus filter mechanisms, Webb, Butler, and Newlands (2003) to investigate what is different about the sequence of
797
TEAM LinG
events between fraudulent and non-fraudulent credit Bayardo, Jr., R.J. (1998). Efficiently mining long patterns
card transactions. from databases. Proceedings of the 1998 ACM SIGMOD
International Conference on Management of Data, 85-93,
Seattle, Washington, USA.
CONCLUSION Dong, G., & Li, J. (1999). Efficient mining of emerging
patterns: Discovering trends and differences. Proceed-
We have presented an overview of techniques for mining ings of the Fifth International Conference on Knowledge
differences among groups, discussing Emerging Pattern Discovery and Data Mining, San Diego, California, USA.
discovery, Contrast Set discovery, and association rule
discovery approaches. Emerging Patterns are useful in a Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP:
classification system where prediction accuracy is the Classification by aggregating emerging patterns. Pro-
focus but are not designed for presenting the group ceedings of the Second International Conference on
differences to the user and thus dont have any filters. Discovery Science, Tokyo, Japan.
Exploratory data mining can result in a large number of
rules. Contrast Set discovery is an exploratory technique Fan, H., & Ramamohanarao, K. (2003). A Bayesian
that includes mechanisms to filter spurious rules, thus approach to use emerging patterns for classification.
reducing the number of rules presented to the user. By Proceedings of the 14th Australasian Database Confer-
forcing the consequent to be the group variable during ence, Adelaide, Australia.
rule discovery, generic rule discovery software like Mag- Li, J., Dong, G., & Ramamohanarao, K. (2001). Making
num Opus can be used to discover group differences. The use of the most expressive jumping emerging patterns
number of differences reported to the user by STUCCO for classification. Knowledge and Information Sys-
and Magnum Opus are related to the different filter mecha- tems, 3(2), 131-145.
nisms for controlling the output of potentially spurious
rules. Magnum Opus uses a more lenient filter than Li, J., Dong, G., Ramamohanarao, K., & Wong, L. (2004).
STUCCO and thus presents more rules to the user. A new DeEPs: A new instance-based lazy discovery and classi-
method, the holdout technique, will be an improvement fication system. Machine Learning, 54(2), 99-124.
over other filter methods, since the technique is statisti- Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classifi-
cally sound. cation and association rule mining. Proceedings of the
Fourth International Conference on Knowledge Dis-
covery and Data Mining, New York, New York.
REFERENCES
Liu, B., Ma, Y., & Wong, C.K. (2001). Classification
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining using association rules: Weaknesses and enhancements.
association rules between sets of items in large data- In V. Kumar et al. (Eds.), Data mining for scientific and
bases. Proceedings of the 1993 ACM SIGMOD Interna- engineering applications (pp. 506-605). Boston: Kluwer
tional Conference on Management of Data, Washing- Academic Publishing.
ton, D.C., USA. Webb, G.I. (1995). An efficient admissible algorithm for
Agrawal, R., & Srikant, R. (1994). Fast algorithms for unordered search. Journal of Artificial Intelligence Re-
mining association rules. Proceedings of the 20th Inter- search, 3, 431-465.
national Conference on Very Large Data Bases, Santiago, Webb, G.I. (2000). Efficient search for association
Chile. rules. Proceedings of the Sixth ACM SIGKDD Interna-
Bay, S.D., & Pazzani, M.J. (1999). Detecting change in tional Conference on Knowledge Discovery and Data
categorical data: Mining contrast sets. Proceedings of Mining, Boston, Massachesetts, USA.
the Fifth ACM SIGKDD International Conference on Webb, G.I. (2003). Preliminary investigations into sta-
Knowledge Discovery and Data Mining, San Diego, tistically valid exploratory rule discovery. Proceedings
USA. of the Australasian Data Mining Workshop, Canberra,
Bay, S.D., & Pazzani, M.J. (2001). Detecting group differ- Australia.
ences: Mining contrast sets. Data Mining and Knowl- Webb, G.I., Butler, S.M., & Newlands, D. (2003). On
edge Discovery, 5(3), 213-246. detecting differences between groups. Proceedings of
the Ninth ACM SIGKDD International Conference on
798
TEAM LinG
Knowledge Discovery and Data Mining, Washington, Growth Rate: The ratio of the proportion of data
D.C., USA. covered by the Emerging Pattern in one group over the M
proportion of the data it covers in another group.
Holdout Technique: A filter technique that splits the
KEY TERMS data into exploratory and holdout sets. Rules discovered
from the exploratory set then can be evaluated against the
Association Rule: A rule relating two itemsets holdout set using statistical tests.
the antecedent and the consequent. The rule indicates Itemset: A conjunction of items (attribute-value pairs)
that the presence of the antecedent implies that the
consequent is more probable in the data. Written as (e.g., age = teen hair = brown ).
AC. k-Most Interesting Rule Discovery: The process of
Contrast Set: Similar to an Emerging Pattern, it is also finding k rules that optimize some interestingness mea-
an itemset whose support differs across groups. The main sure. Minimum support and/or confidence constraints are
difference is the methods application as an exploratory not used.
technique rather than as a classification one.
Market Basket: An itemset; this term is sometimes
Emerging Pattern: An itemset that occurs signifi- used in the retail data-mining context, where the itemsets
cantly more frequently in one group than another. Uti- are collections of products that are purchased in a single
lized as a classification method by several algorithms. transaction.
Filter Technique: Any technique for reducing the Rule Discovery: The process of finding rules that
number of models with the aim of avoiding overwhelm- then can be used to predict some outcome (e.g., IF 13
ing the user. <= age <= 19 THEN teenager).
799
TEAM LinG
800
Mining Historical XML

Qiankun Zhao
Sourav Saha Bhowmick

Nowadays the Web poses itself as the largest data re- The historical XML mining research is largely inspired
pository ever available in the history of humankind by two research communities: XML data mining and
(Reis et al., 2004). However, the availability of huge XML data change detection. The XML data mining
amount of Web data does not imply that users can get community has looked at developing novel algorithms
whatever they want more easily. On the contrary, the to mine snapshots of XML data. The database commu-
massive amount of data on the Web has overwhelmed nity has focused on detecting, representing, and query-
their abilities to find the desired information. It has ing changes to XML data.
been claimed that 99% of the data reachable on the Web Some of the initial work for XML data mining is
is useless to 99% of the users (Han & Kamber, 2000, pp. based on the use of the XPath language as the main
436). That is, an individual may be interested in only a component to query XML documents (Braga et al.,
tiny fragment of the Web data. However, the huge and 2002; Braga et al., 2003). In Braga et al. (2002, 2003),
diverse properties of Web data do imply that Web data the authors presented the XMINE operator, which is a
provides a rich and unprecedented data mining source. tool developed to extract XML association rules for
Web mining was introduced to discover hidden knowl- XML documents. The operator is based on XPath and
edge from Web data and services automatically (Etzioni, inspired by the syntax of XQuery. It allows us to express
1996). According to the type of Web data, Web mining complex mining tasks, compactly and intuitively. XMINE
can be classified into three categories: Web content can be used to specify indifferently (and simultaneously)
mining, Web structure mining, and Web usage mining mining tasks both on the content and on the structure of
(Madria et al., 1999). Web content mining is to extract the data, since the distinction in XML is slight.
patterns from online information such as HTML files, Other works for XML data mining focus on extract-
e-mails, or images (Dumais & Chen, 2000; Ester et al., ing the frequent tree patterns from the structure of XML
2002). Web structure mining is to analysis the link data such as TreeFinder (Termier et al., 2002) and
structures of Web data, which can be inter-links among TreeMiner (Zaki, 2002). TreeFinder uses an Inductive
different Web documents (Kleinberg 1998) or intra- Logic Programming approach. Notice that TreeFinder
links within individual Web document (Arasu & Hector, cannot produce complete results. It may miss many
2003; Lerman et al., 2004). Web usage mining is defined frequent subtrees, especially when the support thresh-
as to discover interesting usage patterns from the sec- old is small or trees in the database have common node
ondary data derived from the interaction of users while labels. TreeMiner can produce the complete results by
surfing the Web (Srivastava et al., 2000; Cooley, 2003). using a novel vertical representation for fast subtree
Recently, XML is widely used as a standard for data support counting.
exchanging in the Internet. Existing work on XML data Different from the above techniques, which focus on
mining includes frequent substructure mining (Inokuchi designing ad-hoc algorithms to extract structures that
et al., 2000; Kuramochi & Karypis, 2001; Zaki, 2002, occur frequently in the snapshot data collections, his-
Yan & Han, 2003; Huan et al., 2003), classification torical XML mining focus on the sequence of changes
(Zaki & Aggarwal, 2003; Huan et al., 2004), and associa- among XML versions.
tion rule mining (Braga et al., 2002). As data in different Considering the dynamic nature of XML data, many
domains can be represented as XML documents, XML efforts have been directed into the research of change
data mining can be useful in many applications such as detection for XML data. XML TreeDiff (Curbera &
bioinformatics, chemistry, network analysis (Deshpande Epstein, 1999) computes the difference between two
et al., 2003; Huan et al., 2004) and etcetera. XML documents using hash values and simple tree
TEAM LinG
comparison algorithm. XyDiff (Cobena et al., 2002) is are inserted and deleted frequently under others.
proposed to detect changes of ordered XML docu- Such change patterns can be critical for monitoring M
ments. Besides insertion, deletion, and updating, and predicting trends in e-commerce Web sites.
XyDiff also support move operation. X-Diff (Wang et They may imply certain underlying semantic mean-
al., 2003) is designed to detect changes of unordered ings; and can be exploited by strategy makers.
XML documents. In our historical XML mining, we
extend the XML change detection techniques to dis- Applications
cover hidden knowledge from the history of changes to
XML data with data mining techniques. Such novel knowledge can be useful in different applica-
tions, such as intelligent change detection for very large
XML documents, Web usage mining, dynamic XML
MAIN THRUST indexing, association rule mining, evolutionary pattern
based classification, and etcetera. We only elaborate on
Overview the first two applications due to the limitation of space.
Suppose one can discover substructures that change
Consider the dynamic property of XML data and exist- frequently and those that do not (frozen structures),
ing XML data mining research; it can be observed that then he/she can use this knowledge to detect changes to
the dynamic nature of XML leads to two challenging relevant portions of the documents at different fre-
problems in XML data mining. First is the maintenance quency based on their change patterns. That is, one can
of the mining results for existing mining techniques. As detect changes to frequently changing content and struc-
the data source changes, new knowledge may be found ture at a different frequency compared to structures that
and some old ones may not be valid. Second, is the do not change frequently. Moreover, one may ignore
discovery of novel knowledge hidden behind the histori- frozen substructures during change detection, as most
cal changes, some of them are difficult or impossible to likely they are not going to change. As one of the major
be discovered from snapshot data. In this paper, we limitations of existing XML change detection systems
focus on the second issue. That is, to discover novel (Cobena et al., 2002; Wang et al., 2003) is that they are
hidden knowledge from the historical changes to XML not scalable for very large XML documents. Knowledge
data. Suppose there is a sequence of XML documents, extracted from historical changes can be used to im-
which are different versions of the same documents. prove the scalability of XML change detection systems.
Then, following novel knowledge can be discovered. Recently, a lot of work has been done in Web usage
Note that, by no means we claim that the list is exhaus- mining. However, most of the existing works focus on
tive. We use them as representatives for the various snapshot Web usage data, while usage data is dynamic in
types of knowledge behind the history of changes. real life. Knowledge hidden behind historical changes
of Web usage data, which reflects how Web access
Frequently changing/Frozen structures/con- patterns (WAP) change, is critical to adaptive Web,
tents: Some parts of the structure or content Web site maintenance, business intelligence, and etcet-
change more frequently and significantly com- era. The Web usage data can be considered as a set of
pared to other structures. Such structures and con- trees, which have the similar structures as XML docu-
tents reflect the relatively more dynamic parts of ments. By partitioning Web usage data according to the
the XML document. Frozen structures/contents user-defined calendar pattern, we can obtain a sequence
represent the most stable part of the XML docu- of changes from the historical Web access patterns.
ment. Identifying such structure is useful for vari- From the changes, useful knowledge, such as how cer-
ous applications such as trend monitoring and tain Web access patterns changed, which parts changes
change detection of very large XML documents. more frequently and which parts do not, can be ex-
Association rules: Some structures/contents are tracted. Some preliminary results of mining the changes
associated in terms of their changes. The associa- to historical Web access patterns have been shown in
tion rules imply the concurrence of changes among Zhao and Bhowmick (2004).
different parts of the XML document. Such knowl-
edge can be used for XML change prediction, Research Issues
XML index maintenance, and XML based multi-
media annotation. To the best of our knowledge, existing state-of-the-art
Change patterns: From the historical changes, XML (structure related) data mining techniques (Yan &
one may observe that more and more nodes are Han, 2002; Yan & Han, 2003) cannot extract such novel
inserted under certain substructures, while nodes knowledge. Even if we apply such techniques repeatedly
801
TEAM LinG
to a sequence of snapshots of XML structure data, they patterns as predicted based on the history. For
cannot discover such knowledge efficiently and com- instance, some of the frozen structures that are
pletely. It is because that the historical XML mining not supposed to change may change while some
research focuses on the structural changes between ver- of the frequently changing structures that are
sions of XML documents, which is generated by XML supposed to change may not change. Such changes
data change detection tools (Zhao & Bhowmick, 2004). may happen for various reasons. For example,
Given a sequence of XML documents (which are some of the structures may be modified by mis-
versions of the same XML document), the objective of takes; some of them may be the result of some
historical XML data mining is to extract hidden and intrusion or fraud actions; and others may be
useful knowledge from the sequence of changes to the caused by intentional or highly occurrence of
XML versions. In this article, we focus on the structural underling events.
changes of XML documents. We proposed three major
issues for historical XML mining. They are identifying Besides the interesting structures, association rules
interesting structures, mining XML delta association between structures can be extracted as well. There are
rules, and classification/clustering evolutionary XML two types of association rules. One is structural delta
data. Based on the historical change behaviors of the association rule and another is semantic delta associa-
XML data, the first issue is to discover the interesting tion rule.
structures. We elaborate on some of the representatives.
Structural delta association rule: Among those
Frozen/Frequently changing structure: Frozen interesting structures, we may observe that some
structure refers to structure that does not change of the structures change together frequently with
frequently and significantly in the history. There certain confidence. For example, whenever struc-
are different possible reasons that a structure does ture A changes, structure B also changes with a
not change in the sequence of XML document probability of 80%. Structural delta association
versions. First, the structure is so well designed rule is used to represent such kind of knowledge.
that it does not change even if the content has It can be used for different applications such as
changed. Second, some parts of the XML docu- version and concurrency control systems.
ment may be ignored or redundant so that they Semantic delta association rule: By incorpo-
never change. Also, some data in the XML docu- rating some metadata, such as types of changes,
ment may never change by nature. ontology, and content summaries of leaf nodes,
Frequently changing structure: Refers to sub- semantic delta association rule can be extracted.
structures in the XML document may change more For example, with the history of changes in an e-
frequently and significantly compared to others. commerce Web site, we may discover that when
They may reflect the relatively more dynamic parts product A becomes more popular, product B will
of the XML document. become less popular. Such semantic delta asso-
Periodic dynamic structure: Among the history ciation rules can be useful for competitor moni-
of changes for some structures, there may be cer- toring and strategy analysis in e-commerce.
tain fixed patterns. Such change patterns may occur
repeatedly in the history. Those structures, where
there exist certain patterns in their change histo- FUTURE TRENDS
ries, are called periodic dynamic structure.
Increasing/Decreasing dynamic structure: Considering the different types of mining we proposed
Among those frequently changing structures, some and the types of data we are going to mine, there are
of them change according to certain patterns. For many challenges ahead. We elaborate on some of the
instance, the changes of some structures may be- representatives here.
come more significant and more frequent. Such
structures will be defined as increasing dynamic Real Data Collection
structures. Similarly, decreasing dynamic struc-
tures denote for structures whose frequency and To collect real data of historical XML structural ver-
significance of changes are decreasing. sions is a challenge. To get the structural data from
Outlier structures: From the historical change XML documents, a parser is needed. Currently, there
behavior of the structures, one may observe that are many XML parsers available. However, it has been
some of them may not comply with their change verified that the parsing process is the most expensive
802
TEAM LinG
process of XML data management (Nicola & John, 2003). detection, the temporal and dynamic property of XML
Moreover, to get the historical structural information, data is incorporated with the semistructure property in M
every time the XML document changes the entire docu- historical XML mining. Examples shown that the re-
ment has to be parsed again. Consequently, to extract sults of historical XML mining can be useful in a variety
historical structural information is a very expensive task. of applications, such as intelligent change detection for
Especially for very large XML documents, usually only very large XML documents, Web usage mining, dy-
some parts of the document change frequently. In order namic XML indexing, association rule mining, evolu-
to get knowledge that is more useful, a longer sequence tionary pattern based classification, and etcetera. By
of historical structural data and larger datasets are desir- exploring the characteristics of XML changes, we present
able. However, to get a longer and larger historical XML a framework of historical XML mining with a list of
structural datasets, a longer time is needed to collect the major research issues and challenges.
corresponding XML documents.
Frequency of Data Gathering REFERENCES
With the dynamic nature of XML data, the process of Arasu, A., & Hector, G.-M. (2003). Extracting struc-
gathering all versions of the historical XML structural tured data from Web pages. In Proceedings of ACM
data becomes a challenging task. The most naive method SIGMOD (pp. 337-348).
is to keep checking corresponding XML documents
continuously, but this approach is very expensive and Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi,
may overwhelm the network bandwidth. An alternative P.L. (2002). A tool for extracting XML association
method is to determine the frequency of checking by rules. In Proceedings of IEEE ICTAI (pp. 57-65).
analyzing the historical changes of similar XML docu- Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi,
ments. However, this method does not guarantee that all P.L. (2003). Discovering interesting information in
versions of the historical data can be detected. In our XML data with association rules. In Proceeding of ACM
research, we propose to extract appropriate frequency for SAC (pp. 450-454).
data gathering based on the changes of the historical data.
Cobena, G., Abiteboul, S., & Marian, A. (2002). Detect-
Incremental Mining ing changes in XML documents. In Proceedings of
IEEE ICDE (pp. 41-52).
The high cost of some data mining processes and the Cooley, R. (2003). The use of Web structure and con-
dynamic nature of XML data make it desirable to mine tent to identify subjectively interesting Web usage pat-
the data incrementally based on the part of data that terns. In ACM Transactions on Internet Technology
changed rather than mining the entire data again from (pp. 93-116).
scratch. Such algorithms are based on the previous
mining results. Most recently, there is a survey of Curbera, & Epstein, D.A. (1999). Fast dierence and
incremental mining of sequential patterns in large data- update of XML documents. In Proceeding of XTech.
base (Parthasarathy et al., 1999). Integrated with a change Deshpande, M., Kuramochi, M., & Karypis, G. (2003).
detection system, our incremental mining of historical Frequent sub-structure-based approaches for classify-
XML structural data can keep the discovered knowledge ing chemical compounds. In Proceedings of IEEE ICDM
up to date and valid. However, the challenge is that (pp. 35-42).
existing incremental mining techniques are for rela-
tional and transactional data, while XML structural is Dumais, S.T., & Chen, H. (2000). Hierarchical
semistructure. How to modify those approaches so that classication of Web content. In Proceedings of An-
they can be used for incremental mining historical XML nual International ACM SIGIR (pp. 256-263).
will be one of our research issues.
Ester, M., Kriegel, H.-P., & Schubert, M. (2002). Web
site mining: A new way to spot competitors, customers
and suppliers in the World Wide Web. In Proceedings
CONCLUSION of the eighth ACM SIGKDD (pp. 249-258).
In this article, we present a novel research direction: Etzioni, O. (1996). The World-Wide Web: Quagmire or
mining historical XML documents. Different from ex- gold mine? Communications of the ACM, 39(11), 65-68.
isting research of XML data mining and XML change
803
TEAM LinG
Han, J., & Kamber, M. (2000). Data mining concepts and Wang, Y., DeWitt, D.J., & Cai, J.-Y. (2003). X-difi: An
techniques. Morgan Kaufmann. effective change detection algorithm for XML documents.
In Proceedings of IEEE ICDE (pp. 519-530).
Huan, J., Wang, W., & Prins, J. (2003). Efficient mining
of frequent subgraph in the presence of isomorphism. In Yang, L.H., Lee, M.L., & Hsu, W. (2003). Efficient mining
Proceedings of IEEE ICDM (pp. 549-552). of XML query patterns for caching. In Proceedings of
VLDB (pp. 6980).
Huan, J., Wang, W., Washington, A., Prins, J., Shah, R.,
& Tropshas, A. (2004). Accurate classification of protein Zaki, M.J. (2002). Efficiently mining frequent trees in a
structural families using coherent subgraph analysis. In forest. In Proceedings of ACM SIGKDD (pp. 71-80).
Proceedings of PSB (pp. 411-422).
Zaki, M.J., & Aggarwal, C.C. (2003). XRules: An effective
Kleinberg, J.M. (1998). Authoritative sources in a structural classifier for XML data. In Proceedings of ACM
hyperlinked environment. Journal of the ACM, 46(5), SIGKDD (pp. 316-325).
604-632.
Zhao, Q., & Bhowmick, S.S. (2004). Mining history of
Lerman, K., Getoor, L., Minton, S., & Knoblock, C. changes to Web access patterns. In Proceeding of PKDD.
(2004). Using the structure of Web sites for automatic
segmentation of tables. In Proceedings of ACM SIGMOD
(pp. 119-130).
KEY TERMS
Madria, S.K., Bhowmick, S.S., Ng, W.K., & Lim, E.-P.
(1999). Research issues in Web data mining. In DMKD Changes to XML: Given two XML document, the set
(pp. 303-312). of edit operations that transform one document to another
Matthias, N., & Jasmi, J. (2003). XML parsing: A threat to is called changes to XML.
database performance. In Proceedings of CIKM (pp. 175- Historical XML: A sequence of XML documents,
178). which are different versions of the same XML document.
McHugh, J., Abiteboul, S., Goldman, R., Quass, D., & It records the change history of the XML document.
Widom, J. (1997). Lore: A database management sys- Mining Historical XML: It is the process of knowl-
tem for semistructured data. ACM SIGMOD Record, edge discovery from the historical changes to versions
26(3), 54-66. of XML documents. It is the integration of XML change
Reis, D. de C., Golgher, P.B, da Silva Altigran, S., & detection systems and XML data mining techniques.
Laender, A.H.F. (2004). Automatic Web news extraction Semi-Structured Data Mining: Semi-structured data
using tree edit distance. In Proceedings of WWW (pp. 502- mining is a sub-field of data mining, where the data
511). collections are semi-structured, such as Web data, chemi-
Srinivasan, P., Zaki, M.J., Ogihara, M. & Dwarkadas, S.. cal data, biological data, network data, and etcetera.
(1999). Incremental and interactive sequence mining. In Web Mining: Web data mining is to use data mining
Proceedings of CIKM (pp. 251-258). techniques to automatically discovery and extract infor-
Shearer, K., Dorai, C., & Venkatesh, S. (2000). Incorpo- mation from Web data and services.
rating domain knowledge with video and voice data XML: Extensible Markup Language (XML) is a
analysis in news broadcasts. In Proceedings of ACM simple, very flexible text format derived from SGML.
SIGKDD (pp. 46-53).
XML Structural Changes: Among the changes to
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. XML, not all the changes may cause the structure changes
(2000). Web usage mining: Discovery and applications of the XML when it is represented as a tree structure. In
of usage patterns from Web data. ACM SIGKDD Explo- this case, only insertion and deletion are called XML
rations, 1(2), 12-23. structural changes.
804
TEAM LinG
805
Mining Images for Structure M

Terry Caelli
Australian National University, Australia
INTRODUCTION ods for binding image content with semantics. In turn,

this reduces to the need for models and algorithms that
Most data warehousing and mining involves storing and are capable of efficiently encoding and matching rela-
retrieving data either in numerical or symbolic form, tional properties of images and associating these rela-
varying from tables of numbers to text. However, when tional properties with semantic descriptions of what is
it comes to everyday images, sounds, and music, the being sensed. To illustrate this approach, we briefly
problem turns out to be far more complex. The major discuss two representative examples of such methods:
problem with image data mining is not so much image (1) Bayesian Networks (Bayesian Nets) for SBR, based
storage, per se, but rather how to automatically index, first on multi-scaled image and then on image feature
extract, and retrieve image content (content-based re- models; (2) principal components analysis (also termed
trieval [CBR]). Most current image data-mining tech- latent semantic indexing or spectral methods).
nologies encode image content by means of image
feature statistics such as color histograms, edge, tex- Bayesian Network Approaches
ture, or shape densities. Two well- known examples of
CBR are IBMs QBIC system used in the State Heritage Bayesian Nets have recently proved to be a powerful
Museum and PICASSO (Corridoni, Del Bimbo & Pala, method for SBR, since semantics are defined in terms
1999) used for the retrieval of paintings. More recently, of the dependencies between image features (nodes),
there have been some developments in indexing and their labels, and known states of what is being sensed.
retrieving images based on the semantics, particularly Inference is performed by propagation probabilities
in the context of multimedia, where, typically, there is through the network. For example, Benitez et al. (2003)
a need to index voice and video (semantic-based re- have developed MediaNet, a knowledge representation
trieval [SBR]). Recent examples include the study by network and inference model for the retrieval of con-
Lay and Guan (2004) on artistry-based retrieval of ceptually defined scene properties integrated with natu-
artworks and that of Benitez and Chang (2002) on com- ral language processing. In a similar way, Hidden Markov
bining semantic and perceptual information in multime- Random Fields (HMRFs) have become a common class
dia retrieval for sporting events. of image models for binding images with symbolic
However, this type of concept or semantics-based image descriptions. In particular, Hierarchical Hidden Markov
indexing and retrieval requires new methods for encoding Random Fields (HHMRF) provide a powerful SBR rep-
and matching images, based on how content is structured, resentation. HHMRFs are defined over multi-scaled
and here we briefly review two approaches to this. image pixel or features defined by Gaussian or Laplacian
pyramids (Bouman & Shapiro, 1994). Each feature, or
pixel, x , at a given scale is measured (observed) to
BACKGROUND evidence scene properties, states, s, corresponding to
semantic entities such as ground, buildings, and so
Generally speaking, image structure is defined in terms forth, as schematically illustrated in Figure 1. The rela-
of image features and their relations. For SBR, such tionships between states serves to define the grammar.
features and relations reference scene information. The link between observations and states defines, in this
These features typically are multi-scaled, varying from approach, the image semantics. Accordingly, at each
pixel attributes derived from localized image windows scale, l, we have a set of observations and states, where
to edges, regions, and even larger image area properties. p(ol (x) /sl (x)) defines the dependency of the observation
at scale, l, on the state of the world (scene).
Specifically, the HHMRF assumes that the state at
MAIN THRUST a pixel, x , is dependent on the states of its neighboring
pixels at the same or neighboring levels of the pyramid.
In recent years there has been an increasing interest in A simple example of the expressiveness of this model is
SBR. However, this requires the development of meth- a forestry scene. This could be an image region (label:
TEAM LinG
Mining Images for Structure
Figure 1. The hierarchical hidden Markov random case, the HMRF is defined over graphs that depict fea-
field (HHMRF) model for image understanding. Here, tures and their relations. That is, consider two attributed
the hidden state variables, X, at each scale are graphs, Gs and G x , representing the image and the query,
evidenced by observations, Y, at the same scale and
respectively. We want to determine just how, if at all, the
the state dependencies within and between levels of
query (graph) structure is embedded somewhere in the
the hierarchy. The HHMRF is defined over pixels and/
image (graph). We define HMRF over the query graph,
or feature graphs.
G x . A single node in G x is defined by xi , and in the graph
Gs , by s . Each node in each graph has vertex and edge
attributes, and the query corresponds to solving a sub-
graph isomorphism problem that involves the assignment
of each xi a unique s , assuming that there is only one
instance of the query structure embedded in the image,
although this can be generalized. In this formulation, the
HMRF model considers each node xi in G x as a random
forest) dependent on a set of regions labeled tree at the variable that can assume any of S possible values corre-
next scale, which, in turn, are dependent on trunk,
branches, and leaves labels at the next scale, and so sponding to the nodes of Gs .
forth. These labels have specific positional relations
over scales, and the observations for each label are The Observation Component: Using HMRF for-
supported by their compatibilities and observations. malities, the similarity (distance: dist) between
Consequently, the SBR query for finding forestry vertex attributes of both graphs is consequently
images is translated into finding the posterior maximum defined as the observation matrix model
likelihood (MAP) of labeling the image regions, given
the forestry model. Using Bayes rule, this reduces to Bi = p ( y xi / xi = s ) = dist ( y xi , y s ) .
the following optimization problem:
s*l (x)) argmax{p(sl (x) /ol (x)) p(sl v (x u))} The Markov Component: Here, we use the bi-
S u,v nary (relational) attributes to construct the com-
patibility functions between states of neighboring
where l v corresponds to the states above and below nodes. Assume that xi and x j are neighbors in the
level l of the hierarchy. In other words, the derived
labeling MAP probability is equivalent to a probabilistic HMRF (being connected in G x ). Similar to the
answer to the query if this image is a forestry scene. previously described unary attributes, we have
There are many approaches to approximate solutions to
this problem, including relaxation labeling, expectation A ji; = p( x j = s / xi = s ) = dist ( y xij , y s ) .
maximization (EM), loopy belief propagation and the
junction tree algorithm (see definitions in Terms and
Definitions). All of these methods are concerned with
optimal propagation of evidence over different layers of Optimization Problem and Solutions: Given
the graphical representation of the image, given the this general HMRF formulation for graph match-
model and the observations, and all have their limita- ing, the optimal solution reduces to that of deriv-
tions. When the HHMRF model is approximated by a ing a state vector s * = ( s 1 ,.., s T ) where s i G s for
triangulated image state model, the junction tree algo-
each vertex x i G x such that the MAP criterion is
rithm is optimal. However, triangulating such hierarchi-
cal meshes is computationally expensive. On the other satisfied, given the model = ( A, B) and data
hand, the other approaches mentioned previously are
not optimal, converging to local minima (Caetano & s * = arg max{ p ( x1 = s ,...., xT = s / )} .
Caelli, 2004). s .. s
When the structural information in the query and

image is defined in terms of features (i.e., regions or Specifically, we introduce two sets of HMRF mod-
edge segments) and their relational attributes, again, els for solving this problem.
HMRFs can be applied to image feature matching. In this
806
TEAM LinG
First, probabilistic relaxation labeling (PRL) is a parallel adjacency matrix where relations are defined in terms of
iterative update method for deriving the most consistent the matrix off-diagonal elements. These matrices are de- M
labels, having the form composed into their eigenvalues and eigenvectors, as
p t 1 (d i ) p( yi / d i ) p ( d i / d i ) p t (d i ) G = P P ' , G = T
1,..i 1,i 1,..,n i i
where T corresponds to the covariance matrix of

where pt (d i ) is typically factored into pair-wise condi- the vertex and edge attributes. In the case where only
tionals. This algorithm is known to converge to local adjacencies (1,0) are used, then G is simply the adja-
minima and is particularly sensitive to initial values. Al- cency matrix. The core idea is that vertices are then
beit, this is an example of an inexact (suboptimal) model projected into a selected eigen-subspace, and corre-
applied to the complete data model. The following models spondences are determined by proximities in such
are the oppositeexact inference models over approxi- spaces. The method, in different variations, has been
mate data modelsand so their optimality is critically explored by Scott and Longuet-Higgins (1991), Shapiro
dependent on how well the data model approximation is. and Brady (1992), and Caelli and Kosinov (2004).
Single Path Dynamic Programming (SPDP) con- Normalization is required when graphs are not of the
sists of approximating the fully connected graph into a same size. Sometimes, only the eigenspectra (i.e., list
single chain, which traverses each vertex exactly once. of eigenvalues) are used to compare graphs (Siddiqi et
This model allows us to use dynamic programming in al., 1999; Luo & Hancock, 2001).
order to find the optimal match between the graphs. In the Such spectral methods offer insightful ways to
second type of model, we improve over this simple chain define image structures, which often can correspond to
model by proposing HMRFs with maximal cliques of semantically corresponding properties between the
size 3 and 4, which still retain the model feasibility for query and the image. For example, comparing two
accomplishing optimal inference via junction tree (JT) different views of the same animal often can result in
methods (Lauritzen, 1996). These cliques encode more the predicted correspondences having semantically
information than the single path models insofar as verti- correct labels, as shown in Figure 2 and Table 1, even
ces are connected to 2 (JT3) and 3 (JT4) other neighbor- when only basic adjacencies (simply whether two fea-
ing vertices, as compared to only 1 (the previous) in the tures are connected or not, without attributes such as
SPDP case. Such larger cliques, their associated vertex distances, angles, etc.) are used to encode the rela-
labels, then are used to derive the optimal labels for a tional structures.
given vertex. The JT algorithm is similar to dynamic
programming involving a forward and backward pass for
determining the most likely labels, given the evidence Figure 2. Two sample shapesboxer-10 and boxer-
and neighborhood consistencies (Caetano & Caelli, 2004). 24and their shock graph representations (Siddiqi
Another recent approach to encoding and retrieving et al., 1999)
image structures in ways that can index their semantics is
the method of Latent Semantic Indexing (LSI), as used in
many other areas of data mining (Berry et al., 1995).
Here, the basic idea is to use Principle Components
Analysis (PCA) to determine the basic dimensionality of
a relational data structure (i.e., items by attribute ma-
trix), project, and cluster items, and to retrieve examples
via proximity in the projection subspaces. For SBR and
for images, what is needed is a form of LSI that functions
on the relational attributes of image features, and re-
cently, the LSI method has been extended to incorporate
such structures. Here, relationships between features
are encoded as a graph, and the LSI approach offers a
different method for solving graph-matching problems.
In this way, SBR consists of matching the query (a graph)
with observed image data encoded also as a graph.
The LSI approach, when applied to relational fea-
tures, uses standard matrix models for graphs, such as the
807
TEAM LinG
Table 1. Cluster decomposition obtained from and robust way. Machine learning, Bayesian networks,
correspondence clustering of the two shock graphs and even new uses of matrix algebra all open up new
for shapes BOXER-10 and BOXER-24. Each cluster possibilities. Unless computers can store, evaluate and
enumerates the vertices from the two graphs that are interpret images the way we do, image data warehousing
grouped together and lists in brackets the part of the and mining will remain a doubtful and certainly labor-
body from which the vertex approximately comes. The intensive technology with limited use without significant
percentage of variance explained by the first three human intervention.
dimensions for shapes Boxer-10 and Boxer-24 was
32% and 33%, respectively (Caelli & Kosinov, 2004).
REFERENCES
Benitez, A., & Chang, S. (2002). Multimedia knowledge

interrelations: Summarization and evaluation. Proceed-
ings of the MDM/KDD-2002, Edmonton, Alberta,
Canada.
Berry, M., Dumais, S., & OBien, G. (1994). SIAM Review,
37(4), 573-595.
Bouman, C., & Shapiro, M. (1994). A multiscale random
field model for Bayesian image segmentation. IEEE:
Transactions On Image Processing, 3(2), 162-177.
Caelli, T., Cheng, L., & Feng, Q. (2003). A Bayesian
approach to image understanding: From images to vir-
tual forests. Proceedings of the 16 th International
Conference on Vision Interface (pp. 1-12), Halifax, Nova
Scotia, Canada.
Caelli, T., & Kosinov, S. (2004). Inexact graph matching
FUTURE TRENDS using eigenspace projection clustering. International
Journal of Pattern Recognition and Artificial Intelli-
As can be seen from the previous discussion, SBR re- gence, 18(3), 329-354.
quires the development of new methods for relational
Caetano, T., & Caelli, T. (2004). Graphical models for
queries, ideally to pose such querying as an optimization
graph matching. Proceedings of the IEEE International
problem that results in a degree of match that, in some
Conference on Computer Vision and Pattern Recogni-
sense, is derived optimally rather than by using heuristic
tion: CVPR (pp. 466-473).
methods. Here, we have discussed briefly two such meth-
ods: one based on Bayesian methods and the other based Corridoni, J., Del Bimbo, A., & Pala, P. (1999). Retrieval of
on matrix methods. What is clearly needed is an integra- paintings using effects induced by colour features. IEEE
tion of both approaches in order to capitalize on the Multimedia, 6(3), 38-53.
benefits of both views. Recent developments in both
deterministic and stochastic kernel methods (Schlkopf Lauritzen, S. (1996). Graphical models. Oxford: Clarendon
& Smola, 2002) may provide the basis for such an integra- Press.
tion. Lay, J., & Guan, L. (2004). Retrieval for color artistry
concepts. IEEE Transactions on Image Processing,
13(3), 326-339.
CONCLUSION
Luo, B., & Hancock, E. (2001). Structural graph match-
Image data mining is still an emerging area where ulti- ing using the EM algorithm and singular value decompo-
mate success and use can only come about if we can sition. IEEE Transaction of Pattern Analysis and Ma-
solve the difficult problem of querying images and chine Intelligence, 23(10), 1120-1136.
searching for content as humans do. This must involve Schlkopf, B., & Smola, A. (2002). Learning with
the creation of new SBR algorithms that clearly are based kernels. Cambridge, MA: MIT Press.
on the ability to support relational queries in an optimal
808
TEAM LinG
Scott, G., & Longuet-Higgins, H. (1991). An algorithm for Hidden Markov Random Field (HMRF): A Markov
associating the features of two patterns. Proceedings of Random Field with additional observation variables at M
the Royal Society of London. each node, whose values are dependent on the node
states. Additional pyramids of MRFs defined over the
Shapiro, L., & Brady, J. (1992). Feature-based correspon- HMRF give rise to hierarchical HMRFs (HHMRFs).
denceAn eigenvector approach. Image and Vision
Computing, 10, 268-281. Image Features: Discrete properties of images that
can be local or global. Examples of local features include
Siddiqi, K., Shokoufandeh, P., Dickinson, S., & Zucker, S. edges, contours, textures, and regions. Examples of glo-
(1999). Shock graphs and shape matching. International bal features include color histograms and Fourier compo-
Journal of Computer Vision, 30, 1-24. nents.
Image Understanding: The process of interpreting
images in terms of what is being sensed.
KEY TERMS
Junction Tree Algorithm: A two-pass method for
Attributed Graph: Graphs whose vertices and edges updating probabilities in Bayesian networks. For trian-
have attributes typically defined by vectors and matri- gulated networks the inference procedure is optimal.
ces, respectively. Loopy Belief Propagation: A parallel method for
Bayesian Networks: A graphical model defining update beliefs or probabilities of states of random vari-
the dependencies between random variables. ables in a Bayesian network. It is a second-order exten-
sion of probabilistic relaxation labeling.
Dynamic Programming: A method for deriving
the optimal path through a mesh. For hidden Markov Markov Random Field (MRF): A set of random
models (HMMs), it is also termed the Viterbi algorithm variables defined over a graph, where dependencies
and involves a method for deriving the optimal state between variables (nodes) are defined by local cliques.
sequence, given a model and an observation sequence. Photogrammetry: The science or art of obtaining
Expectation Maximization: Process of updating reliable measurements or information from images.
model parameters from new data where the new param- Relaxation Labeling: A parallel method for updat-
eter values constitute maximum posterior probability ing beliefs or probabilities of states of random variables
estimates. Typically used for mixture models. in a Bayesian network. Node probabilities are updated in
Grammars: A set of formal rules that define how to terms of their consistencies with neighboring nodes and
perform inference over a dictionary of terms. the current evidence.
Graph: A set of vertices (nodes) connected in various Shape-From-X: The process of inferring surface
ways by edges. depth information from image features such as stereo,
motion, shading, and perspective.
Graph Spectra: The plot of the eigenvalues of the
graph adjacency matrix.
809
TEAM LinG
810
Mining Microarray Data

Nanxiang Ge
Aventis, USA
Li Liu
Aventis, USA
INTRODUCTION mats. Sensible organization of such diverse data

types will ease the data mining process.
During the last 10 years and in particularly within the Data Standards: the microarray data analysis com-
last few years, there has been a data explosion associ- munity has developed a standard for microarray
ated with the completion of the human genome project data: the Minimum Information About a Microarray
(HGP) (IHGMC and Venter et al., 2001) in 2001 and the Experiment (MIAME: http://www.mged.org). The
many sophisticated genomics technologies. The human MIAME standard is needed to enable the consis-
genome (and genome from other species) now provides tent interpretation of experiment results and po-
an enormous amount of data waiting to be transformed tentially to reproduce the experiment.
into useful information and scientific knowledge. The
availability of genome sequence data also sparks the
development of many new technology platforms. Among MAIN THRUST
the available different technology platforms, microarray
is one of the technologies that is becoming more and Mining microarray data is a very challenging task. The
more mature and has been widely used as a tool for raw data from microarray experiments usually comes in
scientific discovery. The major application of microarray image format. Images are then quantified using image
is for simultaneously measuring the expression level of analysis software. Such quantified data are then subject
thousands of genes in the cell. It has been widely used in to three steps of analysis:
drug discovery and starts to impact the drug develop-
ment process. Pre-processing microarray data
The mining microarray database is one of the many Mining microarray data.
challenges facing the field of bioinformatics, computa- Joint mining of microarray data and sequence
tional biology and biostatistics. We review the issues in database.
microarray data mining.
We review the data analysis methods in these three
aspects. We focus our discussion on Affymetrix (http:/
BACKGROUND /www.affymetrix.com) technology, but many of the
methods are applicable to data from other platforms.
The quantity of data generated from a microarray ex-
periment is usually large. Besides the actual gene ex- Pre-Processing Microarray Data
pression measurement, there are many other data avail-
able such as the corresponding sequence information, While effectively managing microarray and related data
genes chromosome location information, gene ontol- is an important first step, microarray data have to be pre-
ogy classification and gene pathway information. Man- processed so that downstream analysis can be performed.
agement of such data is a very challenging issue for Array-to-array and sample-to-sample variations are the
bioinformatics. The following is a list of requirements main reason for the requirement of pre-processing.
for data management. Typically, pre-processing involves the following four
steps:
Data Organization: different microarray platforms
generate data with different formats, different gene Image Analysis: image analysis in microarray
identifiers, etc. Sequence data, gene ontology data, experiment involves gridding of image, signal ex-
and gene pathway information all with diverse for- traction and back ground adjustment. Affymetrixs
TEAM LinG
microarray analysis suite (MAS) provides the quan- ratios were colored red, cells with negative log
tification software. ratios were colored green, and the intensities of M
Data Normalization: normalization is a step neces- reds or greens were proportional to the absolute
sary for remove systematic array to array variations. values of the log ratios. The clustering analysis
Different normalization methods have been pro- efficiently grouped genes with similar functions
posed. Cyclic loess method (Dudoit et al., 2002), together, and the colored image provided an over-
quantile normalization method and contrast-based all pattern in the data. Clustering analysis can also
method normalize the probe level data; Scaling help us understand the novel genes if they were co-
method and non-linear method normalize expres- expressed with genes with known functions.
sion intensity level data. Bolstad, Irizarry, Astrand, Standard clustering analysis, such as hierarchical
and Speed (2003) provides a comparison of all these clustering, k-means clustering, self-organizing
methods and suggests that simple quantile normal- maps, are very useful in mining the microarray
ization performs relatively stable. data. However, these data tables are often cor-
Estimation of Expression Intensities: different rupted with extreme values (outliers), missing
methods to summarize expression intensity from values, and non-normal distributions that preclude
probe level data have appeared in literature. Among standard analysis. Liu, Hawkins, Ghosh, and Young
them, Li and Wong (2001) proposed a model (2003) proposed a robust analysis method, called
based expression intensity (MEBI), Irizarrys rSVD (Robust Singular Value Decomposition), to
(2003) robust multi-array (RMA) methods and the address these problems. The method applies a
method provided in Affymetrix MAS5 software. combination of mathematical and statistical meth-
Data Transformation: data transformation is a ods to progressively take the data set apart so that
very important step and it enables the data to fit to different aspects can be examined for both general
many of the assumptions behind statistical meth- patterns and for very specific effects. The benefits
ods. Log transformation and glog (Durbin, Hardin, of this robust analysis will be both the understand-
Hawkins, & Rocke, 2002) transformation are the ing of large-scale shifts in gene effects and the
two commonly used methods. isolation of particular sample-by-gene effects that
might be either unusual interactions or the result
Mining Microarray Data of experimental flaws. The method requires a
single pass, and does not resort to complex clean-
In principal, the current data mining activities in ing or imputation of the data table before analy-
microarray data can be grouped into two types of stud- sis. The method rSVD was applied to a micro array
ies: unsupervised and supervised. Unsupervised analy- data, revealed different aspects of the data, and
sis has been used widely for mining microarray experi- gave some interesting findings.
ment. Cluster analysis has been the dominant method
for unsupervised mining. Examples for supervised data mining:
Examples of unsupervised data mining:
Golub et al. (1999) studied the gene expression of
Eisen, Spellman, Brown, and Botstein (1998) stud- two types of acute leukemias, acute myeloid leu-
ied the gene expression of budding yeast Saccha- kemia (AML) and acute lymphoblastic leukemia
romyces cerevisiae spotted on cDNA microarrays (ALL), and demonstrated the feasibility of cancer
during the diauxic shift, the mitotic cell division prediction based on gene expression data. The data
cycle, sporulation, and temperature and reducing comes from Affymetrix arrays with 6817 genes,
shocks. Hierarchical clustering was applied to this and consists of 47 cases of ALL and 25 cases of
gene expression data, and the result was repre- AML. 38 samples (27 ALL, 11 AML) were used as
sented by a tree whose branch lengths reflected training data. A set of 50 genes with the highest
the degree of similarity between genes, which was correlations with an idealized expression pat-
assessed by a pair-wise similarity function. The tern vector, where the expression level is uni-
computed tree was then used to order genes in the formly high for AML and uniformly low for ALL,
original data table, and genes with similar expres- were selected. The prediction of a new sample was
sion pattern were grouped together. The ordered based on weighted votes of these 50 genes. The
gene expression table can be displayed graphically method made strong prediction for 29 of the 34
in a colored image, where cells with log ratios of test samples, and the accuracy was 100%. Golubs
0s were colored black, cells with positive log method is actually a minor variant of the maximum
likelihood linear discriminate analysis for two
811
TEAM LinG
classes. Instead of using variances in computing discovering enriched biological themes within gene
weights, Golubs method uses standard deviations. lists. Blalock et al. (2004) studied the gene expres-
Dudoit el al. (2002) compared the prediction perfor- sion of Alzheimers disease (AD), and found some
mance of a variety of classification methods, such as interesting over-represented gene categories
linear/quadratic discriminate analysis, nearest neigh- among the regulated genes using EASE. The find-
bor method, classification trees, aggregated classi- ings suggest a new model of AD pathogenesis in
fiers, bagging and boosting, on three microarray which a genomically orchestrated up-regulation of
datasets, lymphoma data (Alizadeh et al.), leukemia tumor suppressor-mediated differentiation and in-
data (Golub et al.) and NCI60 data (Ross et al.). volution processes induces the spread of pathol-
Based on their comparisons, the rankings of the ogy along myelinated axons.
classifiers were similar across datasets and the main Transcriptional factor binding factors are essen-
conclusion, for the three datasets, is that simple tial elements in the gene regulatory networks.
classifiers such as diagonal linear discriminate analy- Directly experimental identification of such tran-
sis and nearest neighbors perform remarkably well scriptional factors is not practical or efficient in
compared to more sophisticated methods such as many situations. Conlon, Liu, Lieb, and Liu (2003)
aggregated classification trees. provided one method based on least regression to
Dimension reduction techniques, such as principal integrating microarray data and transcriptional
component analysis (PCA) and partial least squares factor binding sites (TFBS) patterns. Curran, Liu,
(PLS) can be used to reduce the dimension of the Long, and Ge (2004) provided a logistic regres-
microarray data before certain classifier is used. sion approach to joint mining an internal
For example, Nguyen et al. (2002) proposed an microarray database and a corresponding TFBS
analysis procedure to predict tumor samples. The database (http://transfac.gbf.de/).
procedure first reduces the dimension of the
microarray data using PLS, and then applies logis-
tic discrimination (LD) or quadratic discriminate FUTURE TRENDS
analysis (QDA). See also West et al. (2001) and
Huang et al. (2003). As more and more new technologies platform are in-
troduced for medical research, it is imaginable that
Joint Data Mining of Microarray, Gene data will continue to grow. For example, Affymetrix
Ontology and TFBS Data recently introduced its SNP chip which contains
100,000 human SNPs. If such a technology were ap-
Combining microarray data and other available data such plied to a clinical trial with 10000 subjects, the SNP
as DNA sequence data, gene ontology data has attracted data alone will be a 10000 by 100000 table in addition
lots of attention in recent years. From a system biology to data from potential other technologies such as
view, combining different data types provides added proteomics, metabonomics and bio-imaging.
power than looking microarray data alone. This is still a Associated with the growth of data will be the in-
very active research area for data mining and we review creasing need for effective data management and data
some of the current work done in two aspects: combining integration. More efficient data retrieval system will
with gene ontology database and combining with (TFBS) be needed as well as system that can accommodate
database. large scale and diversified data.
Besides the growth of data management system, it is
Combining microarray data with gene ontology foreseeable that integrated data analysis will become
data is important in interpreting the microarray more and more routine. At this point, most of the data
data. Expression Analysis Systematic Explorer analysis software is only capable of analyzing small
(EASE, http://david.niaid.nih.gov/david/ease.htm) is scale, isolated data sets. So, the challenges to the
developed by Hosack, Dennis, Sherman, Lane, and informatics field and statistics field will continue to
Lempicki (2003), and is a customizable, standalone grow dramatically.
software application that facilitates the biological
interpretation of gene lists derived from the results
of microarray, proteomics, and SAGE experiments. CONCLUSION
EASE can generate the annotations of a list of
genes in one shot, automatically link to online Microarray technology has generated vast amount of
analysis tools, and provide statistical methods for data for very interesting data mining research. How-
812
TEAM LinG
ever, as the central dogma indicates, microarray only Durbin, B.P., Hardin, J.S., Hawkins, D.M., & Rocke, D.M.
provides one snapshot of the biological system. Se- (2002). A variance-stabilizing transformation of gene ex- M
quencing, proteomics technology, metabolite profiling pression microarray data. Bioinformatics, 18(Supplement
provide different view about the biological system and it 1), S105-S110.
will remain a challenge to deal with the joint data mining
of data from such diverse technology platform. Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D.
There are significant challenges with respect to devel- (1998). Cluster analysis and display of genome-wide ex-
opment of data mining technology to deal with data pression patterns. PNAS, 95, 14863-14868.
generated from different technology platform. The chal- Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,
lenges to deal with the combined data will be even more Gaasenbeek, M., Mesirov, J.P., et al. (1999). Molecular
difficult. Besides methodological challenges, how to or- classification of cancer: Class discovery and class pre-
ganize the data, how to develop standards for data gen- diction by gene Expression monitoring. Science,
erated from different platforms and how to develop com- 286(5439), 531-537.
mon terminologies, all remain to be answered.
Hosack, D.A., Dennis, G. Jr., Sherman, B.T., Lane, H.C., &
Lempicki, R.A. (2003). Identifying biological themes within
REFERENCES lists of genes with EASE. Genome Biology, 4(9), R60.
Huang, E., Cheng, S.H., Dressman, H., Pittman, J., Tsou,
Alizadeh, A.A. et al. (2000). Different types of diffuse M.H., Horng, C.F. et al. (2003). Gene expression predictors of
large b-cell lymphoma identified by gene expression breast cancer outcomes. Lancet, 361, 1590-1596.
profiling. Nature, 403(6769), 503-511.
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D.,
Blalock, E.M., Geddes, J.W., Chen, K.C., Porter, N.M., Antonellis, K.J., Scherf, U. et al. (2003) Exploration, nor-
Markesbery, W.R., & Landfield, P.W. (2004). Incipient malization, and summaries of high density oligonucle-
Alzheimers disease: Microarray correlation analyses otide array probe level data. Biostatistics, 4(2), 249-264.
reveal major transcriptional and tumor suppressor re-
sponses. PNAS, 101(7), 2173-8. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., & Pavlidis, P. (2004).
Coexpression analysis of human genes across many microarray
Bolstad, B.M., Irizarry, R.A., Astrand, M., & Speed, T.P. data sets. Genome Res, 14(6), 1085-1094.
(2003). A comparison of normalization methods for
high density oligonucleotide array data based on bias Li, C., & Wong, W.H. (2001). Model-based analysis of
and variance. Bioinformatics, 19(2), 185-193. oligonucleotide arrays: Expression index computation
and outlier detection. PNAS, 98(1), 31-36.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D.,
Brown, P.O. et al. (1998). The transcriptional program of Liu, L., Hawkins, D.M., Ghosh, S., & Young, S.S. (2003).
sporulation in budding yeast. Science, 282(5389), 699- Robust singular value decomposition analysis of
705. microarray data. PNAS, 100(23), 13167-13172.
Conlon, E.M., Liu, X.S., Lieb, J.D., & Liu, J.S. (2003). The International Human Genome Mapping Consortium
Integrating regulatory motif discovery and genome- (IHGMC). (2001). A physical map of the human ge-
wide expression analysis. PNAS, 100(6), 3339-3344. nome. Nature, 409(6822), 934-941.
Curran, M., Liu, H., Long, F., & Ge, N. (2003). Statisti- Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees,
cal methods for joint data mining of gene expression and C., Spellman, P., et al. (2000). Systematic variation in
DNA sequence database. SIGKDD Explorations Spe- gene expression patterns in human cancer cell lines.
cial Issue on Microarray Data Mining, 5, 122-129. Nature Genetics, 24, 227-234.
Dudoit, S., Fridlyand, J., & Speed, T.P. (2002). Com- Venter, J. et al. (2001). The human genome. Science,
parison of discrimination methods for the classifica- 291(5507), 1304-51.
tion of tumors using gene expression data. Journal of West, M., Blanchette, C., Dressman, H., Huang, E.,
the American Statistical Association, 97(457), 77-87. Ishida, S., Spang, R., et al. (2001). Predicting the clini-
Dudoit, S., Yang, Y.H., Callow, M.J., & Speed, T.P. cal status of human breast cancer by using gene expres-
(2002). Statistical methods for identifying genes with sion profiles. PNAS, 98, 11462-11467.
differential expression in replicated cDNA microarray
experiments. Statistical Sinica, 12(1), 111-139.
813
TEAM LinG
KEY TERMS MIAME: Minimum Information About a Microarray

Experiment. It is a standard for data from a microarray
Bioinformatics: The analysis of biological informa- experiment.
tion using computers and statistical techniques; the sci- Normalization: A process to remove systematic varia-
ence of developing and utilizing computer databases and tion in microarray experiment. Examples of such system-
algorithms to accelerate and enhance biological research atic variation includes between array variation, dye bias,
Gene Ontology (GO): GO is three structured, con- etc.
trolled vocabularies that describe gene products in terms Proteomics: is the large-scale study of proteins, par-
of their associated biological processes, cellular com- ticularly their structures and functions.
ponents and molecular functions in a species-indepen-
dent manner (http://www.geneontology.org ). SNP: Single Nucleotide Polymorphism. The same
sequence from two individuals can often be single base
Metabonomics: The evaluation of tissues and biologi- pair changes. Such change can be useful genetic markers
cal fluids for changes in metabolite levels that result from and might explain why certain individual respond to
toxicant-induced exposure. certain drug while others dont.
814
TEAM LinG
815
Mining Quantitative and Fuzzy M

Association Rules
Hong Shen
Japan Advanced Institute of Science and Technology, Japan
Susumu Horiguchi
Tohoku University, Japan
INTRODUCTION tinuous attribute value range to a set of discrete intervals

and then map all the values in each interval to an item of
The problem of mining association rules from databases binary values (Dougherty, Kohavi, & Sahami, 1995).
was introduced by Agrawal, Imielinski, & Swami (1993). In Two classical discretization methods are equal-width
this problem, we give a set of items and a large collection discretization, which divides the attribute value range
of transactions, which are subsets (baskets) of these into N intervals of equal width without considering the
items. The task is to find relationships between the occur- population (number of instances) within each interval,
rences of various items within those baskets. Mining and equal-depth (or equal-cardinality) discretization,
association rules has been a central task of data mining, which divides the attribute value range into N intervals of
which is a recent research focus in database systems and equal populations without considering the similarity of
machine learning and shows interesting applications in instances within each interval. Examples of using these
various fields, including information management, query methods are given in Fukuda, Morimoto, Morishita, and
processing, and process control. Tokuyama (1996); Miller and Yang (1997); and Srikant and
When items contain quantitative and categorical val- Agrawal (1996).
ues, association rules are called quantitative association To overcome the problems of sharp boundaries
rules. For example, a quantitative association rule derived (Gyenesei, 2001) and expressiveness (Kuok, Fu, & Wong,
from a regional household living standard investigation 1998) in traditional discretization methods, methods for
database has the following form: mining fuzzy association rules were suggested. Many
traditional algorithms, such as Apriori, MAFIA, CLOSET,
and CHARM, were employed to discover fuzzy associa-
age [50,55] married house. tion rules. However, the number of fuzzy attributes is
usually at least double the number of attributes; therefore,
Here, the first item, age, is numeric, and the second these algorithms require huge computational times. To
item is categorical. Categorical attributes can be con- reduce the computation cost of association mining, vari-
verted to numerical attributes in a straightforward way by ous parallel algorithms based on count, data, and candi-
enumerating all categorical values and mapping them to date distribution, along with other suitable strategies,
numerical values. were suggested (Han, Karypis, & Kumar, 1997; Shen,
An association rule becomes a fuzzy association rule Liang, & Ng, 1999; Shen, 1999a; Zaki, Parthasarathy, &
if its items contain probabilistic values that are defined by Ogihara, 2001). Recently, a parallel algorithm for mining
fuzzy sets. fuzzy association rules, which divides the set of fuzzy
attributes into independent partitions based on the natu-
ral independence among fuzzy sets defined by the same
BACKGROUND attribute, was proposed (Phan & Horiguchi, 2004b).
Many results on mining quantitative association rules

can be found in Li, Shen, and Topor (1999) and Miller and MAIN THRUST
Yang (1997). A common method for quantitative associa-
tion rule mining is to map numerical attributes to binary This article introduces recent results of our research on
attributes, then use algorithms of binary association rule this topic in two prospects: mining quantitative associa-
mining. A popular technique for mapping numerical at- tion rules and mining fuzzy association rules.
tributes is to attribute discretization that converts a con-
TEAM LinG
Mining Quantitative and Fuzzy Association Rules
Mining Quantitative Association Rules tal thus has its representative center given by the average
weighted value of (cu, nu) and (cv, nv):
One method that was proposed is an adaptive numerical
value discretization method that considers both value v 1 w 1
cu i =u ni + cv i =u ni cu N u + cv N v
density and value distance of numerical attributes and cu = w 1
=
produces better quality intervals than the two classical i =u
ni Nu + Nv
methods (Li et al., 1999). This method repeatedly selects
a pair of adjacent intervals that have the minimum differ- An optimal interval merge scheme produces a mini-
ence to merge until a given criterion is met. It requires mum number of intervals whose maximal intrainterval
quadratic time on the number of attribute values, because distances are each within a given threshold and whose
each interval initially contains only a single value, and all populations are as equal as possible. Assume that the
the intervals may be merged into one large interval con- threshold for the maximal intrainterval difference is d,
taining all the values in the worst case. A method of linear- which can be the average interinterval differences of all
scan merging for quantizing numeric attribute values that the adjacent interval pairs or can be given by the system.
can be implemented in linear time sequentially and linear For k intervals, let the average population (support) of
cost in parallel was also proposed (Shen, 2001). This
method takes into consideration the maximal intrainterval each interval be N k = N / k and the population deviation
distance and load balancing simultaneously to improve of interval Iu be u =| N u N k | , where Nu is the actual
the quality of merging. In comparison with the existing population of Iu. Initially, Iu={xu} and 0<u<m-1. Our strat-
results in the same category, the algorithm achieves a egy leads to the following algorithm for interval merging:
linear time speedup.
Suppose that a numerical attribute has m distinct 1. Partition {I0, I1, , Im-1} into a minimum number of
values, I = {x0 , x1 ,...x m1 } , where attribute value xi has ni intervals such that each interval has a maximal
m 1 intrainterval distance not greater than d.
occurrences in the database (weight). Let N = i =0 ni be
2. Assume that Step 1 produces k inter-
the total number of attribute value occurrences, called
instances. Without loss of generality, we further assume vals: {I u , I u ,..., I u } and 0=u0< u1< uk-1<m-1. For
0 1 k 1
that xi < xi+1 for all 0<i<m-2 (otherwise, we can simply sort I u j = [ X u j : cu j ] ,where X u j = {xu j , xu j +1 ,..., xu j +1 1 } and
these values). Define P to be a set of maximal disjoint
intervals on I, where interval IuP contains a sequence of cu j is the representative center of I u j , check to see
if moving xu 1 to I u will result in a better load

v 1
attribute values {xu, xu+1,,x v-1} and Nu = i = u ni instances, j +1 j +1
v is the index of the next interval after Iu in P, and 0<u<v<m. balance while preserving the maximal intrainterval
We also assume that Iu has a representative center, cu. distance property, and do so if it will.
Initially, Iu contains only xu, which is also its representa-
tive center. Noticing that x 0<x1 < < xm-1, we can implement Step
We define the maximal intrainterval distance, denoted 1 simply by using a linear scan to form appropriate seg-
by D*(Iu; cu), as follows: ments of intervals after a single pass. Starting from I0 ,
merge Iu with Iu+j for j = 1,2, , until the next merge would
result in Ius maximal intrainterval distance greater than
D * (cu , cv ) = max | xi cu | the threshold; continue this process until no interval to be
i
merged remains. This process requires time O(m).
Assume that two adjacent Step 2 examines every adjacent pair of intervals after
the merge, requiring, at most, m-1 steps. Each step checks
intervals, I u = {xu ,...xv 1} and I v = {xv ,...xw1} , contain
the changes of population deviation by moving uj+1-1
v 1 w 1
N u = i =u ni and N v = i = v ni attribute value occurrences instances from I u to I u . That is, it considers whether the
j j +1
v 1
and have representative centers cu = i =u xi ni / N u and following condition holds:
w 1
cv = i =v xi ni / N v , respectively, and 0<u<v<m. The union,
| u j u j |<| u j +1 +u j +1 | ,
I u = I u U I v , of the two intervals containing (v-u)+(w-
w 1
v)=w-u attribute values and N u + N v = i =u ni instances to- where u =| N u nu N k | and +u j =| N u j+1 nu j+1 1 N k | .
j j j +1
816
TEAM LinG
Do the move only when the condition holds. Clearly, this There are two main reasons for this assumption. First,
step requires O(m) time as well. fuzzy attributes sharing a common original attribute are M
It was shown that the above linear-scan implementa- usually mutually exclusive in meaning, so they would
tion indeed produces the minimum number of intervals largely reduce the support of rules in which they are
satisfying the intrainterval distance property. contained together. Second, such a rule would not be
worthwhile and would carry little meaning. Hence, we can
Mining Fuzzy Association Rules conclude that all fuzzy attributes in the same rule are
independent with respect to the fact that no pair of fuzzy
Let D be a relational database with I ={i1, , in} as the set attributes whose original attributes are identical exists.
of n attributes and T={t1, ,tm} as the set of m records. This observation is the foundation of our new parallel
Using the fuzzification method, we associate each at- algorithm.
The idea of partitioning algorithm is to divide the
tribute, iu, with a set of fuzzy sets: Fiu = { f i1u ,..., f iuk } . A fuzzy
original set of fuzzy attributes into separated parts (each
association rule (Gyenesei, 2001; Kuok et al., 1998) is an for a processor), so that every part retains at least one
implication as follows: fuzzy attribute for each original attribute. To do so, we
divide according to several original attributes, so that the
X is A = Y is B , number of fuzzy attributes reduced at each processor is
maximized. After dividing the set of all the fuzzy at-
where X,Y I are disjointedly frequent itemsets. A and B tributes for parallel processors, we can use any tradi-
are sets of disjoint fuzzy sets corresponding to attributes tional algorithms, such as Apriori, CHARM, and so forth,
to mine local association rules. Finally, the local results
in X and Y: A= { f x ,... f x } , B = { f y ,... f y } , f x Fx , f y Fy .
1 p 1 p i i j j
at processors are gathered to constitute the overall
A fuzzy itemset is now defined as a pair (<X, A>) of an result. Space limitations prevent a detailed discussion of
itemset X and a set A of fuzzy sets associated with at- the partitioning algorithm (FDivision) and the parallel
tributes in X. The support factor, denoted fs(<X, A>), is algorithm (PFAR) (Phan & Horiguchi, 2004b)
determined by the formula The PFAR algorithm was implemented using MPI
standard on the SP2 parallel system, which has a total of
m 64 nodes. Each node consists of four 322MHz PowerPC
{
v =1
x1 (tv x1 ) ... x p (tv x p )} / T , 604e processors, 512 MB local memory, and 9.1GB local
disks. In the experiments, we used only 24 processors
(PEs) on 24 nodes. The testing data included synthetical
where X ={x1,, xp} and tv is the vth record in T; is the T- data and real world databases of heart disease (created
by George John, 1994, statlog-adm@ncc.up.pt,
norm operator in fuzzy logic theory; x (tv xu ) is calculated
u
bob@stams.strathclyde.ac.uk). The experimental results
as showed that the performance of PFAR is satisfactory
(Phan & Horiguchi, 2004a).
mxu (tv xu ), mxu (tv xu ) wxu

0 otherwise,
FUTURE TRENDS
where mx is the membership function of fuzzy set f x ;
u u
Quantitative and fuzzy association rules are two types of
and wx is a user-specified threshold of membership func-
u important association rules that exist in many real-life
tion mx . u
applications. Future research trends include mining them
Each fuzzy attribute is a pair of the original attribute among data that have structural properties such as se-
name accompanied by the fuzzy set name. We perceive that quence, time series, and spatiality. Mining these types of
any fuzzy association rule never contains two fuzzy at- rules in multidimensional databases is also a challenging
tributes sharing a common original attribute in I. For problem. These complex data mining tasks require tech-
example, the rule Age_Old< niques not only for data mining but also for data represen-
tation, reduction, transformation, and visualization.
Cholesterol_High<Age_Young HeartDisease_Yes is
Completion of these tasks depends on successful integra-
invalid because it contains Age_Old and Age_Young,
tion of all the relevant techniques and effective applica-
both of which derive from a common original attribute, Age.
tion of those techniques.
817
TEAM LinG
CONCLUSION Phan, X. H., & Horiguchi, S. (2004a). A parallel algorithm

for mining fuzzy association rules (Tech. Rep. No. IS-RR-
We have summarized research activities in mining quan- 2004-014, JAIST).
titative and fuzzy association rules and some recent Phan, X. H., & Horiguchi, S. (2004b). Parallel data mining
results of our research on these topics. These results, for fuzzy association rules (Tech. Rep).
together with those given in Shen (1999b, 2005), show a
complete picture of the major achievements made within Shen, H. (1999a). Fast subset statistics and its applica-
our data mining research projects. We hope that this tions in data mining. Proceedings of the International
article serves as a useful reference to the researchers Conference on Parallel and Distributed Processing Tech-
working in this area and that its contents provide interest- niques and Applications (pp. 1476-1480).
ing techniques and tools to solve these two challenging
problems, which are significant in various applications. Shen, H. (1999b). New achivements in data mining. In J.
Sun (Ed.), Science: Advancing into the new millenium
(pp. 162-178). Beijing: Peoples Education.
REFERENCES Shen, H. (2001). Optimal attribute quantization for mining
quantitative association rules. Proceedings of the 2001
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Asian High-Performance Computing Conference, Aus-
associations between sets of items in massive databases. tralia.
Proceedings of the ACM SIGMOD International Confer-
ence on Management of Data (pp. 207-216). Shen, H. (2005). Advances in mining association rules. In
John Wang (Ed.), Encyclopedia of data warehousing
Dehaspe, L., & Raedt, L. D. (1997). Mining association and mining. Hershey, PA: Idea Group Reference.
rules in multiple relations. Proceedings of the Seventh
International Workshop on Inductive Logic Program- Shen, H., Liang, W., & Ng, J. K-W. (1999). Efficient
ming, 1297 (pp. 125-132). computation of frequent itemsets in a subcollection of
multiple set families. Informatica, 23(4), 543-548.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Super-
vised and unsupervised discretization of continuous fea- Srikant, R., & Agrawal, R. (1996). Mining quantitative
tures. Proceedings of the 12th International Conference association rules in large relational tables. ACM SIGMOD
on Machine Learning (pp. 194-202). Record, 25(2), 1-8.
Fukuda, T., Morimoto, Y., Morishita, S., & Tokuyama, T. Zaki, M. J., Parthasarathy, S., & Ogihara, M. (2001).
(1996). Mining optimized association rules for numeric Parallel data mining for association rules on shared-
attributes. Proceedings of the 15th ACM SIGACT- memory systems. Knowledge and Information Systems,
SIGMOD (pp. 182-191). 3(1), 1-29.
Gyenesei, A. (2001). A fuzzy approach for mining quanti-

tative association rules. Acta Cybernetica, 15(2), 305-
320. KEY TERMS
Han, E-W., Karypis, G., & Kumar, V. (1997). Scalable Attribute Discretization: The process of converting
parallel data mining for association rules. ACM SIGMOD a (continuous) attribute value range to a set of discrete
Record, 26(2), 277-288. intervals and representing all the values in each interval
Kuok, C. M., Fu, A., & Wong, M. H. (1998). Mining fuzzy with a single (interval) label.
association rules in databases. ACM SIGMOD Record, Attribute Fuzzification: A process that converts
27(1), 41-46. numeric attribute values into membership values of fuzzy
Li, J., Shen, H., & Topor, R. (1999). An adaptive method of sets.
numerical attribute merging for quantitative association Equal-Depth Discretization: A technique that divides
rule mining. Proceedings of the 1999 International Com- the attribute value range into intervals of equal popula-
puter Science Conference (pp. 31-40), Hong Kong. tion (value density).
Miller, R. J., & Yang, Y. (1997). Association rules over Equal-Width Discretization: A technique that di-
interval data. ACM SIGMOD Record, 26(2), 452-460. vides the attribute value range into intervals of equal
width (length).
818
TEAM LinG
Fuzzy Association Rule: An implication rule showing

the conditions of co-occurrence of itemsets that are M
defined by fuzzy sets whose elements have probabilistic
values.
Quantitative Association Rule: An implication rule
containing items of quantitative and/or categorical val-
ues that shows in a given database.
819
TEAM LinG
820
Model Identification through Data Mining

Diego Liberati
Consiglio Nazionale delle Ricerche, Italy
INTRODUCTION tion, and later to infer logical relationships. Finally, it

allows the identification of a simple input-output model of
In many fields of research as in everyday life, one has to the involved process that also could be used for control-
face a huge amount of data, often not completely homo- ling purposes.
geneous and many times without an immediate grasp of
an underlying simple structure. A typical example is the
growing field of bio-informatics, where new technologies, BACKGROUND
like the so-called micro-arrays, provide thousands of
gene expression data on a single cell in a simple and The two tasks of selecting salient variables and identify-
quickly integrated way. Further, the everyday consumer ing their relationships from data may be sequentially
is involved in a process not so different from a logical accomplished with various degrees of success in a variety
point of view, when the data associated with the of ways. Principal components order the variables from
consumers identity contributes to a large database of the most salient to the least, but only under a linear
many customers whose underlying consumer trends are framework. Partial least squares allow an extension to
of interest to the distribution market. non-linear models, provided that one has prior informa-
The large number of variables (i.e., gene expressions, tion on the structure of the involved non-linearity; in fact,
goods) for so many records (i.e., patients, customers) the regression equation needs to be written before iden-
usually are collected with the help of wrapping or ware- tifying its parameters. Clustering may operate in an unsu-
housing approaches in order to mediate among different pervised way without the a priori correct classification of
repositories, as described elsewhere in the encyclopedia. a training set (Booley, 1998). Neural networks are known
Then, the problem arises of reconstructing a synthetic to use embedded rules with the indirect possibility (Taha
mathematical model, capturing the most important rela- & Ghosh, 1999)of making rules explicit or to underline the
tions between variables. This will be the focus of the salient variables. Decision trees (Quinlan, 1994), a popu-
present contribution. lar framework, can provide a satisfactory answer to both
A possible approach to blindly building a simple linear questions.
approximating model is to resort to piece-wise affine
(PWA) identification (Ferrari-Trecate et al., 2003). The
rationale for this model will be explored in the second part MAIN THRUST
of the methodological section of this article.
In order to reduce the dimensionality of the problem, Recently, a different approach has been suggestedHam-
thus simplifying both the computation and the subse- ming clustering. This approach is related to the classical
quent understanding of the solution, the critical problems theory exploited in minimizing the size of electronic circuits,
of selecting the most salient variables must be solved. A with the additional ability to obtain a final function able to
possible approach is to resort to a rule induction method, generalize everything from the training dataset to the most
like the one described in Muselli and Liberati (2002) and likely framework describing the actual properties of the
recalled in the first methodological part of this contribu- data. In fact, the Hamming metric tends to cluster samples
tion. Such a strategy offers the advantage of extracting with code that is less distant. This is likely to be natural if
underlying rules and implying conjunctions and/or dis- variables are redundantly coded via thermometer (for nu-
junctions between such salient variables. The first idea of meric variables) or used for only-one (for logical variables)
their non-linear relationships is provided as a first step to code (Muselli & Liberati, 2000). The Hamming clustering
designing a representative model,using the selected vari- approach reflects the following remarkable properties:
ables.
The joint use of the two approaches recalled in this It is fast, exploiting (after the mentioned binary
article, starting from data without known background coding) just logical operations instead of floating
information about their relationships, first allows a reduc- point multiplications.
tion in dimensionality without significant loss in informa-
TEAM LinG
It directly provides a logical understandable expres- Table 1. The three steps executed by Hamming clustering
sion (Muselli & Liberati, 2002), which is the final to build the set of rules embedded in the mined data M
synthesized function directly expressed as the OR
of ANDs of the salient variables, possibly negated. 1. The input variables are converted into binary strings
via a coding designed to preserve distance and, if
When the variables are selected, a mathematical model relevant, ordering.
of the underlying generating framework still must be 2. The OR of ANDs expression of a logical function
produced. At this point, a first hypothesis of linearity may is derived from the training examples coded in the
be investigated (usually being only a very rough approxi- binary form of step 1.
mation) where the values of the variables are not close to 3. In the OR final expression, each logical AND pro-
the functioning point around which the linear approxima- vides intelligible conjunctions or disjunctions of the
involved variables, ruling the analyzed problem.
tion is computed.
Building a non-linear model is far from easy; the
structure of the non-linearity needs to be a priori knowl-
is one reason for the algorithm speed and for its
edge, which is not usually the case. A typical approach
insensitivity to precision.
consists of exploiting a priori knowledge, when available,
to define a tentative structure, then refining and modify- Step 2: Classical techniques of logical synthesis are
ing it on the training subset of data, and finally retaining specifically designed to obtain the simplest AND-
the structure that best fits a cross-validation on the OR expression able to satisfy all the available input-
testing subset of data. The problem is even more complex output pairs without an explicit attitude to generali-
when the collected data exhibit hybrid dynamics (i.e., their zation. To generalize and infer the underlying rules
evolution in time is a sequence of smooth behaviors and at every iteration, by Hamming clustering groups
abrupt changes). together in a competitive way, binary strings have
An alternative approach is to infer the model directly the same output and are close to each other. A final
from the data without a priori knowledge via an identifica- pruning phase does simplify the resulting expres-
tion algorithm capable of reconstructing a very general sion, further improving its generalization ability.
class of a piece-wise affine model (Ferrari-Trecate et al., Moreover, the minimization of the involved vari-
2003). This method also can be exploited for the data- ables intrinsically excludes the redundant ones,
driven modeling of hybrid dynamic systems, where logic thus enhancing the very salient variables for the
phenomena interact with the evolution of continuous- investigated problem. The low (quadratic) compu-
valued variables. Such an approach will be described tational cost allows for managing quite large
concisely later, after a more detailed drawing of the rules- datasets.
oriented mining procedure, and some applications will be Step 3: Each logical product directly provides an
discussed briefly. intelligible rule, synthesizing a relevant aspect of
the underlying system that is believed to generate
Binary Rule Generation and Variable the available samples.
Selection While Mining Data
Identification of Piece-wise Affine
The approach followed by Hamming clustering in mining Systems Through a Clustering
the available data to select the salient variables and to Technique
build the desired set of rules consists of the three steps
in Table 1. Once the salient variables have been selected, it may be
of interest to capture a model of their dynamical interac-
Step 1: A critical issue is the partition of a possibly tion. Piece-wise affine identification exploits K-means
continuous range in intervals, whose number and clustering that associates data points in multivariable
limits may affect the final result. The thermometer space in such a way that jointly determines a sequence
code then may be used to preserve ordering and of linear submodels and their respective regions of opera-
distance (in the case of nominal input variables, for tion without even imposing continuity at each change in
which a natural ordering cannot be defined, instead the derivative. In order to obtain such a result, the five
adopting the only-one). The simple metric used is steps reported in Table 2 are executed.
the Hamming distance, computed as the number of
different bits between binary strings. In this way, Step 1: The model is locally linear; small sets of data
the training process does not require floating point points close to each other likely belong to the same
computation but rather basic logic operations. This submodel. For each data point, a local set is built,
821
TEAM LinG
collecting the selected points together with a given ation for ongoing research projects. The field of life
number of its neighbors (whose cardinality is one of science plays a central role because of its relevance to
the parameters of the algorithm). Each local set will science and to society.
be pure if made of points really belonging to the same A growing acceptance of international relevance, in
single linear subsystem; otherwise, it is mixed. which the described approaches are being used to
Step 2: For each local dataset, a linear model is provide a contribution, is the so-called field of systems
identified through a usual least squares procedure. biologya feedback model of how proteins interact with
Pure sets belonging to the same submodel give each other and with nucleic acids within the cell, both of
similar parameter sets, while mixed sets yield isolated which are needed to better understand the control mecha-
vectors of coefficients, looking as outliers in the nisms of the cellular cycle. This is especially evident with
parameter space. If the signal-to-noise ratio is good respect to duplication (such as cancer, when such
enough, and if there are not too many mixed sets (i.e., mechanisms become out of control). Such an under-
the number of data points is more than the number of standing will hopefully encourage a drive to personal
submodels to be identified, and the sampling is fair therapy, when everybodys gene expression will be cor-
in every region), then the vectors will cluster in the related to the amount of corresponding proteins in-
parameter space around the values pertaining to each volved in the cellular cycle. Moreover, a new computa-
submodel, apart from a few outliers. tional paradigm could arise by exploiting biological com-
Step 3: A modified version of the classical K-means, ponents like cells instead of the usual silicon hardware,
whose convergence is guaranteed in a finite number thus overcoming some technological issues and possi-
of steps (Ferrari-Trecate et al., 2003), takes into bly facilitating neuroinformatics.
account the confidence on pure and mixed local sets The study of systems biology as a leading edge of the
in order to cluster the parameter vectors. large field of bioinformatics, begins by analyzing data
from so-called micro-arrays. They are little standard
Step 4: Data points are then classified, each belong-
chips, where thousands of gene expressions may be
ing to a local dataset one-to-one related to its gen- obtained from the same cell material, thus providing a
erating data point, which is classified according to large amount of data whose handling with the usual
the cluster to which its parameter vector belongs. deterministic approaches is not conceivable, and whose
Step 5: Both the linear submodels and their regions fault is its inability to obtain significant synthetic infor-
are estimated from the data in each subset. The mation. Thus, matrices of as many subjects as available,
coefficients are estimated via weighted least squares, possibly grouped in homogeneous categories for super-
taking into account the confidence measures. The vised training, each one carrying thousands of gene
shape of the polyhedral region characterizing the expressions, are the natural input to algorithms like
domain of each model may be obtained via linear Hamming clustering. The desired output are rules able to
support vector machines (Vapnik, 1998), easily solved classify; for instance, patients affected by different tu-
via linear/quadratic programming. mors from healthy subjects, on the basis of a few iden-
tified genes, whose set is the candidate basis for the
A Few Applications piece-wise linear model describing their complex interac-
tion in such a particular class of subjects.
The field of application of the proposed approach is Also, without deeper insight into the cell, the iden-
intrinsically wide (both tools are most general and quite tification of the prognostic factors in oncology is already
powerful, especially if combined together). Here, only a occuring with Hamming clustering (Paoli et al., 2000) and
few suggestions will be drawn with reference to already also providing a hint about their interaction, which is not
obtained results or to some application under consider- explicit in the outcome of a simple neural network (Drago
et al., 2002).
Table 2. The five steps for piece-wise affine identification
1. The local datasets neighboring each sample are built. FUTURE TRENDS
2. The local linear models are identified through least squares.
3. The parameters vectors are clustered through a modified Beside improvements in the algorithmic approaches (also
K-means. implying the possibility of taking into account potential
4. The data points are classified according to the clustered a priori knowledge), a wide range of applications will
parameter vectors.
benefit from the proposed ideas in the near future; some
5. The submodels are estimated together with their domains.
of which are outlined here and partially recalled in Table
3.
822
TEAM LinG
Drug design would benefit from the a priori forecast- human organ is probably the brain, whose study may be
ing provided by Hamming clustering about hydrophilic undertaken today either in the sophisticated frame of M
behavior of the not yet experimented pharmacological functional nuclear magnetic imaging or in the simple way
molecule, on the basis of the known properties of some of EEG recording. Multidimensional (three space vari-
possible radicals and in the track of Manga et al. (2003) ables plus time) images or multivariate time series provide
within the frame of computational methods for the predic- an abundance of raw data to mine in order to understand
tion of drug-likeness (Clark & Pickett, 2000). In principle, which kind of activation is produced in correspondence
such not quite different in silico predictions from the with an event (Maieron et al., 2002) or a decision. A brain
systems biology expectation are of paramount relevance computer-interfacing device then may be built that is able
to pharmaceutical companies to save money in designing to reproduce ones intention to perform a movenot
minimally invasive drugs controlling kidney metabolism directly possible for the subject in some impaired physical
that would have less of an affect on the liver the more conditionsand command a proper actuator. Both Ham-
hydrophilic they are. ming clustering and piece-wise affine identification would
Compartmental behavior of drugs then may be improve the only partially successful approach based on
analysed via piece-wise identification by collecting in artificial neural networks (Babiloni et al., 2000). A simple
vivo data samples and clustering them within the more drowsiness detector based on the EEG may be designed, as
active compartment at each stage, instead of the classical well as a flexible anaesthesia/hypotension level detector,
linear system identification that requires non-linear algo- without needing a time-varying more precise, but more
rithms. The same is true for metabolism, such as glucose costly, identification. A psychological stress indicator
in diabetes or urea in renal failure. Dialysis can be modeled may be inferred, outperforming (Pagani et al., 1991). The
as a compartmental process in a sufficiently accurate way. multivariate analysis (Liberati et al., 1997), possibly taking
A piece-wise linear approach is able to simplify the iden- into account the input stimulation so useful in approaching
tification even on a single patient, when population data difficult neurological tasks (like modeling electro
are not available to allow a simple linear deconvolution encephalographic coherence in Alzheimers disease pa-
approach (Liberati & Turkheimer, 1999) (whose result is tients (Locatelli et al., 1998)) or non-linear effects in muscle
only an average knowledge of the overall process, with- contraction (Orizio et al., 1996) would be outperformed by
out taking into special account the very subject under piece-wise linear identification, even in time-varying con-
analysis). Moreover, compartmental models are perva- texts like epilepsy.
sive, like in ecology, wildlife, and population studies; Industrial applications, of course, are not excluded
potential applications in that direction are almost never from the field of possible interests. In Ferrari-Trecate et al.
ending. (2003), for instance, the classification and identification
Many physiological processes are switching natu- of the dynamics of an industrial transformer are performed
rally or intentionally from an active to quiescent state, like via the piece-wise approach, with a much simpler cost, and
hormone pulses whose identification (Sartorio et al., 2002) no really significant reduction of performances with re-
is important in growth and fertility diseases as well as in spect to the non-linear modeling described in Bittanti, et
doping assessment. In that respect, the most fascinating al. (2001).
Table 3. A summary of selected applications of the proposed approaches
System Biology and Bioinformatics: To identify the interaction of the main proteins involved in cell cycle.
Drug Design: To forecast desired and undesired behavior of the final molecule from components.
Compartmental Modeling: To identify number, dimensions, and exchange rates of communicating reservoirs.
Hormone Pulses Detection: To detect true pulses among the stochastic oscillations of the baseline.
Sleep Detector: To forecast drowsiness as well as to identify sleep stages and transitions.
Stress Detector: To identify the psychological state of a subject from multivariate analysis of biological signals.
Prognostic Factors: To identify the interaction between selected features (e.g., in oncology).
Pharmacokinetics and Metabolism: To identify diffusion and metabolic time constants from time series
sampled in blood.
Switching Processes: To identify abrupt or possibly smoother commutations within the process duty cycle.
Brain Computer Interfacing: To identify a decision taken within the brain and propagate it directly form
brain waves.
Anesthesia Detector: To monitor the level of anesthesia and provide close-loop control.
Industrial Applications: A wide spectrum of logical or dynamic-logical hybrid problems may be faced (i.e., the
tracking of a transformer).
823
TEAM LinG
CONCLUSION Liberati, D., & Turkheimer, F. (1999). Linear spectral

deconvolution of catabolic plasma concentration decay
The proposed approaches are very powerful tools for in dialysis. Med Biol Eng Comput, 37, 391-395.
quite a wide spectrum of applications in and beyond data Locatelli, T., Cursi, M., Liberati, D., Franceschi, M., & Comi,
mining, providing an up-to-date answer to the quest of G. (1998). EEG coherence in Alzheimers disease.
formally extracting knowledge from data and sketching a Electroenceph Clin Neurophys, 106(3), 229-237.
model of the underlying process.
Maieron, M. et al. (2002). An ARX model-based approach
to the analysis of single-trial event-related fMRI data.
ACKNOWLEDGMENTS Proceedings of the 8th International Conference on
Functional Mapping of the Human Brain, Sendai,
Marco Muselli had the key role in conceiving and devel- Japan.
oping the recalled clustering algorithms, while Giancarlo Manga, N., Duffy, J.C., Rowe, P.H., & Cronin, M.T.D.
Ferrari-Trecate has been determinant in extending them to (2003). A hierarchical quantitative structure-activity rela-
model identification as well as in kindly revising the tionship model for urinary excretion of drugs in humans
present contribution. Their stimulating and friendly inter- as predictive tool for biotrasformation. Quantita-
action over the past few years is warmly acknowledged. tive Structure-Activity Relationship Comb Sci, 22,
An anonymous reviewer is also gratefully acknowledged 263-273.
for indicating how to improve the writing in several
critical points. Muselli, M., & Liberati, D. (2000). Training digital circuits
with Hamming clustering. IEEE T Circuits and Systems
I: Fundamental Theory and Applications, 47, 513-527.
REFERENCES Muselli, M., & Liberati, D. (2002). Binary rule generation
via Hamming clustering. IEEE T Knowledge and Data
Babiloni, F. et al. (2000). Comparison between human and Eng, 14, 1258-1268.
ANN detection of laplacian-derived electroencephalographic
activity related to unilateral voluntary movements. Comput Orizio, C., Liberati, D., Locatelli, C., DeGrandis, D., &
Biomed Res, 33, 59-74. Veicsteinas, A. (1996). Surface mechanomyogram re-
flects muscle fibres twitches summation. J Biomech,
Bittanti, S., Cuzzola, F.A., Lorito, F., & Poncia, G. (2001). 29(4), 475-481.
Compensation of nonlinearities in a current transformer
for the reconstruction of the primary current. IEEE T Pagani, M. et al. (1991). Sympatho-vagal interaction dur-
Control System Techn, 9(4), 565-573. ing mental stress: A study employing spectral analysis of
heart rate variability in healthy controls and in pa-
Boley, D.L. (1998). Principal direction divisive partition- tients with a prior myocardial infarction. Circulation,
ing. Data Mining and Knowledge Discovery, 2(4), 83(4), 43-51.
325-344.
Paoli, G. et al. (2000). Hamming clustering techniques for
Clark, D.E., & Pickett, S.D. (2000). Computational methods the identification of prognostic indices in patients with
for the prediction of drug-likeness. Drug Discovery advanced head and neck cancer treated with radiation
Today, 5(2), 49-58. therapy. Med Biol Eng Comput, 38, 483-486.
Drago, G.P., Setti, E., Licitra, L., & Liberati, D. (2002). Quinlan, J.R. (1994). C4.5: Programs for machine learn-
Forecasting the performance status of head and neck ing. San Francisco: Morgan Kaufmann.
cancer patient treatment by an interval arithmetic pruned
perceptron. IEEE T Bio-Med Eng, 49(8), 782-787. Sartorio, A., De Nicolao, G., & Liberati, D. (2002). An
improved computational method to assess pituitary re-
Ferrari, T.G., Muselli, M., Liberati, D., & Morari, M. (2003). sponsiveness to secretagogue stimuli. Eur J Endocrinol,
A clustering technique for the identification of piecewise 147(3), 323-332.
affine systems. Automatica, 39, 205-217.
Taha, I., & Ghosh, J. (1999). Symbolic interpretation of
Liberati, D., Cursi, M., Locatelli, T., Comi, G., & Cerutti, S. artificial neural networks. IEEE T Knowledge and Data
(1997). Total and partial coherence of spontaneous and Eng., 11, 448-463.
evoked EEG by means of multi-variable autoregressive
processing. Med Biol Eng Comput, 35(2), 124-130.
824
TEAM LinG
Vapnik, V. (1998). Statistical learning theory. New York: Micro-Arrays: Chips where thousands of gene ex-
Wiley. pressions may be obtained from the same biological cell M
material.
Model Identification: Definition of the structure and
KEY TERMS computation of its parameters best suited to mathemati-
cally describe the process underlying the data.
Bioinformatics: The processing of the huge amount Rule Generation: The extraction from the data of the
of information pertaining to biology. embedded synthetic logical description of their relation-
Hamming Clustering: A fast binary rule generator ships.
and variable selector able to build understandable logical Salient Variables: The real players among the many
expressions by analyzing the Hamming distance between apparently involved in the true core of a complex business.
samples.
Systems Biology: The quest of a mathematical model
Hybrid Systems: Their evolution in time is composed of the feedback regulation of proteins and nucleic acids
by both smooth dynamics and sudden jumps. within the cell.
825
TEAM LinG
826
Modeling Web-Based Data in a Data Warehouse

Hadrian Peter
Charles Greenidge
Good database design generates effective operational Data Warehousing

databases through which we can track customers, sales,
inventories, and other variables of interest. The main Data warehousing is a process concerned with the collec-
reason for generating, storing, and managing good data tion, organization, and analysis of data typically from
is to enhance the decision-making process. The tool used several heterogeneous sources with an aim to augment
during this process is the decision support system (DSS). end-user business functions (Hackathorn, 1998; Strehlo,
The information requirements of the DSS have become so 1996). It is intended to provide a working model for the
complex, that it is difficult for it to extract all the necessary easy flow of data from operational systems to decision
information from the data structures typically found in support systems. The data warehouse structure includes
operational databases. For this reason, a new storage three major levels of informationgranular data, archival
facility called a data warehouse has been developed. Data data, and summary dataand the metadata to support
in the data warehouse have characteristics that are quite them (Inmon, Zachman & Geiger, 1997).
distinct from those in the operational database (Rob & Data warehousing:
Coronel, 2002).
The data warehouse extracts or obtains its data from 1. Is centered on ad hoc end-user queries posed by
operational databases as well as from other sources, thus business users rather than by information system
providing a comprehensive data pool from which useful experts.
information can be extracted. The other sources represent 2. Is concerned with off-line, historical data rather
the external data component of the data warehouse. Exter- than online, volatile, operational type data.
nal information in corporate data warehouses has in- 3. Must efficiently handle larger volumes of data than
creased during the past few years because of the wealth those handled by normal operational systems.
of information available, the desire to work more closely 4. Must present data in a form that coincides with the
with third parties and business partners, and the Internet. expectations and understanding of the business
In this paper, we focus on the Internet as the source users of the system rather than that of the informa-
of the external data and propose a model for representing tion system architects.
such data. The model represents a timely intervention by 5. Must consolidate data elements. Diverse opera-
two important technologies (i.e., data warehousing and tional systems will refer to the same data in different
search engine technology) and is a departure from current ways.
thinking of major software vendors (e.g., IBM, Oracle, 6. Relies heavily on meta-data. The role of meta-data
MicroSoft) that seeks to integrate multiple tools into one is particularly vital, as data must remain in its proper
environment with one administration. The paper is orga- context over time.
nized as follows: background on the topic is discussed in
the following section, including a literature review on the The main issues in data warehousing design are:
main areas of the paper; the main thrust section discusses
the detailed model, including a brief analysis of the 1. performance versus flexibility;
model; finally, we identify and explain the future trends 2. cost;
of the topic. 3. maintenance.
TEAM LinG
Details of data warehousing/warehouses issues are acted on by the search engine. Results coming from the
provided in Berson and Smith (1997), Bischoff and Yevich search engine also can be processed prior to being re- M
(1996), Devlin (1998), Hackathorn (1995, 1998), Inmon layed to the warehouse. This intermediate, independent
(2002), Mattison (1996), Rob and Coronel (2002), Becker meta-data bridge is an important concept in this model.
(2002), Kimball and Ross (2002), Lujan-Mora and Trujillo (2003).
External Data and Search Engines THE DETAILED MODEL
External data represent an essential ingredient in the Motivation/Justification

production of decision-support analysis in the data ware-
house environment (Inmon, 2002). The ability to make To better understand the importance of the new model, we
effective decisions often requires the use of information must examine the search engine and meta-data engine
that is generated outside the domain of the person making components. The meta-data engine cooperates closely
the decision (Parsaye, 1996). External data sources pro- with both the data warehouse and search engine architec-
vide data that cannot be found within the operational tures to facilitate queries and efficiently handle results. In
database but are relevant to the business (e.g., stock this paper, we propose a special-purpose search engine,
prices, market indicators, and competitors data). which is a hybrid between the traditional, general-pur-
Once in the warehouse, the external data go through pose search engine and the meta-search engine.
processes that enable them to be interrogated by the end The hybrid approach we have taken is an effort to
user and business analysts seeking answers to strategic maximize the efficiency of the search process. Although
questions. This external data may augment, revise, or the engine does not need to produce results in a real-time
even challenge the data residing in the operational sys- mode, the throughput of the engine may be increased by
tems (Barquin & Edelstein, 1997). using multi-threading.
Our model provides a cooperative nexus between the The polling of the major commercial engines by our
data warehouse and search engine, and is aimed at en- engine ensures the widest possible access to Internet
hancing the collection of external data for end-user que- information (Bauer et al., 2002; Pfaffenberger, 1996; Ray et
ries originating in the data warehouse (Bischoff & Yevich, al., 1998; Sullivan, 2000).
1996). The model grapples with the complex design differ- The special-purpose hybrid search engine in Figure 1
ences between data warehouse and search engine tech- goes through the following steps each time a query is
nologies by introducing an independent, intermediate, processed:
data-staging layer.
Search engines have been used since the 1990s to 1. Retrieve major search engines home pages HTML
grant to millions access to global external data via the documents;
Internet (Goodman, 2000; Sullivan, 2000). In spite of their 2. Parse files and submit GET type queries to major
shortcomings (Chakrabarti, et al., 2001; Soudah, 2000), engines;
search engines still represent the best approach to the 3. Retrieve results and use links found in results to
retrieval of Internet-based external data (Goodman, 2000; form starting seed links;
Sullivan, 2000). Importantly, under our model, queries 4. Perform depth/breath first traversal using seed links;
originating with the business expert on the warehouse and
side can be modified at an intermediate stage before being 5. Capture results from no. 4 to local disk.
Figure 1. Query and retrieval in hybrid search engine
User Filter
Final Download
Query Results
Submit Major PARSE
Query Engines HTML FILES
FILES
Store
Links
Download Seed
Results Follow Links
Invoke Meta- HTML Seed
Data Engine Links
FILES
on Results
827
TEAM LinG
Architecture between two distinct environments, and, as such, we

believe that it should be developed separately.
It is evident that our proposed system uses the Internet to We now briefly examine each of the four components
supplement the external data requirements of the data of the meta-data engine operation as shown in Figure 2.
warehouse. To function effectively, the system must ad- First, queries originating from the warehouse will have to
dress the issues surrounding the architectures of the data be filtered to ensure they are actionable; that is, non-
warehouse and search engine. The differences are in the redundant and meet the timeliness and other processing
following areas: requirements of the search engine component. Second,
incoming queries may have to be modified in order to
1. Subject Area Specificity meet the goal of being actionable or in order to meet
2. Granularity specific efficiency/success criteria for the search engine.
3. Time Basis (Time Stamping) For example, the recall of results may be enhanced, if
4. Frequency of Updates certain terms are included/excluded from the original
5. Purity of Data query coming from the warehouse. Third, after results are
retrieved, some minimal analysis of those results will
To address these areas, we need a flexible architecture have to occur to determine either the success or failure
that can provide a bridge over which relevant data can be of the retrieval process or to verify the content type or
transmitted. This bridge needs to be a communication structure (e.g., HTML, XML, .pdf) of the retrieved re-
channel, as well, so that an effective relationship between sults. Due to the ability of search engines to return large
the data warehouse and search engine domains is main- volumes of irrelevant results, some preliminary analysis
tained. One approach is to integrate the search engine into is warranted. More advanced versions of the engine
the warehouse. This approach, though possible, is not would tackle semantic issues. The fourth component
expedient, because the warehouse environment is not an (labeled format in the diagram) will be invoked to produce
operational type environment but one optimized for the plain text from the recognizable document formats or,
analysis of enterprise-wide historical data. alternatively, to convert from one known document for-
Typical commercial search engines are composed of mat to another.
a crawler (spider) and an indexer (mite). The indexer is used Off-the-shelf tools exist online to parse HTML, XML,
to codify results into a database for easy querying. Our and other Internet document formats as well as provide
approach reduces the complexity of the search engine by language toolkits from which these four important areas
moving the indexing tasks to the meta-data engine. The of meta-data engine construction can be done. The Web
purpose of the meta-data engine is to facilitate an auto- provides a sprawling proving ground of semi-structured,
matic information retrieval mode. Figure 2 illustrates the structured, and unstructured data that must be handled
concept of the bridge between data warehouse and search using multiple syntactic, structural, and semantic tech-
engine environments. niques. The meta-data engine seeks to architecturally
The meta-data engine provides a useful transaction isolate these techniques from the data warehouse and
point for both queries and results. It is useful architectur- search engine domains.
ally and practically to keep the search engine and data Also, in our proposed model, the issue of relevance
warehouse separate. The separation of the three architec- must be addressed. Therefore, a way must be found to
tures means that they can be optimized independently. The sort out relevant from non-relevant documents. For ex-
meta-data engine performs a specialized task of interfacing ample, we are interested in questions such as the follow-
Figure 2. Meta-data engine operation
Handle Query Submit Modified

Filter Modify
from D.W. Queries to S.E.
Meta-Data
Format Analyze
Handle Results
Provide Results to from S.E.
D.W.
828
TEAM LinG
ing: Does every page containing the word Java specifi- data. With careful selection of external data, organiza-
cally address Java programming issues or tools? tions with data warehouses can reap disproportionate M
benefits over competitors who do not use their historical
The Need for Data Synchronization or external data to maximum effect.
Our model also allows for flexibility. By allowing de-
Data synchronization is an important issue in data ware- velopment to take place in three languagesquery lan-
housing. Data warehouses are designed to provide a guage SQL (e.g., for the data warehouse), Perl for the
consistent view of data emanating from several sources. meta-data engine, and Java for the search enginethe
The data from these sources must be combined, in a strengths of each can be utilized effectively. By allowing
staging area outside of the warehouse, through a process flexibility in the actual implementation of the model, we
of transformation and integration. To achieve this goal can increase efficiency and decrease development time.
requires that the data drawn from the multiple sources be Our model allows for security and integrity to be
synchronized. Data synchronization can only be guaran- maintained. The warehouse need not expose itself to
teed by safeguarding the integrity of the data and ensur- dangerous integrity questions or to increasingly mali-
ing that the timeliness of the data is preserved. cious threats on the Internet. By isolating the module that
We propose that the warehouse transmit a request interfaces with the Internet from the module that carries
for external data to the search engine. Once the eventual vital internal data, the prospects for good security are
results are obtained, they are left for the warehouse to improved.
gather as it completes a new refresh cycle. This process Our model also relieves information overload. By us-
is not intended to provide an immediate response (as ing the meta-data engine to manipulate external data, we
required by a Web surfer) but to lodge a request that will are indirectly removing time-consuming filtering activi-
be answered at some convenient time in the future. ties from users of the data warehouse.
At their simplest, data originating externally may be
synchronized by their date of retrieval as well as by
system statistics such as last time of update, date of FUTURE TRENDS
posting, and page expiration date. More sophisticated
schemes could include verification of dates of origin, use It is our intention to incorporate portals (Hall, 1999; Raab,
of Internet archives, and automated analysis of content to 2000) and vortals (Chakrabarti et al., 2001; Rupley, 2000)
establish dates. For example, online newspaper articles in our continued research in data warehousing. A goal of
often carry the date of publication close to the headlines. our model is to provide a subject-oriented bias of the
An analysis of such a documents content would reveal warehouse when satisfying queries involving the identi-
a date field in a position adjacent to the heading for the story. fication and extraction of external data from the Internet.
The following three potential problems are still to be
Analysis of the Model addressed in our model:
The major strengths of our model are its: 1. It relies on fast-changing search engine technologies;
2. It introduces more system overhead and the need
Independence; for more administrators;
Value-added approach; and 3. It results in the storage of large volumes of irrel-
Flexibility and security evant/worthless information.
This model exhibits (a) logical independence (ensures Other issues to be addressed by our model are:
that each component of the model retains its integrity); (b)
physical independence (allows summarized results of 1. The need to investigate how the search engine
external data to be transferred via network connections to component can make smarter choices of which links
any remote physical machines; and (c) administrative to follow;
independence (administration of each of the three compo- 2. The need to test the full prototype in a live data
nents in the model is a specialized activity; hence, the warehouse environment;
need for three separate administrators). 3. Providing a more detailed comparison between our
In terms of its value-added approach, the model maxi- model and the traditional external data approaches.
mizes the utility of acquired data by adding value to this 4. Determining how the traditional Online Transaction
data long after their day-to-day usefulness has expired. Processing (OLTP) systems will funnel data into
The ability to add pertinent external data to volumes of this model.
aging internal data extends the usefulness of the internal
829
TEAM LinG
We are confident that further research with our model Bischoff, J., & Yevich, R. (1996). The superstore: Building
will provide the techniques for handling the aforemen- more than a data warehouse. Database Programming and
tioned issues. One possible approach is to incorporate Design, 9(9), 220-229.
languages such as WebSQL (Mihaila, 1996) and Squeal
(Spertus, 1999) in our model. Chakrabarti, S. (2002). Mining the Web: Analysis of
hypertext and semi-structured data. New York: Morgan
Kaufmann.
CONCLUSION Chakrabarti, S., van den Berg, M., & Dom, B. (2001).
Focused crawling: A new approach to topic-specific
The model presented in this paper is an attempt to bring Web resource discovery. Retrieved January, 2004, from
the data warehouse and search engine into cooperation to http://www8.org/w8-papers/5a-search-query/crawling/
satisfy the external data requirements for the warehouse.
However, the new model anticipates the difficulty of Day, A. (2004). Data warehouses. American City and
simply merging two conflicting architectures. The ap- County, 119(1), 18.
proach we have taken is the introduction of a third inter- Devlin, B. (1998). Meta-data: The warehouse atlas. DB2
mediate layer called the meta-data engine. The role of this Magazine, 3(1), 8-9.
engine is to coordinate data flows to and from the respec-
tive environments. The ability of this layer to filter extra- Goodman, A. (2000). Searching for a better way. Re-
neous data retrieved from the search engine, as well as trieved July 25, 2002, from http://www.traffick.com
compose domain-specific queries, will determine its success. Hackathorn, R. (1995). Data warehousing energizes your
In our model, the data warehouse is seen as the agent enterprise. Datamation, 38-45.
that initiates the search for external data. Once a search
has been started, the meta-data engine component takes Hackathorn, R. (1998). Web farming for the data ware-
over and eventually returns results. The model provides house. New York: Morgan Kaufmann.
for querying of general-purpose search engines, but a
simple configuration allows for subject-specific engines Hall, C. (1999). Enterprise information portals: Hot air or
(vortals) to be used in the future. When specific engines hot technology [Electronic version]. Business Intelli-
designed for specific niches become widely available, the gence Advisor, 111(11).
model stands ready to accommodate this change. Imhoff, C., Galemmo, N., & Geiger, J.G. (2003). Mastering
We believe that WebSQL and Squeal can be used in data warehouse design: Relational and dimensional
our model to perform the structured queries that will techniques. New York: John Wiley and Sons.
facilitate speedy identification of relevant documents. In
so doing, we think that these tools will allow for faster Inmon, W.H. (2002). Building the data warehouse. New
development and testing of a prototype and, ultimately, York: John Wiley and Sons.
a full-fledged system Inmon, W.H., Zachman, J.A., & Geiger, J.G. (1997). Data
stores, data warehousing, and the Zachman framework:
Managing enterprise knowledge. New York: McGraw-Hill.
REFERENCES
Kimball, R. (1996). Dangerous preconceptions. Retrieved
Barquin, R., & Edelstein, H. (Eds.). (1997). Planning and June 13, 2002, from http://pwp.starnetinc.com/larryg/
designing the data warehouse. Upper Saddle River, NJ: index.html
Prentice-Hall. Kimball, R., & Ross, M. (2002). The data warehouse
Bauer, A., Hmmer, W., Lehner, W., & Schlesinger, L. toolkit: The complete guide to dimensional modeling.
(2002). A decathlon in multidimensional modeling: Open New York: John Wiley and Sons.
issues and some solutions. DaWaK, 274-285. Lujan-Mora, S., & Trujillo, J. (2003). A comprehensive
Becker, S.A. (Ed.). (2002). Data warehousing and Web method for data warehouse design. DMDW.
engineering. Hershey, PA: IRM Press. Marakas, G.M. (2003). Modern data warehousing, min-
Berson, A., & Smith, S.J. (1997). Data warehousing, data ing, and visualization: Core concepts. Upper Saddle
mining and olap. New York: McGraw-Hill. River, NJ: Prentice-Hall College Division.
830
TEAM LinG
Mattison, R. (1996). Data warehousing: Strategies, tech- Strehlo, K. (1996). Data warehousing: Avoid planned
nologies and techniques. New York: McGraw-Hill. obsolescence. Datamation, 38-45. M
McElreath, J. (1995). Data warehouses: An architectural Sullivan, D. (2000). Search engines review chart. Re-
perspective. CA: Computer Sciences Corp. trieved June 10, 2002, from http://searchenginewatch.com
Mihaila, G.A. (1996). WebSQLAn SQL-like query lan- Zghal, H.B., Faiz, S., & Ghezala, H.B (2003). Casme: A case
guage for the World Wide Web [masters thesis]. Univer- tool for spatial data marts design and generation, design
sity of Toronto. and management of data warehouses 2003. Proceedings
of the 5th International Workshop DMDW2003, Berlin,
Parsaye, K. (1996). Data mines for data warehouses. Da- Germany.
tabase Programming and Design, 9(9).
Peralta, V., & Ruggia, R. (2003). Using design guidelines
to improve data warehouse logical design. DMDW.
WEBSITES OF INTEREST
Pfaffenberger, B. (1996). Web search strategies. MIS
Press. www.perl.com
www.cpan.com
Raab, D.M. (1999). Enterprise information portals [Elec- www.xml.com
tronic version]. Relationship Marketing Report. http://java.sun.com
Ray, E.J., Ray, D.S., & Seltzer, R. (1998). The Alta Vista http://semanticweb.org
search revolution. CA: Osborne/McGraw-Hill.
Rob, P., & Coronel, C. (2002). Database systems: Design, KEY TERMS
implementation, and management. New York: Thomson
Learning. Decision Support System (DSS): An interactive ar-
Rupley, S. (2000). From portals to vortals. PC Magazine. rangement of computerized tools tailored to retrieve and
display data regarding business problems and queries.
Sander-Beuermann, F., & Schomburg, M. (1998). Internet
information retrievalThe further development of meta- External Data: Data originating from other than the
search engine technology. Proceedings of the Internet operational systems of a corporation.
Summit, Internet Society. Granular Data: Data representing the lowest level of
Schneider, M. (2003). Well-formed data warehouse struc- detail that resides in the data warehouse.
tures, design and management of data warehouses 2003. Metadata: Data about data; in the data warehouse, it
Proceedings of the 5th International Workshop describes the contents of the data warehouse.
DMDW2003, Berlin, Germany.
Operational Data: Data used to support the daily
Soudah, T. (2000). Search, and you shall find. Retrieved processing a company does.
July 17, 2002, from http://searchenginesguides.com
Refresh Cycle: The frequency with which the data
Spertus, E., & Stein, L.A. (1999). Squeal: A structured warehouse is updated (e.g., once a week).
query language for the Web. Retrieved August 10, 2002,
from http://www9.org/w9cdrom/222/222.html Transformation: The conversion of incoming data
into the desired form.
831
TEAM LinG
832
Moral Foundations of Data Mining

Kenneth W. Goodman
University of Miami, USA
INTRODUCTION What is clear about this landscape is that terms we

attach to issues privacy, for instance can mask
It has become a commonplace observation that scien- significant differences according as one uses a com-
tific progress often, if not usually, outstrips or precedes puter to keep track of warehouse stock or arrest records
the ethical analyses and tools that society increasingly or real estate transactions or sexually transmitted dis-
relies on and even demands. In the case of data mining eases. Moreover, ethical issues take on somewhat dif-
and knowledge discovery in databases, such an observa- ferent aspects depending on whether a computer and its
tion would be mistaken. There are, in fact, a number of data storage media are used by an individual, a business,
useful ethical precedents, strategies, and principles avail- a university, or a government. Atop this is the general
able to guide those who request, pay for, design, main- purpose to which the machine is put: science, business,
tain, use, share, and sell databases used for mining and law enforcement, public health, or national security.
knowledge discovery. These conceptual tools and the This triad content, user, and purpose frames the
need for them will vary as one is using a database to, space in which ethical issues arise.
say, analyze cosmological data, identify potential cus-
tomers, or run a hospital. But these differences should
not be allowed to mask the ability of applied ethics to MAIN THRUST
provide practical guidance to those who work in an
exciting and rapidly growing new field. One can identify a suite of ethical issues that arise in data
mining. All are tethered in one way or another to issues
encountered in computing, statistics, and kindred fields.
BACKGROUND The question of whether data mining, or any discipline
for that matter, presents unique or unprecedented issues
Data mining is itself a hybrid discipline, embodying is open to dispute. Issues between or among disciplines
aspects of computer science, artificial intelligence, often vary by degree more than by kind. If it is learned or
cryptography, statistics, and logic. In greater or lesser inferred from a database that Ms. Garcia prefers blue
degree, each of these disciplines has noted and ad- frocks, it might be the case that her privacy has been
dressed the ethical issues that arise in their practice. In violated. But if it is learned or inferred that she has HIV,
statistics, for instance, leading professional organiza- the stakes are altogether different. The ability to make
tions have ratified codes of ethics that address issues ever-more-fine-grained inferences from very large data-
ranging from safeguarding privileged information and bases increases the importance of ethics in data mining.
avoiding conflicts of interest or sponsor bias (Interna-
tional Statistical Institute, 1985) to the avoidance of Privacy, Confidentiality, and Consent
any tendency to slant statistical work toward predeter-
mined outcomes (American Statistical Association, It is common to distinguish between privacy and confi-
1999). dentiality by applying the former to humans claim or
It is in computer ethics, however, that one finds the desire to control access to themselves or information
earliest, sustained, and most thoughtful literature about them, and the latter, more narrowly, to specific
(Bynum, 1985; Johnson & Snapper, 1985; Ermann, Wil- units of that information. Privacy, if you will, is about
liams & Gutierrez, 1990; Forester & Morrison, 1994; people; confidentiality is about information. Privacy is
Johnson, 1994) in addition to ethics codes by profes- broader, and it includes interest in information protec-
sional societies (Association for Computing Machin- tion and control.
ery, 1992; IEEE, 1990). Traditional issues in computer An important correlate of privacy is consent. One
ethics include privacy and workplace monitoring, hack- cannot control information without being asked for
ing, intellectual property, and appropriate uses and us- permission, or at least informed of information use. A
ers. The intersection of computing and medicine has business that is creating a customer database might
also begun to attract interest (Goodman, 1998a). collect data surreptitiously, arguably infringing on pri-
TEAM LinG
vacy. Or it might publicly thought not necessarily Appropriate Uses and Users of Data
individually disclose the collection. Such disclo- Mining Technology M
sures are often key components of privacy policies.
Privacy policies sometimes seek permission in advance It should be uncontroversial to point out that not all data
to obtain and archive personal data or, more frequently, mining or knowledge discovery is done by appropriate
disclose that data are being collected and then provide a users, and not all uses enjoy equal moral warrant. A data-
mechanism for individuals to opt out. The question mining police state may not be said to operate with the
whether such opt-out policies are adequate to give indi- same moral traction as a government public health ser-
viduals opportunities to control use of their data is vice in a democracy. (We may one day need to inquire
subject to widespread debate. In another context, a whether use of data-mining technology by a government
government will collect data for vital statistics or public is itself grounds for identifying it as repressive.) Simi-
health databases. Such uses, at least in democratic soci- larly, given two businesses (insurance companies, say),
eties, may be justified on grounds of the implied con- it is straightforward to report that the one using data-
sent of those to whom the information applies and who mining technology to identify trends in accidents to
would benefit from its collection. better offer advice about preventing accidents is on firm
It is not clear how much or what kind of consent moral footing, as opposed to one that identifies trends
would be necessary to provide ethical warrant for data in accidents to discriminate against minorities.
mining of personal information. The problem of ad- One way to carve the world of data mining is at the
equate consent is complicated by what may be hypoth- public/private joint. Public uses are generally by gov-
esized to be widespread ignorance about data mining and ernments or their proxies, which can include universi-
its capabilities. As elsewhere, some solutions to this ties and corporate contractors, and can employ data from
ethical problem might be identified or clarified by private sources (such as credit card information). Public
empirical research related to public understanding of data mining can, at least in principle, claim to be in the
data-mining technology, individuals preference for (lev- service of some collective good. The validity of such a
els of) control over use of their information, and similar claim must be assessed and then weighed against damage
considerations. The U.S. Department of Health and or threats to other public goods or values.
Human Services has, for instance, supported research In the United States, the General Accounting Office,
on the process of informed consent for biomedical a research and investigative branch of Congress, identi-
research. Data mining warrants a kindred research pro- fied 199 federal data-mining projects and found that of
gram. Among the issues to be clarified by such research these, 54 mined private sector data, with 36 involving
are the following: personal information. There were 77 projects using data
from other federal agencies, and, of these, 46 involve
To what extent do individuals want to control personal information from the private sector. The per-
access to their information? sonal information, apparently used in these projects
What are the differences between consent to ac- without explicit consent, is said to include student loan
quire data for one purpose and consent for second- application data, bank account numbers, credit card
ary uses of that data? information, and taxpayer identification numbers. The
How do individual preferences or inclinations to projects served a number of purposes, the top six of
consent vary along with the data miner? (That is, it which are given as improving service or performance
may be hypothesized that some or many individu- (65 projects), detecting fraud, waste and abuse (24),
als will be sanguine about data mining by trusted analyzing scientific and research information (23),
public health authorities, but opposed to data min- managing human resources (17), detecting criminal
ing by [certain] business entities or governments.) activities or patterns (15) and analyzing intelligence
and detecting terrorist activities (14) (General Ac-
It should be noted that many information exchanges counting Office, 2004).
especially including those generally involving the Is losing confidentiality in credit card transactions a
most sensitive or personal data are at least partly fair exchange for improved government service? Re-
governed by professionalism standards. Thus, doctor- search? National security? These questions are the fo-
patient and lawyer- or accountant-client relationships cus of sustained debate.
traditionally, if not legally, impose high standards for In the private sphere, data miners enjoy fewer oppor-
the protection of information acquired during the course tunities to claim that their work will result in collective
of a professional relationship. benefit. The strongest warrant for private or for-profit
833
TEAM LinG
data mining is that corporations have fiduciary duties to surveillance system, or national security agency could
shareholders and investors, and these duties can only or be catastrophic. This means that good ethics requires
are best discharged by using data-mining technology. In good science. That is, it is not adequate to collect and
many jurisdictions, the use of personal information, analyze data for worthy purposes one must also seek
including health and credit data, is governed by laws that and stand by standards for doing so well.
include requirements for privacy policies and notices. One feature of data mining that makes this excep-
Because these often include opt-out provisions, it is tionally difficult is its very status as a science. It is
clear is that individuals must as a matter of strategy, customary in most empirical inquiries to frame and
if not moral self-defense take some responsibility for then test hypotheses. Although philosophers of sci-
safeguarding their own information. ence disagree about the best or most productive meth-
A further consideration is that for the foreseeable ods for conducting such inquiries, there is broad agree-
future, most people will continue to have no idea of what ment that experiments and tests generally offer more
data mining is or what it is capable of. For this reason, the rigor and produce more reliable results and insights
standard bar for consent to use data must be set very high than (mere) observation of patterns between or among
indeed. It ought to include definitions of data mining and variables. This is a well-known problem in domains in
must explicitly disclose that the technology is able to which experiments are impossible or unethical. In epi-
infer or discover things about individuals that are not demiology and public health, for instance, it is often
obvious from the kinds of data that are willingly shared: not possible to test a hypothesis by using the experi-
that databases can be linked; that data can be inaccurate; mental tools of control groups, double blinding, and
that decisions based on this technology can affect credit placebos (one could not intentionally attempt to give
eligibility and employment; and so on (OLeary, 1995). people a disease by different means in order to identify
Indeed, the very point of data mining is the discovery of the most dangerous routes of transmission). For this
trends, patterns and (putative) facts that are not obvious reason, epidemiologists and public health scientists
or knowable on the surface. This must become a central are usually mindful of the fact that their studies might
feature of privacy notice disclosure. fail to reveal a causal relation, but rather, no more than
Further, all data-mining entities ought to develop or a statistical correlation.
adopt and then adhere to policies and guidelines govern- Put differently, when there is no hypothesis to test,
ing their practice (OLeary, 1995). it is not possible to know what will be found until it is
discovered (Fule & Roddick, 2004, p. 160).
Error and Uncertainty There is a precedent for addressing this kind of
challenge. In meta-analysis, or the concatenation and
Data analysis is probabilistic. Very large datasets are reanalysis of others results, statisticians endured with-
buggy and contain inaccuracies. It follows that patterns, ering criticism of questionable methods, bias intro-
trends, and other inferences based on algorithmic analy- duced by initial data selection and over-broad conclu-
sis of databases are themselves open to challenge on sions. They have responded by refining their tech-
grounds of accuracy and reliability. Moreover, when niques and improving their processes. Meta-analysis
databases are proprietary, there are disincentives to op- remains imperfect, but scientists in numerous disci-
erate with standards for openness and transparency that plines (perhaps most noteworthily in the biomedical
are morally required and increasingly demanded in sciences, where in some instances meta-analysis has
government, business, and science. What emerges is an altered standards of patient care) have come to rely on
imperative to identify and follow best practices for data it (Goodman, 1998b).
acquisition, transmission, and analysis. Although some
errors are forgivable, those caused by sloppiness, low
standards, haste, or inattention can be blameworthy. When FUTURE TRENDS
the stakes are high, errors can cause extensive harm. It
would indeed be a shame to solve problems of privacy, Data mining has, in a short time, become ubiquitous.
consent, and appropriate use only to fail to do the job one Like many hybrid sciences, it was never clear who
set in the first place. owned it and therefore who should assume responsi-
Perhaps the most interesting kind of data-mining bility for its practice. Attention to data mining has
error is that which identifies or relies on illusory pat- increased in part because of concern about its use by
terns or flawed inferences. A business data-mining op- governments and the extent to which such use will
eration that errs in predicting the blue frock market may infringe on civil liberties. This has, in turn, led to
disappoint, but an error by data miners for a manned renewed scrutiny, a positive development. It must, how-
space program, biomedical research project, public health ever, be emphasized that while issues related to privacy
834
TEAM LinG
and confidentiality are of fundamental concern, they are National Security Surveillance
by no means the only ethical issues raised by data mining M
and, indeed, attention to them at the expense of the other Although it would be blameworthy for vulnerable soci-
issues identified here would be a dangerous oversight. eties to fail to use all appropriate tools at their disposal
The task of identifying ethical issues and proposing to prevent terrorist attacks, data-mining technology
solutions, approaches, and best (ethical) practices itself poses new and interesting challenges to the concept of
requires additional research. It is even possible to evalu- appropriate tool. There are concerns at the outset that
ate data minings ethical sensitivity, that is, the extent data acquisition for whatever purpose may be
to which rule generation itself takes ethical issues into inappropriate. This is true whether police, military, or
account (Fule & Roddick, 2004). other government officials are collecting information
A broad-based research program will use the con- on notecards or in huge, machine-tractable databases.
tent, user, purpose triad to frame and test hypotheses Knowledge discovery ups the ante considerably.
about the best ways to protect widely shared values; it The privacy rights of citizens in a democracy are not
might be domain specific, emphasizing those areas that customarily bartered for other benefits, such as secu-
raise the most interesting issues: business and econom- rity. The rights are what constitute the democracy in the
ics, public health (including bioterrorism prepared- first place. On the other hand, rights are not absolute,
ness), scientific (including biomedical) research, gov- and there might be occasions on which morality permits
ernment operations (including national security), and so or even requires infringements. Moreover, when a data-
on. Some issues/domains raise keenly important issues base is created, appropriate use and error/uncertainty
and concerns. These include genetics and national secu- become linked: appropriate use includes (at least tac-
rity surveillance. itly) the notion that such use will achieve its specified
ends. To the degree that error propagation or uncer-
Genetics and Genomics tainty can impeach data-mining results, the use itself is
correspondingly impeached.
The completion of the Human Genome Project in 2001
is correctly seen as a scientific watershed. Biology and
medicine will never be the same. Although it has been CONCLUSION
recognized for some time that the genome sciences are
dependent on information technology, the ethical is- The intersection of computing and ethics has been a
sues and challenges raised by this symbiosis have only fertile ground for scholars and policymakers. The rapid
been sketched (Goodman, 1996; Goodman, 1999). The growth and extraordinary power of data mining and
fact that vast amounts of human and other genetic infor- knowledge discovery in databases bids fair to be a
mation can be digitized and analyzed for clinical, resource of new and interesting ethical challenges.
search, and public health purposes has led to the cre- Precedents in kindred domains, such as health
ation of a variety of databases and, consequently, estab- informatics, offer a platform on which to begin to
lishment of the era of data mining in genomics. It could prepare the kind of thoughtful and useful conceptual
not be otherwise: [T]he amount of new information is tools that will be required to meet those challenges.
so great that data mining techniques are essential in By identifying the key nodes of content, user, and
order to obtain knowledge from the experimental data purpose, and the leading ethical issues of privacy/con-
(Larraaga, Menasalvas, & Pea, 2004). fidentiality, appropriate use(r), and error/uncertainty,
The ethical issues that arise at the intersection of we can enjoy some optimism that the future of data
genetics and data mining should be anticipated: privacy mining will be guided by thoughtful rules and policies
and confidentiality, appropriate uses and users, and and not the narrower (but not invalid) interests of entre-
accuracy and uncertainty. Each of these embodies spe- preneurs, police, or scientists who hesitate or fail to
cial features. Confidentiality of genetic information weigh the goals of their inquiries against the needs and
applies not only to the source of genetic material but expectations of the societies that support their work.
also, in one degree or another, to his or her relatives.
The concepts of appropriate use and user are strained
given extensive collaborations between and among REFERENCES
corporate, academic, and government entities. And chal-
lenges of accuracy and uncertainty are magnified in case
the results of data mining are applied in medical practice. American Statistical Association. (1999). Ethical guide-
lines for statistical practice. Retrieved from http://
835
TEAM LinG
www.amstat.org/profession/index.cfm?fus Johnson, D. O. (1994). Computer ethics. Englewood Cliffs,

eaction=ethicalstatistics NJ: Prentice Hall.
Association for Computing Machinery. (1992). Code of Johnson, D. O., & Snapper, J. W. (Eds.). (1985). Ethical
ethics and professional conduct. Retrieved from http:// issues in the use of computers. Belmont, CA:
www.acm.org/constitution/code.html Wadsworth.
Bynum, T. W. (Ed.). (1985). Computers and ethics. New Larraaga, P., Menasalvas, E., & Pea, J. M. (2004).
York: Blackwell. Data mining in genomics and proteomics. Artificial
Intelligence in Medicine, 31, iii-iv.
Ermann, M. D., Williams, M. B., & Gutierrez, C. (Eds.).
(1990). Computers, ethics, and society. New York: OLeary, D. E. (1995, April). Some privacy issues in knowl-
Oxford University Press. edge discovery: The OECD personal privacy guidelines.
IEEE Expert, 48-52.
Forester, T., & Morrison, P. (1994). Computer ethics:
Cautionary tales and ethical dilemmas in computing
(2nd ed.). Cambridge, MA: MIT Press.
Fule, P., & Roddick, J. F. (2004). Detecting privacy and
KEY TERMS
ethical sensitivity in data mining results. In V. Estivill-
Castro (Ed.), Proceedings of the 27th Australasian Applied Ethics: The branch of ethics that empha-
Computer Science Conference (pp. 159-166). sizes not theories of morality but ways of analyzing and
resolving issues and conflicts in daily life, the profes-
General Accounting Office. (2004). Data mining: Federal sions and public affairs.
efforts cover a wide range of uses (Rep. No. GAO-04-548).
Washington, DC: U.S. General Accounting Office. Re- Codes of Ethics: More or less detailed sets of
trieved from http://www.gao.gov/new.items/d04548.pdf principles, standards, and rules aimed at guiding the
behavior of groups, usually of professionals in busi-
Goodman, K. W. (1996). Ethics, genomics, and informa- ness, government, and the sciences.
tion retrieval. Computers in Biology and Medicine, 26,
223-229. Confidentiality: The claim, right, or desire that
personal information about individuals should be kept
Goodman, K. W. (Ed.). (1998a). Ethics, computing and secret or not disclosed without permission or informed
medicine: Informatics and the transformation of health consent.
care. Cambridge, MA: Cambridge University Press.
Ethics: The branch of philosophy that concerns
Goodman, K. W. (1998b). Meta-analysis: Conceptual, itself with the study of morality.
ethical and policy issues. In K. W. Goodman (Ed.),
Ethics, computing and medicine: Informatics and the Informed (or Valid) Consent: The process by which
transformation of health care (pp. 139-167). Cambridge, people make decisions based on adequate information,
MA: Cambridge University Press. voluntariness, and capacity to understand and appreciate
the consequences of whatever is being consented to.
Goodman, K. W. (1999). Bioinformatics: Challenges revis-
ited. MD Computing, 16(3),17-20. Morality: A common or collective set of rules,
standards, and principles to guide behavior and by which
IEEE. (1990). IEEE code of ethics. Retrieved from http:// individuals and society can gauge an action as being
www.ieee.org/portal/index.jsp?pageID= corp_level1& blameworthy or praiseworthy.
path=about/whatis& file=code.xml&xsl=generic.xsl
Privacy: Humans claim or desire to control access
International Statistical Institute. (1985). Declaration on to themselves or information about them; privacy is
professional ethics. Retrieved from http://www.cbs.nl/ broader than confidentiality, and it includes interest in
isi/ethics.htm information protection and control.
836
TEAM LinG
837
Mosaic-Based Relevance Feedback for Image M

Retrieval
Odej Kao
University of Paderborn, Germany
Ingo la Tendresse
Technical University of Clausthal, Germany
A standard approach for content-based image retrieval Relevance feedback techniques are often used in image
(CBIR) is based on the extraction and comparison of databases in order to gain additional information about
features usually related to dominant colours, shapes, the sought image set (Rui, Huang, & Mehrotra, 1998).
textures and layout (Del Bimbo, 1999). These features The user evaluates the retrieval results and selects for
are a-priori defined and extracted, when the image is example positive and negative instances, thus in the
inserted into the database. At query time the user sub- subsequent retrieval steps the search parameters are
mits a similar sample image (query-by-sample-image) optimised and the corresponding images/features are
or draws a sketch (query-by-sketch) of the sought supplied with higher weights (Mller & Pun, 2000; Rui
archived image. The similarity degree of the current & Huang, 2000). Moreover, additional query images can
query image and the target images is determined by be considered and allow a more detailed specification
calculation of a multidimensional distance between the of the target image (Baudisch, 2001). However, the
corresponding features. The computed similarity values more complex is the learning model, the more difficult
allow the creation of an image ranking, where the first k, is the analysis and the evaluation of the retrieval results.
usually k=32 or k=64, images are considered retrieval Users in particular those with limited expertise in
hits. These are chained in a list called ranking and then image processing and retrieval are not able to detect
presented to the user. Each of these images can be used misleading areas in the image/sketch with respect to the
as a starting point for a refined search in order to applied retrieval algorithms and to modify the sample
improve the obtained results. image/sketch appropriately. Consequently, an iterative
The assessment of the retrieval result is based on a search is often reduced to a random process of param-
subjective evaluation of whole images and their position eter optimisation.
in the ranking. An important disadvantage of the re- The consideration of the existing user knowledge is the
trieval with content-based features and the presentation main objective of many feedback techniques. In case of
of the resulting images as ranking is that the user is user profiling (Cox, Miller, Minka, Papathomas, & Yianilos,
usually not aware, why certain images are shown on the 2000) the previously completed retrieval sessions are
top positions and why certain images are ranked low or analysed in order to obtain additional information about
not presented at all. Furthermore, users are also inter- users preferences. Furthermore, the selection actions
ested which sketch properties are decisive for the con- during the current session are monitored and images simi-
sideration and rejection of the images, respectively. In lar to those are given higher weights. A hierarchical ap-
case of primitive features like colour these questions proach named multi-feedback ranking separates the query-
can be often answered intuitively. Retrieval with coming image in several regions and allows a more detailed
plex features considering for example texture and lay- search (Mirmehd & Perissamy, 2001).
out creates rankings, where the similarity between the Techniques for permanent feedback guide the user
query and the target images is not always obvious. Thus, through the entire retrieval process. An example for this
the user is not satisfied with the displayed results and approach is implemented by the system Image Retro:
would like to improve the query, but it is not clear to based on a selected sample image a number of images
him/her, which parts of the querying sketch or the are removed from the possible image set. By analysing
sample image should be modified and improved accord- these images, the user develops an intuition about prom-
ing to the desired targets. Therefore, a suitable feedback ising starting points (Vendrig, Worring, & Smeulders,
mechanism is necessary. 2001). In case of the fast feedback, the user receives the
TEAM LinG
Mosaic-Based Relevance Feedback for Image Retrieval
Figure 1. Compilation of a mosaic-based ranking Figure 2. An example for a mosaic feedback
images or users (cm c). The size of the mosaic sections

can also be selected by the user. Here grids consisting of
1616 and 3232 pixel blocks are evaluated.
current ranking after every modification of the query
An example for the resulting mosaics is found in
sketch. The correspondence between the presented rank-
Figure 2. On the left hand side the sought image is
ing and the last modification helps the user to identify
shown, which was approximated by the user-defined
sketch improvements leading in the desired retrieval
sketch in the middle. On the right hand side the resulting
directions and to undo inappropriate sketch drawings
1616 mosaic containing the best section hits of all images
immediately (Veltkamp, Tanase, & Sent, 2001).
in the databases is presented. A manual post-processing
of the mosaic clarifies the significance of individual mo-
saic blocks: Those sections, which are parts of the sought
FEEDBACK METHOD FOR QUERY- image, are framed with a white box. Furthermore, sections
BY-SKETCH from images of the same category in this case other
pictures with animals are marked with white circles.
This section describes a feedback method, which helps An extension of the fixed grids is realised by adaptive
the user to improve the query sketch and to receive the mosaics, where neighbouring sections depending on a
desired results. Subsequently, the query sketch is sepa- given similarity threshold are merged into larger, homo-
rated into regions and each area is compared with the geneous areas, thus the user can evaluate individual
corresponding subsection of all images in the database. sections more easily. Figure 3 shows a sample image and
The most similar sections are grouped in a mosaic, thus the corresponding mosaic-based ranking, which is gained
the user can detect well-defined and misleading areas of using an adaptive grid.
his/her query sketch quickly. The creation of a mosaic In these manually post-processed cases additional
consisting of k areas with the most similar sections is information is provided to the user, which sketch re-
shown in Figure 1. gions are already sufficiently prepared (boxes) and which
The corresponding pseudo-code algorithm for mo- regions have the right direction but still need minor
saic-based retrieval and ranking presentation is given in corrections (circles). Finally, all other regions do not
Algorithm 1. satisfy the similarity criterions and have to be re-drawn.
The criterion cm for the similarity computation of the With this information the user can focus on the most
corresponding sections can be identical with the main promising areas and modify these in a suitable manner.
retrieval criterion (cm = c) or be adapted to specific For the usage of this feedback method in real-world
applications it is necessary to evaluate, whether typical
users are able to detect suitable and misleading regions
Algorithm 1. Pseudo code of the mosaic build-up intuitively and thus to improve the original sketch.
For the performance evaluation of the developed mo-
procedure mosaic(sketch s, ranking R, criterion c m , grid g)
begin saic technique and measuring the retrieval quality we
divide s using a uniform grid g into k fields sk
initialise similarity[1 k] = 0, image[1 k]= none
for each image i R Figure 3. Mosaic based on an adaptive grid
begin
scale i to i* with dim[i*] = dim[s]
divide i* using grid g into k fields ik
for j=[1 k] do
if similarity[j] < cm(sj, ij)
then similarity[j] = cm(sj, ij), image[j] = ij
od
end
mosaic M = j=1k image[j]
end
838
TEAM LinG
created a test image database with pictures of various sections per mosaic and needed on average 63 seconds
categories and selected a wavelet-based feature for the study time.
In order to evaluate the mosaic as a feedback tech-
M
retrieval (Kao & Joubert, 2001). Wavelet-based features
have in common, that the semantic interpretation of the nique we selected 26 sketches, which did not lead to the
extracted coefficients and thus the produced rankings is desired images during the traditional and mosaic-based
rather difficult, in particular for users without signifi- retrieval and asked 16 test persons to improve the
cant background in image processing. Nevertheless, such quality of the sketches. Eight test persons received the
features perform well in general image sets and are target image, the old sketch as well as the correspond-
applied in a variety of systems (Jacobs, Finkelstein, & ing mosaic and had 300 seconds to correct the colours,
Salesin, 1995). The application of the mosaic-based shapes, layout, etc. The other eight persons should
feedback could improve the usability of these powerful improve the sketch without the mosaic feedback. All
features by helping the user to find and to modify the modified sketches were submitted to the CBIR-system
decisive image regions. and produced following results: 46.59% of all sketches
The experiments considered fine-grained mosaics changed according to the mosaic feedback led to a
with a 1616 grid, a 3232 grid, and adaptive grids. Simul- successful traditional retrieval, i.e. the sought image
taneously, a traditional ranking for the current query was was returned within the 16 most similar images. In
created, which contained the best k images for the given opposite only 34.24% of the sketches modified with-
sketch and the selected similarity criterion. Both rankings out mosaic feedback enabled a successful retrieval. If
were subsequently distributed to 16 users, who analysed a mosaic-based ranking was presented, then the recall
those and based on the derived knowledge improved quote increased up to 68.75% (feedback allowed) and
the sketch appropriately. up to 54.35% otherwise. Moreover, the desired image
The obtained results were evaluated manually in or- is placed on average two positions above the ranking
der to resolve following basic questions related to pos- position prior modification.
sible applications of the mosaic feedback:
Can a mosaic be used as a sole presentation form FUTURE TRENDS

for image ranking and thus replace the traditional
list/table-based ranking? The introduction of novel and powerful methods for
Are users able to recognise, which parts of the relevance feedback in image, video and audio databases
mosaic represent sections of relevant images, and is an emerging and fast on-going process. Improve-
which parts should be modified? ments related to integration of semantic information/
Can we expect a significant improvement of the combination of multiple attributes in the query (Heesch
sketch quality, if the relevant sections are cor- & Rger, 2003; Santini, Gupta, & Jain, 2001), auto-
rectly marked/recognised? matic detection/modification of misleading areas, 3D
result visualisation techniques beyond the simple rank-
A query of an image database is considered success- ing list (Cinque, Levialdi, Malizia, & Olsen, 1998),
ful, if the sought image is returned within the first k queries considering multiple instances (Santini & Jain,
retrieval hits. Analogously, in case of mosaic-based 2000) and many more are investigated in number of
result presentation, image retrieval is evaluated as suc- research groups.
cessful, if at least one of the mosaic blocks is a part of The presented mosaic-based feedback technique
the sought image. An analysis of the produced rankings easies the modification process by showing significant
shows, that the recall rate improved dependent on the sketch regions to the user. A re-definition of the entire
used colour space up to 32.75% (la Tendresse & Kao, sketch is however time-consuming and usually avoided
2003; Skubowius, 2002). However, it is not clear, whether by the users. Instead, minor modifications are applied
the user will be able to establish a relation between the to the sketch and another query run is started. By taking
presented section and the image he/she is looking for. this into account, the feedback process and more
Therefore, we asked 16 test persons to study a number of decisive the next ranking can be improved, if the user
images and the corresponding mosaics. The similarity of is guided to the modification of those regions, which
the images to the given sketch was not obvious, as the exhibit the largest chances of success. Thus, the au-
images were found during the mosaic-based retrieval, tonomous detection of misleading sketch sections and
but not in case of the traditional retrieval. The test system-based suggestions for modification are most
persons had 180 seconds time to mark all mosaic sec- desirable in the short-term development focus. For this
tions, which in their opinion match an original image purpose the convergence behaviour of individual sketch
section. They found approximately 65% of the relevant sections towards the desired target is analysed. All results
839
TEAM LinG
can be combined in a so-called convergence map, which is tem, pichunter: Theory, implementation and psychophysi-
subsequently processed by a number of rules. Based on the cal experiments. In IEEE Transactions on Image Process-
output of the analysis process, several actions can be ing, 9(1), 20-37.
proposed. For example, if the given sketch is completely on
the wrong route, the user is asked to re-draw the sketch Del Bimbo, A. (1999). Visual information retrieval. Mor-
entirely. If only parts of the sketch are misleading, the user gan Kaufmann Publishers.
should be asked to pay more attention to those. Finally, Heesch, D., & Rger, S. M. (2003). Relevance feedback for
even some of the necessary modifications can also be content-based image retrieval: What can three mouse
performed/proposed automatically, for example if intuitive clicks achieve? In Advances in Information Retrieval,
features such as colours are used for the retrieval. In this LNCS 2633 (pp. 363-376).
case the user is advised to use more red or green colour in
some sections or to change the average illumination of the Jacobs, C.E., Finkelstein, A., & Salesin, D.H. (1995). Fast
sketch. multiresolution image querying. In Proceedings of ACM
An additional visual help for the user during creation Siggraph (pp. 277-286).
of the query sketch is given by a permanent evaluation of Kao, O., & Joubert, G.R. (2001). Efficient dynamic image
the archived images and presentation of the most prom- retrieval using the trous wavelet transformation. In
ising images in a three-dimensional space. The current Advances in Multimedia Information Processing, LNCS
sketch builds the centre of this 3D-space and the archived 2195 (pp. 343-350).
images are arranged according to their similarity dis-
tance around the sketch. The nearest images are the most La Tendresse, I., & Kao, O. (2003). Mosaic-based sketch-
similar images, thus the user can copy some aspects of ing interface for image databases. Journal of Visual
those images in order to move the retrieval in the Languages and Computing 14(3), 275-293.
desired direction. After each new line the distance to all
images is re-computed and the images in the 3D-space La Tendresse, I., Kao, O., & Skubowius, M. (2002). Mosaic
are re-arranged. Thus, the user can see the effects of the feedback for sketch training and retrieval improvement. In
performed modification immediately and proceed in Proceedings of the IEEE Conference on Multimedia and
the same direction or remove the latest modifications Expo (ICME 2002) (pp. 437-440).
and try another path. Mirmehdi M., & Perissamy, R. (2001). CBIR with percep-
tual region features. In Proceedings of the 12th British
Machine Vision Conference (pp. 51-520).
CONCLUSIONS
Mller, H., Mller, W., Squire, D., Marchand-Maillet, S.,
This article described a mosaic-based feedback method, & Pun, T. (2000). Learning features weights from user
which helps image database users to recognise and to behavior in content-based image retrieval. In Proceed-
modify misleading areas in the querying sketch and to ings of ACM SIGKDD International Conference on
improve the retrieval results. Performance measure- Knowledge Discovery and Data Mining (Workshop on
ments showed that the proposed method easies the Multimedia Data Mining MDM/KDD2000).
querying process and allows the users to modify their Rui, Y., & Huang, T.S. (2000). Optimizing learning in image
sketches in a suitable manner. retrieval. In Proceeding of IEEE International Confer-
ence on Computer Vision and Pattern Recognition (pp.
236-245).
REFERENCES
Rui, Y., Huang, T.S., & Mehrotra, S. (1998). Relevance
Baudisch, P. (2001). Dynamic information Filtering. PhD feedback techniques in interactive content-based Image
Thesis, GMD Research Series 2001, No. 16. GMD retrieval. In Proceedings of SPIE 3312 (pp. 25-36).
Forschungszentrum. Santini, S., Gupta, A., & Jain, R. (2001). Emergent seman-
Cinque, L., Levialdi, S., Malizia, A., & Olsen, K.A. (1998). tics through interaction in image databases. IEEE Trans-
A multi-dimensional image browser. Journal of Visual actions on Knowledge Data Engineering, 13(3), 337-351.
Languages and Computing, 9(1), 103-117. Santini, S., & Jain, R. (2000). Integrated browsing and
Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T., & querying for image databases. IEEE MultiMedia, 7(3), 26-
Yianilos, P.N. (2000). The bayesian image retrieval sys- 39.
840
TEAM LinG
Veltkamp, R.C., Tanase, M., & Sent, D. (2001). Features in sample image. Therefore the user sketches the looked-for
content-based image retrieval systems: A survey. Kluwer image with a few drawing tools. It is not necessary to do M
Academic Publishers. this correctly in all aspects.
Vendrig, J., Worring, M., & Smeulders, A.W.M. (2001). Query-By-Pictorial-Example: The query is formu-
Filter image browsing: Interactive image retrieval by using lated by using a user-provided example image for the
database overviews. Multimedia Tools and Applica- desire retrieval. Both query and stored image are analysed
tions, 15(1), 83-103. in the same way.
Ranking: List of the most similar images extracted
from the database according to the querying sketch/
KEY TERMS image or other user-defined criteria. The ranking dis-
plays the retrieval results to the user.
Content-Based Image Retrieval: Search for suit- Relevance Feedback: The user evaluates the qual-
able image in a database by comparing extracted fea- ity of the individual items in the ranking based on the
tures related to color, shape, layout and other specific subjective expectations / specification of the sought
image characteristics. item. Subsequently, the user supplies the system with
Feature Vector: Data that describes the content of the evaluation results (positive/negative instances,
the corresponding image. The elements of the feature weights ) and re-starts the query. The system consid-
vector represent the extracted descriptive information ers the user-knowledge while computing the similarity.
with respect to the utilised analysis. Similarity: Correspondence of two images. The simi-
Query By Sketch, Query by Painting: A widespread larity is determined by comparing the extracted feature
query type where the user is not able to present a similar vectors, for example by a metric or distance function.
841
TEAM LinG
842
Multimodal Analysis in Multimedia Using

Symbolic Kernels
Hrishikesh B. Aradhye
SRI International, USA
Chitra Dorai
IBM T. J. Watson Research Center, USA
INTRODUCTION Ordinal, such as shot rhythm, which exhibits partial

neighborhood properties (e.g., metric, accelerated,
The rapid adoption of broadband communications tech- decelerated)
nology, coupled with ever-increasing capacity-to-price Nominal, such as identity of a recognized face in a
ratios for data storage, has made multimedia information frame and text recognized from a superimposed
increasingly more pervasive and accessible for consum- caption1
ers. As a result, the sheer volume of multimedia data
available has exploded on the Internet in the past decade Multimedia metadata based on such a multimodal
in the form of Web casts, broadcast programs, and stream- collection of features pose significant difficulties to sub-
ing audio and video. However, indexing, search, and sequent tasks such as classification, clustering, visual-
retrieval of this multimedia data is still dependent on ization, and dimensionality reduction all which tradi-
manual, text-based tagging (e.g., in the form of a file name tionally deal with only continuous-valued data. Common
of a video clip). However, manual tagging of media con- data-mining algorithms employed for these tasks, such as
tent is often bedeviled by an inadequate choice of key- Neural Networks and Principal Component Analysis
words, incomplete and inconsistent terms used, and the (PCA), often assume a Euclidean distance metric, which is
subjective biases of the annotator introduced in his or her appropriate only for real-valued data. In the past, these
descriptions of content adversely affecting accuracy in algorithms could be applied to symbolic domains only
the search and retrieval phase. Moreover, manual anno- after representing the symbolic labels as integers or real
tation is extremely time-consuming, expensive, and values or to a feature space transformation to map each
unscalable in the face of ever-growing digital video col- symbolic feature as multiple binary features. These data
lections. Therefore, as multimedia get richer in content, transformations are artificial. Moreover, the original fea-
become more complex in format and resolution, and grow ture space may not reflect the continuity and neighbor-
in volume, the urgency of developing automated content hood imposed by the integer/real representation.
analysis tools for indexing and retrieval of multimedia This paper discusses mechanisms that extend tasks
becomes easily apparent. traditionally limited to continuous-valued feature spaces,
Recent research towards content annotation, struc- such as (a) dimensionality reduction, (b) de-noising, (c)
turing, and search of digital media has led to a large visualization, and (d) clustering, to multimodal multimedia
collection of low-level feature extractors, such as face domains with symbolic and continuous-valued features.
detectors and recognizers, videotext extractors, speech To this end, we present four kernel functions based on
and speaker identifiers, people/vehicle trackers, and event well-known distance metrics that are applicable to each of
locators. Such analyses are increasingly processing both the four feature types. These functions effectively define
visual and aural elements to result in large sets of a linear or nonlinear dot product of real or symbolic feature
multimodal features. For example, the results of these vectors and therefore fit within the generic framework of
multimedia feature extractors can be kernel space machines. The framework of kernel functions
and kernel space machines provides classification tech-
Real-valued, such as shot motion magnitude, audio niques that are less susceptible to overfitting when com-
signal energy, trajectories of tracked entities, and pared with several data-driven learning-based classifiers.
scene tempo We illustrate the usefulness of such symbolic kernels
Discrete or integer-valued, such as the number of within the context of Kernel PCA and Support Vector
faces detected in a video frame and existence of a Machines (SVMs), particularly in temporal clustering and
scene boundary (yes/no) tracking of videotext in multimedia. We show that such
TEAM LinG
Multimodal Analysis in Multimedia Using Symbolic Kernels
analyses help capture information from symbolic feature fully used to capture important information from large,
spaces, visualize symbolic data, and aid tasks such as nonlinear feature spaces into a smaller set of principal M
classification and clustering and therefore are eminently components (Scholkopf et al., 1999). Operations such as
useful in multimodal analysis of multimedia. clustering or classification can then be carried out in this
reduced dimensional space. Because noise is eliminated
as projections on eigen-vectors with low eigen-values,
BACKGROUND the final reduced space of larger principal components
contains less noise and yields better results with further
Early approaches to multimedia content analysis dealt data analysis tasks such as classification.
with multimodal feature data in two primary ways. Either Although many conventional methods have been
a learning technique such as Neural Nets was used to find previously developed for extraction of principal compo-
patterns in the multimodal data after mapping symbolic nents from nonlinearly correlated data, none allowed for
values into integers, or the multimodal features were generalization of the concepts to dimensionality reduc-
segregated into different groups according to their modes tion of symbolic spaces. The kernel-space representation
of origin (e.g., into audio and video features), processed of KPCA presents such an opportunity. However, since
separately, and the results from the separate processes its inception, applications of KPCA have been primarily
were merged by using some probabilistic mechanism or limited to domains with real-valued, nonlinearly corre-
evidence combination method. The first set of methods lated features despite the recent literature on defining
implicitly assumed the Euclidean distance as an underly- kernels over several discrete objects such as sequences,
ing metric between feature vectors. Although this may be trees, graphs, as well as many other types of objects.
appropriate for real-valued data, it imposes a neighbor- Moreover, recent techniques like the Fisher kernel ap-
hood property on symbolic data that is artificial and is proach by Jaakkola and Haussler (1999) can be used to
often inappropriate. The second set of methods essen- systematically derive kernels from generative models,
tially dealt with each category of multimodal data sepa- which have been demonstrated quite successfully in the
rately and fused the results. They were thus incapable of rich symbolic feature domain of bioinformatics. Against
leading to novel patterns that can arise if the data were the backdrop of these emerging collections of research,
treated together as a whole. As audiovisual collections of the work presented in this paper uses the ideas of Kernel
today provide multimodal information, they need to be PCA and symbolic kernel functions to investigate the yet
examined and interpreted together, not separately, to unexplored problem of symbolic domain principal compo-
make sense of the composite message (Bradley, Fayyad, nent extraction in the context of multimedia. The kernels
& Mangasarian, 1998). used here are designed based on well-known distance
Recent advances in machine-learning and data analy- metrics, namely Hamming distance, Cityblock distance,
sis techniques, however, have enabled more sophisti- and the Edit distance metric, and have been previously
cated means of data analyses. Several researchers have used for string comparisons in several domains, including
attempted to generalize the existing PCA-based frame- gene sequencing.
work. For instance, Tipping (1999) presented a probabilis- With these and other symbolic kernels, multimodal
tic latent-variable framework for data visualization of data from multimedia analysis containing real and sym-
binary and discrete data types. Collins and co-workers bolic values can be handled in a uniform fashion by using,
(Collins, Dasgupta, & Schapire, 2001) generalized the say, an SVM classifier employing a kernel function that is
basic PCA framework, which inherently assumes Gaussian a combination of Euclidean, Hamming, and Edit Distance
features and noise, to other members of the exponential kernels. Applications of the proposed kernel functions to
family of functions. In addition to these research efforts, temporal analysis of videotext data demonstrate the util-
Kernel PCA (KPCA) has emerged as a new data represen- ity of this approach.
tation and analysis method that extends the capabilities
of the classical PCA which is traditionally restricted to
linear feature spaces to feature spaces that may be MAIN THRUST
nonlinearly correlated (Scholkopf, Smola, & Muller, 1999).
In this method, the input vectors are implicitly projected Distance Kernels for Multimodal Data
on a high-dimensional space by using a nonlinear map-
ping. Standard PCA is then applied to this high-dimen- Kernel-based classifiers such as SVMs and Neural Net-
sional space. KPCA avoids explicit calculation of high- works use linear, Radial Basis Function (RBF), or polyno-
dimensional projections with the use of kernel functions, mial functions as kernels that first (implicitly) transform
such as radial basis functions (RBF), high-degree polyno- input data into a higher dimensional feature space and
mials, or the sigmoid function. KPCA has been success- then process them in this space. Many of the common
843
TEAM LinG
kernels assume a Euclidean distance to compare feature ber of change, delete, and insert operations re-
vectors. Symbolic kernels, in contrast, have been less quired to convert one string into another, and
commonly used in the published literature. We use the
len(x ) is the length of string x. In theory, Edit
following distance-based kernel functions in our analysis.
distance does not obey Mercer validity, as has
1. Linear (Euclidean) Kernel Function for Real-valued been recently proved by Cortes and coworkers
Features: This is the most commonly used kernel (Cortes, Haffner, & Mohri, 2002, 2003). However,
function for linear SVMs and other kernel-based empirically, the Kernel matrices generated by the
algorithms. Let x and z be two feature vectors. Then edit kernel are often positive definite, justifying the
the function practical use of Edit distancebased kernels.
K l (x, z ) = x T z (1) Hybrid Multimodal Kernel
Having defined these four basic kernel functions for

defines a Euclidean distance-based kernel function.
different modalities of features, we are now in a position
KPCA with linear kernel function reduces to the
to define a multimodal kernel function that encompasses
standard PCA. The linear kernel trivially follows
all types of common multimedia features. Let any given
Mercers condition for kernel validity, that is, the
feature vector x be comprised of a real-valued feature set
matrix comprising pairwise kernel function values for
any finite subset of feature vectors selected from the x r , a nominal feature set xh , a discrete-valued feature set
feature space is guaranteed to be positive xc , and a string-style feature xe , such that
semidefinite.
2. Hamming Kernel Function for Nominal Features: x = [x r xh xc x e ]. Then, because a linear combi-
Let the number of features be N. Let x and z be two nation of valid kernel functions is a valid kernel function,
feature vectors, that is, N-dimensional symbolic vec- we define
tors from a finite, symbolic feature space XN. Then the
function K m (x, z ) = K l (xr , z r ) + K h (xh , z h ) + K c (xc , z c ) + K l (xe , z e )
(5)
N
K h (x, z ) = N (xi , z i ) (2)
i =1
where K m (x, z ) is our multimodal kernel and , , , and
are constants. Such a hybrid kernel can now be
where (xi , z i ) = 0 if xi = z i and 1 otherwise, defines
seamlessly used to analyze multimodal feature vectors
a Hamming distancebased kernel function and fol- that have real and symbolic values, without imposing
lows the Mercers condition for kernel validity any artificial integer mapping of symbolic labels and
(Aradhye & Dorai, 2002). The equality, xi = z i , refers further obtaining the benefits of analyzing disparate data
to symbol/label match. together as one. The constants , , , and can be
3. Cityblock Kernel Function for Discrete Features: determined in practice either by a knowledge-based analy-
Using the preceding notation, the function sis of the relative importance of the different types of
features or by empirical optimization.
N
K c (x, z ) = M i xi z i (3)
i =1 Example Multimedia Application:
Videotext Postprocessing
where the ith feature is Mi-ary, defines a Cityblock
distancebased kernel function and follows the Video sequences contain a rich combination of images,
Mercers condition for kernel validity (Aradhye & sound, motion, and text. Videotext, which refers to super-
Dorai, 2002). imposed text on images and video frames, serves as an
4. Edit Kernel Function for Stringlike Features: We important source of semantic information in video
define an edit kernel between two strings as streams, besides speech, close caption, and visual con-
tent in video. Recognizing text superimposed on video
K e (x, z ) = max(len(x ), len(z )) E (x, z ) (4) frames yields important information such as the identity
of the speaker, his/her location, topic under discussion,
where E (x, z ) is the edit distance between the two sports scores, product names, associated shopping data,
and so forth, allowing for automated content description,
strings, defined conventionally as the minimum num-
844
TEAM LinG
search, event monitoring, and video program categorization of feature vectors comprising strings of recognized
tion. Therefore, a videotext based Multimedia Description videotext. Experiments with our Edit distance kernel in- M
Scheme has recently been adopted into the MPEG-7 ISO/ vestigated the use of KPCA for analyzing the temporal
IEC standard to facilitate media content description contiguity of videotext using these feature vectors.
(Dimitrova, Agnihotri, Dorai, & Bolle, 2000). To this end,
videotext extraction and recognition is an important task Videotext Clustering and Change Detection: Fig-
of an automated video content analysis system. Figure 1 ure 2 shows the first two components obtained by
shows three illustrative frames containing videotext taken applying KPCA with the Edit distance kernel to a set
from MPEG videos of different genres. of strings recognized from 20 consecutive frames.
However, unlike scanned paper documents, videotext These frames contain instances of two distinct
is superimposed on often changing backgrounds com- strings. Without any assumed knowledge, KPCAs
prising moving objects with a rich variety of color and use of the Edit distance kernel clearly shows two
texture. In addition, videotext is often of low resolution distinct clusters corresponding to these two strings.
and suffers from compression artifacts. Due to these Videotext Tracking and Outlier Detection: Figure
difficulties, existing OCR algorithms result in low accu- 3 shows the first three principal components ob-
racy when applied to the problem of videotext extraction tained by applying KPCA with our Edit distance
and recognition. On the other hand, we observed that kernel to a set of strings recognized from 20 con-
temporal videotext often persists on the screen over a secutive frames. These frames contain instances of
span of time (approximately 30 seconds) to ensure read- videotext scrolling across the screen. In this three-
ability, resulting in many available samples of the same dimensional plot, we can see a visual representation
body of text over multiple frames of video. This redun- of the changing content of videotext as a trajectory
dancy can be exploited to improve recognition accuracy, in the principal component space, and locating
although erroneous text extraction and/or incorrect char- outliers from the trajectory indicates the appear-
acter recognition by a classifier may make the strings ance of other strings in the video frames.
dissimilar from frame to frame. Often no single instance of
the text may lead to perfect recognition, underscoring the These results show that the symbolic kernels can
need for intelligent postprocessing. assist significantly in automated agglomeration and track-
In the existing literature, unfortunately, temporal con- ing of recognized text as well as effective data visualiza-
tiguity analysis of videotext is implemented by using ad- tion. In addition, multimodal feature vectors constructed
hoc thresholds and heuristics for the following reasons. from recognized text strings and frame motion estimates
First of all, due to missed or merged characters, the same can now be analyzed jointly by using our hybrid kernel for
string may be perceived to be of different lengths on media content characterization.
different frames. We thus have feature vectors of varying
lengths. Secondly, the exact duration of the persistence
of videotext is unknown a priori. Two consecutive frames FUTURE TRENDS
can have completely different strings. Thirdly, videotext
can be in scrolling motion. Because multiple moving text One of the big hurdles facing media management systems
blocks can be present in the same video frame, it is is the semantic gap between the high-level meaning sought
nontrivial to recognize which videotext objects from con- by user queries in search for media and the low-level
secutive frames are instances of the same text. features that we actually compute today for media index-
In light of these difficulties, we present brief illustra- ing and description. Computational Media Aesthetics, a
tive examples of dimensionality reduction and visualiza- promising approach to bridging the gap and building
Figure 1. Illustrative frames with videotext
845
TEAM LinG
Figure 2. Videotext clustering Figure 3. Videotext tracking and visualization
high-level semantic descriptions for media search and paper, help apply traditionally numeric methods to sym-
navigation services, is founded upon an understanding bolic spaces without any forced integer mapping for
of media elements and their individual and joint roles in important tasks such as data visualization, principal com-
synthesizing meaning and manipulating perceptions, with ponent extraction, and clustering in multimedia and other
a systematic study of media productions (Dorai & domains.
Venkatesh, 2001). The core trait of this approach is that in
order to create effective tools for automatically under-
standing video, we need to be able to interpret the data REFERENCES
with its makers eye. In order to realize the potential of this
approach, it becomes imperative that all sources of de- Aradhye, H., & Dorai, C. (2002). New kernels for analyzing
scriptive information, audio, video, text, and so forth need multimodal data in multimedia using kernel machines.
to be considered as a whole and analyzed together to Proceedings of the IEEE International Conference on
derive inferences with certain level of integrity. With the Multimedia and Expo, Switzerland, 2 (pp. 37-40).
ability to treat multimodal features as an integrated fea-
ture set to describe media content during classification Bradley, P. S., Fayyad, U. M., & Mangasarian, O. (1998).
and visualization, new higher level semantic mappings Data mining: Overview and optimization opportunities
from low-level features can be achieved to describe media (Tech. Rep. No. 98-01). Madison: University of Wiscon-
content. The symbolic kernels are promising an initial step sin, Computer Sciences Department.
in that direction to facilitate rigorous joint feature analysis Collins, M., Dasgupta, S., & Schapire, R. (2001). A gener-
in various media domains. alization of principal component analysis to the exponen-
tial family. In T.G. Dietterich, S. Becker, & Z. Ghahramani
(Eds.), Advances in neural information processing sys-
CONCLUSION tems 14 (pp. 617-624). Cambridge, MA: MIT Press.
Traditional integer representation of symbolic multimedia Cortes, C., Haffner, P., & Mohri, M. (2002). Rational
feature data for classification and other data-mining tasks kernels. In S. Becker, S. Thrun, & K. Obermayer (Ed.),
is artificial, as the symbolic space may not reflect the Advances in neural information processing systems 15
continuity and neighborhood relations as imposed by (pp. 41-56). Cambridge, MA: MIT Press.
integer representations. In this paper, we use distance- Cortes, C., Haffner, P., & Mohri, M. (2003). Positive
based kernels in conjunction with kernel space methods definite rational kernels. Proceedings of the 16th Annual
such as KPCA to handle multimodal data, including sym- Conference on Computational Learning Theory (pp. 41-
bolic features. These symbolic kernels, as shown in this 56), USA.
846
TEAM LinG
Dimitrova, N., Agnihotri, L., Dorai, C., & Bolle, R. (2000, Mercers Condition: A kernel function is said to obey
October). MPEG-7 videotext descriptor for superimposed Mercers condition for kernel validity iff the kernel matrix M
text in images and video. Signal Processing: Image Com- comprising pairwise kernel evaluations over any given
munication, 16, 137-155. subset of the feature space is guaranteed to be positive
semidefinite.
Dorai, C., & Venkatesh, S. (2001, October). Computational
media aesthetics: Finding meaning beautiful. IEEE Mul- MPEG Compression: Video/audio compression stan-
timedia, 8(4), 10-12. dard established by Motion Picture Experts Group. MPEG
compression algorithms use psychoacoustic modeling of
Jaakkola, T. S., & Haussler, D. (1999). Exploiting genera- audio and motion analysis as well as DCT of video data for
tive models in discriminative classifiers. In M. S. Kearns, efficient multimedia compression.
S. A. Solla, & D. A. Cohn (Eds.), Advances in neural
information processing systems 11 (pp. 487-493). Cam- Multimodality of Feature Data: Feature data is said to
bridge, MA: MIT Press. be multimodal if the features can be characterized as a
mixture of real-valued, discrete, ordinal, or nominal
Scholkopf, B., Smola, A., & Muller, K. R. (1999). Kernel values.
principal component analysis. In B. Scholkopf, C. J. C.
Burges, & A. J. Smola (Eds.), Advances in kernel methods: Principal Component Analysis (PCA): One of the
SV learning (pp. 327-352). Cambridge, MA: MIT Press. oldest modeling and dimensionality reduction techniques.
PCA models observed feature data as a linear combination
Tipping, M. E. (1999). Probabilistic visualisation of high- of a few uncorrelated, Gaussian principal components and
dimensional binary data. In M. S. Kearns, S. A. Solla, & D. additive Gaussian noise.
A. Cohn (Eds.), Advances in neural information process-
ing systems 11 (pp. 592-598). Cambridge, MA: MIT Press. Videotext: Text graphically superimposed on video
imagery, such as caption text, headline news, speaker
identity, location, and so on.
KEY TERMS
Dimensionality Reduction: The process of transfor- ENDNOTE

mation of a large dimensional feature space into a space
comprising a small number of (uncorrelated) components. 1
The term symbolic is loosely used in this paper to
Dimensionality reduction allows us to visualize, catego- mean discrete, ordinal, and nominal features.
rize, or simplify large datasets.
Kernel Function: A function that intrinsically defines
the projection of two feature vectors (function argu-
ments) onto a high-dimensional space and a dot product
therein.
847
TEAM LinG
848
Multiple Hypothesis Testing for Data Mining

Sach Mukherjee
University of Oxford, UK
INTRODUCTION evance, on the basis of his or her records. A set of

customers with high relevance scores would then be
A number of important problems in data mining can be selected as targets for marketing activity. In this ex-
usefully addressed within the framework of statistical ample, the objects are customers.
hypothesis testing. However, while the conventional Clearly, both tasks are similar; each can be thought
treatment of statistical significance deals with error of as comprising the assignment of a suitably defined
probabilities at the level of a single variable, practical relevance score to each object and the subsequent se-
data mining tasks tend to involve thousands, if not mil- lection of a set of objects on the basis of the scores. The
lions, of variables. This Chapter looks at some of the selection of objects thus requires the imposition of a
issues that arise in the application of hypothesis tests to threshold or cut-off on the relevance score, such that
multi-variable data mining problems, and describes two objects scoring higher than the threshold are returned as
computationally efficient procedures by which these relevant. Consider the microarray example described
issues can be addressed. above. Suppose the function used to rank genes is simply
the difference between mean expression levels in the
two classes. Then the question of setting a threshold
BACKGROUND amounts to asking how large a difference is sufficient to
consider a gene relevant. Suppose we decide that a
Many problems in commercial and scientific data min- difference in means exceeding x is large enough: we
ing involve selecting objects of interest from large would then consider each gene in turn, and select it as
datasets on the basis of numerical relevance scores relevant if its relevance score equals or exceeds x.
(object selection). This Section looks briefly at the Now, an important point is that the data are random
role played by hypothesis tests in problems of this kind. variables, so that if measurements were collected again
We start by examining the relationship between rel- from the same biological system, the actual values
evance scores, statistical errors and the testing of hy- obtained for each gene might differ from those in the
potheses in the context of two illustrative data mining particular dataset being analyzed. As a consequence of
tasks. Readers familiar with conventional hypothesis this variability, there will be a real possibility of obtain-
testing may wish to progress directly to the main part of ing scores in excess of x from genes which are in fact
the Chapter. not relevant.
As a topical example, consider the differential analy- In general terms, high scores which are simply due to
sis of gene microarray data (Piatetsky-Shapiro & chance (rather than the underlying relevance of the
Tamayo, 2004; Cui & Churchill, 2003). The data consist object) lead to the selection of irrelevant objects; er-
of expression levels (roughly speaking, levels of activ- rors of this kind are called false positives (or Type I
ity) for each of thousands of genes across two or more errors). Conversely, a truly relevant object may have an
conditions (such as healthy and diseased). The data unusually low score, leading to its omission from the
mining task is to find a set of genes which are differen- final set of results. Errors of this kind are called false
tially expressed between the conditions, and therefore negatives (or Type II errors). Both types of error are
likely to be relevant to the disease or biological process associated with identifiable costs: false positives lead
under investigation. A suitably defined mathematical to wasted resources, and false negatives to missed op-
function (the t-statistic is a canonical choice) is used to portunities. For example, in the market research con-
assign a relevance score to each gene and a subset of text, false positives may lead to marketing material
genes selected on the basis of the scores. Here, the being targeted at the wrong customers; false negatives
objects being selected are genes. may lead to the omission of the right customers from
As a second example, consider the mining of sales the marketing campaign. Clearly, the rates of each kind
records. The aim might be, for instance, to focus mar- of error are related to the threshold imposed on the
keting efforts on a subset of customers, based on some relevance score: an excessively strict threshold will
property of their buying behavior. A suitably defined minimize false positives but produce many false nega-
function would be used to score each customer by rel- tives, while an overly lenient threshold will have the
TEAM LinG
opposite effect. Setting an appropriate threshold is MAIN THRUST

therefore vital to controlling errors and associated costs. M
Statistical hypothesis testing can be thought of as a We have seen that in the case of a single variable,
framework within which the setting of thresholds can be relevance scores obtained from test statistics can be
addressed in a principled manner. The basic idea is to easily converted into error probabilities called P-val-
specify an acceptable false positive rate (i.e. an accept- ues. However, practical data mining tasks, such as min-
able probability of Type I error) and then use probability ing microarrays or consumer records, tend to be on a
theory to determine the precise threshold which corre- very large scale, with thousands, even millions of ob-
sponds to that specified error rate. A general discussion jects under consideration. Under these conditions of
of hypothesis tests at an introductory level can be found multiplicity, the conventional P-value described above
in textbooks of statistics such as DeGroot and Schervish no longer corresponds to the probability of obtaining a
(2002), or Moore and McCabe (2002); the standard false positive.
advanced reference on the topic is Lehmann (1997). An example will clarify this point. Consider once
Now, let us assume for the moment that we have only again the microarray analysis scenario, and assume that
one object to consider. The hypothesis that the object is a suitable relevance scoring function has been chosen.
irrelevant is called the null hypothesis (and denoted by Now, suppose we wish to set a threshold corresponding
H0), and the hypothesis that it is relevant is called the to a false positive rate of 0.05. Let the relevance score
alternative hypothesis (H1). The aim of the hypothesis whose P-value is 0.05 be denoted by t 05. Then, in the case
test is to make a decision regarding the relevance of the of a single variable/gene, if we were to set the threshold
object, that is, a decision as to which hypothesis should at t 05, the probability of obtaining a false positive would
be accepted. Suppose the relevance score for the object be 0.05. However, in the multi-gene setting, it is each of
under consideration is t. A decision regarding the rel- the thousands of genes under study that is effectively
evance of the object is then made as follows: subjected to a hypothesis test with the specified error
probability of 0.05. Thus, the chance of obtaining a false
(1) Specify an acceptable level of Type I error p*. positive is no longer 0.05, but much higher. For in-
(2) Use the sampling distribution of the relevance stance, if each of 10000 genes were statistically inde-
score under the null hypothesis to compute a pendent, (0.05 10000) = 500 genes would be mistak-
threshold score corresponding to p*. Let this enly selected on average! In effect, the very threshold
threshold score be denoted by c. which implied a false positive rate of 0.05 for a single
(3) If t c, reject the null hypothesis and regard the gene now leaves us with hundreds of false positives.
object as relevant. If t < c, regard the object as Multiple hypothesis testing procedures address the
irrelevant. issue of multiplicity in hypothesis tests and provide a
way of setting appropriate thresholds in multi-variable
The specified error level p* is called the signifi- problems. The remainder of this Section describes two
cance level of the test and the corresponding threshold well-known multiple testing methods (the Bonferroni
c the critical value. and False Discovery Rate methods), and discusses their
Hypothesis testing can alternatively be thought of as advantages and disadvantages.
a procedure by which relevance scores are converted Table 1 summarizes the numbers of objects in vari-
into corresponding error probabilities. The null sam- ous categories, and will prove useful in clarifying some
pling distribution can be used to compute the probabil- of the concepts presented below. The total number of
ity p of making a Type I error if the threshold is set at objects under consideration is m, of which m1 are rel-
exactly t, i.e. just low enough to select the given object. evant and m0 are irrelevant. A total of S objects are
This then allows us to assert that the probability of selected, of which S1 are true positives and S0 are false
obtaining a false positive if the given object is to be positives. We follow the convention that variables relat-
selected is at least p. This latter probability of Type I ing to irrelevant objects have the subscript 0 (to
error is called a P-value. In contrast to relevance scores, signify the null hypothesis) and those relating to rel-
P-values, being probabilities, have a clear interpreta- evant objects the subscript 1 (for the alternative hy-
tion. For instance, if we found that an object had a t- pothesis). Note also that fixed quantities (e.g. the total
statistic value of 3 (say), it would be hard to tell whether number of objects) are denoted by lower-case letters,
the object should be regarded as relevant or not. How- while variable quantities (e.g. the number of objects
ever, if we found the corresponding P-value was 0.001, selected) are denoted by upper-case letters.
we would know that if the threshold were set just low The initial stage of a multi-variable analysis follows
enough to include the object, the false positive rate from our discussion of basic hypothesis testing and is
would be 1 in 1000, a fact that is far easier to interpret.
849
TEAM LinG
Table 1. Summary table for multiple testing, following committing at least one Type I error, and then calculate
Benjamini and Hochberg (1995) a corresponding threshold in terms of nominal P-value.
Selected Not selected Total

Procedure
Irrelevant objects S0 m0-S0 m0
Relevant objects S1 m1-S1 m1
Total S m-S m
(1) Specify an acceptable probability p* of at least
one Type I error being committed.
(2) Recall that P(1) P(2) P(m) represent the
common to the methods discussed below. The data is ordered P-values. Find the largest i which satis-
processed as follows: fies the following inequality:
(1) Score each of the m objects under consideration

using a suitably chosen test statistic f. If the data p*
P( i )
corresponding to object j is denoted by D j, and the m
corresponding relevance score by Tj:
Let the largest i found be denoted k.
Tj = f (D j )
(3) Select the k objects corresponding to P-values
P(1), P(2) P(k) as relevant.
Let the scores so obtained be denoted by T1, T2,Tm.
If we assume that all m tests are statistically inde-
(2) Convert each score T j to a corresponding P-value Pj
pendent, it is easy to show that this simple procedure
by making use of the relevant null sampling distri-
does indeed guarantee that the probability of obtaining
bution:
at least one false positive is p*.
null sampling
distribution
Advantages
Tj Pj
The basic Bonferroni procedure is extremely
simple to understand and use.
These P-values are called nominal P-values and
represent per-test error probabilities. The procedure by Disadvantages
which a P-value corresponding to an observed test statis-
tic is computed is not described here, but can be found in The Bonferroni procedure sets the threshold to
any textbook of statistics, such as those mentioned pre- meet a specified probability of making at least
viously. Essentially, the null cumulative distribution func- one Type I error. This notion of error is called
tion of the test statistic is used to compute the probabil- Family Wise Error Rate, or FWER, in the sta-
ity of Type I error at threshold T j. tistical literature. FWER is a very strict control
of error, and is far too conservative for most
(3) Arrange the P-values obtained in the previous step practical applications. Using FWER on datasets
in ascending order (smaller P-values correspond to with large numbers of variables often results in
objects more likely to be relevant). Let the or- the selection of a very small number of objects,
dered P-values be denoted by: or even none at all.
The assumption of statistical independence be-
P(1) P( 2) P( m ) tween tests is a strong one, and almost never
holds in practice.
Thus, each object has a corresponding P-value. In order
to select a subset of objects, it will therefore be sufficient The FDR Method of Benjamini and
to determine a threshold in terms of nominal P-value. Hochberg
The Bonferroni Method An alternative way of thinking about errors in multiple

testing is the false discovery rate (FDR) (Benjamini &
This is perhaps the simplest method for correcting P- Hochberg, 1995). Looking at Table 1 we can see that of
values. We first specify an acceptable probability of S objects selected, S0 are false positives. FDR is simply
850
TEAM LinG
the average proportion of false positives among the Robust FDR Methods
objects selected: M
S The FDR procedure outlined above, while simple and
FDR E 0 computationally efficient, makes several strong assump-
S tions, and while better than Bonferroni is still often too
conservative for practical problems (Storey & Tibshirani,
Where, E[] denotes expectation. The Benjamini and 2003). The recently introduced q-value (Storey, 2003)
Hochberg method allows us to compute a threshold is a more sophisticated approach to FDR correction and
corresponding to a specified FDR. provides a very robust methodology for multiple test-
ing. The q-value method makes use of the fact that P-
Procedure values are uniformly distributed under the null hypoth-
esis to accurately estimate the FDR associated with a
(1) Specify an acceptable FDR q* particular threshold. Estimated FDR is then used to set
(2) Find the largest i which satisfies the following an appropriate threshold. The q-value approach is an
inequality: excellent choice in many multi-variable settings.
i Resampling Based Methods

P( i ) q*
m
In recent years a number of resampling based methods
Let the largest i found be denoted k. have been proposed for multiple testing (see e.g. Westfall
& Young, 1993). These methods make use of
(3) Select the k objects corresponding to P-values computationally intensive procedures such as the boot-
P(1), P (2) P(k) as relevant. strap (Efron, 1982; Davison & Hinckley, 1997) to per-
form non-parametric P-value corrections. Resampling
Under the assumption that the m tests are statisti- methods are extremely powerful and make fewer strong
cally independent, it can be shown that this procedure assumptions than methods based on classical statistics,
leads to a selection of objects such that the FDR is but in many data mining applications the computational
indeed q*. burden of resampling may be prohibitive.
Advantages Machine Learning of Relevance

Functions
The FDR concept is far less conservative than the
Bonferroni method described above. The methods described in this chapter allowed us to
Specifying an acceptable proportion q* of false determine threshold relevance scores, but very little
positives is an intuitively appealing way of con- attention was paid to the important issue of choosing an
trolling the error rate in object selection. appropriate relevance scoring function. Recent research
The FDR method tells us in advance that a certain in bioinformatics (Broberg, 2003; Mukherjee, 2004a)
proportion of results are likely to be false posi- has shown that the effectiveness of a scoring function
tives. As a consequence, it becomes possible to can be very sensitive to the statistical model underlying
plan for the associated costs. the data, in ways which can be difficult to address by
conventional means. When fully labeled data (i.e. datasets
Disadvantages with objects flagged as relevant/irrelevant) are avail-
able, canonical supervised algorithms (see e.g. Hastie,
Again, the assumption of statistically independent Tibshirani, & Friedman , 2001) can be used to learn
tests rarely holds. effective relevance functions. However, in many cases
- microarray data being one example - fully labeled data
is hard to obtain. Recent work in machine learning
FUTURE TRENDS (Mukherjee, 2004b) has addressed the problem of learn-
ing relevance functions in an unsupervised setting by
A great deal of recent work in statistics, machine learn- exploiting a probabilistic notion of stability; this ap-
ing and data mining has focused on various aspects of proach turns out to be remarkably effective in settings
multiple testing in the context of object selection. where underlying statistical models are poorly under-
Some important areas of active research which were not stood and labeled data unavailable.
discussed in detail are briefly described below.
851
TEAM LinG
CONCLUSION Piatetsky-Shapiro, G., & Tamayo, P. (2003). Microarray

Data mining: Facing the Challenges. SIGKDD Explora-
An increasing number of problems in industrial and tions, 5(2).
scientific data analysis involve multiple testing, and Storey, J. D. (2003). The positive false discovery rate: A
there has consequently been an explosion of interest in Bayesian interpretation and the q-value. Annals of Statis-
the topic in recent years. This Chapter has discussed a tics, 31, 2013-2035.
selection of important concepts and methods in mul-
tiple testing; for further reading we recommend Storey, J. D., & Tibshirani, R. (2003). Statistical signifi-
Benjamini and Hochberg (1995) and Dudoit, Shaffer, & cance for genome-wide studies. Proceedings of the Na-
Boldrick, (2003). tional Academy of Sciences, 100 (pp. 9440-9445).
Westfall, P. H., & Young, S. S. (1993). Resampling-based
multiple testing: Examples and methods for p-value
REFERENCES adjustment. John Wiley & Sons.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the
false discovery rate: A practical and powerful approach
to multiple testing. J. Roy. Stat. Soc. B 57, 289-300. KEY TERMS
Broberg, P. (2003). Statistical methods for ranking differ-
entially expressed genes. Genome Biology, 4(6). Alternative Hypothesis: The hypothesis that an
object is relevant. In general terms, the alternative hy-
Cui, X., & Churchill, G. (2003). Statistical tests for differ- pothesis refers to the set of events that is the comple-
ential expression in cDNA microarray experiments. Ge- ment of the null hypothesis.
nome Biology, 4(210).
False Discovery Rate (FDR): The expected proportion
Davison, A. C., & Hinckley, D. V. (1997). Bootstrap of false positives among the objects selected as relevant.
methods and their applications. Cambridge University
Press. False Negative: The error committed when a truly
relevant object is not selected. More generally, a false
DeGroot, M. H., & Schervish, M. J. (2002). Probability negative occurs when the null hypothesis is erroneously
and statistics (3rd ed.). Addison-Wesley. accepted. Also called Type II error.
Dudoit, S., Shaffer, J. P., & Boldrick, J. C. (2003). False Positive: The error committed when an object
Multiple hypothesis testing in microarray experiments. is selected as relevant when it is in fact irrelevant. More
Statistical Science, 18(1), 71-103. generally, a false positive occurs when the null hypoth-
esis is erroneously rejected. Also called Type I error.
Efron, B. (1982). The jacknife, the bootstrap, and
other resampling plans. Society for Industrial and Gene Microarray Data: Measurements of mRNA
Applied Mathematics. abundances derived from biochemical devices called
microarrays. These are essentially measures of gene
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The activity.
elements of statistical learning. Springer-Verlag.
Hypothesis Test: A formal statistical procedure by
Lehmann, E. L. (1997). Testing statistical hypotheses which an interesting hypothesis (the alternative hypoth-
(2nd ed.). Springer-Verlag. esis) is accepted or rejected on the basis of data.
Moore, D. S., & McCabe, G. P. (2002). Introduction to Multiple Hypothesis Test: A formal statistical proce-
the practice of statistics (4 th ed.). W. H. Freeman. dure used to account for the effects of multiplicity in a
Mukherjee, S. (2004a). A theoretical analysis of gene hypothesis test.
selection. Proceedings of the IEEE Computer Society Null Hypothesis: The hypothesis that an object is
Bioinformatics Conference 2004 (CSB 2004). IEEE irrelevant. In general terms, it is the hypothesis we wish
press. to falsify on the basis of data.
Mukherjee, S. (2004b). Unsupervised learning of ranking P-Value: The P-value for an object is the probability
functions for high-dimensional data (Tech. Rep. No. of obtaining a false positive if the threshold is set just
PARG-04-02). University of Oxford, Department of high enough to include the object among the set selected
Engineering Science.
852
TEAM LinG
as relevant. More generally, it is the false positive rate Test Statistic: A relevance scoring function used in
corresponding to an observed test statistic. a hypothesis test. Classical test statistics (such as the t- M
statistic) have null sampling distributions which are
Random Variable: A variable characterized by ran- known a priori. However, under certain circumstances
dom behavior in assuming its different possible values. null sampling distributions for arbitrary functions can
Sampling Distribution: The distribution of values be obtained by computational means.
obtained by applying a function to random data.
853
TEAM LinG
854
Music Information Retrieval

Alicja A. Wieczorkowska
Polish-Japanese Institute of Information Technology, Poland
INTRODUCTION cused on harmonic structure analysis, note extraction,

melody and rhythm tracking, timbre and instrument
Music information retrieval is a multi-disciplinary re- recognition, classification of type of the signal (speech,
search on retrieving information from music. This re- music, pitched vs. non-pitched), and so forth.
search involves scientists from traditional, music, and The research basically uses digital audio recordings,
digital libraries; information science; computer sci- where sound waveform is digitally stored as a sequence
ence; law; business; engineering; musicology; cognitive of discrete samples representing the sound intensity at
psychology; and education (Downie, 2001). a given time instant, and MIDI files, storing information
on parameters of electronically synthesized sounds
(voice, note on, note off, pitch bend, etc.). Sound analy-
BACKGROUND sis and data mining tools are used to extract information
from music files in order to provide the data that meet
A huge amount of audio resources, including music data, users needs (Wieczorkowska & Ras, 2001).
is becoming available in various forms, both analog and
digital. Notes, CDs, and digital resources of the World
Wide Web are growing constantly in amount, but the MAIN THRUST
value of music information depends on how easy it can
be found, retrieved, accessed, filtered, and managed Music information retrieval domain covers a broad range
(Fingerhut, 1997; International Organization for Stan- of topics of interest, and various types of music data are
dardization, 2003). investigated in this research. Basic techniques of digital
Music information retrieval consists of quick and sound analysis for these purposes come from speech
efficient searching for various types of audio data of processing focused on automatic speech recognition
interest to the user, filtering them in order to receive and speaker identification (Foote, 1999). Sound de-
only the data items that satisfy the users preferences scriptors calculated this way can be added to the audio
(International Organization for Standardization, 2003). files in order to facilitate content-based searching of
This broad domain of research includes: music databases.
The issue of representation of music and multimedia
Audio retrieval by content; information in a form that allows interpretation of the
Auditory scene analysis and recognition (Rosenthal informations meaning is addressed by MPEG-7 stan-
& Okuno, 1998); dard, named Multimedia Content Description Interface.
Music transcription; MPEG-7 provides a rich set of standardized tools to
Denoising of old analog recordings; and describe multimedia content through metadata (i.e.,
Other topics discussed further in this article data about data), and music information description also
(Wieczorkowska & Ras, 2003). has been taken into account in this standard (Interna-
tional Organization for Standardization, 2003).
The topics are interrelated, since the same or similar The following topics are investigated within music
techniques can be applied for various purposes. For information retrieval domain:
instance, source separation, usually applied in auditory
scene analysis, is used also for music transcription and Auditory scene analysis, which focuses on various
even restoring (denoising) of all recordings. The re- aspects of music like timbre description, sound
search on music information retrieval has many applica- harmonicity, spatial origin, source separation, and
tions. The most important include automatic production so forth (Bregman, 1990). Timbre is defined sub-
of music score on the basis of the presented input, jectively as this feature of sound that distinguishes
retrieval of music pieces from huge audio databases, and two sounds of the same pitch, loudness, and dura-
restoration of old recordings. Generally, the research tion. Therefore, subjective listening tests are of-
within the music information retrieval domain is fo- ten performed in this research, but also signal-
TEAM LinG
processing techniques are broadly applied here. One set of pitches) and duration (Birmingham et al.,
of the main topics of computational auditory scene 2001). Query-by-humming is one of more popular M
analysis is automatic separation of individual sound topics within music information retrieval domain.
sources from a mixture. It is difficult with mixtures of Audio retrieval-by-example for orchestral music
harmonic instrument sounds, where spectra overlap. aims at searching for acoustic similarity in an
However, assuming time-frequency smoothness of audio collection, based on analysis of the audio
the signal, sound separation can be performed, and signal. Given an example audio document, other
when sound changes in time are observed, onset, documents in a collection can be ranked by simi-
offset, amplitude, and frequency modulation have larity on the basis of long-term structure; specifi-
similar shapes for all frequencies in the spectrum; cally, the variation of soft and louder passages,
thus, a demixing matrix can be estimated for them determined from envelope of audio energy vs.
(Virtanen, 2003; Viste & Evangelista, 2003). Audio time in one or more frequency bands (Foote, 2000).
source separation techniques also can be used to This research is a branch of audio retrieval by
source localization for auditory scene analysis. These content. Audio query-by-example search also can
techniques, like independent component analysis, be performed within a single document when
originate from speech recognition in cocktail party searching for sounds similar to the selected sound
environment, where many sound sources are present. event. Such a system for content-based audio re-
Independent component analysis is used for finding trieval can be based on a self-organizing feature
underlying components from multidimensional sta- map (i.e., a special kind of neural network de-
tistical data, and it looks for components that are signed by analogy with a simplified model of the
statistically independent (Vincent et al., 2003). neural connections in the brain and trained to find
Computational auditory scene recognition aims at relationships in the data). Perceptual similarity
classifying auditory scenes into predefined classes, can be assessed on the basis of spectral evolution
using audio information only. Examples of auditory in order to find sounds of similar timbre (Spevak
scenes are various outside and inside environments, & Polfreman, 2001). Neural networks also are used
like streets, restaurants, offices, homes, cars, and in other forms in audio information retrieval sys-
so forth. Statistical and nearest neighbor algorithms tems. For instance, time-delayed neural networks
can be applied for this purpose. In the nearest (i.e., neural nets with time delay inputs) are applied,
neighbor algorithm, the class (type of auditory since they perform well in speech recognition ap-
scene, in this case) is assigned on the basis of the plications (Meier et al., 2000). One of applications
distance of the investigated sample to the nearest of audio retrieval-by-example is searching for the
sample, for which the class membership is known. piece in a huge database of music pieces with the
Various acoustic features, based on Fourier spec- use of so-called audio fingerprintingtechnology
tral analysis (i.e., mathematic transform, decom- that allows piece identification. Given a short pas-
posing the signal into frequency components), can sage transmitted, for instance, via car phone, the
be applied to parameterize the auditory scene for piece is extracted, and, most important, informa-
classification purposes. Effectiveness of this re- tion also is extracted on the performer and title
search approaches 70% correctness for about 20 linked to this piece in the database. In this way, the
auditory scenes (Peltonen et al., 2002). user may identify the piece of music with very high
Query-by-humming systems, which search me- accuracy (95%) only on the basis of a small re-
lodic databases using sung queries (Adams et al., corded (possibly noisy) passage.
2003). This topic represents audio retrieval by Transcription of music, defined as writing down
contents. Melody usually is quantized coarsely the musical notation for the sounds that constitute
with respect to pitch and duration, assuming mod- the investigated piece of music. Onset detection
erate singing abilities of users. Music retrieval based on incoming energy in frequency bands and
system takes such an aural query (i.e., a motif or a multi-pitch estimation based on spectral analysis
theme) as input, and searches the database for the may be used as the main elements of an automatic
piece from which this query comes. Markov mod- music transcription system. The errors in such a
els, based on Markov chains, can be used for system may contain additional inserted notes,
modeling musical performances. Markov chain is omissions, or erroneous transcriptions (Klapuri
a stochastic process for which the parameter is et al., 2001). Pitch tracking (i.e., estimation of
discrete time values. In Markov sequence of events, pitch of note events in a melody or a piece of
the probability of future states depends on the music) is often performed in many music infor-
present state; in this case, states represent pitch (or mation retrieval systems. For polyphonic music,
855
TEAM LinG
polyphonic pitch-tracking and timbre separation in ranges from about 70% accuracy for instrument
digital audio is performed with such applications as identification to more than 90% for instruments
score-following and denoising of old analog re- family (i.e., strings, winds, etc.), approaching 100%
cordings, which is also a topic of interest within for discriminating impulsive and sustained sounds,
music information retrieval. Wavelet analysis can thus even exceeding human performance. Such
be applied for this purpose, since it decomposes instrument sound classification can be included in
the signal in time-frequency space, and then musical automatic music transcription systems.
notes can be extracted from the result of this decom- Sonification, in which utilities for intuitive auditory
position (Popovic et al., 1995). Simultaneous poly- display (i.e., in audible form) are provided through
phonic pitch and tempo tracking, aiming at automati- a graphical user interface (Ben-Tal et al., 2002).
cally inferring a musical notation that lists the pitch Generating human-like expressive musical per-
and the time limits of each note, is a basis of the formances with appropriately adjusted dynamics
automatic music transcription. A musical perfor- (i.e., loudness), rubato (variation in time limits of
mance can be modeled for these purposes using notes), vibrato (changes of pitch) (Mantaras &
dynamic Bayesian networks (i.e., directed graphical Arcos, 2002), and identification of musical pieces
models of stochastic processes) (Cemgil et al., 2003). representing different types of emotions (ten-
It is assumed that the observations may be gener- derness, sadness, joy, calmness, etc.), which
ated by a hidden process that cannot be directly music evokes. Emotions in music are gaining the
experimentally observed, and dynamic Bayesian net- interest of researchers recently, also including
works represent the hidden and observed states in recognition of emotions in the recordings.
terms of state variables, which can have complex
interdependencies. Dynamic Bayesian networks The topics mentioned previously interrelate and
generalize hidden Markov models, which have one sometimes partially overlap. For instance, both audi-
hidden node and one observed node per observation tory scene analysis and recognition may take into ac-
time. count a very broad range of recordings containing nu-
Automatic characterizing the rhythm and tempo of merous acoustic elements to identify and analyze. Query
music and audio, revealing tempo and the relative by humming requires automatic transcription of music,
strength of particular beats, is a branch of research since the input audio samples first must be transformed
on automatic music transcription. Since highly into the form based on musical notation, describing
structured or repetitive music has strong beat spec- basic melodic features of the query. Audio retrieval by
trum peaks at the repetition times, it allows tempo example and automatic classification of musical in-
estimation and distinguishing between different strument sounds are both branches of retrieval by con-
kinds of rhythms at the same tempo. The tempo can tent. Transcription of music requires not only pitch
be estimated using beat spectral peak criterion (the tracking, but also automatic characterizing the rhythm
lag of the highest peak exceeding assumed time and tempo, so these topics overlap. Pitch tracking is
threshold) accurately to within 1% in the analysis needed in many research topics, including music tran-
window (Foote & Uchihashi, 2001). scription, query by humming, and even automatic clas-
Automatic classification of musical instrument sification of musical instrument sounds, since pitch is
sounds, aiming at accurate identification of musi- one of the features characterizing instrumental sound.
cal instruments playing in a given recording, based Sonification and generating human-like expressive
on various sound analysis and data mining tech- musical performances both are related to sound syn-
niques (Herrera et al., 2000; Wieczorkowska, 2001). thesis, which are needed to create auditory display or
This research is focused mainly on monophonic emotional performance. All these topics are focused
sounds, and sound mixes are usually addressed in on a broad domain of music and its various aspects.
the research on separation of sound sources. In Results of this research are not always easily mea-
most cases, sounds of instruments of definite pitch surable, especially in the case of synthesis-based top-
have been investigated, but recently, research on ics, since they usually are validated via subjective tests.
percussion also has been undertaken. Various analy- Other topics, like transcription of music, may produce
sis methods are used to parameterize sounds for errors of various importance (i.e., wrong pitch, length,
instrument classification purposes, including time- omission, etc.), and comparison of the obtained tran-
domain description, Fourier, and wavelet analysis. script with the original score can be measured in many
Classifiers range from statistic and probabilistic ways, depending on the considered criteria. The easiest
methods, through learning by example, to artificial estimation and comparison of results can be performed in
intelligence methods. Effectiveness of this research case of recognition of singular sound events or files. In
856
TEAM LinG
case of query by example, very high recognition rate has Mellody, M. & Rand, B. (2001). MUSART: Music retrieval
been reached already (95% of correct piece identifica- via aural queries. Proceedings of ISMIR 2001 2 nd Annual M
tion via audio fingerprinting), reaching commercial level. International Symposium on Music Information Re-
The research on music information retrieval is gain- trieval, Bloomington, Indiana.
ing an increasing interest from the scientific commu-
nity, and investigation of further issues in this domain Bregman, A.S. (1990). Auditory scene analysis, the per-
can be expected. ceptual organization of sound. Cambridge, MS: MIT
Press.
Cemgil, A.T., Kappen, B., & Barber, D. (2003). Genera-
FUTURE TRENDS tive model based polyphonic music transcription. Pro-
ceedings of the IEEE Workshop on Applications of
Multimedia databases and library collections, expand- Signal Processing to Audio and Acoustics WASPAA03,
ing tremendously nowadays, need efficient tools for New Paltz, New York.
content-based search. Therefore, we can expect an in-
tensification of research effort on music information de Mantaras, R.L., & Arcos, J.L. (2002, Fall). AI and
retrieval, which may aid searching music data. Espe- music: From composition to expressive performance.
cially, tools for query-by-example and query-by-hum- AI Magazine, 43-58.
ming are needed, as well as tools for automatic music Downie, J.S. (2001). Wither music information re-
transcription, so these areas should be investigated trieval: Ten suggestions to strengthen the MIR research
broadly in the near future. community. Proceedings of the Second Annual Inter-
national Symposium on Music Information Retrieval:
ISMIR 2001, Bloomington, Indiana.
CONCLUSION
Fingerhut, M. (1997). Le multimdia dans la
Music information retrieval is a broad range of re- bibliothque. Culture et recherche, 61. Retrieved 2004
search, focusing on various aspects of possible applica- from http://catalogue.ircam.fr/articles/textes/Fingerhut
tions. The main domains include audio retrieval by 97a/.
content, automatic music transcription, denoising of Foote, J. (1999). An overview of audio information
old recordings, generating human-like performances, retrieval. Multimedia Systems, 7(1), 2-11.
and so forth. The results of this research help users find
the audio data they need, even if the users are not Foote, J. (2000). ARTHUR: Retrieving orchestral mu-
experienced musicians. Constantly growing audio re- sic by long-term structure. Proceedings of the Interna-
sources evoke a demand for efficient tools to deal with tional Symposium on Music Information Retrieval
this enormous amount of data; therefore, music infor- ISMIR 2000, Plymouth, Massachusetts.
mation retrieval becomes a dynamically developing field Foote, J., & Uchihashi, S. (2001). The beat spectrum: A
of research. new approach to rhythm analysis. Proceedings of the
International Conference on Multimedia and Expo
ICME 2001, Tokyo, Japan.
REFERENCES
Herrera, P., Amatriain, X., Batlle, E., & Serra X. (2000).
Adams, N.H., Bartsch, M.A., & Wakefield, G.H. (2003). Towards instrument segmentation for music content
Coding of sung queries for music information retrieval. description: A critical review of instrument classifica-
Proceedings of the IEEE Workshop on Applications of tion techniques. Proceedings of the International Sym-
Signal Processing to Audio and Acoustics WASPAA03, posium on Music Information Retrieval ISMIR 2000,
New Paltz, New York. Plymouth, Massachusetts.
Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G. International Organization for Standardization ISO/IEC
& Cook, P. (2002). SONART: The sonification application JTC1/SC29/WG11. (2003). MPEG-7 Overview. Re-
research toolbox. Proceedings of the 2002 International trieved 2004 from http://www.chiariglione.org/mpeg/
Conference on Auditory Display, Kyoto, Japan. standards/mpeg-7/mpeg-7.htm
Birmingham, W.P., Dannenberg, R.D., Wakefield, G.H., Klapuri, A., Virtanen, T., Eronen, A., & Seppnen, J.
Bartsch, M.A., Bykowski, D., Mazzoni, D., Meek, C., (2001). Automatic transcription of musical recordings.
Proceedings of the Consistent & Reliable Acoustic Cues
for Sound Analysis CRAC Workshop, Aalborg, Denmark.
857
TEAM LinG
Meier, U., Stiefelhagen, R., Yang, J., & Waibel, A. (2000). KEY TERMS
Towards unrestricted lip reading. International Journal
of Pattern Recognition and Artificial Intelligence, 14(5), Digital Audio: Digital representation of sound wave-
571-586. form, recorded as a sequence of discrete samples, repre-
Peltonen, V., Tuomi, J., Klapuri, A., Huopaniemi, J., & senting the intensity of the sound pressure wave at a
Sorsa, T. (2002). Computational auditory scene recogni- given time instant. Sampling frequency describes the
tion. Proceedings of the International Conference on number of samples recorded in each second, and bit
Acoustics Speech and Signal Processing ICASSP, Or- resolution describes the number of bits used to repre-
lando, Florida. sent the quantized (i.e., integer) value of each sample.
Popovic, I., Coifman, R., & Berger, J. (1995). Aspects Fourier Analysis: Mathematical procedure for
of pitch-tracking and timbre separation: Feature detec- spectral analysis, based on Fourier transform that de-
tion in digital audio using adapted local trigonometric composes a signal into sine waves, representing fre-
bases and wavelet packets. Center for Studies in Music quencies present in the spectrum.
Technology. Retrieved 2004 from http://www- Information Retrieval: The actions, methods, and
ccrma.stanford.edu/~brg/research/pc/pitchtrack.html procedures for recovering stored data to provide infor-
Rosenthal, D., & Okuno, H.G. (Eds.). (1998). Computa- mation on a given subject.
tional auditory scene analysis. Proceedings of the IJCAI- Metadata: Data about data (i.e., information about
95 Workshop, Mahwah, New Jersey. the data).
Spevak, C., & Polfreman, R. (2001). Sound spottingA MIDI: Musical Instrument Digital Interface. MIDI
frame-based approach. Proceedings of the Second An- is a common set of hardware connectors and digital
nual International Symposium on Music Information codes used to interface electronic musical instruments
Retrieval: ISMIR 2001, Bloomington, Indiana. and other electronic devices. MIDI controls actions
Vincent, E., Rodet, X., Rbel, A., Fvotte, C., & Carpentier, such as note events, pitch bends, and the like, while the
.L. (2003). A tentative typology of audio source separa- sound is generated by the instrument itself.
tion tasks. Proceedings of the 4th Symposium on Inde- Music Information Retrieval: Multi-disciplinary
pendent Component Analysis and Blind Source Separa- research on retrieving information from music.
tion, Nara, Japan.
Pitch Tracking: Estimation of pitch of note events
Virtanen, T. (2003). Algorithm for the separation of in a melody or a piece of music.
harmonic sounds with time-frequency smoothness con-
straint. Proceedings of the 6th International Confer- Sound: A physical disturbance in the medium through
ence on Digital Audio Effects DAFX-03, London, UK. which it is propagated. Fluctuation may change rou-
tinely, and such a periodic sound is perceived as having
Viste, H., & Evangelista, G. (2003). Separation of har- pitch. The audible frequency range is from about 20 Hz
monic instruments with overlapping partials in multi- (hertz, or cycles per second) to about 20 kHz. Harmonic
channel mixtures. Proceedings of the IEEE Workshop sound wave consists of frequencies being integer mul-
on Applications of Signal Processing to Audio and tiples of the first component (fundamental frequency)
Acoustics WASPAA03, New Paltz, New York. corresponding to the pitch. The distribution of fre-
Wieczorkowska, A. (2001). Musical sound classifica- quency components is called spectrum. Spectrum and
tion based on wavelet analysis. Fundamenta its changes in time can be analyzed using mathematical
Informaticae Journal, 47(1/2), 175-188. transforms, such as Fourier or wavelet transform.
Wieczorkowska, A., & Ras, Z. (2001). Audio content Wavelet Analysis: Mathematical procedure for
description in sound databases. In Proceedings of the time-frequency analysis, based on wavelet transform
First Asia-Pacific Conference on Web Intelligence, WI that decomposes a signal into shifted and scaled ver-
2001, Maebashi City, Japan. sions of the original function called wavelet.
Wieczorkowska, A., & Ras, Z.W. (Eds.). (2003). Music

information retrieval. Journal of Intelligent Information
Systems, 21, 1.
858
TEAM LinG
859
Negative Association Rules in Data Mining N

Olena Daly
David Taniar
INTRODUCTION There are two measures to evaluate association rules:

support and confidence (Agrawal, Imielinski & Swami,
Data Mining is a process of discovering new, unexpected, 1993). The rule AB has support s, if s% of all transactions
valuable patterns from existing databases (Chen, Han & contains both A and B. The rule AB has confidence c,
Yu, 1996; Fayyad et. al., 1996; Frawley, Piatetsky-Shapiro if c% of transactions that contains A also contains B.
& Matheus, 1991; Savasere, Omiecinski & Navathe, 1995). Association rules have to satisfy the user-specified mini-
Though data mining is the evolution of a field with a long mum support (minsup) and minimum confidence (minconf).
history, the term itself was introduced only relatively Such rules with high support and confidence are referred
recently in the 1990s. Data mining is best described as the to as strong rules (Agrawal, Imielinski & Swami, 1993).
union of historical and recent developments in statistics, The generation of the strong association rules is
artificial intelligence, and machine learning. These tech- decomposed into the following two steps (Agrawal,
niques then are used together to study data and find Imielinski & Swami, 1993; Agrawal & Srikant, 1994): (i)
previously hidden trends or patterns within. discover the frequent itemsets; and (ii) use the frequent
Data mining is finding increasing acceptance in sci- itemsets to generate the association rules.
ence and business areas that need to analyze large amounts Itemset is a set of database items. Itemsets that have
of data to discover trends that they could not otherwise support at least equal to minsup are called frequent
find. Different applications may require different data itemsets. 1-itemset is called an itemset with one item (A),
mining techniques. The kinds of knowledge that could be 2-itemset is called an itemset with two items (AB), k-
discovered from a database are categorized into associa- itemset is called an itemset with k items. The output of the
tion rules mining, sequential patterns mining, classifica- first step is all itemsets in the database that have their
tion, and clustering (Chen, Han & Yu, 1996). support at least equal to the minimum support.
In this article, we concentrate on association rules In the second step, all possible association rules will be
mining and, particularly, on negative association rules. generated from each frequent itemset, and the confidence
of the possible rules will be calculated. If the confidence is
at least equal to minimum confidence, the discovered rule
BACKGROUND will be listed in the output of the algorithm.
Consider an example database in Figure 1. Let the
Association rules discover database entries associated minimum support be 40%, the minimum confidence 70%;
with each other in some way (Agrawal, Imielinski & Swami, the database contains 5 records and 4 different items A, B,
1993; Agrawal & Srikant, 1994; Agrawal, et. al., 1996; C, D (see Figure 1). In a database with five records,
Mannila, Toivonen & Verkamo, 1994; Piatetsky-Shapiro, support 40% means two records. For an item (itemset) to
1991; Srikant & Agrawal, 1995). The example could be be frequent in the sample database, it has to occur in two
supermarket items purchased in the same transaction, or more records.
weather conditions occurring on the same day, stock
market movements within the same trade day, words that
tend to appear in the same sentence. In association rules Figure 1. A sample database
mining, we search for sequences of associated entities,
where each subsequence forms association rules as well.
Association rule is an implication of the form AB, where 1 A,D
A and B are database itemsets. A and B belong to the same 2 B,C,D
database transaction. The discovered sequences and 3 A,B,C
later on association rules allow to study customers 4 A,B,C
purchase patterns, stock market movements, and so forth. 5 A,B
TEAM LinG
Negative Association Rules in Data Mining
Discover the Frequent Itemsets Use the Frequent Itemsets to Generate

the Association Rules
1-itemsets: The support of each database item is
calculated To produce an association rule, at least frequent 2-itemset.
So the generation process starts with 2-itemsets and goes
Support(A)= 4 (records) on with longer itemsets. Every possible rule is produced
Support(B)= 4 (records) from each frequent itemset:
Support(C)= 3 (records)
Support(D)= 2 (records) AB: AB
B A
All items occur in two or more records, so all of them AC: AC
are frequent. Frequent 1-itemsets: A, B, C, and D C A
BC: B C
2-itemsets: From frequent 1-itemsets candidate 2- C B
itemsets are generated: AB, AC, AD, BC, BD, CD. To ABC: A BC
calculate the support of the candidate 2-itemsets, B AC
the database is scanned again. C AB
BC A
Support(AB) =3 (records) AC B
Support(AC) =2 (records) AB C
Support(AD) =1 (record)
Support(BC) =3 (records) To distinguish the strong association rules among all
Support(BD) =1 (record) possible rules, the confidence of each possible rule will be
Support(CD) =1 (record) calculated. All support values have been obtained in the
first step of the algorithm.
Only itemsets occurring in two or more records are
frequent. Frequent 2-itemsets: AB, AC, and BC Confidence(AB)=Support(AB)/Support(A)
Confidence(ABC)=Support(ABC)/Support(AB)
3-itemsets: From frequent 2-itemsets candidate 3-
itemsets are generated: ABC. In step 3 for candidate Association rules have been one of the most devel-
3-itemsets, it is allowed to join frequent 2-itemsets oped areas in data mining. Most of research has been done
that only differ in the last item. AB and AC from the in positive implications, which is when occurrence of an
step 2 differ only in the last item, so they are joined item implies the occurrence of another item. Negative
and the candidate 3-itemset ABC is obtained. AB rules also consider negative implications, when occur-
and BC differ in the first item so a candidate itemset rence of an item implies absence of another item.
cannot be produced here; AC and BC do not differ
in the last item, no candidate itemset produced. So
there is only one candidate 3-itemset ABC. To cal- MAIN THRUST
culate the support of the candidate 3-itemsets the
database is scanned again. Negative itemsets are itemsets that contain both items
and their negations (e.g., AB~C~DE~F). ~C means nega-
Support(ABC) =2 (records) tion of the item C (absence of the item C in the database
Frequent 3-itemsets: ABC record).
Negative association rules are rules of a kind AB,
4-itemsets: In step 3, there is only one frequent where A and B are frequent negative itemsets. Negative
itemset, so it is impossible to produce any candidate association rules consider both presence and absence of
4-itemsets. The frequent itemsets generation stops items in the database record and mine for negative impli-
here. If there was more than one frequent 3-itemset, cations between database items.
then step 4 (and onwards) would be similar to step 3. Examples of negative association rules could be
Meat~Fish, which implies that when customers pur-
The list of frequent itemsets: A, B, C, D, AB, AC, BC, chase meat at the supermarket, they do not buy fish at the
and ABC. same time, or ~SunnyWindy, which means no sunshine
860
TEAM LinG
implies wind, or ~OilStock~PetrolStock, which says if cell in a contingency table. The chi-squared statistic is
the price the oil shares does not go up, petrol shares calculated, which is, in short, a normalized deviation from N
wouldnt go up either. expectation for each cell. From the obtained value, the
In the negative association rules mining, the number of strength of the correlation between database items is
possible generated negative rules is vast and numerous in estimated.
comparison to the positive association rules mining. It is Now, we explain category II. The data or negative
a critical issue to distinguish only the most interesting rules generated in the initial step are analyzed first, and
ones among the candidates. then the obtained information is employed on the next
A simple and straightforward method to search nega- steps of generation process.
tive association would be to add the negations of absent Mining frequent negative itemsets starts with itemsets
items to the database transaction and treat them as addi- that contain n items and only one negation of an item; for
tional items. Then, the traditional positive association example, {ABC~D, AB~CD}, where ~C, ~D means ab-
mining algorithms could be run on the extended database. sence of the item C or D. Then, the produced rules with
This approach works in domains with a very limited number n items and one negation are analyzed and some informa-
of database attributes. A good example is weather prediction is obtained that allows the next step when producing
tion. The database only contains 10-15 different factors rules with n items and two negations (A~BC~D) to dis-
like fog, rain, sunshine, and so forth. A database record regard some itemsets without even considering (Fortes,
would look like sunshine, ~fog, ~rain, windy. Thus, the Balcazar & Morales, 2001). n takes on values from 1 to the
number of database items will be doubled, and the number overall number of different items in the database. The
of itemsets will increase significantly. Besides, the sup- negative rules mining stops when the maximum number
port of negations of items can be calculated, based on the of negations m in rules has been reached, m provided by
support of positive items. Then, specially created algo- the user, or when all negative rules have been generated,
rithms are required. if m has not been provided.
The approach works in tasks like weather prediction/ Another approach in category II is to consider the
limited stock market analyses but is hardly applicable to hierarchy of items. An example of database hierarchy is
domains with thousands of different database items (sales/ shown in Figure 2.
healthcare/marketing). There could be numerous and sense- It is supposed that database items are organized into
less negative rules generated then. That is why special the hierarchy ancestor-to-descendant. First, the nega-
approaches are required to distinct interesting negative tive rules on the very top level are generated and then on
association rules. the lower levels. When proceeding to the lower level of
There are two categories in mining negative associa- hierarchy, the information obtained from rules on the
tion rules. In category I, a special measure for negative higher level is utilized (Daly & Taniar, 2003, 2004). The
association rules is created. In category II, the data/first- hierarchy approach makes the rules more general, which
generated rules are analyzed to further produce only most is crucial for negative association. Any further explora-
interesting rules with no computational costs. tion can be ceased, when no additional knowledge will be
First, we explain category I. A measure to distinguish extracted on the lower levels.
the most valuable negative rules will be employed. For instance, if a rule {Beef Not Juice} is generated,
The interest measure may be applied to negative rules it is not interesting what kind of juice (because Not
(Wu, Zhang & Zhang, 2002). Interest measure was first Juice is negation of any kind of juice). In contrast, in the
offered for positive rules. The measure is defined by a positive rule {Beef Juice}, it is required to discover
formula: Interest(AB)=support(AB)-support(A)* a specific kind of juice and go deeper into the levels.
support(B). The negative rules are generated from infre-
quent itemsets. The mining process for negative rules
starts within 1-frequent itemsets. Figure 2. Hierarchy of items
A more complicated way to search negative associa-
tion rules is to employ statistical theories (Brin, Motwani Items
& Silverstein, 1997). Correlations among database items
are calculated. If the correlation is positive, a positive Stationary Food & Drinks
association rule will be obtained. If the correlation is Writing
negative, a negative association rule will be obtained. The Instruments Paper Ink Drinks Food
stronger the correlation, the stronger the generated rule.
In this approach, Chi-squared statistics is employed to Pen Pencil Juice Coke Fish Meat Pastry Dairy
calculate the correlations among database items. Each
Apple Juice Orange Juice Trout Beef Bread Milk Butter
different database transaction (market basket) denotes a
861
TEAM LinG
Though the hierarchy approach requires additional infor- Some research has been done in the negative associa-
mation about the database, the way items are organized tion rules mining, but there are many unexplored subar-
into a hierarchy and an experts opinion may be needed. eas, and advanced research is essential for the areas
After the negative association rules have been gener- development.
ated, some reduction techniques may be applied to refine
the rules. For instance, the reduction techniques could A need for special interest measures and algorithms
verify the true support of items (not ancestors) or gener- for negative association vs. positive association.
alize the sets of negative rules.
New interest measures should be invented, new ap-
proaches for negative rules discovered, the ones that
FUTURE TRENDS would take advantage of the special properties of the
negative rules.
There has been some research in the area of negative
association mining. The negative association obviously A need for parallel algorithms for negative associa-
has provoked more complicated issues than the positive tion rules mining.
association mining. In order to overcome the difficulties
of negative association mining, the research community Parallel algorithms have taken data mining to the new
endeavors to invent new methods and algorithms to extent of mining abilities. As the negative association is
achieve the desirable efficiency. Following are the limita- a relatively new research area, the parallel data mining
tions and trends in the negative association mining. algorithms have not been developed yet, and they are
absolutely necessary.
Issues/Open Research Areas A need for new sophisticated reduction techniques.
A vast number of candidate rules for negative asso- After the set of negative rules has been generated, one
ciation rules in comparison with the positive asso- may apply some reduction and refining techniques to the
ciation rules. set to make sure the rules are not redundant.
In comparison with the positive association mining, in The issues require additional research in the area to be
the negative association rules, a greater number of candi- conducted and more scientists to contribute their knowl-
date itemsets is obtained, which would make it technically edge, time, and inspiration.
impossible to evaluate all and distinguish the rules with
high support/confidence/interest. Besides, the gener- Current Trends
ated rules may not provide efficient information. Special
approaches, measures, and algorithms should be devel-
oped to solve the issue. Use of various approaches in negative association
mining.
Hardware limitations.
Researchers are currently trying to invent distinctive
Hardware limitations are crucial for negative associa- approaches to generate the negative rules and to make
tion rules mining for the following reasons: a huge number sure they provide valuable information. A straightfor-
of calculations and assessments required from the CPU; ward approach is not acceptable for the negative rules;
a storage for vast frequent itemsets trees in the sufficient support/confidence measures are not enough to distin-
main memory; and rapid data interchange with databases guish the negative rules. Some novel/adapted measures,
required. particularly for interest measures, will need to be utilized.
Data structure analysis also forms an important part in the
Vast data sets. process of negative rule generation.
With the development of the hardware components, vast Specialized mining algorithms development.
databases could be handled in sufficient time for the companies.
The algorithms for positive association rules mining
A need for advanced research in negative associa- are not suitable for negative rules, so new and specialized
tion mining. algorithms often are developed by the researchers cur-
rently working in negative rules mining.
862
TEAM LinG
Rules reduction development. Chen, M., Han, J., & Yu, P. (1996). Data mining: An
overview from a database perspective. Institute of Elec- N
Rules reduction techniques have been developed to trical and Electronics Engineers, Transactions on Knowl-
refine the final set of the negative rules. Reduction tech- edge and Data Engineering, 8(6), 866-883.
niques verify the rules quality or generalize the rules.
Daly, O., & Taniar, D. (2003). Mining multiple-level nega-
The researchers are attempting to overcome the vast tive association rules. Proceedings of the International
search space in negative association mining; they look at the Conference on Intelligent Technologies (InTech03).
issue from different points of view to discover what makes Daly, O., & Taniar, D. (2004). Exception rules mining based
the negative rules more interesting and what measures could on negative association rules. Computational Science
distinguish the highly valuable rules from the rest. and Its Applications, 3046, 543-552.
Fayyad, U. et al. (1996). Advances in knowledge discov-
CONCLUSION ery and data mining. American Association for Artificial
Intelligence Press.
This article is a review of the research that has been done Fortes, I., Balczar, J., & Morales, R. (2001). Bounding
in the negative association rules. Negative association negative information in frequent sets algorithms. Pro-
rules have brought interesting challenges and research ceedings of the 4th International Conference of Discovery
opportunities. This is an open area for future research. Science.
This article describes the main issues and approaches
in negative association rules mining derived from the Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1991).
literature. One of the main issues is a vast number of Knowledge discovery in databases: An overview. Ameri-
candidate rules for negative association rules in compari- can Association for Artificial Intelligence Press.
son with the positive association rules. Mannila, H., Toivonen, H., & Verkamo, A. (1994). Efficient
There are two categories in mining negative associa- algorithms for discovering association rules. Proceed-
tion rules. In category I, a special measure for negative ings of the American Association for Artificial Intelli-
association rules is created. In category II, the data/first- gence Workshop on Knowledge Discovery in Databases.
generated rules are analyzed to further produce only most
interesting rules with no computational costs. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and
presentation of strong rules. American Association for
Artificial Intelligence Press.
REFERENCES Savasere, A., Omiecinski, E., & Navathe, S. (1995). An
efficient algorithm for mining association rules in large
Agrawal, R. et al. (1996). Fast discovery of association databases. Proceedings of the 21st International Confer-
rules. In U. Fayyad et al. (Eds.), Advances in knowledge ence on Very Large Data Bases.
discovery and data mining. American Association for
Artificial Intelligence Press. Srikant, R., & Agrawal, R. (1995). Mining generalized
association rules. Proceedings of the 21st Very Large
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Data Bases Conference.
association rules between sets of items in large data-
bases. Proceedings of the Association for Computing Wu, X., Zhang, C., & Zhang, S. (2002). Mining both
Machinery, Special Interest Group on Management of positive and negative association rules. Proceedings of
Data, International Conference Management of Data. the 19th International Conference on Machine Learning.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for

mining association rules in large databases. Proceedings
of the 20th International Conference on Very Large Data KEY TERMS
Bases.
Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond Association Rules: An implication of the form AB,
market basket: Generalizing association rules to correla- where A and B are database itemsets. Association rules
tions. Proceedings of Association for Computing Ma- have to satisfy the pre-set minimum support (minsup) and
chinery, Special Interest Group on Management of Data, minimum confidence (minconf) constraints.
International Conference Management of Data.
863
TEAM LinG
Confidence: The rule AB has confidence c, if c% of Negative Association Rules: Rules of a kind AB,
transactions that contain A, also contains B. where A and B are frequent negative itemsets.
Database Item: An item/entity occurring in the database. Negative Itemsets: Itemsets that contain both items
and their negations.
Frequent Itemsets: Itemsets that have support at least
equal to minsup. Support: The rule AB has support s, if s% of all
transactions contains both A and B.
Itemset: A set of database items.
864
TEAM LinG
865
Neural Networks for Prediction and N

Classification
Kate A. Smith
INTRODUCTION a set of inputs and known outputs. They are a supervised

learning technique in the sense that they require a set of
Neural networks are simple computational tools for exam- training data in order to learn the relationships. With
ining data and developing models that help to identify supervised learning models, the training data contains
interesting patterns or structures. The data used to de- matched information about input variables and observ-
velop these models is known as training data. Once a able outcomes. Models can be developed that learn the
neural network has been exposed to the training data, and relationship between these data characteristics (inputs)
has learnt the patterns that exist in that data, it can be and outcomes (outputs). Figure 1 shows the architecture
applied to new data thereby achieving a variety of out- of a 3-layer MFNN. The weights are updated using the
comes. Neural networks can be used to: backpropagation learning algorithm, as summarized in
Table 1, where f(.) is a non-linear activation function such
learn to predict future events based on the patterns as 1/(1+exp(- (.)), is the learning rate, and dk is the
that have been observed in the historical training desired output of the kth neuron.
data; The successful application of MFNNs to a wide range
learn to classify unseen data into pre-defined groups of areas of business, industry and society has been
based on characteristics observed in the training reported widely. We refer the interested reader to the
data; excellent survey articles covering application domains of
learn to cluster the training data into natural groups medicine (Weinstein, Myers, Casciari, Buolamwini, &
based on the similarity of characteristics in the Raghavan, 1994), communications (Ibnkahla, 2000), busi-
training data. ness (Smith & Gupta, 2002), finance (Refenes, 1995), and
a variety of industrial applications (Fogelman, Souli, &
Gallinari, 1998).
BACKGROUND
There are many different neural network models that have MAIN THRUST
been developed over the last fifty years or so to achieve
these tasks of prediction, classification, and clustering. A number of significant issues will be discussed, and
Broadly speaking, these models can be grouped accord- some guidelines for successful training of neural net-
ing to supervised learning algorithms (for prediction and works will be presented in this section.
classification), and unsupervised learning algorithms (for
clustering). This paper focused on the former paradigm. Figure 1. Architecture of MFNN (note: not all weights
We refer the interested reader to Haykin (1994) for a are shown)
detailed account of many neural network models.
J neurons K neurons
According to a recent study (Wong, Jiang, & Lam, (hidden layer) (output layer)
2000), over fifty percent of reported neural network busi-
ness application studies utilise multilayered feedforward
x1 . w21
w11
w12
y1 v11 z1
v21
.
wJ1
neural networks (MFNNs) with the backpropagation learn- vK1
x2 w22
ing rule (Werbos, 1974; Rumelhart & McClelland, 1986). wJ2 y2 z2
This type of neural network is popular because of its
broad applicability to many problem domains of relevance x3 . w13
w23
..
.
. yJ zK
to business and industry: principally prediction, classifi- wJ2
. w 1N+1 wJ3
cation, and modeling. MFNNs are appropriate for solving . w v1j
.
2N+1
problems that involve learning the relationships between -1

w JN+1
-1 .
TEAM LinG
Neural Networks for Prediction and Classification
Table 1. Backpropagation learning algorithm for MFNNs
STEP 1: Randomly select an input pattern x to present to the MFNN through the
input layer
STEP 2: Calculate the net inputs and outputs of the hidden layer neurons
N +1
net hj = w ji xi y j = f (net hj )
i =1
STEP 3: Calculate the net inputs and outputs of the K output layer neurons
J +1
net ko = v kj y j z k = f ( net ko )
j =1
STEP 4: Update the weights in the output layer (for all k, j pairs)
vkj vkj + c (d k z k ) z k (1 z k ) y j
STEP 5: Update the weights in the hidden layer (for all i, j pairs)
K
w ji w ji + c 2 y j (1 y j ) x i ( ( d k z k ) z k (1 z k )v kj )
k =1
STEP 6: Update the error term

K
E E + (d k z k ) 2
k =1
and repeat from STEP 1 until all input patterns have been presented (one
epoch).
STEP 7: If E is below some predefined tolerance level (say 0.000001), then

STOP.
Otherwise, reset E = 0, and repeat from STEP 1 for another epoch.
Critical Issues for Neural Networks 1. Learning: The developed model must represent
an adequate fit of the training data. The data itself
Neural networks have not been without criticism, and must contain the relationships that the neural net-
there are a number of important limitations that need work is trying to learn, and the neural network
careful attention. At this stage it is still a trial and error model must be able to derive the appropriate
process to obtain the optimal model to represent the weights to represent these relationships.
relationships in a dataset. This requires appropriate selec- 2. Generalization: The developed model must also
tion of a number of parameters including learning rate, perform well when tested on new data to ensure that
momentum rate (if an additional momentum term is added it has not simply memorized the training data char-
to steps 4 and 5 of the algorithm), choice of activation acteristics. It is very easy for a neural network model
function, as well as the optimal neural architecture. If the to overfit the training data, especially for small
architecture is too large, with too many hidden neurons, data sets. The architecture needs to be kept small,
the MFNN will find it very easy to memorize the training and key validation procedures need to be adopted
data, but will not have generalized its knowledge to other to ensure that the learning can be generalized.
datasets (out-of-sample or test sets). This problem is
known as overtraining. Another limitation is the gradient Within each of these stages there are a number of
descent nature of the backpropagation learning algo- important guidelines that can be adopted to ensure effec-
rithm, which causes the network weights to become tive learning and successful generalization for prediction
trapped in local minima of the error function being and classification problems (Remus & OConnor, 2001).
minimized. Finally, neural networks are considered inap- To ensure successful learning:
propriate for certain problem domains where insight and
explanation of the model is required. Some work has been Prepare the Data Prior to Learning the Neural
done on extracting rules from trained neural networks Network Model: A number of pre-processing
(Andrews, Diederich, & Tickle, 1995), but other data steps may be necessary including cleansing the
mining techniques like rule induction may be better suited data; removing outliers; determining the correct
to such situations. level of summarisation; converting non-numeric
data (Pyle, 1999).
Guidelines for Successful Training Normalise, Scale, Deseasonalise and Detrend
the Data Prior to Learning: Time series data
Successful prediction and classification with MFNNs often needs to be deseasonalised and detrended to
requires careful attention to two main stages: enable the neural network to learn the true patterns
in the data (Zhang, Patuwo, & Hu, 1998).
866
TEAM LinG
Ensure that the MFNN Architecture is Appro- the average performance of all randomly extracted
priate to Learn the Data: If there are not enough test sets. The most popular method is known as N
hidden neurons, the MFNN will be unable to repre- ten-fold cross-validation, and involves repeating
sent the relationships in the data. Most commercial the approach for ten distinct subsets of the data.
software packages extend the standard 3-layer Avoid Unnecessarily Large and Complex Ar-
MFNN architecture shown in Figure 1 to consider chitectures: An architecture containing a large
additional layers of neurons, and sometimes in- number of hidden layers and hidden neurons re-
clude feedback loops to provide enhanced learning sults in more weights than a smaller architecture.
capabilities. The number of input dimensions can Since the weights correspond to the degrees of
also be reduced to improve the efficiency of the freedom or number of parameters the model has
architecture if inclusion of some variables does to fit the data, it is very easy for such large
not improve the learning. architectures to overfit the training data. For the
Experiment with the Learning Parameters to sake of future generalization of the model, the
Ensure that the Best Learning is Produced: architecture should therefore be only as large as
There are several parameters in the backpropagation is required to learn the data and achieve an accept-
learning equations that require selection and ex- able performance on all data sets (training, test,
perimentation. These include the learning rate c, and validation where available).
the equation of the function f() and its gradient l,
and the values of the initial weights. When these guidelines are observed, the chances of
Consider Alternative Learning Algorithms to developing a MFNN model that learns the training data
Backpropagation: Backpropagation is a gradient effectively and generalizes its learning on new data are
descent technique that guarantees convergence only greatly improved. Most commercially available neural
to a local minimum of the error function. Other local network software packages include features to facilitate
optimization techniques, such as the Levenberg- adherence to these guidelines.
Marquardt algorithm, have gained popularity due to
their increased speed and reduced memory require-
ments (Kinsella, 1992). Recently, researchers have FUTURE TRENDS
used more sophisticated search strategies such as
genetic algorithms and simulated annealing in an Despite the successful application of neural networks to
effort to find globally optimal weight values (Sexton a wide range of application areas, there is still much
& Dorsey, 2000). research that continues to improve their functionality.
Specifically, research continues in the development of
To ensure successful generalization: hardware models (chips or specialized analog devices)
that enable neural networks to be implemented rapidly in
Extract a Test Set from the Training Data: Com- industrial contexts. Other research attempts to connect
monly 20% of the training data is reserved as a test neural networks back to their roots in neurophysiology,
set. The neural network is only trained on 80% of and seeks to improve the biological plausibility of the
the data, and the degree to which it has learnt or models. On-line learning of neural network models that
memorized the data is gauged by the measured are more effective in situations when the data is dynami-
performance on the test set. When ample addi- cally changing, will also become increasingly important.
tional data is available, a third group of data known A useful discussion of the future trends of neural net-
as the validation set is used to evaluate the gener- works can be found at a virtual workshop discussion:
alization capabilities of the learnt model. For time http://www.ai.univie.ac.at/neuronet/workshop/.
series prediction problems, the validation set is
usually taken as the most recently available data,
and provides the best indication of how the devel- CONCLUSION
oped model will perform on future data. When
there is insufficient data to extract a test set and Over the last decade or so, we have witnessed neural
leave enough training data for learning, cross-vali- networks come of age. The idea of learning to solve
dation sets are used. This involves randomly ex- complex pattern recognition problems using an intelli-
tracting a test set, developing a neural network gent data-driven approach is no longer simply an inter-
models based on the remaining training data, and esting challenge for academic researchers. Neural net-
repeating the process with several random divi- works have proven themselves to be a valuable tool
sions of the data. The reported results are based on across a wide range of application areas. As a critical
867
TEAM LinG
component of most data mining systems, they are also Smith, K. A., & Gupta, J. N. D. (Eds.). (2002). Neural
changing the way organizations view the relationship networks in business: Techniques and applications.
between their data and their business strategy. Hershey, Pennsylvania: Idea Group Publishing.
The multilayered feedforward neural network (MFNN)
has been presented as the most common neural network Weinstein, J. N., Myers, T., Casciari, J. J., Buolamwini, J.,
employing supervised learning to model the relationships & Raghavan, K. (1994). Neural networks in the biomedical
between inputs and outputs. This dominant neural net- sciences: A survey of 386 publications since the begin-
work model finds application across a broad range of ning of 1991. Proceedings of the World Congress on
prediction and classification problems. A series of critical Neural Networks, 1 (pp. 121-126).
guidelines have been provided to facilitate the successful Werbos, P. J. (1974). Beyond regression: New tools for
application of these neural network models. prediction and analysis in the behavioral sciences. Cam-
bridge, MA: Harvard University, Ph.D. dissertation.
REFERENCES Wong, B. K., Jiang, L., & Lam, J. (2000). A bibliography of

neural network business application research: 1994 - 1998.
Andrews, R., Diederich, J., & Tickle, A. (1995). A survey Computers and Operations Research, 27(11), 1045-1076.
and critique of techniques for extracting rules from trained Zhang, Q., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting
artificial neural networks. Knowledge Based Systems, 8, with artificial neural networks: The state of the art. Inter-
373-389. national Journal of Forecasting, 14, 35-62.
Fogelman, S. F., & Gallinari, P. (Eds.). (1998). Industrial
applications of neural networks. Singapore: World Sci- KEY TERMS
entific Press.
Haykin, S. (1994). Neural networks: A comprehensive Architecture: The configuration of the network of
foundation. Englewood Cliffs, NJ: Macmillan Publishing neurons into layers of neurons is referred to as the neural
Company. architecture. To specify the architecture involves declar-
ing the number of input variables, hidden layers, hidden
Ibnkahla, M. (2000). Applications of neural networks to neurons, and output neurons.
digital communications - a survey. Signal Processing,
80(7), 1185-1215. Backpropagation: The name of the most common
learning algorithm for MFNNs. It involves modifying the
Kinsella, J. A. (1992). Comparison and evaluation of weights of the MFNN in such a way that the error (differ-
variants of the conjugate gradient methods for efficient ence between the MFNN output and the training data
training in feed-forward neural networks with backward desired output) is minimized over time. The error at the
error propagation. Neural Networks, 3(27). hidden neurons is approximated by propagating the out-
Pyle, D. (1999). Data preparation for data mining. San put error backwards, hence the name backpropagation.
Francisco: Morgan Kaufmann Publishers. Epoch: A complete presentation of all training data to
Refenes, A. P. (Ed.). (1995). Neural networks in the the MFNN is called one epoch. Typically a MFNN will
capital markets. Chichester: John Wiley & Sons Ltd. need to be trained for many thousands of epochs before
the error is reduced to an acceptable level.
Remus, W., & OConnor, M. (2001). Neural networks for
time series forecasting. In J. S. Armstrong (Ed.), Prin- Generalization: The goal of neural network training is
ciples of forecasting: A handbook for researchers and to develop a model that generalizes its knowledge to
practitioners. Norwell, MA: Kluwer Academic Publishers. unseen data. Overtraining must be avoided for generali-
zation to occur.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Par-
allel distributed processing: Explorations in the micro- Hidden Neurons: The name given to the layer of
structure of cognition. Cambridge, MA: MIT Press. neurons between the input variables and the output
neuron. If too many hidden neurons are used, the neural
Sexton, R. S., & Dorsey, R. E. (2000). Reliable classification network may be easily overtrained. If too few hidden
using neural networks: A genetic algorithm and neurons are used, the neural network may be unable to
backpropagation comparison. Decision Support Systems, learn the training data.
30, 11-22.
868
TEAM LinG
Learning Algorithm: The method used to change the Neural Model: Specifying a neural network model
weights so that the error is minimized. Training data is involves declaring the architecture and activation func- N
repeatedly presented to the MFNN through the input tion types.
layer, the output of the MFNN is calculated and compared
to the desired output. Error information is used to deter- Overtraining: When the MFNN performs significantly
mine which weights need to be modified, and by how better on the training data than an out-of-sample test data,
much. There are several parameters involved in the learn- it is considered to have memorized the training data, and
ing algorithm including learning rate, momentum factor, be overtrained. This can be avoided by following the
initial weights, etc. guidelines presented above.
869
TEAM LinG
870
Off-Line Signature Recognition

Indrani Chakravarty
Indian Institute of Technology, India
Nilesh Mishra
Mayank Vatsa
Richa Singh
P. Gupta
The most commonly used protection mechanisms today Starting from banks, signature verification is used in many
are based on either what a person possesses (e.g. an ID other financial exchanges, where an organizations main
card) or what the person remembers (like passwords and concern is not only to give quality services to its custom-
PIN numbers). However, there is always a risk of pass- ers, but also to protect their accounts from being illegally
words being cracked by unauthenticated users and ID manipulated by forgers.
cards being stolen, in addition to shortcomings like for- Forgeries can be classified into four typesrandom,
gotten passwords and lost ID cards (Huang & Yan, 1997). simple, skilled and traced (Ammar, Fukumura & Yoshida,
To avoid such inconveniences, one may opt for the new 1988; Drouhard, Sabourin, & Godbout, 1996). Generally
methodology of Biometrics, which though expensive will online signature verification methods display a higher
be almost infallible as it uses some unique physiological accuracy rate (closer to 99%) than off-line methods (90-
and/or behavioral (Huang & Yan, 1997) characteristics 95%) in case of all the forgeries. This is because, in off-line
possessed by an individual for identity verification. Ex- verification methods, the forger has to copy only the
amples include signature, iris, face, and fingerprint recog- shape (Jain & Griess, 2000) of the signature. On the other
nition based systems. hand, in case of online verification methods, since the
The most widespread and legally accepted biometric hardware used captures the dynamic features of the
among the ones mentioned, especially in the monetary signature as well, the forger has to not only copy the
transactions related identity verification areas is carried shape of the signature, but also the temporal characteris-
out through handwritten signatures, which belong to tics (pen tilt, pressure applied, velocity of signing etc.) of
behavioral biometrics (Huang & Yan,1997). This tech- the person whose signature is to be forged. In addition,
nique, referred to as signature verification, can be classi- he has to simultaneously hide his own inherent style of
fied into two broad categories - online and off-line. While writing the signature, thus making it extremely difficult to
online deals with both static (for example: number of black deceive the device in case of online signature verification.
pixels, length and height of the signature) and dynamic Despite greater accuracy, online signature recogni-
features (such as acceleration and velocity of signing, tion is not encountered generally in many parts of the
pen tilt, pressure applied) for verification, the latter ex- world compared to off-line signature recognition, be-
tracts and utilizes only the static features (Ramesh and cause it cannot be used everywhere, especially where
Murty, 1999). Consequently, online is much more efficient signatures have to be written in ink, e.g. on cheques,
in terms of accuracy of detection as well as time than off- where only off-line methods will work. Moreover, it re-
line. But, since online methods are quite expensive to quires some extra and special hardware (e.g. pressure
implement, and also because many other applications still sensitive signature pads in online methods vs. optical
require the use of off-line verification methods, the latter, scanners in off-line methods), which are not only expen-
though less effective, is still used in many institutions. sive but also have a fixed and short life span.
TEAM LinG
MAIN THRUST Noise Removal: Signature images, like any other

image may contain noises like extra dots or pixels O
In general, all the current off-line signature verification (Ismail & Gad, 2000), which originally do not belong
systems can be divided into the following sub-modules: to the signature, but get included in the image
because of possible hardware problems or the pres-
Data Acquisition ence of background noises like dirt. To recognize
Preprocessing and Noise Removal the signature correctly, these noise elements have
Feature Extraction and Parameter Calculations to be removed from the background in order to get
Learning and Verification (or Identification) the accurate feature matrices in the feature extrac-
tion phase. A number of filters have been used as
Data Acquisition preprocessors (Ismail & Gad, 2000) by researchers
to obtain the noise free image. Examples include the
Off-line signatures do not consider the time related as- mean filter, median filter, filter based on merging
pects of the signature such as velocity, acceleration and overlapped run lengths in one rectangle (Ismail &
pressure. Therefore, they are often termed as static Gad, 2000) etc. Among all the filtering techniques
signatures, and are captured from the source (i.e. paper) mentioned above, average and median filtering are
using a camera or a high resolution scanner, in compari- considered to be standard noise reduction and iso-
son to online signatures (in which data is captured using lated peak noise removal techniques(Huang & Yan,
a digitizer or an instrumented pen generating signals) 1997). However, median filter is preferred more be-
(Tappert, Suen, & Wakahara, 1990; Wessels & Omlin, cause of its ability to remove noises without blur-
2000), which do consider the time related or dynamic ring the edges of the signature instance unlike the
aspects besides the static features. mean filter.
Space Standardization and Normalization: In
Space standardization, the distance between the
Preprocessing horizontal components of the same signature is
standardized, by removing blank columns, so that
The preprocessing techniques that are generally per- it does not interfere with the calculation of global
formed in off-line signature verification methods com- and local features of the signature image (Baltzakis
prise of noise removal, smoothening, space standardiza- & Papamarkos, 2001; Qi & Hunt, 1994). In nor-
tion and normalization, thinning or skeletonization, con- malization, the signature image is scaled to a stan-
verting a gray scale image to a binary image, extraction of dard size which is the average size of all training
the high pressure region images, etc. samples, keeping the width to height ratio constant
(Baltzakis & Papamarkos, 2001; Ismail & Gad,
Figure 1. Modular structure of an off-line verification 2000; Ramesh & Murty, 1999).
system Extracting the Binary Image from Grayscale
Image: Using the Otsus method a threshold is
Signature
Cropped
scanned Image
calculated to obtain a binary version of the
grayscale image (Ammar, Fukumura, & Yoshida,
1988; Ismail & Gad, 2000; Qi & Hunt, 1994). The
Preprocessing module
algorithm is as follows:
Data Base Analyzer / Controller module Output
1 R ( x, y ) > threshold ,
S ( x, y ) =
Learning module Feature Extraction Verification module 0 R ( x, y ) < threshold ,
module
Figure 2. Noise removal using median filter Figure 3. Converting grayscale image into binary image
(a) Gray scale (b) Noise free (a) Gray scale (b) Binary
871
TEAM LinG
where S(x,y) is the binary image and R(x,y) is the are less sensitive to noise as well. Examples include
grayscale image. width and height of individual signature compo-
Smoothening: The noise-removed image may have nents, width to height ratio, total area of black
small connected components or small gaps, which pixels in the binary and high pressure region (HPR)
may need to be filled up. It is done using a binary images, horizontal and vertical projections of sig-
mask obtained by thresholding followed by morpho- nature images, baseline, baseline shift, relative
logical operations (which include both erosion and position of global baseline and centre of gravity
dilation) (Huang & Yan, 1997; Ismail & Gad, 2000; with respect to width of the signature, number of
Ramesh & Murty, 1999). cross and edge points, circularity, central line,
Extracting the High Pressure Region Image: corner curve and corner line features, slant, run
Ammar et al. (Ammar, M., Fukumura, T., & Yoshida lengths of each scan of the components of the
Y., 1986) have used the high pressure region as one signature, kurtosis (horizontal and vertical), skew-
of the prominent features for detecting skilled ness, relative kurtosis, relative skewness, relative
forgeries. It is the area of the image where the horizontal and vertical projection measures, enve-
writer gives special emphasis reflected in terms of lopes, individual stroke segments. (Ammar,
higher ink density (more specifically, higher gray Fukumura, & Yoshida, 1990; Bajaj & Chaudhury,
level intensities than the threshold chosen). The 1997; Baltzakis & Papamarkos, 2001, Fang, et al.,
threshold is obtained as follows: 2003, Huang & Yan, 1997; Ismail & Gad, 2000; Qi &
Hunt, 1994, Ramesh & Murty, 1999; Yacoubi,
Th HPR = I min + 0.75(I max I min ), Bortolozzi, Justino, & Sabourin, 2000; Xiao &
Leedham, 2002)
Local Features: Local features are confined to a
where Imin and Imax are the minimum and maximum
limited portion of the signature, which is obtained
grayscale intensity values.
by dividing the signature image into grids or treat-
Thinning or Skeletonization: This process is carried
ing each individual signature component as a sepa-
out to obtain a single pixel thick image from the
rate entity (Ismail & Gad, 2000). In contrast to
binary signature image. Different researchers have
global features, they are responsive to small distor-
used various algorithms for this purpose (Ammar,
tions like dirt, but are not influenced by other
Fukumura, & Yoshida, 1988; Baltzakis & Papamarkos,
regions of the signature. Hence, though extraction
2001; Huang & Yan, 1997; Ismail & Gad, 2000).
of local features requires more computations, they
are much more precise. Many of the global features
Feature Extraction have their local counterparts as well. Examples of
local features include width and height of indi-
Most of the features can be classified in two categories - vidual signature components, local gradients such
global and local features. as area of black pixels in high pressure region and
binary images, horizontal and vertical projections,
Global Features: Ismail and Gad (2000) have de- number of cross and edge points, slant, relative
scribed global features as characteristics, which position of baseline, pixel distribution, envelopes
identify or describe the signature as a whole. They etc. of individual grids or components. (Ammar,
are less responsive to small distortions and hence Fukumura, & Yoshida, 1990; Bajaj & Chaudhury,
1997; Baltzakis & Papamarkos, 2001; Huang & Yan,
1997; Ismail & Gad, 2000; Qi & Hunt, 1994; Ramesh
Figure 4. Extracting the high pressure region image & Murty, 1999; Yacoubi, Bortolozzi, Justino, &
Sabourin, 2000)
(a) Gray scale (b) High pressure region

Learning and Verification
The next stage after feature extraction involves learning.

Figure 5. Extraction of thinned image Though learning is not mentioned separately for verifi-
cation as a separate sub module, it is the most vital part
of the verification system in order to verify the authen-
ticity of the signature. Usually five to forty features are
passed in as a feature matrix/vector to train the system.
(a) Binary (b) Thinned In the training phase, 3-3000 signature instances are
image image
872
TEAM LinG
Table 1. Summary of some prominent off-line papers
Mode of
O
Author Database Feature Extraction Results
Verification
Statistical Global baseline, upper and
10 genuine
Ammar, method-- lower extensions, slant
signatures per
Fukumara and Euclidian features, local features (e.g. 90% verification rate
person from 20
Yoshida (1988) distance and local slants) and pressure
people
threshold feature
Signature outline, core
Neural network feature, ink distribution,
Based-- A total of 3528 high pressure region,
99.5 % for random
Huang and Yan Multilayer signatures directional frontiers feature,
forgery and 90 % for
(1997) perception based (including genuine area of feature pixels in
targeted forgeries
neural networks and forged) core, outline, high pressure
trained and used region, directional frontiers,
coarse, and in fine ink
Global geometric features:
Aspect ratio, width without
blanks, slant angle, vertical
Genetic center of gravity (COG) etc.
Algorithms used Moment Based features: 90 % under genuine,
650 signatures, 20
for obtaining horizontal and vertical 98 % under random
Ramesh and genuine and 23
genetically projection images, kurtosis forgery and 70-80 %
Murty (1999) forged signatures
optimized measures( horizontal and under skilled forgery
from 15 people
weights for vertical) case
weighted feature Envelope Based:
vector extracting the lower and
upper envelopes
and
Wavelet Based features
Central line features, corner
220 genuine and line features, central circle 95% recognition rate
Ismail and Gad
Fuzzy concepts 110 forged features, corner curve and 98% verification
(2000)
samples features, critical points rate
features
Yacoubi, 4000 signatures

Average Error(%)
Bortolozzi, from 100 people Caliber, proportion,
Hidden Markov :0.48 on 1st database,
Justino and (40 people for 1st behavior guideline, base
Model and 0.88 on the 2nd
Sabourin and 60 for 2nd behavior, spacing
database
(2000) database
Nonliinear
dynamic time Horizontal and Vertical Best result of 18.1%
warping applied projections for non linear average error rate
Fang, Leung, 1320 genuine
to horizontal and dynamic time warping (average of FAR and
Tang, Tse, signatures from 55
vertical method. Skeletonized FRR) for first
Kwok, and authors and 1320
projections. image with approximation method. The
Wong (2003) forgeries from 12
Elastic bunch of the skeleton by short Average error rate
authors
graph matching lines for elastic bunch was 23.4% in case of
to individual graph matching algorithm the second method
stroke segments
used. These signature instances include either the genu- is to be tested, its feature matrix/vector is calculated, and
ine or both genuine and forged signatures depending on is passed into the verification sub module of the system,
the method. All the methods extract features in a manner which identifies the signature to be either authentic or
such that the signature cannot be constructed back from unauthentic. So, unless the system is trained properly,
them in the reverse order but have sufficient data to chances, though less, are that it may recognize an authen-
capture the features required for verification. These fea- tic signature to be unauthentic (false rejection), and in
tures are stored in various formats depending upon the certain other cases, recognize the unauthentic signature
system. It could be either in the form of feature values, to be authentic (false acceptance). So, one has to be
weights of the neural network, conditional probability extremely cautious at this stage of the system. There are
values of a belief (Bayesian) network or as a covariance various methods for off-line verification systems that are
matrix (Fang et al., 2003). Later, when a sample signature in use today. Some of them include:
873
TEAM LinG
Statistical methods (Ammar, M., Fukumura, T., & cult to compare the values of these error rates, as there is
Yoshida Y., 1988; Ismail & Gad, 2000) no standard database either for off-line or for online
Neural network based approach (Bajaj & Chaudhury, signature verification methods.
1997; Baltzakis & Papamarkos, 2001; Drouhard, Although online verification methods are gaining
Sabourin, & Godbout, 1996; Huang & Yan, 1997) popularity day by day because of higher accuracy rates,
Genetic algorithms for calculating weights (Ramesh off-line signature verification methods are still consid-
& Murty, 1999) ered to be indispensable, since they are easy to use and
Hidden Markov Model (HMM) based methods have a wide range of applicability. Efforts must thus be
(Yacoubi, Bortolozzi, Justino, & Sabourin, 2000) made to improve its efficiency to as close as that of online
Bayesian networks (Xiao & Leedham, 2002) verification methods.
Nonliinear dynamic time warping (in spatial domain)
and Elastic bunch graph matching (Fang et al., 2003)
REFERENCES
The problem of signature verification is that of divid-
ing a space into two different sets of genuine and forged Ammar, M., Fukumura, T., & Yoshida Y. (1986). A new
signatures. Both online and off-line approaches use fea- effective approach for off-line verification of signature
tures to do this, but, the problem with this approach is that by using pressure features. Proceedings 8th Interna-
even two signatures by the same person may not be the tional Conference on Pattern Recognition, ICPR86 (pp.
same. The feature set must thus have sufficient interper- 566-569), Paris.
sonal variability so that we can classify the input signa-
ture as genuine or forgery. In addition, it must also have Ammar, M., Fukumura, T., & Yoshida, Y. (1988). Off-line
a low intrapersonal variability so that an authentic signa- preprocessing and verification of signatures. Interna-
ture is accepted. Solving this problem using fuzzy sets tional Journal of Pattern Recognition and Artificial
and neural networks has also been tried. Another problem Intelligence 2(4), 589-602.
is that an increase in the dimensionality i.e. using more Ammar, M., Fukumura, T., & Yoshida, Y. (1990). Structural
features does not necessarily minimize(s) the error rate. description and classification of signature images. Pat-
Thus, one has to be cautious while choosing the appro- tern Recognition 23(7), 697-710.
priate/optimal feature set.
Bajaj, R., & Chaudhury, S. (1997). Signature verification
using multiple neural classifiers. Pattern Recognition
FUTURE TRENDS 30(1), l-7.
Baltzakis, H., & Papamarkos, N. (2001). A new signature
Most of the presently used off-line verification methods verification technique based on a two-stage neural net-
claim a success rate of more than 95% for random forgeries work classifier. Engineering Applications of Artificial
and above 90% in case of skilled forgeries. Although, a Intelligence, 14, 95-103.
95% verification rate seems high enough, it can be noticed
that, even if the accuracy rate is as high as 99 %, when we Drouhard, J. P., Sabourin, R., & Godbout, M. (1996). A
scale it to the size of a million, even a 1% error rate turns neural network approach to off-line signature verifica-
out to be a significant number. It is therefore necessary to tion using directional pdf. Pattern Recognition 29(3),
increase this accuracy rate as much as possible. 415-424.
Fang, B., Leung, C. H., Tang, Y. Y., Tse, K. W., Kwok, P.
C. K., & Wong, Y. K. (2003). Off-line signature verification
CONCLUSION by tracking of feature and stroke positions. Pattern Rec-
ognition 36, 91-101.
Performance of signature verification systems is mea-
sured from their false rejection rate (FRR or type I error) Huang, K., & Yan, H. (1997). Off-line signature verification
(Huang & Yan, 1997) and false acceptance rate (FAR or based on geometric feature extraction and neural network
type II error) (Huang & Yan, 1997) curves. Average error classification. Pattern Recognition, 30(1), 9-17.
rate, which is the mean of the FRR and FAR values, is also
used at times. Instead of using the FAR and FRR values, Ismail, M. A., & Gad, S. (2000). Off-line Arabic signature
many researchers quote the 100 - average rate values recognition and verification. Pattern Recognition, 33,
as the performance result. Values for various approaches 1727-1740.
have been mentioned in the table above. It is very diffi-
874
TEAM LinG
Jain, A. K. & Griess, F. D., (2000). Online Signature n n

= (x.Pv [x ])/ Pv [x ],
m m
Verification. Project Report, Department of Computer X = (y.Ph [y])/ Ph [y], and Y
y =1 y =1 x =1 x =1 O
Science and Engineering, Michigan State University, USA.
Qi, Y., & Hunt, B. R. (1994). Signature verification using
where X, Y are the x and y coordinates of the centre of
global and grid features. Pattern Recognition, 27(12),
gravity of the image, and Ph and Pv are the horizontal and
1621-1629.
vertical projections respectively.
Ramesh, V.E., & Murty, M. N. (1999). Off-line signature
Cross and Edge Points: A cross point is an image pixel
verification using genetically optimized weighted fea-
in the thinned image having at least three eight neighbors,
tures. Pattern Recognition, 32(7), 217-233.
while an edge point has just one eight neighbor in the
Tappert, C. C., Suen, C. Y., & Wakahara, T. (1990).The thinned image.
State of the Art in Onliine Handwriting Recognition. IEEE
Envelopes: Envelopes are calculated as upper and
Transactions on Pattern Analysis and Machine Intelli-
lower envelopes above and below the global baseline.
gence, 12(8).
These are the connected pixels which form the external
Wessels, T., & Omlin, C. W, (2000). A Hybrid approach for points of the signature obtained by sampling the signa-
Signature Verification. International Joint Conference ture at regular intervals.
on Neural Networks.
Global Baseline: It is the median value of pixel distri-
Yacoubi, A. El., Bortolozzi, F., Justino, E. J. R., & Sabourin, bution along the vertical projection.
R. (2000). An off-line signature verification system using
Horizontal and Vertical Projections: Horizontal pro-
HMM and graphometric features. 4th IAPR International
jection is the projection of the binary image along the
Workshop on Document Analysis Systems (pp. 211-222).
horizontal axis. Similarly, vertical projection is the projec-
Xiao, X., & Leedham, G. (2002). Signature verification tion of the binary image along the vertical axis. They can
using a modified Bayesian network. Pattern Recognition, be calculated as follows:
35, 983-995.
m
Ph [y] = black.pixel(x, y),Pv [x ] = black.pixel(x, y),
n
x =1 y =1
KEY TERMS
where m = width of the image and n = height of the
Area of Black Pixels: It is the total number of black image.
pixels in the binary image.
Biometric Authentication: The identification of indi- Moment Measures: Skewness and kurtosis are mo-
viduals using their physiological and behavioral charac- ment measures calculated using the horizontal and verti-
teristics. cal projections and co-ordinates of centre of gravity of the
signature.
Centre of Gravity: The centre of gravity of the image
is calculated as per the following equation: Slant: Either it is defined as the angle at which the
image has maximum horizontal projection value on rota-
tion or it is calculated using the total number of positive,
negative, horizontally or vertically slanted pixels.
875
TEAM LinG
876
Online Analytical Processing Systems

Rebecca Boon-Noi Tan
INTRODUCTION OLAP Versus OLTP
Since its origin in the 1970s research and development Two reasons why traditional OLTP is not suitable for data
into databases systems has evolved from simple file warehousing are presented: (a) Given that operational
storage and processing systems to complex relational databases are finely tuned to support known OLTP
databases systems, which have provided a remarkable workloads, trying to execute complex OLAP queries against
contribution to the current trends or environments. Data- the operational databases would result in unacceptable
bases are now such an integral part of day-to-day life that performance. Furthermore, decision support requires data
often people are unaware of their use. For example, pur- that might be missing from the operational databases; for
chasing goods from the local supermarket is likely to instance, understanding trends or making predictions
involve access to a database. In order to retrieve the price requires historical data, whereas operational databases
of the item, the application program will access the prod- store only current data. (b) Decision support usually
uct database. A database is a collection of related data and requires consolidating data from many heterogeneous
the database management system (DBMS) is software sources: these might include external sources such as
that manages and controls access to the database (Elmasri stock market feeds, in addition to several operational
& Navathe, 2004). databases. The different sources might contain data of
varying quality, or use inconsistent representations,
codes and formats, which have to be reconciled.
BACKGROUND
Traditional Online Transaction
Data Warehouse Processing (OLTP)
A data warehouse is a specialized type of database. More Traditional relational databases have been used primarily
specifically, a data warehouse is a repository (or archive) to support OLTP systems. The transactions in an OLTP
of information gathered from multiple sources, stored system usually retrieve and update a small number of
under a unified schema, at a single site (Silberschatz, records accessed typically on their primary keys. Opera-
Korth, & Sudarshan, 2002, p. 843). Chaudhuri and Dayal tional databases tend to be hundreds of megabytes to
(1997) consider that a data warehouse should be sepa- gigabytes in size and store only current data (Ramakrishnan
rately maintained from the organizations operational & Gehrke, 2003).
database since the functional and performance require- Figure 1 shows a simple overview of the OLTP system.
ments of online analytical processing (OLAP) supported The operational database is managed by a conventional
by data warehouses are quite different from those of the relational DBMS. OLTP is designed for day-to-day opera-
online transaction processing (OLTP) traditionally sup- tions. It provides a real-time response. Examples include
ported by the operational database. Internet banking and online shopping.
Figure 1. OLTP system
Online
Transaction
Processing
(OLTP)
system
TEAM LinG
Online Analytical Processing (OLAP) special proprietary techniques to store data in matrix-like
n-dimensional arrays (Ramakrishnan & Gehrke, 2003). O
OLAP is a term that describes a technology that uses a The multi-dimensional data cube is implemented by
multi-dimensional view of aggregate data to provide quick the arrays with the dimensions forming the axes of the
access to strategic information for the purposes of ad- cube (Sarawagi, 1997). Therefore, only the data value
vanced analysis (Ramakrishnan & Gehrke, 2003). corresponding to a data cell is stored as direct mapping.
OLAP supports queries and data analysis on aggre- MOLAP servers have excellent indexing properties due to
gated databases built in data warehouses. It is a system the fact that looking for a cell is simple array lookups rather
for collecting, managing, processing and presenting mul- than associative lookups in tables. But unfortunately it
tidimensional data for analysis and management pur- provides poor storage utilization, especially when the
poses (Figure 2). There are two main implementation data set is sparse.
methods to support OLAP applications: relational OLAP In a multi-dimensional data model, the focal point is on
(ROLAP) and multidimensional OLAP (MOLAP). a collection of numeric measures. Each measure depends
on a set of dimensions. For instance, the measure attribute
ROLAP is amt as shown in Figure 4. Sales information is being
arranged in a three-dimensional array of amt. Figure 4
Relational online analytical processing (ROLAP) pro- shows that the array only shows the values for single L#
vides OLAP functionality by using relational databases value where L# = L001, which presented as a slice or-
and familiar relational query tools to store and analyse thogonal to the L# axis.
multidimensional data (Ramakrishnan & Gehrke, 2003).
Entity Relationship diagrams and normalization tech- Cube-By Operator
niques are popularly used for database design in OLTP
environments. However, the database designs recom- In decision support database systems, aggregation is a
mended by ER diagrams are inappropriate for decision commonly used operation. As previously mentioned,
support systems where efficiency in querying and in current SQL can be very inefficient. Thus, to effectively
loading data (including incremental loads) are crucial. A support decision support queries in OLAP environment,
special schema known as a star schema is used in an OLAP a new operator, Cube-by was proposed by (Gray,
environment for performance reasons (Martyn, 2004). Bosworth, Lyaman, & Pirahesh, 1996). It is an extension
This star schema usually consists of a single fact table of the relational operator Group-by. The Cube-by opera-
and a dimension table for each dimension (Figure 3). tor computes Group-by corresponding to all possible
combinations of attributes in the Cube-by clause.
MOLAP In order to see how a data cube is formed, an example
is provided in Figure 5. It shows an example of data cube
Multidimensional online analytical processing (MOLAP) formation through executing the cube statement at the top
extends OLAP functionality to multidimensional data- left of the figure. Figure 5 presents two way of presenting
base management systems (MDBMSs). A MDBMS uses the aggregated data: (a) a data cube, and (b) a 3D data cube
in table form.
Figure 2. OLAP system
Online
Analytical
Processing
(OLAP)
System
877
TEAM LinG
Figure 3. Star schema showing that location, product, date, and sales are represented as relations
Figure 4. SALES presented as a multidimensional dataset
MAIN THRUST cessing. Back-end processing basically deals with the

raw data, which is stored in either tables (ROLAP) or
Processing Level in OLAP Systems arrays (MOLAP) and then process it into aggregated
data, which is presented to the user. Front-end process-
The processing level where the execution of the raw data ing is computing the raw data and managing the pre-
or aggregated data takes place before presenting the result computed aggregated data in either in 3D data cube or n-
to the user is shown in Figure 6. The user is only interested dimensional table.
in the data in the report in a fast response. However, the It is important to mention that front-end processing
background of the processing level is needed in order to is similar to back-end processing as it also deals with raw
provide the user with the result of the aggregated data in data. It needs to construct or execute the raw data into the
an effective and efficient way. data cube and then store the pre-computed aggregated
The processing level has been broken down into two data in either the memory or disk. It is convenient for the
distinct areas: back-end processing and front-end pro- user if the pre-computed aggregated data is ready and is
878
TEAM LinG
Figure 5. An example of data cube formation through executing the cube statement at the top left of the figure
O
Figure 6. Two interesting areas of the processing level in OLAP environment
Online
Analytical
Processing
(OLAP)
System
879
TEAM LinG
stored in the memory whenever the user requests for it. response time for providing the resulted data. Another
The difference between the two processing levels is consideration is that the analyst or manager using the
that the front-end has the pre-computed aggregated data data warehouse may have time constraints. There are two
in the memory which is ready for the user to use or analyze sub-stages or fundamental methods of handling the data:
it at any given time. On the other hand the back-end (a) indexing and (b) partitioning in order to provide the
processing computes the raw data directly whenever user with the resulted data in a reasonable time frame.
there is a request from the user. This is why the front-end Indexing has existed in databases for many decades.
processing is considered to present the aggregated data Its access structures have provided faster access to the
faster or more efficiently than back-end processing. How- base data. For retrieval efficiency, index structures would
ever, it is important to note that back-end processing and typically be defined especially in data warehouses or
front-end processing are basically one whole processing. ROLAP where the fact table is very large (Datta,
The reason behind this break down of the two processing VanderMeer & Ramamritham, 2002). ONeil and Quass
levels is because the problem can be clearly seen in back- (1997) have suggested a number of important indexing
end processing and front-end processing. In the next schemes for data warehousing including bitmap index,
section, the problems associated with each areas and the value-list index, projection index, data index. Data index is
related work in improving the problems will be considered. similar to projection index but it exploits a positional
indexing strategy (Datta, VanderMeer, & Ramamritham,
Back-End Processing 2002). Interestingly MOLAP servers have better indexing
properties than ROLAP servers since they look for a cell
Back-end processing involves basically dealing with using simple array lookups rather than associative look-
the raw data, which is stored in either tables (ROLAP) or ups in tables.
arrays (MOLAP) as shown in Figure 7. The user queries Partitioning of raw data is more complex and challeng-
the raw data for decision-making purpose. The raw data is ing in data warehousing as compared to that of relational
then processed and computed into aggregated data, which and object databases. This is due to the several choices
will be presented to the user for analyzing. Generally the of partitioning of a star schema (Datta, VanderMeer, &
basic stage: extracting, is followed by two sub-stages: Ramamritham, 2002). The data fragmentation concept in
indexing, and partitioning in back-end processing. the context of distributed databases aims to reduce query
Extracting is the process of querying the raw data execution time and facilitate the parallel execution of
either from tables (ROLAP) or arrays (MOLAP) and com- queries (Bellatreche, Karlapalem, Mohania, & Schneide,
puting it. The process of extracting is usually time-con- 2000).
suming. Firstly, the database (data warehouse) size is In a data warehouse or ROLAP, either the dimension
extremely large and secondly the computation time is tables or the fact table or even both can be fragmented.
equally high. However, the user is only interested in a fast Bellatreche et al. (2000) have proposed a methodology for
applying the fragmentation techniques in a data ware-
house star schema to reduce the total query execution
cost. The data fragmentation concept in the context of
Figure 7. Stages in back-end processing
distributed databases aims to reduce query execution time
and facilitates the parallel execution of queries.
Front-End Processing
Front-end processing is computing the raw data and

managing the pre-computed aggregated data in either in
3D data cube or n-dimensional table. The three basic
types of stages that are involved in Front-end processing
are shown in Figure 8. They are (i) constructing, (ii)
storing and (iii) querying. Constructing is basically the
computation process of a data cube by using cube-by
operator. Cube-by is an expensive approach, especially
when the number of Cube-by attributes and the database
size are large.
The difference between constructing and extracting
needs to be clearly defined. Constructing in the Front-end
processing is similar to the extracting in the Back-end
880
TEAM LinG
Figure 8. Stages in front-end processing

O
processing as both of them are basically querying the raw the number of Cube-by attributes and the database size
data. However, in this case, extracting concentrates on are large. Second, storage has insufficient memory space
how the fundamental methods can help in handling the this is due to the loaded data and also in addition to the
raw data in order to provide efficient retrieval. Construct- incremental loads. Third, storage is not properly utilized
ing in the Front-end processing concentrates on the cube- the array may not fit into memory especially when the
by operator that involves the computation of raw data. data set is sparse. Fourth, certain queries might be diffi-
Storing is the process of putting the aggregated data cult to fulfill the need of decision makers. Fifth, the
into either the n-dimension table of rows or the n-dimen- execution querying time has to be reduced when there is
sional arrays or data cube. There are two parts within the n-dimensional query as the time factor is important to
storing process. One part is to store temporary raw data decision makers. However, it is important to consider that
in the memory for executing purpose and other part is to there is scope for other possible problems to be identified.
store pre-computed aggregated data. Hence, there are two
problems related to the storage: (a) insufficient memory
space due to the loaded raw data and also in addition to FUTURE TRENDS
the incremental loads; (b) poor storage utilization the
array may not fit into memory especially when the data set Problems have been identified in each of the three stages,
is sparse. which have generated considerable attention from re-
Querying is the process of extracting useful informa- searchers to find solutions. Several researchers have
tion from the pre-computed data cube or n-dimensional proposed a number of algorithms to solve these cube-by
table for decision makers. It is important to take note that problems. Examples include:
Querying is also part of Constructing process. Querying
also makes use of cube-by operator that involves the Constructing
computation of raw data. ROLAP is able to support ad hoc
requests and allows unlimited access to dimensions un- Cube-by operator is an expensive approach, especially
like MOLAP only allows limited access to predefined when the number of Cube-by attributes and the database
dimensions. Despite the fact that it is able to support ad size are large.
hoc requests and access to dimensions, certain queries
might be difficult to fulfill the need of decision makers and
also to reduce the querying execution time when there is
Fast Computation Algorithms
n-dimensional query as the time factor is important to
decision makers. There are algorithms aimed at fast computation of large
To conclude, the three stages and their associated sparse data cube (Ross & Srivastava, 1997; Beyer &
problems in the OLAP environment have been outlined. Ramakrishnon, 1999). Ross and Srivastava (1997) have
First, Cube-by is an expensive approach, especially when taken into consideration the fact that real data is fre-
881
TEAM LinG
quently sparse. (Ross & Srivastava, 1997) partitioned Condensed Data Cube
large relations into small fragments so that there was
always enough memory to fit in the fragments of large Wang, Feng, Lu, and Yu (2002) have proposed a new
relation. Whereas, Beyer and Ramakrishnan (1999) pro- concept called a condensed data cube. This new ap-
posed the bottom-up method to help reduce the penalty proach reduces the size of data cube and hence its com-
associated with the sorting of many large views. putation time. They make use of single base tuple
compression to generate a condensed cube so it is
Parallel Processing System smaller in size.
The assumption of most of the fast computation algo- Querying

rithms is that their algorithms can be applied into the
parallel processing system. Dehne, Eavis, Hambrusch, It might be difficult for certain queries to fulfill both the
and Rau-Chaplin. (2002) presented a general methodol- need of the decision makers and also the need to reduce
ogy for the efficient parallelization of exiting data cube the querying execution time when there is n-dimensional
construction algorithms. Their paper described two dif- query as the time factor is important to decision makers.
ferent partitioning strategies, one for top-down and one
for bottom-up cube algorithms. They provide a good Types of Queries
summary and comparison with other parallel data cube
computations (Ng, Wagner, & Yin., 2001; Yu & Lu, 2001). Other work has focused on specialized data structures for
In Ng et al. (2001), the approach considers the fast processing of special types of queries. Lee, Ling and
parallelization of sort-based data cube construction, Li (2000) have focused on range-max queries and have
whereas Yu and Lu (2001) focuses on the overlap between proposed hierarchical compact cube to support the range-
multiple data cube computations in a sequential setting. max queries. The hierarchical structure stores not only the
maximum value of all the children sub-cubes, but also
Storing stores one of the locations of the maximum values among
the children sub-cubes. On the other hand, Tan, Taniar,
Two problems related to the storage: (a) insufficient and Lu (2004) focus on data cube queries in term of SQL
memory space; (b) poor storage utilization. and have presented taxonomy of data cube queries. They
also provide a comparison with different type queries.
Data Compression
Parallel Processing Techniques
Another technique is related to data compression.
Lakshmanan, Pei, and Zhao (2003) have developed a Taniar and Tan (2002) have proposed three parallel tech-
systematic approach to achieve efficacious data cube niques to improve the performance of the data cube
construction and exploration by semantic summarization queries. However, the challenges in this area continue to
and compression. grow with the cube-by problems.
Figure 9. Working areas in the OLAP system
882
TEAM LinG
CONCLUSION Ng, R.T., Wagner, A., & Yin, Y. (2001, May). Iceberg-cube
computation with pc clusters. International ACM O
In this overview, the OLAP systems in general have been SIGMOD Conference (pp. 25-36), Santa Barbara, Califor-
considered, followed by processing level in the OLAP nia.
systems and lastly the related and future work in the ONeil, P., & Graefe, G. (1995). Multi-table joins through
OLAP systems. A number of problems and solutions in bit-mapped join indices. SIGMOD Record, 24(3), 8-11.
the OLAP environment have been presented. However,
consideration needs to be given to the possibility that Ramakrishnan, R., & Gehrke, J. (2003). Database manage-
other problems maybe identified which in turn will present ment systems. NY: McGraw-Hill.
new challenges for the researchers to address.
Ross, K.A., & Srivastava, D. (1997, August). Fast compu-
tation of sparse datacubes. International VLDB Confer-
ence (pp. 116-185), Athens, Greece.
REFERENCES
Sarawagi, S. (1997). Indexing OLAP data. IEEE Data
Bellatreche, L., Karlapalem, K., Mohania M., & Schneider Engineering Bulletin, 20(1), 36-43.
M. (2000, September). What can partitioning do for your
data warehouses and data marts? International IDEAS Silberschatz, A., Korth, H., & Sudarshan, S. (2002). Data-
conference (pp. 437-445), Yokohoma, Japan. base system concepts. NY: McGraw-Hill.
Beyer, K.S., & Ramakrishnan, R. (1999, June). Bottom-up Taniar, D., & Tan, R.B.N. (2002, May). Parallel processing
computation of sparse and iceberg cubes. International of multi-join expansion-aggregate data cube query in high
ACM SIGKDD conference (pp. 359-370), Philadelphia, performance database systems. International I-SPAN
PA. Conference (pp. 51-58), Manila, Philippines.
Chaudhuri, S., & Dayal, U. (1997). An overview of data Tan, R.B.N., Taniar, D., & Lu, G.J. (2004). A Taxonomy for
warehousing and OLAP technology. ACM SIGMOD Data Cube Queries, International Journal for Computers
Record, 26, 65-74. and Their Applications, 11(3), 171-185.
Datta, A., VanderMeer, D., & Ramamritham, K. (2002). Wang, W., Feng, J.L., Lu, H.J., & Yu, J.X. (2002, February).
Parallel Star Join + DataIndexes: Efficient Query Process- Condensed Cube: An effective approach to reducing data
ing in Data Warehouses and OLAP. IEEE Transactions cube size. International Data Engineering Conference
on Knowledge & Data Engineering, 14(6), 1299-1316. (pp. 155-165), San Jose, California.
Dehne, F., Eavis, T., Hambrusch, S., & Rau-Chaplin, A. Yu, J.X., & Lu, H.J. (2001, April). Multi-cube computation.
(2002). Parallelizing the data cube. International Journal International DASFAA Conference (pp. 126-133), Hong
of Distributed & Parallel Databases, 11, 181-201. Kong, China.
Elmasri, R., & Navathe, S.B. (2004). Fundamentals of

database systems. Boston, MA: Addison Wesley.
KEY TERMS
Gray, J., Bosworth, A., Lyaman, A. & Pirahesh, H. (1996,
June). Data cube: A relational aggregation operator gen- Back-End Processing: Is dealing with the raw data,
eralizing GROUP-BY, CROSS-TAB, and SUB-TOTALS. which is stored in either tables (ROLAP) or arrays
International ICDE Conference (pp. 152-159), New Or- (MOLAP).
leans, Louisiana.
Data Cube Operator: Computes Group-by correspond-
Lakshmanan, L.V.S., Pei, J., & Zhao, Y. (2003, September). ing to all possible combinations of attributes in the Cube-
Efficacious data cube exploration by semantic by clause.
semmarization and compression. International VLDB
Conference (pp. 1125-1128), Berlin, Germany. Data Warehouse: A type of database A subject-
oriented, integrated, time-varying, non-volatile collec-
Lee, S.Y, Ling, T.W., & Li, H.G. (2000, September). Hierar- tion of data that issued primarily in organizational deci-
chical compact cube for range-max queries. International sion making (Elmasri & Navathe, 2004, p.900).
VLDB Conference (pp. 232-241), Cairo, Egypt.
Front-End Processing: Is computing the raw data and
Martyn, T. (2004). Reconsidering multi-dimensional managing the pre-computed aggregated data in either in
schemas. ACM SIGMOD Record, 83-88. 3D data cube or n-dimensional table.
883
TEAM LinG
Multidimensional OLAP (MOLAP): Extends OLAP Relational OLAP (ROLAP): Provides OLAP func-
functionality to multidimensional database management tionality by using relational databases and familiar rela-
systems (MDBMSs). tional query tools to store and analyse multidimensional
data.
Online Analytical Processing (OLAP): Is a term used
to describe the analysis of complex data from data ware-
house (Elmasri & Navathe, 2004, p.900).
884
TEAM LinG
885
Online Signature Recognition O

Indrani Chakravarty
Nilesh Mishra
Mayank Vatsa
Richa Singh
P. Gupta
INTRODUCTION The hardware for online are expensive and have a

fixed and short life cycle.
Security is one of the major issues in todays world and
most of us have to deal with some sort of passwords in In spite of all these inconveniences, the use online
our daily lives; but, these passwords have some prob- methods is on the rise and in the near future, unless a
lems of their own. If one picks an easy-to-remember process requires particularly an off-line method to be
password, then it is most likely that somebody else may used, the former will tend to be more and more popular.
guess it. On the other hand, if one chooses too difficult
a password, then he or she may have to write it some-
where (to avoid inconveniences due to forgotten pass- BACKGROUND
words) which may again lead to security breaches. To
prevent passwords being hacked, users are usually ad- Online verification methods can have an accuracy rate
vised to keep changing their passwords frequently and of as high as 99%. The reason behind is its use of both
are also asked not to keep them too trivial at the same static and dynamic (or temporal) features, in compari-
time. All these inconveniences led to the birth of the son to the off-line, which uses only the static features
biometric field. The verification of handwritten signa- (Ramesh & Murty, 1999). The major differences be-
ture, which is a behavioral biometric, can be classified tween off-line and online verification methods do not
into off-line and online signature verification methods. lie with only the feature extraction phases and accuracy
Online signature verification, in general, gives a higher rates, but also in the modes of data acquisition, prepro-
verification rate than off-line verification methods, be- cessing and verification/recognition phases, though the
cause of its use of both static and dynamic features of basic sequence of tasks in an online verification (or
the problem space in contrast to off-line which uses recognition) procedure is exactly the same as that of the
only the static features. Despite greater accuracy, online off-line. The phases that are involved comprise of:
signature recognition is not that prevalent in compari-
son to other biometrics. The primary reasons are: Data Acquisition
Preprocessing and Noise Removal
It cannot be used everywhere, especially where Feature Extraction and
signatures have to be written in ink; e.g. on cheques, Verification (or Identification)
only off-line methods will work.
Unlike off-line verification methods, online meth- However, online signatures are much more difficult
ods require some extra and special hardware, e.g. to forge than off-line signatures (reflected in terms of
electronic tablets, pressure sensitive signature higher accuracy rate in case of online verification meth-
pads, etc. For off-line verification method, on the ods), since online methods involve the dynamics of the
other hand, we can do the data acquisition with signature such as the pressure applied while writing, pen
optical scanners. tilt, the velocity with which the signature is done etc. In
TEAM LinG
Online Signature Recognition
case of off-line, the forger has to copy only the shape Data Acquisition
(Jain & Griess, 2000) of the signature. On the other
hand, in case of online, the hardware used captures the Data acquisition (of the dynamic features) in online
dynamic features of the signature as well. It is extremely verification methods is generally carried out using spe-
difficult to deceive the device in case of dynamic fea- cial devices called transducers or digitizers (Tappert,
tures, since the forger has to not only copy the charac- Suen, & Wakahara, 1990, Wessels & Omlin, 2000), in
teristics of the person whose signature is to be forged, contrast to the use of high resolution scanners in case of
but also at the same time, he has to hide his own inherent off-line. The commonly used instruments include the
style of writing the signature. There are four types of electronic tablets (which consist of a grid to capture the
forgeries: random, simple, skilled and traced forgeries x and y coordinates of the pen tip movements), pressure
(Ammar, Fukumura, & Yoshida, 1988; Drouhard, sensitive tablets, digitizers involving technologies such
Sabourin, & Godbout, 1996). In case of online signa- as acoustic sensing in air medium, surface acoustic
tures, the system shows almost 100% accuracy for the waves, triangularization of reflected laser beams, and
first two classes of forgeries and 99% in case of the optical sensing of a light pen to extract information
latter. But, again, a forger can also use a compromised about the number of strokes, velocity of signing, direc-
signature-capturing device to repeat a previously re- tion of writing, pen tilt, pressure with which the signa-
corded signature signal. In such extreme cases, even ture is written etc.
online verification methods may suffer from repetition
attacks when the signature-capturing device is not physi-
cally secure.
Preprocessing
Preprocessing in online is much more difficult than in

MAIN THRUST off-line, because it involves both noise removal (which
can be done using hardware or software) (Plamondon &
Although the basic sequence of tasks in online signature Lorette, 1989) and segmentation in most of the cases.
verification is almost the same as that of off-line meth- The other preprocessing steps that can be performed are
ods, the modes differ from each other especially in the signal amplifying, filtering, conditioning, digitizing,
ways the data acquisition, preprocessing and feature resampling, signal truncation, normalization, etc. How-
extraction are carried out. More specifically, the sub- ever, the most commonly used include:
modules of online are much more difficult with respect
to off-line (Jain & Griess, 2000). Figure 1 gives a External Segmentation: Tappert, Suen and
generic structure of an online signature verification Wakahara (1990) define external segmentation as
system. The online verification system can be classified the process by which the characters or words of a
into the following modules: signature are isolated before the recognition is
carried out.
Data Acquisition, Resampling: This process is basically done to
Preprocessing, ensure uniform smoothing to get rid of the redun-
Feature Extraction, dant information, as well as to preserve the re-
Learning and Verification. quired information for verification by comparing
Figure 1. Modular structure of a generic online verification system
Input device
/ interface
Preprocessing module
Data Base Analyzer / Controller module Output
Learning module Feature Extraction Verification module

module
886
TEAM LinG
the spatial data of two signatures. According to Jain Feature Extraction

and Griess (2000), here, the distance between two O
critical points is measured, and if the total distance Online signature extracts both the static and the dy-
exceeds a threshold called the resampling length( namic features. Some of the static and dynamic fea-
which is calculated by dividing the distance by the tures have been listed below.
number of sample points for that segment), then a
new point is created by using the gradient between Static Features: Although both static and dy-
the two points, namic information are available to the online
Noise Reduction: Noise is nothing but irrelevant verification system, in most of the cases, the
data, usually in the form of extra dots or pixels in static information is discarded, as dynamic fea-
images (in case of off-line verification methods) tures are quite rich and they alone give high
(Ismail & Gad, 2000), which do not belong to the accuracy rates. Some of the static features used
signature, but are included in the image (in case of by online method are:
off-line) or in the signal (in case of online), be-
cause of possible hardware problems (Tappert, Suen, Width and height of the signature parts (Ismail
& Wakahara, 1990) or presence of background & Gad, 2000)
noises like dirt or by faulty hand movements Width to height ratio (Ismail & Gad, 2000)
(Tappert, Suen, & Wakahara, 1990) while signing. Number of cross and edge points (Baltzakis &
Papamarkos, 2001)
The general techniques used are: Run lengths of each scan of the components of
the signature (Xiao & Leedham, 2002)
Smoothening: in which a point is averaged with Kurtosis (horizontal and vertical), skewness,
its neighbors. (Tappert, Sue, & Wakahara, 1990) relative kurtosis, relative skewness, relative
Filtering: Online thinning (Tappert, Suen, & horizontal and vertical projection measures
Wakahara, 1990) is not the same as off-line (Bajaj & Chaudhury, 1997; Ramesh & Murty,
thinning (Baltzakis & Papamarkos, 2001), al- 1999)
though, both decrease the number of irrelevant Envelopes: upper and lower envelope fea-
dots or points. According to Tappert, Suen and tures (Bajaj & Chaudhury, 1997; Ramesh &
Wakahara (1990), it can be carried out either by Murty, 1999)
forcing a minimum distance between adjacent Alphabet Specific Features: Like presence
points or by forcing a minimum change in the of ascenders, descenders, cusps, closures, dots
direction of the tangent to the drawing for con- etc. (Tappert, Suen, & Wakahara, 1990)
secutive points.
Wild Point Correction: This removes noises Dynamic Features: Though online methods uti-
(points) caused by hardware problems (Tappert, lize some of the static features, they give more
Suen, & Wakahara, 1990). emphasis to the dynamic features, since these
Dehooking: Dehooking on the other hand re- features are more difficult to imitate.
moves hook like features from the signature,
which usually occur at the start or end of the The most widely used dynamic features include:
signature (Tappert, Suen, & Wakahara, 1990).
Dot Reduction: The dot size is reduced to Position (ux(t) and uy(t) ): e.g. the pen tip
single points. (Tappert, Suen, & Wakahara, 1990) position when the tip is in the air, and when it
is in the vicinity of the writing surface etc (Jain
Normalization: Off-line normalization often lim- & Griess, 2000; Plamondon & Lorette, 1989).
its itself to scaling the image to a standard size and/ Pressure (u p (t)) (Jain & Griess, 2000;
or removing blank columns and rows. In case of Plamondon & Lorette, 1989)
online, normalization includes deskewing, baseline Forces (uf(t), ufx(t), ufy(t)): Forces are com-
drift correction (orienting the signature to hori- puted from the position and pressure coordi-
zontal), size normalization and stroke length nor- nates (Plamondon & Lorette, 1989).
malization (Ramesh & Murty, 1999; Tappert, Suen, Velocity (u v(t), uvx(t), uvy(t)): It can be de-
& Wakahara, 1990; Wessels & Omlin, 2000). rived from position coordinates (Jain & Griess,
2000; Plamondon & Lorette, 1989).
887
TEAM LinG
Absolute and relative speed between two criti- for error rates of different approaches have been in-
cal points (Jain, Griess, & Connell, 2002) cluded in the comparison table above. Equal error rate
Acceleration (ua(t), u ax(t), uay(t)): Accelera- (ERR) which is calculated using FRR and FAR is also
tion can be derived from velocity or position used for measuring the accuracy of the systems.
coordinates. It can also be computed using an
accelerometric pen (Jain & Griess, 2000; GENERAL PROBLEMS
Plamondon & Lorette, 1989).
Parameters: A number of parameters like num- The feature set in online has to be taken very carefully,
ber of peaks, starting direction of the signature, since it must have sufficient interpersonal variability so
number of pen lifts, means and standard devia- that the input signature can be classified as genuine or
tions, number of maxima and minima for each forgery. In addition, it must also have a low intra per-
segment, proportions, signature path length, sonal variability so that an authentic signature is ac-
path tangent angles etc. are also calculated apart cepted. Therefore, one has to be extremely cautious,
from the above mentioned functions to increase while choosing the feature set, as increase in dimen-
the dimensionality (Gupta, 1997; Plamondon & sionality does not necessarily mean an increase in effi-
Lorette, 1989; Wessels & Omlin, 2000). More- ciency of a system.
over, all these features can be of both global and
local in nature.
FUTURE TRENDS
Verification and Learning
Biometrics is gradually replacing the conventional pass-
For online, examples of comparison methods include word and ID based devices, since it is both convenient
use of: and safer than the earlier methods. Today, it is not
difficult at all to come across a fingerprint scanner or an
Corner Point and Point to Point Matching algo- online signature pad. Nevertheless, it still requires a lot
rithms (Zhang, Pratikakis, Cornelis, & Nyssen, of research to be done to make the system infallible,
2000) because even an accuracy rate of 99% can cause failure
Similarity measurement on logarithmic spectrum of the system when scaled to the size of a million.
(Lee, Wu, & Jou, 1998) So, it will not be strange if we have ATMs in near
Extreme Points Warping (EPW) (Feng & Wah, future granting access only after face recognition, fin-
2003) gerprint scan and verification of signature via embedded
String Matching and Common threshold (Jain, devices, since a multimodal system will have a lesser
Griess, & Connell, 2002) chance of failure than a system using a single biometric
Split and Merging, (Lee, Wu, & Jou, 1997) or a password/ID based device. Currently online and off-
Histogram classifier with global and local line signature verification systems are two disjoint
likeliness coefficients (Plamondon & Lorette, approaches. Efforts must be also made to integrate the
1989) two approaches enabling us to exploit the higher accu-
Clustering analysis ( Lorette, 1984) racy of online verification method and greater applica-
Dynamic programming based methods; matching bility of off-line method.
with Mahalanobis pseudo-distance (Sato &
Kogure,1982)
Hidden Markov Model based methods (McCabe, CONCLUSION
2000; Kosmala & Rigoll, 1998)
Most of the online methods claim near about 99%
Table 1 gives a summary of some prominent works in accuracy for signature verification. These systems are
the online signature verification field. gaining popularity day by day and a number of products
are currently available in the market. However, online
Performance Evaluation verification is facing stiff competition from fingerprint
verification system which is both more portable and has
Performance evaluation of the output (which is to ac- higher accuracy. In addition, the problem of maintaining
cept or reject the signature) is done using false rejec- a balance between FAR and FRR has to be maintained.
tion rate (FRR or type I error) and false acceptance rate Theoretically, both FAR and FRR are inversely related
(FAR or type II error) (Huang & Yan, 1997). The values to each other. That is, if we keep tighter thresholds to
888
TEAM LinG
Table 1. Summery of some prominent online papers Gupta, J., & McCabe, A. (1997). A review of dynamic
handwritten signature verification. Technical Article, O
Author Mode Of Verification Database Feature Extraction
Results
(Error rates) James Cook University, Australia.
coordinates to
86.5 % accuracy rate for
Wu, Lee and
Jou (1997)
Split and merging
200 genuine and
246 forged
represent the
signature and the
genuine and 97.2 % for
forged
Huang, K., & Yan, H. (1997). Off-line signature verification
27 people
velocity
based on geometric feature extraction and neural network
Features based on
Wu, Lee and
Similarity
measurement on
each 10 signs,
560 genuine and
coefficients of the FRR=1.4 % and classification. Pattern Recognition, 30(1), 9-17.
Jou (1998) logarithmic FAR=2.8 %
logarithmic spectrum 650 forged for
spectrum
testing
Ismail, M.A., & Gad, S. (2000). Off-line Arabic signature
0.1 % mismatch in
Zhang,
Pratikakis,
Corner point matching
corner points
segments in case of recognition and verification. Pattern Recognition, 33,
algorithm and Point to 188 signatures, corner point algorithm
Cornelis and
Nyssen
point matching from 19 people
extracted based on
velocity information
and 0.4 % mismatch in 1727-1740.
algorithm case of point to point
(2000)
matching algorithm
Jain, A. K., & Griess, F. D. (2000). Online signature
Number of strokes,
co-ordinate distance verification. Project Report, Department of Computer Sci-
between two points,
angle with respect to ence and Engineering, Michigan State University, USA.
Jain, Griess 1232 signatures x and y axis,
String matching and Type I: 2.8%
and Connell of 102 curvature, distance
common threshold Type II: 1.6%
(2002) individual from centre of
gravity, grey value
Jain, A. K., Griess, F. D., & Connell, S. D. (2002). Online
in 9X9
neighborhood and
signature verification. Pattern Recognition, 35(12), 2963-
velocity features 2972.
25 users EER (Euclidean)
Extreme Points
Feng and
Warping ( both
contributed
30 genuine
x and y trajectories
used, apart from
=
25.4 % Kosmala, A., & Rigoll, G. (1998). A systematic comparison
Euclidean distance and
Wah (2003)
Correlation
signatures
and 10 forged
torque and center of
mass.
EER
(correlation) = between online and off-line methods for signature verifi-
Coefficients used)
signatures 27.7 %
cation using hidden markov models. 14th International
Conference on Pattern Recognition (pp. 1755-1757).
decrease FAR, inevitably we increase the FRR by reject-
ing some genuine signatures. Further research is thus Lee, S. Y., Wu, Q. Z., & Jou, I. C. (1997). Online signature
still required to overcome this barrier. verification based on split and merge matching mecha-
nism. Pattern Recognition Letters, 18, 665-673
Lee, S. Y., Wu, Q. Z., & Jou, I. C. (1998). Online
REFERENCES signature verification based on logarithmic spectrum.
Pattern Recognition, 31(12), 1865-1871
Ammar, M., Fukumura, T., & Yoshida Y. (1988). Off-
line preprocessing and verification of signatures. Inter- Lorette, G. (1984). Online handwritten signature recogni-
national Journal of Pattern Recognition and Artifi- tion based on data analysis and clustering. Proceedings
cial Intelligence, 2(4), 589-602. of 7th International Conference on Pattern Recognition,
Vol. 2 (pp. 1284-1287).
Ammar, M., Fukumura, T., & Yoshida Y. (1990). Structural
description and classification of signature images. Pat- McCabe, A. (2000). Hidden markov modeling with simple
tern Recognition, 23(7), 697-710. directional features for effective and efficient handwrit-
ing verification. Proceedings of the Sixth Pacific Rim
Bajaj, R., & Chaudhury, S. (1997). Signature verification International Conference on Artificial Intelligence.
using multiple neural classifiers. Pattern Recognition,
30(1), l-7. Plamondon R., & Lorette, G. (1989). Automatic signa-
ture verification and writer identification the state of
Baltzakis, H., & Papamarkos, N. (2001). A new signature the art. Pattern Recognition, 22(2), 107-131.
verification technique based on a two-stage neural net-
work classifier. Engineering Applications of Artificial Ramesh, V.E., & Murty, M. N. (1999). Off-line signa-
Intelligence, 14, 95-103. ture verification using genetically optimized weighted
features. Pattern Recognition, 32(7), 217-233.
Drouhard, J. P., Sabourin, R., & Godbout, M. (1996). A
neural network approach to off-line signature verifica- Sato, Y., & Kogure, K. (1982). Online signature verifi-
tion using directional pdf. Pattern Recognition, 29(3), cation based on shape, motion and handwriting pressure.
415-424. Proceedings of 6th International Conference on Pat-
tern Recognition, Vol. 2 (pp. 823-826).
Feng, H., & Wah, C. C. (2003). Online signature verifi-
cation using a new extreme points warping technique. Tappert, C. C., Suen, C. Y., & Wakahara, T. (1990). The
Pattern Recognition Letters 24(16), 2943-2951. state of the art in on-line handwriting recognition. IEEE
889
TEAM LinG
Transactions on Pattern Analysis and Machine Intelli- Global Features: The features are extracted using the
gence, 12(8). complete signature image or signal as a single entity.
Wessels, T., & Omlin, C. W. (2000). A hybrid approach for Local Features: The geometric information of the
signature verification. International Joint Conference signature is extracted in terms of features after dividing
on Neural Networks. the signature image or signal into grids and sections.
Xiao, X., & Leedham, G. (2002). Signature verification Online Signature Recognition: The signature is
using a modified bayesian network. Pattern Recognition, captured through a digitizer or an instrumented pen and
35, 983-995. both geometric and temporal information are recorded
and later used in the recognition process.
Zhang, K., Pratikakis, I., Cornelis, J., & Nyssen, E. (2000).
Using landmarks in establishing a point to point corre- Random Forgery: Random forgery is one in which
spondence between signatures. Pattern Analysis and the forged signature has a totally different semantic
Applications, 3, 69-75. meaning and overall shape in comparison to the genuine
signature.
Simple Forgery: Simple forgery is one in which the
KEY TERMS semantics of the signature are the same as that of the
genuine signature, but the overall shape differs to a great
Equal Error Rate: The error rate when the propor- extent, since the forger has no idea about how the
tions of FAR and FRR are equal. The accuracy of the signature is done.
biometric system is inversely proportional to the value Skilled Forgery: In skilled forgery, the forger has a
of EER. prior knowledge about how the signature is written and
False Acceptance Rate: Rate of acceptance of a practices it well, before the final attempt of duplicating it.
forged signature as a genuine signature by a handwritten Traced Forgery: For traced forgery, a signature
signature verification system. instance or its photocopy is used as a reference and tried
False Rejection Rate: Rate of rejection of a genu- to be forged.
ine signature as a forged signature by a handwritten
signature verification system.
890
TEAM LinG
891
Organizational Data Mining O

Hamid R. Nemati
The University of North Carolina at Greensboro, USA
Christopher D. Barko
The University of North Carolina at Greensboro, USA
INTRODUCTION overcome information paralysisthere is so much data

available that it is difficult to determine what is relevant
Data mining is now largely recognized as a business and how to extract meaningful knowledge. Organiza-
imperative and considered essential for enabling the tions today routinely collect and manage terabytes of data
execution of successful organizational strategies. The in their databases, thereby making information paralysis
adoption rate of data mining by enterprises is growing a key challenge in enterprise decision making. The gen-
quickly, due to noteworthy industry results in applica- eration and management of business data lose much of
tions such as credit assessment, risk management, mar- their potential organizational value unless important con-
ket segmentation, and the ever-increasing volumes of clusions can be extracted from them quickly enough to
corporate data available for analysis. The quantity of influence decision making, while the business opportu-
data being captured is staggeringdata experts esti- nity is still present. Managers must understand rapidly
mate that in 2002, the world generated five exabytes of and thoroughly the factors driving their business in order
information. This amount of data is more than all the to sustain a competitive advantage. Organizational speed
words ever spoken by human beings (Hardy, 2004). The and agility, supported by fact-based decision making, are
rate of growth is just as astoundingthe amount of data critical to ensure an organization remains at least one
produced in 2002 was up 68% from just two years earlier. step ahead of its competitors.
The size of the typical business database has grown a
hundred-fold during the past five years as a result of
Internet commerce, ever-expanding computer systems, BACKGROUND
and mandated recordkeeping by government regulations
(Hardy, 2004). Following this trend, a recent survey of The manner in which organizations execute this intri-
corporations across 23 countries revealed that the larg- cate decision-making process is critical to their well-
est transactional database almost doubled in size to 18 TB being and industry competitiveness. Those organiza-
(terabytes), while the largest decision-support database tions making swift, fact-based decisions by optimally
grew to almost 30 TB (Reddy, 2004). leveraging their data resources will outperform those
However, in spite of this enormous growth in enter- organizations that do not. A robust technology that
prise databases, research from IBM reveals that organi- facilitates this process of optimal decision making is
zations use less than 1% of their data for analysis known as Organizational Data Mining (ODM). ODM is
(Brown, 2002). In a similar study, a leading business defined as leveraging data mining tools and technolo-
intelligence firm surveyed executives at 450 companies gies in order to enhance the decision-making process by
and discovered that 90% of these organizations rely on transforming data into valuable and actionable knowl-
gut instinct rather than hard facts for most of their edge to gain a competitive advantage (Nemati & Barko,
decisions, because they lack the necessary information 2001). ODM eliminates the guesswork that permeates
when they need it (Brown, 2002). In cases where suffi- so much of corporate decision making. By adopting
cient business information is available, those organiza- ODM, an organizations managers and employees are
tions only are able to utilize less than 7% of it (The able to act sooner rather than later, be proactive rather
Economist, 2001). This is the fundamental irony of the than reactive, and know rather than guess. ODM technol-
Information Age we live inorganizations possess enor- ogy has helped many organizations to optimize internal
mous amounts of business information yet have so little resource allocations while better understanding and
real business knowledge. responding to the needs of their customers.
In the past, companies have struggled to make deci- ODM spans a wide array of technologies, including,
sions because of lack of data. But in the current environ- but not limited to, e-business intelligence, On-Line
ment, more and more organizations are struggling to Analytical Processing (OLAP), optimization, Customer
TEAM LinG
Organizational Data Mining
Relationship Management (CRM), electronic CRM (e- nical and business issues related to ODM projects, and
CRM), Executive Information Systems (EIS), digital elaborated on how organizations are benefiting through
dashboards, and enterprise information portals. ODM enhanced enterprise decision making (Nemati & Barko,
enables organizations to answer questions about the past 2001). The results of our research suggest that ODM
(what has happened?), the present (what is happening?), can improve the quality and accuracy of decisions for
and the future (what might happen?). Armed with this any organization that is willing to make the investment.
capability, organizations can generate valuable knowl- After exploring the status and utilization of ODM in
edge from their data, which, in turn, enhances enterprise organizations, we decided to focus subsequent research
decisions. This decision-enhancing technology offers on how organizations implement ODM projects and on
many advantages in operations (faster product develop- the factors critical to its success. To that end, we
ment, optimal supply chain management), marketing developed a new ODM Implementation Framework based
(higher profitability and increased customer loyalty on data, technology, organizations, and the Iron Triangle
through more effective marketing campaigns), finance (Nemati & Barko, 2003). Our research demonstrated
(optimal portfolio management, financial analytics), that selected organizational data mining project factors,
and strategy implementation (Business Performance when modeled under this new framework, have a signifi-
Management [BPM] and the Balanced Scorecard). cant influence on the successful implementation of
Over the last three decades, the organizational role ODM projects.
of information technology has evolved from efficiently Given the promise of strengthening customer rela-
processing large amounts of batch transactions to pro- tionships and enhancing profits, CRM technology and
viding information in support of tactical and strategic associated research are gaining greater acceptance within
decision-making activities. This evolution, from auto- organizations. However, findings from recent studies
mating expensive manual systems to providing strategic suggest that organizations generally fail to support their
organizational value, led to the birth of Decision Sup- CRM efforts with complete data (Brohman et al., 2003).
port Systems (DSS), such as data warehousing and data As further investigation, our latest research has focused
mining. The organizational need to combine data from on a specific ODM technology known as Electronic
multiple stand-alone systems (e.g., financial, manufac- Customer Relationship Management (e-CRM) and its
turing, and distribution) grew as corporations began to data integration role within organizations. Consequently,
acknowledge the power of combining these data sources we developed a new e-CRM Value Framework to better
for reporting. This spurred the growth of data warehous- examine the significance of integrating data from all
ing, where multiple data sources were stored in a format customer touch-points with the goal of improving cus-
that supported advanced data analysis. tomer relationships and creating additional value for the
The slowness in adoption of ODM techniques in the firm. Our research findings suggest that, despite the
1990s was partly due to an organizational and cultural cost and complexity, data integration for e-CRM projects
resistance. Business management always has been re- contributes to a better understanding of the customer
luctant to trust something it does not fully understand. and leads to higher return on investment (ROI), a greater
Until recently, most businesses were managed by in- number of benefits, improved user satisfaction, and a
stinct, intuition, and gut feeling. The transition over the higher probability of attaining a competitive advantage
past 20 years to a method of managing by the numbers is (Nemati, Barko & Moosa, 2003).
both the result of technology advances as well as a
generational shift in the business world, as younger
managers arrive with information technology training MAIN THRUST
and experience.
Data mining is the process of discovering and interpret-
ODM Research ing previously unknown patterns in databases. It is a
powerful technology that converts data into information
Given the scarcity of past research in ODM along with and potentially actionable knowledge. However, there
its growing acceptance and importance in organizations, are many obstacles to the broad inclusion of data mining
we conducted empirical research during the past several in organizations. Obtaining new knowledge in an organi-
years that explored the utilization of ODM in organiza- zational vacuum does not facilitate optimal decision
tions along with project implementation factors critical making in a business setting. Simply incorporating data
for success. We surveyed ODM professionals from mining into the enterprise mix without considering non-
multiple industries in both domestic and international technical issues is usually a recipe for failure. Busi-
organizations. Our initial research examined the ODM nesses must give careful thought when weaving data
industry status and best practices, identified both tech- mining into their organizations fabric. The unique orga-
892
TEAM LinG
nizational challenge of understanding and leveraging ond, organizations create, organize, and process data to
ODM to engineer actionable knowledge requires assimi- generate new knowledge through organizational learn- O
lating insights from a variety of organizational and tech- ing. This knowledge creation activity enables the orga-
nical fields and developing a comprehensive framework nization to develop new capabilities, design new prod-
that supports an organizations quest for a sustainable ucts and services, enhance existing offerings, and im-
competitive advantage. These multi-disciplinary fields prove organizational processes. Third, organizations
include data mining, business strategy, organizational search for and evaluate data in order to make decisions.
learning and behavior, organizational culture, organiza- This data is critical, since all organizational actions are
tional politics, business ethics and privacy, knowledge initiated by decisions and all decisions are commit-
management, information sciences, and decision sup- ments to actions, the consequences of which will, in
port systems. These fundamental elements of ODM can turn, lead to the creation of new data. Adopting an OT
be categorized into three main groups: Artificial Intelli- methodology enables an enterprise to enhance the
gence (AI), Information Technology (IT), and Organiza- knowledge engineering and management process.
tional Theory (OT). Our research and industry experi- In another OT study, researchers and academic
ence suggest that successfully leveraging ODM requires scholars have observed that there is no direct correla-
integrating insights from all three categories in an orga- tion between information technology (IT) investments
nizational setting typically characterized by complexity and organizational performance. Research has con-
and uncertainty. This is the essence and uniqueness of firmed that identical IT investments in two different
ODM. Obtaining maximum value from ODM involves a companies may give a competitive advantage to one
cross-department team effort that includes statisticians/ company but not the other. Therefore, a key factor for
data miners, software engineers, business analysts, line- the competitive advantage in an organization is not the
of-business managers, subject-matter experts, and upper IT investment but the effective utilization of informa-
management support. tion as it relates to organizational performance
(Brynjolfsson & Hitt, 1996). This finding emphasizes
Organizational Theory and ODM the necessity of integrating OT practices with robust
information technology and artificial intelligence tech-
Organizations are concerned primarily with studying niques in successfully leveraging ODM.
how operating efficiencies and profitability can be
achieved through the effective management of custom- ODM Practices at Leading Companies
ers, suppliers, partners, and employees. To achieve these
goals, research in Organizational Theory (OT) suggests A 2002 Strategic Decision-Making study conducted by
that organizations use data in three vital knowledge cre- Hackett Best Practices determined that world-class
ation activities. This organizational knowledge creation companies have adopted ODM technologies at more
and management is a learned ability that only can be than twice the rate of average companies (Hoblitzell,
achieved via an organized and deliberate methodology. 2002). ODM technologies provide these world-class
This methodology is a foundation for successfully lever- organizations greater opportunities to understand their
aging ODM within the organization. The three knowl- business and make informed decisions. ODM also en-
edge creation activities (Choo, 1997) are: ables world-class organizations to leverage their inter-
nal resources more efficiently and more effectively
Sense Making: The ability to interpret and under- than their average counterparts, who have not fully
stand information about the environment and events embraced ODM.
happening both inside and outside the organization. Many of todays leading organizations credit their
Knowledge Making: The ability to create new success to the development of an integrated, enter-
knowledge by combining the expertise of members prise-level ODM system. As part of an effective CRM
to learn and innovate. strategy, customer retention is now widely viewed by
Decision Making: The ability to process and ana- organizations as a significant marketing strategy in
lyze information and knowledge in order to select creating a competitive advantage. Research suggests
and implement the appropriate course of action. that as little as a 5% increase in retention can mean as
much as a 95% boost in profit, and repeat customers
First, organizations use data to make sense of changes generate over twice as much gross income as new
and developments in the external environmentsa pro- customers (Winer, 2001). In addition, many business
cess called sense making. This is a vital activity wherein executives today have replaced their cost reduction
managers discern the most significant changes, interpret strategies with a customer retention strategyit costs
their meaning, and develop appropriate responses. Sec- approximately five to 10 times more to acquire new
893
TEAM LinG
customers than to retain established customers (Pan & pliers, and shareholders. The never-ending challenge is
Lee, 2003). to successfully integrate data-mining technologies
An excellent example of a successful CRM strategy within organizations in order to enhance decision mak-
is Harrahs Entertainment, which has saved over $20 ing with the objective of optimally allocating scarce
million per year since implementing its Total Rewards enterprise resources. This is not an easy task, as many
CRM program. This ODM system has given Harrahs a information technology professionals, consultants, and
better understanding of its customers and has enabled managers can attest. The media can oversimplify the
the company to create targeted marketing campaigns effort, but successfully implementing ODM is not ac-
that almost doubled the profit per customer and deliv- complished without political battles, project manage-
ered same-store sales growth of 14% after only the first ment struggles, cultural shocks, business process
year. In another notable case, Travelocity.com, an reengineering, personnel changes, short-term financial
Internet-based travel agency, implemented an ODM sys- and budgetary shortages, and overall disarray.
tem and improved total bookings and earnings by 100% Recent ODM research has revealed a number of
in 2000. Gross profit margins improved 150%, and industry predictions that are expected to be key ODM
booker conversion rates rose 8.9%, the highest in the issues in the coming years. Nemati & Barko (2001)
online travel services industry. found that almost 80% of survey respondents expect
In another significant study, executives from 24 Web farming/mining and consumer privacy to be sig-
leading companies in customer-knowledge management, nificant issues. We also foresee the development of
including FedEx, Frito-Lay, Harley-Davidson, Procter widely accepted standards for ODM processes and tech-
& Gamble, and 3M, all realized that in order to succeed, niques to be an influential factor for knowledge seekers
they must go beyond simply collecting customer data in the 21st century. One attempt at ODM standardization
and must translate it into meaningful knowledge about is the creation of the Cross Industry Standard Process
existing and potential customers (Davenport, Harris & for Data Mining (CRISP-DM) project that developed an
Kohli, 2001). This study revealed that several objec- industry and tool-neutral, data-mining process model to
tives were common to all of the leading companies, and solve business problems. Another attempt at industry
these objectives can be facilitated by ODM. A few of standardization is the work of the Data Mining Group in
these objectives are segmenting the customer base, developing and advocating the Predictive Model Markup
prioritizing customers, understanding online customer Language (PMML), which is an XML-based language
behavior, engendering customer loyalty, and increasing that provides a quick and easy way for companies to
cross-selling opportunities. define predictive models and share models between
compliant vendors applications. Last, Microsofts OLE
DB for Data Mining is a further attempt at industry
FUTURE TRENDS standardization and integration. This specification of-
fers a common interface for data mining that will enable
The number of ODM projects is projected to grow more developers to embed data-mining capabilities into their
than 300% in the next decade (Linden, 1999). As the existing applications. One only has to consider
collection, organization, and storage of data rapidly Microsofts industry-wide dominance of the office pro-
increase, ODM will be the only means of extracting ductivity (Microsoft Office), software development
timely and relevant knowledge from large corporate (Visual Basic and .Net), and database (SQL Server)
databases. The growing mountains of business data, markets to envision the potential impact this could have
coupled with recent advances in Organizational Theory on the ODM market and its future direction.
and technological innovations, provide organizations
with a framework to effectively use their data to gain a
competitive advantage. An organizations future suc- CONCLUSION
cess will depend largely on whether or not they adopt
and leverage this ODM framework. ODM will continue Although many improvements have materialized over
to expand and mature as the corporate demand for one- the last decade, the knowledge gap in many organiza-
to-one marketing, CRM, e-CRM, Web personalization, tions is still prevalent. Industry professionals have sug-
and related interactive media increases. gested that many corporations could maintain current
We believe that organizations are slowly moving revenues at half the current costs, if they optimized
from the Information Age to the Knowledge Age, where their use of corporate data (Saarenvirta, 2001). Whether
decision makers will leverage ODM for optimal busi- this finding is true or not, it sheds light on an important
ness performance. Organizations are complex entities issue. Leading corporations in the next decade will
comprised of employees, managers, politics, culture, adopt and weave these ODM technologies into the fab-
hierarchies, teams, processes, customers, partners, sup- ric of their organizations at the strategic, tactical, and
894
TEAM LinG
operational levels. Those enterprises that see the strate- Pan, S.L., & Lee, J.-N. (2003). Using E-CRM for a
gic value of evolving into knowledge organizations by unified view of the customer. Communications of the O
leveraging ODM will benefit directly in the form of ACM, 46(4), 95-99.
improved profitability, increased efficiency, and sus-
tainable competitive advantage. Once the first organiza- Reddy, R. (2004). The reality of real time. Intelligent
tion within an industry realizes a competitive advantage Enterprise, 7(10), 40-41.
through ODM, it is only a matter of time before one of Saarenvirta, G. (2001). Operation data mining. DB2 Maga-
three events transpires: its industry competitors adopt zine. Retrieved from http://www.db3mag.com/db_area/
ODM, change industries, or vanish. By adopting ODM, archives/2001/q2/saarenvirta.html
an organizations managers and employees are able to
act sooner rather than later, anticipate rather than react, The slow progress of fast wires. (2001). The Econo-
know rather than guess, and, ultimately, succeed rather mist, 358(8209), 57-59.
than fail. Winer, R.S. (2001). A framework for customer rela-
tionship management. California Management Review,
43(4), 89-106.
REFERENCES
Brohman, M. et al. (2003). Data completeness: A key to

effective net-based customer service systems. Commu- TERMS
nications of the ACM, 46(6), 47-51.
CRISP-DM (Cross Industry Standard Process
Brown, E. (2002). Analyze this. Forbes, 169(8), 96-98. for Data Mining): An industry and tool-neutral data-
Brynjolfsson, E., & Hitt, L. (1996, September 9). The mining process model developed by members from
customer counts. InformationWeek. Retrieved from diverse industry sectors and research institutes.
www.informationweek.com/596/96mit.htm CRM (Customer Relationship Management): The
Choo, C.W. (1997). The knowing organization: How methodologies, software, and Internet capabilities that
organizations use information to construct meaning, help a company manage customer relationships in an
create knowledge, and make decisions. Retrieved from efficient and organized manner.
www.choo.fis.utoronto.ca/fis/ko/default.html Information Paralysis: A condition where too much
Davenport, T.H., Harris, J.G., & Kohli, A.K. (2001). data causes difficulty in determining relevancy and ex-
How do they know their customers so well? Sloan tracting meaningful information and knowledge.
Management Review, 42(2), 63-73. ODM (Organizational Data Mining): The process
Hardy, Q. (2004). Data of reckoning. Forbes, 173(10), of leveraging data-mining tools and technologies in an
151-154. organizational setting in order to enhance the decision-
making process by transforming data into valuable and
Hoblitzell, T. (2002). Disconnects in todays BI sys- actionable knowledge in order to gain a competitive
tems. DM Review, 12(6), 56-59. advantage.
Linden, A. (1999). CIO update: Data mining applica- OLAP (Online Analytical Processing): A data
tions of the next decade. Inside Gartner Group. Gartner mining technology that uses software tools to interac-
Inc. tively and simultaneously analyze different dimensions
Nemati, H.R., & Barko, C.D. (2001). Issues in organiza- of multi-dimensional data.
tional data mining: A survey of current practices. Jour- OT (Organizational Theory): A body of research
nal of Data Warehousing, 6(1), 25-36. that focuses on studying how organizations create and
Nemati, H.R., & Barko, C.D. (2003). Key factors for use information in three strategic arenassense mak-
achieving organizational data mining success. Indus- ing, knowledge making, and decision making.
trial Management & Data Systems, 103(4), 282-292. PMML (Predictive Model Markup Language):
Nemati, H.R., Barko, C.D., & Moosa, A. (2003). E- An XML-based language that provides a method for
CRM analytics: The role of data integration. Journal of companies to define predictive models and share mod-
Electronic Commerce in Organizations, 1(3), 73-89. els between compliant vendors applications.
895
TEAM LinG
896
Path Mining in Web Processes Using Profiles

Jorge Cardoso
University of Madeira, Portugal
INTRODUCTION process, the QoS may be substantially different. If you

can predict with a certain degree of confidence the path
Business process management systems (BPMSs) (Smith that will be followed at run time, you can significantly
& Fingar, 2003) provide a fundamental infrastructure to increase the precision of QoS estimation algorithms for
define and manage business processes, Web processes, Web processes.
and workflows. When Web processes and workflows Because the large amounts of data stored in process
are installed and executed, the management system gen- logs exceeds understanding, I describe the use of data-
erates data describing the activities being carried out mining techniques to carry out path mining from the data
and is stored in a log. This log of data can be used to stored in log systems. This approach uses classification
discover and extract knowledge about the execution of algorithms to conveniently extract patterns represent-
processes. One piece of important and useful informa- ing knowledge related to paths. My work is novel be-
tion that can be discovered is related to the prediction of cause no previous work has targeted the path mining of
the path that will be followed during the execution of a Web processes and workflows. The literature includes
process. I call this type of discovery path mining. Path only work on process and workflow mining (Agrawal,
mining is vital to algorithms that estimate the quality of Gunopulos, & Leymann, 1998; Herbst & Karagiannis,
service of a process, because they require the predic- 1998; Weijters & van der Aalst, 2001).
tion of paths. In this work, I present and describe how Process mining allows the discovery of workflow
process path mining can be achieved by using data- models from a workflow log containing information
mining techniques. about workflow processes executed. Luo, Sheth, Kochut,
and Arpinar (2003) present an architecture and the
implementation of a sophisticated exception-handling
BACKGROUND mechanism supported by a case-based reasoning (CBR)
engine.
BPMSs, such as workflow management systems (WfMS)
(Cardoso, Bostrom, & Sheth, 2004) are systems ca-
pable of both generating and collecting considerable MAIN THRUST
amounts of data describing the execution of business
processes, such as Web processes. This data is stored in The material presented in this section emphasizes the
a process log systems, which are vast data archives that use of data-mining techniques for uncovering interest-
are seldom visited. Yet, the data generated from the ing process patterns hidden in large process logs. The
execution of processes are rich with concealed infor- method contained in the next section is more suitable
mation that can be used for making intelligent business for administrative and production processes compared
decisions. to Ad-hoc and collaborative processes, because they are
One important and useful piece of knowledge to more repetitive and predictable.
discover and extract from process logs is the implicit
rules that govern path mining. Web Process Scenario
In Web processes for e-commerce, suppliers and
customers define a contract between the two parties, A major bank has realized that to be competitive and
specifying quality of service (QoS) items, such as prod- efficient it must adopt a new and modern information
ucts or services to be delivered, deadlines, quality of system infrastructure. Therefore, a first step was taken
products, and cost of services. The management of QoS in that direction with the adoption of a workflow man-
metrics directly impacts the success of organizations agement system to support its business processes. All
participating in e-commerce. A Web process, which the services available to customers are stored and ex-
typically can have a graphlike representation, includes a ecuted under the supervision of the workflow system.
number of linearly independent control paths. Depend- One of the services supplied by the bank is the loan
ing on the path followed during the execution of a Web process depicted in Figure 1.
TEAM LinG
A Web process is composed of Web services and ally. The Web service Approve Home Loan Conditionally,
transitions. Web services are represented by circles, as the name suggests, approves a home loan under a set P
and transitions are represented by arrows. Transitions of conditions.
express dependencies between Web services. A Web The following formula is used to determine if a loan
service with more than one outgoing transition can be is approved or rejected.
classified as an and-split or xor-split. And-split Web
services enable all their outgoing transitions after com- MP = (L*R*(1+R/12)12*NY)/(-12+12*(1+R12)12*NY)
pleting their execution. Xor-split Web services enable (1)
only one outgoing transition after completing their
execution. And-split Web services are represented with MP=Monthly payment,L=Loan amount,R=Interest
, and xor-split Web services are represented with . rate,NY=Number of years
A Web service with more than one incoming transition
can be classified as an and-join or xor-join. And-join When the result of a loan application is known, it is
Web services start their execution when all their incom- e-mailed to the client. Three Web services are respon-
ing transitions are enabled. Xor-join Web services are sible for notifying the client: Notify Home Loan Client,
executed as soon as one of the incoming transitions is Notify Education Loan Client, and Notify Car Loan
enabled. As with and-split and xor-split Web services, Client. Finally, the Archive Application Web service
and-join and xor-join Web services are represented with creates a report and stores the loan application data in a
the symbols and , respectively. database record.
The Web process of this scenario is composed of 14
Web services. The Fill Loan Request Web service al- Web Process Log
lows clients to request a loan from the bank. In this step,
the client is asked to fill out an electronic form with During the execution of Web processes (such as the one
personal information and data describing the condition presented in Figure 1), events and messages generated
of the loan being requested. by the enactment system are stored in a Web process
The second Web service, Check Loan Type, deter- log. These data stores provide an adequate format on
mines the type of loan a client has requested and, based which path mining can be performed. The data includes
on the type, forwards the request to one of three Web real-time information describing the execution and be-
services: Check Home Loan, Check Educational Loan, havior of Web processes, Web services, instances, tran-
or Check Car Loan. sitions, and other elements such as runtime QoS metrics.
Educational loans are not handled and managed auto- Table 1 illustrates an example of a modern Web process
matically. After an educational loan application is sub- log.
mitted and checked, a notification is immediately sent To perform path mining, current Web process logs
informing the client that he or she has to contact the need to be extended to store information indicating the
bank personally. values and the type of the input parameters passed to
A loan request can be either accepted (Approve Web services and the output parameters received from
Home Loan and Approve Car Loan) or rejected (Reject Web services. Table 2 shows an extended Web process
Home Loan and Reject Car Loan). In the case of a home log that accommodates input/output values of Web ser-
loan, however, the loan can also be approved condition- vices parameters generated at run time. Each Parameter/
value entry has a type, parameter name, and value (e.g.,
Figure 1. The loan process string loan-type=car-loan).
Additionally, the Web process log needs to include
path information describing the Web services that have
been executed during the enactment of a Web process.
This information can be easily stored in the log. For
example, an extra field can be added to the log system to
contain the information indicating the path followed.
The path needs only to be associated to the entry corre-
sponding to the last service of a process to be executed.
For example, in the Web process log illustrated in Table
2, the service NotifyUser is the last service of a Web
process. The log has been extended in such a way that the
NotifyUser record contains information about the path
that was followed during the Web process execution.
897
TEAM LinG
Table 1. Web process log

Date Web process Process Web service Service Cost Durati
instance instance on
6:45 03-03-04 LoanApplication LA04 RejectCarLoan RCL03 $1.2 13 min
6:51 03-03-04 TravelRequest TR08 FillRequestTravel FRT03 $1.1 14 min
6:59 03-03-04 TravelRequest TR09 NotifyUser NU07 $1.4 24 hrs
7:01 03-03-04 InsuranceClaim IC02 SubmitClaim SC06 $1.2 05 min

Table 2. Extended Web process log
Process Web service Service Parameter/value Path

instance instance
LA04 RejectCarLoan RCL03 int LoanNum=14357;
string loan-type=car-loan
LA04 NotifyCLoanClient NLC07 string e-mail=jf@uma.pt
LA05 CheckLoanReques CLR05 double income=12000;
t string Name=Eibe Frank;
TR09 NotifyUser NU07 String e-mail=jf@uma.pt; FillForm->CheckForm->
String tel=35129170023 Approve->Sign->Report

Web Process Profile when the attributes of the profile have been assigned to
specific values. The path attribute is a target class.
When beginning work on path mining, it is necessary to Classification algorithms classify samples or instances
elaborate a profile for each Web process. A profile into target classes.
provides the input to machine learning and is character- After the profiles and a path attribute value for each
ized by its values on a fixed, predefined set of attributes. profile have been determined, I can use data-mining
The attributes correspond to the Web service input/ methods to establish a relationship between the pro-
output parameters that have been stored previously in the files and the paths followed at run time. One method
Web process log. Path mining will be performed on appropriate to deal with my problem is the use of
these attributes. classification.
A profile contains two types of attributes, numeric In classification, a learning schema takes a set of
and nominal. Numeric attributes measure numbers, ei- classified profiles, from which it is expected to learn a
ther real or integer-valued. For example, Web services way of classifying unseen profiles. Because the path of
inputs or outputs parameters that are of type byte, deci- each training profile is provided, my methodology uses
mal, int, short, or double will be placed in the profile and supervised learning.
classified as numeric. In Table 2, the parameters
LoanNum, income, BudgetCode, income, and tel will be
classified as numeric in the profile. EXPERIMENTS
Nominal attributes take on values within a finite set
of possibilities. Nominal quantities have values that are In this section, I present the results of applying my
distinct symbols. For example, the parameter loan-type algorithm to a synthetic loan dataset. To generate a
from the loan application and present in Table 2 is synthetic dataset, I start with the process presented in
nominal because it can take the finite set of values: the introductory scenario and, using this as a process
home-loan, education-loan, and car-loan. In my approach, model graph, log a set of process instance executions.
string and Boolean data type manipulated by Web ser- The data are lists of event records stored in a Web
vices are considered to be nominal attributes. process log consisting of process names, instance iden-
tification, Web services names, variable names, and so
Profile Classification forth. Table 3 shows the additional data that have been
stored in the Web process log. The information in-
The attributes present in a profile trigger the execution cludes the Web service variable values that are logged
of a specific set of Web services. Therefore, for each by the system and the path that has been followed during
profile previously constructed, I associate an additional the execution of instances. Each entry corresponds to
attribute, the path attribute, indicating the path followed an instance execution.
898
TEAM LinG
Table 3. Additional data stored in the Web process log
Income Loan_Type Loan_ Loan_ Name SSN Path

P
amount years
1361.0 Home-Loan 129982.0 33 Bernard-Boar 10015415 FR>CLT>CHL>AHL>NHC>CA
Unknown Education-Loan Unknown Unknown John-Miller 15572979 FR>CLT>CEL>CA
1475.0 Car-Loan 15002.0 9 Eibe-Frank 10169316 FR>CLT>CCL>ACL>NCC>CA

Web process profiles provide the input to machine Each experiment has involved data from 1,000 Web
learning and are characterized by a set of six attributes: process executions and a variable number of attributes
income, loan_type, loan_amount, loan_years, name, and (ranging from two to six). I have conducted 34 experi-
SSN. The profiles for the loan process contain two types ments, analyzing a total of 34,000 records containing data
of attributes: numeric and nominal. The attributes income, from Web process instance executions. Figure 2 shows
loan_amount, loan_years, and SSN are numeric, whereas the results that I have obtained.
the attributes loan_type and name are nominal. As an The path-mining technique developed has achieved
example of a nominal attribute, loan_type can take the encouraging results. When three or more attributes are
finite set of values home-loan, education-loan, and car- involved in the prediction, the system is able to predict
loan. These attributes correspond to the Web service correctly the path followed for more than 75% of the
input/output parameters that have been stored previ- process instances. This accuracy improves when four
ously in the Web process log presented in Table 3. attributes are involved in the prediction; in this case,
Each profile is associated with a class indicating the more than 82% of the paths are correctly predicted.
path that has been followed during the execution of a When five attributes are involved, I obtain a level of
process when the attributes of the profile have been prediction that reaches a high of 93.4%. Involving all six
assigned specific values. The last column of Table 3 attributes in the prediction gives excellent results: 88.9%
shows the class named path. The profiles and path at- of the paths are correctly predicted. When a small
tributes will be used to establish a relationship between number of attributes are involved in the prediction, the
the profiles and the paths followed at runtime. The results are not as good. For example, when only two
profiles and the class path have been extracted from the attributes are selected, I obtain predictions that range
Web process log. from 25.9% to 86.7%.
After profiles are constructed and associated with
paths, these data are combined and formatted to be
analyzed using Weka (2004), a set of software for FUTURE TRENDS
machine learning and data mining. The data is automati-
cally formatted using the ARFF format. I have used the Currently, organizations use BPMSs, such as WfMS, to
J.48 algorithm, which is Wekas implementation of the define, enact, and manage a wide range of distinct appli-
C4.5 (Hand, Mannila, & Smyth, 2001) decision tree cations (Q-Link Technologies, 2002), such as insur-
learner to classify profiles. C4.5 decision tree learner ance claims, bank loans, bioinformatic experiments
is one of the most well-known decision tree algorithms in (Hall, Miller, Arnold, Kochut, Sheth, & Weise, 2003),
the data-mining community. Weka system and its data health-care procedures (Anyanwu, Sheth, Cardoso,
format (ARFF) is also one of the most well-known data- Miller, & Kochut, 2003), and telecommunication services
mining systems in academia. (Luo, Sheth, Kochut, & Arpinar, 2003).
In the future, I expect to see a wider spectrum of
Figure 2. Experimental results applications managing processes in organizations. Ac-
cording to the Aberdeen Groups estimates, spending in
100.00% the business process management software sector (which
% Correctly Predicted Paths
80.00%
includes workflow systems) reached $2.26 billion in
2001 (Cowley, 2002).
60.00%
The concept of path mining can be used effectively in
40.00% many business applications for example, to estimate
20.00%
the QoS of Web processes and workflows (Cardoso,
Miller, Sheth, Arnold, & Kochut, 2004) because the
0.00%
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6
estimation requires the prediction of paths. Organizations
Number of Attributes operating in modern markets, such as e-commerce activi-
899
TEAM LinG
ties and distributed Web services interactions, require Cowley, S. (2002, September 23). Study: BPM market
QoS management. Appropriate quality control leads to primed for growth. Available from the InfoWorld Web
the creation of quality products and services; these, in site, http://www.infoworld.com
turn, fulfill customer expectations and achieve customer
satisfaction (Cardoso, Sheth, & Miller, 2002). Hall, R. D., Miller, J. A., Arnold, J., Kochut, K. J., Sheth,
A. P., & Weise, M. J. (2003). Using workflow to build an
information management system for a geographically
distributed genome sequence initiative. In R. A. Prade &
CONCLUSION H. J. Bohnert (Eds.), Genomics of plants and fungi (pp.
359-371). New York: Marcel Dekker.
BPMSs, Web processes, workflows, and workflow sys-
tems represent fundamental technological infrastruc- Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of
tures that efficiently define, manage, and support busi- data mining. Bradford Book.
ness processes. The data generated from the execution
and management of Web processes can be used to Herbst, J., & Karagiannis, D. (1998). Integrating machine
discover and extract knowledge about the process ex- learning and workflow management to support acquisi-
ecutions and structure. tion and adaption of workflow models. Proceedings of the
I have shown that one important area of Web pro- Ninth International Workshop on Database and Expert
cesses to analyze is path mining. I have demonstrated Systems Applications.
how path mining can be achieved by using data-mining Luo, Z., Sheth, A., Kochut, K., & Arpinar, B. (2003).
techniques, namely classification, to extract path knowl- Exception handling for conflict resolution in cross-
edge from Web process logs. From my experiments, I organizational workflows. Distributed and Parallel
can conclude that classification methods are a good Databases, 12(3), 271-306.
solution to perform path mining on administrative and
production Web processes. Q-Link Technologies. (2002). BPM2002: Market mile-
stone report. Retrieved from http://www.qlinktech.com.
Smith, H., & Fingar, P. (2003). Business process man-
REFERENCES agement (BPM): The third wave. Meghan-Kiffer Press.
Agrawal, R., Gunopulos, D., & Leymann, F. (1998). Weijters, T., & van der Aalst, W. M. P. (2001). Process
Mining process models from workflow logs. Proceed- mining: Discovering workflow models from event-based
ings of the Sixth International Conference on Extend- data. Proceedings of the 13th Belgium-Netherlands
ing Database Technology, Spain. Conference on Artificial Intelligence.
Anyanwu, K., Sheth, A., Cardoso, J., Miller, J. A., & Weka. (2004). Weka [Computer software.] Retrieved from
Kochut, K. J. (2003). Healthcare enterprise process http://www.cs.waikato.ac.nz/ml/weka/
development and integration. Journal of Research and
Practice in Information Technology, 35(2), 83-98.
Cardoso, J., Bostrom, R. P., & Sheth, A. (2004). Workflow KEY TERMS
management systems and ERP systems: Differences, com-
monalities, and applications. Information Technology Business Process: A set of one or more linked
and Management Journal, 5(3-4), 319-338. activities that collectively realize a business objective
or goal, normally within the context of an organizational
Cardoso, J., Miller, J., Sheth, A., Arnold, J., & Kochut, K. structure.
(2004). Quality of service for workflows and Web service
processes. Journal of Web Semantics: Science, Services Business Process Management System (BPMS):
and Agents on the World Wide Web, 1(3), 281-308. Provides an organization with the ability to collectively
define and model its business processes, deploy these
Cardoso, J., Sheth, A., & Miller, J. (2002). Workflow processes as applications that are integrated with its
quality of service. Proceedings of the International Con- existing software systems, and then provide managers
ference on Enterprise Integration and Modeling Tech- with the visibility to monitor, analyze, control, and
nology and International Enterprise Modeling Confer- improve the execution of those processes.
ence, Spain.
900
TEAM LinG
Process Definition: The representation of a business Workflow: The automation of a business process, in
process in a form that supports automated manipulation whole or part, during which documents, information, or P
or enactment by a workflow management system. tasks are passed from one participant to another for
action, according to a set of procedural rules.
Web Process: A set of Web services that carry out a
specific goal. Workflow Management System: A system that de-
fines, creates, and manages the execution of workflows
Web Process Data Log: Records and stores events through the use of software, which is able to interpret
and messages generated by the enactment system during the process definition, interact with participants, and,
the execution of Web processes. where required, invoke the use of tools and applications.
Web Service: Describes a standardized way of inte-
grating Web-based applications by using open standards
over an Internet protocol.
901
TEAM LinG
902
Pattern Synthesis for Large-Scale Pattern

Recognition
P. Viswanath
Indian Institute of Science, India
M. Narasimha Murty
Shalabh Bhatnagar
INTRODUCTION tion, feature extraction, prototype selection, and

clustering techniques.
Two major problems in applying any pattern recognition Data Structures and Algorithms: Computational
technique for large and high-dimensional data are (a) requirements, compact storage structures, effi-
high computational requirements and (b) curse of di- cient nearest neighbor search techniques, approxi-
mensionality (Duda, Hart, & Stork, 2000). Algorithmic mate search methods, algorithmic paradigms, and
improvements and approximate methods can solve the divide-and-conquer approaches.
first problem, whereas feature selection (Guyon & Database Management: Relational operators,
Elisseeff, 2003), feature extraction (Terabe, Washio, projection, cartesian product, data structures, data
Motoda, Katai, & Sawaragi, 2002), and bootstrapping management, queries, and indexing techniques.
techniques (Efron, 1979; Hamamoto, Uchimura, &
Tomita, 1997) can tackle the second problem. We pro-
pose a novel and unified solution for these problems by MAIN THRUST
deriving a compact and generalized abstraction of the
data. By this term, we mean a compact representation of Pattern synthesis, compact representations followed by
the given patterns from which one can retrieve not only its application with NNC, are described in this section.
the original patterns but also some artificial patterns.
The compactness of the abstraction reduces the compu- Pattern Synthesis
tational requirements, and its generalization reduces
the curse of dimensionality effect. Pattern synthesis Generation of artificial new patterns by using the given
techniques accompanied with compact representations set of patterns is called pattern synthesis. There are
attempt to derive compact and generalized abstractions two broad ways of doing pattern synthesis: model-
of the data. These techniques are applied with nearest based pattern synthesis and instance-based pattern
neighbor classifier (NNC), which is a popular nonpara- synthesis.
metric classifier used in many fields, including data In model-based pattern synthesis, a model (such as
mining, since its conception in the early 1950s the Hidden Markov model) or description (such as
(Dasarathy, 2002). probability distribution) of the data is derived first and
is then used to generate new patterns. This method can
be used to generate as many patterns as needed, but it has
BACKGROUND two drawbacks. First, any model depends on the under-
lying assumptions; hence, the synthetic patterns gener-
Pattern synthesis techniques, compact representations ated can be erroneous. Second, deriving the model might
and its application with NNC are based on more estab- be computationally expensive. Another argument against
lished fields: this method is that if pattern classification is the pur-
pose, then the model itself can be used without generat-
Pattern Recognition: Statistical techniques, ing any patterns at all!
parametric and nonparametric methods, classifier Instance-based pattern synthesis, on the other hand,
design, nearest neighbor classification, curse of uses the given training patterns and some of the proper-
dimensionality, similarity measures, feature selec- ties of the data. It can generate only a finite number of
TEAM LinG
Pattern Synthesis for Large-Scale Pattern Recognition
new patterns. Computationally, this method can be less records that can then be synthesized are (bread, milk,
expensive than deriving a model. It is especially useful biscuits) and (coffee, milk, sugar). Here, milk is the P
for nonparametric methods, such as NNC- and Parzen- overlap. A compact representation in this case is shown
window-based density estimation (Duda et al., 2000), in Figure 1, where a path from left to right denotes a data
which directly use the training instances. Further, this item or pattern. So you get four patterns total (two
method can also result in reduction of the computa- original and two synthetic patterns) from the graph
tional requirements. shown in Figure 1. Association rules derived from asso-
This article presents two instance-based pattern syn- ciation rule mining (Han & Kamber, 2000) can be used
thesis techniques called overlap-based pattern syn- to find these kinds of dependencies. Generalization of
thesis and partition-based pattern synthesis and their this concept and its compact representation for large
corresponding compact representations. datasets are described in the paragraphs that follow.
If the set of features, F, can be arranged in an order
Overlap-based Pattern Synthesis such that F = {f1, f 2 , ..., fd } is an ordered set, with fk being
the k th feature and all possible three-block partitions
Let F be the set of features (or attributes). There may can be represented as Pi = {A i, B i, Ci } such that Ai = (f 1,
exist a three-block partition of F, say, {A, B, C}, with the ..., fa ), B i = (f a+1, ..., fb ), and Ci = (fb+1, ..., fd ), then the
following properties. For a given class, there is a depen- compact representation called overlap pattern graph
dency (probabilistic) among features in A U B. Simi- is described with the help of an example.
larly, features in B U C have a dependency. However,
features in A (or C ) can affect those in C (or A ) only Overlap Pattern Graph (OLP-graph)
through features in B. That is, to state it more formally,
A and C are statistically independent, given B. Suppose Let F = (f 1, f2, f 3 , f4 , f5 ). Let two partitions satisfying the
that this is the case and you are given two patterns, X = conditional independence requirement be P1 = { {f1}, { f 2
(a1, b, c1) and Y = (a2, b, c2), such that a1 is a feature- , f3 }, { f4 , f5 }} and P2 = { {f 1, f2 } , {f3 , f4 }, { f5 }}. Let three
vector that can be assigned to the features in A, b to the given patterns be (a,b,c,d,e), (p,b,c,q,r), and (u,v,c,q,w),
features in B, and c1 to the features in C. Similarly, a2, b, respectively. Because (b,c) is common between the1st
and c2 are feature-vectors that can be assigned to fea- and 2nd patterns, two synthetic patterns that can be
tures in A, B, and C, respectively. Our argument, then, is generated are (a,b,c,q,r) and (p,b,c,d,e). Likewise, three
that the two patterns, (a 1, b, c 2) and (a2, b, c 1), are also other synthetic patterns that can be generated are
valid patterns in the same class or category as X and Y. If (p,b,c,d,e), (p,b,c,q,w), and (a,b,c,q,w). (Note that the
these two new patterns are not already in the class of last synthetic pattern is derived from two earlier syn-
patterns, it is only because of the finite nature of the set. thetic patterns.) A compact representation called over-
We call this generation of additional patterns an overlap pattern graph (OLP-graph) for the entire set (in-
lap-based pattern synthesis, because this kind of syn- cluding both given and synthetic patterns) is shown in
thesis is possible only if the two given patterns have the Figure 2, where a path from left to right represents a
same feature-values for features in B. In the given ex- pattern. The graph is constructed by inserting the given
ample, feature-vector b is common between X and Y and patterns, whereas the patterns that can be extracted out
therefore is called the overlap. This method is suitable of the graph form the entire synthetic set consisting of
only with discrete valued features (can also be of sym- both original and synthetic patterns. Thus, from the
bolic or categorical types). If more than one such parti- graph in Figure 2, a total of eight patterns can be ex-
tion exists, then the synthesis technique is applied se- tracted, five of which are new synthetic patterns.
quentially with respect to the partitions in some order. OLP-graph can be constructed by scanning the given
One simple example to illustrate this concept is as dataset only once and is independent of the order in
follows. Consider a supermarket sales database where which the given patterns are considered. An approxi-
two records, (bread, milk, sugar) and (coffee, milk, mate method for finding partitions, a method for con-
biscuits), are given. Assume that a known dependency
exists between (a) bread and milk, (b) milk and sugar, (c)
Figure 2. OLP-graph
coffee and milk, and (d) milk and biscuits. The two new
Figure 1. A compact representation
903
TEAM LinG
struction of OLP-graph, and its application to NNC are block. A path in the PC-tree for block 1 (from root to leaf)
described in Viswanath, Murty, and Bhatnagar (2003). concatenated with a path in the PC-tree for block 2 gives
For large datasets, this representation drastically re- a pattern that can be extracted. So a total of six patterns
duces the space requirement. can be extracted from the structures shown in Figure 3.
Partition-based Pattern Synthesis An Application of Synthetic Patterns

with Nearest Neighbor Classifier
Let P = {A1, A2,...,Ak} be a k block partition of F such
that A1, A2,..., and Ak are statistically independent sub- Classification and prediction methods are among vari-
sets. Let Y be the given dataset and ProjA1(Y) be the ous elements of data mining and knowledge discovery,
projection of Y onto features in the subset A1. Then the including association rule mining, clustering, link analy-
cartesian product ProjA1(Y) X ProjA2(Y) X X ProjAk(Y) sis, rule induction, and so forth (Wang, 2003). The
is the synthetic set generated using this method. nearest neighbor classifier (NNC) is a very popular
The partiton P satisfying the above requirement can nonparametric classifer (Dasarathy, 2002). It is widely
be obtained from domain knowledge or from association used in various fields because of its simplicity and
rule mining. Approximate partitioning methods using good performance. Its conceptual equivalents in other
mutual information or pair-wise correlation between the fields (such as artificial intelligence) are instance-
features can also be used to get suitable partitions and are based learning, lazy learning, memory-based reason-
experimentally demonstrated to work well with NNC in ing, case-based reasoning, and so forth. Theoretically,
Viswanath, Murty, and Bhatnagar (2004). with an infinite number of samples, its error rate is
upper-bounded by twice the error of Bayes classifier
Partitioned Pattern Count Tree (PPC- (Duda et al., 2000). NNC is a lazy classifier in the sense
tree) that no general model (like a decision tree) is built until
a new sample needs to be classified. So NNC can easily
This compact representation, suitable for storing the adapt to situations where the dataset changes frequently,
partition-based synthetic patterns, is based on Pattern but computational requirements and curse of dimension-
Count Tree (PC-tree) (Ananthanarayana, Murty, & ality are the two major issues that need to be addressed
Subramanian, 2001), which is a prefix-tree-like struc- in order to make the classifier suitable for data mining
ture. PPC-tree is a generalization of PC-tree such that a applications (Dasarathy, 2002). Prototype selection
PC-tree is built for each of the projected patterns, that is, (Susheela Devi & Murty, 2002), feature selection (Guyon
ProjAi(Y) for i = 1 to k. PPC-tree is a more compact & Elisseeff, 2003), feature extraction (Terabe et al., 2002),
structure than PC-tree, which can be built by scanning the compact representations of the datasets (Khan, Ding, &
dataset only once and is independent of the order in Perrizo, 2002), and bootstrapping (Efron, 1979; Hamamoto
which the given patterns are considered. The construc- et al., 1997) are some of the remedies to tackle these
tion method and properties of PPC-tree can be found in problems. But all these techniques need to be followed
(Viswanath et al., 2004). As an example, consider four given one after the other; that is, they cannot be combined
patterns, (a,b,c,d,e,f), (a,u,c,d,v,w), (a,b,p,d,e,f), and together. Pattern synthesis techniques and compact
(a,b,c,d,v,w), and a two-block partition, where the first representations described in this article provide a unified
block consists of the first three features, and the second solution.
block consists of the remaining three features. The corre- Efficient implementations of NNC that can di-
sponding PPC-tree is shown in Figure 3, where each node rectly work with OLP-graph are given in (Viswanath et al.,
contains a feature-value and an integer called count, 2003). These implementations use dynamic programming
which gives the number of patterns sharing that node. The techniques to reuse the partial distance computations
figure shows two PC-trees, one corresponding to each and thus reduce the classification time requirement.
Partition-based pattern synthesis is presented in
Figure 3. PPC-tree (Viswanath et al., 2004), where an efficient NNC imple-
mentation with constant classification time is presented.
These methods are, in general, based on the divide-and-
conquer approach and are efficient to work directly with
the compact representation. Thus, the computational
requirements are reduced. Because the total number of
patterns considered (i.e., the size of the synthetic set) is
much larger than those in the given training set, the effect
of curse of dimensionality is reduced.
904
TEAM LinG
FUTURE TRENDS Guyon, I., & Elisseeff, A.(2003). An introduction to vari-

able and feature selection. Journal of Machine Learning P
Integration of various data mining tasks, such as associa- Research, 3, 1157-1182.
tion rule mining, feature selection, feature extraction, Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest neighbor
classification, and so forth, by using compact and gener- classification on spacial data streams using P-trees. Proceed-
alized abstractions is a very promising direction. Apart ings of the Pacific-Asia Conference on Knowledge Discovery
from NNC, other applications using synthetic patterns and Data Mining (pp. 517-528), Taiwan.
and corresponding compact representations, including
decision tree induction, rule deduction, clustering, and so Susheela Devi, V., & Murty, M. N. (2002). An incremental
forth, are also interesting areas that can give valuable prototype set building technique. Pattern Recognition
results. Journal, 35, 505-513.
Terabe, M., Washio, T., Motoda, H., Katai, O., & Sawaragi,
T. (2002). Attribute generation based on association rules.
CONCLUSION Knowledge and Information Systems, 4(3), 329-349.
Pattern synthesis is a novel technique that can enhance Viswanath, P., Murty, M. N., & Bhatnagar, S. (2003).
the generalization ability of lazy learning methods such Overlap pattern synthesis with an efficient nearest neigh-
as NNC by reducing the curse of dimensionality effect; bor classifier (Indian Institute of Science Tech. Rep. 1/
also, the corresponding compact representations can 2003). Retrieved from http://keshava.csa.iisc.ernet.in/
reduce the computational requirements. In essence, it techreports/2003/overlap.pdf
derives a compact and generalized abstraction of the
given dataset. Overlap-based pattern synthesis and par- Viswanath, P., Murty, M. N., & Bhatnagar, S. (2004).
tition-based pattern synthesis are two instance-based Fusion of multiple approximate nearest neighbor clas-
pattern synthesis methods where OLP-graph and PPC- sifiers for fast and efficient classification [Electronic
tree are respective compact storage structures. version]. Information Fusion, 5(4), 239-250.
Wang, J. (2003). Data mining: Opportunities and chal-
lenges. Hershey, PA: Idea Group Publishing.
REFERENCES
Ananthanarayana, V. S., Murty, M. N., & Subramanian, D. KEY TERMS

K. (2001). An incremental data mining algorithm for com-
pact realization of prototypes. Pattern Recognition, 34, Bootstrapping: Generating artificial patterns from
2249-2251. the given original patterns. (This does not mean that the
artificial set is larger in size than the original set; also,
Dasarathy, B. V. (2002). Data mining tasks and methods: artificial patterns need not be distinct from the original
Classification: Nearest-neighbor approaches. In Hand- patterns.)
book of data mining and knowledge discovery (pp. 288-
298). New York: Oxford University Press. Curse of Dimensionality Effect: When the number
of samples needed to estimate a function (or model)
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern grows exponentially with the dimensionality of the data.
classification (2nd ed.) Wiley.
Compact and Generalized Abstraction of the
Efron, B. (1979). Bootstrap methods: Another look at Training Set: A compact representation built by using
the jackknife. Annual Statistics, 7, 1-26. the training set from which not only the original patterns
Hamamoto, Y., Uchimura, S., & Tomita, S. (1997). A boot- but also some new synthetic patterns can be derived.
strap technique for nearest neighbor classifier design. Prototype Selection: The process of selecting a
IEEE Transactions on Pattern Analysis and Machine few representative samples from the given training set
Intelligence, 19(1), 73-79. suitable for the given task.
Han, J., & Kamber, M. (2000). Data mining: Concepts and Training Set: The set of patterns whose class labels
techniques. Morgan Kaufmann. are known and that are used by the classifier in classify-
ing a given pattern.
905
TEAM LinG
906
Physical Data Warehousing Design

Ladjel Bellatreche
LISI/ENSMA, France
Mukesh Mohania
IBM India Research Lab, India
INTRODUCTION or relational-object) and external sources (legacy data,

other files formats) that may be distributed, autonomous,
Recently, organizations have increasingly emphasized and heterogeneous. Before integrating this data into a
applications in which current and historical data are warehouse, it should be cleaned to minimize errors and to
analyzed and explored comprehensively, identifying use- fill in missing information, when possible, and trans-
ful trends and creating summaries of the data in order to formed to reconcile semantic conflicts that can be found
support high-level decision making. Every organization in the various sources. The cleaned and transformed data
keeps accumulating data from different functional units, are integrated finally into a warehouse. Since the sources
so that they can be analyzed (after integration), and are updated periodically, it is necessary to refresh the
important decisions can be made from the analytical warehouse. This component also is responsible for man-
results. Conceptually, a data warehouse is extremely aging the warehouse data, creating indices on data tables,
simple. As popularized by Inmon (1992), it is a subject- partitioning data, and updating meta-data. The ware-
oriented, integrated, time-invariant, non-updatable col- house data contain the detail data, summary data, consoli-
lection of data used to support management decision- dated data, and/or multi-dimensional data. The data typi-
making processes and business intelligence. A data cally are accessed and analyzed using tools, including
warehouse is a repository into which are placed all data OLAP query engines, data mining algorithms, informa-
relevant to the management of an organization and from tion, visualization tools, statistical packages, and report
which emerge the information and knowledge needed to generators.
effectively manage the organization. This management The meta-data generally is held in a separate reposi-
can be done using data-mining techniques, comparisons tory. The meta-data contain the informational data about
of historical data, and trend analysis. For such analysis, the creation, management, and usage of tools (e.g., ana-
it is vital that (1) data should be accurate, complete, lytical tools, report writers, spreadsheets and data-mining
consistent, well defined, and time-stamped for informa- tools) for analysis and informational purposes. Basically,
tional purposes; and (2) data should follow business rules the OLAP server interprets client queries (the client inter-
and satisfy integrity constraints. Designing a data ware- acts with the front-end tools and passes these queries to
house is a lengthy, time-consuming, and iterative pro- the OLAP server) and converts them into complex SQL
cess. Due to the interactive nature of a data warehouse queries required to access the warehouse data. It also
application, having fast query response time is a critical might access the data warehouse. It serves as a bridge
performance goal. Therefore, the physical design of a between the users of the warehouse and the data con-
warehouse gets the lions part of research done in the data tained in it. The warehouse data also are accessed by the
warehousing area. Several techniques have been devel- OLAP server to present the data in a multi-dimensional
oped to meet the performance requirement of such an way to the front-end tools. Finally, the OLAP server
application, including materialized views, indexing tech- passes the multi-dimensional views of data to the front-
niques, partitioning and parallel processing, and so forth. end tools, which format the data according to the clients
Next, we briefly outline the architecture of a data ware- requirements.
housing system. The warehouse data are typically modeled multi-
dimensionally. The multi-dimensional data model has
been proved to be the most suitable for OLAP applica-
BACKGROUND tions. OLAP tools provide an environment for decision
making and business modeling activities by supporting
The conceptual architecture of a data warehousing sys- ad-hoc queries. There are two ways to implement a multi-
tem is shown in Figure 1. Data of a warehouse is extracted dimensional data model: (1) by using the underlying
from operational databases (relational, object-oriented, relational architecture (star schemas, snowflake schemas)
TEAM LinG
Figure 1. A data warehousing architecture

P
- Relational - Select - Analysis

- Transform Ware - Data Clients
ROLAP/ Mining
- Legacy
Meta - Clean house MOLAP
- Network Data - Report
- Integrate Data
Generator
- Others - Refresh
- Others
OLAP Server Front-end Tools
Information Sources Warehouse Creation

and Management Component
to project a pseudo-multi-dimensional model (example cations have driven the growth of the DBMS industry in
includes Informix Red Brick Warehouse); and (2) by using the past three decades and will doubtless continue to be
true multi-dimensional data structures such as, arrays important. One of the main objectives of relational sys-
(example includes Hyperion Essbase OLAP Server tems is to maximize transaction throughput and minimize
Hyperion). The advantage of MOLAP architecture is that concurrency conflicts. However, these systems generally
it provides a direct multi-dimensional view of the data have limited decision support functions and do not ex-
whereas the ROLAP architecture is just a multi-dimen- tract all the necessary information required for faster,
sional interface to relational data. On the other hand, the better, and intelligent decision making for the growth of
ROLAP architecture has two major advantages: (i) it can an organization. For example, it is hard for an RDBMS to
be used and easily integrated into other existing relational answer the following query: What are the supply patterns
database systems; and (ii) relational data can be stored for product ABC in New Delhi in 2003, and how were they
more efficiently than multi-dimensional data. different from the year 2002? Therefore, it has become
Data warehousing query operations include standard important to support analytical processing capabilities in
SQL operations, such as selection, projection, and join. In organizations for (1) the efficient management of organi-
addition, it supports various extensions to aggregate zations, (2) effective marketing strategies, and (3) efficient
functions, such as percentile functions (e.g., top 20th and intelligent decision making. OLAP tools are well
percentile of all products), rank functions (e.g., top 10 suited for complex data analysis, such as multi-dimen-
products), mean, mode, and median. One of the important sional data analysis and to assist in decision support
extensions to the existing query language is to support activities that access data from a separate repository
multiple group-by, by defining roll-up, drill-down, and called a data warehouse, which selects data from many
cube operators. Roll-up corresponds to doing further operational legacies, and possibly heterogeneous data
group-by on the same data object. Note that roll-up sources. The following table summarizes the differences
operator is order sensitive; that is, when it is defined in the between OLTP and OLAP.
extended SQL, the order of columns (attributes) matters.
The function of a drill-down operation is the opposite of
roll-up. MAIN THRUST
OLTP vs. OLAP Decision-support systems demand speedy access to data,

no matter how complex the query. To satisfy this objec-
Relational database systems (RDBMS) are designed to tive, many optimization techniques exist in the literature.
record, retrieve, and manage large amounts of real-time Most of these techniques are inherited from traditional
transaction data and to keep the organization running by relational database systems. Among them are materialized
supporting daily business transactions (e.g., update trans- views (Bellatreche et al., 2000; Jixue et al., 2003; Mohania
actions). These systems generally are tuned for a large et al., 2000; Sanjay et al., 2000, 2001), indexing methods
community of users, and the user typically knows what is (Chaudhuri et al., 1999; Jgens et al., 2001; Stockinger et
needed and generally accesses a small number of rows in al., 2002), data partitioning (Bellatreche et al., 2000, 2002,
a single transaction. Indeed, relational database systems 2004; Gopalkrishnan et al., 2000; Kalnis et al., 2002; Oracle,
are suited for robust and efficient Online Transaction 2000; Sanjay et al., 2004), and parallel processing (Datta
Processing (OLTP) on operational data. Such OLTP appli- et al., 1998).
907
TEAM LinG
Table 1. Comparison between OLTP and OLAP applications
Criteria OLTP OLAP
User Clerk, IT Professional Decision maker

Function Day-to-day operations Decision support
BD Design Application-oriented Subject-oriented
Data Current Historical, consolidated
View Detailed, flat relation Summarized, multi-dimensional
Usage Structured, repetitive Ad hoc
Unit of Work Short, simple transaction Complex query
Access Read/Write Append mostly
Records Accessed Tens Millions
Users Thousands Tens
Size MB-GB GB-TB
Metric Transaction throughput Query throughput
Materialized Views static. This is because each algorithm starts with a set of
frequently asked queries (a priori known) and then se-
Materialized views are used to precompute and store lects a set of materialized views that minimize the query
aggregated data, such as sum of sales. They also can be response time under some constraint. The selected ma-
used to precompute joins with or without aggregations. terialized views will be a benefit only for a query belong-
So, materialized views are used to reduce the overhead ing to the set of a priori known queries. The disadvantage
associated with expensive joins or aggregations for a large of this kind of algorithm is that it contradicts the dynamic
or important class of queries. Two major problems related nature of decision support analysis. Especially for ad-
to materializing the views are (1) the view-maintenance hoc queries, where the expert user is looking for interest-
problem, and (2) the view-selection problem. Data in the ing trends in the data repository, the query pattern is
warehouse can be seen as materialized views generated difficult to predict.
from the underlying multiple data sources. Materialized
views are used to speed up query processing on large Indexing Techniques
amounts of data. These views need to be maintained in
response to updates in the source data. This often is done Indexing has been the foundation of performance tuning
using incremental techniques that access data from under- for databases for many years. It creates access struc-
lying sources. In a data-warehousing scenario, accessing tures that provide faster access to the base data relevant
base relations can be difficult; sometimes data sources to the restriction criteria of queries. The size of the index
may be unavailable, since these relations are distributed structure should be manageable, so that benefits can be
across different sources. For these reasons, the issue of accrued by traversing such a structure. The traditional
self-maintainability of the view is an important issue in indexing strategies used in database systems do not
data warehousing. The warehouse views can be made self- work well in data warehousing environments. Most OLTP
maintainable by materializing some additional information, transactions typically access a small number of rows;
called auxiliary relations, derived from the intermediate most OLTP queries are point queries. B-trees, which are
results of the view computation. There are several algo- used in the most common relational database systems,
rithms, such as counting algorithm and exact-change algo- are geared toward such point queries. They are well
rithm, proposed in the literature for maintaining material- suited for accessing a small number of rows. An OLAP
ized views. query typically accesses a large number of records for
To answer the queries efficiently, a set of views that are summarizing information. For example, an OLTP transac-
closely related to the queries is materialized at the data tion would typically query for a customer who booked a
warehouse. Note that not all possible views are material- flight on TWA 1234 on April 25, for instance; on the
ized, as we are constrained by some resource like disk other hand, an OLAP query would be more like give me
space, computation time, or maintenance cost. Hence, we the number of customers who booked a flight on TWA
need to select an appropriate set of views to materialize 1234 in one month, for example. The second query would
under some resource constraint. The view selection prob- access more records that are of a type of range queries.
lem (VSP) consists in selecting a set of materialized views B-tree indexing scheme is not suited to answer OLAP
that satisfies the query response time under some resource queries efficiently. An index can be a single-column or
constraints. All studies showed that this problem is an NP- multi-column table (or view). An index either can be
hard. Most of the proposed algorithms for the VSP are clustered or non-clustered. An index can be defined on
908
TEAM LinG
one table (or view) or many tables using a join index. In the ship Management (CRM) offerings have evolved, there is
data warehouse context, when we talk about index, we a need for active integration of CRM with the ODS for real- P
refer to two different things: (1) indexing techniques and time consulting and marketing (i.e., how to integrate ODS
(2) the index selection problem. A number of indexing with CRM via messaging system for real-time business
strategies have been suggested for data warehouses: analysis).
value-list index, projection index, bitmap index, bit-sliced Another trend that has been seen recently is that many
index, data index, join index, and star join index. enterprises are moving from data warehousing solutions
to information integration (II). II refers to a category of
Data Partitioning and Parallel middleware that lets applications access data as though
Processing they were in a single database, whether or not they are. It
enables the integration of data and content sources to
The data partitioning process decomposes large tables provide real-time read and write access in order to trans-
(fact tables, materialized views, indexes) into multiple form data for business analysis and data warehousing and
(relatively) small tables by applying the selection opera- to data placement for performance, currency, and avail-
tors. Consequently, the partitioning offers significant ability. That is, we envisage that there will more focus on
improvements in availability, administration, and table integrating the data and contents rather than only inte-
scan performance Oracle9i. grating structured data, as done in the data warehousing.
Two types of partitioning are possible to decompose
a table: vertical and horizontal. In the vertical fragmenta-
tion, each partition consists of a set of columns of the CONCLUSION
original table. In the horizontal fragmentation, each par-
tition consists of a set of rows of the original table. Two The data warehousing design is quite different from those
versions of horizontal fragmentation are available: pri- of transactional database systems, commonly referred as
mary horizontal fragmentation and derived horizontal Online Transaction Processing (OLTP) systems. A data
fragmentation. The primary horizontal partitioning (HP) warehouse tends to be extremely large, and the informa-
of a relation is performed using predicates that are defined tion in a warehouse usually is analyzed in a multi-dimen-
on that table. On the other hand, the derived partitioning sional way. The main objective of a data warehousing
of a table results from predicates defined on another design is to facilitate the efficient query processing and
relation. In a context of ROLAP, the data partitioning is maintenance of materialized views. For achieving this
applied as follows (Bellatreche et al., 2002): it starts by objective, it is important that the relevant data is materi-
fragmenting dimension tables, and then, by using the alized in the warehouse. Therefore, the problems of select-
derived horizontal partitioning, it decomposes the fact ing materialized views and maintaining them are very
table into several fact fragments. Moreover, by partition- important and have been addressed in this article. To
ing data of ROLAP schema (star schema or snowflake further reduce the query processing cost, the data can be
schema) among a set of processors, OLAP queries can be partitioned. That is, partitioning helps in reducing the
executed in a parallel, potentially achieving a linear irrelevant data access and eliminates costly joins. Further,
speedup and thus significantly improving query response partitioning at a finer granularity can increase the data
time (Datta et al., 1998). Therefore, the data partitioning access and processing cost. The third problem is index
and the parallel processing are two complementary tech- selection. We found that judicious index selection does
niques to achieve the reduction of query processing cost reduce the cost of query processing, but we also showed
in data warehousing environments. that indices on materialized views improve the perfor-
mance of queries even more. Since indices and material-
ized views compete for the same resource (storage), we
FUTURE TRENDS found that it is possible to apply heuristics to distribute
the storage space among materialized views and indices
It has been seen that many enterprises are moving toward so as to efficiently execute queries and maintain material-
building the Operational Data Store (ODS) solutions for ized views and indexes.
real-time business analysis. The ODS gets data from one It has been seen that enterprises are moving toward
or more Enterprise Resource Planning (ERP) systems and building the data warehousing and operational data store
keeps the most recent version of information for analysis solutions.
rather than the history of data. Since the Client Relation-
909
TEAM LinG
REFERENCES the International Conference on Management of Data

(ACM SIGMOD), Wisconsin, USA.
Bellatreche, L. et al. (2000). What can partitioning do for Mohania, M., & Kambayashi, Y. (2000). Making aggregate
your data warehouses and data marts? Proceedings of the views self-maintainable. Data & Knowledge Engineer-
International Database Engineering and Applications ing, 32(1), 87-109.
Symposium (IDEAS), Yokohoma, Japan.
Oracle Corp. (2004). Oracle9i enterprise edition partition-
Bellatreche, L. et al. (2002). PartJoin: An efficient storage ing option. Retrieved from http://otn.oracle.com/prod-
and query execution for data warehouses. Proceedings of ucts/oracle9i/datasheets/partitioning.html
the International Conference on Data Warehousing and
Knowledge Discovery (DAWAK), Aix-en Provence, Sanjay, A., Chaudhuri, S., & Narasayya, V.R. (2000).
France. Automated selection of materialized views and indexes in
Microsoft SQL server. Proceedings of the 26th Interna-
Bellatreche, L. et al. (2004). Bringing together partitioning, tional Conference on Very Large Data Bases
materialized views and indexes to optimize performance of (VLDB2000), Cairo, Egypt.
relational data warehouses. Proceedings of the Interna-
tional Conference on Data Warehousing and Knowl- Sanjay, A., Chaudhuri, S., & Narasayya, V. (2001). Mate-
edge Discovery (DAWAK), Zaragoza, Spain. rialized view and index selection tool for Microsoft SQP
server 2000. Proceedings of the International Confer-
Bellatreche, L., Karlapalem, K., & Schneider, M. (2000). On ence on Management of Data (ACM SIGMOD), CA.
efficient storage space distribution among materialized
views and indices in data warehousing environments. Sanjay, A., Narasayya, V., & Yang, B. (2004). Integrating
Proceedings of the International Conference on Infor- vertical and horizontal partitioning into automated physi-
mation and Knowledge Management (ACM CIKM), cal database design. Proceedings of the International
McLearn, VA, USA. Conference on Management of Data (ACM SIGMOD),
Paris, France.
Chaudhuri, S., & Narasayya, V. (1999). Index merging.
Proceedings of the International Conference on Data Stockinger, K. (2000). Bitmap indices for speeding up
Engineering (ICDE), Sydney, Australia. high-dimensional data analysis. Proceedings of the Inter-
national Conference Database and Expert Systems Ap-
Datta, A., Moon, B., & Thomas, H.M. (2000). A case for plications (DEXA), London, UK.
parallelism in data warehousing and olap. Proceedings of
the DEXA Workshop, London, UK.
Gopalkrishnan, V., Li, Q., & Karlapalem, K. (2000). Effi-
cient query processing with associated horizontal class
KEY TERMS
partitioning in an object relational data warehousing
environment. Proceedings of the International Work- Bitmap Index: Consists of a collection of bitmap vec-
shop on Design and Management of Data Warehouses, tors, each of which is created to represent each distinct
Stockholm, Sweden. value of the indexed column. A bit i in a bitmap vector,
representing value x, is set to 1, if the record i in the indexed
Hyperion. (2000). Hyperion Essbase OLAP Server. http:/ table contains x.
/www.hyperion.com/
Cube: A multi-dimensional representation of data that
Inmon, W.H. (1992). Building the data warehouse. John can be viewed from different perspectives.
Wiley.
Data Warehouse: An integrated decision support
Jixue, L., Millist, W., Vincent, M., & Mohania, K. (2003). database whose content is derived from the various
Maintaining views in object-relational databases. Knowl- operational databases. A subject-oriented, integrated,
edge and Information Systems, 5(1), 50-82. time-variant, non-updatable collection of data used in
support of management decision-making processes.
Jrgens, M., Lenz, H. (2001). Tree based indexes versus
bitmap indexes: A performance study. International Jour- Dimension: A business perspective useful for analyz-
nal of Cooperative Information Systems (IJCIS), 10(3), ing data. A dimension usually contains one or more
355-379. hierarchies that can be used to drill up or down to different
levels of detail.
Kalnis, P. et al. (2002). An adaptive peer-to-peer network
for distributed caching of OLAP results. Proceedings of
910
TEAM LinG
Dimension Table: A table containing the data for one fact table. The index is implemented using one of two
dimension within a star schema. The primary key is used representations: row id or bitmap, depending on the P
to link to the fact table, and each level in the dimension has cardinality of the indexed column.
a corresponding field in the dimension table.
Legacy Data: Data that you already have and use.
Fact Table: The central table in a star schema, contain- Most often, this takes the form of records in an existing
ing the basic facts or measures of interest. Dimension database on a system in current use.
fields are also included (as foreign keys) to link to each
dimension table. Measure: A numeric value stored in a fact table or
cube. Typical examples include sales value, sales volume,
Horizontal Partitioning: Distributing the rows of a price, stock, and headcount.
table into several separate tables.
Star Schema: A simple database design in which
Join Index: Built by translating restrictions on the dimensional data are separated from fact or event data. A
column value of a dimension table to restrictions on a large dimensional model is another name for star schema.
911
TEAM LinG
912
Predicting Resource Usage for Capital Efficient

Marketing
D.R. Mani
Massachusetts Institute of Technology, USA, and Harvard University, USA
Andrew L. Betz
Progressive Insurance, USA
James H. Drew
Verizon Laboratories, USA
INTRODUCTION Figure 1. Marketing initiatives to boost average demand

can indirectly increase peak demand to beyond capacity.
A structural conflict exists in businesses that sell services
whose production costs are discontinuous and whose Capital Expense vs. Resource Capacity
consumption is continuous but variable. A classic ex-
ample is in businesses where capital-intensive infrastruc- 100
ture is necessary for provisioning service, but the capac-
80
ity resulting from capital outlay is not always fully and
Resource Capacity
Capacity
efficiently utilized. Marketing departments focus on ini- 60
Peak Demand
tiatives that increase infrastructure usage to improve 40
both customer retention and ongoing revenue. Engineer-
ing and operations departments focus on the cost of 20 Avg Demand
service provision to improve the capital efficiency of rev- 0

enue dollars received. Consequently, a marketing initiative 0 1 2 3 4 5 6
to increase infrastructure usage may be resisted by engi- Capital Outlay
neering, if its introduction would require great capital

expense to accommodate that increased usage. This con- provide conclusive positions on specific marketing initia-
flict is exacerbated when a usage-enhancing initiative tends tives, and their engineering consequences, the usage
to increase usage variability so that capital expenditures revenues, and infrastructure performance predicted by
are triggered with only small increases in total usage. these models provide systematic, sound, and quantita-
A data warehouse whose contents encompass both tive input for making balanced and cost-effective busi-
these organizational functions has the potential to medi- ness decisions.
ate this conflict, and data mining can be the tool for this
mediation. Marketing databases typically have customer
data on rate plans, usage, and past responses to market-
ing promotions. Engineering databases generally record
BACKGROUND
infrastructure locations, usages, and capacities. Other
information often is available from both general domains In this business context, applying data mining
to allow for the aggregation, or clustering, of customer (Abramowicz & Zurada, 2000; Berry & Linoff, 2004; Han
types, rate plans, and marketing promotions, so that & Kamber, 2000) to capital efficient marketing is illus-
marketing proposals and their consequences can be evalu- trated here by a study from wireless telephony 1 (Green,
ated systematically to aid in decision making. These 2000), where marketing plans2 introduced to utilize excess
databases generally contain such voluminous or compli- off-peak network capacity3 (see Figure 1) potentially could
cated data that classical data analysis tools are inad- result in requiring fresh capital outlays by indirectly
equate. In this article, we look at a case study where data driving peak demand to levels beyond current capacity.
mining is applied to predicting capital-intensive resource We specifically consider marketing initiatives (e.g.,
or infrastructure usage, with the goal of guiding marketing rate plans with free nights and weekends) that are aimed
decisions to enable capital-efficient marketing. Although at stimulating off-peak usage. Given a rate plan with a fixed
the data mining models developed in this article do not peak minute allowance, availability of extra off-peak min-
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Figure 2. Flowchart describing data sources and data mining operations used in predicting busy-hour impact of
marketing initiatives P
utes could potentially increase peak usage. The quantifi- lifetime value5, access charge, subscribed rate plan, and
cation of this effect is complicated by the corporate reality peak and off-peak minutes used. Rate plan data provide
of myriad rate plans and geographically extensive and details for a given rate plan, including monthly charges,
complicated peak usage patterns. In this study, we use allowed peak, off-peak, weekend minutes of use, per-
data mining methods to analyze customer, call detail, rate minute charges for excess use, long distance, roaming
plan, and cell-site location data to predict the effect of charges, and so forth. Call detail data, for every call placed
marketing initiatives on busy-hour4 network utilization. in a given time period, provide the originating and termi-
This will enable forecasting network cost of service for nating phone numbers (and, hence, originating and termi-
marketing initiatives, thereby leading to optimization of nating customers), cell sites used in handling the call, call
capital outlay. duration, and other call details. Cell site location data
indicate the geographic location of cell sites, capacity of
each cell site, and details about the physical and electro-
MAIN THRUST magnetic configuration of the radio towers.
Ideally, the capital cost of a marketing initiative is ob- Data Mining Process
tained by determining the existing capacity, the increased
capacity required under the new initiative, and then fac- Figure 2 provides an overview of the analysis and data-
toring the cost of the additional capital; data for a study mining process. The numbered processes are described in
like this would come from a corporate data warehouse more detail to illustrate how the various components are
(Berson & Smith, 1997) that integrates data from relevant integrated into an exploratory tool that allows marketers
sources. Unfortunately, such detailed cost data are not and network engineers to evaluate the effect of proposed
available in most corporations and businesses. In fact, in initiatives.
many situations, the connection between promotional
marketing initiatives and capital cost is not even recog- 1. Cell-Site Clustering: Clustering cell sites using
nized. In this case study, we therefore need to assemble the geographic location (latitude, longitude) results
relevant data from different and disparate sources in order in cell site clusters that capture the underlying
to predict the busy-hour impact of marketing initiatives. population density, with cluster area generally in-
versely proportional to population. This is a natural
Data consequence of the fact that heavily populated
urban areas tend to have more cell towers to cover
The parallelograms in the flowchart in Figure 2 indicate the large call volumes and provide good signal
essential data sources for linking marketing initiatives to coverage. The flowchart for cell site clustering is
busy-hour usage. Customer data characterize the cus- included in Figure 3 with results of k-means cluster-
tomer by indicating a customers mobile phone number(s), ing (Hastie, Tibshirani & Friedman, 2001) for the San
913
TEAM LinG
Figure 3. Steps 1and 2 in the data mining process outlined Francisco area shown in Figure 4. The cell sites in
in Figure 2 the area are grouped into four clusters, each cluster
approximately circumscribed by an oval.
2. Predictive Modeling: The predictive modeling
stage, shown in Figure 3, merges customer and rate
plan data to provide the data attributes (features)
used for modeling. The target for modeling is ob-
tained by combining cell site clusters with network
usage information. Note that the sheer volume of
call detail data makes its summarization and merg-
ing with other customer and operational data a
daunting one. See Berry and Linoff (2000) for a
discussion. The resulting dataset has related cus-
tomer characteristics and rate plan details for every
customer, matched up with that customers actual
network usage and the cell sites providing service
for that customer. This data can then be used to
build a predictive model. Feature selection (Liu &
Motoda, 1998) is performed, based on correlation
to target, with grouped interactions taken into
account. The actual predictive model is based on
linear regression (Hand, Mannila & Smyth, 2001;
Hastie, Tibshirani & Friedman, 2001), which, for
this application, performs similar to neural network
models (Haykin, 1994). Note that the sheer number
of customer and rate plan characteristics requires
the variable reduction capabilities of a data-mining
solution.
For each of the four cell site clusters shown in Figure

4, Figure 5 shows the actual vs. the predicted busy-hour
usage for calls initiated during a specific month in the San
Figure 4. Clustering cell sites in the San Francisco area, based on geographic position. Each spot represents a cell
site, with the ovals showing approximate location of cell site clusters.
914
TEAM LinG
Figure 5. Regression modeling results showing predicted plans into a small number of groups. Without such
vs. actual busy-hour usage for the cell site clusters shown clustering, the complex data combinations and analy- P
in Figure 4. ses needed to predict busy-hour usage will result in
very little predictive power. Clusters ameliorate the
situation by providing a small set of representative
Ac tu a l vs . P red ic ted B H U sa g e
Cell Cluster 1
Ac tu al vs . P re dic te d B H U sa g e
Cell Cluster 2
cases to use in the prediction.
10 10
Similar to cell site clustering, based on empirical explo-

Predicted BH Usage (min)
8 8

y = 0.6615x + 0.3532
6
R2 = 0.672
6
y = 0.3562x + 0.5269
R2 = 0.3776
ration, we decide on using four clusters for both custom-
4 4
ers and rate plans. Results of the clustering for customers
2 2
and applicable rate plans in the San Francisco area are
shown in Figures 6 and 7.
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Actual BH Usage (min) Actual BH Usage (min)
Ac tu a l v s. P red icted B H U s ag e A ctu a l vs . P red icte d B H U s ag e

Cell Cluster 3 Cell Cluster 4
5. Integration and Application of the Predictive Model:
10 10
Stringing all the pieces together, we can start with
a proposed rate plan and an inwards projection6 of

8 8
6 6
4
y = 0.3066x + 0.6334
R2 = 0.3148 4
y = 0.3457x + 0.4318 the expected number of customers who would sub-
R2 = 0.3591
2 2
scribe to the plan and predict the busy-hour usage
0
0 2 4 6 8 10
0 on targeted cell sites. The process is outlined in
Figure 8. Inputs labeled A, B, and C come from the
0 2 4 6 8 10
Actual BH Usage (min) Actual BH Usage (min)
respective flowcharts in Figures 2, 6, and 7.
Francisco region. With statistically significant R2 correla- Validation

tion values ranging from about 0.31 to 0.67, the models,
while not perfect, are quite good at predicting cluster- Directly evaluating and validating the analytic model
level usage. developed would require data summarizing the capital
cost of a marketing initiative. This requires determining
3. and 4. Customer and Rate Plan Clustering: Using k- the existing capacity, the increased capacity required
means clustering to cluster customers based on under the new initiative, and then factoring the cost of the
their attribute, and rate plans, based on their salient additional capital. Unfortunately, as mentioned earlier,
features, results in categorizing customers and rate such detailed cost data are generally unavailable. There-
Figure 6. Customer clustering flowchart and results of Figure 7. Rate plan clustering flowchart and clustering
customer clustering. The key attributes identifying the results. The driving factor that distinguishes rate plan
four clusters are access charge (ACC_CHRG), lifetime clusters include per-minute charges (intersystem, peak,
value (LTV_PCT), peak minutes of use (PEAK_MOU), and off-peak) and minutes of use.
and total calls (TOT_CALL).
915
TEAM LinG
Figure 8. Predicting busy-hour usage at a cell site for a given marketing initiative
fore, we validate the analytic model by estimating its ity could be useful, for example, when price elasticity data
impact and comparing the estimate with known param- suggest inward projection differences by cluster.
eters. Starting at the cluster level, we applied the predictive
We began this estimation by identifying the 10 most model to estimate the average busy-hour usage for each
frequent rate plans in the data warehouse. From this list, customer on each cell tower. These cluster-level predic-
we selected one rate plan for validation. Based on market- tions were disaggregated to the cellular-tower level by
ing area and geographic factors, we assume an equal assigning busy-hour minutes proportionate to total min-
inwards projection for each customer cluster. Of course, utes for each cell tower within the respective cellular-
our approach allows for, but does not require, different tower cluster. Following this, we then calculated the
inward projections for the customer clusters. This flexibil- actual busy-hour usage per customer of that rate plan
across the millions of call records. In Figure 9, a scatter
plot of actual busy-hour usage against predicted busy-
Figure 9. Scatter plot showing actual vs. predicted hour usage, with an individual cell tower now the unit of
busy-hour usage for each cell site, for a specific rate plan analysis, reveals an R2 correlation measure of 0.13.
The estimated model accuracy dropped from R2s in the
mid 0.30s for cluster-level data (Figure 5) to about 0.13
Actual vs. Predicted BH Usage for Each Cell Site when the data were disaggregated to the cellular tower
Rate Plan RDCDO
level (Figure 9). In spite of the relatively low R2 value, the
1 correlation is statistically significant, indicating that this
0.8
approach can make contributions to the capital estimates
Predicted BH Usage
y = 1.31x - 0.1175
R2 = 0.1257
of marketing initiatives. However, the model accuracy on
0.6 disaggregated data was certainly lower than the effects
0.4
observed at the cluster level. The reasons for this loss in
accuracy probably could be attributed to the fine-grained
0.2
disaggregation, the large variability among the cell sites
0 in terms of busy-hour usage, the proportionality assump-
0 0.1 0.2 0.3 0.4 0.5 0.6 tion made in disaggregating, and model sensitivity to
Actual BH Usage inward projections. These data point out both an oppor-
916
TEAM LinG
Figure 10. Lifetime value (LTV) distributions for customers with below and above average usage and BH usage
indicates that high-LTV customers also impact the network with high BH usage P
LTV distribution for LTV distribution for

customers with less than customers with more than
average usage and BH average usage and BH
usage usage
tunity (that rate plans can be intelligently targeted to the customer lifetime value density plots as a function of
specific locations with excess capacity) and a threat (that strain that they place on the network. The left panel in
high busy-hour volatility would lead engineering to be Figure 10 shows LTV density for customers with below-
cautious in allowing usage-increasing plans). average total usage and below-average busy-hour usage.
The panel on the right shows LTV density for customers
Business Insights with above-average total usage and above-average busy-
hour usage. Although the predominant thinking in mar-
In the context of the analytic model, the data also can keting circles is that higher LTV is always better, the data
reveal some interesting business insights. First, observe suggest this reasoning should be tempered by whether
Figure 11. Exploring customer clusters and rate plan clusters in the context of BH usage shows that customers in a
cluster who subscribe to a class of rate plans impact BH usage differently from customers in a different cluster
BH Impact by Customer & Rate Plan Cluster
B H Impact by C ustome r C luste r

0.25 0.25
Proportion of Heavy BH Users
0.2
0.2 0.2
Proportion of Heavy BH Users
0.15 0.15 0.15
0.1 0.1 0.1
0.05 0.05 0.05
0 0
0
1 2 3 4 1 4
Customer Cluster Customer Cluster
A
Rate Plan Cluster 2 Rate Plan Cluster 3 B
917
TEAM LinG
the added value in revenue offsets the disproportionate target specific customer segments with rate plan promo-
strain on network resources. This is the basis for a tions in order to increase overall usage (and, hence,
fundamental tension between marketing and engineering retention), while more efficiently using, but not exceeding
functions in large businesses. network capacity. A further practical advantage of this
Looking at busy-hour impact by customer cluster and data-mining approach is that all of the customer groups,
rate plan cluster is also informative, as shown in Figure 11. cell site locations, and rate plan promotions are simplified
For example, if we define heavy BH users as customers through clustering in order to simplify their characteristic
who are above average in total minutes as well as busy- representation facilitating productive discussion among
hour minutes, we can see main effect differences across upper-management strategists.
customer clusters (Figure 11a). We have sketched several important practical busi-
This is not entirely surprising, as we have already seen ness insights that come from this methodology. A basic
that LTV was a major determinant of customer cluster and concept to be demonstrated to upper management is that
heavy BH customers also skewed towards having higher busy-hour usage, the main driver of equipment capacity
LTV. However, there was an unexpected crossover inter- needs, varies greatly by customer segment and by general
action of rate plan cluster by customer cluster when heavy cell site grouping (see Figure 9) and that usage can be
BH users were the target (Figure 11b). The implication is differently volatile over site groupings (see Figure 5).
that controlling for revenue, certain rate plan types are These differences point out the potential need for target-
more network-friendly, depending on the customer clus- ing specific customer segments with specific rate plan
ter under study. Capital-conscious marketers in theory promotions in specific locations. One interesting example
could tailor their rate plans to minimize network impact by is illustrated in Figure 11, where two customer segments
tuning rate plans more precisely to customer segments. respond differently to each of two candidate rate plans,
one group decreasing its BH usage under one plan and the
other decreasing it under the other plan. This leads to the
FUTURE TRENDS possibility that certain rate plans should be offered to
specific customer groups in locations where there is little
The proof of concept case study sketched here describes excess equipment capacity, but others can be offered
an analytical effort to optimize capital costs of marketing where there is more slack in capacity.
programs, while promoting customer satisfaction and Of course, there are many organizational, implementa-
retention by utilizing the business insights gained to tion, and deployment issues associated with this inte-
tailor marketing programs to better match the needs and grated approach. All involved business functions must
requirements of customers (Lenskold, 2003). As database accept the validity of the modeling and its results, and this
systems incorporate data-mining engines (e.g., Oracle concurrence requires the active support of upper manage-
data mining [Oracle, 2004]), future software and customer ment overseeing each function. Second, these models
relationship management applications will automatically should be repeatedly run to assure their relevance in a
incorporate such an analysis to extract optimal recommen- dynamic business environment, as wireless telephony is
dations and business insights for maximizing business in our illustration. Third, provision should be made for
return on investment and customer satisfaction, leading capturing the sales and engineering consequences of the
to effective and profitable one-on-one customer relation- decisions made from this approach, to be built into future
ship management (Brown, 2000; Gilmore & Pine, 2000; models. Despite these organizational challenges, busi-
Greenberg, 2001). ness decision makers informed by these models hopefully
may resolve a fundamental business rivalry.
CONCLUSION
REFERENCES
We have made the general argument that a comprehensive
company data warehouse and broad sets of models ana- Abramowicz, W., & Zurada, J. (2000). Knowledge discov-
lyzed with modern data mining techniques can resolve ery for business information systems. Kluwer.
tactical differences between different organizations within Berry, M.A., & Linoff, G.S. (2000). Mastering data min-
the company and provide a systematic and sound basis ing: The art and science of customer relationship man-
for business decision making. The role of data mining in agement. Wiley.
so doing has been illustrated here in mediating between
the marketing and engineering functions in a wireless Berry, M.A., & Linoff, G.S. (2004). Data mining tech-
telephony company, where statistical models are used to niques: For marketing, sales, and customer relationship
management. Wiley.
918
TEAM LinG
Berson, A., & Smith, S. (1997). Data warehousing, data Capital-Efficient Marketing: Marketing initiatives
mining and OLAP. McGraw-Hill. that explicitly take into account and optimize, if possible, P
the capital cost of provisioning service for the introduced
Brown, S.A. (2000). Customer relationship management: initiative or promotion.
Linking people, process, and technology. John Wiley.
Feature Selection: The process of identifying those
Drew, J., Mani, D.R., Betz, A., & Datta, P. (2001). Targeting input attributes that contribute significantly to building
customers with statistical and data-mining techniques. a predictive model for a specified output or target.
Journal of Services Research, 3(3), 205-219.
Inwards Projection: Estimating the expected number
Gilmore, J., & Pine, J. (2000). Markets of one. Harvard of customers that would sign on to a marketing promotion,
Business School Press. based on customer characteristics and profiles.
Green, J.H. (2000). The Irwin handbook of telecommuni- k-Means Clustering: A clustering method that groups
cations. McGraw-Hill. items that are close together, based on a distance metric
Greenberg, P. (2001). CRM at the speed of light: Captur- like Euclidean distance, to form clusters. The members in
ing and keeping customers in Internet real time. Mc- each of the clusters can be described succinctly using the
Graw Hill. mean (or centroid) of the respective cluster.
Han, J., & Kamber, M. (2000). Data mining: Concepts and Lifetime Value: A measure of the profit generating
techniques. Morgan Kaufmann. potential, or value, of a customer; a composite of expected
tenure (how long the customer stays with the business
Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of provider) and expected revenue (how much a customer
data mining. Bradford Books. spends with the business provider).
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Predictive Modeling: Use of statistical or data-mining
elements of statistical learning: Data mining, inference, methods to relate attributes (input features) to targets
and prediction. Springer. (outputs) using previously existing data for training in
Haykin, S. (1994). Neural networks: A comprehensive such a manner that the target can be predicted for new data
foundation. Prentice-Hall. based on the input attributes alone.
Lenskold, J.D. (2003). Marketing ROI: The path to cam-

paign, customer, and corporate profitability. Mc-Graw
Hill. ENDNOTES
Liu, H., & Motoda, H. (1998). Feature selection for knowl-
edge discovery and data mining. Kluwer. 1
Wireless telephony here refers to both the engi-
neering and commercial aspects of mobile telecom-
Mani, D.R., Drew, J., Betz, A., & Datta, P. (1999). Statistics munications by radio signal, also known as cellular
and data mining techniques for lifetime value modeling. telephony.
Proceedings of the Fifth ACM SIGKDD Conference on 2
In this article, the terms marketing plan and mar-
Knowledge Discovery and Data Mining. keting initiative refer to promotions and incen-
Oracle. (2004). Oracle Corporation. Retrieved from http:/ tives, generally mediated through a rate plana
/otn.oracle.com/products/bi/odm/index.html mobile calling plan that specifies minute usage al-
lowances at peak and off-peak times, monthly
charge, cost of extra minutes, roaming and long-
distance charges, and so forth.
KEY TERMS 3
The capacity of a cellular telephone network is the
maximum number of simultaneous calls that can be
Busy Hour: The hour at which a mobile telephone handled by the system at any given time. Times
network handles the maximum call traffic in a 24-hour during the day when call volumes are higher than
period. It is that hour during the day or night when the average are referred to as peak hours; off-peak
product of the average number of incoming calls and hours entail minimal call trafficlike nights and
average call duration is at its maximum. weekends.
919
TEAM LinG
4
Busy hour (BH) is that hour during which the net- ized for each individual customer. (Drew, et al., 2001;
work utilization is at its peak. It is that hour during Mani et al., 1999).
the day or night when the product of the average 6
Inwards projection is a term used to denote an
number of incoming calls and average call duration estimate of the number of customers that would
is at its maximum. subscribe to a new (or enhanced) rate plan, with
5
The well-established and widely used concept of details reasonably summarizing characteristics of
customer lifetime value (LTV) captures expected customers expected to sign on to such a plan.
revenue and expected tenure, preferably personal-
920
TEAM LinG
921
Privacy and Confidentiality Issues in Data Mining P

Ycel Saygin
Sabanci University, Turkey
INTRODUCTION that needs to be secured. More importantly, the privacy

of the data owners needs to be taken into consideration,
Data regarding people and their activities have been depending on how the data will be used. Privacy risks
collected over the years, which has become more perva- increase even more when we consider powerful data-
sive with widespread usage of the Internet. Collected data mining tools that can be used to extract confidential
usually are stored in data warehouses, and powerful data information. We will try to address the privacy and con-
mining tools are used to turn it into competitive advan- fidentiality issues that relate to data-mining techniques
tage. Besides businesses, government agencies are among throughout this article.
the most ambitious data collectors, especially in regard to
the increase of safety threats coming from global terrorist
organizations. For example, CAPPS (Computer Assisted BACKGROUND
Passenger Prescreening System) collects flight reserva-
tion information as well as commercial information about Historically, researchers have been more interested in the
passengers. This data, in turn, can be utilized by govern- security aspect of data storage and transfer. Data security
ment security agencies. Although CAPPS represents US is still an active area of research in relation to new data
national data collection efforts, it also has an effect on storage and transfer models, such as XML. Also, new
other countries. The following sign at the KLM ticket desk media of data transfer, such as wireless, introduce new
in Amsterdam International Airport illustrates the inter- challenges.
national level of data collection efforts: Please note that The most important work on data security that relates
KLM Royal Dutch Airlines and other airlines are required to the privacy and confidentiality issues in data analysis
by new security laws in the US and several other countries is statistical disclosure control and data security in statis-
to give security customs and immigration authorities tical databases (Adam & Wortmann, 1989). The basic idea
access to passenger data. Accordingly, any information of a statistical database is to let users obtain statistical or
we hold about you and your travel arrangements may be aggregate information from a database, such as the aver-
disclosed to the concerning authorities of these countries age income and maximum income in a given department.
in your itinerary. This is a very striking example of how This is done while limiting the disclosure of confidential
the confidential data belonging to citizens of one country information, such as the salary of a particular person. The
could be handed over to authorities of some other country concept of k-anonymity, which ensures that the obtained
via newly enforced security laws. In fact, some of the values refer to at least k individuals (Sweeney, 2002), was
largest airline companies in the US, including American, proposed as a measure for the degree of confidentiality in
United, and Northwest, turned over millions of passenger aggregate data. In this way, identification of individuals
records to the FBI, according to the New York Times can be blocked up to a certain degree. However, succes-
(Schwartz & Maynard, 2004). sive query results over the database may pose a risk to
Aggressive data collection efforts by government security, since a combination of intelligently issued que-
agencies and enterprises have raised a lot of concerns ries may infer confidential data. The general problem of
among people about their privacy. In fact, the Total adversaries being able to obtain sensitive data from non-
Information Awareness (TIA) project, which aims to build sensitive data is called the inference problem (Farkas &
a centralized database that will store the credit card Jajodia, 2003).
transactions, e-mails, Web site visits, and flight details of
Americans was not funded by Congress due to privacy
concerns. Although the privacy of individuals is pro- MAIN THRUST
tected by regulations, such regulations may not be enough
to ensure privacy against new technologies such as data Data mining is considered to be a tool for analyzing large
mining. Therefore, important issues need to be consid- collections of data. As a result, government and industry
ered by the data collectors and the data owners. First of seek to exploit its potential by collecting extensive infor-
all, the data itself may contain confidential information mation about people and their activities. Coupled with its
TEAM LinG
Privacy and Confidentiality Issues in Data Mining
power to extract previously unknown knowledge, data horizontally. If the schema of the local databases are
mining also has become one of the main targets of privacy complementary to each other, then this means that the
advocates. In reference to data mining, there are two main data are vertically distributed. The Purdue research group
concerns regarding privacy and confidentiality: considered both horizontal and vertical data distribution
for privacy-preserving association rule mining. In both
(1) Protecting the confidentiality of data and privacy of cases, the individual association rules together with their
individuals against data mining methods. statistical properties were assumed to be confidential.
(2) Enabling the data mining algorithms to be used over Also, secure multi-part computation techniques were
a database without revealing private and confiden- employed, based on specialized encryption practices, to
tial information. make sure that the confidential association rules were
circulated among the participating companies in encrypted
We should note that, although both of the tracks seem form. The resulting global association rules can be ob-
to be similar to each other, their purposes are different. In tained in a private manner without each company knowing
the first case, the data may not be confidential, but data- which rule belongs to which local database.
mining tools could be used to infer confidential informa- Both of the seminal works on privacy-preserving data
tion. Therefore, some sanitization techniques are needed mining (Agrawal & Srikant, 2000; Kantarcioglu & Clifton,
to modify the database, so that privacy concerns are 2002) have shaped the research in this area. They show
alleviated. In the second track, the data are considered how data mining can be done on private data in a central-
confidential and are perturbed before they are given to a ized and distributed environment. Although a specific
third party. Both of the approaches are necessary in order data-mining model was the target in both of these papers,
to achieve full privacy of individuals and will be discussed these authors initiated ideas that could be applied to other
in detail in the following subsections. data-mining models.
Privacy Preserving Data Mining Privacy and Confidentiality Against

Data Mining
In the year 2000, Agrawal and Srikant, in their seminal paper
published in ACM SIGMOD proceedings, coined the term Data mining and data warehousing technology have a
privacy preserving data mining. Based on this work, great impact on data analysis and business decision
privacy-preserving data mining can be defined as a tech- making, which motivated the collection of data on practi-
nology that enables data-mining algorithms to work on cally everything, ranging from customer purchases to
encrypted or perturbed data (Agrawal & Srikant, 2000). We navigation patterns of Web site visitors. As a result of
think of it as mining without seeing the actual data. The data collection efforts, privacy advocates in the US are
authors have pointed out that the collected data may be now raising their voices against the misuse of data, even
outsourced to third parties to perform the actual data if it was collected for security purposes. The following
mining. However, before the data could be handed over to quote from the New York Times article demonstrates the
third parties, the confidential values in the database, such point of privacy against data mining (Markoff, 2002):
as the salary of employees, needs to be perturbed. The
research results are for specific data-mining techniques; Pentagon has released a study that recommends the
namely, decision tree construction for classification. The government to pursue specific technologies as potential
main idea is to perturb the confidential data values in a way safeguards against the misuse of data-mining systems
that the original data distribution could be reconstructed similar to those now being considered by the government
but not the original data values. In this way, the classifica- to track civilian activities electronically in the United
tion techniques that rely on the distribution of the data to States and abroad.
construct a model can still work within a certain error margin
that the authors approve. This shows us that even the most aggressive data
Another interesting work on privacy-preserving data collectors in the US are aware of the fact that the data-
mining was conducted in the US at Purdue University mining tools could be misused, and we need a mechanism
(Kantarcioglu & Clifton, 2002). This work was in the to protect the confidentiality and privacy of people.
context of association rules in a distributed environment. The initial work by Chang and Moskovitz (2000) points
The authors considered a case in which there were mul- out the possible threats imposed by data mining tools. In
tiple companies that had their own confidential local this work, the authors argue that even if some of the data
databases that they did not want to share with others. values are deleted (or hidden) in a database, there is still
When the datasets of the individual companies have the a threat of the hidden data being recovered. They show
same schema, then this means that the data are distributed that a classification model could be constructed using
922
TEAM LinG
part of the released data as a training set, which can be used The New York Times article cited previously high-
further to predict the hidden data values. Chang and lights another important aspect of privacy issues in data P
Moskovitz (2000) propose a feedback mechanism that will mining by saying, Perhaps the strongest protection
try to construct a prediction model and go back to the against abuse of information systems is Strong Audit
database to update it with the aim of blocking the predic- mechanisms we need to watch the watchers (Markoff,
tion of confidential data values. 2002). This confirms that data-mining tools also should
Another aspect of privacy and confidentiality against be monitored, and users access to data via data-mining
data mining is that data-mining results, such as the pat- tools should be controlled. Although there is some initial
terns in a database, could be confidential. For example, a work on this issue (Oliveira, Zaiane & Saygin, 2004), how
database could be released with the thought that the this could be achieved is still an open problem waiting to
confidential values are hidden and cannot be queried. be addressed, since there is no available access control
However, there may be patterns in the database that are mechanism specifically developed for data-mining tools
confidential. This issue was first pointed out by OLeary similar to the one employed in database systems.
(1991). Clifton and Marks (1996) further elaborate on the Privacy-preserving data-mining techniques have been
issue of patterns being confidential by specific examples proposed, but they may not fully preserve the privacy, as
of data-mining models, such as association rules. The pointed out in Agrawal and Aggarwal (2001) and
association rules, which are very popular in retail market- Evfimievski, Gehrke, and Srikant (2003). Therefore, pri-
ing, may contain confidential information that should not vacy metrics and benchmarks are needed to assess the
be disclosed. In Verykios, et al. (2004), a way of identifying privacy threats and the effectiveness of the proposed
confidential association rules and sanitizing the database privacy-preserving data-mining techniques or the pri-
to limit the disclosure of the association rules are dis- vacy breaches introduced by data-mining techniques.
cussed. Furthermore, the work illustrates ways in which
heuristics for updating the database can reduce the signifi-
cance of the association rules in terms of their support and CONCLUSION
confidence. The main idea is to change the transactions in
the database that contribute to the support and confi- Data mining has found a lot of applications in the indus-
dence of the sensitive (or private) association rules in a try and government, due to its success in combining
way that the support and/or confidence of the rules de- machine learning, statistics, and database fields with the
crease with a limited effect on the non-sensitive rules. aim of turning heaps of data into valuable knowledge.
Widespread usage of the Internet, especially for e-com-
merce and other services, has led to the collection and
FUTURE TRENDS storage of more data at a lower cost. Also, large data
repositories about individuals and their activities, coupled
Privacy and confidentiality issues in data mining will be with powerful data mining tools, have increased fears
more and more crucial as data collection efforts increase about privacy. Therefore, data mining researchers have
and the type of data collected becomes more diverse. A felt the urge to address the privacy and confidentiality
typical example of this is the usage of RFID tags and mobile issues. Although privacy-preserving data mining is still
phones, which reveal our sensitive location information. in its infancy, some promising results have been achieved
As more data become available and more tools are imple- as the outcomes of initial studies. For example, data
mented to search this data, the privacy risks will increase perturbation techniques enable data-mining models to
even more. Consider the World Wide Web, which is a huge be built on private data, and encryption techniques allow
data repository with very powerful search engines work- multiple parties to mine their databases as if their data
ing on top of it. An adversary can find the phone number were stored in a central database. The threat of powerful
of a person from some source and use the search engine data-mining tools revealing confidential information also
Google to obtain the address of that person. Then, by has been addressed. The initial results in this area have
applying the address to a tool such as Mapquest, door-to- shown that confidential patterns in a database can be
door driving directions to the corresponding persons concealed by specific hiding techniques. For the ad-
home could be found easily. This is a very simple example vance of the data-mining technology, privacy issues
of how the privacy of a person could be in danger by need to be investigated more, and the current problems,
integrating data from a couple of sources, together with a such as privacy leaks in privacy-preserving data-mining
search engine over a data repository. Data integration over algorithms, and the scalability issues need to be re-
multiple data repositories will be one of the main chal- solved. In summary, data-mining technology is needed
lenges in privacy and confidentiality of data against data to make our lives easier and to increase our safety
mining. standards, but, at the same time, privacy standards should
923
TEAM LinG
be established and enforced on data collectors in order to Conference on Knowledge Discovery and Data Mining,
protect the privacy of data owners against the misuse of Sydney, Australia.
the collected data.
Saygin, Y., Verykios, V.S., & Clifton, C. (2001). Using
unknowns to prevent the discovery of association rules.
SIGMOD Record, 30(4), 45-54.
REFERENCES
Schwartz, J., & Micheline, M. (2004, May 1). Airlines gave
Adam, N.R., & Wortmann, J.C. (1989). Security-control F.B.I. millions of records on travelers after 9/11. New York
methods for statistical databases: A comparative study. Times.
ACM Computing Surveys, 21(4), 515-556.
Sweeney, L. (2003). k-anonymity: A model for protecting
Agrawal, D., & Aggarwal, C. (2001). On the design and privacy. International Journal on Uncertainty, Fuzziness
quantification of privacy preserving data mining algo- and Knowledge-Based Systems, 10(5), 557-570.
rithms. Proceedings of ACM Symposium on Principles of
Database Systems, Santa Barbara, California. Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., &
Dasseni, E. (2004). Association rule hiding. IEEE Trans-
Agrawal, R., & Srikant, R. (2000). Privacy preserving data actions on Knowledge and Data Engineering, 16(4),
mining. Proceedings of SIGMOD Conference, Dallas, 434-447.
Texas.
Clifton, C., & Marks, D. (1996). Security and privacy
implications of data mining. Proceedings of ACM Work- KEY TERMS
shop on Data Mining and Knowledge Discovery, Montreal
Canada. Data Perturbation: Modifying the data so that origi-
Evfimievski, A.V., Gehrke, J., & Srikant, R. (2003). Limiting nal confidential data values cannot be recovered.
privacy breaches in privacy preserving data mining. Pro- Distributed Data Mining: Performing the data-mining
ceedings of ACM Symposium on Principles of Database task on data sources distributed in different sites.
Systems, San Diego, California.
K-Anonymity: A privacy metric, which ensures an
Farkas, C., & Jajodia, S. (2003). The inference problem: A individuals information cannot be distinguished from at
survey. ACM SIGKDD Explorations, 4(2), 6-12. least k-1 other people when a data source is disclosed.
Kantarcioglu, M., & Clifton, C. (2002). Privacy-preserving Privacy Against Data Mining: Preserving the privacy
distributed mining of association rules on horizontally of individuals against data-mining tools when disclosed
partitioned data. Proceedings of The ACM SIGMOD data contain private information that could be extracted
Workshop on Research Issues on Data Mining and Knowl- by data-mining tools.
edge Discovery, Madison, Wisconsin.
Privacy Preserving Data Mining: Performing the data-
Markoff, J. (2002, December 19). Study seeks technology mining task on private data sources (centralized or distributed).
safeguards for privacy. New York Times (p. 18).
Secure Multiparty Computation: Computing the re-
OLeary, D.E. (1991). Knowledge discovery as a threat to sult of an operation (i.e., sum, min, max) on private data
database security. In G. Piatetsky-Shapiro, & W.J. Frawley (e.g., finding the richest person among a group of people
(Eds.), Knowledge discovery in databases (pp. 507-516). without revealing the wealth of the individuals).
AAI/MIT Press.
Statistical Database: Type of database system that is
Oliveira, S., Zaiane, O., & Saygin, Y. (2004). Secure asso- designed to support statistical operations while prevent-
ciation rule sharing. Proceedings of the 8th Pacific-Asia ing operations that could lead to the association of
individuals with confidential data.
924
TEAM LinG
925
Privacy Protection in Association Rule Mining P

Neha Jha
Shamik Sural
INTRODUCTION allel and distributed computing environments has be-

come crucial for successful business operations. The
Data mining technology has emerged as a means for problem is not simply that the data is distributed, but that
identifying patterns and trends from large sets of data. it must remain distributed. For instance, transmitting large
Mining encompasses various algorithms, such as discov- quantities of data to a central site may not be feasible or
ery of association rules, clustering, classification and it may be easier to combine results than to combine the
prediction. While classification and prediction techniques sources. This has led to the development of a number of
are used to extract models describing important data distributed data mining algorithms (Ashrafi et al., 2002).
classes or to predict future data trends, clustering is the However, most of the distributed algorithms were initially
process of grouping a set of physical or abstract objects developed from the point of view of efficiency, not secu-
into classes of similar objects. rity. Privacy concerns arise when organizations are will-
Alternatively, association rule mining searches for ing to share data mining results but not the data. The data
interesting relationships among items in a given data set. could be distributed among several custodians, none of
The discovery of association rules from a huge volume of who are allowed to transfer their data to another site. The
data is useful in selective marketing, decision analysis transaction information may also be specific to individual
and business management. A popular area of application users who do not appreciate the disclosure of individual
is the market basket analysis, which studies the buying record values. Consider another situation. A multina-
habits of customers by searching for sets of items that are tional company wants to mine its own data spread over
frequently purchased together (Han & Kamber, 2003). many countries for globally valid results while national
Let I = {i1, i 2, , in} be a set of items and D be a set of laws prevent trans-border data sharing.
transactions, where each transaction T belonging to D is Thus, a need for developing privacy preserving dis-
an itemset such that T is a subset of I. A transaction T tributed data mining algorithms has been felt in recent
contains an itemset A if A is a subset of T. An itemset A years. This keeps the information about each site secure,
with k items is called a k-itemset. An association rule is an and at the same time determines globally valid results for
implication of the form A => B where A and B are subsets taking effective business decisions. There are many vari-
of I and A B is null. The rule A => B is said to have a ants of these algorithms depending on how the data is
support s in the database D if s% of the transactions in D distributed, what type of data mining we wish to do, and
contain AUB. The rule is said to have a confidence c if c% what restrictions are placed on sharing of information.
of the transactions in D that contain A also contain B. The This paper gives an overview of the different approaches
problem of mining association rules is to find all rules to privacy protection and information security in distrib-
whose support and confidence are higher than a specified uted association rule mining. The following two sections
minimum support and confidence. An example associa- describe the traditional privacy preserving methods along
tion rule could be: Driving at night and speed > 90 mph with the pioneering work done in recent years. Finally, we
results in a road accident, where at least 70% of drivers discuss the open research issues and future directions of
meet the three criteria and at least 80% of those meeting work.
the driving time and speed criteria actually have an acci-
dent.
While data mining has its roots in the traditional fields BACKGROUND
of machine learning and statistics, the sheer volume of
data today poses a serious problem. Traditional methods A direct approach to data mining over multiple sources
typically make the assumption that the data is memory would be to run existing data mining tools at each site
resident. This assumption is no longer tenable and imple- independently and then combine the final results. While
mentation of data mining ideas in high-performance par- it is simple to implement, this approach often fails to give
TEAM LinG
Privacy Protection in Association Rule Mining
globally valid results since a rule that is valid in one or of privacy concerns in data mining has led to a wide range
more of the individual locations need not be valid over the of proposals in the past few years. The solutions can be
entire data set. broadly categorized as those belonging to the classes of
Efforts have been made to develop methods that data obfuscation, data summarization and data separa-
perform local operations at each site to produce interme- tion. The goal of data obfuscation is to hide the data to be
diate results, which can then be used to obtain the final protected. This is achieved by perturbing the data before
result in a secure manner. For example, it can be easily delivering it to the data miner by either randomly modify-
shown that if a rule has support > m% globally, it must ing the data, swapping the values between the records or
have support > m% on at least one of the individual sites. performing controlled modification of data to hide the
This result can be applied to the distributed case with secrets. Cryptographic techniques are often employed to
horizontally partitioned data (all sites have the same encrypt the source data, perform intermediate operations
schema but each site has information on different enti- on the encrypted data and then decrypt the values to get
ties). A distributed algorithm for this would work by back the final result with each site not knowing anything
requesting each site to send all rules with support at least but the global rule. Summarization, on the other hand,
m. For each rule returned, the sites are then asked to send attempts to make available innocuous summaries of the
the count of their items that support the rule, and the total data and therefore only the needed facts are exposed.
count of all items at the site. Using these values, the global Data separation ensures that only the trusted parties can
support of each rule can be computed with the assurance see the data by making all operations and analysis to be
that all rules with support at least m have been found. performed either by the owner/creator of the data or by
This method provides a certain level of information trusted third parties.
security since the basic data is not shared. However, the One application of data perturbation technique is
problem becomes more difficult if we want to protect not decision tree based classification to protect individual
only the individual items at each site, but also how much privacy by adding random values from a normal/Gaussian
each site supports a given rule. The above method reveals distribution of mean 0 to the actual data values (Agrawal
this information, which may be considered to be a breach & Srikant, 2000). Bayes rule for density functions is then
of security depending on the sensitivity of any given used to reconstruct the distribution. The approach is
application. Theoretical studies in the field of secure quite elegant since it provides a method for approximating
computation started in the late 1980s. In recent years, the the original data distribution and not the original data
focus has shifted more to the application field (Maurer, values by using the distorted data and information on the
2003). The challenge is to apply the available theoretical random data distribution. Similar data perturbation tech-
results in solving intricate real-world problems. Du & niques can be applied to the mining of Boolean associa-
Atallah (2001) review and suggest a number of open tion rules also (Rizvi & Haritsa, 2002). It is assumed that
secure computation problems including applications in the tuples in the database are fixed length sequences of
the field of computational geometry, statistical analysis 0s and 1s. A typical example is the market basket appli-
and data mining. They also suggest a method for solving cation where the columns represent the items sold by a
secure multi-party computational geometry problems and supermarket, and each row describes, through a sequence
secure computation of dot products in separate pieces of of 1s and 0s, the purchases made by a particular cus-
work (Atallah & Du, 2001; Ioannidis et al., 2002). tomer (1 indicates a purchase and 0 indicates no pur-
In all the above-mentioned work, the secure computa- chase). One interesting feature of this work is a flexible
tion problem has been treated with an approach to provid- definition of privacy; for example, the ability to correctly
ing absolute zero knowledge whereas the corporations guess a value of 1 from the perturbed data can be
may not always be willing to bear the cost of zero informa- considered a greater threat to privacy than correctly
tion leakage as long as they can keep the information learning a 0.
shared within known bounds. In the next section, we For many applications such as market basket, it is
discuss some of the important approaches to privacy reasonable to expect that the customers would want more
preserving data mining with an emphasis on the algo- privacy for the items they buy compared to the items they
rithms developed for association rule mining. do not. There are primarily two ways of handling this
requirement. In one method, the data is changed or per-
turbed to a certain extent to hide the exact information that
MAIN THRUST can be extracted from the original data. In another ap-
proach, data is encrypted before running the data mining
In this section, we describe the various levels of privacy algorithms on it. While data perturbation techniques
protection possible while mining data and the corre- usually result in a transformation that leads to loss of
sponding algorithms to achieve the same. Identification information and the exact result cannot be determined,
926
TEAM LinG
cryptographic protocols try to achieve zero knowledge Since the goal is to determine if the support exceeds
transfer using lossless transformation. Cryptographic tech- a threshold rather than to learn the exact support of a rule, P
niques have been developed for the ID3 classification the secure sum computation is slightly altered and, in-
algorithm with two parties having horizontally partitioned stead of sending the computed values to each other, the
data (Lindell & Pinkas, 2000). While the approach is inter- sites perform secure comparison among themselves. If
esting, it is not very efficient in mining rules from very large the goal were to have a totally secure method, the union
databases. Also, completely secure multi-party computa- step would have to be eliminated. However, using the
tion may not actually be required in practical applications. secure union method gives higher efficiency with prov-
Although there may be many different data mining ably controlled disclosure of some minor information
techniques, they often perform similar computations at (e.g., the number of duplicate items and the candidate
various stages. For example, counting the number of items sets). The validity of even this disclosed information can
in a subset of the data shows up in both association rule be reduced by noise addition as each site can add some
mining and learning decision trees. There exist four popu- fake large itemsets to its actual locally large itemsets. In
lar methods for privacy-preserving computations that can the pruning phase, the fake items can be eliminated.
be used to support different forms of data mining. These In contrast to the above method, computing associa-
are: secure sum, set union, set intersection size and scalar tion rules in vertically partitioned data is even more
product. Though all methods are not truly secure (in some, challenging (Vaidya & Clifton, 2002). Here the items are
information other than the results is revealed), but they do partitioned and each itemset is split between sites. Most
have provable bounds on the information released. In steps of the traditional a priori algorithm can be done
addition, they are efficient as the communication and locally at each of the sites (Agarwal & Srikant, 1994). The
computation cost is not significantly increased through crucial step involves finding the support count of an
addition of the privacy-preserving component. itemset. If the support count of an itemset is securely
At this stage, it is important to understand that not all computed, it can be checked if the support is greater than
the association rules have equal need for protection. One the threshold to determine whether the itemset is fre-
has to analyze the sensitivity of the various association quent. Consider the entire transaction database to be a
rules mined from a database (Saygin et al., 2002). From the Boolean matrix where 1 represents the presence of an item
large amount of data made available to the public through (column) in a transaction (row), while 0 correspondingly
various sources, a malicious user may be able to extract represents an absence. Then the support count of an
association rules that were meant to be protected. Reduc- itemset is the scalar product of the vectors representing
ing the support or the confidence can hide the sensitive the sub-itemsets with both parties. An algorithm to
rules. Evfimievski et al. (2002) investigate the problem of compute the scalar product securely is sufficient for
applying randomization techniques in association rule secure computation of the support count.
mining. One of their most important contributions is the Most of the above protocols assume a semi-honest
application of privacy preserving techniques in rule min- model, where the parties involved will honestly follow
ing from categorical data. They also provide a formal the protocol but can later try to infer additional informa-
definition for privacy breaches. It is quite challenging to tion from whatever data they receive through the proto-
extend this work by combining randomization techniques col. One result of this is that parties are not allowed to
with cryptographic techniques to make the scheme more give spurious input to the protocol. If a party is allowed
robust. to give spurious input, they can probe to determine the
Kantarcioglu & Clifton (2002) propose a cryptographic value of a specific item at other parties. For example, if a
technique for mining association rules in a horizontally party gives the input (0, ..., 0, 1, 0, , 0), the result of the
partitioned database. It assumes that the transactions are scalar product (1 or 0) tells the malicious party if the other
distributed among n sites. The global support count of an party has the transaction corresponding to the 1. Attacks
item set is the sum of all the local support counts. An of this type can be termed as probing attacks. All of the
itemset A is globally supported if the global support count protocols currently suggested in the literature are sus-
of A is greater than s% of the total transaction database ceptible to such probing attacks. Better techniques,
size. The global confidence of a rule A => B can be given which work even in the malicious model, are needed to
as (AUB).support/(A).support and a k-itemset is called a guard against this.
globally large k-itemset if it is globally supported. Quasi-
commutative hash functions are used for secure computa-
tion of set unions that determine globally frequent candi- FUTURE DIRECTIONS
date itemsets from locally frequent candidate itemsets.
The globally frequent k-itemsets are determined from the The ongoing work in the field has suggested several
candidate itemsets with a secure protocol. interesting directions for future research. Existing ap-
927
TEAM LinG
proaches should be extended to cover new classes of Symposium on Principles of Database Systems (pp. 247-
privacy breaches and to ascertain the theoretical limits on 255). Santa Barbara, California.
discoverability for a given level of privacy (Agrawal &
Aggarwal, 2001). Potential research needs to concentrate Agrawal, R., & Srikant, R. (2000). Privacy-preserving data
on combining the randomization and cryptographic pro- mining. In ACM SIGMOD Conference on Management of
tocols to get the strengths of both without the weak- Data (pp. 439-450). Dallas, TX, USA.
nesses of either. Privacy estimation formulas used in data Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2002). A data
perturbation techniques can be refined so as to include mining architecture for distributed environments. Inno-
the effects of using the mining output to re-interrogate the vative Internet Computing Systems, 27-38.
distorted database. Extending association rule mining of
vertically partitioned data to multiple parties is an impor- Atallah, M.J., & Du, W. (2001). Secure multi-party compu-
tant research topic in itself, especially considering collu- tational geometry. In Seventh International Workshop on
sion between the parties. Algorithms and Data Structures (pp. 165-179). Provi-
Much of the current work on vertically partitioned dence, Rhode Island, USA.
data is limited to Boolean association rule mining. Cat- Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M.
egorical attributes and quantitative association rule min- (2003). Tools for privacy preserving distributed data
ing are significantly more complex problems. Most impor- mining. ACM SIGKDD Explorations, 4(2), 28-34.
tantly, a wide range of data mining algorithms exists such
as classification, clustering and sequence detection, and Du, W., & Atallah, M.J. (2001). Secure multi-party compu-
the effect of privacy constraints on these algorithms is tation problems and their applications: A review and open
also an interesting area of future research (Verykios et al., problems. In New Security Paradigms Workshop (pp. 11-
2004). The final goal of researchers is to develop methods 20). Cloudcroft, New Mexico, USA.
enabling any data mining operation that can be done at a
single site to be done across various sources, while Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, G.
respecting the privacy policies of the sources. (2002). Privacy preserving mining of association rules. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (pp. 217-228). Edmonton,
Canada28.
CONCLUSION
Several algorithms have been proposed to address the techniques. San Francisco: Morgan Kaufmann Publish-
conflicting goals of supporting privacy and accuracy ers.
while mining association rules on large databases. Once
this field of research is matured, a toolkit of privacy- Ioannidis, I., Grama, A., & Atallah, M. (2002). A secure
preserving distributed computation techniques needs to protocol for computing dot-products in clustered and
be built that can be assembled to solve specific real-world distributed environments. In International Conference
problems (Clifton et al., 2003). Current techniques address on Parallel Processing (pp. 279-285).
the problem of performing one secure computation and Kantarcioglu, M., & Clifton, C. (2002). Privacy-preserving
using that result to perform the next computation reveals distributed mining of association rules on horizontally
intermediate information that may not be part of the final partitioned data. In ACM SIGMOD Workshop on Re-
results. Controlled disclosure is guaranteed by evaluat- search Issues on Data Mining and Knowledge Discov-
ing whether the real results together with the extra infor- ery.
mation violate privacy constraints. This approach how-
ever, becomes more difficult with iterative techniques as Lindell, Y., & Pinkas, B. (2000). Privacy preserving data
intermediate results from several iterations may reveal a mining. Advances in Cryptology (pp. 36-54).
lot of information. Proving that this does not violate Maurer, U. (2003). Secure multi-party computation made
privacy is a difficult problem. simple. In Third Conference on Security in Communica-
tion Networks (SCN02) (pp. 14-28). Lecture Notes in
Computer Science (Vol. 2576). Berlin: Springer-Verlag.
REFERENCES
Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy
Agrawal, D., & Aggarwal, C.C. (2001). On the design and in association rule mining. In Twenty-Eighth Interna-
quantification of privacy preserving data mining algo- tional Conference on Very Large Data Bases (pp. 682-
rithms. In Twentieth ACM SIGACT-SIGMOD-SIGART 689).
928
TEAM LinG
Saygin, Y., Verykios, V.S., & Elmagarmid, A.K. (2002). Distributed Data Mining: Mining information from a
Privacy preserving association rule mining. In Twelfth very large set of data spread across multiple locations P
International Workshop on Research Issues in Data without transferring the data to a central location.
Engineering (pp. 151-158).
Horizontally Partitioned Data: A distributed archi-
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso- tecture in which the all the sites share the same database
ciation rule mining in vertically partitioned data. In Eighth schema but have information about different entities. The
ACM SIGKDD International Conference on Knowledge union of all the rows across all the sites forms the complete
Discovery and Data Mining (pp. 639-644). Edmonton, database.
Alberta, Canada.
Itemset: A set of one or more items that are purchased
Verykios, V.S., Bertino, E., Fovino, I.N., Provenza, L.P., together in a single transaction.
Saygin, Y., & Theodoridis. (2004). State-of-the-art in pri-
vacy preserving data mining. ACM SIGMOD Record, 33 Privacy Preserving Data Mining: Algorithms and
(1), 50-57. methods that are used to mine rules from a distributed set
of data in which the sites reveal detailed information as
less as possible.
KEY TERMS Secure Multi-Party Computation: Computation of an
overall result based on the data from a number of users in
Association Rule: A relation between the occurrences which only the final result is known to the individual user
of a set of items with another set of items in a large data at the end of computation and nothing else.
set.
Vertically Partitioned Data: A distributed architec-
Cryptographic Data Mining Techniques: Methods ture in which the different sites store different attributes
that encrypt individual data before running data mining of the data. The union of all these attributes or columns
algorithms so that the final result is also available in an together forms the complete database.
encrypted form.
929
TEAM LinG
930
Profit Mining
Senqiang Zhou
Ke Wang
INTRODUCTION & Srikant, 1994), considers a rule as interesting if it

passes certain statistical tests such as support/confi-
A major obstacle in data mining applications is the gap dence. To an enterprise, however, it remains unclear
between the statistic-based pattern extraction and the how such rules can be used to maximize a given business
value-based decision-making. Profit mining aims to object. For example, knowing Perfume Lipstick
reduce this gap. In profit mining, given a set of past and Perfume Diamond, a store manager still cannot
transactions and pre-determined target items, we like to tell which of Lipstick and Diamond, and what price
build a model for recommending target items and pro- should be recommended to a customer who buys Per-
motion strategies to new customers, with the goal of fume. Simply recommending the most profitable item,
maximizing profit. Though this problem is studied in the say Diamond, or the most likely item, say Lipstick, does
context of retailing environment, the concept and tech- not maximize the profit because there is often an in-
niques are applicable to other applications under a gen- verse correlation between the likelihood to buy and the
eral notion of utility. In this short article, we review dollar amount to spend. This inverse correlation re-
existing techniques and briefly describe the profit mining flects the general trend that the more dollar amount is
approach recently proposed by the authors. The reader is involved, the more cautious the buyer is when making a
referred to (Wang, Zhou & Han, 2002) for the details. purchase decision.
BACKGROUND MAIN THRUST OF THE CHAPTER
It is a very complicated issue whether a customer buys Related Work

a recommended item. Consideration includes items
stocked, prices or promotions, competitors offers, Profit maximization is different from the hit maximi-
recommendation by friends or customers, psychologi- zation as in classic classification because each hit may
cal issues, conveniences, etc. For on-line retailing, it generate different profit. Several approaches existed to
also depends on security consideration. It is unrealistic make classification cost-sensitive. (Domingos, 1999)
to model all such factors in a single system. In this proposed a general method that can serve as a wrapper to
article, we focus on one type of information available in make a traditional classifies cost-sensitive. (Zadrozny
most retailing applications, namely past transactions. & Elkan, 2001) extended the error metric by allowing
The belief is that shopping behaviors in the past may the cost to be example dependent. (Margineantu &
shed some light on what customers like. We try to use Dietterich, 2000) gave two bootstrap methods to esti-
patterns of such behaviors to recommend items and mate the average cost of a classifier. (Pednault, Abe &
prices. Zadrozny, 2002) introduced a method to make sequential
Consider an on-line store that is promoting a set of cost-sensitive decisions, and the goal is to maximize the
target items. At the cashier counter, the store likes to total benefit over a period of time. These approaches
recommend one target and a promotion strategy (such assume a given error metric for each type of
as a price) to the customer based on non-target items misclassification, which is not available in profit mining.
purchased. The challenge is determining an item inter- Profit mining is related in motivation to action-
esting to the customer at a price affordable to the ability (or utility) of patterns: a pattern is interesting in
customer and profitable to the store. We call this prob- the sense that the user can act upon it to her advantage
lem profit mining (Wang, Zhou & Han, 2002). (Silberschatz & Tuzhilin, 1996). (Kleinberg,
Most statistics-based rule mining, such as associa- Papadimitriou & Raghavan, 1998) gave a framework for
tion rules (Agrawal, Imilienski & Swami, 1993; Agrawal evaluating data mining operations in terms of utility in
TEAM LinG
Profit Mining
decision-making. These works, however, did not pro- part of her transactions and including a special target
pose concrete solutions to the actionability problem. item NULL representing no recommendation. Now, P
Recently, there were several works applying associa- each recommendation of a non-NULL item (and price)
tion rules to address business related problems. (Brijs, corresponds to identifying a buyer of the item. This
Swinnen, Avanhoof & Wets, 1999; Wong, Fu & Wang, modeling is more general than the traditional direct
2003; Wang & Su, 2002) studied the problem of select- marketing in that it can identify buyers for more than
ing a given number of items for stocking. The goal is to one type of item and promotion strategies.
maximize the profit generated by selected items or
customers. These works present one important step Profit Mining
beyond association rue mining, i.e., addressing the issue
of converting a set of individual rules into a single We solve the profit mining by extracting patterns from
actionable model for recommending actions in a given a set of past transactions. A transaction consists of a
scenario. collection of sales of the form (item, price). A simple
There were several attempts to generalize associa- price can be substituted by a promotion strategy, such
tion rules to capture more semantics, e.g., (Lin, Yao & as buy one get one free or X quantity for Y dollars,
Louie, 2002; Yao, Hamilton & Butz, 2004; Chan, Yang that provides sufficient information for derive the price.
& Shen, 2003). Instead of a uniform weight associated The transactions were collected over some period of
with each occurrence of an item, these works associate times and there could be several prices even for the
a general weight with an item and mine all itemsets that same item if sales occurred at different times. Given a
pass some threshold on the aggregated weight of items collection of transactions, we find recommendation
in an itemset. Like association rule mining, these works rules of the form {s 1, , sk}<I,P>, where I is a target
did not address the issue of converting a set of rules or item and P is a price of I, and each si is a pair of non-target
itemsets into a model for recommending actions. item and price. An example is (Perfume,
Collaborative filtering (Resnick & Varian, 1997)
price=$20) (Lipstick, price=$10). This recommen-
makes recommendation by aggregating the opinions
(such as rating about movies) of several advisors who dation rule can be used to recommend Lipstick at the
share the taste with the customer. Built on this technol- price of $10 to a customer who bought Perfume at the
ogy, many large commerce web sites help their custom- price of $20. If the recommendation leads to a sale of
ers to find products. For example, Amazon.com uses Lipstick of quantity Q, it generates (10-C)*Q profit,
Book Matcher to recommend books to customers; where C is the cost of Lipstick.
Moviefinder.com recommends movies to customers Several practical considerations would make rec-
using We Predict recommender system. For more ommendation rules more useful. First, items on the
examples, please refer to (Schafer, Konstan & Riedl, left-hand side in si can be item categories instead to
1999). The goal is to maximize the hit rate of recom- capture category-related patterns. Second, a customer
mendation. For items of varied profit, maximizing profit may have paid a higher price if a lower price was not
is quite different from maximizing hit rate. Also, col- available at the shopping time. We can incorporate the
laborative filtering relies on carefully selected item domain knowledge that paying a higher price implies the
endorsements for similarity computation, and a good willingness of paying a lower price (for exactly the
set of advisors to offer opinions. Such data are not same item) to search for stronger rules at lower prices.
easy to obtain. The ability of recommending prices, in This can be done through multi-level association mining
addition to items, is another major difference between (Srikant and Agrawal, 1995; Han and Fu, 1995), by
profit mining and other recommender systems. modeling a lower price as a more general category than
Another application where data mining is heavily a higher price. For example, the sale {<chicken, $3.8>}
used for business targets is direct marketing. See (Ling in a transaction would match any of the following more
& Li, 1998; Masand & Shapiro, 1996; Wang, Zhou, general sales in a rule: <chicken, $3.8>, <chicken,
Yeung & Yang, 2002), for example. The problem is to $3.5>, <chicken, $3.0>, chicken, meat, food. Note that
identify buyers using data collected from previous cam- the last three sales are generalized by climbing up the
paigns, where the product to be promoted is usually category hierarchy and dropping the price.
fixed and the best guess is about who are likely to buy. A key issue is how to make a set of individual rules
The profit mining, on the other hand, is to guess the best work as a single recommender. Our approach is ranking
item and price for a given customer. Interestingly, these rules the recommendation profit. The recommendation
two problems are closely related to each other. We can profit of a rule r is defined as the average profit of the
model the direct marketing problem as profit mining target item in r among all transactions that match r. Note
problem by including customer demographic data as that the rank by average profit implicitly takes into
account of both confidence and profit because a high
931
TEAM LinG
Profit MIning
average profit implies that both confidence and profit are CONCLUSION
high. Given a new customer, we pick up the highest
ranked matching rule to make recommendation. Profit mining is a promising data mining approach
Before making recommendation, however, over- because it addresses the ultimate goal of data mining. In
fitting rules that work only for observed transactions, this article, we study profit mining in the context of
but not for new customers, should be pruned because our retailing business, but the principles and techniques
goal is to maximize profit on new customers. The idea is illustrated should be applicable to other applications.
as follows. Instead of ranking rules by observed profit, For example, items can be general actions and prices
we rank rules by projected profit, which is based on the can be a notion of utility resulted from actions. In
estimated error of a rule adapted for pruning classifiers addition, items can be used to model customer demo-
(Quinlan, 1993). Intuitively, the estimated error will graphic information such as Gender, in which case the
increase for a rule that matches a small number of price component is unused.
transactions. Therefore, over-fitting rules tend to have a
larger estimated error, which translates into a lower
projected profit, and a lower rank. REFERENCES
For a detailed exposure and experiments on real life
and synthetic data sets, the reader is referred to (Wang, Agrawal, R., Imilienski, T., & Swami, A. (1993, May).
Zhou & Han, 2002). Mining association rules between sets of items in large
databases. ACM Special Interest Group on Manage-
ment of Data (SIGMOD) (pp. 207-216), Washington D.C.,
FUTURE TRENDS USA.
The profit mining proposed is only the first, but impor- Agrawal, R. & Srikant, R. (1994, September). Fast algo-
tant, step in addressing the ultimate goal of data mining. rithms for mining association rules. International Con-
To make profit mining more practical, several issues ference on Very Large Data Bases (VLDB) (pp. 487-499),
need further study. First, it is quite likely that the recom- Santiago de Chile.
mended item tends to be the item that the customer will Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999,
buy independently of the recommendation. Obviously, August). Using association rules for product assortment
such items need not be recommended, and recommenda- decisions: A case study. International Conference on
tion should focus on those items that the customer may Knowledge Discovery and Data Mining (KDD) (pp. 254-
buy if informed, but may not otherwise. Recommending 260), San Diego, USA.
such items likely brings in additional profit. Second, the
current model maximizes only the profit of one-shot Chan, R., Yang, Q., & Shen, Y. (2003, November). Mining
selling effect, therefore, a sale in a large quantity is high utility itemsets. IEEE International Conference on
favored. In reality, a customer may regularly shop the Data Mining (ICDM) (pp. 19-26), Melbourne, USA.
same store over a period of time, in which case a sale in
a large quantity will affect the shopping frequency of a Domingos, P. (1999, August). MetaCost: A general
customer, thus, profit. In this case, the goal is maximiz- method for making classifiers cost-sensitive. ACM SIG
ing the profit for reoccurring customers over a period of International Conference on Knowledge Discovery and
time. Another interesting direction is to incorporate the Data Mining (SIGKDD) (pp. 155-164), San Diego, USA.
feedback whether a certain recommendation is rejected Han, J., & Fu, Y. (1995, September). Discovery of mul-
and accepted to improve future recommendations. tiple-level association rules from large databases. Inter-
This current work has focused on the information national Conference on Very Large Data Bases (VLDB)
captured in past transactions. As pointed out in Introduc- (pp. 420-431), Zurich, Switzerland.
tion, other things such as competitors offers, recom-
mendation by friends or customers, consumer fashion, Kleinberg, J., Papadimitriou, C. & Raghavan, P. (1998,
psychological issues, conveniences, etc. can affect the December). A microeconomic view of data mining.
customers decision. Addressing these issues requires Data Mining and Knowledge Discovery Journal, 2(4),
additional knowledge, such as competitors offers, and 311-324.
computers may not be the most suitable tool. One solu- Lin, T. Y., Yao, Y.Y., & Louie, E. (2002, May). Value
tion could be suggesting several best recommendations added association rules. Advances in Knowledge Dis-
to the domain expert, the store manager or sales person covery and Data Mining, 6th Pacific-Asia Conference
in this case, who makes the final recommendation to the PAKDD (pp. 328-333), Taipei, Taiwan.
customer after factoring the other considerations.
932
TEAM LinG
Profit Mining
Ling, C., & Li, C. (1998, August) Data mining for direct Wong, R. C. W., Fu, A. W. C., & Wang, K. (2003, Novem-
marketing: problems and solutions. ACM SIG Interna- ber). MPIS: Maximal-profit item selection with cross- P
tional Conference on Knowledge Discovery and Data selling considerations. IEEE International Conference
Mining (SIGKDD) (pp. 73-79), New York, USA. on Data Mining (ICDM) (pp. 371-378), Melbourne, USA.
Margineantu, D. D., & Dietterich, G. T. (2000, June-July). Yao, H., Hamilton, H. J., & Butz, C. J. (2004, April). A
Bootstrap methods for the cost-sensitive evaluation of foundational approach for mining itemset utilities from
classifiers. International Conference on Machine Learn- databases. SIAM International Conference on Data
ing (ICML) (pp. 583-590), San Francisco, USA. Mining (SIAMDM) (pp. 482-486), Florida, USA.
Masand, B., & Shapiro, G. P. (1996, August) A comparison Zadrozny, B., & Elkan, C. (2001, August). Learning and
of approaches for maximizing business payoff of predic- making decisions when costs and probabilities are both
tion models. ACM SIG International Conference on unknown. ACM SIG International Conference on
Knowledge Discovery and Data Mining (SIGKDD) (pp. Knowledge Discovery and Data Mining (SIGKDD) (pp.
195-201), Portland, USA. 204-213), San Francisco, USA.
Pednault, E., Abe, N., & Zadrozny, B. (2002, July). Sequen-
tial cost-sensitive decision making with reinforcement
learning. ACM SIG International Conference on Knowl- KEY TERMS
edge Discovery and Data Mining (SIGKDD) (p. 259-268),
Edmonton, Canada. Association Rule: An association has the form I1 I2,
Quinlan, J.R. (1993). C4.5: Programs for Machine Learn- where I1 and I2 are two itemsets. The support of an
ing. Morgan Kaufmann. association rule is the support of the itemset I1 I2, and
the confidence of a rule is the ratio of support of I1 I2
Resnick, P., & Varian, H.R. (1997). CACM special issue on and the support of I1.
recommender systems. Communications of the ACM,
40(3), 56-58. Classification: Given a set of training examples in
which each example is labeled by a class, build a model,
Schafer, J. B., Konstan, J. A., & Riedl, J. (1999, November). called a classifier, to predict the class label of new
Recommender systems in E-commerce. ACM Conference examples that follow the same class distribution as
on Electronic Commerce (pp. 158-166), Denver, USA. training examples. A classifier is accurate if the pre-
dicted class label is the same as the actual class label.
Silberschatz, A., & Tuzhilin, A. (1996). What makes pat-
terns interesting in knowledge discovery systems. IEEE Cost Sensitive Classification: The error of a
Transactions on Knowledge and Data Engineering, 8(6), misclassification depends on the type of the
970-974. misclassification. For example, the error of
misclassifying Class 1 as Class 2 may not be the same as
Srikant, R., & Agrawal, R. (1995, September). Mining the error of misclassifying Class 1 as Class 3.
generalized association rules. International Conference
on Very Large Data Bases (VLDB) (pp. 407-419), Zurich, Frequent Itemset: The support of an itemset refers
Switzerland. to as the percentage of transactions that contain all the
items in the itemset. A frequent itemset is an itemset
Wang, K., & Su, M. Y. (2002, July). Item selection by hub- with support above a pre-specified threshold.
authority profit ranking. ACM SIG International Confer-
ence on Knowledge Discovery and Data Mining Over-fitting Rule: A rule has high performance (e.g.
(SIGKDD) (pp. 652-657), Edmonton, Canada. high classification accuracy) on observed transaction(s)
but performs poorly on future transaction(s). Hence, such
Wang, K., Zhou, S., & Han, J. (2002, March). Profit mining: rules should be excluded from the decision-making sys-
From patterns to actions. International Conference on tems (e.g. recommender). In many cases over-fitting rules
Extending Database Technology (EDBT) (pp. 70-87), are generated due to the noise in data set.
Prague, Czech Republic.
Profit Mining: In a general sense, profit mining refers
Wang, K., Zhou, S., Yeung, J. M. S., & Yang, Q. (2003, to data mining aimed at maximizing a given objective
March). Mining customer value: From association rules function over decision making for a targeted population
to direct marketing. International Conference on Data (Wang, Zhou & Han, 2002). Finding a set of rules that pass
Engineering (ICDE) (pp. 738-740), Bangalore, India. a given threshold on some interestingness measure (such
as association rule mining or its variation) is not profit
933
TEAM LinG
Profit MIning
mining because of the lack of a specific objective function building a model for recommending target products and
to be maximized. Classification is a special case of profit prices with the objective of maximizing net profit.
mining where the objective function is the accuracy and
the targeted population consists of future cases. This Transaction: A transaction is some set of items cho-
paper examines a specific problem of profit mining, i.e., sen from a fixed alphabet.
934
TEAM LinG
935
Pseudo Independent Models P

Yang Xiang
University of Guelph, Canada
Graphical models such as Bayesian networks (BNs) (Pearl, Let V be a set of n discrete variables x1, , xn (in what
1988) and decomposable Markov networks (DMNs) follows, we will focus on finite, discrete variables). Each
(Xiang, Wong & Cercone, 1997) have been applied widely variable xi has a finite space Si = {xi,1, x i,2, , xi,D } of
to probabilistic reasoning in intelligent systems. Figure1 cardinality Di. When there is no confusion, we write xi,j as
illustrates a BN and a DMN on a trivial uncertain domain: xij for simplicity. The space of a set V of variables is defined
A virus can damage computer files, and so can a power by the Cartesian product of the spaces of all variables in
glitch. A power glitch also causes a VCR to reset. The BN V, that is, SV = S1 x ... x Sn (or S ). Thus, SV contains the
i i
in (a) has four nodes, corresponding to four binary vari- tuples made of all possible combinations of values of the
ables taking values from {true, false}. The graph structure variables in V. Each tuple is called a configuration of V,
encodes a set of dependence and independence assump- denoted by v = (x 1, , xn).
tions (e.g., that f is directly dependent on v, and p but is Let P(xi ) denote the probability function over x i and
independent of r, once the value of p is known). Each node P(xij ) denote the probability value P(xi = x ij ). A probabi-
is associated with a conditional probability distribution listic domain model (PDM) over V defines the probabil-
conditioned on its parent nodes (e.g., P(f | v, p)). The joint ity values of every configuration for every subset A V .
probability distribution is the product P(v, p, f, r) = P(f |
Let P(V) or P(x1, , xn) denote the joint probability
v, p) P(r | p) P(v) P(p). The DMN in (b) has two groups
distribution (JPD) function over x1, , xn and P(x1 j1, ,
of nodes that are maximally pair-wise connected, called
xn jn) denote the probability value of a configuration (x1 j1,
cliques. Each clique is associated with a probability
distribution (e.g., clique {v, p, f} is assigned P(v, p, f)). The , xn jn). We refer to the function P(A) over A V as the
joint probability distribution is P(v, p, f, r) = P(v, p, f) P(r, marginal distribution over A and P(xi) as the marginal
p) / P(p), where P(p) can be derived from one of the clique distribution of x i. We refer to P(x1 j1, , xn jn) as a joint
distributions. The networks, for instance, can be used to parameter and P(xij) as a marginal parameter of the
reason about whether there are viruses in the computer corresponding PDM over V.
system, after observations on f and r are made. For any three disjoint subsets of variables W, U and
Construction of such networks by elicitation from Z in V, subsets W and U are called conditionally indepen-
domain experts can be very time-consuming. Automatic dent given Z, if
discovery (Neapolitan, 2004) by exhaustively testing all
possible network structures is intractable. Hence, heuris- P(W | U, Z) = P(W | Z)
tic search must be used. This article examines a class of
graphical models that cannot be discovered using the for all possible values in W, U and Z such that P(U, Z) >
common heuristics. 0. Conditional independence signifies the dependence
mediated by Z. This allows the dependence among
W U U U Z to be modeled over subsets W U Z and U U Z
separately. Conditional independence is the key property
explored through graphical models.
Figure 1. (a) a trivial example BN; (b) a corresponding Subsets W and U are said to be marginally indepen-
DMN dent (sometimes referred to as unconditionally indepen-
dent) if
P(W | U) = P(W)
for all possible values W and U such that P(U) > 0. When
two subsets of variables are marginally independent,
there is no dependence between them. Hence, each subset
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of I

Encyclopedia of Data Warehousing and Mining 2006

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Encyclopedia of Data Warehousing and Mining 2006

Hochgeladen von

Copyright:

Verfügbare Formate

TEAM LinG

IDEA GROUP REFERENCE

Published in the United States of America by

and in the United Kingdom by

Library of Congress Cataloging-in-Publication Data

Encyclopedia of data warehousing and mining / John Wang, editor.

British Cataloguing in Publication Data

Floriana Esposito, Universitadi Bari, Italy

Chandrika Kamath, Lawrence Livermore National Laboratory, USA

Huan Liu, Arizona State University, USA

Sach Mukherjee, University of Oxford, UK

Alan Oppenheim, Montclair State University, USA

Marco Ramoni, Harvard University, USA

Mehran Sahami, Stanford University, USA

Michele Sebag, Universite Paris-Sud, France

Alexander Tuzhilin, New York University, USA

Ning Zhong, Maebashi Institute of Technology, Japan

Abdulghani, Amin A. / Quantiva, USA

Active Disks for Data Mining / Alexander Thomasian ........................................................................................... 6

Active Learning with Multiple Views / Ion Muslea ................................................................................................. 12

Aggregate Query Rewriting in Multidimensional Databases / Leonardo Tininini ................................................. 28

API Standardization Efforts for Data Mining / Jaroslav Zendulka ......................................................................... 39

Application of Data Mining to Recommender Systems, The / J. Ben Schafer ........................................................ 44

Artificial Neural Networks for Prediction / Rafael Mart ......................................................................................... 54

Automated Anomaly Detection / Brad Morantz ..................................................................................................... 78

Automatic Musical Instrument Sound Classification / Alicja A. Wieczorkowska ................................................... 83

Business Processes / David Sundaram and Victor Portougal ............................................................................... 118

Center-Based Clustering and Regression Clustering / Bin Zhang ........................................................................... 134

Classification and Regression Trees / Johannes Gehrke ........................................................................................ 141

Classification Methods / Aijun An ........................................................................................................................... 144

Closed-Itemset Incremental-Mining Problem / Luminita Dumitriu ......................................................................... 150

Cluster Analysis in Fitting Mixtures of Curves / Tom Burr ..................................................................................... 154

Clustering Analysis and Algorithms / Xiangji Huang ............................................................................................. 159

Clustering of Time Series Data / Anne Denton ......................................................................................................... 172

Clustering Techniques / Sheng Ma and Tao Li ....................................................................................................... 176

Comprehensibility of Data Mining Algorithms / Zhi-Hua Zhou .............................................................................. 190

Computation of OLAP Cubes / Amin A. Abdulghani .............................................................................................. 196

Concept Drift / Marcus A. Maloof ............................................................................................................................ 202

Condensed Representations for Data Mining / Jean-Francois Boulicaut ............................................................. 207

Continuous Auditing and Data Mining / Edward J. Garrity, Joseph B. ODonnell,

Data Management in Three-Dimensional Structures / Xiong Wang ........................................................................ 228

Data Mining for Damage Detection in Engineering Structures / Ramdev Kanapady

Data Mining for Intrusion Detection / Aleksandar Lazarevic ................................................................................. 251

Data Mining in the Federal Government / Les Pang ................................................................................................ 268

Data Mining with Cubegrades / Amin A. Abdulghani ............................................................................................. 288

Data Quality in Data Warehouses / William E. Winkler .......................................................................................... 302

Diabetic Data Warehouses / Joseph L. Breault ....................................................................................................... 359

Discovering Knowledge from XML Documents / Richi Nayak .............................................................................. 372

Discovering Unknown Patterns in Free Text / Jan H. Kroeze ................................................................................. 382

Discovery Informatics / William W. Agresti ............................................................................................................. 387

Embedding Bayesian Networks in Sensor Grids / Juan E. Vargas .......................................................................... 427

Ensemble Data Mining Methods / Nikunj C. Oza ................................................................................................... 448

Ethics of Data Mining / Jack Cook ......................................................................................................................... 454

Evaluation of Data Mining Methods / Paolo Giudici ............................................................................................. 464

Evolutionary Computation and Genetic Algorithms / William H. Hsu ..................................................................... 477

Evolutionary Mining of Rule Ensembles / Jorge Muruzbal .................................................................................. 487

Formal Concept Analysis Based Clustering / Jamil M. Saquer .............................................................................. 514

Fuzzy Information and Data Analysis / Reinhard Viertl ......................................................................................... 519

General Model for Data Warehouses, A / Michel Schneider .................................................................................. 523

Genetic Programming / William H. Hsu .................................................................................................................... 529

Graph Transformations and Neural Networks / Ingrid Fischer ............................................................................... 534

Humanities Data Warehousing / Janet Delve .......................................................................................................... 570

Instance Selection / Huan Liu and Lei Yu ............................................................................................................... 621

Intelligence Density / David Sundaram and Victor Portougal .............................................................................. 630