Beruflich Dokumente
Kultur Dokumente
Encyclopedia of
Data Warehousing
and Mining
John Wang
Montclair State University, USA
TEAM LinG
Acquisitions Editor: Rene Davies
Development Editor: Kristin Roth
Senior Managing Editor: Amanda Appicello
Managing Editor: Jennifer Neidig
Copy Editors: Eva Brennan, Alana Bubnis, Rene Davies and Sue VanderHook
Typesetters: Diane Huskinson, Sara Reed and Larissa Zearfoss
Support Staff: Michelle Potter
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.
Copyright 2006 by Idea Group Inc. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI of the trademark or registered trademark.
All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set
are those of the authors, but not necessarily of the publisher.
TEAM LinG
Editorial Advisory Board
TEAM LinG
List of Contributors
TEAM LinG
Carneiro, Sofia / University of Minho, Portugal
Cerchiello, Paola / University of Pavia, Italy
Chakravarty, Indrani / Indian Institute of Technology, India
Chalasani, Suresh / University of Wisconsin-Parkside, USA
Chang, Chia-Hui / National Central University, Taiwan
Chen, Qiyang / Montclair State University, USA
Chen, Shaokang / The University of Queensland, Australia
Chen, Yao / University of Massachusetts Lowell, USA
Chen, Zhong / Shanghai JiaoTong University, PR China
Chien, Chen-Fu / National Tsing Hua University, Taiwan
Cho, Vincent / The Hong Kong Polytechnic University, Hong Kong
Chu, Feng / Nanyang Technological University, Singapore
Chu, Wesley / University of California - Los Angeles, USA
Chung, Seokkyung / University of Southern California, USA
Chung, Soon M. / Wright State University, USA
Conversano, Claudio / University of Cassino, Italy
Cook, Diane J. / University of Texas at Arlington, USA
Cook, Jack / Rochester Institute of Technology, USA
Cunningham, Colleen / Drexel University, USA
Dai, Honghua / Deakin University, Australia
Daly, Olena / Monash University, Australia
Dardzinska, Agnieszka / Bialystok Technical University, Poland
Das, Gautam / The University of Texas at Arlington, USA
de Campos, Luis M. / Universidad de Granada, Spain
de Luigi, Fabio / University of Ferrara, Italy
De Meo, Pasquale / Universit Mediterranea di Reggio Calabria, Italy
DeLorenzo, Gary J. / Robert Morris University, USA
Delve, Janet / University of Portsmouth, UK
Denoyer, Ludovic / University of Paris VI, France
Denton, Anne / North Dakota State University, USA
Dhaenens, Clarisse / LIFL, University of Lille 1, France
Diday, Edwin / University of Dauphine, France
Dillon, Tharam / University of Technology Sydney, Australia
Ding, Qiang / Concordia College, USA
Ding, Qin / Pennsylvania State University, USA
Domeniconi, Carlotta / George Mason University, USA
Dorado de la Calle, Julin / University of A Corua, Spain
Dorai, Chitra / IBM T. J. Watson Research Center, USA
Drew, James H. / Verizon Laboratories, USA
Dumitriu, Luminita / Dunarea de Jos University, Romania
Ester, Martin / Simon Fraser University, Canada
Fan, Weiguo / Virginia Polytechnic Institute and State University, USA
Felici, Giovanni / Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy
Feng, Ling / University of Twente, The Netherlands
Fernndez, Vctor Fresno / Universidad Rey Juan Carlos, Spain
Fernndez-Luna, Juan M. / Universidad de Granada, Spain
Fischer, Ingrid / Friedrich-Alexander University Erlangen-Nrnberg, Germany
Fu, Ada Wai-Chee / The Chinese University of Hong Kong, Hong Kong
Fu, Li M. / University of Florida, USA
Fu, Yongjian / Cleveland State University, USA
Fung, Benjamin C. M. / Simon Fraser University, Canada
Fung, Benny Yiu-ming / The Hong Kong Polytechnic University, Hong Kong
Gallinari, Patrick / University of Paris VI, France
Galvo, Roberto Kawakami Harrop / Instituto Tecnolgico de Aeronutica, Brazil
TEAM LinG
Ganguly, Auroop R. / Oak Ridge National Laboratory, USA
Garatti, Simone / Politecnico di Milano, Italy
Garrity, Edward J. / Canisius College, USA
Ge, Nanxiang / Aventis, USA
Gehrke, Johannes / Cornell University, USA
Georgieva, Olga / Institute of Control and System Research, Bulgaria
Giudici, Paolo / University of Pavia, Italy
Goodman, Kenneth W. / University of Miami, USA
Greenidge, Charles / University of the West Indies, Barbados
Grzymala-Busse, Jerzy W. / University of Kansas, USA
Gunopulos, Dimitrios / University of California, USA
Guo, Hong / Southern Illinois University, USA
Gupta, Amar / University of Arizona, USA
Gupta, P. / Indian Institute of Technology, India
Haastrup, Palle / European Commission, Italy
Hamdi, Mohamed Salah / UAE University, UAE
Hamel, Lutz / University of Rhode Island, USA
Hamers, Ronald / Erasmus Medical Thorax Center, The Netherlands
Hammer, Peter L. / RUTCOR, Rutgers University, USA
Han, Hyoil / Drexel University, USA
Harms, Sherri K. / University of Nebraska at Kearney, USA
Hogo, Mofreh / Czech Technical University, Czech Republic
Holder, Lawrence B. / University of Texas at Arlington, USA
Hong, Yu / BearingPoint Inc, USA
Horiguchi, Susumu / Tohoku University, Japan
Hou, Wen-Chi / Southern Illinois University, USA
Hsu, Chun-Nan / Institute of Information Science, Academia Sinica, Taiwan
Hsu, William H. / Kansas State University, USA
Hu, Wen-Chen / University of North Dakota, USA
Hu, Xiaohua / Drexel University, USA
Huang, Xiangji / York University, Canada
Huete, Juan F. / Universidad de Granada, Spain
Hwang, Sae / University of Texas at Arlington, USA
Ibaraki, Toshihide / Kwansei Gakuin University, Japan
Ito, Takao / Ube National College of Technology, Japan
Jahangiri, Mehrdad / University of Southern California, USA
Jrvelin, Kalervo / University of Tampere, Finland
Jha, Neha / Indian Institute of Technology, Kharagpur, India
Jin, Haihao / University of Kentucky, USA
Jourdan, Laetitia / LIFL, University of Lille 1, France
Jun, Jongeun / University of Southern California, USA
Kanapady, Ramdev / University of Minnesota, USA
Kao, Odej / University of Paderborn, Germany
Karakaya, Murat / Bilkent University, Turkey
Katsaros, Dimitrios / Aristotle University, Greece
Kern-Isberner, Gabriele / University of Dortmund, Germany
Khan, Latifur / University of Texas at Dallas, USA
Khan, M. Riaz / University of Massachusetts Lowell, USA
Khan, Shiraj / University of South Florida, USA
Kickhfel, Rodrigo Branco / Catholic University of Pelotas, Brazil
Kim, Han-Joon / The University of Seoul, Korea
Klawonn, Frank / University of Applied Sciences Braunschweig/Wolfenbuettel, Germany
Koeller, Andreas / Montclair State University, USA
Kokol, Peter / University of Maribor, FERI, Slovenia
TEAM LinG
Kontio, Juha / Turku Polytechnic, Finland
Koppelaar, Henk / Delft University of Technology, The Netherlands
Kroeze, Jan H. / University of Pretoria, South Africa
Kros, John F. / East Carolina University, USA
Kryszkiewicz, Marzena / Warsaw University of Technology, Poland
Kusiak, Andrew / The University of Iowa, USA
la Tendresse, Ingo / Technical University of Clausthal, Germany
Lax, Gianluca / University Mediterranea of Reggio Calabria, Italy
Layos, Luis Magdalena / Universidad Politcnica de Madrid, Spain
Lazarevic, Aleksandar / University of Minnesota, USA
Lee, Chung-Hong / National Kaohsiung University of Applied Sciences, Taiwan
Lee, Chung-wei / Auburn University, USA
Lee, JeongKyu / University of Texas at Arlington, USA
Lee, Tzai-Zang / National Cheng Kung University, Taiwan, ROC
Lee, Zu-Hsu / Montclair State University, USA
Lee-Post, Anita / University of Kentucky, USA
Leni, Mitja / University of Maribor, FERI, Slovenia
Levary, Reuven R. / Saint Louis University, USA
Li, Tao / Florida International University, USA
Li, Wenyuan / Nanyang Technological University, Singapore
Liberati, Diego / Consiglio Nazionale delle Ricerche, Italy
Licthnow, Daniel / Catholic University of Pelotas, Brazil
Lim, Ee-Peng / Nanyang Technological University, Singapore
Lin, Beixin (Betsy) / Montclair State University, USA
Lin, Tsau Young / San Jose State University, USA
Lindell, Yehuda / Bar-Ilan University, Israel
Lingras, Pawan / Saint Marys University, Canada
Liu, Chang / Northern Illinois University, USA
Liu, Huan / Arizona State University, USA
Liu, Li / Aventis, USA
Liu, Xiaohui / Brunel University, UK
Liu, Xiaoqiang / Delft University of Technology, The Netherlands, and Donghua University, China
Lo, Victor S.Y. / Fidelity Personal Investments, USA
Lodhi, Huma / Imperial College London, UK
Loh, Stanley / Catholic University of Pelotas, Brazil, and Lutheran University of Brasil, Brazil
Long, Lori K. / Kent State University, USA
Lorenzi, Fabiana / Universidade Luterana do Brasil, Brazil
Lovell, Brian C. / The University of Queensland, Australia
Lu, June / University of Houston-Victoria, USA
Lu, Xinjian / California State University, Hayward, USA
Lutu, Patricia E.N. / University of Pretoria, South Africa
Ma, Sheng / IBM T.J. Watson Research Center, USA
Maj, Jean-Baptiste / LORIA/INRIA, France
Maloof, Marcus A. / Georgetown University, USA
Mangamuri, Murali / Wright State University, USA
Mani, D. R. / Massachusetts Institute of Technology, USA, and Harvard University, USA
Maniezzo, Vittorio / University of Bologna, Italy
Manolopoulos, Yannis / Aristotle University, Greece
Marchetti, Carlo / Universit di Roma La Sapienza, Italy
Mart, Rafael / Universitat de Valncia, Spain
Masseglia, Florent / INRIA Sophia Antipolis, France
Mathieu, Richard / Saint Louis University, USA
McLeod, Dennis / University of Southern California, USA
Mecella, Massimo / Universit di Roma La Sapienza, Italy
TEAM LinG
Meinl, Thorsten / Friedrich-Alexander University Erlangen-Nrnberg, Germany
Meo, Rosa / Universit degli Studi di Torino, Italy
Mishra, Nilesh / Indian Institute of Technology, India
Mladeni , Dunja / Jozef Stefan Institute, Slovenia
Mobasher, Bamshad / DePaul University, USA
Mohania, Mukesh / IBM India Research Lab, India
Morantz, Brad / Georgia State University, USA
Moreira, Adriano / University of Minho, Portugal
Motiwalla, Luvai / University of Massachusetts Lowell, USA
Muhlenbach, Fabrice / EURISE, Universit Jean Monnet - Saint-Etienne, France
Mukherjee, Sach / University of Oxford, UK
Murty, M. Narasimha / Indian Institute of Science, India
Muruzbal, Jorge / University Rey Juan Carlos, Spain
Muselli, Marco / Italian National Research Council, Italy
Musicant, David R. / Carleton College, USA
Muslea, Ion / SRI International, USA
Nanopoulos, Alexandros / Aristotle University, Greece
Nasraoui, Olfa / University of Louisville, USA
Nayak, Richi / Queensland University of Technology, Australia
Nemati, Hamid R. / The University of North Carolina at Greensboro, USA
Ng, Vincent To-yee / The Hong Kong Polytechnic University, Hong Kong
Ng, Wee-Keong / Nanyang Technological University, Singapore
Nicholson, Scott / Syracuse University School of Information Studies, USA
ODonnell, Joseph B. / Canisius College, USA
Oh, JungHwan / University of Texas at Arlington, USA
Oppenheim, Alan / Montclair State University, USA
Owens, Jan / University of Wisconsin-Parkside, USA
Oza, Nikunj C. / NASA Ames Research Center, USA
Pang, Les / National Defense University, USA
Paquet, Eric / National Research Council of Canada, Canada
Pasquier, Nicolas / Universit de Nice-Sophia Antipolis, France
Pathak, Praveen / University of Florida, USA
Perlich, Claudia / IBM Research, USA
Perrizo, William / North Dakota State University, USA
Peter, Hadrian / University of the West Indies, Barbados
Peterson, Richard L. / Montclair State University, USA
Pharo, Nils / Oslo University College, Norway
Piltcher, Gustavo / Catholic University of Pelotas, Brazil
Poncelet, Pascal / Ecole des Mines dAls, France
Portougal, Victor / The University of Auckland, New Zealand
Povalej, Petra / University of Maribor, FERI, Slovenia
Primo, Tiago / Catholic University of Pelotas, Brazil
Provost, Foster / New York University, USA
Psaila, Giuseppe / Universit degli Studi di Bergamo, Italy
Quattrone, Giovanni / Universit Mediterranea di Reggio Calabria, Italy
Rabual Dopico, Juan R. / University of A Corua, Spain
Rahman, Hakikur / SDNP, Bangladesh
Rakotomalala, Ricco / ERIC, Universit Lumire Lyon 2, France
Ramoni, Marco F. / Harvard Medical School, USA
Ras, Zbigniew W. / University of North Carolina, Charlotte, USA
Rea, Alan / Western Michigan University, USA
Rehm, Frank / German Aerospace Center, Germany
Ricci, Francesco / eCommerce and Tourism Research Laboratory, ITC-irst, Italy
Rivero Cebrin, Daniel / University of A Corua, Spain
TEAM LinG
Sacharidis, Dimitris / University of Southern California, USA
Saldaa, Ramiro / Catholic University of Pelotas, Brazil
Sanders, G. Lawrence / State University of New York at Buffalo, USA
Santos, Maribel Yasmina / University of Minho, Portugal
Saquer, Jamil M. / Southwest Missouri State University, USA
Sayal, Mehmet / Hewlett-Packard Labs, USA
Saygin, Ycel / Sabanci University, Turkey
Scannapieco, Monica / Universit di Roma La Sapienza, Italy
Schafer, J. Ben / University of Northern Iowa, USA
Scheffer, Tobias / Humboldt-Universitt zu Berlin, Germany
Schneider, Michel / Blaise Pascal University, France
Scime, Anthony / State University of New York College Brockport, USA
Sebastiani, Paola / Boston University School of Public Health, USA
Segall, Richard S. / Arkansas State University, USA
Shah, Shital C. / The University of Iowa, USA
Shahabi, Cyrus / University of Southern California, USA
Shen, Hong / Japan Advanced Institute of Science and Technology, Japan
Sheng, Yihua Philip / Southern Illinois University, USA
Siciliano, Roberta / University of Naples Federico II, Italy
Simitsis, Alkis / National Technical University of Athens, Greece
Simes, Gabriel / Catholic University of Pelotas, Brazil
Sindoni, Giuseppe / ISTAT - National Institute of Statistics, Italy
Singh, Richa / Indian Institute of Technology, India
Smets, Philippe / Universit Libre de Bruxelles, Belgium
Smith, Kate A. / Monash University, Australia
Song, Il-Yeol / Drexel University, USA
Song, Min / Drexel University, USA
Sounderpandian, Jayavel / University of Wisconsin-Parkside, USA
Souto, Nieves Pedreira / University of A Corua, Spain
Stanton, Jeffrey / Syracuse University School of Information Studies, USA
Sundaram, David / The University of Auckland, New Zealand
Sural, Shamik / Indian Institute of Technology, Kharagpur, India
Talbi, El-Ghazali / LIFL, University of Lille 1, France
Tan, Hee Beng Kuan / Nanyang Technological University, Singapore
Tan, Rebecca Boon-Noi / Monash University, Australia
Taniar, David / Monash University, Australia
Teisseire, Maguelonne / University of Montpellier II, France
Terracina, Giorgio / Universit della Calabria, Italy
Thelwall, Mike / University of Wolverhampton, UK
Theodoratos, Dimitri / New Jersey Institute of Technology, USA
Thomasian, Alexander / New Jersey Institute of Technology, USA
Thuraisingham, Bhavani / The MITRE Corporation, USA
Tininini, Leonardo / CNR - Istituto di Analisi dei Sistemi e Informatica Antonio Ruberti, Italy
Troutt, Marvin D. / Kent State University, USA
Truemper, Klaus / University of Texas at Dallas, USA
Tsay, Li-Shiang / University of North Carolina, Charlotte, USA
Tzacheva, Angelina / University of North Carolina, Charlotte, USA
Ulusoy, zgr / Bilkent University, Turkey
Ursino, Domenico / Universit Mediterranea di Reggio Calabria, Italy
Vardaki, Maria / University of Athens, Greece
Vargas, Juan E. / University of South Carolina, USA
Vatsa, Mayank / Indian Institute of Technology, India
Viertl, Reinhard / Vienna University of Technology, Austria
Viktor, Herna L. / University of Ottawa, Canada
TEAM LinG
Virgillito, Antonino / Universit di Roma La Sapienza, Italy
Viswanath, P. / Indian Institute of Science, India
Walter, Jrg Andreas / University of Bielefeld, Germany
Wang, Dajin / Montclair State University, USA
Wang, Hai / Saint Marys University, Canada
Wang, Ke / Simon Fraser University, Canada
Wang, Lipo / Nanyang Technological University, Singapore
Wang, Shouhong / University of Massachusetts Dartmouth, USA
Wang, Xiong / California State University at Fullerton, USA
Webb, Geoffrey I. / Monash University, Australia
Wen, Ji-Rong / Microsoft Research Asia, China
West, Chad / IBM Canada Limited, Canada
Wickramasinghe, Nilmini / Cleveland State University, USA
Wieczorkowska, Alicja A. / Polish-Japanese Institute of Information Technology, Poland
Winkler, William E. / U.S. Bureau of the Census, USA
Wong, Raymond Chi-Wing / The Chinese University of Hong Kong, Hong Kong
Woon, Yew-Kwong / Nanyang Technological University, Singapore
Wu, Chien-Hsing / National University of Kaohsiung, Taiwan, ROC
Xiang, Yang / University of Guelph, Canada
Xing, Ruben / Montclair State University, USA
Yan, Feng / Williams Power, USA
Yan, Rui / Saint Marys University, Canada
Yang, Hsin-Chang / Chang Jung University, Taiwan
Yang, Hung-Jen / National Kaohsiung Normal University, Taiwan
Yang, Ying / Monash University, Australia
Yao, James E. / Montclair State University, USA
Yao, Yiyu / University of Regina, Canada
Yavas, Gkhan / Bilkent University, Turkey
Yeh, Jyh-haw / Boise State University, USA
Yoo, Illhoi / Drexel University, USA
Yu, Lei / Arizona State University, USA
Zendulka, Jaroslav / Brno University of Technology, Czech Republic
Zhang, Bin / Hewlett-Packard Research Laboratories, USA
Zhang, Chengqi / University of Technology Sydney, Australia
Zhang, Shichao / University of Technology Sydney, Australia
Zhang, Yu-Jin / Tsinghua University, Beijing, China
Zhao, Qiankun / Nanyang Technological University, Singapore
Zhao, Yan / University of Regina, Canada
Zhao, Yuan / Nanyang Technological University, Singapore
Zhou, Senqiang / Simon Fraser University, Canada
Zhou, Zhi-Hua / Nanjing University, China
Zhu, Dan / Iowa State University, USA
Zhu, Qiang / University of Michigan, USA
Ziad, Tarek / NUXEO, France
Ziarko, Wojciech / University of Regina, Canada
Zorman, Milan / University of Maribor, FERI, Slovenia
Zou, Qinghua / University of California - Los Angeles, USA
TEAM LinG
Contents
by Volume
VOLUME I
Action Rules / Zbigniew W. Ras, Angelina Tzacheva, and Li-Shiang Tsay ........................................................... 1
Administering and Managing a Data Warehouse / James E. Yao, Chang Liu, Qiyang Chen, and June Lu .......... 17
Agent-Based Mining of User Profiles for E-Services / Pasquale De Meo, Giovanni Quattrone,
Giorgio Terracina, and Domenico Ursino ......................................................................................................... 23
Aggregation for Predictive Modeling with Relational Data / Claudia Perlich and Foster Provost ....................... 33
Approximate Range Queries by Histograms in OLAP / Francesco Buccafurri and Gianluca Lax ........................ 49
Association Rule Mining / Yew-Kwong Woon, Wee-Keong Ng, and Ee-Peng Lim ................................................ 59
Association Rule Mining and Application to MPIS / Raymond Chi-Wing Wong and Ada Wai-Chee Fu ............. 65
Association Rule Mining of Relational Data / Anne Denton and Christopher Besemann ..................................... 70
Association Rules and Statistics / Martine Cadot, Jean-Baptiste Maj, and Tarek Ziad ..................................... 74
Bayesian Networks / Ahmad Bashir, Latifur Khan, and Mamoun Awad ............................................................... 89
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective / Les Pang ....................................................... 94
Bibliomining for Library Decision-Making / Scott Nicholson and Jeffrey Stanton ................................................. 100
Biomedical Data Mining Using RBF Neural Networks / Feng Chu and Lipo Wang ................................................ 106
Building Empirical-Based Knowledge for Design Recovery / Hee Beng Kuan Tan and Yuan Zhao ...................... 112
Case-Based Recommender Systems / Fabiana Lorenzi and Francesco Ricci ....................................................... 124
Categorization Process and Data Mining / Maria Suzana Marc Amoretti ............................................................. 129
Clustering in the Identification of Space Models / Maribel Yasmina Santos, Adriano Moreira,
and Sofia Carneiro .............................................................................................................................................. 165
Clustering Techniques for Outlier Detection / Frank Klawonn and Frank Rehm ................................................. 180
Combining Induction Methods with the Multimethod Approach / Mitja Leni, Peter Kokol, Petra Povalej
and Milan Zorman ............................................................................................................................................... 184
Content-Based Image Retrieval / Timo R. Bretschneider and Odej Kao ................................................................. 212
Data Driven vs. Metric Driven Data Warehouse Design / John M. Artz ................................................................. 223
TEAM LinG
Data Mining and Decision Support for Business and Science / Auroop R. Ganguly, Amar Gupta,
and Shiraj Khan .................................................................................................................................................. 233
Data Mining and Warehousing in Pharma Industry / Andrew Kusiak and Shital C. Shah .................................... 239
Data Mining in Diabetes Diagnosis and Detection / Indranil Bose ........................................................................ 257
Data Mining in Human Resources / Marvin D. Troutt and Lori K. Long ............................................................... 262
Data Mining in the Soft Computing Paradigm / Pradip Kumar Bala, Shamik Sural,
and Rabindra Nath Banerjee .............................................................................................................................. 272
Data Mining Medical Digital Libraries / Colleen Cunningham and Xiaohua Hu ................................................... 278
Data Mining Methods for Microarray Data Analysis / Lei Yu and Huan Liu ......................................................... 283
Data Mining with Incomplete Data / Hai Wang and Shouhong Wang .................................................................... 293
Data Quality in Cooperative Information Systems / Carlo Marchetti, Massimo Mecella, Monica Scannapieco,
and Antonino Virgillito ...................................................................................................................................... 297
Data Reduction and Compression in Database Systems / Alexander Thomasian .................................................. 307
Data Warehouse Back-End Tools / Alkis Simitsis and Dimitri Theodoratos ......................................................... 312
Data Warehouse Performance / Beixin (Betsy) Lin, Yu Hong, and Zu-Hsu Lee ..................................................... 318
Data Warehousing and Mining in Supply Chains / Richard Mathieu and Reuven R. Levary ............................... 323
Data Warehousing Search Engine / Hadrian Peter and Charles Greenidge ......................................................... 328
Data Warehousing Solutions for Reporting Problems / Juha Kontio ..................................................................... 334
Database Queries, Data Mining, and OLAP / Lutz Hamel ....................................................................................... 339
Database Sampling for Data Mining / Patricia E.N. Lutu ....................................................................................... 344
DEA Evaluation of Performance of E-Business Initiatives / Yao Chen, Luvai Motiwalla, and M. Riaz Khan ...... 349
Decision Tree Induction / Roberta Siciliano and Claudio Conversano ............................................................... 353
TEAM LinG
Discovering an Effective Measure in Data Mining / Takao Ito ............................................................................... 364
Discovering Ranking Functions for Information Retrieval / Weiguo Fan and Praveen Pathak ............................ 377
Discretization for Data Mining / Ying Yang and Geoffrey I. Webb .......................................................................... 392
Discretization of Continuous Attributes / Fabrice Muhlenbach and Ricco Rakotomalala .................................. 397
Distributed Association Rule Mining / Mafruz Zaman Ashrafi, David Taniar, and Kate A. Smith ........................ 403
Distributed Data Management of Daily Car Pooling Problems / Roberto Wolfler Calvo, Fabio de Luigi,
Palle Haastrup, and Vittorio Maniezzo ............................................................................................................. 408
Drawing Representative Samples from Large Databases / Wen-Chi Hou, Hong Guo, Feng Yan,
and Qiang Zhu ..................................................................................................................................................... 413
Efficient Computation of Data Cubes and Aggregate Views / Leonardo Tininini .................................................. 421
Employing Neural Networks in Data Mining / Mohamed Salah Hamdi .................................................................. 433
Enhancing Web Search through Query Log Mining / Ji-Rong Wen ....................................................................... 438
Enhancing Web Search through Web Structure Mining / Ji-Rong Wen ................................................................. 443
Ethnography to Define Requirements and Data Model / Gary J. DeLorenzo ......................................................... 459
Evolution of Data Cube Computational Approaches / Rebecca Boon-Noi Tan ...................................................... 469
Evolutionary Data Mining For Genomics / Laetitia Jourdan, Clarisse Dhaenens, and El-Ghazali Talbi ............ 482
Explanation-Oriented Data Mining / Yiyu Yao and Yan Zhao ................................................................................. 492
Factor Analysis in Data Mining / Zu-Hsu Lee, Richard L. Peterson, Chen-Fu Chien, and Ruben Xing ............... 498
Financial Ratio Selection for Distress Classification / Roberto Kawakami Harrop Galvo, Victor M. Becerra,
and Magda Abou-Seada ..................................................................................................................................... 503
TEAM LinG
Flexible Mining of Association Rules / Hong Shen ................................................................................................. 509
Graph-Based Data Mining / Lawrence B. Holder and Diane J. Cook .................................................................... 540
Group Pattern Discovery Systems for Multiple Data Sources / Shichao Zhang and Chengqi Zhang ................... 546
Heterogeneous Gene Data for Classifying Tumors / Benny Yiu-ming Fung and Vincent To-yee Ng .................... 550
Hierarchical Document Clustering / Benjamin C. M. Fung, Ke Wang, and Martin Ester ....................................... 555
High Frequency Patterns in Data Mining / Tsau Young Lin .................................................................................... 560
Homeland Security Data Mining and Link Analysis / Bhavani Thuraisingham ..................................................... 566
Hyperbolic Space for Interactive Visualization / Jrg Andreas Walter ................................................................... 575
VOLUME II
Identifying Single Clusters in Large Data Sets / Frank Klawonn and Olga Georgieva ......................................... 582
Immersive Image Mining in Cardiology / Xiaoqiang Liu, Henk Koppelaar, Ronald Hamers,
and Nico Bruining ............................................................................................................................................... 586
Imprecise Data and the Data Mining Process / Marvin L. Brown and John F. Kros .............................................. 593
Incorporating the People Perspective into Data Mining / Nilmini Wickramasinghe .............................................. 599
Incremental Mining from News Streams / Seokkyung Chung, Jongeun Jun and Dennis McLeod ........................ 606
Inexact Field Learning Approach for Data Mining / Honghua Dai ......................................................................... 611
Information Extraction in Biomedical Literature / Min Song, Il-Yeol Song, Xiaohua Hu, and Hyoil Han .............. 615
Integration of Data Sources through Data Mining / Andreas Koeller .................................................................... 625
TEAM LinG
Intelligent Query Answering / Zbigniew W. Ras and Agnieszka Dardzinska ........................................................ 639
Interactive Visual Data Mining / Shouhong Wang and Hai Wang .......................................................................... 644
Inter-Transactional Association Analysis for Prediction / Ling Feng and Tharam Dillon .................................... 653
Interval Set Representations of Clusters / Pawan Lingras, Rui Yan, Mofreh Hogo, and Chad West .................... 659
Knowledge Discovery with Artificial Neural Networks / Juan R. Rabual Dopico, Daniel Rivero Cebrin,
Julin Dorado de la Calle, and Nieves Pedreira Souto .................................................................................... 669
Learning Bayesian Networks / Marco F. Ramoni and Paola Sebastiani ............................................................... 674
Learning Information Extraction Rules for Web Data Mining / Chia-Hui Chang and Chun-Nan Hsu .................. 678
Locally Adaptive Techniques for Pattern Classification / Carlotta Domeniconi and Dimitrios Gunopulos ........ 684
Logical Analysis of Data / Endre Boros, Peter L. Hammer, and Toshihide Ibaraki .............................................. 689
Lsquare System for Mining Logic Data, The / Giovanni Felici and Klaus Truemper ............................................ 693
Material Acquisitions Using Discovery Informatics Approach / Chien-Hsing Wu and Tzai-Zang Lee ................. 705
Materialized View Selection for Data Warehouse Design / Dimitri Theodoratos and Alkis Simitsis ..................... 717
Methods for Choosing Clusters in Phylogenetic Trees / Tom Burr ........................................................................ 722
Mining Association Rules on a NCR Teradata System / Soon M. Chung and Murali Mangamuri ....................... 746
Mining Association Rules Using Frequent Closed Itemsets / Nicolas Pasquier ................................................... 752
Mining Chat Discussions / Stanley Loh, Daniel Licthnow, Thyago Borges, Tiago Primo,
Rodrigo Branco Kickhfel, Gabriel Simes, Gustavo Piltcher, and Ramiro Saldaa ..................................... 758
Mining Data with Group Theoretical Means / Gabriele Kern-Isberner .................................................................. 763
Mining E-Mail Data / Steffen Bickel and Tobias Scheffer ....................................................................................... 768
TEAM LinG
Mining for Image Classification Based on Feature Elements / Yu-Jin Zhang ......................................................... 773
Mining for Profitable Patterns in the Stock Market / Yihua Philip Sheng, Wen-Chi Hou, and Zhong Chen ......... 779
Mining Frequent Patterns via Pattern Decomposition / Qinghua Zou and Wesley Chu ......................................... 790
Mining Group Differences / Shane M. Butler and Geoffrey I. Webb ....................................................................... 795
Mining Historical XML / Qiankun Zhao and Sourav Saha Bhowmick .................................................................. 800
Mining Quantitative and Fuzzy Association Rules / Hong Shen and Susumu Horiguchi ..................................... 815
Modeling Web-Based Data in a Data Warehouse / Hadrian Peter and Charles Greenidge ................................. 826
Mosaic-Based Relevance Feedback for Image Retrieval / Odej Kao and Ingo la Tendresse ................................. 837
Multimodal Analysis in Multimedia Using Symbolic Kernels / Hrishikesh B. Aradhye and Chitra Dorai ............ 842
Multiple Hypothesis Testing for Data Mining / Sach Mukherjee ........................................................................... 848
Negative Association Rules in Data Mining / Olena Daly and David Taniar ....................................................... 859
Neural Networks for Prediction and Classification / Kate A. Smith ......................................................................... 865
Off-Line Signature Recognition / Indrani Chakravarty, Nilesh Mishra, Mayank Vatsa, Richa Singh,
and P. Gupta ........................................................................................................................................................ 870
Online Signature Recognition / Indrani Chakravarty, Nilesh Mishra, Mayank Vatsa, Richa Singh,
and P. Gupta ........................................................................................................................................................ 885
Organizational Data Mining / Hamid R. Nemati and Christopher D. Barko .......................................................... 891
Path Mining in Web Processes Using Profiles / Jorge Cardoso ............................................................................. 896
Physical Data Warehousing Design / Ladjel Bellatreche and Mukesh Mohania .................................................. 906
Predicting Resource Usage for Capital Efficient Marketing / D. R. Mani, Andrew L. Betz, and James H. Drew .... 912
TEAM LinG
Privacy and Confidentiality Issues in Data Mining / Ycel Saygin ......................................................................... 921
Privacy Protection in Association Rule Mining / Neha Jha and Shamik Sural ..................................................... 925
Reasoning about Frequent Patterns with Negation / Marzena Kryszkiewicz ......................................................... 941
Recovery of Data Dependencies / Hee Beng Kuan Tan and Yuan Zhao ................................................................ 947
Resource Allocation in Wireless Networks / Dimitrios Katsaros, Gkhan Yavas, Alexandros Nanopoulos,
Murat Karakaya, zgr Ulusoy, and Yannis Manolopoulos ............................................................................ 955
Retrieving Medical Records Using Bayesian Networks / Luis M. de Campos, Juan M. Fernndez-Luna,
and Juan F. Huete ............................................................................................................................................... 960
Robust Face Recognition for Data Mining / Brian C. Lovell and Shaokang Chen ............................................... 965
Rough Sets and Data Mining / Jerzy W. Grzymala-Busse and Wojciech Ziarko .................................................... 973
Rule Generation Methods Based on Logic Synthesis / Marco Muselli .................................................................. 978
Rule Qualities and Knowledge Combination for Decision-Making / Ivan Bruha .................................................... 984
Sampling Methods in Approximate Query Answering Systems / Gautam Das ...................................................... 990
Search Situations and Transitions / Nils Pharo and Kalervo Jrvelin .................................................................. 1000
Secure Multiparty Computation for Privacy Preserving Data Mining / Yehida Lindell .......................................... 1005
Semantic Data Mining / Protima Banerjee, Xiaohua Hu, and Illhoi Yoo ............................................................... 1010
Semi-Structured Document Classification / Ludovic Denoyer and Patrick Gallinari ............................................ 1015
Sequential Pattern Mining / Florent Masseglia, Maguelonne Teisseire, and Pascal Poncelet ............................ 1028
Statistical Data Editing / Claudio Conversano and Roberta Siciliano .................................................................. 1043
Statistical Metadata in Data Processing and Interchange / Maria Vardaki ........................................................... 1048
TEAM LinG
Subgraph Mining / Ingrid Fischer and Thorsten Meinl .......................................................................................... 1059
Support Vector Machines / Mamoun Awad and Latifur Khan ............................................................................... 1064
Survival Analysis and Data Mining / Qiyang Chen, Alan Oppenheim, and Dajin Wang ...................................... 1077
Symbiotic Data Mining / Kuriakose Athappilly and Alan Rea .............................................................................. 1083
Symbolic Data Clustering / Edwin Diday and M. Narasimha Murthy .................................................................... 1087
Synthesis with Data Warehouse Applications and Utilities / Hakikur Rahman .................................................... 1092
Temporal Association Rule Mining in Event Sequences / Sherri K. Harms ........................................................... 1098
Text Content Approaches in Web Content Mining / Vctor Fresno Fernndez and Luis Magdalena Layos ....... 1103
Text Mining Methods for Hierarchical Document Indexing / Han-Joon Kim ......................................................... 1113
Time Series Analysis and Mining Techniques / Mehmet Sayal .............................................................................. 1120
Topic Maps Generation by Text Mining / Hsin-Chang Yang and Chung-Hong Lee ............................................. 1130
Tree and Graph Mining / Dimitrios Katsaros and Yannis Manolopoulos ............................................................. 1140
Trends in Web Content and Structure Mining / Anita Lee-Post and Haihao Jin .................................................. 1146
Trends in Web Usage Mining / Anita Lee-Post and Haihao Jin ............................................................................ 1151
Use of RFID in Supply Chain Data Processing / Jan Owens, Suresh Chalasani,
and Jayavel Sounderpandian ............................................................................................................................. 1160
Using Standard APIs for Data Mining in Prediction / Jaroslav Zendulka .............................................................. 1171
Utilizing Fuzzy Decision Trees in Decision Making / Malcolm J. Beynon .............................................................. 1175
Vertical Data Mining / William Perrizo, Qiang Ding, Qin Ding, and Taufik Abidin .............................................. 1181
Video Data Mining / JungHwan Oh, JeongKyu Lee, and Sae Hwang .................................................................... 1185
Visualization Techniques for Data Mining / Herna L. Viktor and Eric Paquet ...................................................... 1190
TEAM LinG
Wavelets for Querying Multidimensional Datasets / Cyrus Shahabi, Dimitris Sacharidis,
and Mehrdad Jahangiri ...................................................................................................................................... 1196
Web Mining in Thematic Search Engines / Massimiliano Caramia and Giovanni Felici .................................... 1201
Web Usage Mining through Associative Models / Paolo Giudici and Paola Cerchiello .................................... 1231
World Wide Web Usage Mining / Wen-Chen Hu, Hung-Jen Yang, Chung-wei Lee, and Jyh-haw Yeh ............... 1242
TEAM LinG
xxi
Foreword
There has been much interest developed in the data mining field both in the academia and the industry over the past
10-15 years. The number of researchers and practitioners working in the field and the number of scientific papers
published in various data mining outlets increased drastically over this period. Major commercial vendors incorporated
various data mining tools into their products, and numerous applications in many areas, including life sciences, finance,
CRM, and Web-based applications, have been developed and successfully deployed.
Moreover, this interest is no longer limited to the researchers working in the traditional fields of statistics, machine
learning and databases, but has recently expanded to other fields, including operations research/management science
(OR/MS) and mathematics, as evidenced from various data mining tracks organized at different INFORMS meetings,
special issues of OR/MS journals and the recent conference on Mathematical Foundations of Learning Theory
organized by mathematicians.
As the Encyclopedia of Data Warehousing and Mining amply demonstrates, all these diverse interests from
different groups of researchers and practitioners helped to shape data mining as a broad and multi-faceted discipline
spanning a large class of problems in such diverse areas as life sciences, marketing (including CRM and e-commerce),
finance, telecommunications, astronomy, and many other fields (the so called data mining and X phenomenon, where
X constitutes a broad range of fields where data mining is used for analyzing the data). This also resulted in a process
of cross-fertilization of ideas generated by these diverse groups of researchers interacting across the traditional
boundaries of their disciplines.
Despite all this progress, data mining still faces several challenges that make the field ripe with future research
opportunities. First, despite the cross-fertilization of ideas spanning various disciplines, the convergence among
different disciplines proceeds gradually, and more work is required to arrive at a unified view of data mining widely
accepted by different groups of researchers. Second, despite a considerable progress, still more work is required on
the theoretical foundations of data mining, as was recently stated by the participants of the Dagstuhl workshop Data
Mining: The Next Generation organized by R. Agrawal, J.-C. Freytag and R. Ramakrishnan and also expressed by
various other data mining researchers. Third, the data mining community must address the privacy and security
problems for data mining to be accepted by the privacy advocates and the Congress. Fourth, as the field advances, so
is the scope of data mining applications. The challenge to the field is to develop more advanced data mining methods
that would work in these increasingly demanding applications. Fifth, despite a considerable progress in developing more
user-friendly data mining tools, more work is required in this area with the goal of making these tools accessible to a
large audience of nave data mining users. In particular, one of the challenges is to devise methods that would
smoothly embed data mining tools into corresponding applications on the front-end and would integrate these tools
with databases on the back-end. Achieving such capabilities is very important since this would allow data mining to
cross the chasm (using Geoffrey Moores terminology) and become a mainstream technology utilized by millions of
users. Finally, more work is required on actionability and on the development of better methods for discovering
actionable patterns in the data. Currently, discovering actionable patterns in data constitutes a laborious and
challenging process. It is important to streamline and simplify this process and make it more efficient.
Given significant and rapid advancements in data mining and data warehousing, it is important to take periodic
snapshots of the field every few years. The data mining community addressed this issue by producing publications
covering the state of the art of the field every few years starting with the first volume Advances in Knowledge
Discovery and Data Mining (edited by U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy) published by
AAAI/MIT Press in 1996. This encyclopedia provides the latest snapshot of the field and surveys a broad array of
topics ranging from the basic theories to the recent advancements in the field and covers a diverse range of problems
TEAM LinG
xxii
from the analysis of microarray data to the analysis of multimedia and Web data. It also identifies future directions and
trends in data mining and data warehousing. Therefore, this volume should become an excellent guide to researchers
and practitioners.
Alexander Tuzhilin
New York University, USA
TEAM LinG
xxiii
Preface
How can a data-flooded manager get out of the mire? How can a confused decision maker pass through a maze?
How can an over-burdened problem solver clean up a mess? How can an exhausted scientist decipher a myth?
The answer is an interdisciplinary subject and a powerful tool known as data mining (DM). DM can turn data into
dollars; transform information into intelligence; change pattern into profit; and convert relationship into resources.
As the third branch of operations research and management science (OR/MS) and the third milestone of data
management, DM can help attack the third category of decision making by elevating our raw data into the third stage
of knowledge creation.
The term third has been mentioned four times above. Lets go backward and look at the three stages of knowledge
creation. Managers are often drowning in data (the first stage) but starving for knowledge. A collection of data is not
information (the second stage); and a collection of information is not knowledge. Data begets information which begets
knowledge. The whole subject of DM has a synergy of its own and represents more than the sum of its parts.
There are three categories of decision making: structured, semi-structured and unstructured. Decision making
processes fall along a continuum that ranges from highly structured decisions (sometimes called programmed) to highly
unstructured (non-programmed) decisions (Turban et al., 2005, p. 12).
At one end of the spectrum, structured processes are routine and typically repetitive problems for which standard
solutions exist. Unfortunately, rather than being static, deterministic and simple, the majority of real world problems
are dynamic, probabilistic, and complex. Many professional and personal problems are classified as unstructured, or
marginally as semi-structured, or even in between, since the boundaries between them may not be crystal-clear.
In addition to developing normative models (such as linear programming, economic order quantity) for solving
structured (or programmed) problems, operation researchers and management scientists have created many descriptive
models, such as simulation and goal programming, to deal with semi-structured alternatives. Unstructured problems,
however, fall in a gray areas for which there are no cut-and-dry solution methods. The current two branches of OR/MS
hit a dead end with unstructured problems.
To gain knowledge, one must understand the patterns that emerge from information. Patterns are not just simple
relationships among data; they exist separately from information, as archetypes or standards to which emerging
information can be compared so that one may draw inferences and take action. Over the last 40 years, the tools and
techniques used to process data and information have continued to evolve from databases (DBs) to data warehousing
(DW) and further to DM. DW applications, the middle of these three stages, have become business-critical. However,
DM can help deliver even more value from these huge repositories of information.
Certainly, there are many statistical models that have emerged over time. Machine learning has marked a milestone
in the evolution of computer science (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). Although DM is still
in its infancy, it is now being used in a wide range of industries and for a range of tasks in a variety of contexts (Wang,
2003). DM is synonymous with knowledge discovery in databases, knowledge extraction, data/pattern analysis, data
archeology, data dredging, data snooping, data fishing, information harvesting, and business intelligence (Giudici,
2003; Hand et al., 2001; Han & Kamber, 2000). There are unprecedented opportunities in the future to utilize DM.
Data warehousing and mining (DWM) is the science of managing and analyzing large datasets and discovering novel
patterns. In recent years, DWM has emerged as a particularly exciting and industrially relevant area of research.
Prodigious amounts of data are now being generated in domains as diverse and elusive as market research, functional
genomics and pharmaceuticals. Intelligently analyzing data to discover knowledge with the aim of answering crucial
questions and helping make informed decisions is the challenge that lies ahead.
The Encyclopedia of Data Warehousing and Mining provides theories, methodologies, functionalities, and
applications to decision makers, problem solvers, and data miners in business, academia, and government. DWM lies
TEAM LinG
xxiv
at the junction of database systems, artificial intelligence, machine learning and applied statistics, which makes it a
valuable area for researchers and practitioners. With a comprehensive overview, The Encyclopedia of Data Warehous-
ing and Mining offers a thorough exposure to the issues of importance in this rapidly changing field. The encyclopedia
also includes a rich mix of introductory and advanced topics while providing a comprehensive source of technical,
functional and legal references to DWM.
After spending more than a year preparing this book, with a strictly peer-reviewed process, I am delighted to see
it published. The standard for selection was very high. Each article went through at least three peer reviews; additional
third-party reviews were sought in cases of controversy. There have been innumerable instances where this feedback
has helped to improve the quality of the content, and even influenced authors in how they approach their topics.
The primary objective of this encyclopedia is to explore the myriad of issues regarding DWM. A broad spectrum
of practitioners, managers, scientists, educators, and graduate students who teach, perform research, and/or implement
these discoveries, are the envisioned readers of this encyclopedia.
The encyclopedia contains a collection of 234 articles, written by an international team of 361 experts representing
leading scientists and talented young scholars from 34 countries. They have contributed great effort to create a source
of solid, practical information, informed by sound underlying theory that should become a resource for all people
involved in this dynamic new field. Lets take a peek at a few articles:
The evaluation of DM methods requires a great deal of attention. A valid model evaluation and comparison can
improve considerably the efficiency of a DM process. Paolo Giudici has presented several ways to perform model
comparison, in which each has its advantages and disadvantages.
According to Zbigniew W. Ras, the main object of action rules is to generate special types of rules for a database
that point the direction for re-classifying objects with respect to some distinguishing attributes (called decision
attributes). This creates flexible attributes that form a basis for action rules construction.
With the constraints imposed by computer memory and mining algorithms, we can experience selection pressures
more than ever. The main point of instance selection is approximation. Our task is to achieve as good mining results
as possible by approximating the whole dataset with the selected instances and hope to do better in DM with instance
selection as it is possible to remove noisy and irrelevant data in the process. Huan Liu and Lei Yu have presented an
initial attempt to review and categorize the methods of instance selection in terms of sampling, classification, and
clustering.
Shichao Zhang and Chengqi Zhang introduce a group of pattern discovery systems for dealing with the multiple
data source (MDS) problem, mainly including a logical system for enhancing data quality; a logical system for resolving
conflicts; a data cleaning system; a database clustering system; a pattern discovery system and a post-mining system.
Based on his extensive experience, Gautam Das surveys recent state-of-the-art solutions to the problem of
approximate query answering in databases, in which ballpark answers(i.e., approximate answers) to queries can be
provided within acceptable time limits. These techniques sacrifice accuracy to improve running time; typically through
some sort of lossy data compression. Also, Han-Joon Kim (the holder of two patents on text mining applications)
discusses a comprehensive text-mining solution to document indexing problems on topic hierarchies (taxonomy).
Condensed representations have been proposed as a useful concept for the optimization of typical DM tasks. It
appears as a key concept within the emerging inductive DB framework where inductive query evaluation needs for
effective constraint-based DM techniques. Jean-Franois Boulicaut introduces this research domain, its achievements
in the context of frequent itemset mining from transactional data and its future trends.
Zhi-Hua Zhou discusses complexity issues in DM. Although we still have a long way to go in order to produce
patterns that can be understood by most people involved with DM tasks, endeavors on improving the comprehensibility
of complicated algorithms have proceeded at a promising pace.
Pattern classification poses a difficult challenge in finite settings and high dimensional spaces caused by the issue
of dimensionality. Carlotta Domeniconi and Dimitrios Gunopulos discuss classification techniques, including the
authors own work, to mitigate the problem of dimensionality and reduce bias, by estimating local feature relevance and
selecting features accordingly. This issue has both theoretical and practical relevance, since learning tasks abound in
which data are represented as a collection of a very large numbers of features. Thus, many applications can benefit from
improvements in predicting error.
Qinghua Zou proposes using pattern decomposition algorithms to find frequent patterns in large datasets. Pattern
decomposition is a DM technology that uses known, frequent or infrequent patterns to decompose long itemsets to
many short ones. It identifies frequent patterns in a dataset using a bottom-up methodology and reduces the size of
the dataset in each step. The algorithm avoids the process of candidate set generation and decreases the time for
counting supports due to the reduced dataset.
TEAM LinG
xxv
Perrizo, Ding, et al. review a category of DM approaches using vertical data structures. They demonstrate their
applications in various DM areas, such as association rule mining and multi-relational DM. Vertical DM strategy aims
at addressing scalability issues by organizing data in vertical layouts and conducting logical operations on vertically
partitioned data instead of scanning the entire DB horizontally.
Integration of data sources refers to the task of developing a common schema, as well as data transformation
solutions, for a number of data sources with related content. The large number and size of modern data sources makes
manual approaches to integration increasingly impractical. Andreas Koeller provides a comprehensive overview over
DM techniques which can help to partially or fully automate the data integration process.
DM applications often involve testing hypotheses regarding thousands or millions of objects at once. The statistical
concept of multiple hypothesis testing is of great practical importance in such situations, and an appreciation of the
issues involved can vastly reduce errors and associated costs. Sach Mukherjee provides an introductory look at multiple
hypothesis testing in the context of DM.
Maria Vardaki illustrates the benefits of using statistical metadata by information systems, depicting also how such
standardization can improve the quality of statistical results. She proposes a common, semantically rich, and object-
oriented data/metadata model for metadata management that integrates the main steps of data processing and covers
all aspects of DW that are essential for DM requirements. Finally, she demonstrates how a metadata model can be
integrated in a web-enabled statistical information system to ensure quality of statistical results.
A major obstacle in DM applications is the gap between statistic-based pattern extraction and value-based decision-
making. Profit mining aims at reducing this gap. The concept and techniques proposed by Ke Wang and Senqiang Zhou
are applicable to applications under a general notion of utility.
Although a tremendous amount of progress has been made in DM over the last decade or so, many important
challenges still remain. For instance, there are still no solid standards of practice; it is still too easy to misuse DM
software; secondary data analysis without appropriate experimental design is still common; and it is still hard to choose
right kind of analysis methods for the problem in hand. Xiao Hui Liu points out that intelligent data analysis (IDA)
is an interdisciplinary study concerning the effective analysis of data, which may help advance the state of art in the
field.
In recent years, the need to extract complex tree-like or graph-like patterns in massive data collections (e.g., in
bioinformatics, semistructured or Web DBs) has become a necessity. This has led to the emergence of the research field
of graph and tree mining. This field provides many promising topics for both theoretical and engineering achievements,
and many expect this to be one of the key fields in DM research in the years ahead. Katsaros and Manolopoulos review
the most important strategic application-domains where frequent structure mining (FSM) provides significant results.
A survey is presented of the most important algorithms that have been proposed for mining graph-like and tree-like
substructures in massive data collections.
Lawrence B. Holder and Diane J. Cook are among the pioneers in the field of graph-based DM and have developed
the widely-disseminated Subdue graph-based DM system (http://ailab.uta.edu/subdue). They have directed multi-
million dollar government-funded projects in the research, development and application of graph-based DM in real-
world tasks ranging from bioinformatics to homeland security.
Graphical models such as Bayesian networks (BNs) and decomposable Markov networks (DMNs) have been widely
applied to probabilistic reasoning in intelligent systems. Automatic discovery of such models from data is desirable,
but is NP-hard in general. Common learning algorithms use single-link look-ahead searches for efficiency. However,
pseudo-independent (PI) probabilistic domains are not learnable by such algorithms. Yang Xiang introduces funda-
mentals of PI domains and explains why common algorithms fail to discover them. He further offers key ideas as to how
they can efficiently be discovered, and predicts advances in the near future.
Semantic DM is a novel research area that used graph-based DM techniques and ontologies to identify complex
patterns in large, heterogeneous data sets. Tony Hus research group at Drexel University is involved in the
development and application of semantic DM techniques to the bioinformatics and homeland security domains.
Yu-Jin Zhang presents a novel method for image classification based on feature element through association rule
mining. The feature elements can capture well the visual meanings of images according to the subjective perception
of human beings, and are suitable for working with rule-based classification models. Techniques are adapted for mining
the association rules which can find associations between the feature elements and class attributes of the image, and
the mined rules are applied to image classifications.
Results of image DB queries are usually presented as a thumbnail list. Subsequently, each of these images can be
used for refinement of the initial query. This approach is not suitable for queries by sketch. In order to receive the desired
images, the user has to recognize misleading areas of the sketch and modify these images appropriately. This is a non-
TEAM LinG
xxvi
trivial problem, as the retrieval often is based on complex, non-intuitive features. Therefore, Odej Kao presents a mosaic-
based technique for sketch feedback, which combines the best sections contained in an image DB into a single query
image.
Andrew Kusiak and Shital C. Shah emphasize the need for an individual-based paradigm, which may ensure the well-
being of patients and the success of pharmaceutical industry. The new methodologies are illustrated with various
medical informatics research projects on topics such as predictions for dialysis patients, significant gene/SNP
identifications, hypoplastic left heart syndrome for infants, and epidemiological and clinical toxicology. DWM and data
modeling will ultimately lead to targeted drug discovery and individualized treatments with minimum adverse effects.
The use of microarray DBs has revolutionized the way in which biomedical research and clinical investigation can
be conducted in that high-density arrays of specified DNA sequences can be fabricated onto a single glass slide or
chip. However, the analysis and interpretation of the vast amount of complex data produced by this technology poses
an unprecedented challenge. LinMin Fu and Richard Segall present a state-of-the-art review of microarray DM problems
and solutions.
Knowledge discovery from genomic data has become an important research area for biologists. An important
characteristic of genomic applications is the very large amount of data to be analyzed, and most of the time, it is not
possible to apply only classical statistical methods. Therefore, Jourdan, Dhaenens and Talbi propose to model
knowledge discovery tasks associated with such problems as combinatorial optimization tasks, in order to apply
efficient optimization algorithms to extract knowledge from those large datasets.
Founded on the work of Indrani Chakravarty et al.s research, handwritten signature is a behavioral biometric. There
are two methods used for recognition of handwritten signatures offline and online. While offline methods extract static
features of signature instances by treating them as images, online methods extract and use temporal or dynamic features
of signatures for recognition purposes. Temporal features are difficult to imitate, and hence online recognition methods
offer higher accuracy rates than offline methods.
Neurons are small processing units that are able to store some information. When several neurons are connected,
the result is a neural network, a model inspired by biological neural networks like the brain. Kate Smith provides useful
guidelines to ensure successful learning and generalization of the neural network model. Also, a special version in the
form of probabilistic neural networks (PNNs) is explained by Ingrid Fischer with the help of graphic transformations.
The sheer volume of multimedia data available has exploded on the Internet in the past decade in the form of webcasts,
broadcast programs and streaming audio and video. Automated content analysis tools for multimedia depend on face
detectors and recognizers; videotext extractors; speech and speaker identifiers; people/vehicle trackers; and event
locators resulting in large sets of multimodal features that can be real-valued, discrete, ordinal, or nominal. Multimedia
metadata based on such a multimodal collection of features, poses significant difficulties to subsequent tasks such as
classification, clustering, visualization and dimensionality reduction which traditionally deal only with continuous-
valued data. Aradhye and Dorai discuss mechanisms that extend tasks traditionally limited to continuous-valued feature
spaces to multimodal multimedia domains with symbolic and continuous-valued features, including (a) dimensionality
reduction, (b) de-noising, (c) visualization, and (d) clustering.
Brian C. Lovell and Shaokang Chen review the recent advances in the application of face recognition for multimedia
DM. While the technology for mining text documents in large DBs could be said to be relatively mature, the same cannot
be said for mining other important data types such as speech, music, images and video. Yet these forms of multimedia
data are becoming increasingly common on the Internet and intranets.
The goal of Web usage mining is to capture, model, and analyze the behavioral patterns and profiles of users
interacting with a Web site. Bamshad Mobasher and Yongjian Fu provide an overview of the three primary phases of
the Web mining process: data preprocessing, pattern discovery, and pattern analysis. The primary focus of their articles
is on the types of DM and analysis tasks most commonly used in Web usage mining, as well as some their typical
applications in areas such as Web personalization and Web analytics. Ji-Rong Wen explores the ways of enhancing
Web search using query log mining and Web structure mining.
In line with Mike Thelwalls opinion, scientific Web intelligence (SWI) is a research field that combines techniques
from DM, Web intelligence and scientometrics to extract useful information from the links and text of academic-related
Web pages, using various clustering, visualization and counting techniques. SWI is a type of Web mining that combines
Web structure mining and text mining. Its main uses are in addressing research questions concerning the Web, or Web-
related phenomena, rather than in producing commercially useful knowledge.
Web-enabled electronic business is generating massive amount of data on customer purchases, browsing patterns,
usage times and preferences at an increasing rate. DM techniques can be applied to all the data being collected. Richi
Nayak presents issues associated with DM for Web-enabled electronic-business.
TEAM LinG
xxvii
Tobias Scheffer gives an overview of common email mining tasks including email filing, spam filtering and mining
communication networks. The main section of his work focuses on recent developments in mining email data for support
of the message creation process. Approaches to mining question-answer pairs and sentences are also reviewed.
Stanley Loh describes a computer-supported approach to mine discussions that occurred in chat rooms. Dennis
Mcleod explores incremental mining from news streams. JungHwan Oh summarizes the current status of video DM. J.
Ben Schafer addresses the technology used to generate recommendations.
In the abstract, a DW can be seen as a set of materialized views defined over source relations. During the initial design
of a DW, the designer faces the problem of deciding which views to materialize in the DW. This problem has been
addressed in the literature for different classes of queries and views, and with different design goals. Theodoratos and
Simitsis identify the different design goals used to formulate alternative versions of the problem and highlight the
techniques used to solve it.
Michel Schneider addresses the problem of designing a DW schema. He suggested a general model for this purpose
that integrates a majority of existing models: the notion of a well-formed structure is proposed to help design the process;
a graphic representation is suggested for drawing well-formed structures; and the classical star-snowflake structure
is represented.
Anthony Scime presents a methodology for adding external information from the World Wide Web to a DW, in
addition to the DWs domain information. The methodology assures decision makers that the added Web based data
are relevant to the purpose and current data of the DW.
Privacy and confidentiality of individuals are important issues in the information technology age. Advances in DM
technology have increased privacy concerns even more. Jack Cook and Ycel Sayg1n highlight the privacy and
confidentiality issues in DM, and survey state of the art solutions and approaches for achieving privacy preserving
DM.
Ken Goodman provides one of the first overviews of ethical issues that arise in DM. He shows that while privacy
and confidentiality often are paramount in discussions of DM, other issues including the characterization of
appropriate uses and users, and data miners intentions and goals must be considered. Machine learning in genomics
and in security surveillance are set aside as special issues requiring attention.
Increased concern about privacy and information security has led to the development of privacy preserving DM
techniques. Yehuda Lindell focuses on the paradigms for defining security in this setting, and the need for a rigorous
approach. Shamik Sural et al. present some of important approaches to privacy protection in association rule mining.
Human-computer interaction is crucial in the knowledge discovery process in order to accomplish a variety of novel
goals of DM. In Shou Hong Wangs opinion, interactive visual DM is human-centered DM, implemented through
knowledge discovery loops coupled with human-computer interaction and visual representations.
Symbiotic DM is an evolutionary approach that shows how organizations analyze, interpret, and create new
knowledge from large pools of data. Symbiotic data miners are trained business and technical professionals skilled in
applying complex DM techniques and business intelligence tools to challenges in a dynamic business environment.
Athappilly and Rea opened the discussion on how businesses and academia can work to help professionals learn, and
fuse the skills of business, IT, statistics, and logic to create the next generation of data miners.
Yiyu Yao and Yan Zhao first make an immediate comparison between scientific research and DM and add an
explanation construction and evaluation task to the existing DM framework. Explanation-oriented DM offers a new
perspective, which has a significant impact on the understanding of the complete process of DM and effective
applications of DM results.
Traditional DM views the output from any DM initiative as a homogeneous knowledge product. Knowledge
however, always is a multifaceted construct, exhibiting many manifestations and forms. It is the thesis of Nilmini
Wickramasinghes discussion that a more complete and macro perspective, and a more balanced approach to knowledge
creation, can best be provided by taking a broader perspective of the knowledge product resulting from the KDD
process: namely, by incorporating a people-based perspective into the traditional KDD process, and viewing knowledge
as the multifaceted construct it is. This in turn will serve to enhance the knowledge base of an organization, and facilitate
the realization of effective knowledge.
Fabrice Muhlenbach and Ricco Rakotomalala are the authors of an original supervised multivariate discretization
method called HyperCluster Finder. Their major contributions to the research community are present in a DM software
called TANAGRA, which is freely available on Internet.
Recently there have been many efforts to apply DM techniques to security problems, including homeland security
and cyber security. Bhavani Thuraisingham (the inventor of three patents for MITRE) examines some of these
developments in DM in general and link analysis in particular, and shows how DM and link analysis techniques may
be applied for homeland security applications. Some emerging trends are also discussed.
TEAM LinG
xxviii
In order to reduce financial statement errors and fraud, Garrity, ODonnell and Sanders proposed an architecture
that provides auditors with a framework for an effective continuous auditing environment that utilizes DM.
The applications of DWM are everywhere: from Kernel Methods in Chemoinformatics to Data Mining for Damage
Detection in Engineering Structures; from Predicting Resource Usage for Capital Efficient Marketing to Mining for
Profitable Patterns in the Stock Market; from Financial Ratio Selection for Distress Classification to Material
Acquisitions Using Discovery Informatics Approach; from Resource Allocation in Wireless Networks to Reinforcing
CRM with Data Mining; from Data Mining Medical Digital Libraries to Immersive Image Mining in Cardiology; from
Data Mining in Diabetes Diagnosis and Detection to Distributed Data Management of Daily Car Pooling Problems;
and from Mining Images for Structure to Automatic Musical Instrument Sound ClassificationThe list of DWM
applications is endless and the future of DWM is promising.
Knowledge explosion pushes DWM, a multidisciplinary subject, to ever-expanding regions. Inclusion, omission,
emphasis, evolution and even revolution are part of our professional life. In spite of our efforts to be careful, should
you find any ambiguities or perceived inaccuracies, please contact me at wangj@mail.montclair.edu.
REFERENCES
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data
mining. AAAI/MIT Press.
Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. John Wiley.
Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press.
Turban, E., Aronson, J. E., & Liang, T. P. (2005). Decision support systems and intelligent systems. Upper Saddle River,
NJ: Pearson Prentice Hall.
Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing.
TEAM LinG
xxix
Acknowledgments
The editor would like to thank all of the authors for their insights and excellent contributions to this book. I also want
to thank the group of anonymous reviewers who assisted me in the peer-reviewing process and provided comprehen-
sive, critical and, constructive reviews. Each Editorial Advisory Board member has made a big contribution in guidance
and assistance.
The editor wishes to acknowledge the help of all involved in the development process of this book, without whose
support the project could not have been satisfactorily completed. Linxi Liao and MinSun Ku, two graduate assistants,
are hereby graciously acknowledged for their diligent work. I owe my thanks to Karen Dennis for lending a hand in the
tedious process of proof-reading. A further special note of thanks goes to the staff at Idea Group Inc., whose
contributions have been invaluable throughout the entire process, from inception to final publication. Particular thanks
go to Sara Reed, Jan Travers, and Rene Davies, who continuously prodded via e-mail to keep the project on schedule,
and to Mehdi Khosrow-Pour, whose enthusiasm motivated me to accept his invitation to join this project.
My appreciation is also due the Global Education Center at MSU for awarding me a Global Education Fund. I would
also like to extend my thanks to my brothers Zhengxian, Shubert (an artist, http://www.portraitartist.com/wang/bio.htm),
and sister Jixian, who stood solidly behind me and contributed in their own sweet little ways. Thanks go to all Americans,
since it would not have been possible for the four of us to come to the U.S. without their support of different scholarships.
Finally, I want to thank my family: my parents, Houde Wang and Junyan Bai for their encouragement; my wife Hongyu
for her unfailing support, and my son Leigh for being without a dad during this project. By the way, our second boy
Leon was born on August 4, 2004. Like a baby, DWM has a bright and promising future.
TEAM LinG
xxx
John Wang, Ph.D., is a full professor in the Department of Information and Decision Sciences at Montclair State
University (MSU), USA. Professor Wang has published 89 refereed papers and three books. He is on the editorial board
of the International Journal of Cases on Electronic Commerce and has been a guest editor and referee for Operations
Research, IEEE Transactions on Control Systems Technology, and many other highly prestigious journals. His long-
term research goal is on the synergy of operations research, data mining and cybernetics.
TEAM LinG
1
Action Rules A
Zbigniew W. Ras
University of North Carolina, Charlotte, USA
Angelina Tzacheva
University of North Carolina, Charlotte, USA
Li-Shiang Tsay
University of North Carolina, Charlotte, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Action Rules
reclassify objects with respect to some distinguished Otherwise, it is called flexible. Date of Birth is an example
attribute (called a decision attribute). Clearly, each of a stable attribute. Interest rate on any customer
relational schema gives a list of attributes used to account is an example of a flexible attribute. For simplicity
represent objects stored in a database. Values of some reasons, we will consider decision tables with only one
of these attributes, for a given object, can be changed decision. We adopt the following definition of a decision
and this change can be influenced and controlled by table:
user. However, some of these changes (for instance By a decision table we mean an information system
profit) cannot be done directly to a decision attribute. S = (U, A1 A2 {d}), where d A1 A2 is a distin-
In such a case, definitions of this decision attribute in guished attribute called decision. The elements of A1 are
terms of other attributes (called classification attributes) called stable conditions, whereas the elements of A2
have to be learned. These new definitions are used to {d} are called flexible conditions. Our goal is to change
construct action rules showing what changes in values of values of attributes in A1 for some objects from U so the
some attributes, for a given class of objects, are needed values of the attribute d for these objects may change as
to reclassify objects the way users want. But users may well. Certain relationships between attributes from A1 and
still be either unable or unwilling to proceed with ac- the attribute d will have to be discovered first.
tions leading to such changes. In all such cases, we may By Dom(r) we mean all attributes listed in the IF part
search for definitions of values of any classification of a rule r extracted from S. For example, if r = [
attribute listed in an action rule. By replacing a value of (a1,3)*(a2,4) (d,3)] is a rule, then Dom(r) = {a1,a2}.
such attribute by its definition extracted either locally By d(r) we denote the decision value of rule r. In our
or at remote sites (if system is distributed), we con- example d(r) = 3.
struct new action rules, which might be of more interest If r1, r2 are rules and B A1 A2 is a set of attributes,
to business users than the initial rule.
then r1/B = r2/B means that the conditional parts of rules
r1, r2 restricted to attributes B are the same.
For example, if r1 = [(a1,3) (d,3)], then r1/{a1} = r/
MAIN THRUST {a1}.
Assume also that (a, v w) denotes the fact that the
The technology dimension will be explored to clarify value of attribute a has been changed from v to w.
the meaning of actionable rules including action rules Similarly, the term (a, v w)(x) means that a(x)=v has
and extended action rules. been changed to a(x)=w. Saying another words, the
property (a,v) of an object x has been changed to prop-
Action Rules Discovery in a Stand- erty (a,w). Assume now that rules r1, r2 have been
alone Information System extracted from S and r1/A1 = r2/A1, d(r1)=k1, d(r2)=k2
and k1< k2. Also, assume that (b1, b2,, bp) is a list of
An information system is used for representing all attributes in Dom(r1) Dom(r2) A2 on which r1,
knowledge. Its definition, given here, is due to Pawlak r2 differ and r1(b1)= v1, r1(b2)= v2,, r1(bp)= vp,
(1991). r2(b1)= w1, r2(b2)= w2,, r2(bp)= wp.
By an information system we mean a pair S = (U, A), By (r1,r2)-action rule on xU we mean a statement:
where:
[ (b1, v1 w1) (b2, v2 w2) (bp, vp
1. U is a nonempty, finite set of objects (object wp)](x) [(d, k1 k2)](x).
identifiers),
2. A is a nonempty, finite set of attributes, that is, If the value of the rule on x is true then the rule is
a:U Va for a A, where Va is called the domain valid. Otherwise it is false.
of a. Let us denote by U<r1> the set of all customers in U
supporting the rule r1. If (r1,r2)-action rule is valid on
Information systems can be seen as decision tables. xU <r1> then we say that the action rule supports the new
In any decision table together with the set of attributes profit ranking k2 for x.
a partition of that set into conditions and decisions is To define an extended action rule (Ras & Tsay, 2003),
given. Additionally, we assume that the set of conditions let us assume that two rules are considered. We present
is partitioned into stable and flexible conditions (Ras &
Table 1.
Wieczorkowska, 2000).
A (St) B (Fl) C (St) E (Fl) G (St) H (Fl) D (Decision)
Attribute a A is called stable for the set U if its values
a1 b1 c1 e1 d1
assigned to objects from U can not change in time. a1 b2 g2 h2 d2
TEAM LinG
Action Rules
them in Table 1 to better clarify the process of construct- [(a, 1 2)](x) [(d, L H)](x) is supported by x1
ing extended action rules. Here, St means stable classi- and x4. A
fication attribute and Fl means flexible one.
In a classical representation, these two rules will have The confidence of an extended action rule is higher
a form: than the confidence of the corresponding action rule
because all objects making the confidence of that ac-
r1 = [ a1 * b1 * c1 * e1 d1 ] , tion rule lower have been removed from its set of
r2 = [ a1 * b2 * g2 * h2 d2 ]. support.
Assume now that object x supports rule r1, which Actions Rules Discovery in Distributed
means that it is classified as d1. In order to reclassify x Autonomous Information System
to class d2, we need to change its value B from b1 to b2
but also we have to require that G(x)=g2 and that the In Ras & Dardzinska (2002), the notion of a Distributed
value H for object x has to be changed to h2. This is the Autonomous Knowledge System (DAKS) framework
meaning of the extended (r1,r2)-action rule given below: was introduced. DAKS is seen as a collection of knowl-
edge systems where each knowledge system is initially
[(B, b1 b2) (G = g2) (H, h2)](x) (D, defined as an information system coupled with a set of
d1 d2)(x). rules (called a knowledge base) extracted from that
system. These rules are transferred between sites due
Assume now that by Sup(t) we mean the number of to the requests of a query answering system associated
tuples having property t . with the client site. Each rule transferred from one site
By the support of the extended (r1,r2)-action rule of DAKS to another remains at both sites.
(given above) we mean: Assume now that information system S represents
one of DAKS sites. If rules extracted from S = (U, A1 A2
Sup[(A=a1)*(B=b1)*(G=g2)]
{d}), describing values of attribute d in terms of
By the confidence of the extended (r1,r2)-action rule attributes from A 1 A2, do not lead to any useful action
(given above) we mean: rules (user is not willing to undertake any actions sug-
gested by rules), we may:
[Sup[(A=a1)*(B=b1)*(G=g2)*(D=d1)]/
Sup[(A=a1)*(B=b1)*(G=g2)]] 1) search for definitions of flexible attributes listed
in the classification parts of these rules in terms
[Sup[(A=a1)*(B=b2)*(C=c1)*(D=d2)]/ of other local flexible attributes (local mining
Sup[(A=a1)*(B=b2)*(C=c1)]]. for rules),
2) search for definitions of flexible attributes listed
To give another example of extended action rule, in the classification parts of these rules in terms
assume that S=(U,A1 A2 {d}) is a decision table repre- of flexible attributes from another site (mining
sented by Table 2. Assume that A1= {c, b}, A 2 = {a}. for rules at remote sites),
For instance, rules r1=[(a,1)*(b,1) (d,L)], r2=[(c,2) * 3) search for definitions of decision attributes of
(a,2) (d,H)] can be extracted from S, where U <r1> = {x1, these rules in terms of flexible attributes from
x4}. Extended (r1,r2)-action rule another site (mining for rules at remote sites).
[ (a, 1 2) (c = 2)](x) [(d, L H)](x) Another problem, which has to be taken into consid-
eration, is the semantics of attributes that are common
is only supported by object x1. The corresponding (r1,r2)- for a client site and some of the remote sites. This
action rule semantics may easily differ from site to site. Some-
time, such a difference in semantics can be repaired
quite easily. For instance, if Temperature in Celsius is
Table 2. used at one site and Temperature in Fahrenheit at the
other, a simple mapping will fix the problem. If infor-
c a b d
mation systems are complete and two attributes have
x1 2 1 1 L
x2 1 2 2 L
the same name and differ only in their granularity level,
x3 2 2 1 H a new hierarchical attribute can be formed to fix the
x4 1 1 1 L problem. If databases are incomplete, the problem is more
TEAM LinG
Action Rules
complex because of the number of options available to To give more formal definition of similarity, we assume
interpret incomplete values (including null vales). The that:
problem is especially difficult in a distributed framework
when chase techniques based on rules extracted at the (x,y) = [{(bi(x), bi(y)) : bi (A Am)}] / card(A
client and at remote sites are used by a client site to impute Am), where:
current values by values which are less incomplete. These
problems are presented and partial solutions given in Ras (bi(x), bi(y)) = 0, if bi(x) bi(y)
& Dardzinska (2002). (bi(x), bi(y)) = 1, if bi(x) = bi(y)
Now, let us assume that the action rule (bi(x), bi(y)) = 1/2, if either bi(x) is undefined.
TEAM LinG
Action Rules
can be seen as actionable rules and the same used to perimental and Theoretical Artificial Intelligence, Spe-
construct action-rules. cial Issue on Knowledge Discovery, 17(1-2), 119-128. A
Tzacheva, A., & Ras, Z.W. (2003). Discovering non-
standard semantics of semi-stable attributes. In I. Russell
REFERENCES & S. Haller (Eds.), Proceedings of FLAIRS-2003 (pp. 330-
334), St. Augustine, Florida. Menlo Park, CA: AAAI
Adomavicius, G., & Tuzhilin, A. (1997). Discovery of Press.
actionable patterns in databases: The action hierarchy
approach. In Proceedings of KDD97 Conference, New- Tzacheva, A., & Ras, Z.W. (2005). Action rules mining.
port Beach, CA. Menlo Park, CA: AAAI Press. International Journal of Intelligent Systems, Special
Issue on Knowledge Discovery (In press).
Geffner, H., & Wainer, J. (1998). Modeling action,
knowledge and control. In H. Prade (Ed.), ECAI 98, 13th
European Conference on AI (pp. 532-536). New York:
John Wiley & Sons. KEY TERMS
Liu, B., Hsu, W., & Chen, S. (1997). Using general
impressions to analyze discovered classification rules. Actionable Rule: A rule is actionable if user can do
In Proceedings of KDD97 Conference, Newport Beach, an action to his/her advantage based on this rule.
CA. Menlo Park, CA: AAAI Press. Autonomous Information System: Information
Pawlak, Z. (1991). Rough sets: Theoretical aspects of system existing as an independent entity.
reasoning about data. Kluwer. Domain of Rule: Attributes listed in the IF part of a
Ras, Z., & Dardzinska, A. (2002). Handling semantic rule.
inconsistencies in query answering based on distributed Flexible Attribute: Attribute is called flexible if
knowledge mining. In Foundations of Intelligent Sys- its value can be changed in time.
tems, Proceedings of ISMIS02 Symposium (pp. 66-
74). LNAI (No. 2366). Berlin: Springer-Verlag. Knowledge Base: A collection of rules defined as
expressions written in predicate calculus. These rules
Ras, Z., & Gupta, S. (2002). Global action rules in distrib- have a form of associations between conjuncts of values
uted knowledge systems. Fundamenta Informaticae Jour- of attributes.
nal, 51(1-2), 175-184.
Ontology: An explicit formal specification of how
Ras, Z., & Wieczorkowska, A. (2000). Action rules: to represent objects, concepts and other entities that are
How to increase profit of a company. In D.A. Zighed, J. assumed to exist in some area of interest and relation-
Komorowski, & J. Zytkow (Eds.), Principles of Data ships holding among them. Systems that share the same
Mining and Knowledge Discovery. (Proceedings of ontology are able to communicate about domain of
PKDD00 (pp. 587-592), LNAI (No. 1910), Lyon, France. discourse without necessarily operating on a globally
Berlin: Springer-Verlag. shared theory. System commits to ontology if its ob-
Silberschatz, A., & Tuzhilin, A. (1995). On subjective servable actions are consistent with definitions in the
measures of interestingness in knowledge discovery. In ontology.
Proceedings of KDD95 Conference. Menlo Park, CA: Semantics: The meaning of expressions written in
AAAI Press. some language, as opposed to their syntax, which de-
Silberschatz, A., & Tuzhilin, A. (1996). What makes scribes how symbols may be combined independently
patterns interesting in knowledge discovery systems. of their meaning.
IEEE Transactions on Knowledge and Data Engineer- Stable Attribute: Attribute is called stable for the
ing, 5(6). set U if its values assigned to objects from U cannot
Tsay, L.-S., & Ras, Z.W. (2005). Action rules discovery change in time.
system DEAR, method and experiments. Journal of Ex-
TEAM LinG
6
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Active Disks for Data Mining
the processing of (SQLstructured query language) objects is determined by the distance of their feature
queries, as compared to low-level data manipulation lan- vectors. k-NN queries find the k objects with the smallest A
guages (DMLs) for hierarchical and network databases, (squared) Euclidean distances.
led to numerous proposals for database machines (Stanley Clustering methods group items, but unlike classifica-
& Su, 1983). The following categorization is given: (a) tion, the groups are not predefined. A distance measure,
intelligent secondary storage devices, proposed to speed such as the Euclidean distance between feature vectors of
up text retrieval but later modified to handle relational the objects, is used to form the clusters. The agglomerative
algebra operators, (b) database filters to accelerate table clustering algorithm is a hierarchical algorithm, which
scan, such as content-addressable file store (CAFS) from starts with as many clusters as there are data items.
International Computers Limited (ICL), (c) associative Agglomerative clustering tends to be expensive.
memory systems, which retrieve data by content, and (d) Non-hierarchical or partitional algorithms compute
database computers, which are mainly multicomputers. the clusters more efficiently. The popular k-means clus-
Intelligent secondary storage devices can be further tering method is in the family of squared-error cluster-
classified (Riedel, 1999): (a) processor per track ing algorithms, which can be implemented as follows:
(PPT), (b) processor per head (PPH), and (c) proces- (1) Designate k randomly selected points from n points
sor per disk (PPD). Given that modern disks have as the centroids of the clusters; (2) assign a point to the
thousands of tracks, the first solution is out of the cluster, whose centroid is closest to it, based on Euclid-
question. The second solution may require the R/W ean or some other distance measure; (3) recompute the
heads to be aligned simultaneously to access all the centroids for all clusters based on the items assigned to
tracks on a cylinder, which is not feasible. NCRs them; (4) repeat steps (2 through 3) with the new cen-
Teradata DBC/1012 database machine (1985) is a troids until there is no change in point membership. One
multicomputer PPD system. measure of the quality of clustering is the sum of
To summarize, according to the active disk para- squared distances (SSD) of points in each cluster with
digm, the host computer offloads the processing of respect to its centroid. The algorithm may be applied
data-warehousing and data-mining operators onto the several times and the results of the iteration with the
embedded microprocessor controller in the disk drive. smallest SSD selected. Clustering of large disk-resi-
There is usually a cache associated with each disk drive, dent datasets is a challenging problem (Dunham, 2003).
which is used to hold prefetched data but can be also Association rule mining (ARM) considers market-
used as a small memory, as mentioned previously. basket or shopping-cart data, that is, the items purchased
on a particular visit to the supermarket. ARM first
determines the frequent sets, which have to meet a
MAIN THRUST certain support level. For example, s% support for two
items A and B, such as bread and butter, implies that they
Data mining, which requires high data access band- appear together in s percent of transactions. Another
widths and is computationally intensive, is used to illus- measure is the confidence level, which is the ratio of the
trate active disk applications. support for the set intersection of A and B divided by the
support for A by itself. If bread and butter appear to-
Data-Mining Applications gether in most market-basket transactions, then there is
high confidence that customers who buy bread also buy
The three main areas of data mining are (a) classifica- butter. On the other hand, this is meaningful only if a
tion, (b) clustering, and (c) association rule mining significant fraction of customers bought bread, that is,
(Dunham, 2003). A brief review is given of the methods the support level is high. Multiple passes over the data
discussed in this article. are required to find all association rules (with a lower
Classification assigns items to appropriate classes bound for the support) when the number of objects is
by using the attributes of each item. When regression is large.
used for this purpose, the input values are the item Algorithms to reduce the cost of ARM include sam-
attributes, and the output is its class. The k-nearest- pling, partitioning (the argument for why this works is
neighbor (k-NN) method uses a training set, and a new that a frequent set of items must be frequent in at least
item is placed in the set, whose entries appear most one partition), and parallel processing (Zaki, 1999).
among the k-NNs of the target item.
K-NN queries are also used in similarity search, for Hardware Technology Trends
example, content based image retrieval (CBIR). Ob-
jects (images) are represented by feature vectors in the A computer system, which may be a server, a worksta-
areas of object color, texture, and so forth. Similarity of tion, or a PC, has three components most affecting its
TEAM LinG
Active Disks for Data Mining
performance: one or more microprocessors, a main memory, Table scan is a costly operation in relational data-
and magnetic disk storage. My discussion of technology bases, which is applied selectively when the search
trends is based on Patterson and Keeton (1998). argument is not sargable, that is, not indexed. The data
The main memory consists of dynamic random access filtering for table scans can be carried out in parallel if the
memory (DRAM) chips. Memory chip capacity is qua- data is partitioned across several disks. A GROUP BY
drupled every three years so that the memory capacity SQL statement computes a certain value, such as the
versus cost ratio is increasing 25% per year. The memory minimum, the maximum, or the mean, based on some
access latency is about 150 nanoseconds and, in the case classification of the rows in the table under consider-
of RAMBUS DRAM, the bandwidth is 800 to 1600 MB per ation, for example, undergraduate standing. To compute
second. The access latency is dropping 7% per year, and the overall average GPA, each disk sends its mean GPA
the bandwidth is increasing 20% per year. According to and the number of participating records to the host
Moores law, processor speed increases 60% per year computer.
versus 7% for main memories, so that the processor- A table scan outperforms indexing in processing k-
memory performance gap grows 50% per year. Multilevel NN queries for high dimensions. There is the additional
cache memories are used to bridge this gap and reduce the cost of building and maintaining the index. A synthetic
number of processor cycles per instruction (Hennessy dataset associated with the IBM Almadens Quest
& Patterson, 2003). project for loan applications is used to experiment with
According to Greg Papadopulos (at Sun k-NN queries. The relational table contains the follow-
Microsystems), the demand for database applications ing attributes: age, education, salary, commission, zip
exceeds the increase in central processing unit (CPU) code, make of car, cost of house, loan amount, years
speed according to Moores law (Patterson & Keeton, owned. In the case of categorical attributes, an exact
1998). Both are a factor of two, but the former is in 9 to match is required.
12 months, and the latter is in 18 months. Consequently, The Apriori algorithm for ARM is also considered
the so-called database gap is increasing with time. in determining whether customers who purchase a par-
Disk capacity is increasing at the rate of 60% per ticular set of items also purchase an additional item,
year. This is due to dramatic increases in magnetic re- but this is meaningful at a certain level of support.
cording density. Disk access time consists of queueing More rules are generated for lower values of support,
time, controller time, seek time, rotational latency, and so the limited size of the disk cache may become a
transfer time. The increased disk RPMs (rotations per bottleneck.
minute) and especially increased linear recording densi- Analytical models and measurement results from a
ties have resulted in very high transfer rates, which are prototype show that active disks scale beyond the point
increasing 60% annually. The sum of seek time and where the server saturates.
rotational latency, referred to as positioning time, is of
the order of 10 milliseconds and is decreasing very Freeblock Scheduling
slowly (8% per year). Utilizing disk access bandwidth is
a very important consideration and is the motivation This CMU project emphasizes disk performance from
behind freeblock scheduling (Lumb, Schindler, Ganger, the viewpoint of maximizing disk arm utilization (Lumb
Nagle, & Riedel, 2000; Riedel, Faloutsos, Ganger, & et al., 2000). Freeblock scheduling utilizes opportu-
Nagle, 2000). nistic reading of low-priority blocks of data from disk,
while the arm having completed the processing of a
Active Disk Projects high priority request is moving to process another such
request. Opportunities for freeblock scheduling di-
There have been several concurrent activities in the area minish as more and more blocks are being read. This is
of active disks at Carnegie Mellon University, the Uni- because blocks located centrally will be accessed right
versity of California at Santa Barbara, University of away, although other disk blocks located at extreme
Maryland, University of California at Berkeley, and in disk cylinders may require an explicit access.
the SmartSTOR project at IBMs Almaden Research In one scenario a disk processes requests by an OLTP
Center. (online transaction processing) application and a back-
ground ARM application. OLTP requests have a higher
Active1 Disk Projects at CMU priority because transaction response time should be as
low as possible, but ARM requests are processed as
Most of this effort is summarized in (Riedel, Gibson, & freeblock requests. OLTP requests access specific
Faloutsos, 1998). records, but ARM requires multiple passes over the
TEAM LinG
Active Disks for Data Mining
dataset in any order. This is a common feature of algo- The host, which runs application programs, can offload
rithms suitable for freeblock scheduling. processing to SmartSTORs, which then deliver results A
In the experimental study, OLTP requests are gener- back to the host. Experimental results with the TPC-D
ated by transactions running at a given multiprogramming benchmark are presented (see http://www.tpc.org).
level (MPL) with a certain think time, that is, the time
before the transaction generates its next request. Re-
quests are to 8 KB blocks, and the Read:Write ratio is 2:1, FUTURE TRENDS
as in the TPC-C benchmark (see http://www.tpc.org),
while the background process accesses 4 KB blocks. The The offloading of activities from the host to peripheral
bandwidth due to freeblock requests increases with in- devices has been carried out successfully in the past.
creasing MPL. Initiating low-priority requests when the SmartSTOR is an intermediate step, and active disk is a
disk has a low utilization is not considered in this study, more distant possibility, which would require standard-
because such accesses would result in an increase in the ization activities such as object-based storage devices
response time of disk accesses on behalf of the OLTP (OSD) (http://www.snia.org).
application. Additional seeks are required to access the
remaining blocks so that the last 5% of requests takes 30%
of the time of a full scan. CONCLUSION
Active Disks at UCSB/Maryland The largest benefit stemming from active disks comes
from the parallel processing capability provided by a
The host computer acts as a coordinator, scheduler, and large number of disks, that the aggregate processing
combiner of results, while the bulk of processing is power of disk controllers may exceed the computing
carried out at the disks (Acharya, Uysal, & Saltz, 1998). power of servers.
The computation at the host initiates disklets at the The filtering effect is another benefit of active disks.
disks. Disklets are disallowed to initiate disk accesses Disk transfer rates are increasing rapidly, so that by
and to allocate and free memory in the disk cache. All eliminating unnecessary I/O transfers, more disks can
these functions are carried out by the host computers be placed on I/O buses or storage area networks
operating system (OS). Disklets can only access memory (SANs).
locations (in a disks cache) within certain bounds speci-
fied by the host computers OS.
Implemented are SELECT, GROUP BY, and ACKNOWLEDGMENT
DATACUBE operator, which computes GROUP BYs
for all possible combinations of a list of attributes Supported by NSF through Grant 0105485 in Computer
(Dunham, 2003), external sort, image convolution, and Systems Architecture.
generating composite satellite images.
TEAM LinG
Active Disks for Data Mining
Hennessey, J. L., & Patterson, D. A. (2003). Computer Database Computer or Machine: A specialized com-
architecture: A quantitative approach (3rd ed.). Morgan puter for database applications, which usually works in
Kaufmann. conjunction with a host computer.
Hsu, W. W., Smith, A. J., & Young, H. (2000). Projecting Database Gap: Processing demand for database appli-
the performance of decision support workloads with smart cations is twofold in nine to 12 months, but it takes 18
storage (SmartSTOR). Proceedings of the Seventh Inter- months for the processor speed to increase that much
national Conference on Parallel and Distributed Sys- according to Moores law.
tems (pp. 417-425), Japan.
Disk Access Time: Sum of seek time (ST), rotational
Keeton, K., Patterson, D. A., & Hellerstein, J. M. (1998). latency (RL), and transfer time (TT). ST is the time to move
A case for intelligent disks (IDISKs). ACM SIGMOD the read/write heads (attached to the disk arm) to the
Record, 27(3), 42-52. appropriate concentric track on the disk. There is also a
head selection time to select the head on the appropriate
Lumb, C., Schindler, J., Ganger, G. R., Nagle, D. F., & track on the disk. RL for small block transfers is half of the
Riedel, E. (2000). Towards higher disk head utilization: disk rotation time. TT is the ratio of the block size and the
Extracting free bandwidth from busy disk drives. Pro- average disk transfer rate.
ceedings of the Fourth Symposium on Operating Sys-
tems Design and Implementation (pp. 87-102), USA. Freeblock Scheduling: A disk arm scheduling method
that uses opportunistic accesses to disk blocks required
Patterson, D. A., & Keeton, K. (1998). Hardware technol- for a low-priority activity.
ogy trends and database opportunities. Keynote address
of the ACM SIGMOD Conference on Management of Processor per Track/Head/Disk: The last organiza-
Data. Retrieved from http://www.cs.berkeley.edu/~pattr tion corresponds to active disks.
sn/talks.html
Shared Everything/Nothing/Disks System: The main
Ramakrishnan, R., & Gehrke, J. (2003). Database manage- memory and disks are shared by the (multiple) processors
ment systems (3rd ed.). McGraw-Hill. in the first case, nothing is shared in the second case (i.e.,
standalone computers connected via an interconnection
Riedel, E. (1999). Active disk Remote execution for network), and disks are shared by processor memory
network attached storage (Tech. Rep. No. CMU-CS- combinations in the third case.
99-177). CMU, Department of Computer Science.
SmartSTOR: A scheme where the disk array control-
Riedel, E., Faloutsos, C., Ganger, G. R., & Nagle, D. F. ler for multiple disks assists the host in processing data-
(2000). Data mining in an OLTP system (nearly) for base applications.
free. Proceedings of the ACM SIGMOD International
Conference on Management of Data (pp. 13-21), USA. Table Scan: The sequential reading of all the blocks
of a relational table to select a subset of its attributes
Riedel, E., Gibson, G. A., & Faloutsos, C. (1998). Active based on a selection argument, which is either not
storage for large scale data mining and multimedia appli- indexed (called a sargable argument) or the index is not
cations. Proceedings of the 24th International Very clustered.
Large Data Base Conference (pp. 62-73), USA.
Transaction Processing Council: This council
Su, W.-Y.S. (1983). Advanced database machine archi- has published numerous benchmarks for transaction
tecture. Prentice-Hall. processing (TPC-C), decision support (TPC-H and TPC-
Zaki, M. J. (1999). Parallel and distributed associative rule R), transactional Web benchmark, also supporting brows-
mining: A survey. IEEE Concurrency, 7(4), 14-25. ing (TPC-W).
10
TEAM LinG
Active Disks for Data Mining
Barbara (UCSB) and University of Maryland; (c) PipeHash algorithm represents the datacube as a
intelligent disks by Keeton, Patterson, and lattice of related GROUP BYs. A directed edge A
Hellerstein (1998) at the University of California connects a GROUP BY i to a GROUP BY j, if j can
(UC) at Berkeley; and (d) the SmartSTOR project at be generated by i and has one less attribute
UC Berkeley and the IBM Almaden Research Center (Agarwal, Agrawal, Deshpande, Gupta, Naughton,
by Hsu, Smith, and Young (2000). Ramakrishnan, et al., 1996). Three other applica-
2
The SQL SELECT and GROUP BY statements are tions dealing with an external sort, image convolu-
easy to implement. The datacube operator com- tion, and generating composite satellite images
putes GROUP BYs for all possible combinations are beyond the scope of this discussion.
of a list of attributes (Dunham, 2003). The
11
TEAM LinG
12
TEAM LinG
Active Learning with Multiple Views
sites so that they can be accessed and combined via Figure 1. An information agent that combines data
database-like queries. For example, consider the agent in from the Zagats restaurant guide, the L.A. County A
Figure 1, which answers queries such as the following: Health Department, the ETAK Geocoder, and the Tiger
Map service
Show me the locations of all Thai restaurants in L.A. Restaurant Guide
that are A-rated by the L.A. County Health Department.
L.A. County Query:
Health Dept. A-rated Thai
To answer this query, the agent must combine data
restaurants
from several Web sources: in L.A.
13
TEAM LinG
Active Learning with Multiple Views
To illustrate Co-Testing for wrapper induction, con- iterative, two-step process: first, it uses the hypotheses
sider the task of extracting restaurant phone numbers from learned in each view to probabilistically label all the
documents similar to the one shown in Figure 2. To extract unlabeled examples; then it learns a new hypothesis in
this information, the wrapper must detect both the begin- each view by training on the probabilistically labeled
ning and the end of the phone number. For instance, to find examples provided by the other view.
where the phone number begins, one can use the following By interleaving active and semi-supervised learn-
rule: ing, Co-EMT creates a powerful synergy. On one hand,
Co-Testing boosts Co-EMs performance by providing
R1 = SkipTo( Phone:<i> ) it with highly informative labeled examples (instead of
random ones). On the other hand, Co-EM provides Co-
This rule is applied forward, from the beginning of Testing with more accurate classifiers (learned from
the page, and it ignores everything until it finds the string both labeled and unlabeled data), thus allowing Co-
Phone:<i>. Note that this is not the only way to detect Testing to make more informative queries.
where the phone number begins. An alternative way to Co-EMT was not yet applied to wrapper induction,
perform this task is to use the following rule: because the existing algorithms are not probabilistic
learners; however, an algorithm similar to Co-EMT was
R2 = BackTo( Cuisine ) BackTo( ( Number ) ) applied to information extraction from free text (Jones
et al., 2003). To illustrate how Co-EMT works, we
which is applied backward, from the end of the document. describe now the generic algorithm Co-EMTWI, which
R2 ignores everything until it finds Cuisine and then, combines Co-Testing with the semi-supervised wrap-
again, skips to the first number between parentheses. per induction algorithm described next.
Note that R1 and R2 represent descriptions of the In order to perform semi-supervised wrapper in-
same concept (i.e., beginning of phone number) that are duction, one can exploit a third view, which is used to
learned in two different views (see Muslea et al. [2001] evaluate the confidence of each extraction. This new
for details on learning forward and backward rules). That content-based view (Muslea et al., 2003) describes the
is, views V1 and V2 consist of the sequences of charac- actual item to be extracted. For example, in the phone
ters that precede and follow the beginning of the item, numbers extraction task, one can use the labeled ex-
respectively. View V1 is called the forward view, while amples to learn a simple grammar that describes the
V2 is the backward view. Based on V1 and V2, Co-Testing field content: (Number) Number Number. Similarly,
can be applied in a straightforward manner to wrapper when extracting URLs, one can learn that a typical URL
induction. As shown in Muslea (2002), Co-Testing clearly starts with the string http://www., ends with the string
outperforms existing state-of-the-art algorithms, both .html, and contains no HTML tags.
on wrapper induction and a variety of other real world Based on the forward, backward, and content-based
domains. views, one can implement the following semi-super-
vised wrapper induction algorithm. First, the small set
Co-EMT: Interleaving Active and of labeled examples is used to learn a hypothesis in
Semi-Supervised Learning each view. Then, the forward and backward views feed
each other with unlabeled examples on which they
To further reduce the need for labeled data, Co-EMT make high-confidence extractions (i.e., strings that are
(Muslea et al., 2002a) combines active and semi-super- extracted by either the forward or the backward rule and
vised learning by interleaving Co-Testing with Co-EM are also compliant with the grammar learned in the
(Nigam & Ghani, 2000). Co-EM, which is a semi-super- third, content-based view).
vised, multi-view learner, can be seen as the following Given the previous Co-Testing and the semi-super-
vised learner, Co-EMTWI combines them as follows. First,
the sets of labeled and unlabeled examples are used for
semi-supervised learning. Second, the extraction rules
Figure 2. The forward rule R1 and the backward rule that are learned in the previous step are used for Co-
R2 detect the beginning of the phone number. Forward Testing. After making a query, the newly labeled example
and backward rules have the same semantics and differ is added to the training set, and the whole process is
only in terms of from where they are applied (start/end repeated for a number of iterations. The empirical study in
of the document) and in which direction Muslea, et al., (2002a) shows that, for a large variety of
R1: SkipTo( Phone : <i> ) R2: BackTo(Cuisine) BackTo( (Number) ) text classification tasks, Co-EMT outperforms both Co-
Testing and the three state-of-the-art semi-supervised
Name: <i>Ginos </i> <p>Phone :<i> (800)111-1717 </i> <p> Cuisine :
learners considered in that comparison.
14
TEAM LinG
Active Learning with Multiple Views
15
TEAM LinG
Active Learning with Multiple Views
Natural Language Processing & Very Large Corpora Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000).
(pp. 100-110). Text classification from labeled and unlabeled docu-
ments using EM. Machine Learning, 39(2-3), 103-134.
Jones, R., Ghani, R., Mitchell, T., & Riloff, E. (2003).
Active learning for information extraction with mul- Pierce, D., & Cardie, C. (2001). Limitations of co-training
tiple view feature sets. Proceedings of the ECML-2003 for natural language learning from large datasets. Empiri-
Workshop on Adaptive Text Extraction and Mining. cal Methods in Natural Language Processing, 1-10.
Knoblock, C. et al. (2001). The Ariadne approach to Web- Raskutti, B., Ferra, H., & Kowalczyk, A. (2002). Using
based information integration. International Journal of unlabeled data for text classification through addition
Cooperative Information Sources, 10, 145-169. of cluster parameters. Proceedings of the Interna-
tional Conference on Machine Learning (ICML-2002).
Muslea, I. (2002). Active learning with multiple views
[doctoral thesis]. Los Angeles: Department of Com- Tong, S., & Koller, D. (2001). Support vector machine
puter Science, University of Southern California. active learning with applications to text classification.
Journal of Machine Learning Research, 2, 45-66.
Muslea, I., Minton, S., & Knoblock, C. (2000). Selective
sampling with redundant views. Proceedings of the Na-
tional Conference on Artificial Intelligence (AAAI-2000).
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierar-
KEY TERMS
chical wrapper induction for semi-structured sources.
Journal of Autonomous Agents & Multi-Agent Sys- Active Learning: Detecting and asking the user to
tems, 4, 93-114. label only the most informative examples in the domain
(rather than randomly-chosen examples).
Muslea, I., Minton, S., & Knoblock, C. (2002a). Active
+ semi-supervised learning = robust multi-view learn- Inductive Learning: Acquiring concept descrip-
ing. Proceedings of the International Conference on tions from labeled examples.
Machine Learning (ICML-2002). Meta-Learning: Learning to predict the most ap-
Muslea, I., Minton, S., & Knoblock, C. (2002b). Adap- propriate algorithm for a particular task.
tive view validation: A first step towards automatic view Multi-View Learning: Explicitly exploiting sev-
detection. Proceedings of the International Confer- eral disjoint sets of features, each of which is sufficient
ence on Machine Learning (ICML-2002). to learn the target concept.
Muslea, I., Minton, S., & Knoblock, C. (2003). Active Semi-Supervised Learning: Learning from both
learning with strong and weak views: A case study on labeled and unlabeled data.
wrapper induction. Proceedings of the International
Joint Conference on Artificial Intelligence (IJCAI-2003). View Validation: Deciding whether a set of views
is appropriate for multi-view learning.
Nigam, K., & Ghani, R. (2000). Analyzing the effective-
ness and applicability of co-training. Proceedings of Wrapper Induction: Learning (highly accurate)
the Conference on Information and Knowledge Man- rules that extract data from a collection of documents
agement (CIKM-2000). that share a similar underlying structure.
16
TEAM LinG
17
Chang Liu
Northern Illinois University, USA
Qiyang Chen
Montclair State University, USA
June Lu
University of Houston-Victoria, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Administering and Managing a Data Warehouse
bases into data warehouse databases. Thus the team Formulating Strategic Plans: Environmental fac-
requires a technical role called ETL Specialist. On the tors can be matched up against the strategic plan by
other hand, a data warehouse is intended to support the identifying current market positioning, financial
business decision-making process. Someone like a busi- goals, and opportunities.
ness analyst is also needed to ensure that business Determining Specific Objectives: Exploration ware-
information requirements are crossed to the data ware- house can be used to find patterns; if found, these
house development. Data in the data warehouse can be patterns are then compared with patterns discov-
very sensitive and cross functional areas, such as per- ered previously to optimize corporate objectives
sonal medical records and salary information. There- (Inmon, Terdeman, & Imhoff, 2000).
fore, a higher level of security on the data is needed.
Encrypting the sensitive data in data warehouse is a While managing a data warehouse for business strat-
potential solution. Issues as such in data warehouse egy, what needs to be taken into consideration is the
administration and management need to be defined and difference between companies. No one formula fits
discussed. every organization. Avoid using so called templates
from other companies. The data warehouse is used for
your companys competitive advantages. You need to
MAIN THRUST follow your companys user information requirements
for strategic advantages.
Data warehouse administration and management covers
a wide range of fields. This article focuses only on data Data Warehouse Development Cycle
warehouse and business strategy, data warehouse devel-
opment life cycle, data warehouse team, process man- Data warehouse system development phases are similar
agement, and security management to present the cur- to the phases in the systems development life cycle
rent concerns and issues in data warehouse administra- (SDLC) (Adelman & Rehm, 2003). However, Barker
tion and management. (1998) thinks that there are some differences between
the two due to the unique functional and operational
Data Warehouse and Business Strategy features of a data warehouse. As business and informa-
tion requirements change, new corporate information
Data is the blood of an organization. Without data, the models evolve and are synthesized into the data ware-
corporation has no idea where it stands and where it will house in the Synthesis of Model phase. These models
go (Ferdinandi, 1999, p. xi). With data warehousing, are then used to exploit the data warehouse in the
todays corporations can collect and house large vol- Exploit phase. The data warehouse is updated with new
umes of data. Does the size of data volume simply data using appropriate updating strategies and linked to
guarantee you a success in your business? Does it mean various data sources.
that the more data you have the more strategic advan- Inmon (2002) sees system development for data
tages you have over your competitors? Not necessarily. warehouse environment as almost exactly the opposite
There is no predetermined formula that can turn your of the traditional SDLC. He thinks that traditional SDLC
information into competitive advantages (Inmon, is concerned with and supports primarily the opera-
Terdeman, & Imhoff, 2000). Thus, top management and tional environment. The data warehouse operates under
data administration team are confronted with the ques- a very different life cycle called CLDS (the reverse of
tion of how to convert corporate information into com- the SDLC). The CLDS is a classic data-driven develop-
petitive advantages. ment life cycle, but the SDLC is a classic requirements-
A well-managed data warehouse can assist a corpora- driven development life cycle.
tion in its strategy to gain competitive advantages. This
can be achieved by using an exploration warehouse, The Data Warehouse Team
which is a direct product of data warehouse, to identify
environmental factors, formulate strategic plans, and Building a data warehouse is a large system develop-
determine business specific objectives: ment process. Participants of data warehouse develop-
ment can range from a data warehouse administrator
Identifying Environmental Factors: Quantified (DWA) (Hoffer, Prescott, & McFadden, 2005) to a
analysis can be used for identifying a corporations business analyst (Ferdinandi, 1999). The data ware-
products and services, market share of specific house team is supposed to lead the organization into
products and services, financial management. assuming their roles and thereby bringing about a part-
18
TEAM LinG
Administering and Managing a Data Warehouse
nership with the business (McKnight, 2000). A data ware- a new data warehouse administrator is required for
house team may have the following roles (Barker, 1998; each year a data warehouse is up and running and A
Ferdinandi, 1999; Inmon, 2000, 2003; McKnight, 2000): is being used successfully;
if an ETL tool is being written manually, many data
Data Warehouse Administrator (DWA): respon- warehouse administrators are needed; if automa-
sible for integrating and coordinating of metadata tion tool is needed much fewer staffing is required;
and data across many different data sources as well automated data warehouse database management
as data source management, physical database de- system (DBMS) requires fewer data warehouse
sign, operation, backup and recovery, security, and administrators, otherwise more administrators
performance and tuning. are needed;
Manager/Director: responsible for the overall fewer supporting staff is required if the corporate
management of the entire team to ensure that the information factory (CIF) architecture is fol-
team follows the guiding principles, business re- lowed more closely; reversely, more staff is
quirements, and corporate strategic plans. needed.
Project Manager: responsible for data warehouse
project development, including matching each team McKnight (2000) suggests that all the technical
members skills and aspirations to tasks on the roles be performed full-time by dedicated personnel
project plan. and each responsible person receives specific data
Executive Sponsor: responsible for garnering and warehouse training.
retaining adequate resources for the construction Data warehousing is growing rapidly. As the scope
and maintenance of the data warehouse. and data storage size of the data warehouse change, the
Business Analyst: responsible for determining roles and size of a data warehouse team should be
what information is required from a data warehouse adjusted accordingly. In general, the extremes should
to manage the business competitively. be avoided. Without sufficient professionals, job may
System Architect: responsible for developing and not be done satisfactorily. On the other hand, too many
implementing the overall technical architecture of people will certainly get the team overstuffed.
the data warehouse, from the backend hardware and
software to the client desktop configurations. Process Management
ETL Specialist: responsible for routine work on
data extraction, transformation, and loading for the Developing data warehouse has become a popular but
warehouse databases. exceedingly demanding and costly activity in informa-
Front End Developer: responsible for develop- tion systems development and management. Data ware-
ing the front-end, whether it is client-server or house vendors are competing intensively for their cus-
over the Web. tomers because so much of their money and prestige
OLAP Specialist: responsible for the develop- are at stake. Consulting vendors have redirected their
ment of data cubes, a multidimensional view of data attention toward this rapidly expanding market seg-
in OLAP. ment. User companies are facing with a serious ques-
Data Modeler: responsible for modeling the ex- tion on which product they should buy. Sen & Jacobs
isting data in an organization into a schema that is (1998) advice is to first understand the process of data
appropriate for OLAP analysis. warehouse development before selecting the tools for
Trainer: responsible for training the end-users to its implementation. A data warehouse development
use the system so that they can benefit from the process refers to the activities required to build a data
data warehouse system. warehouse (Barquin, 1997). Sen & Jacob (1998) and
End User: responsible for providing feedback to Ma, Chou, & Yen (2000) have identified some of these
the data warehouse team. activities, which need to be managed during the data
warehouse development cycle: initializing project, es-
In terms of the size of the data warehouse administra- tablishing the technical environment, tool integration,
tor team, Inmon (2003) has several recommendations: determining scalability, developing an enterprise in-
formation architecture, designing the data warehouse
large warehouse requires more analysts; database, data extraction/transformation, managing
every 100gbs of data in a data warehouse requires metadata, developing the end-user interface, managing
another data warehouse administrator; the production environment, managing decision sup-
port tools and applications, and developing warehouse
roll-out.
19
TEAM LinG
Administering and Managing a Data Warehouse
As mentioned before, data warehouse development is transmission process. Even if unauthorized access occurs
a large system development process. Process manage- during transmission, there is no harm to the encrypted data
ment is not required in every step of the development unless the user has the decryption code (Ma, Chou, & Yen,
processes. Devlin (1997) states that process management 2000).
is required in the following areas: process schedule,
which consists of a network of tasks and decision points;
process map definition, which defines and maintains the FUTURE TRENDS
network of tasks and decision points that make up a
process; task initiation, which supports to initiate tasks Data warehousing administration and management is
on all of the hardware/software platforms in the entire data facing several challenges, as data warehousing becomes
warehouse environment; status information enquiry, a mature part of the infrastructure of organizations.
which enquires about the status of components that are More legislative work is necessary to protect individual
running on all platforms. privacy from abuse by government or commercial enti-
ties that have large volumes of data concerning those
Security Management individuals. The protection also calls for tightened se-
curity through technology as well as user efforts for
In recent years, information technology (IT) security workable rules and regulations while at the same time
has become one of the hottest and most important topics still granting a data warehouse the ability to perform
facing both users and providers (Senn, 2005). The goal large datasets for meaningful analyses (Marakas, 2003).
of database security is the protection of data from Todays data warehouse is limited to storage of
accidental or intentional threats to its integrity and structured data in the form of records, fields, and data-
access (Hoffer, Prescott, & McFadden, 2005). The bases. Unstructured data, such as multimedia, maps,
same is true for a data warehouse. However, higher graphs, pictures, sound, and video files are demanded
security methods, in addition to the common practices increasingly in organizations. How to manage the stor-
such as view-based control, integrity control, process- age and retrieval of unstructured data and how to search
ing rights, and DBMS security, need to be used for the for specific data items set a real challenge for data
data warehouse due to the differences between a data- warehouse administration and management. Alternative
base and data warehouse. One of the differences that storage, especially the near-line storage, which is one
demand a higher level of security for a data warehouse of the two forms of alternative storage, is considered to
is the scope of and detail level of data in the data be one of the best future solutions for managing the
warehouse, such as financial transactions, personal storage and retrieval of unstructured data in data ware-
medical records, and salary information. A method that houses (Marakas, 2003).
can be used to protect data that requires high level of The past decade has seen a fast rise of the Internet
security in a data warehouse is by using encryption and and World Wide Web. Today, Web-enabled versions of
decryption. all leading vendors warehouse tools are becoming avail-
Confidential and sensitive data can be stored in a able (Moeller, 2001). This recent growth in Web use
separate set of tables where only authorized users can and advances in e-business applications have pushed the
have access. These data can be encrypted while they are data warehouse from the back office, where it is ac-
being written into the data warehouse. In this way, the cessed by only a few business analysts, to the front lines
data captured and stored in the data warehouse are of the organization, where all employees and every
secure and can only be accessed on an authorized basis. customer can use it.
Three levels of security can be offered by using encryp- To accommodate this move to the frontline of the
tion and decryption. The first level is that only authorized organization, the data warehouse demands massive
users can have access to the data in the data warehouse. scalability for data volume as well as for performance.
Each group of users, internal or external, ranging from As the number of and types of users increase rapidly,
executives to information consumers should be granted enterprise data volume is doubling in size every 9 to 12
different rights for security reasons. Unauthorized users months. Around-the-clock access to the data warehouse
are totally prevented from seeing the data in the data is becoming the norm. The data warehouse will require
warehouse. The second level is the protection from unau- fast implementation, continuous scalability, and ease of
thorized dumping and interpretation of data. Without the management (Marakas, 2003).
right key an unauthorized access will not be allowed to Additionally, building distributed warehouses, which
write anything into the tables. On the other hand, the are normally called data marts, will be on the rise. Other
existing data in the tables cannot be decrypted. The third technical advances in data warehousing will include an
level is the protection from unauthorized access during the increasing ability to exploit parallel processing, auto-
20
TEAM LinG
Administering and Managing a Data Warehouse
mated information delivery, greater support of object Information Technology Toolbox. (2004). 2003 IToolbox
extensions, very large database support, and user-friendly spending survey. Retrieved from http://datawarehouse. A
Web-enabled analysis applications. These capabilities ittoolbox.com/research/survey.asp
should make data warehouses of the future more powerful
and easier to use, which will further increase the impor- Inmon, W.H. (2002). Building the data warehouse (3rd
tance of data warehouse technology for business strate- ed.). New York: John Wiley & Sons Inc.
gic decision making and competitive advantages (Ma, Inmon, W.H. (2000). Building the data warehouse: Get-
Chou, & Yen, 2000; Marakas, 2003; Pace University, 2004). ting started. Retrieved from http://www.billinmon.com/
library/whiteprs/earlywp/ttbuild.pdf
TEAM LinG
Administering and Managing a Data Warehouse
22
TEAM LinG
23
Giovanni Quattrone
Universit Mediterranea di Reggio Calabria, Italy
Giorgio Terracina
Universit della Calabria, Italy
Domenico Ursino
Universit Mediterranea di Reggio Calabria, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Agent-Based Mining of User Profiles for E-Services
based on the maintenance and the exploitation of Samaras and Panayiotou (2002) present a flexible
user profiles. agent-based system for providing wireless users
In Lau, Hofstede, and Bruza (2000), WEBS, an with a personalized access to the Internet ser-
agent-based approach for supporting e-commerce vices.
activities, is proposed. It exploits probabilistic In Araniti, De Meo, Iera, and Ursino (2003), a
logic rules for allowing the customer preferences novel XML-based multi-agent system for QoS
for other products to be deduced. management in wireless networks is presented.
Ardissono, et al. (2001) describe SETA, a multi-
agent system conceived for developing adaptive These approaches are particularly general and inter-
Web stores. SETA uses knowledge representation esting; however, to the best of our knowledge, none of
techniques to construct, maintain, and exploit user them has been conceived for handling e-services.
profiles.
In Bradley and Smyth (2003), the system CASPER,
for handling recruitment services, is proposed. MAIN THRUST
Given a user, CASPER first ranks job advertise-
ments according to an applicants desires and then Challenges to Face
recommends job proposals to the applicant on the
basis of the applicants past behavior. In order to overcome the problems outlined previously,
In Razek, Frasson, and Kaltenbach (2002), a multi- some challenges must be tackled.
agent prototype for e-learning called CITS (Con- First, a user can access many e-services, operating in
fidence Intelligent Tutoring Agent) is proposed. the same or in different application contexts; a faithful
The approach of CITS aims at being adaptive and and complete profile of the user can be constructed only
dynamic. by taking into account the users behavior while access-
In Shang, Shi, and Chen (2001), IDEAL (Intelli- ing all the sites. In other words, it should be possible to
gent Distributed Environment for Active Learn- construct a unique structure on the user side, storing the
ing), a multi-agent system for active distance learn- users profile and, therefore, representing the users
ing, is proposed. In IDEAL, course materials are behavior while accessing all the sites.
decomposed into small components called Second, for a given user and e-service provider, it
lecturelets. These are XML documents containing should be possible to compare the profile of the user
JAVA code; they are dynamically assembled to with the offers of the provider for extracting those
cover course topics according to learner progress. proposals that probably will interest the user. Existing
In Zaiane (2002), an approach for exploiting Web- techniques for satisfying such a requirement are based
mining techniques to build a software agent sup- mainly on the exploitation of either log files or cookies.
porting e-learning activities is presented. Techniques based on log files can register only some
information about the actions carried out by the user
All these systems construct, maintain, and exploit a upon accessing an e-service; however, they cannot match
user profile; therefore, we can consider them adaptive user preferences and e-service proposals. Vice versa,
w.r.t. the user; however, to the best of our knowledge, techniques based on cookies are able to carry out a
none of them is adaptive w.r.t. the device. certain, even if primitive, match; however, they need to
On the other side, in various areas of computer know and exploit some personal information that a user
science research, a large variety of approaches adapting might consider private.
their behavior to the device the user is exploiting has Third, it should be necessary to overcome the typical
been proposed. As an example: one-size-fits-all philosophy of present e-service pro-
viders by developing systems capable of adapting their
In Anderson, Domingos, and Weld (2001), a frame- behavior to both the profile of the user and to the
work called MINPATH, capable of simplifying the characteristics of the device the user is exploiting for
browsing activity of a mobile user and taking into accessing them (Communications of the ACM, 2002).
account the device the user is exploiting, is pre-
sented.
In Macskassy, Dayanik, and Hirsh (2000), a frame-
System Description
work named i-Valets is proposed for allowing a
user to visit an information source by using differ- The system we present in this article (called e-service
ent devices. adaptive manager [ESA-Manager]) aims at solving all
24
TEAM LinG
Agent-Based Mining of User Profiles for E-Services
three problems mentioned previously. It is an XML- possibly interesting for the user in the future and that the
based multi-agent system for handling user accesses to user disregarded to take into account in the past (see De A
e-services, capable of adapting its behavior to both user Meo, Rosaci, Sarn, Terracina & Ursino [2003] for a
and device profiles. specialization of these algorithms to e-commerce). In
In ESA-Manager, a service provider agent is present our opinion, this is a particularly interesting feature for
for each e-service provider, handling the proposals stored a novel approach devoted to deal with e-services.
therein as well as the interaction with the user. In addi- Last, but not the least, it is worth observing that
tion, an agent is associated with each user, adapting its since the user profile management is carried out at the
behavior to the profiles of both the user and the device user side, no information about the user profile is sent
the user is exploiting for visiting the sites. Actually, to the e-service providers. In this way, ESA-Manager
since a user can access e-service providers by means of solves privacy problems left open by cookies.
different devices, the users profile cannot be stored in All the reasonings presented show that ESA-Man-
only one of them; as a matter of fact, it is necessary to ager is capable of solving also the second problem
have a unique copy of the user profile that registers the mentioned previously.
users behavior in visiting the e-service providers during In ESA-Manager, the device profile plays a central
the various sessions, possibly carried out by means of role. Indeed, the proposals of a provider shown to a
different devices. For this reason, the profile of a user user, as well as their presentation formats, depend on
must be handled and stored in a support different from the characteristics of the device the user is presently
the devices generally exploited by the user for accessing exploiting. However, the ESA-Manager capability of
e-service providers. As a consequence, on the user side, adapting its behavior to the device the user is exploiting
the exploitation of a profile agent appears compulsory, is not restricted to the presentation format of the
storing the profiles of both involved users and devices, proposals; indeed, the exploited device can influence
and a user-device agent, associated with a specific user also the computation of the interest degree shown by a
operating by means of a specific device, supporting the user for the proposals presented by each provider.
user in his or her activities. More specifically, one of the parameters that the
As previously pointed out, for each user, a unique interest degree associated with a proposal is based on,
profile is mined and maintained, storing information is the time the user spends visiting the corresponding
about the users behavior in accessing all e-service pro- Web pages. This time is not to be considered as an
viders1the techniques for mining, maintaining, and absolute measure, but it must be normalized w.r.t. both
exploiting user profiles are quite complex and slightly the characteristics of the exploited device and the
differ in the various applications domains; the interested navigation costs (Chan, 2000). The following example
reader can find examples of them, along with the corre- allows this intuition to be clarified. Assume that a user
sponding validation issues, in De Meo, Rosaci, Sarn, visits a Web page for two times and that each visit takes
Terracina, and Ursino (2003) for e-commerce and in De n seconds. Suppose, also, that during the first access,
Meo, Garro, Terracina, and Ursino (2003) for e-learn- the user exploits a mobile phone having a low proces-
ing. In this way, ESA-Manager solves the first problem sor clock and supporting a connection characterized by
mentioned previously. a low bandwidth and a high cost. During the second
Whenever a user accesses an e-service by means of a visit, the user uses a personal computer having a high
certain device, the corresponding service provider agent processor clock and supporting a connection charac-
sends information about its proposals to the user device terized by a high bandwidth and a low cost. It is possible
agent associated with the service provider agent and the to argue that the interest the user exhibited for the page
device he or she is exploiting. The user device agent in the former access is greater than what the user
determines similarities between the proposals presented exhibited in the latter one. Also, other device param-
by the provider and the interests of the user. For each of eters influence the behavior of ESA-Manager (see De
these similarities, both the service provider agent and Meo, Rosaci, Sarn, Terracina & Ursino [2003] for a
the user device agent cooperate for presenting to the detailed specification of the role of these parameters).
user a group of Web pages adapted to the exploited This reasoning allows us to argue that ESA-Manager
device, illustrating the proposal. solves also the third problem mentioned previously.
We argue that this behavior provides ESA-Manager As already pointed out, many agents are simulta-
with the capability of supporting the user in the search of neously active in ESA-Manager; they strongly interact
proposals of the users interest offered by the provider. with each other and continuously exchange informa-
In addition, the algorithms underlying ESA-Manager al- tion. In this scenario, an efficient management of in-
low it to identify not only the proposals probably inter- formation exchange appears crucial. One of the most
esting for the user in the present, but also other ones promising solutions to this problem has been the adop-
25
TEAM LinG
Agent-Based Mining of User Profiles for E-Services
tion of XML. XML capabilities make it particularly suited ploited for both storing the agent ontologies and for
to be exploited in the agent research. In ESA-Manager, the handling the agent communication.
role of XML is central; indeed, (1) the agent ontologies are As for future work, we argue that various improve-
stored as XML documents; (2) the agent communication ments could be performed on ESA-Manager for better-
language is ACML; (3) the extraction of information from ing its effectiveness and completeness. As an example,
the various data structures is carried out by means of it might be interesting to categorize involved users on
XQuery; and (4) the manipulation of agent ontologies is the basis of their profiles, as well as involved providers
performed by means of the Document Object Model on the basis of their proposals. As a further example of
(DOM). profitable features with which our system could be
enriched, we consider extremely promising the deriva-
tion of association rules representing and predicting the
FUTURE TRENDS user behavior on accessing one or more providers.
Finally, ESA-Manager could be made even more adap-
The spectacular growth of the Internet during the last tive by considering the possibility to adapt its behavior
decade has strongly conditioned the e-service land- on the basis not only of the device a user is exploiting
scape. Such a growth is particularly surprising in some during a certain access, but also of the context (e.g., job,
application domains, such as financial services or e- holidays) in which the user is currently operating.
government.
As an example, the Internet technology has enabled
the expansion of financial services by integrating the REFERENCES
already existing, quite variegate financial data and ser-
vices and by providing new channels for information Adaptive Web. (2002). Communications of the ACM,
delivery. For instance, in 2004, the number of house- 45(5).
holds in the U.S. that will use online banking is expected
to exceed approximately 24 million, nearly double the Anderson, C.R., Domingos, P., & Weld, D.S. (2001).
number of households at the end of 2000. Adaptive Web navigation for wireless devices. Pro-
Moreover, e-services are not a leading paradigm ceedings of the Seventeenth International Joint Con-
only in business contexts, but they are an emerging ference on Artificial Intelligence (IJCAI 2001), Se-
standard in several application domains. As an example, attle, Washington.
they are applied vigorously by governmental units at Araniti, G., De Meo, P., Iera, A., & Ursino, D. (2003).
national, regional, and local levels around the world. Adaptively controlling the QoS of multimedia wireless
Moreover, e-service technology is currently success- applications through user-profiling techniques. Jour-
fully exploited in some metropolitan networks for pro- nal of Selected Areas in Communications, 21(10),
viding mediation tools in a democratic system in order 1546-1556.
to make citizen participation in rule- and decision-
making processes more feasible and direct. These are Ardissono, L. et al. (2001). Agent technologies for the
only two examples of the role e-services can play in the development of adaptive Web stores. Agent Mediated
e-government context. Handling and managing this tech- Electronic Commerce, The European AgentLink Per-
nology in all these environments is one of the most spective (pp. 194-213). Lecture Notes in Computer Sci-
challenging issues for present and future researchers. ence, Springer.
Bradley, K., & Smyth, B. (2003). Personalized informa-
tion ordering: A case study in online recruitment. Knowl-
CONCLUSION edge-Based Systems, 16(5-6), 269-275.
In this article, we have proposed ESA-Manager, an XML- Chan, P.K. (2000). Constructing Web user profiles: A
based and adaptive multi-agent system for supporting a non-invasive learning approach. Web Usage Analysis
user accessing an e-service provider in the search of and User Profiling, 39-55. Springer.
proposals present therein and appearing to be appealing De Meo, P., Garro, A., Terracina, G., & Ursino, D.
according to the users past interests and behavior. (2003). X-Learn: An XML-based, multi-agent system
We have shown that ESA-Manager is adaptive w.r.t. for supporting user-device adaptive e-learning. Pro-
the profile of both the user and the device the user is ceedings of the International Conference on Ontolo-
exploiting for accessing the e-service provider. Finally, gies, Databases and Applications of Semantics
we have seen that it is XML-based, since XML is ex- (ODBASE 2003), Taormina, Italy.
26
TEAM LinG
Agent-Based Mining of User Profiles for E-Services
De Meo, P., Rosaci, D., Sarn G.M.L., Terracina, G., & tion Language defined by the Foundation for Intelligent
Ursino, D. (2003). An XML-based adaptive multi-agent Physical Agent (FIPA). A
system for handling e-commerce activities. Proceed-
ings of the International Conference on Web Ser- Adaptive System: A system adapting its behavior on
vicesEurope (ICWS-Europe 03), Erfurt, Germany. the basis of the environment it is operating in.
Garcia, F.J., Patern, F., & Gil, A.B. (2002). An adaptive Agent: A computational entity capable of both per-
e-commerce system definition. Proceedings of the ceiving dynamic changes in the environment it is oper-
International Conference on Adaptive Hypermedia ating in and autonomously performing user delegated
and Adaptive Web-Based Systems (AH02), Malaga, tasks, possibly by communicating and cooperating with
Spain. other similar entities.
Hull, R., Benedikt, M., Christophides, V., & Su, J. (2003). Agent Ontology: A description (like a formal speci-
E-services: A look behind the curtain. Proceedings of fication of a program) of the concepts and relationships
the Symposium on Principles of Database Systems that can exist for an agent or a community of agents.
(PODS 2003), San Diego, California. Device Profile: A model of a device storing infor-
Lau, R., Hofstede, A., & Bruza, P. (2000). Adaptive mation about both its costs and capabilities.
profiling agents for electronic commerce. Proceed- E-Service: A collection of network-resident soft-
ings of the CollECTeR Conference on Electronic Com- ware programs that collaborate for supporting users in
merce (CollECTeR 2000), Breckenridge, Colorado. both accessing and selecting data and services of their
Macskassy, S.A., Dayanik, A.A., & Hirsh, H. (2000). Infor- interest handled by a provider site. Examples of e-
mation valets for intelligent information access. Proceed- services are e-commerce, e-learning, and e-government
ings of the AAAI Spring Symposia Series on Adaptive applications.
User Interfaces, (AUI-2000), Stanford, California. eXtensible Markup Language (XML): The novel
Razek, M.A., Frasson, C., & Kaltenbach, M. (2002). language, standardized by the World Wide Web Consor-
Toward more effective intelligent distance learning tium, for representing, handling, and exchanging infor-
environments. Proceedings of the International Con- mation on the Web.
ference on Machine Learning and Applications Multi-Agent System (MAS): A loosely coupled
(ICMLA02), Las Vegas, Nevada. network of software agents that interact to solve prob-
Samaras, G., & Panayiotou, C. (2002). Personalized lems that are beyond the individual capacities or knowl-
portals for the wireless user based on mobile agents. edge of each of them. An MAS distributes computa-
Proceedings of the International Workshop on Mo- tional resources and capabilities across a network of
bile Commerce, Atlanta, Georgia. interconnected agents. The agent cooperation is handled
by means of an Agent Communication Language.
Shang, Y., Shi, H., & Chen, S. (2001). An intelligent
distributed environment for active learning. Proceed- User Modeling: The process of gathering informa-
ings of the ACM International Conference on World tion specific to each user either explicitly or implicitly.
Wide Web (WWW 2001), Hong Kong. This information is exploited in order to customize the
content and the structure of a service to the users
Terziyan, V., & Vitko, O. (2002). Intelligent informa- specific and individual needs.
tion management in mobile electronic commerce. Arti-
ficial Intelligence News, Journal of Russian Associa- User Profile: A model of a user representing both
tion of Artificial Intelligence, 5. the users preferences and behavior.
27
TEAM LinG
28
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases
trasting features, although they share the same require- In general, query answering techniques are preferable
ment of fast online response times. In particular, one of in contexts where exact answers are unlikely to be ob- A
the key differences between OLTP and OLAP queries is tained (e.g., integration of heterogeneous data sources,
the number of records required to calculate the answer. like Web sites), and response time requirements are not
OLTP queries typically involve a rather limited number of very stringent. However, as noted in Grahne & Mendelzon
records, accessed through primary key or other specific (1999), query answering methods can be extremely ineffi-
indexes, which need to be processed for short, isolated cient, as it is difficult or even impossible to process only
transactions or to be issued on a user interface. In con- the useful views and apply optimization techniques
trast, multidimensional queries usually require the classi- such as pushing selections and joins. As a consequence,
fication and aggregation of a huge amount of data (Gupta, the rewriting approach is more appropriate in contexts
Harinarayan, & Quass, 1995) and fast response times are such as OLAP systems, where there is a very large amount
made possible by the extensive use of pre-computed of data and fast response times are required (Goldstein &
queries, called materialized views (whose answers are Larson, 2001), and for query optimization, where different
stored permanently in the database), and by sophisti- query plans need to be maintained in the main memory and
cated techniques enabling the query engine to exploit efficiently compared (Afrati, Li, & Ullman, 2001).
these pre-computed results.
Rewriting and Answering: An Example
29
TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases
dents, employed and retired, called V_ST, V_EMP and consequently to replace the calculations with access to
V_RET respectively. For example V_RET may be defined the corresponding materialized views). The results are
by: extended to NGPSJ (Nested GPSJ) expressions in
Golfarelli & Rizzi (2000).
CREATE VIEW V_RET AS In Srivastava, Dar, Jagadish, & Levy (1996) an algo-
SELECT * rithm is proposed to rewrite a single block (conjunctive)
FROM Cens SQL query with GROUP BY and aggregations using
WHERE Empl_status = retired various views of the same form. The aggregate functions
considered are MIN, MAX, COUNT and SUM. The
It is evident that no rewriting can be obtained by using algorithm is based on the detection of homomorphisms
only the specified views, both because some individuals from view to query, as in the non-aggregate context
are not present in any of the views (e.g., young children, (Levy, Mendelzon, Sagiv, & Srivastava, 1995). However,
unemployed, housewives, etc.) and because some may be it is shown that more restrictive conditions must be
present in two views (a student may also be employed). considered when dealing with aggregates, as the view
However, a query answering technique tries to collect has to produce not only the right tuples, but also their
each useful accessible record and build the best pos- correct multiplicities.
sible answer, possibly by introducing approximations. In Cohen, Nutt, & Serebrenik (1999, 2000) a somewhat
By using the information on the census tract and a match- different approach is proposed: the original query, us-
ing algorithm most overlapping records may be determined able views and rewritten query are all expressed by an
and an estimate (lower bound) of the result obtained by extension of Datalog with aggregate functions (again
summing the non-replicated contributions from the views. COUNT, SUM, MIN and MAX) as query language.
Obviously, this would require a considerable computation Queries and views are assumed to be conjunctive. Sev-
time, but it might be able to produce an approximated eral candidates for rewriting of particular forms are con-
answer, in a situation where rewriting techniques would sidered and for each candidate, the views in its body are
produce no answer at all. unfolded (i.e., replaced by their body in the view defini-
tion). Finally, the unfolded candidate is compared with
Rewriting Aggregate Queries the original query to verify equivalence by using known
equivalence criteria for aggregate queries, particularly
A typical elementary multidimensional query is de- those proposed in Nutt, Sagiv, & Shurin (1998) for COUNT,
scribed by the join of the fact table with two or more SUM, MIN and MAX queries. The technique can be
dimension tables to which is applied an aggregate group extended by using the equivalence criteria for AVG
by query (see the example query Q1 below). As a conse- queries presented in Grumbach, Rafanelli, & Tininini
quence, the rewriting of this form of query and view has (1999), based on the syntactic notion of isomorphism
been studied by many researchers. modulo a product.
In query rewriting it is important to identify the views
SELECT D1.dim1, D2.dim2, AGG(F.measure) that may be actually useful in the rewriting process: this
FROM fact_table F, dim_table1 D1, dim_table2 D2 is often referred to as the view usability problem. In the
WHERE F.dimKey1 = D1.dimKey1 non-aggregate context, it is shown (Levy, Mendelzon,
AND F.dimKey2 = D2.dimKey2 Sagiv, & Srivastava, 1995) that a conjunctive view can be
GROUP BY D1.dim1, D2.dim2 (Q1) used to produce a conjunctive rewritten query if a homo-
morphism exists from the body of the view to that of the
In Gupta, Harinarayan, & Quass (1995), an algorithm is query. Grumbach, Rafanelli, & Tininini (1999) demon-
proposed to rewrite conjunctive queries with aggrega- strate that more restrictive (necessary and sufficient)
tions using views of the same form. The technique is based conditions are needed for the usability of conjunctive
on the concept of generalized projection (GP) and some count views for rewriting of conjunctive count queries,
transformation rules utilizable by an optimizer, which en- based on the concept of sound homomorphisms. It is
ables the query and views to be put in a particular normal also shown that in the presence of aggregations, it is not
form, based on GPSJ (Generalized Projection/Selection/ sufficient only to consider rewritten queries of conjunc-
Join) expressions. The query and views are analyzed in tive form: more complex forms may be required, particu-
terms of their query tree, that is, the tree representing how larly those based on the concept of isomorphism modulo
to calculate them by applying selections, joins and gener- a product.
alized projections on the base relations. By using the All rewriting algorithms proposed in the literature are
transformation rules, the algorithm tries to produce a based on trying to obtain a rewritten query with a particu-
match between one or more view trees and subtrees (and lar form by using (possibly only) the available views. An
30
TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases
interesting question is: Can I rewrite more by consider- would be interesting to study the property of complete-
ing rewritten queries of more complex form?, and the ness of known rewriting algorithms and to provide neces- A
even more ambitious one, Given a collection of views, sary and sufficient conditions for the usability of a view
is the information they provide sufficient to rewrite a to rewrite a query, even when both the query and the view
query? In Grumbach & Tininini (2003) the problem is are aggregate and of non-trivial form (e.g., allowing dis-
investigated in a general framework based on the con- junction and some limited form of negation).
cept of query subsumption. Basically, the information
content of a query is characterized by its distinguishing
power, that is, by its ability to determine that two CONCLUSION
database instances are different. Hence a collection of
views subsumes a query if it is able to distinguish any This paper has discussed a fundamental issue related to
pair of instances also distinguishable by the query, and multidimensional query evaluation, that is, how a multidi-
it is shown that a query rewriting using various views mensional query expressed in a given language can be
exists if the views subsume the query. In the particular translated, using some available materialized views, into
case of count and sum queries defined over the same fact an (efficient) evaluation plan which retrieves the neces-
table, an algorithm is proposed which is demonstrated to sary information and calculates the required results. We
be complete. In other words, even if the algorithm (as have analyzed the difference between query answering
with any algorithm of practical use) considers rewritten and query rewriting approach and discussed the main
queries of particular forms, it is shown that no improve- techniques proposed in literature to rewrite aggregate
ment could be obtained by considering rewritten que- multidimensional queries using materialized views.
ries of more complex forms.
Finally, in Grumbach & Tininini (2000) a completely
different approach to the problem of aggregate rewriting REFERENCES
is proposed. The technique is based on the idea of
formally expressing the relationships (metadata) between Abiteboul, S., & Duschka, O.M. (1998). Complexity of
raw and aggregate data and also among aggregate data of answering queries using materialized views. In ACM Sym-
different types and/or levels of detail. Data is stored in posium on Principles of Database Systems (PODS98)
standard relations, while the metadata are represented by (pp. 254-263).
numerical dependencies, namely Horn clauses formally
expressing the semantics of the aggregate attributes. The Afrati, F.N., Li, C., & Ullman, J.D. (2001). Generating
mechanism is tested by transforming the numerical de- efficient plans for queries using views. In ACM Interna-
pendencies into Prolog rules and then exploiting the tional Conference on Management of Data (SIGMOD01)
Prolog inference engine to produce the rewriting. (pp. 319-330).
Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modeling
multidimensional databases. In International Confer-
FUTURE TRENDS ence on Data Engineering (ICDE97) (pp. 232-243).
Although query rewriting techniques are currently con- Cabibbo, L., & Torlone, R. (1997). Querying multidimen-
sidered to be preferable to query answering in OLAP sional databases. In International Workshop on Data-
systems, the always increasing processing capabilities of base Programming Languages (DBPL97) (pp. 319-335).
modern computers may change the relevance of query
answering techniques in the near future. Meanwhile, the Calvanese, D., De Giacomo, G., Lenzerini, M., & Vardi,
limitations in the applicability of several rewriting algo- M.Y. (2000). What is view-based query rewriting? In
rithms shows that a substantial effort is still needed and International Workshop on Knowledge Representation
important contributions may stem from results in other meets Databases (KRDB00) (pp. 17-27).
research areas like logic programming and automated Cohen, S., Nutt, W., & Serebrenik, A. (1999). Rewriting
reasoning. Particularly, aggregate query rewriting is strictly aggregate queries using views. In ACM Symposium on
related to the problem of query equivalence for aggregate Principles of Database Systems (PODS99) (pp. 155-166).
queries and current equivalence criteria only apply to
rather simple forms of query, and dont consider, for Cohen, S., Nutt, W., & Serebrenik, A. (2000). Algorithms
example, the combination of conjunctive formulas with for rewriting aggregate queries using views. In ABDIS-
nested aggregations. DASFAA Conference 2000 (pp. 65-78).
Also the results on view usability and query Goldstein, J., & Larson, P. (2001). Optimizing queries
subsumption can be considered only preliminary and it using materialized views: A practical, scalable solution. In
31
TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases
ACM International Conference on Management of Data Principles of Database Systems (PODS98) (pp. 214-
(SIGMOD01) (pp. 331-342). 223).
Golfarelli, M., & Rizzi, S. (2000). Comparing nested GPSJ Srivastava, D., Dar, S., Jagadish, H.V., & Levy, A.Y.
queries in multidimensional databases. In Workshop on (1996). Answering queries with aggregation using views.
Data Warehousing and OLAP (DOLAP 2000) (pp. 65-71). In International Conference on Very Large Data Bases
(VLDB96) (pp. 318-329).
Grahne, G., & Mendelzon, A.O. (1999). Tableau tech-
niques for querying information sources through global
schemas. In International Conference on Database KEY TERMS
Theory (ICDT99) (pp. 332-347).
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data Cube: A collection of aggregate values classified
Data cube: A relational aggregation operator generalizing according to several properties of interest (dimensions).
group-by, cross-tab, and sub-total. In International Con- Combinations of dimension values are used to identify the
ference on Data Engineering (ICDE96) (pp. 152-159). single aggregate values in the cube.
Grumbach, S., Rafanelli, M., & Tininini, L. (1999). Query- Dimension: A property of the data used to classify it
ing aggregate data. In ACM Symposium on Principles of and navigate the corresponding data cube. In multidimen-
Database Systems (PODS99) (pp. 174-184). sional databases dimensions are often organized into
several hierarchical levels, for example, a time dimension
Grumbach, S., & Tininini, L. (2000). Automatic aggrega- may be organized into days, months and years.
tion using explicit metadata. In International Conference
on Scientific and Statistical Database Management Drill-Down (Roll-Up): Typical OLAP operation, by
(SSDBM00) (pp. 85-94). which aggregate data are visualized at a finer (coarser)
level of detail along one or more analysis dimensions.
Grumbach, S., & Tininini, L. (2003). On the content of
materialized aggregate views. Journal of Computer and Fact: A single elementary datum in an OLAP system,
System Sciences, 66(1), 133-168. the properties of which correspond to dimensions and
measures.
Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggre-
gate-query processing in data warehousing environments. Fact Table: A table of (integrated) elementary data
In International Conference on Very Large Data Bases grouped and aggregated in the multidimensional query-
(VLDB95) (pp. 358-369). ing process.
Halevy, A.Y. (2001). Answering queries using views. Materialized View: A particular form of query whose
VLDB Journal, 10(4), 270-294. answer is stored in the database to accelerate the evalu-
ation of further queries.
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D.
(1999). What can hierarchies do for data warehouses? In Measure: A numeric value obtained by applying an
International Conference on Very Large Data Bases aggregate function (such as count, sum, min, max or
(VLDB99) (pp. 530-541). average) to groups of data in a fact table.
Lenzerini, M. (2002). Data integration: A theoretical per- Query Answering: Process by which the (possibly
spective. In ACM Symposium on Principles of Database approximate) answer to a given query is obtained by
Systems (PODS02) (pp. 233-246). exploiting the stored answers and definitions of a collec-
tion of materialized views.
Levy, A.Y., Mendelzon, A.O., Sagiv, Y., & Srivastava, D.
(1995). Answering queries using views. In ACM Sympo- Query Rewriting: Process by which a source query is
sium on Principles of Database Systems (PODS95) (pp. transformed into an equivalent one referring (almost ex-
95-104). clusively) to a collection of materialized views. In multidi-
mensional databases, query rewriting is fundamental in
Nutt, W., Sagiv, Y., & Shurin, S. (1998). Deciding equiva- achieving acceptable (online) response times.
lences among aggregate queries. In ACM Symposium on
32
TEAM LinG
33
Foster Provost
New York University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Aggregation for Predictive Modeling with Relational Data
objects (an exception being count). It is therefore neces- worked data with identifier attributes. As Knobbe et al.
sary to assume class-conditional independence and ag- (1999) point out, traditional aggregation operators like
gregate the attributes independently, which limits the min, max, and count are based on histograms. A histo-
expressive power of the model. Perlich & Provost (2003) gram itself is a crude approximation of the underlying
discuss in detail the implications of various assump- distribution. Rather than estimating one distribution for
tions and aggregation choices on the expressive power every bag of attributes, as done by traditional aggrega-
of resulting classification models. For example, cus- tion operators, this new aggregation approach estimates
tomers who buy mostly expensive books cannot be in a first step only one distribution for each class, by
identified if price and type are aggregated separately. In combining all bags of objects for the same class. The
contrast, ILP methods do not assume independence and combination of bags of related objects results in much
can express an expensive book (TYPE=BOOK and better estimates of the distribution, since it uses many
PRICE>20); however aggregation through existential more observations. The number of parameters differs
unification can only capture whether a customer bought across distributions: for a normal distribution only two
at least one expensive book, not whether he has bought parameters are required, mean and variance, whereas
primarily expensive books. Only two systems, POLKA distributions of categorical attributes have as many
(Knobbe et al., 2001) and REGLAGGS (Wrobel & parameters as possible attribute values. In a second step,
Krogel, 2001) combine Boolean conditions and nu- the bags of attributes of related objects are aggregated
meric aggregates to increase the expressive power of through vector distances (e.g., Euclidean, Cosine, Like-
the model. lihood) between a normalized vector-representation of
Another challenge is posed by categorical attributes the bag and the two class-conditional distributions.
with many possible values, such as ISBN numbers of Imagine the following example of a document clas-
books. Categorical attributes are commonly aggregated sification domain with two tables (Document and Au-
using mode (the most common value) or the count for thor) shown in Figure 1.
all values if the number of different values is small. The first aggregation step estimates the class-
These approaches would be ineffective for ISBN: it has conditional distributions DClass n of authors from the
many possible values and the mode is not meaningful Author table. Under the alphabetical ordering of
since customers usually buy only one copy of each position:value pairs, 1:A, 2:B, and 3:C, the value for
book. Many relational domains include categorical at- DClass n at position k is defined as:
tributes of this type. One common class of such do-
mains involves networked data, where most of the infor-
Number of occurrences of author k in the set of authors related to documents of class n
mation is captured by the relationships between objects, DClass n[k] =
Number of authors related to documents of class n
34
TEAM LinG
Aggregation for Predictive Modeling with Relational Data
Figure 2. Extended document table with new cosine high-dimensional categorical fields (author names and
features added document identifiers) are not applicable. A
The generalization performance of the new aggrega-
Document Table tion approach is related to a number of properties that are
Paper ID Class Cosine(Pn, DClass 1) Cosine(Pn, DClass 0)
of particular relevance and advantage for predictive
P1 0 0.667 0.707
P2 1 0.707 0.5 modeling:
P3 1 0.962 0.816
P4 0 0.333 0.707 Dimensionality Reduction: The use of distances
compresses the high-dimensional space of pos-
sible categorical values into a small set of dimen-
sions one for each class and distance metric. In
By taking advantage of DClass 1 and D Class 0 another new
particular, this allows the aggregation of object
aggregation approach becomes possible. Rather than
identifiers.
constructing counts for all distinct values (impossible
Preservation of Discriminative Information:
for high-dimensional categorical attributes) one can se-
Changing the class labels of the target objects
lect a small subset of values where the absolute differ-
will change the values of the aggregates. The loss
ence between entries in DClass 0 and D Class 1 is maximal. This
of discriminative information is lower since the
method would identify author B as the most discriminative.
class-conditional distributions capture signifi-
These new features, constructed from class-condi-
cant differences.
tional distributions, show superior classification per-
Domain Independence: The density estimation
formance on a variety of relational domains (Perlich &
does not require any prior knowledge about the
Provost, 2003, 2004). Table 1 summarizes the relative
application domain and therefore is suitable for a
out-of-sample performances (averaged over 10 experi-
variety of domains.
ments with standard deviations in parentheses) as pre-
Applicability to Numeric Attributes: The ap-
sented in Perlich (2003) on the CORA document classi-
proach is not limited to categorical values but can
fication task (McCallum et al., 2000) for 400 training
also be applied to numeric attributes after
examples. The data set includes information about the
discretization. Note that using traditional aggre-
authorship, citations, and the full text. This example also
gation through mean and variance assumes im-
demonstrates the opportunities arising from the ability
plicitly a normal distribution; whereas this aggre-
of relational models to take advantage of additional
gation makes no prior distributional assumptions
background information such as citations and authorship
and can capture arbitrary numeric distributions.
over simple text classification. The comparison includes
Monotonic Relationship: The use of distances to
in addition to two distribution-based feature construc-
class-conditional densities constructs numerical
tion approaches (1 and 2) using logistic regression for
features that are monotonic in the probability of
model induction: 3) a Nave Bayes classifier using the
class membership. This makes logistic regression
full text learned by the Rainbow (McCallum, 1996)
a natural choice for the model induction step.
system, 4) a Probabilistic Relational Model (Koller &
Aggregation of Identifiers: By using object iden-
Pfeffer, 1998) using traditional aggregates on both text
tifiers such as names it can overcome some of the
and citation/authorship with the results reported by Taskar
limitations of the independence assumptions and
et al. (2001), and 5) a Simple Relational Classifier (Macskassy
even allow the learning from unobserved object
& Provost, 2003) that uses only the known class labels of
properties (Perlich & Provost, 2004). The identifier
related (e.g., cited) documents. It is important to observe
represents the full information of the object and in
that traditional aggregation operators such as mode for
35
TEAM LinG
Aggregation for Predictive Modeling with Relational Data
particular the joint distribution of all other attributes The potential complexity of relational models and the
and even further unknown properties. resulting computational complexity of relational modeling
Task-Specific Feature Construction: The advan- remains an obstacle to real-time applications. This limita-
tages outlined above are possible through the use tion has spawned work in efficiency improvements (Yin et
the target value during feature construction. This al., 2003; Tang et al., 2003) and will remain an important
practice requires the splitting of the training set into task.
two separate portions for 1) the class-conditional
density estimation and feature construction and 2)
the estimation of the classification model. CONCLUSION
To summarize, most relational modeling has limited Relational modeling is a burgeoning topic within ma-
itself to a small set of existing aggregation operators. chine learning research, and is applicable commonly in
The recognition of the limited expressive power moti- real-world domains. Many domains collect large
vated the combination of Boolean conditioning and amounts of transaction and interaction data, but so far
aggregation, and the development of new aggregation lack a reliable and automated mechanism for model
methodologies that are specifically designed for pre- estimation to support decision-making. Relational
dictive relational modeling. modeling with appropriate aggregation methods has the
potential to fill this gap and allow the seamless integration
of model estimation on top of existing relational data-
FUTURE TRENDS bases, relieving the analyst from the manual, time-con-
suming, and omission-prone task of feature construction.
Computer-based analysis of relational data is becoming
increasingly necessary as the size and complexity of
databases grow. Many important tasks, including REFERENCES
counter-terrorism (Tang et al., 2003), social and eco-
nomic network analysis (Jensen & Neville, 2002), docu- Deroski, S. (2001). Relational data mining applications:
ment classification (Perlich, 2003), customer relation- An overview, In S. D eroski & N. Lavra (Eds.), Rela-
ship management, personalization, fraud detection tional data mining (pp. 339-364). Berlin: Springer Verlag.
(Fawcett & Provost, 1997), and genetics [e.g., see the
overview by Deroski (2001)], used to be approached with Fawcett, T., & Provost, F. (1997). Adaptive fraud detec-
special-purpose algorithms, but now are recognized as tion. Data Mining and Knowledge Discovery, (1).
inherently relational. These application domains both
Jensen, D., & Neville, J. (2002). Data mining in social
profit from and contribute to research in relational model-
networks. In R. Breiger, K. Carley, & P. Pattison (Eds.),
ing in general and aggregation for feature construction in
Dynamic social networks modeling and analysis (pp.
particular.
287-302). The National Academies Press.
In order to accommodate such a variety of domains,
new aggregators must be developed. In particular, it is Kirsten, M., Wrobel, S., & Horvath, T. (2001). Distance
necessary to account for domain-specific dependencies based approaches to relational learning and clustering.
between attributes and entities that currently are ig- In S. Deroski & N. Lavra (Eds.), Relational data mining
nored. One common type of such dependency is the (pp. 213-234). Berlin: Springer Verlag.
temporal order of events which is important for the
discovery of causal relationships. Knobbe A.J., de Haas, M., & Siebes, A. (2001).
Aggregation as a research topic poses the opportu- Propositionalisation and aggregates. In L. DeRaedt & A.
nity for significant theoretical contributions. There is Siebes (Eds.), Proceedings of the Fifth European Confer-
little theoretical work on relational model estimation ence on Principles of Data Mining and Knowledge
outside of first-order logic. In contrast to a large body Discovery (LNAI 2168) (pp. 277-288). Berlin: Springer
of work in mathematics and the estimation of functional Verlag.
dependencies that map well-defined input spaces to Koller, D., & Pfeffer, A. (1998). Probabilistic frame-
output spaces, aggregation operators have not been in- based systems. In Proceedings of Fifteenth/Tenth Con-
vestigated nearly as thoroughly. Model estimation tasks ference on Artificial Intelligence/Innovative Applica-
are usually framed as search over a structured (either in tion of Artificial Intelligence (pp. 580-587). American
terms of parameters or increasing complexity) space of Association for Artificial Intelligence.
possible solutions. But the structuring of a search space
of aggregation operators remains an open question.
36
TEAM LinG
Aggregation for Predictive Modeling with Relational Data
Kramer, S., Lavra , N., & Flach, P. (2001). Proposition- fier attributes. Working Paper CeDER-04-04. Stern School
alization approaches to relational data mining. In S. of Business. A
Deroski & N. Lavra (Eds.), Relational data mining (pp.
Popescul, L., Ungar, H., Lawrence, S., & Pennock, D.M.
262-291). Berlin: Springer Verlag.
(2002). Structural logistic regression: Combining rela-
Krogel, M.A., Rawles, S., Zelezny, F., Flach, P.A., Lavrac, tional and statistical learning. In Proceedings of the
N., & Wrobel, S. (2003). Comparative evaluation of ap- Workshop on Multi-Relational Data Mining.
proaches to propositionalization. In T. Horvth & A.
Tang, L.R., Mooney, R.J., & Melville, P. (2003). Scaling up
Yamamoto (Eds.), Proceedings of the 13th International
ILP to large examples: Results on link discovery for
Conference on Inductive Logic Programming (LNAI
counter-terrorism. In Proceedings of the Workshop on
2835) (pp. 197-214). Berlin: Springer-Verlag.
Multi-Relational Data Mining (pp. 107-121).
Krogel, M.A., & Wrobel, S. (2001). Transformation-based
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic
learning using multirelational aggregation. In C. Rouveirol
classification and clustering in relational data. In Pro-
& M. Sebag (Eds.), Proceedings of the Eleventh Interna-
ceedings of the 17th International Joint Conference
tional Conference on Inductive Logic Programming (ILP)
on Artificial Intelligence (pp. 870-878).
(LNAI 2157) (pp. 142-155). Berlin: Springer Verlag.
Yin, X., Han, J., & Yang, J. (2003). Efficient multi-
Krogel M.A., & Wrobel, S. (2003). Facets of aggregation
relational classification by tuple ID propagation. In
approaches to propositionalization. In T. Horvth & A.
Proceedings of the Workshop on Multi-Relational
Yamamoto (Eds.), Proceedings of the Work-in-Progress
Data Mining.
Track at the 13th International Conference on Inductive
Logic Programming (pp. 30-39).
Macskassy, S.A., & Provost, F. (2003). A simple relational
classifier. In Proceedings of the Workshop of Multi- KEY TERMS
Relational Data Mining at SIGKDD-2003.
Aggregation: Also commonly called a summary, an
McCallum, A.K. (1996). Bow: A toolkit for statistical aggregation is the calculation of a value from a bag or
language modeling, text retrieval, classification and (multi)set of entities. Typical aggregations are sum,
clustering. Retrieved from http://www.cs.cmu.edu/ count, and average.
~mccallum/bow
Discretization: Conversion of a numeric variable
McCallum, A.K., Nigam, K., Rennie, J., & Seymore, K. into a categorical variable, usually though binning. The
(2000). Automating the construction of Internet portals entire range of the numeric values is split into a number
with machine learning. Information Retrieval, 3(2), 127- of bins. The numeric value of the attributes is replaced by
163. the identifier of the bin into which it falls.
Muggleton, S. (Ed.). (1992). Inductive logic program- Class-Conditional Independence: Property of a multi-
ming. London: Academic Press. variate distribution with a categorical class variable c and
a set of other variables (e.g., x and y). The probability of
Neville J., Jensen, D., & Gallagher, B. (2003). Simple
observing a combination of variable values given the
estimators for relational bayesian classifers. In Pro-
class label is equal to the product of the probabilities of
ceedings of the Third IEEE International Conference
each variable value given the class: P(x,y|c) = P(x|c)*P(y|c).
on Data Mining (pp. 609-612).
Inductive Logic Programming: A field of research at
Perlich, C. (2003). Citation-based document classifi-
the intersection of logic programming and inductive ma-
cation. In Proceedings of the Workshop on Informa-
chine learning, drawing ideas and methods from both
tion Technology and Systems (WITS).
disciplines. The objective of ILP methods is the inductive
Perlich, C., & Provost, F. (2003). Aggregation-based construction of first-order Horn clauses from a set of
feature invention and relational concept classes. In Pro- examples and background knowledge in relational form.
ceedings of the Ninth ACM SIGKDD International Con-
Propositionalization: The process of transforming a
ference on Knowledge Discovery and Data Mining.
multi-relational dataset, containing structured examples,
Perlich, C., & Provost, F. (2004). ACORA: Distribution- into a propositional data set (one table) with derived
based aggregation for relational learning from identi- attribute-value features, describing the structural proper-
ties of the example.
37
TEAM LinG
Aggregation for Predictive Modeling with Relational Data
Relational Data: Data where the original information Relational Learning: Learning in relational domains
cannot be represented in a single table but requires two that include information from multiple tables, not based
or more tables in a relational database. Every table can on manual feature construction.
either capture the characteristics of entities of a particular
type (e.g., person or product) or relationships between Target Objects: Objects in a particular target tables
entities (e.g., person bought product). for which a prediction is to be made. Other objects reside
in additional background tables, but are not the focus
of the prediction task.
38
TEAM LinG
39
TEAM LinG
API Standardization Efforts for Data Mining
2. Test the quality of a mining model by applying oriented specification for a set of data access interfaces
testing data. designed for record-oriented data stores. It employs SQL
3. Apply a data mining model to new data. commands as arguments of interface operations. The
4. Browse a data mining model for reporting and visu- approach in defining OLE DB for DM was not to extend
alization applications. OLE DB interfaces but to expose data mining interfaces in
a language-based API.
The APIs support several commonly accepted and OLE DB for DM treats a data mining model as if it were
widely used techniques both for predictive and descrip- a special type of table: (a) Input data in the form of a set
tive data mining (see Table 1). Not all techniques need all of cases is associated with a data mining model and
the tasks listed above. For example, association rule additional meta-information while defining the data min-
mining does not require testing and application to new ing model. (b) When input data is inserted into the data
data, whereas classification does. mining model (it is populated), a mining algorithm
The goals of the APIs are very similar but the approach builds an abstraction of the data and stores it into this
of each of them is different. OLE DB for DM is a language- special table. For example, if the data model represents a
based interface, SQL/MM DM is based on user-defined decision tree, the table contains a row for each leaf node
data types in SQL:1999, and JDM contains packages of of the tree (Netz et al., 2001). Once the data mining model
data mining oriented Java interfaces and classes. is populated, it can be used for prediction, or it can be
In the next section, each of the APIs is briefly charac- browsed for visualization.
terized. An example showing their application in predic- OLE DB for DM extends syntax of several SQL state-
tion is presented in another article in this encyclopedia. ments for defining, populating, and using a data mining
model see Figure 1.
mining algorithm
algorithm settings
CREATE
40
TEAM LinG
API Standardization Efforts for Data Mining
Testing data
DM_MiningData
DM_MinigData
DM_XXTestResult
DM_LogicalDataSpec DM_XXModel
Training data
DM_XXSettings
DM_ApplicationData DM_XXResult
DM_XXBldTask
Application data
41
TEAM LinG
API Standardization Efforts for Data Mining
mining, standards are not only intended to unify existing SAS Enterprise Miner to support PMML. (September
products with well-known functionality but to (partially) 17, 2002). Retrieved from http://www.sas.com/news/
design the functionality such that future products match preleases/091702/news1.html
real world requirements. A simple example of using the
APIs in prediction is presented in another article of this Schwenkreis, F. (2001). Data mining Technology driven
book (Zendulka, 2005). by standards? Retrieved from http://www.research.
microsoft.com/~jamesrh/hpts2001/submissions/
FriedemannSchwenkreis.htm
REFERENCES SQL Multimedia and Application Packages. Part 6: Data
Mining. ISO/IEC 13249-6. (2002).
Common Warehouse Metamodel Specification: Data
Mining. Version 1.0. (2001). Retrieved from http:// Zendulka, J. (2005). Using standard APIs for data mining
www.omg.org/docs/ad/01-02-01.pdf in prediction. In J. Wang (Ed.) Encyclopedia of data
warehousing and mining. Hershey, PA: Idea Group Ref-
Cross Industry Standard Process for Data Mining erence.
(CRISP-DM). Version 1.0. (2000). Retrieved from http://
www.crisp-dm.org/
KEY TERMS
Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data
mining standards initiatives. Communications of the ACM, API: Application programming interface (API) is a
45 (8), 59-61. description of the way one piece of software asks another
Han, J., & Kamber, M. (2001). Data mining: concepts and program to perform a service. A standard API for data
techniques. Morgan Kaufmann Publishers. mining enables for different data mining algorithms from
various vendors to be easily plugged into application
Hornick, M. et al. (2004). JavaSpecification Request programs.
73: JavaData Mining (JDM). Version 0.96. Retrieved
from http://jcp.org/aboutJava/communityprocess/first/ Data Mining Model: A high-level global description
jsr073/ of a given set of data which is the result of a data mining
technique over the set of data. It can be descriptive or
Melton, J., & Eisenberg, A. (2001). SQL Multimedia and predictive.
Application Packages (SQL/MM). SIGMOD Record, 30
(4), 97-102. DMG: Data Mining Group (DMG) is a consortium of
data mining vendors for developing data mining stan-
Melton, J., & Simon, A. (2001). SQL: 1999. Understanding dards. They have developed a Predictive Model Markup
relational language components. Morgan Kaufmann language (PMML).
Publishers.
JDM: Java Data Mining (JDM) is an emerging stan-
Microsoft Corporation. (2000). OLE DB for Data Mining dard API for the programming language Java. It is an
Specification Version 1.0. object-oriented interface that specifies a set of Java classes
and interfaces supporting data mining operations for
Netz, A. et al. (2001, April). Integrating data mining with building, testing, and applying a data mining model.
SQL Databases: OLE DB for data mining. In Proceedings
of the 17 th International Conference on Data Engineer- OLE DB for DM: OLE DB for Data Mining (OLE DB for
ing (ICDE 01) (pp. 379-387). Heidelberg, Germany. DM) is a Microsofts language-based standard API that
introduces several SQL-like statements supporing data
Oracle9i Data Mining. Concepts. Release 9.2.0.2. (2002). mining operations for building, testing, and applying a
Viewable CD Release 2 (9.2.0.2.0). data mining model.
PMML Version 2.1. (2003). Retrieved from http:// PMML: Predictive Model Markup Language (PMML)
www.dmg.org/pmml-v2-1.html is an XML-based language which provides a quick and
Saarenvirta, G. (2001, Summer). Operation Data Mining. easy way for applications to produce data mining models
DB2 Magazine, 6(2). International Business Machines in a vendor-independent format and to share them be-
Corporation. Retrieved from http://www.db2mag.com/ tween compliant applications.
db_area/archives/2001/q2/saarenvirta.shtml
42
TEAM LinG
API Standardization Efforts for Data Mining
SQL1999: Structured Query Language (SQL): 1999. SQL/MM DM: SQL Multimedia and Application Pack-
The version of the standard database language SQL ages Part 6: Data Mining (SQL/MM DM) is an interna- A
adapted in 1999, which introduced object-oriented fea- tional standard the purpose of which is to define data
tures. mining user-defined types and associated routines for
building, testing, and applying data mining models. It is
based on structured user-defined types of SQL:1999.
43
TEAM LinG
44
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
The Application of Data Mining to Recommender Systems
Both nearest-neighbor and correlation-based of features for the items being classified or data about
recommenders provide a high level of personalization in relationships among the items. The category is a domain- A
their recommendations, and most early systems using specific classification such as malignant/benign for tumor
these techniques showed promising accuracy rates. As classification, approve/reject for credit requests, or in-
such, CF-based systems have continued to be popular in truder/authorized for security checks. One way to build a
recommender applications and have provided the bench- recommender system using a classifier is to use informa-
marks upon which more recent applications have been tion about a product and a customer as the input, and to
compared. have the output category represent how strongly to
recommend the product to the customer. Classifiers may
be implemented using many different machine-learning
DATA MINING IN RECOMMENDER strategies including rule induction, neural networks, and
APPLICATIONS Bayesian networks. In each case, the classifier is trained
using a training set in which ground truth classifications
The term data mining refers to a broad spectrum of math- are available. It can then be applied to classify new items
ematical modeling techniques and software tools that are for which the ground truths are not available. If subse-
used to find patterns in data and user these to build quent ground truths become available, the classifier may
models. In this context of recommender applications, the be retrained over time.
term data mining is used to describe the collection of For example, Bayesian networks create a model based
analysis techniques used to infer recommendation rules on a training set with a decision tree at each node and
or build recommendation models from large data sets. edges representing user information. The model can be
Recommender systems that incorporate data mining tech- built off-line over a matter of hours or days. The resulting
niques make their recommendations using knowledge model is very small, very fast, and essentially as accurate
learned from the actions and attributes of users. These as CF methods (Breese, Heckerman, & Kadie, 1998). Baye-
systems are often based on the development of user sian networks may prove practical for environments in
profiles that can be persistent (based on demographic or which knowledge of consumer preferences changes slowly
item consumption history data), ephemeral (based on with respect to the time needed to build the model but are
the actions during the current session), or both. These not suitable for environments in which consumer prefer-
algorithms include clustering, classification techniques, ence models must be updated rapidly or frequently.
the generation of association rules, and the production of Classifiers have been quite successful in a variety of
similarity graphs through techniques such as Horting. domains ranging from the identification of fraud and
Clustering techniques work by identifying groups of credit risks in financial transactions to medical diagnosis
consumers who appear to have similar preferences. Once to intrusion detection. Good et al. (1999) implemented
the clusters are created, averaging the opinions of the induction-learned feature-vector classification of movies
other consumers in her cluster can be used to make and compared the classification with CF recommenda-
predictions for an individual. Some clustering techniques tions; this study found that the classifiers did not perform
represent each user with partial participation in several as well as CF, but that combining the two added value over
clusters. The prediction is then an average across the CF alone.
clusters, weighted by degree of participation. Clustering One of the best-known examples of data mining in
techniques usually produce less-personal recommenda- recommender systems is the discovery of association
tions than other methods, and in some cases, the clusters rules, or item-to-item correlations (Sarwar et. al., 2001).
have worse accuracy than CF-based algorithms (Breese, These techniques identify items frequently found in as-
Heckerman, & Kadie, 1998). Once the clustering is com- sociation with items in which a user has expressed
plete, however, performance can be very good, since the interest. Association may be based on co-purchase data,
size of the group that must be analyzed is much smaller. preference by common users, or other measures. In its
Clustering techniques can also be applied as a first step simplest implementation, item-to-item correlation can be
for shrinking the candidate set in a CF-based algorithm or used to identify matching items for a single item, such
for distributing neighbor computations across several as other clothing items that are commonly purchased with
recommender engines. While dividing the population into a pair of pants. More powerful systems match an entire set
clusters may hurt the accuracy of recommendations to of items, such as those in a customers shopping cart, to
users near the fringes of their assigned cluster, pre- identify appropriate items to recommend. These rules can
clustering may be a worthwhile trade-off between accu- also help a merchandiser arrange products so that, for
racy and throughput. example, a consumer purchasing a childs handheld video
Classifiers are general computational models for as- game sees batteries nearby. More sophisticated temporal
signing a category to an input. The inputs may be vectors data mining may suggest that a consumer who buys the
45
TEAM LinG
The Application of Data Mining to Recommender Systems
video game today is likely to buy a pair of earplugs in the traditional CF algorithms do not consider. In one study
next month. using synthetic data, Horting produced better predic-
Item-to-item correlation recommender applications tions than a CF-based algorithm (Wolf et al., 1999).
usually use current interest rather than long-term cus-
tomer history, which makes them particularly well suited
for ephemeral needs such as recommending gifts or locat- FUTURE TRENDS
ing documents on a topic of short lived interest. A user
merely needs to identify one or more starter items to elicit As data mining algorithms have been tested and vali-
recommendations tailored to the present rather than the dated in their application to recommender systems, a
past. variety of promising applications have evolved. In this
Association rules have been used for many years in section we will consider three of these applications
merchandising, both to analyze patterns of preference meta-recommenders, social data mining systems, and
across products, and to recommend products to consum- temporal systems that recommend when rather than what.
ers based on other products they have selected. An asso- Meta-recommenders are systems that allow users to
ciation rule expresses the relationship that one product is personalize the merging of recommendations from a va-
often purchased along with other products. The number of riety of recommendation sources employing any number
possible association rules grows exponentially with the of recommendation techniques. In doing so, these sys-
number of products in a rule, but constraints on confi- tems let users take advantage of the strengths of each
dence and support, combined with algorithms that build different recommendation method. The SmartPad super-
association rules with itemsets of n items from rules with market product recommender system (Lawrence et al.,
n-1 item itemsets, reduce the effective search space. As- 2001) suggests new or previously unpurchased prod-
sociation rules can form a very compact representation of ucts to shoppers creating shopping lists on a personal
preference data that may improve efficiency of storage as digital assistant (PDA). The SmartPad system considers
well as performance. They are more commonly used for a consumers purchases across a stores product tax-
larger populations rather than for individual consumers, onomy. Recommendations of product subclasses are
and they, like other learning methods that first build and based upon a combination of class and subclass associa-
then apply models, are less suitable for applications where tions drawn from information filtering and co-purchase
knowledge of preferences changes rapidly. Association rules drawn from data mining. Product rankings within a
rules have been particularly successfully in broad applica- product subclass are based upon the products sales
tions such as shelf layout in retail stores. By contrast, rankings within the users consumer cluster, a less per-
recommender systems based on CF techniques are easier sonalized variation of collaborative filtering. MetaLens
to implement for personal recommendation in a domain (Schafer et al., 2002) allows users to blend content re-
where consumer opinions are frequently added, such as quirements with personality profiles to allow users to
online retail. determine which movie they should see. It does so by
In addition to use in commerce, association rules have merging more persistent and personalized recommenda-
become powerful tools in recommendation applications in tions, with ephemeral content needs such as the lack of
the domain of knowledge management. Such systems offensive content or the need to be home by a certain
attempt to predict which Web page or document can be time. More importantly, it allows the user to customize
most useful to a user. As Gry (2003) writes, The problem the process by weighting the importance of each indi-
of finding Web pages visited together is similar to finding vidual recommendation.
associations among itemsets in transaction databases. While a traditional CF-based recommender typically
Once transactions have been identified, each of them requires users to provide explicit feedback, a social data
could represent a basket, and each web resource an item. mining system attempts to mine the social activity records
Systems built on this approach have been demonstrated to of a community of users to implicitly extract the impor-
produce both high accuracy and precision in the coverage tance of individuals and documents. Such activity may
of documents recommended (Geyer-Schultz et al., 2002). include Usenet messages, system usage history, cita-
Horting is a graph-based technique in which nodes are tions, or hyperlinks. TopicShop (Amento et al., 2003) is
users, and edges between nodes indicate degree of simi- an information workspace which allows groups of com-
larity between two users (Wolf et al., 1999). Predictions are mon Web sites to be explored, organized into user de-
produced by walking the graph to nearby nodes and fined collections, manipulated to extract and order com-
combining the opinions of the nearby users. Horting dif- mon features, and annotated by one or more users. These
fers from collaborative filtering as the graph may be walked actions on their own may not be of large interest, but the
through other consumers who have not rated the product collection of these actions can be mined by TopicShop
in question, thus exploring transitive relationships that and redistributed to other users to suggest sites of
46
TEAM LinG
The Application of Data Mining to Recommender Systems
general and personal interest. Agrawal et al. (2003) ex- that data mining algorithms can be and will continue to be
plored the threads of newsgroups to identify the relation- an important part of the recommendation process. A
ships between community members. Interestingly, they
concluded that due to the nature of newsgroup postings
users are more likely to respond to those with whom they REFERENCES
disagree links between users are more likely to sug-
gest that users should be placed in differing partitions Adomavicius, G., & Tuzhilin, A. (2001). Extending
rather than the same partition. Although this technique recommender systems: A multidimensional approach.
has not been directly applied to the construction of IJCAI-01 Workshop on Intelligent Techniques for Web
recommendations, such an application seems a logical Personalization (ITWP2001), Seattle, Washington.
field of future study.
Although traditional recommenders suggest what item Agrawal, R., Rajagopalan, S., Srikant, R., & Xu, Y. (2003).
a user should consume they have tended to ignore changes Mining newsgroups using networks arising from social
over time. Temporal recommenders apply data mining behavior. In Proceedings of the Twelfth World Wide
techniques to suggest when a recommendation should be Web Conference (WWW12) (pp. 529-535), Budapest, Hun-
made or when a user should consume an item. Adomavicius gary.
and Tuzhilin (2001) suggest the construction of a recom- Amento, B., Terveen, L., Hill, W., Hix, D., & Schulman, R.
mendation warehouse, which stores ratings in a hypercube. (2003). Experiments in social data mining: The TopicShop
This multidimensional structure can store data on not System. ACM Transactions on Computer-Human Inter-
only the traditional user and item axes, but also for action, 10 (1), 54-85.
additional profile dimensions such as time. Through this
approach, queries can be expanded from the traditional Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical
what items should we suggest to user X to at what analysis of predictive algorithms for collaborative filter-
times would user X be most receptive to recommendations ing. In Proceedings of the 14th Conference on Uncer-
for product Y. Hamlet (Etzioni et al., 2003) is designed tainty in Artificial Intelligence (UAI-98) (pp. 43-52),
to minimize the purchase price of airplane tickets. Hamlet Madison, Wisconsin.
combines the results from time series analysis, Q-learn-
ing, and the Ripper algorithm to create a multi-strategy Etzioni, O., Knoblock, C.A., Tuchinda, R., & Yates, A.
data-mining algorithm. By watching for trends in airline (2003). To buy or not to buy: Mining airfare data to
pricing and suggesting when a ticket should be pur- minimize ticket purchase price. In Proceedings of the
chased, Hamlet was able to save the average user 23.8% Ninth ACM SIGKDD International Conference on Knowl-
when savings was possible. edge Discovery and Data Mining (pp. 119-128), Wash-
ington. D.C.
Gry, M., & Haddad, H. (2003). Evaluation of Web usage
CONCLUSION mining approaches for users next request prediction. In
Fifth International Workshop on Web Information and
Recommender systems have emerged as powerful tools Data Management (pp. 74-81), Madison, Wisconsin.
for helping users find and evaluate items of interest.
These systems use a variety of techniques to help users Geyer-Schulz, A., & Hahsler, M. (2002). Evaluation of
identify the items that best fit their tastes or needs. While recommender algorithms for an Internet information bro-
popular CF-based algorithms continue to produce mean- ker based on simple association rules and on the repeat-
ingful, personalized results in a variety of domains, data buying theory. In Fourth WEBKDD Workshop: Web Min-
mining techniques are increasingly being used in both ing for Usage Patterns & User Profiles (pp. 100-114),
hybrid systems, to improve recommendations in previ- Edmonton, Alberta, Canada.
ously successful applications, and in stand-alone Good, N. et al. (1999). Combining collaborative filtering
recommenders, to produce accurate recommendations in with personal agents for better recommendations. In Pro-
previously challenging domains. The use of data mining ceedings of Sixteenth National Conference on Artificial
algorithms has also changed the types of recommenda- Intelligence (AAAI-99) (pp. 439-446), Orlando, Florida.
tions as applications move from recommending what to
consume to also recommending when to consume. While Herlocker, J., Konstan, J.A., Borchers, A., & Riedl, J.
recommender systems may have started as largely a (1999). An algorithmic framework for performing collabo-
passing novelty, they clearly appear to have moved into rative filtering. In Proceedings of the 1999 Conference on
a real and powerful tool in a variety of applications, and Research and Development in Information Retrieval,
(pp. 230-237), Berkeley, California.
47
TEAM LinG
The Application of Data Mining to Recommender Systems
Resnick, P., & Varian, H.R. (1997). Communications of the Collaborative Filtering: Selecting content based on
Association of Computing Machinery Special issue on the preferences of people with similar interests.
Recommender Systems, 40(3), 56-89. Meta-Recommenders: Provide users with personal-
Sarwar, B., Karypis, G., Konstan, J.A., & Reidl, J. (2001). ized control over the generation of a single recommenda-
Item-based collaborative filtering recommendation algo- tion list formed from the combination of rich recommenda-
rithms. In Proceedings of the Tenth International Con- tion data from multiple information sources and recom-
ference on World Wide Web (pp. 285-295), Hong Kong. mendation techniques.
Schafer, J.B., Konstan, J.A., & Riedl, J. (2001). E-Com- Nearest-Neighbor Algorithm: A recommendation
merce Recommendation Applications. Data Mining and algorithm that calculates the distance between users
Knowledge Discovery, 5(1/2), 115-153. based on the degree of correlations between scores in the
users preference histories. Predictions of how much a
Schafer, J.B., Konstan, J.A., & Riedl, J. (2002). Meta- user will like an item are computed by taking the weighted
recommendation systems: User-controlled integration of average of the opinions of a set of nearest neighbors for
diverse recommendations. In Proceedings of the Elev- that item.
enth Conference on Information and Knowledge (CIKM-
02) (pp. 196-203), McLean, Virginia. Recommender Systems: Any system that provides a
recommendation, prediction, opinion, or user-configured
Shoemaker, C., & Ruiz, C. (2003). Association rule mining list of items that assists the user in evaluating items.
algorithms for set-valued data. Lecture Notes in Com-
puter Science, 2690, 669-676. Social Data-Mining: Analysis and redistribution of
information from records of social activity such as
Wolf, J., Aggarwal, C., Wu, K-L., & Yu, P. (1999). Horting newsgroup postings, hyperlinks, or system usage his-
hatches an egg: A new graph-theoretic approach to col- tory.
laborative filtering. In Proceedings of ACM SIGKDD
International Conference on Knowledge Discovery & Temporal Recommenders: Recommenders that incor-
Data Mining (pp. 201-212), San Diego, CA. porate time into the recommendation process. Time can be
either an input to the recommendation function, or the
output of the function.
48
TEAM LinG
49
Gianluca Lax
University Mediterranea of Reggio Calabria, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Approximate Range Queries by Histograms in OLAP
50
TEAM LinG
Approximate Range Queries by Histograms in OLAP
Classical histograms lack the last point, since they one or more buckets can be computed exactly, while if it
are flat structures. Many proposals have been presented partially overlaps a bucket, then the result only can be A
in order to guarantee the three properties previously estimated.
described, and we report some of them in the following. The simplest adopted estimation technique is the
Requirement (3) was introduced by Koudas, Continuous Value Assumption (CVA). Given a bucket
Muthukrishnan, and Srivastava (2000), where the authors of size s and sum c, a range query overlapping the bucket
have shown the insufficient accuracy of classical histo- in i points is estimated as (i / s ) c . This corresponds to
grams in evaluating hierarchical range queries. Therein, a estimating the partial contribution of the bucket to the
polynomial-time algorithm for constructing optimal his- range query result by linear interpolation.
tograms with respect to hierarchical queries is proposed. Another possibility is to use the Uniform Spread
The selectivity estimation problem for non-hierarchi- Assumption (USA). It assumes that values are distrib-
cal range queries was studied by Gilbert, Kotidis, uted at equal distance from each other and that the
Muthukrishnan, and Strauss (2001), and, according to overall frequency sum is equally distributed among
property (2), optimal and approximate polynomial (in the them. In this case, it is necessary to know the number of
database size) algorithms with a provable approximation non-null frequencies belonging to the bucket. Denoting
guarantee for constructing histograms are also presented. by t such a value, the range query is estimated by
Guha, Koudas, and Srivastava (2002) have proposed
efficient algorithms for the problem of approximating
the distribution of measure attributes organized into (s 1) + (i 1) (t 1) c
.
hierarchies. Such algorithms are based on dynamic pro- ( s 1) t
gramming and on a notion of sparse intervals.
Algorithms returning both optimal and suboptimal An interesting problem is understanding whether,
solutions for approximating range queries by histograms by exploiting information typically contained in histo-
and their dynamic maintenance by additive changes are gram buckets and possibly by adding some concise
provided by Muthukrishnan and Strauss (2003). The best summary information, the frequency estimation inside
algorithm, with respect to the construction time, return- buckets and, then, the histogram accuracy can be im-
ing an optimal solution takes polynomial time. proved. To this aim, starting from a theoretical analysis
Buccafurri and Lax (2003) have presented a histo- about limits of CVA and USA, Buccafurri, Pontieri,
gram based on a hierarchical decomposition of the data Rosaci, and Sacc (2002) have proposed to use an addi-
distribution kept in a full binary tree. Such a tree, con- tional storage space of 32 bits, called 4LT, in each
taining a set of precomputed hierarchical queries, is bucket in order to store the approximate representation
encoded by using bit saving for obtaining a smaller of the data distribution inside the bucket. In particular, 4LT
structure and, thus, for efficiently supporting hierarchi- is used to save approximate cumulative frequencies at
cal range queries. seven equidistant intervals internal to the bucket.
Besides bucket-based histograms, there are other Clearly, approaches similar to that followed in
kinds of histograms whose construction is not driven by Buccafurri, Pontieri, Rosaci, and Sacc (2002) have to
the search of a suitable partition of the attribute domain, deal with the trade-off between the extra storage space
and, further, their structure is more complex than simply required for each bucket and the number of total buck-
a set of buckets. This class of histograms is called non- ets the allowed total storage space consents.
bucket based histograms. Wavelets are an example of
such kind of histograms.
In the next section, we deal with the second problem FUTURE TRENDS
introduced earlier concerning the estimation of range
queries partially involving buckets. Data streams is an emergent issue that in the last two
years has captured the interest of many scientific com-
Estimation Inside a Bucket munities. The crucial problem arising in several appli-
cation contexts like network monitoring, sensor net-
While finding the optimal bucket partition has been works, financial applications, security, telecommuni-
widely investigated in past years, the problem of estimat- cation data management, Web applications, and so on is
ing queries partially involving a bucket has received a dealing with continuous data flows (i.e., data streams)
little attention. having the following characteristics: (1) they are time
Histograms are well suited to range query evaluation, dependent; (2) their size is very large, so that they
since buckets basically correspond to a set of precom- cannot be stored totally due to the actual memory
puted range queries. A range query that involves entirely
51
TEAM LinG
Approximate Range Queries by Histograms in OLAP
limitation; and (3) data arrival is very fast and unpredict- Buccafurri, F., & Lax, G. (2003). Pre-computing approxi-
able, so that each data management operation should be mate hierarchical range queries in a tree-like histogram.
very efficient. Proceedings. of the International Conference on Data
Since a data stream consists of a large amount of Warehousing and Knowledge Discovery.
data, it is usually managed on the basis of a sliding
window, including only the most recent data (Babcock, Buccafurri, F., & Lax, G. (2004). Reducing data stream
Babu, Datar, Motwani & Widom, 2002). Thus, any tech- sliding windows by cyclic tree-like histograms. Pro-
nique capable of compressing sliding windows by main- ceedings of the 8th European Conference on Principles
taining a good approximate representation of data dis- and Practice of Knowledge Discovery in Databases.
tribution is certainly relevant in this field. Typical que- Buccafurri, F., Pontieri, L., Rosaci, D., & Sacc, D.
ries performed on sliding windows are similarity que- (2002). Improving range query estimation on histo-
ries and other analyses, like change mining queries grams. Proceedings of the International Conference
(Dong, Han, Lakshmanan, Pei, Wang & Yu, 2003) useful on Data Engineering.
for trend analysis and, in general, for understanding the
dynamics of data. Also in this field, histograms may Chakrabarti, K., Garofalakis, M., Rastogi, R., & Shim,
become an important analysis tool. The challenge is K. (2001). Approximate query processing using wave-
finding new histograms that (1) are fast to construct and lets. VLDB Journal, The International Journal on
to maintain; that is, the required updating operations Very Large Data Bases, 10(2-3), 199-223.
(performed at each data arrival) are very efficient; (2) Chaudhuri, S., Das, G., & Narasayya, V. (2001). A ro-
maintain a good accuracy in approximating data distri- bust, optimization-based approach for approximate an-
bution; and (3) support continuous querying on data. swering of aggregate queries. Proceedings of the 2001
An example of the above emerging approaches is ACM SIGMOD International Conference on Manage-
reported in Buccafurri and Lax (2004), where a tree-like ment of Data.
histogram with cyclic updating is proposed. By using
such a compact structure, many mining techniques, which Dong, G. et al. (2003). Online mining of changes from data
would take computational cost very high if used on real streams: Research problems and preliminary results. Pro-
data streams, can be implemented effectively. ceedings of the ACM SIGMOD Workshop on Manage-
ment and Processing of Data Streams.
Ganti, V., Lee, M. L., & Ramakrishnan, R. (2000).
CONCLUSION Icicles: Self-tuning samples for approximate query an-
swering. Proceedings of 26th International Confer-
Data reduction represents an important task both in data ence on Very Large Data Bases.
mining task and in OLAP, since it allows us to represent
very large amounts of data in a compact structure, which Garofalakis, M., & Gibbons, P.B. (2002). Wavelet syn-
efficiently perform on mining techniques or OLAP opses with error guarantees. Proceedings of the ACM
queries. Time and memory cost advantages arisen from SIGMOD International Conference on Management
data compression, provided that a sufficient degree of of Data.
accuracy is guaranteed, may improve considerably the
capabilities of mining and OLAP tools. Garofalakis, M., & Kumar, A. (2004). Deterministic wave-
This opportunity (added to the necessity, coming let thresholding for maximum error metrics. Proceedings
from emergent research fields such as data streams) of of the Twenty-third ACM SIGMOD-SIGACT-SIGART Sym-
producing more and more compact representations of posium on Principles of Database Systems.
data explains the attention that the research community Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss,
is giving toward techniques like histograms and wave- M.J. (2001). Optimal and approximate computation of
lets, which provide a concrete answer to the previous summary statistics for range aggregates. Proceedings
requirements. of the Twentieth ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems.
52
TEAM LinG
Approximate Range Queries by Histograms in OLAP
Kacha, A., Grenez, F., De Doncker, P., & Benmahammed, K. Bucket-Based Histogram: A type of histogram whose
(2003). A wavelet-based approach for frequency estima- construction is driven by the search of a suitable partition A
tion of interference signals in printed circuit boards. Pro- of the attribute domain into buckets.
ceedings of the 1st International Symposium on Informa-
tion and Communication Technologies. Continuous Value Assumption (CVA): A tech-
nique allowing us to estimate values inside a bucket by
Khalifa, O. (2003). Image data compression in wavelet linear interpolation.
transform domain using modified LBG algorithm. Pro-
ceedings of the 1st International Symposium on Infor- Data Preprocessing: The application of several
mation and Communication Technologies. methods preceding the mining phase, done for improv-
ing the overall data mining results. Usually, it consists
Koudas, N., Muthukrishnan, S., & Srivastava, D. (2000). of (1) data cleaning, a method for fixing missing values,
Optimal histograms for hierarchical range queries (ex- outliers, and possible inconsistent data; (2) data inte-
tended abstract). Proceedings of the Nineteenth ACM gration, the union of (possibly heterogeneous) data
SIGMOD-SIGACT-SIGART Symposium on Principles coming from different sources into a unique data store;
of Database Systems. and (3) data reduction, the application of any technique
working on data representation capable of saving stor-
Li, T., Li, Q., Zhu, S., & Ogihara, M. (2002). Survey on age space without compromising the possibility of in-
wavelet applications in data mining. ACM SIGKDD Ex- quiring them.
plorations, 4(2), 49-68.
Histogram: A set of buckets implementing a parti-
Muthukrishnan, S., & Strauss, M. (2003). Rangesum tion of the overall domain of a relation attribute.
histograms. Proceedings of the Fourteenth Annual
ACM-SIAM Symposium on Discrete Algorithms. Range Query: A query returning an aggregate infor-
mation (i.e., sum, average) about data belonging to a
Wu, Y., Agrawal, D., & Abbadi, A.E. (2002). Query given interval of the domain.
estimation by adaptive sampling. Proceedings of the
International Conference on Data Engineering. Uniform Spread Assumption (USA): A technique
for estimating values inside a bucket by assuming that
values are distributed at an equal distance from each
other and that the overall frequency sum is distributed
KEY TERMS equally among them.
53
TEAM LinG
54
INTRODUCTION BACKGROUND
The design and implementation of intelligent systems From a technical point of view, ANNs offer a general
with human capabilities is the starting point to design framework for representing nonlinear mappings from
Artificial Neural Networks (ANNs). The original idea several input variables to several output variables. They
takes after neuroscience theory on how neurons in the are built by tuning a set of parameters known as weights
human brain cooperate to learn from a set of input and can be considered as an extension of the many
signals to produce an answer. Because the power of the conventional mapping techniques. In classification or
brain comes from the number of neurons and the mul- recognition problems, the nets outputs are categories,
tiple connections between them, the basic idea is that while in prediction or approximation problems, they are
connecting a large number of simple elements in a continuous variables. Although this article focuses on
specific way can form an intelligent system. the prediction problem, most of the key issues in the net
Generally speaking, an ANN is a network of many functionality are common to both.
simple processors called units, linked to certain neigh- In the process of training the net (supervised learn-
bors with varying coefficients of connectivity (called ing), the problem is to find the values of the weights w
weights) that represent the strength of these connec- that minimize the error across a set of input/output pairs
tions. The basic unit of ANNs, called an artificial neu- (patterns) called the training set E. For a single output
ron, simulates the basic functions of natural neurons: it and input vector x, the error measure is typically the
receives inputs, processes them by simple combination root mean squared difference between the predicted
and threshold operations, and outputs a final result. output p(x,w) and the actual output value f(x) for all the
ANNs often employ supervised learning in which elements x in E (RMSE); therefore, the training is an
training data (including both the input and the desired unconstrained nonlinear optimization problem, where
output) is provided. Learning basically refers to the the decision variables are the weights, and the objective
process of adjusting the weights to optimize the net- is to reduce the training error. Ideally, the set E is a
work performance. ANNs belongs to machine-learning representative sample of points in the domain of the
algorithms because the changing of a networks connec- function f that you are approximating; however, in prac-
tion weights causes it to gain knowledge in order to tice it is usually a set of points for which you know the
solve the problem at hand. f-value.
Neural networks have been widely used for both
classification and prediction. In this article, I focus on
the prediction or estimation problem (although with ( f ( x) p( x, w)) 2
Min error ( E , w) = xE
(1)
some few changes, my comments and descriptions also w E
apply to classification). Estimating and forecasting fu-
ture conditions are involved in different business activi-
The main goal in the design of an ANN is to obtain a
ties. Some examples include cost estimation, predic-
model that makes good predictions for new inputs (i.e.,
tion of product demand, and financial planning. More-
to provide good generalization). Therefore, the net must
over, the field of prediction also covers other activities,
represent the systematic aspects of the training data
such as medical diagnosis or industrial process model-
rather than their specific details. The standard way to
ing.
measure the generalization provided by the net consists
In this short article I focus on the multilayer neural
of introducing a second set of points in the domain of f
networks because they are the most common. I describe
called the testing set, T. Assume that no point in T
their architecture and some of the most popular training
belongs to E and f(x) is known for all x in T. After the
methods. Then I finish with some associated conclu-
optimization has been performed and the weights have
sions and the appropriate list of references to provide
been set to minimize the error in E (w=w*), the error
some pointers for further study.
across the testing set T is computed (error(T,w*)). The
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Artificial Neural Networks for Prediction
net must exhibit a good fit between the target f-values and restrict our attention to real functions f: n ). Figure
the output (prediction) in the training set and also in the 1 shows a net where NI = 1, 2, ..., n, NH = n+1, n+2,..., A
testing set. If the RMSE in T is significantly higher than n+m and N O = s.
that one in E, you say that the net has memorized the data Given an input pattern x=(x1,...,xn), the neural net-
instead of learning them (i.e., the net has overfitted the work provides the user with an associated output NN(x,w),
training data). which is a function of the weights w. Each node i in the
The optimization of the function given in (1) is a hard input layer receives a signal of amount xi that it sends
problem by itself. Moreover, keep in mind that the final through all its incident arcs to the nodes in the hidden
objective is to obtain a set of weights that provides low layer. Each node n+j in the hidden layer receives a signal
values of error(T,w*) for any set T. In the following input(n+j) according to the expression
sections I summarize some of the most popular and
n
other not so popular but more efficient methods to train
the net (i.e., to compute appropriate weight values). Input(n+j)=wn+j + x w
i =1
i i ,n + j
where wn+j is the bias value for node n+j, and wi,n+j is the
MAIN THRUST weight value on the arc from node i in the input layer to
node n+j in the hidden layer. Each hidden node trans-
Several models inspired by biological neural networks forms its input by means of a nonlinear activation func-
have been proposed throughout the years, beginning tion: output(j)=sig(input(j)). The most popular choice
with the perceptron introduced by Rosenblatt (1962). for the activation function is the sigmoid function
He studied a simple architecture where the output of the sig(x)= 1/(1+e-x). Laguna and Mart (2002) test two
net is a transformation of a linear combination of the activation functions for the hidden neurons and con-
input variables and the weights. Minskey and Papert clude that the sigmoid presents superior performance.
(1969) showed that the perceptron can only solve lin- Each hidden node n+j sends the amount of signal
early separable classification problems and is therefore output(n+j) through the arc (n+j,s). The node s in the
of limited interest. A natural extension to overcome its output layer receives the weighted sum of the values
limitations is given by the so-called multilayer- coming from the hidden nodes. This sum, NN(x,w), is the
perceptron, or, simply, multilayer neural networks. I nets output according to the expression:
have considered this architecture with a single hidden
m
layer. A schematic representation of the network ap-
pears in Figure 1. NN(x,w) = ws + output (n + j) w
j =1
n+ j ,s
Neural Network Architecture In the process of training the net (supervised learn-
ing), the problem is to find the values of the weights
Let NN=(N, A) be an ANN where N is the set of nodes and (including the bias factors) that minimize the error
A is the set of arcs. N is partitioned into three subsets: (RMSE) across the training set E. After the optimization
NI, input nodes, NH, hidden nodes, and NO, output nodes. has been performed and the weights have been set
I assume that n variables exist in the function that I want (w=w*),the net is ready to produce the output for any
to predict or approximate, therefore |N I|= n. The neural input value. The testing error Error(T,w*) computes the
network has m hidden neurons (|N H|= m) with a bias term Root Mean Squared Error across the elements in the
in each hidden neuron and a single output neuron (we testing set T={y1, y2,..,ys}, where no one belongs to the
training set E:
inp u ts w 1, n+ 1 error ( y , w ) i
Error(T,w*) = i =1 .
x1 1 n+1 s
w n+ 1 ,s
ou tp ut
Training Methods
x2 2 n+2 s
55
TEAM LinG
Artificial Neural Networks for Prediction
method. For a deeper understanding of them, see the based on strategy can provide useful clues about how
excellent book by Bishop (1995). the strategy may profitably be changed.
Backpropagation (BP) was the first method for neu- As far as I know, the first tabu search approach for
ral network training and is still the most widely used neural network training is due to Sexton et al. (1998).
algorithm in practical applications. It is a gradient de- A short description follows. An initial solution x0 is
scent method that searches for the global optimum of the randomly drawn from a uniform distribution in the
network weights. Each iteration consists of two steps. range [-10,10]. Solutions are randomly generated in
First, partial derivatives Error/ w are computed for this range for a given number of iterations. When
each weight in the net. Then weights are modified to generating a new point xnew, aspiration level and tabu
reduce the RMSE according to the direction given by the conditions are checked. If f(xnew)<f(xbest), then the point
gradient. There have been different modifications to this is automatically accepted and both xbest and f(xbest) are
basic procedure; the most significant is the addition of a updated; otherwise, the tabu conditions are tested. If
momentum term to prevent zigzagging in the search. there is one solution xi in the tabu list (TL) such as
Because the neural network training problem can be f(xnew) [f(xi)-0.01*f(x i), f(xi)+0.01*f(xi)], then the com-
expressed as a nonlinear unconstrained optimization plete test is applied to xnew and xi; otherwise, the point
problem, I might use more elaborated nonlinear methods is accepted. The test checks whether all the weights in
than the gradient descent to solve it. A selection of the x new are within 0.01 from xi in this case the point is
best established algorithms in unconstrained nonlinear rejected; otherwise, the point is accepted, and xnew and
optimization has also been used in this context. Specifi- f(x new) are entered into TL. This process continues for
cally, the nonlinear simplex method, the direction set 1,000 iterations of accepted solutions. Then another
method, the conjugate gradient method, the Levenberg- cycle of 1,000 iterations of random sampling begins.
Marquardt algorithm (More, 1978), and the GRG2 These cycles will continuously repeat while f(x best)
(Smith and Lasdon, 1992). improves.
Recently, metaheuristic methods have also been Mart and El-Fallahi (2004) propose an improved
adapted to this problem. Specifically, on one hand you tabu search method that consists of three phases:
can find those methods based on local search proce- MultiRSimplex, TSProb, and TSFreq. After the initial-
dures, and on the other, those methods based on popula- ization with the MultiRSimplex phase, the procedure
tion of solutions known as evolutionary methods. In the performs iterations in a loop consisting in alternating
first category, two methods have been applied, simulated both phases, TSProb and TSFreq, to intensify and diver-
annealing and tabu search, while in the second you can sify the search respectively. In this work, a computa-
find the so-called genetic algorithms, the scatter search, tional study of 12 methods for neural network training
and, more recently, a path relinking implementation. is presented, including nonlinear and local-search-based
Several studies (Sexton, 1998) have shown that tabu optimizers. Overall, experiments with 45 functions
search outperforms the simulated annealing implemen- from the literature were performed to compare the
tation; therefore, I first focus on the different tabu procedures. The experiments show that some functions
search implementations for ANN training. cannot be approximated with a reasonable accuracy
level when training the net for a limited number of
Tabu Search iterations. The experimentation also shows that the
proposed TS provides, on average, the best solutions
Tabu search (TS) is based on the premise that in order to (best approximations).
qualify as intelligent, problem solving must incorporate
adaptive memory and responsive exploration. The adap- Evolutionary Methods
tive memory feature of TS allows the implementation of
procedures that are capable of searching the solution The idea of applying the biological principle of natural
space economically and effectively. Because local evolution to artificial systems, introduced more than
choices are guided by information collected during the three decades ago, has seen impressive growth in the
search, TS contrasts with memoryless designs that heavily past few years. Evolutionary algorithms have been suc-
rely on semirandom processes that implement a form of cessfully applied to numerous problems from different
sampling. The emphasis on responsive exploration in domains, including optimization, automatic program-
tabu search, whether in a deterministic or probabilistic ming, machine learning, economics, ecology, popula-
implementation, derives from the supposition that a bad tion genetics, studies of evolution and learning, and
strategic choice can yield more information than a good social systems.
random choice. In a system that uses memory, a bad choice A genetic algorithm is an iterative procedure that
consists of a constant-size population of individuals,
56
TEAM LinG
Artificial Neural Networks for Prediction
each represented by a finite string of symbols, known as solutions, a reference set update method to build and
the genome, encoding a possible solution in a given maintain a reference set consisting of the b best solu- A
problem space. This space, referred to as the search tions found, a subset generation method to operate on the
space, comprises all possible solutions to the problem reference set in order to produce a subset of its solutions
at hand. Solutions to a problem were originally encoded as a basis for creating combined solutions, and a solution
as binary strings due to certain computational advan- combination method to transform a given subset of
tages associated with such encoding. Also, the theory solutions produced by the subset generation method into
about the behavior of algorithms was based on binary one or more combined solution vectors. An exhaustive
strings. Because in many instances it is impractical to description of these methods and how they operate can
represent solutions by using binary strings, the solution be found in Laguna and Mart (2003).
representation has been extended in recent years to Laguna and Mart (2002) proposed a three-step Scat-
include character-based encoding, real-valued encod- ter Search algorithm for ANNs. El-Fallahi, Mart, and
ing, and tree representations. Lasdon (in press) propose a new training method based
The standard genetic algorithm proceeds as follows. on the path relinking methodology. Path relinking starts
An initial population of individuals is generated at ran- from a given set of elite solutions obtained during a
dom or heuristically. Every evolutionary step, known as previous search process. Path relinking and its cousin,
a generation, the individuals in the current population Scatter Search, are mainly based on two elements: com-
are decoded and evaluated according to some predefined binations and local search. Path relinking generalizes
quality criterion, referred to as the fitness, or fitness the concept of combination beyond its usual application
function. To form a new population (the next genera- to consider paths between solutions. Local search, per-
tion), individuals are selected according to their fitness. formed now with the GRG2 optimizer, intensifies the
Many selection procedures are currently in use, one of search by seeking local optima. The paper shows an
the simplest being Hollands original fitness-propor- empirical comparison of the proposed method with the
tionate selection, where individuals are selected with a best previous evolutionary approaches, and the associ-
probability proportional to their relative fitness. This ated experiments show the superiority of the new method
ensures that the expected number of times an individual in terms of solution quality (prediction accuracy). On
is chosen is approximately proportional to its relative the other hand, these experiments confirm again that a
performance in the population. Thus, high-fitness few functions cannot be approximated with any of the
(good) individuals stand a better chance of reproduc- current training methods.
ing, while low-fitness ones are more likely to disappear.
In terms of ANN training, a solution (or individual)
consists of an array with the nets weights and its asso- FUTURE TRENDS
ciated fitness is usually the RMSE obtained with this
solution in the training set. You can find a lot of research An open problem in the context of prediction is to
in GA implementations to ANNs. Consider, for in- compare ANNs with some modern approximation tech-
stance, the recent work by Alba and Chicano (2004), in niques developed in statistics. Specifically, the non-
which a hybrid GA is proposed. Here, the hybridization parametric additive models and local regression can
refers to the inclusion of problem-dependent knowl- also offer good solutions to the general approximation
edge in a general search template. The hybrid algorithms or prediction problem. The development of hybrid sys-
used in this work are combinations of two algorithms tems from both technologies could give the starting
(weak hybridization), where one of them acts as an point for a new generation of prediction systems.
operator in the other. This kind of combinations has
produced the most successful training methods in the
last few years. The authors proposed here the combina- CONCLUSION
tion of GA with the BP algorithm as well as GA with the
Levenberg-Marquardt for training ANNs. In this work I revise the most representative methods for
Scatter search (SS) was first introduced in Glover neural network training. Several computational studies
(1977) as a heuristic for integer programming. The with some of these methods reveal that the best results
following template is a standard for implementing scat- are achieved with a combination of a metaheuristic
ter search that consists of five methods. A diversifica- procedure with a nonlinear optimizer. These experi-
tion generation method to generate a collection of ments also show that from a practical point of view,
diverse trial solutions, an improvement method to trans- some functions cannot be approximated.
form a trial solution into one or more enhanced trial
57
TEAM LinG
Artificial Neural Networks for Prediction
Mart, R., & El-Fallahi, A. (2004). Multilayer neural Metaheuristic: A master strategy that guides and
networks: An experimental evaluation of on-line train- modifies other heuristics to produce solutions beyond
ing methods. Computers and Operations Research, those that are normally generated in a quest for local
31, 1491-1513. optimality.
Mart, R., Laguna, M., & Glover, F. (in press). Principles Network Training: The process of finding the val-
of scatter search. European Journal of Operational ues of the network weights that minimize the error
Research. across a set of input/output pairs (patterns) called the
training set.
Minsky, M. L., & Papert, S.A. (1969). Perceptrons
(Expanded ed.). Cambridge, MA: MIT Press. Optimization: The quantitative study of optima and
the methods for finding them.
More, J. J. (1978). The Levenberg-Marquardt algo-
rithm: Implementation and theory. In G. Watson (Ed.), Prediction: Consists of approximating unknown
Lecture Notes in Mathematics: Vol. 630. functions. The nets input is the values of the function
variables, and the output is the estimation of the func-
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & tion image.
Flannery, B. P. (1992). Numerical recipes: The art of
scientific computing. Cambridge, MA: Cambridge Uni- Scatter Search: A metaheuristic that belongs to the
versity Press. evolutionary methods.
58
TEAM LinG
59
Wee-Keong Ng
Nanyang Technological University, Singapore
Ee-Peng Lim
Nanyang Technological University, Singapore
INTRODUCTION BACKGROUND
Association Rule Mining (ARM) is concerned with how Recently, a new class of problems emerged to challenge
items in a transactional database are grouped together. It ARM researchers: Incoming data is streaming in too
is commonly known as market basket analysis, because fast and changing too rapidly in an unordered and un-
it can be likened to the analysis of items that are fre- bounded manner. This new phenomenon is termed data
quently put together in a basket by shoppers in a market. stream (Babcock, Babu, Datar, Motwani, & Widom,
From a statistical point of view, it is a semiautomatic 2002).
technique to discover correlations among a set of vari- One major area where the data stream phenomenon
ables. is prevalent is the World Wide Web (Web). A good
ARM is widely used in myriad applications, includ- example is an online bookstore, where customers can
ing recommender systems (Lawrence, Almasi, Kotlyar, purchase books from all over the world at any time. As
Viveros, & Duri, 2001), promotional bundling (Wang, a result, its transactional database grows at a fast rate
Zhou, & Han, 2002), Customer Relationship Manage- and presents a scalability problem for ARM. Traditional
ment (CRM) (Elliott, Scionti, & Page, 2003), and cross- ARM algorithms, such as Apriori, were not designed to
selling (Brijs, Swinnen, Vanhoof, & Wets, 1999). In handle large databases that change frequently (Agrawal
addition, its concepts have also been integrated into & Srikant, 1994). Each time a new transaction arrives,
other mining tasks, such as Web usage mining (Woon, Apriori needs to be restarted from scratch to perform
Ng, & Lim, 2002), clustering (Yiu & Mamoulis, 2003), ARM. Hence, it is clear that in order to conduct ARM on
outlier detection (Woon, Li, Ng, & Lu, 2003), and the latest state of the database in a timely manner, an
classification (Dong & Li, 1999), for improved effi- incremental mechanism to take into consideration the
ciency and effectiveness. latest transaction must be in place.
CRM benefits greatly from ARM as it helps in the In fact, a host of incremental algorithms have already
understanding of customer behavior (Elliott et al., 2003). been introduced to mine association rules incremen-
Marketing managers can use association rules of prod- tally (Sarda & Srinivas, 1998). However, they are only
ucts to develop joint marketing campaigns to acquire incremental to a certain extent; the moment the univer-
new customers. The application of ARM for the cross- sal itemset (the number of unique items in a database)
selling of supermarket products has been successfully (Woon, Ng, & Das, 2001) is changed, they have to be
attempted in many cases (Brijs et al., 1999). In one restarted from scratch. The universal itemset of any
particular study involving the personalization of super- online store would certainly be changed frequently,
market product recommendations, ARM has been ap- because the store needs to introduce new products and
plied with much success (Lawrence et al., 2001). To- retire old ones for competitiveness. Moreover, such
gether with customer segmentation, ARM helped to incremental ARM algorithms are efficient only when
increase revenue by 1.8%. the database has not changed much since the last mining.
In the biology domain, ARM is used to extract novel The use of data structures in ARM, particularly the
knowledge on protein-protein interactions (Oyama, trie, is one viable way to address the data stream phe-
Kitano, Satou, & Ito, 2002). It is also successfully nomenon. Data structures first appeared when program-
applied in gene expression analysis to discover biologi- ming became increasingly complex during the 1960s. In
cally relevant associations between different genes or his classic book, The Art of Computer Programming
between different environment conditions (Creighton Knuth (1968) reviewed and analyzed algorithms and data
& Hanash, 2003). structures that are necessary for program efficiency.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Association Rule Mining
Since then, the traditional data structures have been The Frequent Pattern-growth (FP-growth) algo-
extended, and new algorithms have been introduced for rithm is a recent association rule mining algorithm that
them. Though computing power has increased tremen- achieves impressive results (Han, Pei, Yin, & Mao,
dously over the years, efficient algorithms with custom- 2004). It uses a compact tree structure called a Fre-
ized data structures are still necessary to obtain timely quent Pattern-tree (FP-tree) to store information about
and accurate results. This fact is especially true for ARM, frequent 1-itemsets. This compact structure removes
which is a computationally intensive process. the need for multiple database scans and is constructed
The trie is a multiway tree structure that allows fast with only 2 scans. In the first database scan, frequent 1-
searches over string data. In addition, as strings with itemsets are obtained and sorted in support descending
common prefixes share the same nodes, storage space order. In the second scan, items in the transactions are
is better utilized. This makes the trie very useful for first sorted according to the order of the frequent 1-
storing large dictionaries of English words. Figure 1 itemsets. These sorted items are used to construct the
shows a trie storing four English words (ape, apple, FP-tree. Figure 2 shows an FP-tree constructed from
base, and ball). Several novel trielike data structures the database in Table 1.
have been introduced to improve the efficiency of ARM, FP-growth then proceeds to recursively mine FP-
and we discuss them in this section. trees of decreasing size to generate frequent itemsets
Amir, Feldman, & Kashi (1999) presented a new way without candidate generation and database scans. It does
of mining association rules by using a trie to preprocess so by examining all the conditional pattern bases of
the database. In this approach, all transactions are mapped the FP-tree, which consists of the set of frequent itemsets
onto a trie structure. This mapping involves the extrac- occurring with the suffix pattern. Conditional FP-trees
tion of the powerset of the transaction items and the are constructed from these conditional pattern bases,
updating of the trie structure. Once built, there is no and mining is carried out recursively with such trees to
longer a need to scan the database to obtain support discover frequent itemsets of various sizes. However,
counts of itemsets, because the trie structure contains because both the construction and the use of the FP-
all their support counts. To find frequent itemsets, the trees are complex, the performance of FP-growth is
structure is traversed by using depth-first search, and reduced to be on par with Apriori at support thresholds
itemsets with support counts satisfying the minimum of 3% and above. It only achieves significant speed-ups
support threshold are added to the set of frequent at support thresholds of 1.5% and below. Moreover, it is
itemsets. only incremental to a certain extent, depending on the
Drawing upon that work, Yang, Johar, Grama, & FP-tree watermark (validity support threshold). As new
Szpankowski (2000) introduced a binary Patricia trie to transactions arrive, the support counts of items in-
reduce the heavy memory requirements of the prepro- crease, but their relative support frequency may de-
cessing trie. To support faster support queries, the crease, too. Suppose, however, that the new transactions
authors added a set of horizontal pointers to index cause too many previously infrequent itemsets to become
nodes. They also advocated the use of some form of
primary threshold to further prune the structure. How-
ever, the compression achieved by the compact Patricia Table 1. A sample transactional database
trie comes at a hefty price: It greatly complicates the
TID Items
horizontal pointer index, which is a severe overhead. In 100 AC
addition, after compression, it will be difficult for the 200 BC
Patricia trie to be updated whenever the database is 300 ABC
altered. 400 ABCD
A B ROOT
P A
C
E P S L
A B
L E L
E B
60
TEAM LinG
Association Rule Mining
frequent that is, the watermark is raised too high (in It is highly scalable with respect to the size of both
order to make such itemsets infrequent) according to a the database and the universal itemset. A
user-defined level then the FP-tree must be recon- It is incrementally updated as transactions are
structed. added or deleted.
The use of lattice theory in ARM was pioneered by It is constructed independent of the support
Zaki (2000). Lattice theory allows the vast search space threshold and thus can be used for various support
to be decomposed into smaller segments that can be thresholds.
tackled independently in memory or even in other ma- It helps to speed up ARM algorithms to a certain
chines, thus promoting parallelism. However, they re- extent that allows results to be obtained in real-
quire additional storage space as well as different tra- time.
versal and construction techniques. To complement the
use of lattices, Zaki uses a vertical database format, We shall now discuss our novel trie data structure
where each itemset is associated with a list of transac- that not only satisfies the above requirements but also
tions known as a tid-list (transaction identifierlist). outperforms the discussed existing structures in terms
This format is useful for fast frequency counting of of efficiency, effectiveness, and practicality. Our struc-
itemsets but generates additional overheads because most ture is termed Support-Ordered Trie Itemset (SOTrieIT
databases have a horizontal format and would need to be pronounced so-try-it). It is a dual-level support-
converted first. ordered trie data structure used to store pertinent
The Continuous Association Rule Mining Algorithm itemset information to speed up the discovery of fre-
(CARMA), together with the support lattice, allows the quent itemsets.
user to change the support threshold and continuously As its construction is carried out before actual
displays the resulting association rules with support and mining, it can be viewed as a preprocessing step. For
confidence bounds during its first scan/phase (Hidber, every transaction that arrives, 1-itemsets and 2-itemsets
1999). During the second phase, it determines the pre- are first extracted from it. For each itemset, the
cise support of each itemset and extracts all the frequent SOTrieIT will be traversed in order to locate the node
itemsets. CARMA can readily compute frequent itemsets that stores its support count. Support counts of 1-
for varying support thresholds. However, experiments itemsets and 2-itemsets are stored in first-level and
reveal that CARMA only performs faster than Apriori at second-level nodes, respectively. The traversal of the
support thresholds of 0.25% and below, because of the SOTrieIT thus requires at most two redirections, which
tremendous overheads involved in constructing the sup- makes it very fast. At any point in time, the SOTrieIT
port lattice. contains the support counts of all 1-itemsets and 2-
The adjacency lattice, introduced by Aggarwal & Yu itemsets that appear in all the transactions. It will then
(2001), is similar to Zakis boolean powerset lattice, be sorted level-wise from left to right according to the
except the authors introduced the notion of adjacency support counts of the nodes in descending order.
among itemsets, and it does not rely on a vertical data- Figure 3 shows a SOTrieIT constructed from the
base format. Two itemsets are said to be adjacent to each database in Table 1. The bracketed number beside an item
other if one of them can be transformed to the other with is its support count. Hence, the support count of itemset
the addition of a single item. To address the problem of {AB} is 2. Notice that the nodes are ordered by support
heavy memory requirements, a primary threshold is de- counts in a level-wise descending order.
fined. This term signifies the minimum support thresh- In algorithms such as FP-growth that use a similar
old possible to fit all the qualified itemsets into the data structure to store itemset information, the structure
adjacency lattice in main memory. However, this ap- must be rebuilt to accommodate updates to the universal
proach disallows the mining of frequent itemsets at
support thresholds lower than the primary threshold.
Figure 3. A SOTrieIT structure
ROOT
MAIN THRUST
61
TEAM LinG
Association Rule Mining
itemset. The SOTrieIT can be easily updated to accommo- more varied to cater to a broad customer base; transac-
date the new changes. If a node for a new item in the tion databases will grow in both size and complexity.
universal itemset does not exist, it will be created and Hence, association rule mining research will certainly
inserted into the SOTrieIT accordingly. If an item is continue to receive much attention in the quest for
removed from the universal itemset, all nodes contain- faster, more scalable and more configurable algorithms.
ing that item need only be removed, and the rest of the
nodes would still be valid.
Unlike the trie structure of Amir et al. (1999), the CONCLUSION
SOTrieIT is ordered by support count (which speeds up
mining) and does not require the powersets of transac- Association rule mining is an important data mining task
tions (which reduces construction time). The main weak- with several applications. However, to cope with the
ness of the SOTrieIT is that it can only discover frequent current explosion of raw data, data structures must be
1-itemsets and 2-itemsets; its main strength is its speed utilized to enhance its efficiency. We have analyzed
in discovering them. They can be found promptly be- several existing trie data structures used in association
cause there is no need to scan the database. In addition, rule mining and presented our novel trie structure, which
the search (depth first) can be stopped at a particular has been proven to be most useful and practical. What
level the moment a node representing a nonfrequent lies ahead is the parallelization of our structure to
itemset is found, because the nodes are all support further accommodate the ever-increasing demands of
ordered. todays need for speed and scalability to obtain associa-
Another advantage of the SOTrieIT, compared with tion rules in a timely manner. Another challenge is to
all previously discussed structures, is that it can be design new data structures that facilitate the discovery
constructed online, meaning that each time a new trans- of trends as association rules evolve over time. Differ-
action arrives, the SOTrieIT can be incrementally up- ent association rules may be mined at different time
dated. This feature is possible because the SOTrieIT is points and, by understanding the patterns of changing rules,
constructed without the need to know the support thresh- additional interesting knowledge may be discovered.
old; it is support independent. All 1-itemsets and 2-
itemsets in the database are used to update the SOTrieIT
regardless of their support counts. To conserve storage REFERENCES
space, existing trie structures such as the FP-tree have
to use thresholds to keep their sizes manageable; thus, Aggarwal, C. C., & Yu, P. S. (2001). A new approach to
when new transactions arrive, they have to be recon- online generation of association rules. IEEE Transac-
structed, because the support counts of itemsets will tions on Knowledge and Data Engineering, 13(4), 527-
have changed. 540.
Finally, the SOTrieIT requires far less storage space
than a trie or Patricia trie because it is only two levels Agrawal, R., & Srikant, R. (1994). Fast algorithms for
deep and can be easily stored in both memory and files. mining association rules. Proceedings of the 20th In-
Although this causes some input/output (I/O) overheads, ternational Conference on Very Large Databases (pp.
it is insignificant as shown in our extensive experiments. 487-499), Chile.
We have designed several algorithms to work synergis-
tically with the SOTrieIT and, through experiments with Amir, A., Feldman, R., & Kashi, R. (1999). A new and
existing prominent algorithms and a variety of databases, versatile method for association generation. Informa-
we have proven the practicality and superiority of our tion Systems, 22(6), 333-347.
approach (Das, Ng, & Woon, 2001; Woon et al., 2001). In Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom,
fact, our latest algorithm, FOLD-growth, is shown to J. (2002). Models and issues in data stream systems.
outperform FP-growth by more than 100 times (Woon, Proceedings of the ACM SIGMOD/PODS Conference
Ng, & Lim, 2004). (pp. 1-16), USA.
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999).
FUTURE TRENDS Using association rules for product assortment deci-
sions: A case study. Proceedings of the Fifth ACM
The data stream phenomenon will eventually become SIGKDD Conference (pp. 254-260), USA.
ubiquitous as Internet access and bandwidth become Creighton, C., & Hanash, S. (2003). Mining gene expres-
increasingly affordable. With keen competition, prod- sion databases for association rules. Bioinformatics,
ucts will become more complex with customization and 19(1), 79-86.
62
TEAM LinG
Association Rule Mining
Das, A., Ng, W. K., & Woon, Y. K. (2001). Rapid associa- Woon, Y. K., Ng, W. K., & Lim, E. P. (2002). Online and
tion rule mining. Proceedings of the 10th International incremental mining of separately grouped web access A
Conference on Information and Knowledge Manage- logs. Proceedings of the Third International Conference
ment (pp. 474-481), USA. on Web Information Systems Engineering (pp. 53-62),
Singapore.
Dong, G., & Li, J. (1999). Efficient mining of emerging
patterns: Discovering trends and differences. Proceed- Woon, Y. K., Ng, W. K., & Lim, E. P. (2004). A support-
ings of the Fifth International Conference on Knowl- ordered trie for fast frequent itemset discovery. IEEE
edge Discovery and Data Mining (pp. 43-52), USA. Transactions on Knowledge and Data Engineering,
16(5).
Elliott, K., Scionti, R., & Page, M. (2003). The confluence
of data mining and market research for smarter CRM. Yang, D. Y., Johar, A., Grama, A., & Szpankowski, W.
Retrieved from http://www.spss.com/home_page/ (2000). Summary structures for frequency queries on
wp133.htm large transaction sets. Proceedings of the Data Com-
pression Conference (pp. 420-429).
Han, J., Pei, J., Yin Y., & Mao, R. (2004). Mining frequent
patterns without candidate generation: A frequent-pat- Yiu, M. L., & Mamoulis, N. (2003). Frequent-pattern based
tern tree approach. Data Mining and Knowledge Discov- iterative projected clustering. Proceedings of the Third
ery, 8(1), 53-97. International Conference on Data Mining, USA.
Hidber, C. (1999). Online association rule mining. Pro- Zaki, M. J. (2000). Scalable algorithms for association
ceedings of the ACM SIGMOD Conference (pp. 145-154), mining. IEEE Transactions on Knowledge and Data
USA. Engineering, 12(3), 372-390.
Knuth, D.E. (1968). The art of computer programming, Vol.
1. Fundamental Algorithms. Addison-Wesley Publish-
ing Company. KEY TERMS
Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M. S.,
& Duri, S. (2001). Personalization of supermarket product Apriori: A classic algorithm that popularized asso-
recommendations. Data Mining and Knowledge Discov- ciation rule mining. It pioneered a method to generate
ery, 5(1/2), 11-32. candidate itemsets by using only frequent itemsets in
the previous pass. The idea rests on the fact that any
Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extrac- subset of a frequent itemset must be frequent as well.
tion of knowledge on protein-protein interaction by asso- This idea is also known as the downward closure prop-
ciation rule discovery. Bioinformatics, 18(5), 705-714. erty.
Sarda, N. L., & Srinivas, N. V. (1998). An adaptive algo- Itemset: An unordered set of unique items, which
rithm for incremental mining of association rules. Pro- may be products or features. For computational effi-
ceedings of the Ninth International Conference on Da- ciency, the items are often represented by integers. A
tabase and Expert Systems (pp. 240-245), Austria. frequent itemset is one with a support count that ex-
ceeds the support threshold, and a candidate itemset is
Wang, K., Zhou, S., & Han, J. (2002). Profit mining: From a potential frequent itemset. A k-itemset is an itemset
patterns to actions. Proceedings of the Eighth Interna- with exactly k items.
tional Conference on Extending Database Technology
(pp. 70-87), Prague. Key: A unique sequence of values that defines the
location of a node in a tree data structure.
Woon, Y. K., Li, X., Ng, W. K., & Lu, W. F. (2003).
Parameterless data compression and noise filtering us- Patricia Trie: A compressed binary trie. The
ing association rule mining. Proceedings of the Fifth Patricia (Practical Algorithm to Retrieve Information
International Conference on Data Warehousing and Coded in Alphanumeric) trie is compressed by avoiding
Knowledge Discovery (pp. 278-287), Prague. one-way branches. This is accomplished by including in
each node the number of bits to skip over before making
Woon, Y. K., Ng, W. K., & Das, A. (2001). Fast online the next branching decision.
dynamic association rule mining. Proceedings of the
Second International Conference on Web Information SOTrieIT: A dual-level trie whose nodes represent
Systems Engineering (pp. 278-287), Japan. itemsets. The position of a node is ordered by the support
count of the itemset it represents; the most frequent
63
TEAM LinG
Association Rule Mining
itemsets are found on the leftmost branches of the rithm has to be executed many times before this value can
SOTrieIT. be well adjusted to yield the desired results.
Support Count of an Itemset: The number of transac- Trie: An n-ary tree whose organization is based on
tions that contain a particular itemset. key space decomposition. In key space decomposition,
the key range is equally subdivided, and the splitting
Support Threshold: A threshold value that is used to position within the key range for each node is pre-
decide if an itemset is interesting/frequent. It is defined by defined.
the user, and generally, an association rule mining algo-
64
TEAM LinG
65
Ada Wai-Chee Fu
The Chinese University of Hong Kong, Hong Kong
1. Frequent Itemset Generation: Generate all sets of 1. Association Rules Based on the Type of Values of
items that have support greater than or equal to a Attribute: Based on the type of values of attributes,
certain threshold, called minsupport there are two kinds Boolean association rule,
2. Association Rule Generation: From the frequent which is presented above, and quantitative associa-
itemsets, generate all association rules that have tion rule. Quantitative association rule describes
confidence greater than or equal to a certain thresh- the relationships among some quantitative attributes
old called minconfidence (e.g., income and age). An example is
income(40K..50K) age(40..45). One proposed
Step 1 is much more difficult compared with Step 2. method is grid-based dividing each attribute into
Thus, researchers have focused on the studies of fre- a fixed number of partitions [Association Rule Clus-
quent itemset generation. tering System (ARCS) in Lent, Swami & Widom
The Apriori Algorithm is a well-known approach, (1997)]. Srikant & Agrawal (1996) proposed to par-
which was proposed by Agrawal & Srikant (1994), to find tition quantitative attributes dynamically and to
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Association Rule Mining and Application to MPIS
merge the partitions based on a measure of partial Figure 2. A concept hierarchy of the fruit
completeness. Another non-grid based approach is
found in Zhang, Padmanabhan, & Tuzhilin (2004).
2. Association Rules based on the Dimensionality of fruit
Data: Association rules can be divided into single-
dimensional association rules and multi-dimen-
sional association rules. One example of single-
dimensional rule is buys({Orange, Knife}) apple orange banana
buys(Plate), which contains only the dimension
buys. Multi-dimensional association rule is the one
containing attributes for more than one dimension.
For example, income(40K..50K) buys(Plate). One exists no itemset X such that (1) X X and (2) trans-
mining approach is to borrow the concept of data actions t, X is in t implies X is in t. These considerations
cube in the field of data warehousing. Figure 1 can reduce the resulting number of frequent itemsets
shows a lattice for the data cube for the dimensions significantly.
age, income and buys. Researchers (Kamber, Han, Another variation of the frequent itemset problem is
& Chiang, 1997) applied the data cube model and mining top-K frequent itemsets (Cheung & Fu, 2004). The
used the aggregate techniques for mining. problem is to find K frequent itemsets with the greatest
3. Association Rules based on the Level of Abstrac- supports. It is often more reasonable to assume the
tions of Attribute: The rules discussed in previous parameter K, instead of the data-distribution dependent
sections can be viewed as single-level association parameter of minsupport because the user typically would
rule. A rule that references different levels of ab- not have the knowledge of the data distribution before
straction of attributes is called a multilevel associa- data mining.
tion rule. Suppose there are two rules The other variations of the problem are the incremen-
income(10K..20K) buys(fruit) and tal update of mining association rules (Hidber, 1999),
income(10K..20K) buys(orange). There are two constraint-based rule mining (Grahne & Lakshmanan,
different levels of abstractions in these two rules 2000), distributed and parallel association rule mining
because fruit is a higher-level abstraction of or- (Gilburd, Schuster, & Wolff, 2004), association rule min-
ange. Han & Fu (1995) apply a top-down strategy ing with multiple minimum supports/without minimum
to the concept hierarchy in the mining of frequent support (Chiu, Wu, & Chen, 2004), association rule
itemsets. mining with weighted item and weight support (Tao,
Murtagh, & Farid, 2003), and fuzzy association rule min-
Other Extensions to Association Rule ing (Kuok, Fu, & Wong, 1998).
Mining Association rule mining has been integrated with
other data mining problems. There have been the integra-
There are other extensions to association rule mining. tion of classification and association rule mining (Wang,
Some of them (Bayardo, 1998) find maxpattern (i.e., maxi- Zhou, & He, 2000) and the integration of association rule
mal frequent patterns) while others (Zaki & Hsiao, 2002) mining with relational database systems (Sarawagi, Tho-
find frequent closed itemsets. Maxpattern is a frequent mas, & Agrawal, 1998).
itemset that does not have a frequent item superset. A
frequent itemset is a frequent closed itemsets if there Application of the Concept of
Association Rules to MPIS
Figure 1. A lattice showing the data cube for the
dimensions age, income, and buys Other than market basket analysis (Blischok, 1995), asso-
ciation rules can also help in applications such as intru-
() sion detection (Lee, Stolfo, & Mok, 1999), heterogeneous
genome data (Satou et al., 1997), mining remotely sensed
income images/data (Dong, Perrizo, Ding, & Zhou, 2000) and
(age) (buys)
product assortment decisions (Wong, Fu, & Wang, 2003;
Wong & Fu, 2004). Here we focus on the application on
(age, income) income, buys) product assortment decisions, as it is one of very few
age, buys)
examples where the association rules are not the end
(age, income, buys) mining results.
66
TEAM LinG
Association Rule Mining and Application to MPIS
Transaction database in some applications can be very fidence of I d, the more likely the profit of I should
large. For example, Hedberg (1995) quoted that Wal-Mart not be counted. This is the reasoning behind the above A
kept about 20 million sales transactions per day. Such data definition. In the above example, suppose we choose
requires sophisticated analysis. As pointed out by Blischok monitor and telephone. Then, d = {keyboard}. All profits
(1995), a major task of talented merchants is to pick the of monitor will be lost if, in the history, we find conf(I
profit generating items and discard the losing items. It may d)=1, where I = monitor. This example illustrates the
be simple enough to sort items by their profit and do the importance of the consideration of cross-selling factor in
selection. However, this ignores a very important aspect the profit estimation.
in market analysis the cross-selling effect. There can be Wong, Fu, & Wang (2003) propose two algorithms to
items that do not generate much profit by themselves but deal with this problem. In the first algorithm, they ap-
they are the catalysts for the sales of other profitable items. proximate the total profit of the item selection in qua-
Recently, some researchers (Kleinberg, Papadimitriou, & dratic form and solve a quadratic optimization problem.
Raghavan, 1998) suggest that concepts of association The second one is a greedy approach called MPIS_Alg,
rules can be used in the item selection problem with the which prunes items iteratively according to an estimated
consideration of relationships among items. function based on the formula of the total profit of the
One example of the product assortment decisions is item selection until J items remain.
Maximal-Profit Item Selection (MPIS) with cross-selling Another product assortment decision problem is stud-
considerations (Wong, Fu, & Wang, 2003). Consider the ied by Wong & Fu, (2004), which addresses the problem
major task of merchants to pick profit-generating items and of selecting a set of marketing items in order to boost the
discard the losing items. Assume we have a history record sales of the store.
of the sales (transactions) of all items. This problem is to
select a subset from the given set of items so that the
estimated profit of the resulting selection is maximal among FUTURE TRENDS
all choices.
Suppose a shop carries office equipment composed of A new area for investigation of the problem of mining
monitors, keyboards and telephones, with profits of frequent itemsets is mining data streaming for frequent
$1000K, $100K and $300K, respectively. If now the shop itemsets (Manku & Motwani, 2002; Yu, Chong, Lu, &
decides to remove one of the three items from its stock, the Zhou, 2004). In such a problem, the data is so massive
question is which two we should choose to keep. If we that all data cannot be stored in the memory of a computer
simply examine the profits, we may choose to keep moni- and the data cannot be processed by traditional algo-
tors and telephones, and so the total profit is $1300K. rithms. The objective of all proposed algorithm is to store
However, we know that there is strong cross-selling effect as few as possible data and to minimize the error gener-
between monitor and keyboard (see Table 1). If the shop ated by some estimation in the model.
stops carrying keyboard, the customers of monitor may Privacy preservation on the association rule mining
choose to shop elsewhere to get both items. The profit is also rigorously studied in these few years (Vaidya &
from monitor may drop greatly, and we may be left with Clifton, 2002; Agrawal, Evfimievski, & Srikant, 2003). The
profit of $300K from telephones only. If we choose to keep problem is to mine from two or more different sources
both monitors and keyboards, then the profit can be without exposing individual transaction data to each
expected to be $1100K, which is higher. other.
MPIS will give us the desired solution. MPIS utilizes
the concept of the relationship between selected items and
unselected items. Such relationship is modeled by the CONCLUSION
cross-selling factor. Suppose d is the set of unselected
items and I is the selected item. A loss rule is proposed in Association rule mining plays an important role in the
the form I d, where d means the purchase of any item literature of data mining. It poses many challenging
in d. The rule indicates that from the history, whenever a
customer buys the item I, he/she also buys at least one of
Table 1.
the items in d. Interpreting this as a pattern of customer
behavior, and assuming that the pattern will not change
Monitor Keyboard Telephone
even when some items were removed from the stock, if 1 1 0
none of the items in d are available then the customer also 1 1 0
will not purchase I. This is because if the customer still 0 0 1
0 0 1
purchases I, without purchasing any items in d, then the 0 0 1
pattern would be changed. Therefore, the higher the con- 1 1 1
67
TEAM LinG
Association Rule Mining and Application to MPIS
issues for the development of efficient and effective Hedberg, S. (1995, October). The data gold rush. BYTE, 83-99.
methods. After taking a closer look, we find that the
application of association rules requires much more in- Hidber, C. (1999), Online association rule mining. SIGMOD,
vestigations in order to aid in more specific targets. We 145-156.
may see a trend towards the study of applications of Kamber, M., Han, J., & Chiang, J.Y. (1997). Metarule-
association rules. guided mining of multi-dimensional association rules
using data cubes. In Proceeding of the 3rd International
Conference on Knowledge Discovery and Data Mining
REFERENCES (pp. 207-210).
Agrawal, R., Evfimievski, A., & Srikant, R. (2003). Informa- Kleinberg, J., Papadimitriou, C., & Raghavan, P. (1998). A
tion sharing across private database. SIGMOD, 86-97. microeconomic view of data mining. Knowledge Discov-
ery Journal, 2(4), 311-324.
Agrawal, R., Imilienski, T., & Swami. (1993). Mining asso-
ciation rules between sets of items in large databases. Kuok, C.M., Fu, A.W.C., & Wong, M.H., (1998). Mining
SIGMOD, 129-140. fuzzy association rules in databases. ACM SIGMOD
Record, 27(1), 41-46.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for
mining association rules. In Proceedings of the 20th VLDB Lee, W., Stolfo, S.J., & Mok, K.W. (1999). A data mining
Conference (pp. 487-499). framework for building intrusion detection models. In
IEEE Symposium on Security and Privacy (pp. 120-132).
Bayardo, R.J. (1998). Efficiently mining long patterns
from databases. SIGMOD, 85-93. Lent, B., Swami, A.N., & Widom, J. (1997). Clustering
Association Rules. In ICDE (pp. 220-231).
Blischok, T. (1995). Every transaction tells a story. Chain
Store Age Executive with Shopping Center Age, 71(3), Manku, G.S., & Motwani, R. (2002). Approximate fre-
50-57. quency counts over data streams. In Proceedings of the
20 th International Conference on VLDB (pp. 346-357).
Cheung, Y.L., & Fu, A.W.-C. (2004). Mining association
rules without support threshold: With and without Item Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrat-
Constraints. TKDE, 16(9), 1052-1069. ing association rule mining with relational database sys-
tems: Alternatives and implications. SIGMOD, 343-354.
Chiu, D.-Y., Wu, Y.-H., & Chen, A.L.P. (2004). An efficient
algorithm for mining frequent sequences by a new strat- Satou, K., Shibayama, G., Ono, T., Yamamura, Y., Furuichi,
egy without support counting. ICDE, 375-386. E., Kuhara, S., & Takagi, T. (1997). Finding association
rules on heterogeneous genome data. In Pacific Sympo-
Dong, J., Perrizo, W., Ding, Q., & Zhou, J. (2000). The sium on Biocomputing (PSB) (pp. 397-408).
application of association rule mining to remotely sensed
data. In Proceedings of the 2000 ACM symposium on Srikant, R., & Agrawal, R. (1996). Mining quantitative
Applied computing (pp. 340-345). association rules in large relational tables. SIGMOD, 1-12.
Gilburd, B., Schuster, A., & Wolff, R. (2004). A new Tao, F., Murtagh, F., & Farid, M. (2003). Weighted asso-
privacy model and association-rule mining algorithm for ciation rule mining using weighted support and signifi-
large-scale distributed environments. SIGKDD. cance framework. In The Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Grahne, G., Lakshmanan, L., & Wang, X. (2000). Efficient Mining (pp. 661-666).
mining of constrained correlated sets. ICDE, 512-521
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso-
Han, J., & Fu, Y. (1995). Discovery of multiple-level asso- ciation rule mining in vertically partitioned data. In The
ciation rules from large databases. In Proceedings of the Eighth ACM SIGKDD International Conference on
1995 International Conference on VLDB (pp. 420-431). Knowledge Discovery and Data Mining (pp. 639-644).
Han, J., & Kamber, M. (2000). Data mining: Concepts and Wang, K., Zhou, S., & He, Y. (2000). Growing decision
techniques. San Mateo, CA: Morgan Kaufmann Publishers. trees on support-less association rules. In Sixth ACM
SIGKDD International Conference on Knowledge Dis-
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns covery & Data Mining (pp. 265-269).
without candidate generation. SIGMOD, 1-12.
68
TEAM LinG
Association Rule Mining and Application to MPIS
Wong, R.C.-W., & Fu, A.W.-C. (2004). ISM: Item selection is a set of items and Ij is a single item not in X, is the fraction
for marketing with cross-selling considerations. In Ad- of the transactions containing all items in set X that also A
vances in Knowledge Discovery and Data Mining, 8th contain item Ij.
Pacific-Asia Conference (PAKDD) (pp. 431-440), Lecture
Notes in Computer Science 3056. Berlin: Springer. Frequent Itemset/Pattern: The itemset with support
greater than or equal to a certain threshold, called
Wong, R.C.-W., Fu, A.W.-C., & Wang, K. (2003). MPIS: minsupport.
Maximal-profit item selection with cross-selling consider-
ations. In IEEE International Conference on Data Min- Infrequent Itemset: Itemset with support smaller than
ing (ICDM) (pp. 371-378). a certain threshold, called minsupport.
Yu, J.X., Chong, Z., Lu, H., & Zhou, A. (2004). False Itemset: A set of items
positive or false negative: Mining frequent itemsets from K-Itemset: Itemset with k items
high speed transactional data streams. In Proceedings of
the Thirtieth International Conference on Very Large Maximal-Profit Item Selection (MPIS): The problem
Data Bases. of item selection, which selects a set of items in order to
maximize the total profit with the consideration of cross-
Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient selling effect
algorithm for closed itemset mining. In SIAM Interna-
tional Conference on Data Mining (SDM). Support (Itemset) Or Frequency: The support of an
itemset X is the fraction of transactions containing all
Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the items in X.
discovery of significant statistical quantitative rules. In
Proceedings of the 10th ACM SIGKDD Knowledge Dis- Support (Rule): The support of a rule X Ij, where X
covery and Data Mining Conference. is a set of items and Ij is a single item not in X, is the fraction
of the transactions containing all items in set X that also
contain item Ij.
KEY TERMS
Transaction: A record containing the items bought
Association Rule: A kind of rule in the form X Ij, where by a customer.
X is a set of some items and Ij is a single item not in X.
Confidence: The confidence of a rule X I j, where X
69
TEAM LinG
70
Christopher Besemann
North Dakota State University, USA
General Concept
BACKGROUND
Two main challenges have to be addressed when applying
Several areas of databases and data mining contribute to association rule mining to relational data. Combined min-
advances in association rule mining of relational data: ing of multiple tables leads to a search space that is
typically large even for moderately sized tables. Perfor-
Relational Data Model: underlies most commercial mance is, thereby, commonly an important issue in rela-
database technology and also provides a strong tional data mining algorithms. A less obvious problem lies
mathematical framework for the manipulation of in the skewing of results (Jensen & Neville, 2002). The
complex data. Relational algebra provides a natural relational join operation combines each record from one
starting point for generalizations of data mining table with each occurrence of the corresponding record in
techniques to complex data types. a second table. That means that the information in one
Inductive Logic Programming, ILP (Deroski & record is represented multiple times in the joined table.
Lavra , 2001): a form of logic programming, in Data mining algorithms that operate either explicitly or
which individual instances are generalized to make implicitly on joined tables, thereby, use the same informa-
hypotheses about unseen data. Background knowl- tion multiple times. Note that this problem also applies to
edge is incorporated directly. algorithms in which tables are joined on-the-fly by iden-
Association Rule Mining, ARM (Agrawal, tifying corresponding records as they are needed. Further
Imielinski, & Swami, 1993): identifies associa- specific issues may have to be addressed when reflexive
tions and correlations in large databases. Associa- relationships are present. These issues will be discussed
tion rules are defined based on items, such as in the section on relations that represent a graph.
objects in a shopping cart. Efficient algorithms are A variety of techniques have been developed for data
designed by limiting output to sets of items that mining of relational data (D eroski & Lavra , 2001). A
occur more frequently than a given threshold. typical approach is called inductive logic programming,
Graph Theory: addresses networks that consist of ILP. In this approach relational structure is represented in
nodes, which are connected by edges. Traditional the form of Prolog queries, leaving maximum flexibility to
graph theoretic problems typically assume no more the user. While the notation of ILP differs from the
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Association Rule Mining of Relational Data
relational notation it can be noted that all relational A typical example of an association rule mining prob-
operators can also be represented in ILP. The approach lem is mining of annotation data of proteins in the pres- A
does thereby not limit the types of problems that can be ence of a protein-protein interaction graph (Oyama, Kitano,
addressed. It should, however, also be noted that while Satou, & Ito, 2002). Associations are extracted that relate
relational database management system are developed functions and localizations of one protein with those of
with performance in mind there may be a trade-off between interacting proteins. Oyama et al. use association rule
the generality of Prolog-based environments and their mining, as applied to joined relations, for this work.
limitations in speed. Another example could be association rule mining of
Application of ARM within the ILP setting corre- attributes associated with scientific publications on the
sponds to a search for frequent Prolog queries as a graph of their mutual citations.
generalization of traditional association rules (Dehaspe & A problem of the straight-forward approach of mining
De Raedt, 1997). Examples of association rule mining of joined tables directly becomes obvious upon further
relational data using ILP (Dehaspe & Toivonen, 2001) study of the rules: In most cases the output is dominated
could be shopping behavior of customers where relation- by rules that involve the same item as it occurs in different
ships between customers are included in the reasoning. entity instances that participate in a relationship. In the
While ILP does not use a relational joining step as such, example of protein annotations within the protein interac-
it does also associate individual objects with multiple tion graph a protein in the nucleus is found to fre-
occurrences of corresponding objects. Problems with quently interact with another protein that is also located
skewing are, thereby, also encountered in this approach. in the nucleus. Similarities among relational neighbors
An alternative to the ILP approach is to apply the have been observed more generally for relational data-
standard definition of association rule mining to relations bases (Macskassy & Provost, 2003). It can be shown that
that are joined using the relational join operation. While filtering of output is not a consistent solution to this
such an approach is less general it is often more efficient problem, and items that are repeated for multiple nodes
since the join operation is highly optimized in standard should be eliminated in a preprocessing step (Besemann
database systems. It is important to note that a join & Denton, 2004). This is an example of a problem that does
operation typically changes the support of an item set, not occur in association rule mining of a single table and
and any support calculation should therefore be based on requires special attention when moving to multiple rela-
the relation that uses the smallest number of join opera- tions. The example also highlights the need to discuss
tions (Cristofor & Simovici, 2001). Equivalent changes in differences between sets of items of related objects are
item set weighting occur in ILP. (Besemann, Denton, Yekkirala, Hutchison, & Anderson, 2004).
Interestingness of rules is an important issue in any
type of association rule mining. In traditional association Related Research Areas
rule mining the problem of rule interest has been ad-
dressed in a variety of work on redundant rules, including A related research area is graph-based ARM (Inokuchi,
closed set generation (Zaki, 2000). Additional rule metrics Washio, & Motoda, 2000; Yan & Han, 2002). Graph-based
such as lift and conviction have been defined (Brin, ARM does not typically consider more than one label on
Motwani, Ullman, & Tsur, 1997). In relational association each node or edge. The goal of graph-based ARM is to
rule mining the problem has been approached by the find frequent substructures based on that one label,
definition of a deviation measure (Dehaspe & Toivonen, focusing on algorithms that scale to large subgraphs. In
2001). In general it can be noted that relational data mining relational ARM multiple item are associated with each
poses many additional problems related to skewing of node and the main problem is to achieve scaling with
data compared with traditional mining on a single table respect to the number of items per node. Scaling to large
(Jensen & Neville, 2002). subgraphs is usually irrelevant due to the small world
property of many types of graphs. For most networks of
Relations that Represent a Graph practical interest any node can be reached from almost
any other by means of no more than some small number
One type of relational data set has traditionally received of edges (Barabasi & Bonabeau, 2003). Association rules
particular attention, albeit under a different name. A that involve longer distances are therefore unlikely to
relation representing a relationship between entity in- produce meaningful results.
stances of the same type, also called a reflexive relation- There are other areas of research on ARM in which
ship, can be viewed as the definition of a graph. Graphs related transactions are mined in some combined fashion.
have been used to represent social networks, biological Sequential pattern or episode mining (Agrawal & Srikant,
networks, communication networks, and citation graphs, 1995; Yan, Han, & Afshar, 2003) and inter-transaction
just to name a few. mining (Tung, Lu, Han, & Feng, 1999) are two main
71
TEAM LinG
Association Rule Mining of Relational Data
categories. Generally the interest in association rule min- Besemann, C., & Denton, A. (2004, June). UNIC: UNique
ing is moving beyond the single-table setting to incorpo- item counts for association rule mining in relational
rate the complex requirements of real-world data. data. Technical Report, North Dakota State University,
Fargo, North Dakota.
Besemann, C., Denton, A., Yekkirala, A., Hutchison, R.,
FUTURE TRENDS & Anderson, M. (2004, Aug.). Differential association
rule mining for the study of protein-protein interaction
The consensus in the data mining community of the impor- networks. In Proceedings ACM SIGKDD Workshop on
tance of relational data mining was recently paraphrased Data Mining in Bioinformatics, Seattle, WA.
by Dietterich (2003) as I.i.d. learning is dead. Long live
relational learning. The statistics, machine learning, and Brin, S., Motwani, R., Ullman, J.D., & Tsur, S. (1997).
ultimately data mining communities have invested de- Dynamic itemset counting and implication rules for mar-
cades into sound theories based on a single table. It is now ket basket data. In Proceedings of the ACM SIGMOD
time to afford as much rigor to relational data. When taking International Conference on Management of Data,
this step it is important to not only specify generalizations Tucson, AZ.
of existing algorithms but to also identify novel questions
that may be asked that are specific to the relational setting. Cristofor, L., & Simovici, D. (2001). Mining association
It is, furthermore, important to identify challenges that rules in entity-relationship modeled databases. Tech-
only occur in the relational setting, including skewing due nical Report, University of Massachusetts Boston.
to the application of the relational join operator, and Dehaspe, L., & De Raedt, L. (1997, Dec.). Mining associa-
correlations that are frequent in relational neighbors. tion rules in multiple relations. In Proceedings of the 7th
International Workshop on Inductive Logic Program-
ming (pp. 125-132), Prague, Czech Republic.
CONCLUSION
Dehaspe, L., & Toivonen, H. (2001). Discovery of rela-
Association rule mining of relational data is a powerful tional association rules. In S. D eroski, & N. Lavra
frequent pattern mining technique that is useful for several (Eds.), Relational data mining. Berlin: Springer.
data structures including graphs. Two main approaches Dietterich, T. (2003, Nov.). Sequential supervised learn-
are distinguished. Inductive logic programming provides ing: Methods for sequence labeling and segmentation.
a high degree of flexibility, while mining of joined relations Invited Talk, 3rd IEEE International Conference on
is a fast technique that allows the study problems related Data Mining, Melbourne, FL, USA.
to skewed or uninteresting results. The potential compu-
tational complexity of relational algorithms and specific D eroski, S., & Lavra , N. (2001). Relational data min-
properties of relational data make its mining an important ing. Berlin: Springer.
current research topic. Association rule mining takes a Inokuchi, A., Washio, T., & Motoda, H. (2000) An apriori-
special role in this process, being one of the most impor- based algorithm for mining frequent substructures from
tant frequent pattern algorithms. graph data. In Proceedings of the 4th European Confer-
ence on Principles of Data Mining and Knowledge
Discovery (pp. 13-23), Lyon, France.
REFERENCES
Jensen, D., & Neville, J. (2002). Linkage and
Agrawal, R., Imielinski, T., & Swami, A.N. (1993, May). autocorrelation cause feature selection bias in relational
Mining association rules between sets of items in large learning. In Proceedings of the 19th International Con-
databases. In Proceedings of the ACM International ference on Machine Learning (pp. 259-266), Sydney,
Conference on Management of Data (pp. 207-216), Wash- Australia.
ington, D.C. Macskassy, S., & Provost, F. (2003). A simple relational
Agrawal, R., & Srikant, R. (1995). Mining sequential pat- classifier. In Proceedings of the 2nd Workshop on Multi-
terns. In Proceedings of the 11 th International Conference Relational Data Mining at KDD03, Washington, D.C.
on Data Engineering (pp. 3-14), IEEE Computer Society Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extrac-
Press, Taipei, Taiwan. tion of knowledge on protein-protein interaction by as-
Barabasi, A.L., & Bonabeau, E. (2003). Scale-free net- sociation rule discovery. Bioinformatics, 18(8), 705-714.
works. Scientific American, 288(5), 60-69.
72
TEAM LinG
Association Rule Mining of Relational Data
Tung, A.K.H., Lu, H., Han, J., & Feng, L. (1999). Breaking Confidence: The confidence of a rule is the support
the barrier of transactions: Mining inter-transaction asso- of the item set consisting of all items in the rule (A ) A
ciation rules. In Proceedings of the International Confer- divided by the support of the antecedent.
ence on Knowledge Discovery and Data Mining, San
Diego, CA. Entity-Relationship Model (E-R-Model): A model to
represent real-world requirements through entities, their
Yan, X., & Han, J. (2002). gSpan: Graph-based substruc- attributes, and a variety of relationships between them. E-
ture pattern mining. In Proceedings of the International R-Models can be mapped automatically to the relational
Conference on Data Mining, Maebashi City, Japan. model.
Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining Inductive Logic Programming (ILP): Research area
closed sequential patterns in large datasets. In Proceed- at the interface of machine learning and logic program-
ings of the 2003 SIAM International Conference on Data ming. Predicate descriptions are derived from examples
Mining, San Francisco, CA. and background knowledge. All examples, background
knowledge and final descriptions are represented as logic
Zaki, M.J. (2000). Generating non-redundant association programs.
rules. In Proceedings of the International Conference on
Knowledge Discovery and Data Mining (pp. 34-43), Boston, Redundant Association Rule: An association rule is
MA. redundant if it can be explained based entirely on one or
more other rules.
73
TEAM LinG
74
Jean-Baptiste Maj
LORIA/INRIA, France
Tarek Ziad
NUXEO, France
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Association Rules and Statistics
Figure 1. Coding and analysis methods variable has a particular effect on the link between Y and
S, called interaction (Winer, Brown & Michels, 1991). A
Quantitative Ordinal Qualitative Yes/No The association rules for this model are:
a) b) c)
75
TEAM LinG
Association Rules and Statistics
variables. In statistics, the decision is easy to make out of Govaert, G. (2003). Analyse de donnes. Lavoisier, France:
test results, unlike association rules, where a difficult Hermes-Science.
choice on several indices thresholds has to be performed.
For the level of knowledge, the statistical results need Gras, R., & Bailleul, M. (2001). La fouille dans les donnes
more interpretation relative to the taxonomy and the asso- par la mthode danalyse statistique implicative.
ciation rules. Colloque de Caen. Ecole polytechnique de lUniversit
Finally, graphs of the regression equations (Hayduk, de Nantes, Nantes, France.
1987), taxonomy (Foucart, 1997), and association rules Guyon, I., & Elisseeff, A. (2003). An introduction to
(Gras & Bailleul, 2001) are depicted in Figure 2. variable and feature selection: Special issue on variable
and feature selection. Journal of Machine Learning
Research, 3, 1157-1182.
FUTURE TRENDS
Han, J., & Kamber, M. (2001). Data mining: Concepts
With association rules, some researchers try to find the and techniques. San Francisco, CA: Morgan Kaufmann.
right indices and thresholds with stochastic methods. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
More development needs to be done in this area. Another data mining. Cambridge, MA: MIT Press.
sensitive problem is the set of association rules that is not
made for deductive reasoning. One of the most common Hayduk, L.A. (1987). Structural equation modelling
solutions is the pruning to suppress redundancies, con- with LISREL. Maryland: John Hopkins Press.
tradictions and loss of transitivity. Pruning is a new method Jensen, D. (1992). Induction with randomization test-
and needs to be developed. ing: Decision-oriented analysis of large data sets [doc-
toral thesis]. Washington University, Saint Louis, MO.
76
TEAM LinG
Association Rules and Statistics
Zhu, H. (1998). On-line analytical mining of association Linear Model: A variable is fitted by a linear combina-
rules [doctoral thesis]. Simon Fraser University, Burnaby, tion of other variables and interactions between them. A
Canada.
Pruning: The algorithms of extraction for the associa-
tion rule are optimized in computationality cost but not in
other constraints. This is why a suppression has to be
KEY TERMS performed on the results that do not satisfy special
constraints.
Attribute-Oriented Induction: Association rules, clas- Structural Equations: System of several regression
sification rules, and characterization rules are written with equations with numerous possibilities. For instance, a
attributes (i.e., variables). These rules are obtained from same variable can be made into different equations, and
data by induction and not from theory by deduction. a latent (not defined in data) variable can be accepted.
Badly Structured Data: Data, like texts of corpus or Taxonomy: This belongs to clustering methods and is
log sessions, often do not contain explicit variables. To usually represented by a tree. Often used in life categori-
extract association rules, it is necessary to create vari- zation.
ables (e.g., keyword) after defining their values (fre-
quency of apparition in corpus texts or simply apparition/ Tests of Regression Model: Regression models and
non apparition). analysis of variance models have numerous hypothesis,
e.g. normal distribution of errors. These constraints allow
Interaction: Two variables, A and B, are in interaction to determine if a coefficient of regression equation can be
if their actions are not seperate. considered as null with a fixed level of significance.
77
TEAM LinG
78
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Automated Anomaly Detection
possible, the program should flag values that are to be shucked (removed from the shell) for obvious reasons.
verified. This may not always be possible, or it may be too Other such rules from domain knowledge can be created A
expensive. Not all situations repeat within a reasonable (abalone.net, 2004; University of Capetown, 2004; World
time, if at all (i.e., observation of Halleys comet). Aquaculture, 2004). Sometimes, they may seem too obvi-
There are two schools of thought, the first being to ous, but they are effective. The rules can be programmed
substitute the mean value for the missing or wrong value. into a subroutine specific to the dataset.
The problem with this is that it might not be a reasonable Regression can be used to check for variables that are
value, and it can create a new rule, one that could be false not statistically significant. Step-wise regression is a
(i.e., shoe size for a giant is not average). It might introduce handy tool for identifying significant variables. Other
sample bias, as well (Berry & Linoff, 2000). ratio variables can be created and then checked for signifi-
Deleting the observation is the other common solu- cance using regression. Again, domain knowledge can
tion. Quite often, in large datasets, a duplicate exists, so help create these variables, as well as insight and some
deleting causes no loss. The cost of improper commission luck. Insignificant variables can be deleted from the
is greater than that of omission. Sometimes an outlier tells dataset, and new ones can be added.
a story. So, one has to be careful about deletions. If the dataset is real valued, it is possible that records
exist that are within tolerance or measurement error of
each other. There are two ways to reduce the number of
THE AUTOMATED ANOMALY unique observations. (1) Attenuate the accuracy by round-
DETECTION PROCESS ing to reduce the number of significant digits. Each
variable rounding to one less significant digit reduces the
number of possible patterns by an order of magnitude. (2)
Methodology Calculate a mean and standard deviation for the cleaned
dataset. Using an appropriate distribution, sort the values
To illustrate the process, a public dataset is used. This by standard deviations from the mean. Testing to see if the
particular one is available from the University of California chosen distribution is correct is accomplished by using a
at Irvine Machine Learning Repository (University of Cali- Chi square test, a Kolmogorof Smirnoff test, or the empiri-
fornia, 2003). Known as the Abalone dataset, it consists of cal test. The number of standard deviations replaces the
4,400 observations of abalones that were captured in the real valued data, and a simple categorical dataset will exist.
wild with several measurements of each one. Natural varia- This allows for simple comparisons between observa-
tion exists, as well as human error, both in making the tions. Otherwise, records with values as little as .0001%
measurements and in the recording. Also listed on the Web differences would be considered unique and different.
site were some studies that used the data and their results. While some of the precision of the original data is lost, this
Accuracy in the form of hit rate varied between 0-35%. process is exploratory and finds the general patterns that
While it may seem overly simple and obvious, plot- are in the data. This allows one to gain insight into the
ting the data is the first step. These graphical views can database using a combination of statistics and artificial
provide much insight into the data (Webb, 2002). The data intelligence (Pazzani, 2000), using human knowledge and
for each variable can be plotted vs. frequency of occur- skill as the catalyst to improve the results.
rence to visually determine distribution. Combining this The final step before mining the data is to remove
with knowledge of the research will help to determine the duplicates, as they add no additional information. As the
correct distribution to use for each included variable. A collection of observations gets increasingly larger, it gets
sum of independent terms would tend to support a Gauss harder to introduce new experiences. This process can be
normal distribution, while the product of a number of incorporated into the computer program by a simple
independent terms might suggest using log normal. This process that is similar to bubblesort. Instead of comparing
plotting also might suggest necessary transformations. to see which row is greater, it just looks for differences. If
It is necessary to understand the acceptable range for none are found, then the row is deleted.
each field. Some values obtained might not be reasonable.
If there is a zero in a field, is it indicative of a missing value,
or is it an acceptable value? No value is not the same as
Example Results
zero. Some values, while within bounds, might not be
possible. It is also necessary to check for obvious mis- A few variables were plotted producing, some very un-
takes, inconsistencies, or out of bounds. usual graphs. These were definitely not the graphs that
Knowledge about the subject of study is necessary. were expected. This was the first indication that the
From this, rules can be made. In the example of the abalone, dataset was noisy. Abalones are born in very large num-
the animal in the shell must weigh more than when it is bers, but with an extremely high infant mortality rate (over
79
TEAM LinG
Automated Anomaly Detection
99%) (Bamfield Marine Science Centre, 2004). This graph of the dataset and its plots led to some suspicion of the
did not reflect that. group with one ring (age = 2.5 years). OLS regression was
An initial scan of the data showed some inconsistent performed on this group, yielding an F of 27, but an R 2 of
points, like a five-year-old infant, a shucked animal weigh- only 0.03. This tells us that this portion of the data is only
ing more than a complete one, and other similar abnormali- muddying the water and attenuating the performance of
ties. Another problem with most analyses of these datasets our model.
is that gender is not ratio or ordinal data and, therefore, had Upon removal of this group of observations, OLS
to be converted to a dummy variable. regression was performed on the remaining data, giving
Step-wise regression removed all but five variables. an improved F of 639 (showing that, indeed, it is a good
The remaining variables were: diameter, height, whole model) and an R2 of 0.53, an acceptable level and one that
weight, shucked weight, and viscera weight. Two new can adequately describe the variation in the criterion.
variables were created: shell ratio (whole weight divided The results listed at the Web site where the dataset
by shell weight) and weight to diameter ratio. Since the was obtained are as follows:
diameter is directly proportional to volume, this variable is Sam Waugh in the Computer Science Department at
proportional to density. The proof of its significance was the University of Tasmania used this dataset in 1995 for
a t value of 39 and an F value of 1561. These are both his doctoral dissertation (University of California, 2003).
statistically significant. A plot of shell ratio vs. frequency His results, while the first recorded attempt, did not have
yielded a fairly Gauss normal looking curve. good accuracy at predicting the age. The problem was
As these are real valued data with four digits given, it encoded as a classification task.
is possible to have observations that vary by as little as
0.01%. This value is even less than the accuracy of the 24.86% Cascade Correlation (no hidden nodes)
measuring instruments. In other words, there are really a 26.25% Cascade Correlation (five hidden
relatively small number of possibilities, described by a nodes)
large number of almost identical examples, some within 21.5% C4.5
measurement tolerance of each other. 0.0% Linear Discriminant Analysis
The mean and standard deviation were calculated for 3.57% k=5 Nearest Neighbor
each of the remaining and new variables of the dataset. The
empirical test was done to verify approximate meeting of Clark, et al. (1996) did further work on this dataset.
Gauss normal distribution. Each value then was replaced They split the ring classification into three groups: 1 to
by the integer number of standard deviations it is from the 8, 9 to 10, and 11 and up. This reduced the number of
mean, creating a categorical dataset. Simple visual inspec- targets and made each one bigger, in effect making each
tion showed two things: (1) there was, indeed, correlation easier to hit.
among the observations; and (2) it became increasingly Their results were much better, as shown in the
more difficult to introduce a new pattern. following:
Duplicate removal process was the next step. As
expected, the first 50 observations only had 22% dupli- 64% Back propagation
cates, but by the time the entire dataset was processed, 55% Dystal
65% of the records were removed, because it presented no
new information. The results obtained from the answer tree using the
To better understand the quality of the data, least new cleaned dataset are shown in Table 1.
squares regression was performed. The model produced All of the one-ring observations were filtered out in
an ANOVA F value of 22.4, showing good confidence in a previous step, and the extraction was 100% accurate in
it. But the Pearsonian correlation coefficient R2 of only 0.25 not predicting any as being one-ring. The hit rates are as
indicated that there was some problem. Visual observation follows:
80
TEAM LinG
Automated Anomaly Detection
Data mining is an exploratory process to see what is in the University of Capetown Zoology Department. (2004).
data and what patterns can be found. Noise and errors in http://web.uct.ac.za/depts/zoology/abnet
the dataset are reflected in the results from the mining Webb, A. (2002). Statistical pattern recognition. West
process. Cleaning the data and identifying anomalies Sussex, England: Wiley & Sons.
should be performed. Marked observations should be
verified and corrected, if possible. If this cannot be done, Westphal, C., & Blaxton, T. (1998). Data mining solutions
they should be deleted. In real valued datasets, the values methods and tools for solving real-world problems. New
can be categorized with accepted statistical techniques. York: Wiley & Sons.
Anomaly detection, after some manual viewing and analy-
sis, can be automated. Part of the process is specific to the World Aquaculture. (2004). http://www7.taosnet.com/
knowledge domain of the dataset, and part could be platinum/data/light/species/abalone.html
standardized. In our example problem, this cleaning pro-
cess improved results, and the mining produced a more
accurate rule set. KEY TERMS
Abalone.net. (2003). All about abalone: An online guide. ANOVA or Analysis of Variance: A powerful statis-
Retrieved from http://www.abalone.net tical method for studying the relationship between a
response or criterion variable and a set of one or more
Bamfield Marine Sciences Centre Public Education predictor or independent variable(s).
Programme. (2004). Oceanlink. Retrieved from http://
oceanlink.island.net/oinfo/Abalone/abalone.html Correlation: Amount of relationship between two
variables, how they change relative to each other, range:
Berry, M.J.A., & Linoff, G.S. (2000). Mastering data min- -1 to +1.
ing: The art and science of customer relationship man-
agement. New York, NY: Wiley & Sons, Inc. F Value: Fisher value, a statistical distribution, used
here to indicate the probability that an ANOVA model is
Bloom, D. (1998). Technology, experimentation, and the good. In the ANOVA calculations, it is the ratio of squared
quality of survey data. Science, 280(5365), 847-848. variances. A large number translates to confidence in the
Clark, D., Schreter, Z., & Adams, A. (1996). A quantitative model.
comparison of dystal and backpropagation. Proceedings
81
TEAM LinG
Automated Anomaly Detection
Ordinal Data: Data that is in order but has no relation- Step-Wise Regression: An automated procedure on
ship between the values or to an external value. statistical programs that adds one predictor variable at a
time, and if it is not statistically significant, it removes it
Pearsonian Correlation Coefficient: Defines how from the model. Some work in both directions either by
much of the variation in the criterion variable(s) is caused adding or removing from the model, one at a time.
by the model. Range: 0 to 1.
T Value, also called Students t: A statistical
Ratio Data: Data that is in order and has fixed spacing; distribution for smaller sample sizes. In regression rou-
a relationship between the points that is relative to a fixed tines in statistical programs, it indicates whether a predic-
external point. tor variable is statistically significant or if it truly is
contributing to the model. A value more than about 3 is
required for this indication.
82
TEAM LinG
83
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Automatic Musical Instrument Sound Classification
84
TEAM LinG
Automatic Musical Instrument Sound Classification
uncorrelated variables, which keep as much of the A statistical pattern-recognition technique; maxi-
variability in the data as possible. mum a posteriori classifier based on Gaussian mod- A
MPEG-7 audio descriptors, including log-attack time els (introducing prior probabilities), obtained via
(i.e., logarithm of onset duration, fundamental fre- Fisher multiple discriminant analysis that projects
quencypitch, spectral envelope, and spread, etc.) the high-dimensional feature space into a space of
(ISO, 2003; Peeters, et al., 2000). one dimension fewer than the number of classes in
which the classes are separated maximally (Martin
Feature vectors obtained via parameterization of & Kim, 1998).
musical instrument sounds are used as inputs for classi- Neural networks designed by analogy with a sim-
fiers, both for training and recognition purposes. plified model of the neural connections in the
brain and trained to find relationships in the data;
Classification Techniques multi-layer nets and self-organizing feature maps
have been used (Cosi et al., 1994; Kostek &
Automatic classification is the process by which a clas- Czyzewski, 2001).
sificatory system processes information in order to Decision trees, where nodes are labeled with
automatically classify data accurately, or the result of sound parameters, edges are labeled with param-
such a process. A class may represent an instrument, eter values, and leaves represent classes
articulation, instrument family, and so forth. Classifiers (Wieczorkowska, 1999b).
applied to this task range from probabilistic and statisti- Rough-set-based algorithms; rough sets are de-
cal algorithms through methods based on learning by fined by upper approximation, containing ele-
example, where classification is based on the distance ments that belong to the set for sure, and lower
between the observed sample and the nearest known approximation containing elements that may be-
neighbor, to methods originating from artificial intelli- long to the set (Wieczorkowska, 1999a).
gence like neural networks, which mimic neural connec- Support vector machines that aim at finding the
tions in the brain. Each classifier yields a new sound hyperplane that best separates observations be-
description (representation). Some classifiers produce an longing to different classes (Agostini et al., 2003).
explicit set of classification rules (e.g., decision trees or Hidden Markov Models (HMM) used for repre-
rough set based algorithms), giving insight into relation- senting sequences of states; in this case, can be
ships between specific sound timbres and the calculated used for representing long sequences of feature
features. Since human-performed recognition of musical vectors that define an instrument sound (Herrera
instruments is based on subjective criteria and difficult to et al., 2000).
formalize, learning algorithms that allow extraction of pre-
cise rules of sound classification are broadening our knowl- Classifiers are first trained and then tested with
edge and giving formal representation of subjective sound respect to their generalization purposes (i.e., whether
features. they work properly on unknown samples.
The following algorithms can be applied to musical
instrument sound classification: Validation and Results
Bayes decision rule (i.e., probabilistic classifica- Parameterization and classification methods yield vari-
tion method of assignment of unknown samples) to ous results, depending on the sound data and validation
the classes. In Brown (1999), training data were procedure when classifiers are tested on unseen
grouped into clusters obtained through k-means samples. Usually, the available data are divided into
algorithm, and Gaussian probability density func- training and test sets. For instance, 70% of the data is
tions were formed from the mean and variance of used for training and the remaining 30% for testing;
each cluster. this procedure is usually repeated a number of times,
K-Nearest Neighbor (k-NN) algorithm, where the and the final result is the average of all runs. Other
class (instrument) for a tested sound sample is popular divisions are in proportions 80/20 or 90/10.
assigned on the basis of the distances between the Also, leave-one-out procedure is used, where only one
vector of parameters for this sample and the major- sample is used for testing. Generally, the higher per-
ity of k nearest vectors representing known samples centage of the training data is in proportion to the test
(Kaminskyj, 2002; Martin & Kim, 1998). To im- data and the smaller the number of classes, the higher
prove performance, genetic algorithms are addi- the accuracy that is obtained. Some instruments are
tionally applied to find the optimal set of weights easily identified with high accuracy, whereas, others
for the parameters (Fujinaga & McMillan, 2000). frequently are misclassified, especially with those from
85
TEAM LinG
Automatic Musical Instrument Sound Classification
the same family. Classification of instruments is sometimes Rough-set-based classifiers and decision trees ap-
performed hierarchically: articulation or family is recog- plied to the data representing 18 classes (11 orches-
nized first, and then the instrument is identified. tral instruments, various articulation), parameter-
Following is an overview of results obtained so far in ized using Fourier and wavelet-based attributes,
the research on musical instrument sound classification: yielded 68-77 % accuracy in 90/10 test, and 64-68%
in 70/30 tests (Wieczorkowska, 1999b).
Brown (1999) reported an average 84.1% recog- K-NN and rough-set-based classifiers, applied to
nition accuracy for two classesoboe and saxo- spectral and temporal sound parameterization,
phoneusing cepstral coefficients as features yielded 68% accuracy in 80/20 tests for 18 classes,
and Bayes decision rules for clusters obtained via representing 11 orchestral instruments and vari-
k-means algorithm. ous articulation (Wieczorkowska et al., 2003).
Brown, Houix, and McAdams (2001), in experi-
ments with four classes, obtained 7984% accu- Generally, instrument families or sustained/impul-
racy for bin-to-bin differences of constant-Q co- sive sounds are identified with accuracy exceeding 90%,
efficients, and cepstral and autocorrelation coef- whereas instruments, if there are more than 10, are
ficients using Bayesian method. identified with accuracy reaching about 70%. These
K-NN classification applied to mel-frequency and results compare favorably with human performance and
linear prediction cepstral coefficients (Eronen, exceed results obtained for inexperienced listeners.
2001), with training on 29 orchestral instruments
and testing on 16 instruments from various re-
cordings, yielded 35% accuracy for instruments FUTURE TRENDS
and 77% for families. K-NN, combined with ge-
netic algorithms (Fujinaga & McMillan, 2000), Automatic indexing and searching of audio files is gain-
yielded 50% correctness in leave-one-out tests on ing increasing interest. MPEG-7 standard addresses the
spectral features representing 23 orchestral in- issue of content description in multimedia data, and
struments played with various articulations. audio descriptors provided in this standard form a basis
Kaminskyj (2002) applied k-NN to constant-Q for further research. Constant growth of audio resources
and cepstral coefficients, MSA trajectories, am- available on the Internet causes an increasing need for
plitude envelope, and spectral centroid. He ob- content-based search of audio data. Therefore, we can
tained 89-92% accuracy for instruments, 96% for expect intensification of research in this domain and
families, and 100% in identifying impulsive vs. progress of studies on automatic classification of mu-
sustained sounds in leave-one-out tests for MUMS sical instrument sounds.
data. Tests on other recordings initially yielded
33-61% accuracy, and 87-90% after improve-
ments. CONCLUSION
Multilayer neural networks applied to wavelet and
Fourier-based parameterization yielded 72-99% Results obtained so far in automatic musical instrument
accuracy for various groups of four instruments sound classification vary, depending on the size of the data,
(Kostek & Czyzewski, 2001). sound parameterization, classifier, and testing method.
The statistical pattern-recognition technique and Also, some instruments are identified easily with high
k-NN algorithm, applied to sounds representing accuracy, whereas others are misclassified frequently, in
14 orchestral instruments played with various ar- case of both human and machine performance.
ticulation, yielded 71.6% accuracy for instruments, Increasing interest in content-based searching
86.9% for families, and 98.8% in discriminating through audiovisual data and growth of amount of mul-
continuant sounds vs. pizzicato (Martin & Kim, timedia data available via the Internet raises the need and
1998) in 70/30 tests. The features included pitch, perspective for further progress in automatic classifi-
spectral centroid, ratio of odd-to-even harmonic cation of audio data.
energy, onset asynchrony, and the strength of vi-
brato and tremolo (quick changes of sound ampli-
tude or note repetitions).
Discriminant analysis and support vector machines
REFERENCES
yielded about 70% accuracy in leave-one-out tests
with spectral features for 27 instruments (Agostini Agostini, G., Longari, M., & Pollastri, E. (2003). Musi-
et al., 2003). cal instrument timbres classification with spectral fea-
86
TEAM LinG
Automatic Musical Instrument Sound Classification
tures. EURASIP Journal on Applied Signal Processing, Kostek, B., & Czyzewski, A. (2001). Representing musical
1, 1-11. instrument sounds for their automatic classification. Jour- A
nal of the Audio Engineering Society, 49(9), 768-785.
Ando, S., & Yamaguchi, K. (1993). Statistical study of
spectral parameters in musical instrument tones. Journal Martin, K.D., & Kim, Y.E. (1998). Musical instrument
of the Acoustical Society of America, 94(1), 37-45. identification: A pattern-recognition approach. Pro-
ceedings of the 136th Meeting of the Acoustical Society
Brown, J.C. (1999). Computer identification of musical of America, Norfolk, Virginia.
instruments using pattern recognition with cepstral co-
efficients as features. Journal of the Acoustical Soci- Opolko, F., & Wapnick, J. (1987). MUMSMcGill Univer-
ety of America, 105, 1933-1941. sity master samples. [CD-ROM]. McGill University,
Montreal, Quebec, Canada.
Brown, J.C., Houix, O., & McAdams, S. (2001). Feature
dependence in the automatic identification of musical Peeters, G., McAdams, S., & Herrera, P. (2000). Instrument
woodwind instruments. Journal of the Acoustical Soci- sound description in the context of MPEG-7. Proceedings
ety of America, 109, 1064-1072. of the International Computer Music Conference
ICMC2000, Berlin, Germany.
Cosi, P., De Poli, G., & Lauzzana, G. (1994). Auditory
modelling and self-organizing neural networks for tim- Pollard, H.F., & Jansson, E.V. (1982). A tristimulus method
bre classification. Journal of New Music Research, 23, for the specification of musical timbre. Acustica, 51, 162-
71-98. 171.
Eronen, A. (2001). Comparison of features for musical SIL. (1999). LinguaLinks library. Retrieved 2004 from
instrument recognition. Proceedings of the IEEE Work- http://www.sil.org/LinguaLinks/Anthropology/Expndd
shop on Applications of Signal Processing to Audio EthnmsclgyCtgrCltrlMtrls/MusicalInstrumentsSub
and Acoustics WASPAA 2001, New York, NY, USA. categorie.htm
Fritts, L. (1997). The University of Iowa musical instru- Smith, R. (2000). Rods encyclopedic dictionary of tradi-
ment samples. Retrieved 2004 from http://there tional music. Retrieved 2004 from http://
min.music.uiowa.edu/MIS.html www.sussexfolk.freeserve.co.uk/ency/a.htm
Fujinaga, I., & McMillan, K. (2000). Realtime recogni- Viste, H., & Evangelista, G. (2003). Separation of har-
tion of orchestral instruments. Proceedings of the Inter- monic instruments with overlapping partials in multi-
national Computer Music Conference, Berlin, Germany. channel mixtures. Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and
Herrera, P., Amatriain, X., Batlle, E., & Serra, X. (2000). Acoustics WASPAA-03, New Paltz, New York.
Towards instrument segmentation for music content
description: A critical review of instrument classifica- Whitaker, J.C. & Benson, K.B. (Eds.). (2002). Stan-
tion techniques. Proceedings of the International Sym- dard handbook of audio and radio engineering. New
posium on Music Information Retrieval ISMIR 2000, York: McGraw-Hill.
Plymouth, Massachusetts.
Wieczorkowska, A. (1999a). Rough sets as a tool for
Hornbostel, E.M.V., & Sachs, C. (1914). Systematik der audio signal classification. Foundations of Intelligent
Musikinstrumente. Ein Versuch. Zeitschrift fr Systems LNCS/LNAI 1609, 11th Symposium on Method-
Ethnologie, 46(4-5), 553-90. ologies for Intelligent Systems, Proceedings/ISMIS99,
Warsaw, Poland.
IRCAM, Institute de Recherche et Coordination Acoustique/
Musique. (2003). Studio on line. Retrieved 2004 from http:// Wieczorkowska, A. (1999b). Skuteczno rozpoznawania
forumnet.ircam.fr/rubrique.php3?id_rubr ique=107 dwikw instrumentw muzycznych w zalenoci od
sposobu parametryzacji i rodzaju klasyfikatora (Effi-
ISO, International Organisation for Standardisation. (2003). ciency of musical instrument sounds recognition de-
MPEG-7 Overview. Retrieved 2004 from http:// pending on parameterization and classifier) [doctoral
www.chiariglio ne.org/mpeg/standards/mpeg-7/mpeg- thesis] [in Polish]. Gdansk: Technical University of
7.htm Gdansk.
Kaminskyj, I. (2002). Multi-feature musical instrument
Wieczorkowska, A., Wrblewski, J., Synak, P., & lzak ,
sound classifier w/user determined generalisation per-
D. (2003). Application of temporal descriptors to musical
formance. Proceedings of the Australasian Computer
instrument sound recognition. Journal of Intelligent
Music Association Conference ACMC 2002, Melbourne,
Information Systems, 21(1), 71-93.
Australia.
87
TEAM LinG
Automatic Musical Instrument Sound Classification
88
TEAM LinG
89
Bayesian Networks B
Ahmad Bashir
University of Texas at Dallas, USA
Latifur Khan
University of Texas at Dallas, USA
Mamoun Awad
University of Texas at Dallas, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Bayesian Networks
A graphical model visually illustrates conditional inde- Probability calculus does not require that the probabili-
pendencies among variables in a given problem. Two ties be based on theoretical results or frequencies of
variables that are conditionally independent have no repeated experiments, commonly known as relative fre-
direct impact on each others values. Furthermore, the quencies. Probabilities may also be completely subjective
graphical model shows any intermediary variables that estimates of the certainty of an event.
separate two conditionally independent variables. Consider an example of a basketball game. If one were
Through these intermediary variables, two conditionally to bet on an upcoming game between Team A and Team
independent variables affect one another. B, it is important to know the probability of Team A
A graph is composed of a set of nodes, which repre- winning the game. This probability is definitely not a ratio,
sent variables, and a set of edges. Each edge connects two a relative frequency, or even an estimate of a relative
nodes, and an edge can have an optional direction as- frequency; the game cannot be repeated many times
signed to it. For X1 and X2, if a causal relationship between under exactly the same conditions. Rather, the probability
the variables exists, the edge will be directional, leading represents only ones belief concerning Team As chances
from the case variable to the effect variable; if just a of winning. Such a probability is termed a Bayesian or
correlation between the variables exists, the edge will be subjective probability and makes use of Bayes theorem
undirected. to calculate unknown probabilities.
We use an example with three variables to illustrate A Bayesian probability may also be referred to as a
these concepts. In this example, two conditionally inde- personal probability. The Bayesian probability of an
pendent variables, A and C, are directly related to another event x is a persons degree of belief in that event. A
variable, B. To represent this situation, an edge must exist Bayesian probability is a property of the person who
between the nodes of the variables that are directly assigns the probability, whereas a classical probability
related, that is, between A and B and between B and C. is a physical property of the world, meaning it is the
Furthermore, the relationships between A and B and B and physical probability of an event.
C are correlations as opposed to causal relations; hence, An important difference between physical probability
the respective edges will be undirected. Figure 1 illus- and Bayesian probability is that repeated trials are not
trates this example. Due to conditional independence, necessary to measure the Bayesian probability. The Baye-
nodes A and C still have an indirect influence on one sian method can assign a probability for events that may
another; however, variable B encodes the information be difficult to experimentally determine. An oft-voiced
from A that impacts C, and vice versa. criticism of the Bayesian approach is that probabilities
A Bayesian network is a specific type of graphical seem arbitrary, but this is a probability assessment issue
model, with directed edges and no cycles (Stephenson, that does not take away from the many possibilities that
2000). The edges in Bayesian networks are viewed as Bayesian probabilities provide.
causal connections, where each parent node causes an
effect on its children. Causal Influence
In addition, nodes in a Bayesian network contain a
conditional probability table, or CPT, which stores all Bayesian networks require an operational method for
probabilities that may be used to reason or make infer- identifying causal relationships in order for accurate
ences within the system. domain modeling. Hence, causal influence is defined in
the following manner: If the action of making variable X
Figure 1. Graphical model of two independent variables take some value sometimes changes the value taken by
A and C that are directly related to a third variable B variable Y, then X is assumed to be responsible for some-
times changing Ys value, and one may conclude that X is
a cause of Y. More formally, X is manipulated when we
force X to take some value, and we say X causes Y if some
B manipulation of X leads to a change in the probability
distribution of Y.
Furthermore, if manipulating X leads to a change in the
probability distribution of Y, then X obtaining a value by
any means whatsoever also leads to a change in the
probability distribution of Y. Hence, one can make the
A C
natural conclusion that causes and their effects are statis-
90
TEAM LinG
Bayesian Networks
91
TEAM LinG
Bayesian Networks
sion. Moreover, graphical representations uncover sev- Bn) that is associated with each variable A with
eral opportunities for efficient computation and serve as parents B1, B2 Bn
understandable logic diagrams.
Bayesian networks can simulate humanlike reason- Bayesian networks continue to play a vital role in
ing; this fact is not, however, due to any structural prediction and classification within data mining
similarities with the human brain. Rather, it is because of (Niedermeyer, 1998). They are a marriage between prob-
the resemblance between the ways Bayesian networks ability theory and graph theory, providing a natural tool
and humans reason. The resemblance is more psychologi- for dealing with two problems that occur throughout
cal than biological but nevertheless a true benefit. applied mathematics and engineering: uncertainty and
complexity. Also, Bayesian networks play an increasingly
Bayesian Inference important role in the design and analysis of machine
learning algorithms, serving as a promising way to ap-
Inference is the task of computing the probability of each proach present and future problems related to artificial
value of a node in a Bayesian network when other vari- intelligence and data mining (Choudhary, Rehg, Pavlovic,
ables values are known (Jensen, 1999). This concept is & Pentland, 2002; Doshi, Greenwald, & Clarke, 2002;
what makes Bayesian networks so powerful, as it allows Fenton, Cates, Forey, Marsh, Neil, & Tailor, 2003).
the user to apply knowledge toward forward or backward
reasoning. Suppose that a specific value for one or more
of the variables in the network has been observed. If one REFERENCES
variable has a definite value, or evidence, the probabili-
ties, or belief values, for the other variables need to be Choudhury, T., Rehg, J. M., Pavlovic, V., & Pentland, A.
revised, as this variable is not a defined value. This (2002). Boosting and structure learning in dynamic Baye-
calculation of the updated probabilities for system vari- sian networks for audio-visual speaker detection. Pro-
ables that are based on new evidence is precisely the ceedings of the International Conference on Pattern
definition of inference. Recognition (ICPR), Canada, III (pp. 789-794).
Doshi, P., Greenwald, L., & Clarke, J. (2002). Towards
effective structure learning for large Bayesian networks.
FUTURE TRENDS Proceedings of the AAAI Workshop on Probabilistic
Approaches in Search, Canada (pp. 16-22).
The future of Bayesian networks lies in determining new
ways to tackle the following issues of Bayesian inferencing Fenton, N., Cates, P., Forey, S., Marsh, W., Neil, M., &
and in building a Bayesian structure that accurately Tailor, M. (2003). Modelling risk in complex software
represents a particular system. As we discuss in this projects using Bayesian networks (Tech. Rep. No. RA-
paper, conditional dependencies can be mapped into a DAR Tech Repo). London: Queen Mary University.
graph in several ways, each with subtle semantic and
statistical differences. Future research will give way to Helsper, E.M. & Gaag, L.C. van der. (2002). Building
Bayesian networks that can understand system seman- Bayesian networks through ontologies. Proceedings of
tics and adapt accordingly, not only with respect to the the 15th Eureopean Conference on Artificial Intelli-
conditional probabilities within each node but also with gence, Lyon, France (pp. 680-684).
respect to the graph itself. Huang, K., King, I., & R. Lyu, M. (2002). Learning maximum
likelihood semi-naive Bayesian network classifier. Pro-
ceedings of the IEEE International Conference on Sys-
CONCLUSION tems, Man, and Cybernetics, 3, 6, Hamammet, Tunisia.
A Bayesian network consists of the following elements: Jensen, F. (1999). Gradient descent training of Bayesian
networks. Proceedings of the European Conference on
A set of variables and a set of directed edges Symbolic and Quantitative Approaches to Reasoning
between variables and Uncertainty (pp. 190-200).
A finite set of mutually exclusive states for each Neapolitan, R. E. (2004). Learning Bayesian networks.
variable Upper Saddle River, NJ: Prentice-Hall.
A directed acyclic graph (DAG), constructed from
the variables coupled with the directed edges Niedermayer, D. (1998). An introduction to Bayesian
A conditional probability table (CPT) P(A | B1, B2, , networks and their contemporary applications. Re-
92
TEAM LinG
Bayesian Networks
trieved October 2004, from http://www.niedermayer.ca/ Independent: Two random variables are independent
papers/bayesian/ when knowing something about the value of one of them B
does not yield any information about the value of the
Stephenson, T. (2000). An introduction to Bayesian net- other.
work theory and usage. Retrieved October 2004, from
http://www.idiap.ch/publications/todd00a.bib.abs.html Joint Probability: The probability of two events oc-
curring in conjunction.
Data Mining: The application of analytical methods Supervised Learning: A machine learning technique
and tools to data for the purpose of identifying patterns for creating a function from training data; the task of the
and relationships such as classification, prediction, esti- supervised learner is to predict the value of the function
mation, or affinity grouping. for any valid input object after having seen only a small
number of training data.
93
TEAM LinG
94
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective
EPA Envirofacts Warehouse - The Envirofacts data their dependents. Over 12,000 doctors, nurses and ad-
warehouse comprises of information from 12 different ministrators use it. Frank Gillett, an analyst at Forrester B
environmental databases for facility information, in- Research, Inc., stated that, What kills these huge data
cluding toxic chemical releases, water discharge permit warehouse projects is that the human beings dont agree
compliance, hazardous waste handling processes, on the definition of data. Without that . . . all that $450
Superfund status, and air emission estimates. Each pro- million [cost of the warehouse project] could be thrown
gram office provides its own data and is responsible for out the window (Hamblen, 1998).
maintaining this data. Initially, the Envirofacts ware-
house architects noted some data integrity problems, Be Selective on what Data Elements to
namely, issues with accurate data, understandable data, Include in the Warehouse
properly linked data and standardized data. The archi-
tects had to work hard to address these key data issues so Users are unsure of what they want so they place an
that the public can trust that the quality of data in the excessive number of data elements in the warehouse.
warehouse (Garvey, 2003). This results in an immense, unwieldy warehouse in
U.S. Navy Type Commander Readiness Manage- which query performance is impaired.
ment System The Navy uses a data warehouse to Federal Credit Union - The data warehouse archi-
support the decisions of its commanding officers. Data tect for this organization suggests that users know which
at the lower unit levels is aggregated to the higher levels data they use most, although they will not always admit
and then interfaced with other military systems for a to what they use least (Deitch, 2000).
joint military assessment of readiness as required by the
Joint Chiefs of Staff. The Navy found that it was spend-
ing too much time to determine its readiness and some
Select the Extraction-Transformation-
of its reports contained incorrect data. The Navy devel- Loading (ETL) Strategy Carefully
oped a user friendly, Web-based system that provides
quick and accurate assessment of readiness data at all Having an effective ETL strategy that extracts data from
levels within the Navy. The system collects, stores, the various transactional systems, transforms the data to
reports and analyzes mission readiness data from air, sub a common format, and loads the data into a relational or
and surface forces for the Atlantic and Pacific Fleets. multidimensional database is the key to a successful
Although this effort was successful, the Navy learned that data warehouse project. If the ETL strategy is not effec-
data originating from the lower levels still needs to be tive, it will mean delays in refreshing the data ware-
accurate. The reason is that a number of legacy systems, house, contaminating the data warehouse with dirty data,
which serves as the source data for the warehouse, lacked and increasing the costs in maintaining the warehouse.
validation functions (Microsoft, 2000). IRS Compliance Warehouse supports research and
decision support, allows the IRS to analyze, develop,
Standardize the Organizations Data and implement business strategies for increasing volun-
tary compliance, improving productivity and managing
Definitions the organization. It also provides projections, forecasts,
quantitative analysis, and modeling. Users are able to
A key attribute of a data warehouse is that it serves as a query this data for decision support.
single version of the truth. This is a significant im- A major hurdle was to transform the large and di-
provement over the different and often conflicting ver- verse legacy online transactional data sets for effective
sions of the truth that come from an environment of use in an analytical architecture. They needed a way to
disparate silos of data. To achieve this singular version process custom hierarchical data files and convert to
of the truth, there needs to be consistent definitions of ASCII for local processing and mapping to relational
data elements to afford the consolidation of common databases. They ended up with developing a script pro-
information across different data sources. These con- gram that will do all of this. ETL is a major challenge and
sistent data definitions are captured in a data warehouses may be a showstopper for a warehouse implementa-
metadata repository. tion (Kmonk, 1999).
DoD Computerized Executive Information System
(CEIS) is a 4-terabyte data warehouse holds the medical
records of the 8.5 million active members of the U.S.
Leverage the Data Warehouse to
military health care system who are treated at 115 Provide Auditing Capability
hospitals and 461 clinics around the world. The Defense
Department wanted to convert its fixed-cost health care An overlooked benefit of data warehouses is its capabil-
system to a managed-care model to lower costs and ity of serving as an archive of historic knowledge that
increase patient care for the active military, retirees and can be used as an audit trail for later investigations.
95
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective
U.S. Army Operational Testing and Evaluation Com- found that portals provide an effective way to access
mand (OPTEC) is charged with developing test criteria diverse data sources via a single screen (Kmonk, 1999).
and evaluating the performance of extremely complex weap-
ons equipment in every conceivable environment and Make Warehouse Data Available to All
condition. Moreover, as national defense policy is under- Knowledgeworkers (Not Only to
going a transformation, so do the weapon systems, and
thus the testing requirements. The objective of their ware-
Managers)
house was to consolidate a myriad of test data sets to
provide analysts and auditors with access to the specific The early data warehouses were designed to support
information needed to make proper decisions. upper management decision-making. However, over
OPTEC was having fits when audit agencies, such as time, organizations have realized the importance of
the General Accounting Office (GAO), would show up to knowledge sharing and collaboration and its relevance
investigate a weapon system. For instance, if problems to the success of the organizational mission. As a
with a weapon show up five years after it is introduced result, upper management has become aware of the
into the field, people are going to want to know what tests need to disseminate the functionality of the data ware-
were performed and the results of those tests. A ware- house throughout the organization.
house with its metadata capability made data retrieval IRS Compliance Data Warehouse supports a diver-
much more efficient (Microsoft, 2000). sity of user types economists, research analysts, and
statisticians all of whom are searching for ways to
improve customer service, increase compliance with
Leverage the Web and Web Portals for federal tax laws and increase productivity. It is not just
Warehouse Data to Reach Dispersed for upper management decision making anymore
Users (Kmonk, 1999).
In many organizations, users are geographically distrib- Supply Data in a Format Readable by
uted and the World Wide Web has been very effective as Spreadsheets
a gateway for these dispersed users to access the key
resources of their organization, which include data ware- Although online analytical tools such as those sup-
houses and data marts. ported by Cognos and Business Objects are useful for
U.S. Army OPTEC developed a Web-based front end data analysis, the spreadsheet is still the basic tool used
for its warehouse so that information can be entered and by most analysts.
accessed regardless of the hardware available to users. It U.S. Army OPTEC wanted users to transfer data and
supports the geographically dispersed nature of OPTECs work with information on applications that they are
mission. Users performing tests in the field can be familiar with. In OPTEC, they transfer the data into a
anywhere from Albany, New York to Fort Hood, Texas. format readable by spreadsheets so that analysts can
That is why the browser client the Army developed is so really crunch the data. Specifically, pivot tables found in
important to the success of the warehouse (Microsoft, spreadsheets allows the analysts to manipulate the infor-
2000). mation to put meaning behind the data (Microsoft, 2000).
DoD Defense Dental Standard System supports more
than 10,000 users at 600 military installations world-
wide. The solution consists of three main modules:
Restrict or Encrypt Classified/
Dental Charting, Dental Laboratory Management, and Sensitive Data
Workload and Dental Readiness Reporting. The charting
module helps dentists graphically record patient infor- Depending on requirements, a data warehouse can con-
mation. The lab module automates the workflow between tain confidential information that should not be re-
dentists and lab technicians. The reporting module al- vealed to unauthorized users. If privacy is breached, the
lows users to see key information though Web-based organization may become legally liable for damages
online reports, which is a key to the success of the and suffer a negative reputation with the ensuing loss of
defense dental operations. customers trust and confidence. Financial conse-
IRS Compliance Data Warehouse includes a Web- quences can result.
based query and reporting solution that provides high-value, DoD Computerized Executive Information System
easy-to-use data access and analysis capabilities, be quickly uses an online analytical processing tool from a popu-
and easily installed and managed, and scale to support lar vendor that could be used to restrict access to
hundreds of thousands of users. With this portal, the IRS certain data, such as HIV test results, so that any confi-
dential data would not be disclosed (Hamblen, 1998).
96
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective
97
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective
workforce should recognize the value of the knowledge Parker, G. (1999). Data warehousing at the federal govern-
that can be gained from data warehousing and how to ment: A CIO perspective. In Proceedings from Data
apply it to achieve organizational success. Warehouse Conference 99.
A data warehouse should be part of an enterprise
architecture, which is a framework for visualizing the PriceWaterhouseCoopers. (2001). Technology forecast.
information technology assets of an enterprise and how SAS. (2000). The U.S. Bureau of the Census counts on a
these assets interrelate. It should reflect the vision and better system.
business processes of an organization. It should also
include standards for the assets and interoperability Schwartz, A. (2000). Making the Web Safe. Federal Com-
requirements among these assets. puter Week.
98
TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective
Legacy System: Typically, a database management Terabyte: A unit of memory or data storage capacity
system in which an organization has invested consider- equal to roughly 1,000 gigabytes. B
able time and money and resides on a mainframe or
minicomputer. Total Cost of Ownership: Developed by Gartner
Group, an accounting method used by organizations
Outsourcing: Acquiring services or products from seeking to identify their both direct and indirect sys-
an outside supplier or manufacturer in order to cut costs tems costs.
and/or procure outside expertise.
Performance Metrics: Key measurements of sys-
tem attributes that is used to determine the success of NOTE
the process.
Pivot Tables: An interactive table found in most The views expressed in this article are those of the
spreadsheet programs that quickly combines and com- author and do not reflect the official policy or position
pares typically large amounts of data. One can rotate its of the National Defense University, the Department of
rows and columns to see different arrangements of the Defense or the U.S. Government.
source data, and also display the details for areas of
interest.
99
TEAM LinG
100
Jeffrey Stanton
Syracuse University School of Information Studies, USA
INTRODUCTION BACKGROUND
Most people think of a library as the little brick building Forward-thinking authors in the field of library science
in the heart of their community or the big brick building in began to explore sophisticated uses of library data some
the center of a college campus. However, these notions years before the concept of data mining became popular-
greatly oversimplify the world of libraries. Most large ized. Nutter (1987) explored library data sources to sup-
commercial organizations have dedicated in-house li- port decision making but lamented that the ability to
brary operations, as do schools; nongovernmental orga- collect, organize, and manipulate data far outstrips the
nizations; and local, state, and federal governments. With ability to interpret and to apply them (p. 143). Johnston
the increasing use of the World Wide Web, digital librar- and Weckert (1990) developed a data-driven expert sys-
ies have burgeoned, serving a huge variety of different tem to help select library materials, and Vizine-Goetz,
user audiences. With this expanded view of libraries, two Weibel, and Oskins (1990) developed a system for auto-
key insights arise. First, libraries are typically embedded mated cataloging based on book titles (see also Morris,
within larger institutions. Corporate libraries serve their 1992, and Aluri & Riggs, 1990). A special section of
corporations, academic libraries serve their universities, Library Administration and Management, Mining your
and public libraries serve taxpaying communities who automated system, included articles on extracting data
elect overseeing representatives. Second, libraries play a to support system management decisions (Mancini, 1996),
pivotal role within their institutions as repositories and extracting frequencies to assist in collection decision
providers of information resources. In the provider role, making (Atkins, 1996), and examining transaction logs to
libraries represent in microcosm the intellectual and learn- support collection management (Peters, 1996).
ing activities of the people who comprise the institution. More recently, Banerjeree (1998) focused on describ-
This fact provides the basis for the strategic importance ing how data mining works and how to use it to provide
of library data mining: By ascertaining what users are better access to the collection. Guenther (2000) discussed
seeking, bibliomining can reveal insights that have mean- data sources and bibliomining applications but focused
ing in the context of the librarys host institution. on the problems with heterogeneous data formats.
Use of data mining to examine library data might be Doszkocs (2000) discussed the potential for applying
aptly termed bibliomining. With widespread adoption of neural networks to library data to uncover possible asso-
computerized catalogs and search facilities over the past ciations between documents, indexing terms, classifica-
quarter century, library and information scientists have tion codes, and queries. Liddy (2000) combined natural
often used bibliometric methods (e.g., the discovery of language processing with text mining to discover informa-
patterns in authorship and citation within a field) to tion in digital library collections. Lawrence, Giles, and
explore patterns in bibliographic information. During the Bollacker (1999) created a system to retrieve and index
same period, various researchers have developed and citations from works in digital libraries. Gutwin, Paynter,
tested data-mining techniques, which are advanced sta- Witten, Nevill-Manning, and Frank (1999) used text min-
tistical and visualization methods to locate nontrivial ing to support resource discovery.
patterns in large datasets. Bibliomining refers to the use These projects all shared a common focus on improv-
of these bibliometric and data-mining techniques to ex- ing and automating two of the core functions of a library:
plore the enormous quantities of data generated by the acquisitions and collection management. A few authors
typical automated library. have recently begun to address the need to support
management by focusing on understanding library users:
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Bibliomining for Library Decision-Making
Schulman (1998) discussed using data mining to examine ILS Data Sources from the Creation of
changing trends in library user behavior; Sallis, Hill, the Library System B
Jancee, Lovette, and Masi (1999) created a neural network
that clusters digital library users; and Chau (2000) dis-
cussed the application of Web mining to personalize
Bibliographic Information
services in electronic reference.
The December 2003 issue of Information Technology One source of data is the collection of bibliographic
and Libraries was a special issue dedicated to the records and searching interfaces that represents materi-
bibliomining process. Nicholson presented an overview als in the library, commonly known as the Online Public
of the process, including the importance of creating a data Access Catalog (OPAC). In a digital library environment,
warehouse that protects the privacy of users. Zucca the same type of information collected in a bibliographic
discussed the implementation of a data warehouse in an library record can be collected as metadata. The concepts
academic library. Wormell; Surez-Balseiro, Iribarren- parallel those in a traditional library: Take an agreed-upon
Maestro, & Casado; and Geyer-Schultz, Neumann, & standard for describing an object, apply it to every object,
Thede used bibliomining in different ways to understand and make the resulting data searchable. Therefore, digital
the use of academic library sources and to create appro- libraries use conceptually similar bibliographic data
priate library services. sources to traditional libraries.
We extend these efforts by taking a more global view
of the data generated in libraries and the variety of Acquisitions Information
decisions that those data can inform. Thus, the focus of
this work is on describing ways in which library and Another source of data for bibliomining comes from
information managers can use data mining to understand acquisitions, where items are ordered from suppliers and
patterns of behavior among library users and staff and tracked until they are received and processed. Because
patterns of information resource use throughout the insti- digital libraries do not order physical goods, somewhat
tution. different acquisition methods and vendor relationships
exist. Nonetheless, in both traditional and digital library
environments, acquisition data have untapped potential
MAIN THRUST for understanding, controlling, and forecasting informa-
tion resource costs.
Integrated Library Systems and Data
ILS Data Sources from Usage of the
Warehouses
Library System
Most managers who wish to explore bibliomining will
need to work with the technical staff of their Integrated User Information
Library System (ILS) vendors to gain access to the data-
bases that underlie the system and create a data ware- In order to verify the identity of users who wish to use
house. The cleaning, preprocessing, and anonymizing of library services, libraries maintain user databases. In
the data can absorb a significant amount of time and effort. libraries associated with institutions, the user database is
Only by combining and linking different data sources, closely aligned with the organizational database. Sophis-
however, can managers uncover the hidden patterns that ticated public libraries link user records through zip codes
can help them understand library operations and users. with demographic information in order to learn more about
their user population. Digital libraries may or may not have
Exploration of Data Sources any information about their users, based upon the login
procedure required. No matter what data are captured
Available library data sources are divided into three about the patron, it is important to ensure that the iden-
groups for this discussion: data from the creation of the tification information about the patron is separated from
library, data from the use of the collection, and data from the demographic information before this information is
external sources not normally included in the ILS. stored in a data warehouse; doing so protects the privacy
of the individual.
101
TEAM LinG
Bibliomining for Library Decision-Making
Circulation and Usage Information fore, tracking in-house use is also vital in discovering
patterns of use. This task becomes much easier in a
The richest sources of information about library user digital library, as Web logs can be analyzed to discover
behavior are circulation and usage records. Legal and what sources the users examined.
ethical issues limit the use of circulation data, however. A
data warehouse can be useful in this situation, because Interlibrary Loan and Other Outsourcing
basic demographic information and details about the cir- Services
culation could be recorded without infringing upon the
privacy of the individual. Many libraries use interlibrary loan and/or other
Digital library services have a greater difficulty in outsourcing methods to get items on a need-by-need
defining circulation, as viewing a page does not carry the basis for users. The data produced by this class of
same meaning as checking a book out of the library, transactions will vary by service but can provide a
although requests to print or save a full text information window to areas of need in a library collection.
resource might be similar in meaning. Some electronic full-
text services already implement the server-side capture of Applications of Bibliomining through a
such requests from their user interfaces.
Data Warehouse
Searching and Navigation Information Bibliomining can provide an understanding of the indi-
vidual sources listed previously in this article; however,
The OPAC serves as the primary means of searching for much more information can be discovered when sources
works owned by the library. Additionally, because most are combined through common fields in a data ware-
OPACs use a Web browser interface, users may also house.
access bibliographic databases, the World Wide Web,
and other online resources during the same session; all
this information can be useful in library decision making.
Bibliomining to Improve Library Services
Digital libraries typically capture logs from users who are
searching their databases and can track, through Most libraries exist to serve the information needs of
clickstream analysis, the elements of Web-based services users, and therefore, understanding the needs of indi-
visited by users. In addition, the combination of a login viduals or groups is crucial to a librarys success. For
procedure and cookies allows the connection of user many decades, librarians have suggested works; market
demographics to the services and searches they used in a basket analysis can provide the same function through
session. usage data in order to aid users in locating useful works.
Bibliomining can also be used to determine areas of
deficiency and to predict future user needs. Common
External Data Sources areas of item requests and unsuccessful searches may
point to areas of collection weakness. By looking for
Reference Desk Interactions patterns in high-use items, librarians can better predict
the demand for new items.
In the typical face-to-face or telephone interaction with a Virtual reference desk services can build a database
library user, the reference librarian records very little infor- of questions and expert-created answers, which can be
mation about the interaction. Digital reference transac- used in a number of ways. Data mining could be used to
tions, however, occur through an electronic format, and discover patterns for tools that will automatically assign
the transaction text can be captured for later analysis, questions to experts based upon past assignments. In
which provides a much richer record than is available in addition, by mining the question/answer pairs for pat-
traditional reference work. The utility of these data can be terns, an expert system could be created that can provide
increased if identifying information about the user can be users an immediate answer and a pointer to an expert for
captured as well, but again, anonymization of these trans- more information.
actions is a significant challenge.
Bibliomining for Organizational Decision
Item Use Information Making Within the Library
Fussler and Simon (as cited in Nutter, 1987) estimated that Just as the user behavior is captured within the ILS, the
75 to 80% of the use of materials in academic libraries is in behavior of library staff can also be discovered by con-
house. Some types of materials never circulate, and there-
102
TEAM LinG
Bibliomining for Library Decision-Making
103
TEAM LinG
Bibliomining for Library Decision-Making
Libraries have gathered data about their collections and Guenther, K. (2000). Applying data mining principles to
users for years but have not always used those data for library data collection. Computers in Libraries, 20(4), 60-
better decision making. By taking a more active approach 63.
based on applications of data mining, data visualization, Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., &
and statistics, these information organizations can get a Frank, E. (1999). Improving browsing in digital libraries
clearer picture of their information delivery and manage- with keyphrase indexes. Decision Support Systems, 2I,
ment needs. At the same time, libraries must continue to 81-104.
protect their users and employees from the misuse of
personally identifiable data records. Information discov- Hwang, S., & Chuang, S. (in press). Combining article
ered through the application of bibliomining techniques content and Web usage for literature recommendation in
gives the library the potential to save money, provide digital libraries. Online Information Review.
more appropriate programs, meet more of the users infor- Johnston, M., & Weckert, J. (1990). Selection advisor: An
mation needs, become aware of the gaps and strengths of expert system for collection development. Information
their collection, and serve as a more effective information Technology and Libraries, 9(3), 219-225.
source for its users. Bibliomining can provide the data-
based justifications for the difficult decisions and fund- Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digital
ing requests library managers must make. libraries and autonomous citation indexing. IEEE Com-
puter, 32(6), 67-71.
104
TEAM LinG
Bibliomining for Library Decision-Making
Surez-Balseiro, C. A., Iribarren-Maestro, I., Casado, E. S. Online Public Access Catalog (OPAC): The module
(2003). A study of the use of the Carlos III University of of the Integrated Library System designed for use by the
Madrid Librarys online database service in Scientific public to allow discovery of the librarys holdings through
Endeavor. Information Technology and Libraries, 22(4), the searching of bibliographic surrogates. As libraries
179-182. acquire more digital materials, they are linking those
materials to the OPAC entries.
Vizine-Goetz, D., Weibel, S., & Oskins, M. (1990). Auto-
mating descriptive cataloging. In R. Aluri, & D. Riggs
(Eds.), Expert systems in libraries (pp. 123-127). Norwood,
NJ: Ablex Publishing Corporation. NOTE
Wormell, I. (2003). Matching subject portals with the This work is based on Nicholson, S., & Stanton, J. (2003).
research environment. Information Technology and Li- Gaining strategic advantage through bibliomining: Data
braries, 22(4), 158-166. mining for management decisions in corporate, special,
Zucca, J. (2003). Traces in the clickstream: Early work on digital, and traditional libraries. In H. Nemati & C. Barko
a management information repository at the University of (Eds.). Organizational data mining: Leveraging enter-
Pennsylvania. Information Technology and Libraries, prise data resources for optimal performance (pp. 247
22(4), 175-178. 262). Hershey, PA: Idea Group.
105
TEAM LinG
106
Lipo Wang
Nanyang Technological University, Singapore
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Biomedical Data Mining Using RBF Neural Networks
classified three different subtypes of lymphoma with the systemic bias induced during experiments. We follow
100% accuracy by using 48 genes. (Tibshirani, Hastie, the normalization procedure used by Dudoit, Fridlyand, B
Narashiman, & Chu, 2003) and Speed (2002). Three preprocessing steps were ap-
However, there are still a lot of things can be done to plied: (a) thresholding with floor of 100 and ceiling of
improve present algorithms. In this work, we use and 16000; (b) filtering, exclusion of genes with max/min<5 or
compare two gene selection schemes, i.e., principal com- (max-min)<500. max and min refer to the maximum and the
ponents analysis (PCA) (Simon, 1999) and a t-test-based minimum of the gene expression values, respectively; and
method (Tusher, Tibshirani, & Chu, 2001). After that, we (c) base 10 logarithmic transformation. There are 3571
introduce an RBF neural network (Fu & Wang, 2003) as the genes survived after these three steps. After that, the data
classification algorithm. were standardized across experiments, i.e., minus the
mean and divided by the standard deviation of each
experiment.
MAIN THRUST
Methods for Gene Selection
After a comparative study of gene selection methods, a
detailed description of the RBF neural network and some As mentioned in the former part, the gene expression data
experimental results are presented in this section. are very high-dimensional. The dimension of input pat-
terns is determined by the number of genes used. In a
Microarray Data Sets typical microarray experiment, usually several thousands
of genes take part in. Therefore, the dimension of patterns
We analyze three well-known gene expression data sets, is several thousands. However, only a small number of the
i.e., the SRBCT data set (Khan et al., 2001), the lymphoma genes contribute to correct classification; some others
data set (Alizadeh et al., 2000), and the leukemia data set even act as noise. Gene selection can eliminate the
(Golub et al., 1999). influence of such noise. Furthermore, the fewer the
The lymphoma data set (http://llmpp.nih.gov/lym- genes used, the lower the computational burden to the
phoma) (Alizadeh et al., 2000) contains 4026 well mea- classifier. Finally, once a smaller subset of genes is iden-
sured clones belonging to 62 samples. These samples tified as relevant to a particular cancer, it helps biomedical
belong to following types of lymphoid malignancies: researchers focus on these genes that contribute to the
diffuse large B-cell lymphoma (DLBCL, 42 samples), fol- development of the cancer. The process of gene selection
licular lymphoma (FL, nine samples) and chronicle lym- is ranking genes discriminative ability first and then
phocytic leukemia (CLL, 11 samples). In this data set, a retaining the genes with high ranks.
small part of data is missing. A k-nearest neighbor algo- As a critical step for classification, gene selection has
rithm was used to fill those missing values (Troyanskaya been studied intensively in recent years. There are two
et al., 2001). main approaches, one is principal component analysis
The SRBCT data set (http://research.nhgri.nih.gov/ (PCA) (Simon, 1999), perhaps the most widely used method;
microarray/Supplement/) (Khan et al., 2001) contains the the other is a t-test-based approach which has been more
expression data of 2308 genes. There are totally 63 training and more widely accepted. In the important papers
samples and 25 testing samples. Five of the testing samples (Alizadeh et al., 2000; Khan et al., 2001), PCA was used.
are not SRBCTs. The 63 training samples contain 23 Ewing The basic idea of PCA is to find the most informative
family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 genes that contain most of the information in the data set.
neuroblastoma (NB), and eight Burkitt lymphomas (BL). Another approach is based on t-test that is able to mea-
And the 20 testing samples contain six EWS, five RMS, six sure the difference between two groups. Thomas, Olsen,
NB, and three BL. Tapscott, and Zhao. (2001) recommended this method.
The leukemia data set (http://www-genome.wi.mit.edu/ Tusher et al. (2001) and Pan (2002) also proposed their
cgi-\\bin /cancer/publications) (Golub et al., 1999) has method based on t-test, respectively. Besides these two
two types of leukemia, i.e., acute myeloid leukemia (AML) main methods, there are also some other methods. For
and acute lymphoblastic leukemia (ALL). Among these example, a method called Markov blanket was proposed
samples, 38 of them are for training; the other 34 blind by Xing, Jordan, and Karp (2001). Li, Weinberg, Darden,
samples are for testing. The entire leukemia data set and Pedersen (2001) applied another method which com-
contains the expression data of 7,129 genes. Different bined genetic algorithm and K-nearest neighbor.
with the cDNA microarray data, the leukemia data are PCA (Simon, 1999) aims at reducing the input dimen-
oligonucleotide microarray data. Because such expres- sion by transforming the input space into a new space
sion data are raw data, we need to normalize them to reduce described by principal components (PCs). All the PCs are
107
TEAM LinG
Biomedical Data Mining Using RBF Neural Networks
orthogonal and they are ordered according to the absolute total number of samples. x i is the general mean expres-
value of their eigenvalues. The k-th PC is the vector with sion value for gene i. s i is the pooled within-class
the k-th largest eigenvalue. By leaving out the vectors with standard deviation for gene i. Actually, the t-score used
small eigenvalues, the input spaces dimension is reduced. here is a t-statistics between a specific class and the
In fact, the PCs indicate the directions with largest overall centroid of all the classes.
variations of input vectors. Because PCA chooses vectors To compare the t-test-based method with PCA, we
with largest eigenvalues, it covers directions with largest also applied it to the lymphoma data set with the same
variations of vectors. In the directions determined by the procedure as what we did by using PCA. This method
vectors with small eigenvalues, the variations of vectors obtained 100% accuracy with only the top six genes. The
are very small. In a word, PCA intends to capture the most results are shown in Figure 1. This comparison indicated
informative directions (Simon, 1999). that the t-test-based method was much better than PCA
We tested PCA in the lymphoma data set (Alizadeh et in this problem.
al., 2000). We obtained 62 PCs from the 4026 genes in the
data set by using PCA. Then, we ranked those PCs accord- An RBF Neural Network
ing to their eigenvalues (absolute values). Finally, we used
our RBF neural network that will be introduced in the latter
An RBF neural network (Haykin, 1999) has three layers.
part to classify the lymphoma data set.
The first layer is an input layer; the second layer is a
At first, we randomly divided the 62 samples into two
hidden layer that includes some radial basis functions,
parts, 31 samples for training and the other 31 samples for
also known as hidden kernels; the third layer is an output
testing. We then input the 62 PCs one by one to the RBF
layer. An RBF neural network can be regarded as a
network according to their eigenvalue ranks starting with
mapping of the input domain X onto the output domain
the PC ranked one. That is, we first used only a single PC
Y. Mathematically, an RBF neural network can be de-
that is ranked 1 as the input to the RBF network. We trained
scribed as follows:
the network with the training data and subsequently tested
the network with the testing data. We repeated this pro-
N
cess with the top two PCs, then the top three PCs, and so y m ( x) = wmi G ( x t i ) + bm , i=1,2,,N; m=1,2,M
on. Figure 1 shows the testing error. From this result, we i =1
found that the RBF network can not reach 100% accuracy.
The best testing accuracy is 93.55% that happened when 36 Here stands for the Euclidean norm. M is the
or 61 PCs were input to the classifier. The classification result
number of outputs. N is the number of hidden kernels.
using the t-test-based gene selection method will be shown
in the next section, which is much better than PCA approach. y m (x) is the output m corresponding to the input x. t i is
The t-test-based gene selection measures the differ- the position of kernel i. wmi is the weight between the
ence of genes distribution using a t-test based scoring kernel i and the output m. bm is the bias on the output m.
scheme, i.e., t-score (TS). After that, only the genes with G ( x t i ) is the kernel function. Usually, an RBF neural
the highest TSs are to be put into our classifier. The TS of
network uses Gaussian kernel functions as follows:
gene i is defined as follows (Tusher et al., 2001):
x ik x i
TS i = max , k = 1,2,...K
d
k i s
Figure 1. Classification results of using PCA and the t-
test-based method as gene selection methods
x ik = jC k x ij / nk
n
x i = xij / n 1
j =1
(x )
where: 1 2 0. 8
si2 = ij x ik
nK k
jC k
0. 6
PCA
Er r or
d k = 1 / nk + 1 / n
t - t est
0. 4
0. 2
There are K classes. max {yk, k = 1,2,..K} is the maximum
of all y k, k = 1,2,..K. Ck refers to class k that includes nk 0
1 6 11 16 21 26 31 36 41 46 51 56 61
samples. xij is the expression value of gene i in sample j. x ik
Number of genes
is the mean expression value in class k for gene i. n is the
108
TEAM LinG
Biomedical Data Mining Using RBF Neural Networks
2 FUTURE TRENDS
B
x t i
G ( x t i ) = exp( 2
)
2 i
Until now, the focus of work is investigating the informa-
tion with statistical importance in microarray data sets. In
where i is the radius of the kernel i. the near future, we will try to incorporate more biological
The main steps to construct an RBF neural network knowledge into our algorithm, especially the correlations
include: (a) determining the positions of all the kernels (ti); of genes.
(b) determining the radius of each kernel ( i ); and (c) In addition, with more and more microarray data sets
calculating the weights between each kernel and each produced in laboratories around the world, we will try to
output node. mine multi-data-set with our RBF neural network, i.e., we
In this paper, we use a novel RBF neural network will try to process the combined data sets. Such an attempt
proposed by Fu and Wang (Fu and Wang, 2003), which will hopefully bring us a much broader and deeper insight
allows for large overlaps of hidden kernels belonging to into those data sets.
the same class.
Results CONCLUSION
In the SRBCT data set, we first ranked the entire 2308 Through our experiments, we conclude that the t-test-
genes according to their TSs (Tusher et al., 2001). Then based gene selection method is an appropriate feature
we picked out 96 genes with the highest TSs. We applied selection/dimension reduction approach, which can find
our RBF neural network to classify the SRBCT data set. more important genes than PCA can.
The SRBCT data set contains 63 samples for training and The results in the SRBCT data set and the leukemia
20 blind samples for testing. We input the selected 96 data set proved the effectiveness of our RBF neural
genes one by one to the RBF network according to their network. In the SRBCT data set, it obtained 100% accu-
TS ranks starting with the gene ranked one. We repeated racy with only seven genes. In the leukemia data set, it
this process with the top two genes, then the top three made only one error with 12, 20, 22, and 32 genes, respec-
genes, and so on. Figure 2 shows the testing errors with tively. In view of this, we also conclude that our RBF
respect to the number of genes. The testing error decreased neural network outperforms almost all the previously
to 0 when the top seven genes were input into the RBF published methods in terms of accuracy and the number
network. of genes required.
In the leukemia data set, we chose 56 genes with the
highest TSs (Tusher et al., 2001). We followed the same
procedure as in the SRBCT data set. We did classification REFERENCES
with 1 gene, then two genes, then three genes and so on.
Our RBF neural network got an accuracy of 97.06%, i.e. Alizadeh, A.A. et al. (2000). Distinct types of diffuse large
one error in all 34 samples, when 12, 20, 22, 32 genes were B-cell lymphoma identified by gene expression profiling.
input, respectively. Nature, 403, 503-511.
Figure 2. The testing result in the SRBCT data set Figure 3. The testing result in the leukemia data set
0. 5 0. 2
0. 4 0. 15
0. 3
Er r or
Er r or
0. 1
0. 2
0. 05
0. 1
0 0
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55
Number of genes Number of genes
109
TEAM LinG
Biomedical Data Mining Using RBF Neural Networks
Dudoit, S., Fridlyand, J., & Speed, J. (2002). Comparison Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance
of discrimination methods for the classification of tumors analysis of microarrays applied to the ionizing radiation
using gene expression data. Journal of American Statis- response. Proc. Natl. Acad. Sci. USA, 98, 5116-5121.
tics Association, 97, 77-87.
Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature
Fu, X., & Wang, L. (2003). Data dimensionality reduction selection for high-dimensional genomic microarray data.
with application to simplifying RBF neural network struc- Proceedings of the Eighteenth International Conference
ture and improving classification performance. IEEE Trans. on Machine Learning (pp. 601-608). Morgan Kaufmann
Syst., Man, Cybernetics. Part B: Cybernetics, 33, 399-409. Publishers, Inc.
Golub, T.R. et al. (1999). Molecular classification of can-
cer: class discovery and class prediction by gene expres- KEY TERMS
sion monitoring. Science, 286, 531-537.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Feature Extraction: Feature extraction is the process
Gene selection for cancer classification using support to obtain a group of features with the characters we need
vector machines. Machine Learning, 46, 389-422. from the original data set. It usually uses a transform (e.g.
principal component analysis) to obtain a group of fea-
Haykin, S. (1999). Neural network, a comprehensive tures at one time of computation.
foundation (2nd ed.). New Jersey, U.S.A: Prentice-Hall, Inc.
Feature Selection: Feature selection is the process to
Khan, J.M. et al. (2001). Classification and diagnostic select some features we need from all the original features.
prediction of cancers using gene expression profiling and It usually measures the character (e.g. t-test score) of each
artificial neural networks. Nature Medicine, 7, 673-679. feature first, then, chooses some features we need.
Li, L., Weinberg, C.R., Darden, T.A., & Pedersen, L.G. Gene Expression Profile: Through microarray chips,
(2001). Gene selection for sample classification based on an image that describes to what extent genes are ex-
gene expression data: Study of sensitivity to choice of pressed can be obtained. It usually uses red to indicate the
parameters of the GA/KNN method. Bioinformatics, 17, high expression level and uses green to indicate the low
1131-1142. expression level. This image is also called a gene expres-
sion profile.
Olshen, A.B., & Jain, A.N. (2002). Deriving quantitative
conclusions from microarray expression data. Microarray: A Microarray is also called a gene chip
Bioinformatics, 18, 961-970. or a DNA chip. It is a newly appeared biotechnology that
allows biomedical researchers monitor thousands of genes
Pan, W. (2002). A comparative review of statistical meth- simultaneously.
ods for discovering differentially expressed genes in repli-
cated microarray experiments. Bioinformatics, 18, 546-554. Principal Components Analysis: Principal compo-
nents analysis transforms one vector space into a new
Schena, M., Shalon, D., Davis, R.W., & Brown, P.O. (1995). space described by principal components (PCs). All the
Quantitative monitoring of gene expression patterns with PCs are orthogonal to each other and they are ordered
a complementary DNA microarray. Science, 270, 467-470. according to the absolute value of their eigenvalues. By
Thomas, J.G., Olsen, J.M., Tapscott, S.J. & Zhao, L.P. leaving out the vectors with small eigenvalues, the dimen-
(2001). An efficient and robust statistical modeling ap- sion of the original vector space is reduced.
proach to discover differentially expressed genes using ge- Radial Basis Function (RBF) Neural Network: An
nomic expression profiles. Genome Research, 11, 1227-1236. RBF neural network is a kind of artificial neural network.
Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2002). It usually has three layers, i.e., an input layer, a hidden
Diagnosis of multiple cancer types by shrunken centroids of layer, and an output layer. The hidden layer of an RBF
gene expression. Proc. Natl. Acad. Sci. USA, 99, 6567-6572. neural network contains some radial basis functions, such
as Gaussian functions or polynomial functions, to trans-
Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2003). form input vector space into a new non-linear space. An
Class predication by nearest shrunken centroids with applica- RBF neural network has the universal approximation abil-
tions to DNA microarrays. Statistical Science, 18, 104-117. ity, i.e., it can approximate any function to any accuracy,
Troyanskaya, O., Cantor, M, & Sherlock, G., et al. (2001). as long as there are enough hidden neurons.
Missing value estimation methods for DNA microarrays.
Bioinformatics, 17, 520-525.
110
TEAM LinG
Biomedical Data Mining Using RBF Neural Networks
T-Test: T-test is a kind of statistical method that Training a Neural Network: Training a neural net-
measures how large the difference is between two groups work means using some known data to build the structure B
of samples. and tune the parameters of this network. The goal of training
is to make the network represent a mapping or a regression we
Testing a Neural Network: To know whether a trained need.
neural network is the mapping or the regression we need,
we test this network with some data that have not been
used in the training process. This procedure is called
testing a neural network.
111
TEAM LinG
112
Yuan Zhao
Nanyang Technological University, Singapore
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Building Empirical-Based Knowledge for Design Recovery
Secondly, we design experiments to validate the hy- Figure 1. An overview of the proposed design recovery
potheses. Some software tools should be developed to B
automate or semiautomate the characterization of the Identify Designs
or
properties stated in the hypotheses. We may merge their Properties
multiple hypotheses together as a single hypothesis for
the convenience of hypothesis testing. An experiment is
Identity Program
designed to conduct a binomial test (Gravetter & Wallnau, Charactertics and Formulate
Hypotheses
2000) for each resulting hypothesis. If altogether we
have k hypotheses denoted by H1,., Hk and we would
like the probability of validity of the proposed design Develop Software Tools
Develop Theory
to aid Experiments for
recovery to be more than or equal to q, then we must Hypotheses Testing
for the Inference of Designs
p
Develop Software Tool
to Implement the Algorithms
For the use of normal approximation for the bino-
mial test, both npj and n(1-pj) must be greater than or
equal to 10. As such, the sample size n must be greater Conduct Experiment
than or equal to max (10/pj, 10/(1-pj)). The experiment to validate the Effectiveness
113
TEAM LinG
Building Empirical-Based Knowledge for Design Recovery
functional dependencies are not discovered in the initial := y), {xclu_fd(X Y of R)}}), such that if the rd node
system development. They are only identified during the successfully reads an R record specified, and y is iden-
system maintenance stage. tical to the Y-value of the record read, is called an
Although keys can be used to implement functional insertion pattern for enforcing the FD, X Y of R.
dependencies in old generation DBMSs, due to the effort A program path in the form ({{xclu_fd(X Y of R)},
in restructuring databases during the system maintenance mdf(R, X == x0, Y := y), {xclu_fd(X Y of R)}}), such that
stage, most functional dependencies identified during this all the R records in which the X-values are equal to x 0 are
stage are not defined explicitly as keys in the databases. modified by the mdf node, is called a Y-modification
They are enforced in transactions. Furthermore, most of pattern for enforcing the FD, X Y of R.
the conventional files and relational databases allow only A program path in the form (rd(R, X == x), {xclu_fd(X
the definition of one key. As such, most of the candidate Y of R)}, {{xclu_fd(X Y of R)}, mdf(R, X == x0, X :=
keys are enforced in transactions. As a result, many func- x, Y := unchange), {xclu_fd(X Y of R)}}), such that the
tional dependencies in a legacy database are enforced in mdf node is only executed if the rd node does not
the transactions that update the database. Our previous successfully read an R record specified, is called an X-
work (Tan, 2004) has proven that if all the transactions that modification pattern for enforcing the FD, X Y of
update a database satisfy a set of properties with reference R.
to a functional dependency, X Y of R, then the func- We have proven the following rules by Tan (2004):
tional dependency holds in the database. Before proceed-
ing further, we shall discuss these properties. Nonviolation of FD: In a transaction, if all the
For generality, we shall express a program path in the nodes that insert R records or modify the at-
form (a1, , an), where ai is either a node or in the form tribute X or Y in any program path from the start
node to the end node are always contained in a
{ nii ,.., nik }, in which nii ,.., nik are nodes. If ai is in the
sequence of subpaths in the patterns for enforc-
ing the functional dependency, X Y of R, then
form { nii ,.., nik }, then in (a1, , a n), after the predeces-
the transaction does not violate the functional
sor node of a i, the subsequence of nodes, nii ,.., nik , dependency.
FD in Database: Each transaction that updates
may occur any number of times (possibly zero), before any record type involved in a functional depen-
proceeding to the successor of a i. dency does not violate the functional dependency
Before proceeding further, we shall introduce some if and only if is a functional dependency
notations to represent certain types of nodes in control designed in the database.
flow graphs of transactions that will be used throughout
this paper: Theoretically, the property stated in the first rule is
not a necessary property in order for the functional
rd(R, W == b): A node that reads or selects an R dependency to hold. As such, we may not be able to
record in which the W-value is b if it exists. recover all functional dependencies enforced by rec-
mdf(R, W == b, Z1 := c1,.., Z n := cn): A node that ognizing these properties.
modifies the Z 1-value, .. , Zn-value of an R record, Fortunately, other than very exceptional cases, most
in which the original W-value is b, to c1,.., cn, of the enforcement of functional dependency does
respectively. result to the previously mentioned property. As such,
ins(R, Z1 := c1,.., Zn := cn): A node that inserts an empirically, the property is usually also necessary for
R record, in which its Z1-value,.. , Zn-value are the functional dependency, X Y of R, to hold in the
set to c1,.., c n, respectively. database. Thus, we take the following hypothesis.
Here, R is a record type. W, Z 1, , Zn are sequences of Hypothesis 1: If a transaction does not violate
R attributes. The values of those attributes that are not the functional dependency, X Y of R, then all
mentioned in the mdf and ins nodes can be modified and the nodes that insert R records or modify the
set to any value. attribute X or Y in any program path from the start
We shall also use xclu_fd(X Y of R) to denote a node to the end node are always contained in a
node in the control flow graph of a transaction that does sequence of subpaths in the patterns for enforc-
not perform any updating that may lead to the violation of ing the functional dependency.
the functional dependency X Y of R.
A program path in the form (rd(R, X == x), {xclu_fd(X With the hypothesis, the result discussed by Tan and
Y of R)}, {{xclu_fd(X Y of R)}, ins(R, X := x, Y Thein (in press) can be extended to the following theorem.
114
TEAM LinG
Building Empirical-Based Knowledge for Design Recovery
FUTURE TRENDS
Table 1. The statistics of an experiment
In general, many designs are very difficult (if not impos-
sible) to formally prove from program characteristics
Enforcement of the (Chandra, Godefroid, & Palm, 2002; Clarke, Grumberg,
Transaction Functional Dependency & Peled,1999; Deng & Kothari, 2002). Therefore, the
Correct Wrong use of empirical-based knowledge is important for the
1 55 16 automated recognition of designs from program source
2 65 6 codes (Embury & Shao, 2001; Tan, Ling, & Goh, 2002;
3 36 35 Wong, 2001). We believe that the integration of empiri-
115
TEAM LinG
Building Empirical-Based Knowledge for Design Recovery
cal-based properties into existing program analysis and Gravetter, F. J., & Wallnau, L. B. (2000). Statistics for the
model-checking techniques will be a fruitful direction behavioral sciences. Belmont, CA: Wadsworth.
in the future.
Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones,
P. W., Hoaglin, D. C., Emam, K. E., & Rosenberg, J. (2002).
Preliminary guidelines for empirical research in software
CONCLUSION engineering. IEEE Transactions on Software Engineer-
ing, 28(8), 721-734.
Empirical-based knowledge has been used in the recog-
nition of designs from source codes through automated Kozaczynski, W., Ning, J., & Engberts, A. (1992). Program
program analysis. It is a promising research direction. concept recognition and transformation. IEEE Transac-
This chapter introduces the approach for building em- tions on Software Engineering, 18(12), 1065-1075.
pirical-based knowledge, a vital part for such research
exploration. We have also applied it in the recognition Tan, H. B. K., Ling, T. W., & Goh, C. H. (2002). Exploring
of functional dependencies enforced in database trans- into programs for the recovery of data dependencies
actions from the transactions. We believe that our ap- designed. IEEE Transactions on Knowledge and Data
proach will encourage more exploration of the discov- Engineering, 14(4), 825-835.
ery and use of empirical-based knowledge in this area. Tan, H. B. K., & Thein, N. L. (in press). Recovery of PTUIE
Recently, we have completed our work on the recovery handling from source codes through recognizing its prob-
of posttransaction user-input error (PTUIE) handling able properties. IEEE Transactions on Knowledge and
for database transaction. This approach appeared in IEEE Data Engineering, 16(10), 1217-1231.
Transactions on Knowledge and Data Engineering
(Tan & Thein, 2004). Ullman, J. D. (1982). Principles of database systems (2nd
ed.). Rockville, MD: Computer Science Press.
Wong, K. (2001). Research challenges in reverse engi-
REFERENCES neering community. Proceedings of the International
Workshop on Program Comprehension (pp. 323-332),
Basili, V. R. (1996). The role of experimentation in Canada.
software engineering: Past, current, and future. The 18th
International Conference on Software Engineering (pp.
442-449), Germany. KEY TERMS
Beizer, B. (1990). Software testing techniques. New York: Control Flow Graph: An abstract data structure
Van Nostrand Reinhold. used in compilers. It is an abstract representation of a
Chandra, S., Godefroid, P., & Palm, C. (2002). Software procedure or program, maintained internally by a com-
model checking in practice: An industrial case study. piler. Each node in the graph represents a basic block.
Proceedings of the International Conference on Soft- Directed edges are used to represent jumps in the con-
ware Engineering (pp. 431-441), USA. trol flow.
Clarke, E. M., Grumberg, O., & Peled, D. A. (1999). Model Design Recovery: Recreates design abstractions
checking. MIR Press. from a combination of code, existing design documen-
tation (if available), personal experience, and general
Deng, Y. B., & Kothari, S. (2002). Recovering conceptual knowledge about problem and application domains.
roles of data in a program. Proceedings of the Interna-
tional Conference on Software Maintenance (pp. 342- Functional Dependency: For any record r in a
350), Canada. record type, its sequence of values of the attributes in X
is referred to as the X-value of r. Let R be a record type,
Embury, S. M., & Shao, J. (2001). Assisting the compre- and X and Y be sequences of attributes of R. We say that
hension of legacy transactions. Proceedings of the the functional dependency, X Y of R, holds at time t,
Working Conference on Reverse Engineering (pp. 345- if at time t, for any two R records r and s, the X-values
354), Germany. of r and s are identical, then the U-values of r and s are
Ferrante, J., Ottenstein, K. J., & Warren, J. O. (1987). The also identical.
program dependence graph and its use in optimisation. Hypothesis Testing: Hypothesis testing refers to
ACM Transaction on Programming Languages and Sys- the process of using statistical analysis to determine if the
tems, 9(3), 319-349. observed differences between two or more samples are
116
TEAM LinG
Building Empirical-Based Knowledge for Design Recovery
due to random chance (as stated in the null hypothesis) to the set of values or behaviors arising dynamically at run-
or to true differences in the samples (as stated in the time when executing a program on a computer. B
alternate hypothesis).
PTUIE: Posttransaction user-input error. An error
Model Checking: A method for formally verifying made by users in an input to a transaction execution and
finite-state concurrent systems. Specifications about discovered only after completion of the execution.
the system are expressed as temporal logic formulas,
and efficient symbolic algorithms are used to traverse the Transaction: An atomic set of processing steps in a
model defined by the system and check whether the database application such that all the steps are per-
specification holds. formed either fully or not at all.
117
TEAM LinG
118
Business Processes
David Sundaram
The University of Auckland, New Zealand
Victor Portougal
The University of Auckland, New Zealand
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Business Processes
It is worth exploring each of the phrases within this Business Process Classification
definition.
Over the years many classifications of processes have
achieves a particular result been suggested. The American Productivity & Quality
Center (Process Classification Framework, 1996) dis-
The result might be Goods and/or Services. tinguishes two types of processes 1) operating pro-
It should be possible to identify and count the cesses and 2) management and support processes.
result eg. Fulfilment of Orders, Resolution of Operating processes include processes such as:
Complaints, Raising of Purchase Orders, etc.
Understanding Markets and Customers
for the customer of the process Development of Vision and Strategy
Design of Product and Services
Every process has a customer. The customer maybe Marketing and Selling of Products and Services
internal (employee) or external (organisation). Production and Delivery of Products and Services
A key requirement is that customer should be able Invoicing and Servicing of Customers
to give feedback on the process
119
TEAM LinG
Business Processes
In contrast management and support processes in- standable patterns in data. Some of the key steps of the
clude processes such as: data mining business process involve:
120
TEAM LinG
Business Processes
Figure 1. ARIS views/house of business engineering view. This view, however, is significant for the subject-
(Scheer, 2000) related view of business processes only when it gives an B
opportunity for describing in full the other components
that are more directly linked toward the business.
121
TEAM LinG
Business Processes
wrong business practices. Thus a designer could avoid Data Mining (CRISP-DM, 2004) that provides a lifecycle
wrong design solutions. process oriented approach to the mining of data in
A repository of such mal-processes would enable organizations. Apart from this data warehousing and data
organisations to avoid typical mistakes in enterprise mining techniques are a key element in various steps of
system design. Such a repository can be a valuable asset the process lifecycle alluded to in the trends. Data
in education of ERP designers and it can be useful in warehousing and data mining techniques help us not only
general management education as well. Another appli- in process identification and analysis (identifying and
cation of this repository might be troubleshooting. For analyzing candidates for improvement) but also in ex-
example, sources of some nasty errors in the sales and ecution and monitoring and control of processes. Data
distribution system might be found in the mal-pro- warehousing technologies enable us in collecting, ag-
cesses of data entry in finished goods handling. gregating, slicing and dicing of process information
while data mining technologies enable us in looking for
patterns in process information enabling us to monitor
FUTURE TRENDS and control the organizational process more efficiently.
Thus there is a symbiotic mutually enhancing relation-
There are three key trends that characterise business ship both at the conceptual level as well as at the tech-
processes: digitisation (automation), integration (intra nology level between business processes and data ware-
and inter organisational), and lifecycle management housing and data mining.
(Kalakota & Robinson, 2003). Digitisation involves the
attempts by many organisations to completely automate
as many of their processes as possible. Another equally REFERENCES
important initiative is the seamless integration and co-
ordination of processes within and without the APQC (1996). Process classification framework (pp. 1-6).
organisation: backward to the supplier, forward to the APQCs International Benchmark Clearinghouse & Arthur
customers, and vertically of operational, tactical, and Andersen & Co.
strategic business processes. The management of both
these initiatives/trends depends to a large extent on the Bider, I., Johannesson, P., & Perjons, E. (2002). Goal-
proper management of processes throughout their oriented patterns for business processes. Workshop on
lifecycle: from process identification, process model- Goal-Oriented Business Process Modelling, London.
ling, process analysis, process improvement, process CRISP-DM (2004). Retrieved from http://www.crisp-dm.org
implementation, process execution, to process monitor-
ing/controlling (Rosemann, 2001). Implementing such a Curran, T., Keller, G., & Ladd, A. (1998). SAP R/3 business
lifecycle orientation enables organizations to move in blueprint: Understanding business process reference
benign cycles of improvement and sense, respond, and model. Upper Saddle River, NJ: Prentice Hall.
adapt to the changing environment (internal and external). Davenport, T.H. (1993). Process innovation. Boston, MA:
All these trends require not only the use of Enterprise Harvard Business School Press.
Systems as a foundation but also data warehousing and
data mining solutions. These trends will continue to be Davis, R., (2001) Business process modelling with
major drivers of the enterprise of the future. ARIS: A practical guide. UK: Springer-Verlag.
Genovese, Y., Bond, B., Zrimsek, B., & Frey, N. (2001).
The transition to ERP II: Meeting the challenges.
CONCLUSION Gartner Group.
In this modern landscape, business processes and tech- Hammer, M., & Champy, J. (1993). Re-engineering the
niques and tools for data warehousing and data mining Corporation: A manifesto for business revolution. New
are intricately linked together. The impacts are not just York: Harper Business.
one way. Concepts from business processes could be
and are used to make the data mining and data warehous- Jacobson. (1995). The object advantage. Addison-
ing processes an integral part of organizational pro- Wesley.
cesses. Data warehousing and data mining processes are Kalakota, R., & Robinson, M. (2003). Services blue-
a regular part of organizational business processes, print: Roadmap for execution. Boston: Addison-
enabling the conversion of operational information into Wesley.
tactical and strategic level information. An example of
this is the Cross Industry Standard Process (CRISP) for
122
TEAM LinG
Business Processes
Lindsay, D., Downs, K., & Lunn. (2003). Business pro- Business Process: A business process is a collection
cessesattempts to find a definition. Information and of interrelated work tasks, initiated in response to an B
Software Technology, 45, 1015-1019. event that achieves a specific result for the customer of
the process.
Ould, A. M. (1995). Business processes: Modelling and
analysis for reengineering. Wiley. Digitisation: Measures that automate processes.
Rosemann, M. (2001, March). Business process lifecycle ERP: Enterprise resource planning system, a software
management. Queensland University of Technology. system for enterprise management. It is also referred to as
Enterprise Systems (ES).
Scheer, (2000). ARIS methods. IDS.
Functional Areas: Companies that make products
Scheer, A.-W., & Habermann, F. (2000). Making ERP a to sell have several functional areas of operations.
success. Communications of the Association of Com- Each functional area comprises a variety of business
puting Machinery, 43(4) 57-61. functions or business activities.
Sharp, A., & McDermott, P. (2001). Just what are pro- Integration of Processes: The coordination and
cesses anyway? Workflow modeling: tools for process integration of processes seamlessly within and without
improvement and application development (pp. 53-69). the organization.
Mal-Processes: A sequence of actions that a sys-
tem can perform, interacting with a legal user of the
KEY TERMS system, resulting in harm for the organization or stake-
holder.
ARIS: Architecture of Integrated Information Sys-
tems, a modeling and design tool for business pro- Process Lifecycle Management: Activities under-
cesses. taken for the proper management of processes such as
identification, analysis, improvement, implementation,
as-is Business Process: Current business pro- execution, and monitoring.
cess.
to-be Business Process: Re-engineered busi-
ness process.
123
TEAM LinG
124
Francesco Ricci
eCommerce and Tourism Research Laboratory, ITC-irst, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Case-Based Recommender Systems
Figure 1. CBR-RS framework (Lorenzi & Ricci, 2004) This means that a case c = (x, u, s, e) CB, generally
consists of four (optional) sub-elements x, u, s, e, which C
are elements of the spaces X, U, S, E respectively. Each
CBR-RS adopts a particular model for the spaces X, U,
S, E. These spaces could be empty, vector, set of docu-
ment (textual), labeled graphs, etc.
125
TEAM LinG
Case-Based Recommender Systems
In the next stage the reused case component is adapted DieToRecs DTR
to better fit in the new problem. Mostly, the adaptation in
CBR-RSs is implemented by allowing the user to customize DieToRecs helps the user to plan a leisure travel
the retrieved set of products. This can also be implemented (Fesenmaier, Ricci, Schaumlechner, Wober, & Zanella
as a query refinement task. For example, in Comparison- 2003). We present here two different approaches (de-
based Retrieval (McGinty & Smyth, 2003) the system asks cision styles) implemented in DieToRecs: the single
user feedbacks (positive or negative) about the retrieved item recommendation (SIR), and the travel completion
product and with this information it updates the user (TC). A case represents a user interaction with the sys-
query. tem, and it is built incrementally during the recommenda-
The last step of the CBR recommendation cycle is the tion session. A case comprises all the quoted models:
retain phase (or learning), where the new case is retained content, user profile, session and evaluation model.
in the case base. In DieToRecs (Fesenmaier, Ricci, SIR starts with the user providing some prefer-
Schaumlechner, Wober, & Zanella, 2003), for instance, ences. The system searches in the catalog for products
all the user/system recommendation sessions are stored that (logically) match with these preferences and re-
as cases in the case base. turns a result set. This is not to be confused with the
The next subsections describe very shortly some repre- retrieval set that contains a set of similar past recom-
sentative CBR-RSs, focusing on their peculiar characteris- mendation session. The products in the result set are
tic (see Lorenzi & Ricci, 2004) for the complete report. then ranked with a double similarity function (Ricci et
al., 2003) in the revise stage, after a set of relevant
Interest Confidence Value ICV recommendation sessions are retrieved.
In the TC function the cycle starts with the users
Montaner, Lopez, and la Rosa (2002) assume that the preferences too but the system retrieves from the case
users interest in a new product is similar to the users base cases matching users preferences. Before rec-
interest in similar past products. This means that when a ommending the retrieved cases to the user, the system
new product comes up, the recommender system pre- in the revise stage updates, or replaces, the travel
dicts the users interest in it based on interest attributes products contained in the case exploiting up-to-date
of similar experiences. A case is modeled by objective information taken from the catalogues. In the review
attributes describing the product (content model) and phase the system allows the user to reconfigure the
subjective attributes describing implicit or explicit in- recommended travel plan. The system allows the user
terests of the user in this product (evaluation model), to replace, add or remove items in the recommended
i.e., c X E. In the evaluation model, the authors intro- travel. When the user accepts the outcome (the final
duced the drift attribute, which models a decaying im- version of the recommendation shown to the user), the
portance of the case as time goes and the case is not used. system retains this new case in the case base.
The system can recommend in two different ways:
prompted or proactive. In prompted mode, the user pro- Compromise-Driven Retrieval CDR
vides some preferences (weights in the similarity metric)
and the system retrieves similar cases. In the proactive CDR models a case only by the content component
recommendation, the system does not have the user prefer- (McSherry, 2003a). In CDR, if a given case c1 is more
ences, so it estimates the weights using past interactions. similar to the target query than another case c2, and
In the reuse phase the system extracts the interest differs from the target query in a subset of the at-
values of retrieved cases and in the revise phase it calcu- tributes in which c2 differs from the target query, then
lates the interest confidence value of a restaurant to c 1 is more acceptable than c2.
decide if this should be recommender to the user or not. In the CDR retrieval algorithm the system sorts all
The adaptation is done asking to the user the correct the cases in the case-base according to the similarity to
evaluation of the product and after that a new case (the a given query. In a second stage, it groups together the
product and the evaluation) is retained in the case base. cases making the same compromise (do not match a
It is worth noting that in this approach the recommended user preferred attribute value) and builds a reference
product is not retrieved from the case base, but the set with just one case for each compromise group. The
retrieved cases are used to estimate the user interest in cases in the reference set are recommended to the user.
this new product. This approach is similar to that used in The user can also refine (review) the original query,
DieToRecs in the single item recommendation function. accepting one compromise, and adding some preference
on a different attribute (not that already specified). The
system will further decompose the set of cases corre-
126
TEAM LinG
Case-Based Recommender Systems
sponding to the selected compromise. The revise and Table 1. Comparison of the CBR-RSs
retain phases do not appear in this approach.
Approach
ICV
Retrieval
Similarity
Reuse
IC value
Revise
IC computation
Review
Feedback
Retain
Default
C
SIR Similarity Selective Rank User edit Default
ExpertClerk EC TC
OBR
Similarity
Similarity +
Default
Default
Logical query
None
User edit
Tweak
Default
None
Ordering
CDR Similarity + Default None Tweak None
Expertclerk is a tool for developing a virtual salesclerk Grouping
EC Similarity Default None Feedback None
system as a front-end of e-commerce websites (Shimazu,
2002). The system implements a question selection method
(decision tree with information gain). Using navigation- area have already delivered, how the existing CBR-RSs
by-asking, the system starts the recommendation session behave and which are the topics that could be better
asking questions to user. The questions are nodes in a exploit in future systems.
decision tree. A question node subdivides the set of
answer nodes and each one of these represents a different
answer to the question posed by the question node. The CONCLUSION
system concatenates all the answer nodes chosen by the
user and then constitutes the SQL retrieval condition In the previous sections we have briefly analyzed eight
expression. different CBR recommendations. Table 1 shows the
This query is applied to the case base to retrieve the main features of these approaches.
set of cases that best match the user query. Then, the Some observations are in order. The majority of the
system shows three samples products to the user and CBR-RSs stress the importance of the retrieval phase.
explains their characteristics (positive and negative). Some systems perform retrieval in two steps. First,
In the review phase, the system switches to the cases are retrieved by similarity, and then the cases are
navigation-by-proposing conversation mode and allows grouped or filtered. The use of pure similarity does not
the user to refine the query. After refinement, the sys- seem to be enough to retrieve a set of cases that satisfy
tem applies the new query to the case base and retrieves the user. This seems to be true especially in those
new cases. These cases are ranked and shown to the user. application domains that require a complex case struc-
The cycle continues until the user finds a preferred ture (e.g. travel plans) and therefore require the devel-
product. In this approach the revise and the retain phases opment of hybrid solutions for case retrieval.
are not implemented. The default reuse phase is used in the majority of the
CBR-RSs, i.e., all the retrieved cases are recommended
to the user. ICV and SIR have implemented the reuse
FUTURE TRENDS case in different way. In SIR, for instance, the system
can retrieve just part of the case. The same systems that
This paper presented a review of the literature on CBR implemented non-trivial reuse approaches, have also
recommender systems. We have found that it is often implemented both the revise phase, where the cases are
unclear how and why the proposed recommendation adapted, and the retain phase, where the new case (adapted
methodology can be defined as case-based and there- case) is stored.
fore we have introduced a general framework that can All the CBR-RSs analyzed implement the review phase,
illustrate similarities and differences of various ap- allowing the user to refine the query. Normally the sys-
proaches. Moreover, we have found that the classical tem expects some feedback from the user (positive or
CBR problem-solving loop is implemented only par- negative), new requirements or a product selection.
tially and sometime is not clear whether a CBR stage
(retrieve, reuse, revise, review, retain) is implemented
or not. For this reason, the proposed unifying frame- REFERENCES
work makes possible a coherent description of different
CBR-RSs. In addition an extensive usage of this frame- Aamodt, A., & Plaza, E. (1994). Case-based reasoning:
work can help describing in which sense a recommender Foundational issues, methodological variations, and
system exploits the classical CBR cycle, and can point system approaches. AI Communications, 7(1), 39-59.
out new interesting issues to be investigated in this area.
For instance, the possible ways to adapt retrieved cases Bergmann, R., Richter, M., Schmitt, S., Stahl, A., &
to improve the recommendation and how to learn these Vollrath, I. (2001). Utility-oriented matching: New re-
adapted cases. search direction for case-based reasoning. 9th German
We believe, that with such a common view it will be Workshop on Case-Based Reasoning, GWCBR01 (pp.
easier to understand what the research projects in the 14-16), Baden-Baden, Germany.
127
TEAM LinG
Case-Based Recommender Systems
Burke, R. (2000). Knowledge-based recommender sys- active query management and twofold similarity. 5th Inter-
tems. Encyclopedia of Library and Information Science, national Conference on Case-Based Reasoning, ICCBR
Vol. 69. 2003 (pp. 479-493), Trondheim, Norway.
Fesenmaier, D, Ricci, F, Schaumlechner, E, Wober, K., Schafer, J.B, Konstan, J. A., & Riedl, J. (2001). E-
& Zanella, C. (2003). DIETORECS: Travel advisory for commerce recommendation applications. Data Mining
multiple decision styles. Information and Communi- and Knowledge Discovery, 5(1/2), 115-153.
cation Technologies in Tourism, 232-241.
Shimazu, H. (2002). Expertclerk: A conversational case-
Lorenzi, F., & Ricci, F. (2004). A unifying framework based reasoning tool for developing salesclerk agents in
for case-base reasoning recommender systems. Tech- e-commerce webshops. Artificial Intelligence Review,
nical Report, IRST. 18, 223-244.
McGinty, L., & Smyth, B. (2002). Comparison-based Witten, I. H., & Frank, E. (2000). Data mining. Morgan
recommendation. Advances in Case-Based Reason- Kaufmann Publisher.
ing, 6th European Conference on Case Based Reason-
ing, ECCBR 2002 (pp. 575-589), Aberdeen, Scotland.
McGinty, L., & Smyth, B. (2003). The power of sugges- KEY TERMS
tion. 18th International Joint Conference on Artificial
Intelligence, IJCAI-03 (pp. 276-290), Acapulco, Mexico. Case-Based Reasoning: It is an Artificial Intelli-
McSherry, D. (2003a). Increasing dialogue efficiency in gence approach that solves new problems using the
case-based reasoning without loss of solution quality. solutions of past cases.
18 th International Joint Conference on Artificial Intelli- Collaborative Filtering: Approach that collects
gence, IJCAI-03 (pp. 121-126), Acapulco, Mexico. user ratings on currently proposed products to infer the
McSherry, D. (2003b). Similarity and compromise. 5th In- similarity between users.
ternational Conference on Case-Based Reasoning, Content-Based Filtering: Approach where the user
ICCBR 2003 (pp. 291-305), Trondheim, Norway. expresses needs and preferences on a set of attributes
Montaner, M., Lopez, B., & la Rosa, J.D. (2002). Improv- and the system retrieves the items that match the de-
ing case representation and case base maintenance in scription.
recommender systems. Advances in Case-Based Rea- Conversational Systems: Systems that can com-
soning, 6th European Conference on Case Based Reason- municate with users through a conversational paradigm
ing, ECCBR 2002 (pp. 234-248),Aberdeen, Scotland.
Machine Learning: The study of computer algo-
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & rithms that improve automatically through experience.
Riedl, J. (1994). Grouplens: An open architecture for
collaborative filtering of Netnews. ACM Conference Recommender Systems: Systems that help the user to
on Computer-Supported Cooperative Work (pp. 175- choose products, taking into account his/her preferences.
186). Web Site Personalization: Web sites that are person-
Ricci, F., Venturini, A., Cavada, D., Mirzadeh, N., Blaas, D., alized to each user, knowing the user interests and needs.
& Nones, M. (2003). Product recommendation with inter-
128
TEAM LinG
129
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Categorization Process and Data Mining
Vector Machine (SVM) classifier, and the feature se- technologies in the categorization process through the
lection approach uses distributional clustering of words making of conceptual maps, especially the possibility
via the recently introduced information bottleneck of creating a collaborative map made by different users,
method, which generates a more efficient representa- points out the cultural aspects of the concept represen-
tion of the documents); (6) the taxonomy method, based tation in terms of existing coincidences as to the choice
on hierarchical text categorization that documents are of the prototypical element by the same cultural group.
assigned to leaf-level categories of a category tree (the Thus, the technologies of information, focused on the
taxonomy is a recently emerged subfield of the seman- study of individual maps, demand revisited discussions
tic networks and conceptual maps). After the previous on the popular perceptions concerning concepts used
work in hierarchical classification focused on docu- daily (folk psychology). It aims to identify ideological
ments category, the tree classify method internal cat- similarity and cognitive deviation, both based on the
egories with a top down level-based classification that prototypes and on the levels of categorization developed
can classify concepts in the document. in the maps, with an emphasis on the cultural and semiotic
aspects of the investigated groups.
The networks of semantic values thus created and It attempted to show how the semiotic and linguistic
stabilized constitute the cultural-metaphorical worlds analysis of the categorization process can help in the
which are discursively real for the speakers of particular identification of the ideological similarity and cognitive
languages. The elements of these networks, though deviations, favoring the involvement of subjects in the
ultimately rooted in the physical-biological realm can map production, exploring and valuing the relation be-
and do operate independently of the latter, and form the tween the categorization process and the cultural experi-
stuff of our everyday discourses (Manjali, 1997, p. 1). ence of the subject in the world, both parts of the cogni-
tive process of conceptual map construction.
The prototype has given way to a true revolution (the
Roschian revolution) regarding classic lexical semantics. The concept maps, or the semantic nets, are space graphic
If we observe the conceptual map for chair, for instance, representations of the concepts and their relationships.
we will realize that the choice of most representative chair The concept maps represent, simultaneously, the
types; that is, our prototype of chair, supposes a double organization process of the knowledge, by the
adequacy: referential because the sign (concept of chair) relationships (links) and the final product, through the
must integrate the features retained from the real or concepts (nodes). This way, besides the relationship
imaginary world, and structural, because the sign must be between linguistic and visual factors is the interaction
pertinent (ideological criterion) and distinctive concern- among their objects and their codes (Amoretti, 2001, p.
ing the other neighbor concepts of chair. When I say that 49).
this object is a chair, it is supposed that I have an idea of
the chair sign, forming the use of a lexical or visual image The building of a map involves collaboration when the
competence coming from my referential experience, and subjects/students/users share information, still without
that my prototypical concept of chair is more adequate modifying the data, and involves cooperation, when users
than its neighbors bench or couch, because I perceive not only share their knowledge but also may interfere and
that there is a back part and there are no arms. Then, it is modify the information received from the other users, acting
useless to try to explain the creation of a prototype inside in a asynchronized way to build a collective map. Both
a language, because it is formed from context interactions. cooperation and collaboration attest the autonomy of the
The double origin of a prototype is bound, then, to ongoing cognitive process, the direction given by the users
shared knowledge relation between the subjects and themselves when trying to adequate their knowledge.
their communities (Amoretti, 2003). When people do a conceptual map, they usually privi-
lege the level where the prototype is. The basic concept
map starts with a general concept at the top of the map and
MAIN THRUST then works its way down through a hierarchical structure
to more specific concepts. The empirical concept (Kant)
Hypertext poses new challenges for a data-mining pro- of cat and chair has been studied by users with map
cess, especially for text categorization research, because software. They make an initial map at the beginning of the
metadata extracted from Web sites provide rich informa- semester and another about the same subject at the end
tion for classifying hypertext documents, and it is a new of the semester. I first discussed how cats and chairs
kind of problem to solve, to know how to appropriately appear, what could be called the structure of cat and chair
represent that information and automatically learn statis- appearance. Second, I discussed how cat and chair are
tical patterns for hypertext categorization. The use of perceived and which attributes make a cat a cat and a chair
130
TEAM LinG
Categorization Process and Data Mining
a chair. Finally, I will consider cat and chair as an experi- same category. A chair is a more central member than a
ential category, so the point of departure is our experience television, which, in turn, is a rather marginal member. C
in the world about cat and chair. The acquisition of the Rosch (2000) claims that prototypes can only constrain,
concept cat and chair is mediated by concrete experiences. but do not determine, models of representations.
Thus, the learner must possess relevant prior knowledge The main thrust of my argument is that it is very
and a mental scheme to acquire a prototypical concept. important to data mining to know, besides the cognitive
The expertise changes the conceptual level organiza- categorization process, what is the prototypical concept
tion competences. In the first maps, the novice chair map in a dataset. This basic level of concept organization
privileged the basic level, the most important exemplar of reflects the social representation in a better way than the
a class, the chair prototype. This level has a high coher- other levels (i.e., superordinate and subordinate levels).
ence and distinctiveness. After thinking about this con- The prototype knowledge affords a variety of culture
cept, students (now chair experts) repeated the experiment representations. The conceptual mapping system con-
and carried out again the expert chair map with much more tains prototype data on the hierarchical way of concept
details in the superordinate level, showing eight different levels. Using different software (three measuring lev-
kinds of chairs: dining room chair, kitchen chair, garden elssuperordinate, basic, and subordinate), I suggest
chair, and so forth. This level has high coherence and low the construction of different maps for each concept to
distinctiveness (Rosh, 2000). So, users learn by doing the analyze the categorization cognitive process with maps
categorization process. and to show how the categorization performance of
Language system arbitrarily cuts up the concepts into individual and collective or organizational team overtime
discrete categories (Hjelmeslev, 1968), and all categories is important in a data-mining work.
have equal status. Human language is both natural and Categorization is a part of Jakobsons (2000) commu-
cultural. According to the prototype theory, the role played nication model (also appropriated from information
by non-linguistic factors like perception and environment theory) with cultural aspects (context). This principle
is demonstrated throughout the concept as the prototype allows to locate on a definite gradient objects and rela-
from the subjects of each community. tions that are observed, based on similarity and contigu-
A concept is a sort of scheme. An effective way of ity associations (frame/script semantic) (Schank, 1999)
representing a concept is to retain only its most important and based on hierarchically relations (prototypic seman-
properties. This group of most important properties of a tic) (Kleiber, 1990; Rosch, 2000), in terms of perceived
concept is called prototype. The idea of prototype makes family resemblance among category members.
it possible for the subject to have a mental construction,
identifying the typical features of several categories, and,
when the subject finds a new object, he or she may compare FUTURE TRENDS
it to the prototype in his or her memory. Thus, the proto-
type of chair, for instance, allows new objects to be Data-mining language technology systems typically have
identified and labeled as chairs. In individual conceptual focused on the factual aspect of content analysis. How-
maps creation, one may confirm the presence of variables ever, there are other categorization aspects, including
for the same concept. pragmatics, point of view, and style, which must receive
The notion of prototype originated in the 1970s, greatly more attention like types and models of subjective clas-
due to Eleanor Roschs (2000) psychological research on sification information and categorization characteristics
the organization of conceptual categories. Its revolution- such as centrality, polarity, intensity, and different lev-
ary character marked a new era for the discussions on els of granularity (i.e., expression, clause, sentence,
categorization and brought existing theories, such as the discourse segment, document, hypertext).
classical view of the prototype question. On the basis of It is also important to define properties heritage
Roschs results, it is argued that members of the so-called among different category levels, viewed throughout hi-
Aristotelian (or classical) categories share all the same erarchical relations as one that allowed to virtually add
properties and showed that categories are structured in an certain pairs of value (attributes from a unit to another).
entirely different way; members that constitute them are We should also think of concepts managing that, in a
assigned in terms of gradual participation, and the cat- given category, are considered as an exception. It would
egorical attribution is made by human beings according to be necessary to allow the heritage blockage of certain
the more or less centrality/marginality of collocation within attributes. I will be opening new perspectives to the data-
the categorical structure. Elements recognized as central mining research of categorization and prototype study,
members of the category represent the prototype. For which shows the ideological similarity perception medi-
instance, a chair is a very good example of the category ated by collaborative conceptual maps.
furniture, while a television is a less typical example of the
131
TEAM LinG
Categorization Process and Data Mining
Much is still unknowable about the future of data Andler, D. (1987). Introduction aux sciences cognitives.
mining in higher education and in the business intelli- Paris: Gallimard.
gence process. The categorization process is a factor that
will affect this future and can be identified with the crucial Cordier, F. (1989). Les notions de typicalit et niveau
role played by the prototypes. Linguistics have not yet dabstracion: Analyse des proprits des representa-
paid this principle due attention. However, some conse- tions [thse de doctorat ddtat]. Paris: Sud University.
quences should already necessarily follow from its proto- das, J. (1966). Smantique structurale. Recherche de
type recognition. The extremely powerful explanation of mthode. Paris: PUF.
prototype categorization constitutes the most salient
feature in data mining. So, a very important application in Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992).
the data-mining methodology is the results of the proto- Knowledge discovery in databases: An overwiev. AI
type categorization research like a form of retrieval of Magazine, 13(2), 57-70.
unexpected information. Greimas, A.J. (1966). Smantique structurale. Recherche
de mthode. Paris: PUF.
132
TEAM LinG
Categorization Process and Data Mining
Concept: A sort of scheme produced by repeated Prototype: An effective way of representing a concept
experiences. Concepts are essentially each little idea that is to retain only its most important properties or the most C
we have in our heads about anything. This includes not typical element of a category, which serves as a cognitive
only everything, but every attribute of everything. reference point with respect to a cultural community. This
group of most important properties or most typical ele-
Conceptual Maps: Semiotic representation (linguistic ments of a concept is called prototype. The idea of
and visual) of the concepts (nodes) and their relation- prototype makes possible that the subject has a mental
ships (links); represent the organization process of the construction, identifying the typical features of several
knowledge When people do a conceptual map, they categories. Prototype is defined as the object that is a
usually privilege the level where the prototype is. They categorys best model.
prefer to categorize at an intermediate level; this basic
level is the first level learned, the most common level
named, and the most general level where visual shape and
attributes are maintained.
133
TEAM LinG
134
INTRODUCTION
With guarantee of convergence to only a local opti-
Center-based clustering algorithms are generalized to mum, the quality of the converged results, measured by
more complex model-based, especially regression- the performance function of the algorithm, could be far
model-based, clustering algorithms. This article briefly from its global optimum. Several researchers explored
reviews three center-based clustering algorithmsK- alternative initializations to achieve the convergence to
Means, EM, and K-Harmonic Meansand their gener- a better local optimum (Bradley & Fayyad, 1998; Meila
alizations to regression clustering algorithms. More & Heckerman, 1998; Pena et al., 1999).
details can be found in the referenced publications. K-Harmonic Means (KHM) (Zhang, 2001; Zhang et
al., 2000) is a recent addition to the family of center-
based clustering algorithms. KHM takes a very different
BACKGROUND approach from improving the initializations. It tries to
address directly the source of the problema single
Center-based clustering is a family of techniques with cluster is capable of trapping far more centers than its
applications in data mining, statistical data analysis fair share. This is the main reason for the existence of a
(Kaufman et al., 1990), data compression (vector quanti- very large number of local optima under K-Means and
zation) (Gersho & Gray, 1992), and many others. K- EM when K>10. With the introduction of a dynamic
means (KM) (MacQueen, 1967; Selim & Ismail, 1984), weighting function of data, KHM is much less sensitive
and the Expectation Maximization (EM) (Dempster et al., to initialization, demonstrated through a large number
1977; McLachlan & Krishnan, 1997; Rendner & Walker, of experiments in Zhang (2003). The dynamic weighting
1984) with linear mixing of Gaussian density functions function reduces the ability of a single data cluster,
are two of the most popular clustering algorithms. trapping many centers.
K-Means is the simplest among the three. It starts Replacing the point-centers by more complex data
model centers, especially regression models, in the
with initializing a set of centers M = {mk | k = 1,..., K } and second part of this article, a family of model-based
iteratively refines the location of these centers to find clustering algorithms is created. Regression clustering
the clusters in a dataset. Here are the steps: has been studied under a number of different names:
Clusterwise Linear Regression by Spath (1979, 1981,
K-Means Algorithm 1983, 1985), DeSarbo and Cron (1988), Hennig (1999,
2000) and others; Trajectory clustering using mixtures
Step 1: Initialize all centers (randomly or based of regression models by Gaffney and Smith (1999);
on any heuristic). Fitting Regression Model to Finite Mixtures by Will-
Step 2: Associate each data point with the nearest iams (2000); Clustering Using Regression by Gawrysiak,
et. al. (2000); Clustered Partial Linear Regression by
center. This step partitions the data set into K
Torgo, et. al. (2000). Regression clustering is a better
disjoint subsets (Voronoi Partition).
name for the family, because it is not limited to linear or
Step 3: Calculate the best center locations (i.e., the
piecewise regressions.
centroids of the partitions) to maximize a perfor- Spath (1979, 1981, 1982) used linear regression and
mance function (2), which is the total squared dis- partition of the dataset, similar to K-means, in his
tance from each data point to the nearest center. algorithm that locally minimizes the total mean square
Step 4: Repeat Steps 2 and 3 until there are no error over all K-regressions. He also developed an
more changes on the membership of the data points incremental version of his algorithm. He visualized his
(proven to converge). piecewise linear regression concept in his book (Spath,
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Center-Based Clustering and Regression Clustering
1985) exactly as he named his algorithm. DeSarbo (1988) where Sk X is the subset of x that are closer to mk than
used a maximum likelihood method for performing to all other centers (the Voronoi partition). C
clusterwise linear regression. Hennig (1999) studied clus-
tered linear regression, as he named it, using the same linear
K
mixing of Gaussian density functions. 1
(b) EM: d ( x, M ) = log pk * EXP( || xi ml ||2 ) , where
( )
D
k =1
For K-Means, EM, and K-Harmonic means, both their A linear mixture of K identical spherical (Gaussian
performance functions and their iterative algorithms density) functions, which is still a probability density
are treated uniformly in this section for comparison. function, is used here.
This uniform treatment is carried over to the three
regression clustering algorithms, RC-KM, RC-EM and (c) K-Harmonic Means: d ( x, M ) = 1HA (|| x mk || p ) , the
RC-KHM, in the second part. k K
135
TEAM LinG
Center-Based Clustering and Regression Clustering
i ,k
Step 4: Repeat Steps 2 and 3 until a chosen conver- l =1 d i,l l =1 d i,l
136
TEAM LinG
Center-Based Clustering and Regression Clustering
3. Reproducible results become even more important Regression in (9) is not effective when the dataset
when we use these algorithms to different datasets contains a mixture of very different response character- C
that are sampled from the same hidden distribution. istics, as shown in Figure 1a; it is much better to find the
The results from KHM better represent the proper- partitions in the data and to learn a separate function on
ties of the distribution and are less dependent on a each partition, as shown in Figure 1b. This is the idea of
particular sample set. EMs results are more depen- Regression-Clustering (RC). Regression provides a model
dent on the sample set. for the clusters; clustering partitions the data to best fit
the models. The linkage between the two algorithms is a
The details on the setup of the experiments, quantita- common objective function shared between the regres-
tive comparisons of the results, and the Matlab source sions and the clustering.
code of K-Harmonic Means can be found in the paper. RC algorithms can be viewed as replacing the K
geometric-point centers in center-based clustering al-
Generalization to Complex Model- gorithms by a set of model-based centers, particularly a
Based ClusteringRegression set of regression functions M = { f1 ,..., f K } . With
Clustering the same performance function as defined in (1), but the
distance from a data point to the set of centers replaced
Clustering applies to datasets without response infor- by the following, ( e( f ( x), y) =|| f ( x) y ||2 )
mation (unsupervised); regression applies to datasets
with response variables chosen. Given a dataset with
a) d (( x, y ), M ) = MIN (e( f ( x ), y )) for RC with K-Means
responses, Z = ( X , Y ) = {( xi , yi ) | i = 1,..., N } , a family of f M
(RC-KM)
functions = { f } (a function class making the optimi-
K
zation problem well defined, such as polynomials of up 1
b) d (( x, y ), M ) = log pk * EXP( e( f ( x), y )) for EM
( )
D
to a certain degree) and a loss function e() 0 , regres- k =1
sion solves the following minimization problem (Mont- and
gomery et al., 2001):
c) d (( x, y ), M ) = HA(e( f ( x ), y )) for RC K-Harmonic
f M
N Means (RC-KHM).
f opt = arg min e( f ( xi ), yi ) (9)
f i =1
The three iterative algorithmsRC-KM, RC-EM, and
RC-KHMminimizing their corresponding performance
m
function, take the following common form (10). Regres-
Commonly, = { l h( x, al ) | l R, al R n } , linear
l =1 sion with weighting takes the places of weighted averag-
expansion of simple parametric functions, such as poly- ing in (4). The regression function centers in the uth
nomials of degree up to m, Fourier series of bounded iteration are the solution of the minimization,
frequency, neural networks. Usually,
e( f ( x), y ) =|| f ( x) y || , with p=1,2 most widely used
p N
f k( u ) = arg min a ( zi ) p ( Z k | zi ) || f ( xi ) yi ||2 (10)
(Friedman, 1999). f i =1
137
TEAM LinG
Center-Based Clustering and Regression Clustering
K
1
pk(u 1) EXP( e( f k(u 1) ( xi ), yi ))
and pk(u 1) =
N
p( Z
i =1
( u 1)
k | zi ) . theory in supervised learning.
k =1 2 Regression clustering will find many applications in
analyzing real-word data. Single-function regression
The same parallel structure can be observed between has been used very widely for data analysis and forecast-
the center-based EM clustering algorithm and the RC- ing. Data collected in an uncontrolled environment, like
EM algorithm. in stocks, marketing, economy, government census, and
many other real-world situations, are very likely to
(c) For RC-K-Harmonic Means, with contain a mixture of different response characters. Re-
e( f ( x), y ) =|| f ( xi ) yi || , p' gression clustering is a natural extension to the classi-
cal single-function regression.
K K K
a p ( zi ) = dip,l' + 2 d p'
i ,l
p '+ 2
and p ( Z k | zi ) = di ,k d p '+ 2
i ,l
l =1 l =1 l =1 CONCLUSION
where d i ,l =|| f (u 1) ( xi ) yi || . ( p > 2 is used.) Replacing the simple geometric-point centers in cen-
The same parallel structure can be observed between ter-based clustering algorithms by more complex data
the center-based KHM clustering algorithm and the RC- models provides a general scheme for deriving other
KHM algorithm. model-based clustering algorithms. Regression models
Sensitivity to initialization in center-based cluster- are used in this presentation to demonstrate the process.
ing carries over to regression clustering. In addition, a The key step in the generalization is defining the dis-
new form of local optimum is illustrated in Figure 2. tance function from a data point to the set of models
It happens to all three RC algorithms, RC-KM, RC- the regression functions in this special case.
KHM, and RC-EM. Among the three algorithms, EM has a strong foun-
dation in probability theory. It is the convergence to
only a local optimum and the existence of a very large
Figure 2. A new kind of local optimum occurs in number of optima when the number of clusters is more
regression clustering. than a few (>5, for example) that keeps practitioners
from the benefits of its theory. K-Means is the simplest
and its objective function the most intuitive. But it has
the similar problem as the EMs sensitivity to initializa-
tion of the centers. K-Harmonic Means was developed
with close attention to the dynamics of its convergence;
it is much more robust than the other two on low dimen-
138
TEAM LinG
Center-Based Clustering and Regression Clustering
sional data. Improving the convergence of center-based gence Artificial in Uncertainty (pp. 386-395). Morgan
clustering algorithms on higher dimensional data (dim > Kaufman. C
10) still needs more research.
Montgomery, D.C., Peck, E.A., & Vining, G.G. (2001).
Introduction to linear regression analysis. John Wiley
& Sons.
REFERENCES
Nock, R., & Nielsen, F. (2004). An abstract weighting
Bradley, P., & Fayyad, U.M. (1998). Refining initial points framework for clustering algorithms. Proceedings of
for KM clustering. MS Technical Report MSR-TR-98-36. the Fourth International SIAM Conference on Data
Mining. Orlando, Florida.
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maxi-
mum likelihood from incomplete data via the EM algo- Pena, J., Lozano, J., & Larranaga, P. (1999). An empirical
rithm. Journal of the Royal Statistical Society, 39(1), 1- comparison of four initialization methods for the K-
38. means algorithm. Pattern Recognition Letters, 20,
1027-1040.
DeSarbo, W.S., & Corn, L.W. (1988). A maximum likeli-
hood methodology for clusterwise linear regression. Jour- Rendner, R.A., & Walker, H.F. (1984). Mixture densi-
nal of Classification, 5, 249-282. ties, maximum likelihood and the EM algorithm. SIAM
Review, 26(2).
Duda, R., & Hart, P. (1972). Pattern classification and
scene analysis. John Wiley & Sons. Schapire, R.E. (1999). Theoretical views of boosting and
applications. Proceedings of the Tenth International
Friedman, J., Hastie, T., & Tibshirani. R. (1998). Additive Conference on Algorithmic Learning Theory.
logistic regression: A statistical view of boosting [tech-
nical report]. Department of Statistics, Stanford Univer- Selim, S.Z., & Ismail, M.A (1984). K-means type algo-
sity. rithms: A generalized convergence theorem and charac-
terization of local optimality. IEEE Transactions on
Gersho, A., & Gray, R.M. (1992). Vector quantization and PAMI-6, 1.
signal compression. Kluwer Academic Publishers.
Silverman, B.W. (1998). Density estimation for statis-
Hamerly, G., & Elkan, C. (2002). Alternatives to the k- tics and data analysis. Chapman & Hall/CRC.
means algorithm that find better clusterings. Proceed-
ings of the ACM conference on information and knowl- Spath, H. (1981). Correction to algorithm 39:
edge management (CIKM). Clusterwise linear regression. Computing, 26, 275.
Hamerly, G., & Elkan, C. (2003). Learning the k in k-means. Spath, H. (1982). Algorithm 48: A fast algorithm for
Proceedings of the Seventeenth Annual Conference on clusterwise linear regression. Computing, 29, 175-181.
Neural Information Processing Systems.
Spath, H. (1985). Cluster dissection and analysis. New
Hennig, C. (1997). Datenanalyse mit modellen fur cluster York: Wiley.
linear regression [Dissertation]. Hamburg, Germany:
Institut Fur Mathmatsche Stochastik, Universitat Ham- Tibshirani, R., Walther, G., & Hastie, T. (2000). Estimating
burg. the number of clusters in a dataset via the gap statistic.
Retrieved from http://www-stat.stanford.edu/~tibs /
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in research.html
data: An introduction to cluster analysis. John Wiley &
Sons Zhang, B. (2001). Generalized K-harmonic meansDy-
namic weighting of data in unsupervised learning. Pro-
MacQueen, J. (1967). Some methods for classification and ceedings of the First SIAM International Conference on
analysis of multivariate observations. Proceedings of the Data Mining (SDM2001), Chicago, Illinois.
Fifth Berkeley Symposium on Mathematical Statistics
and Probability, Berkeley, California. Zhang, B. (2003). Comparison of the performance of cen-
ter-based clustering algorithms. Proceedings of PAKDD-
McLachlan, G. J., & Krishnan, T. (1997). EM algorithm 03, Seoul, South Korea.
and extensions. John Wiley & Sons.
Zhang, B. (2003a). Regression clustering. Proceedings of
Meila, M., & Heckerman, D. (1998). An experimental com- the IEEE International Conference on Data Mining,
parison of several clustering and initialization methods. In Melbourne, Florida.
Proceedings of the Fourteenth Conference on Intelli-
139
TEAM LinG
Center-Based Clustering and Regression Clustering
Zhang, B., Hsu, M., & Dayal, U. (2000). K-harmonic means. Model-Based Clustering: A mixture of simpler distri-
Proceedings of the International Workshop on Tempo- butions is used to fit the data, which defines the clusters
ral, Spatial and Spatio-Temporal Data Mining, Lyon, of the data. EM with linear mixing of Gaussian density
France. functions is the best example, but K-Means and K-Har-
monic Means are the same type. Regression clustering
algorithms are also model-based clustering algorithms
with mixing of more complex distributions as its model.
KEY TERMS
Regression: A statistical method of learning the rela-
Boosting: Assigning and updating weights on data tionship between two sets of variables from data. One set
points according to a particular formula in the process of is the independent variables or the predictors, and the
refining classification models. other set is the response variables.
Center-Based Clustering: Similarity among the data Regression Clustering: Combining the regression
points is defined through a set of centers. The distance methods with center-based clustering methods. The
from each data point to a center determined the data points simple geometric-point centers in the center-based clus-
association with that center. The clusters are represented tering algorithms are replaced by regression models.
by the centers. Sensitivity to Initialization: Center-based clus-
Clustering: Grouping data according to similarity tering algorithms are iterative algorithms that minimiz-
among them. Each clustering algorithm has its own defi- ing the value of its performance function. Such algo-
nition of similarity. Such grouping can be hierarchical. rithms converge to only a local optimum of its perfor-
mance function. The converged positions of the centers
Dynamic Weighting: Reassigning weights on the depend on the initial positions of the centers where the
data points in each iteration of an iterative algorithm. algorithm start with.
140
TEAM LinG
141
Let us start by introducing decision trees. For the ease of Age <
40
explanation, we are going to focus on binary decision
trees. In binary decision trees, each internal node has two
children nodes. Each internal node is associated with a No Gender=M
predicate, called the splitting predicate, which involves
only the predictor attributes. Each leaf node is associated
with a unique value for the dependent attribute. A deci- No Yes
sion encodes a data-mining model as follows. For an
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Classification and Regression Trees
There are two problems that make decision tree con- databases (Gehrke, Ramakrishnan & Ganti, 2000; Shafer,
struction a hard problem. First, construction of the opti- Agrawal & Mehta, 1996).
mal tree for several measures of optimality is an NP-hard In most classification and regression scenarios, we
problem. Thus, all decision tree construction algorithms also have costs associated with misclassifying a record,
grow the tree top-down according to the following greedy or with being far off in our prediction of a numerical
heuristic: At the root node, the training database is dependent value. Existing decision tree algorithms can
examined, and a splitting predicate is selected. Then the take costs into account, and they will bias the model
training database is partitioned according to the splitting toward minimizing the expected misprediction cost in-
predicate, and the same method is applied recursively at stead of the expected misclassification rate, or the ex-
each child node. The second problem is that the training pected difference between the predicted and true value of
database is only a sample from a much larger population the dependent attribute.
of records. The decision tree has to perform well on
records drawn from the population, not on the training
database. (For the records in the training database, we FUTURE TRENDS
already know the value of the dependent attribute.)
Three different algorithmic issues need to be ad- Recent developments have expanded the types of models
dressed during the tree construction phase. The first that a decision tree can have in its leaf nodes. So far, we
issue is to devise a split selection algorithm, such that the assumed that each leaf node just predicts a constant value
resulting tree models the underlying dependency rela- for the dependent attribute. Recent work, however, has
tionship between the predictor attributes and the depen- shown how to construct decision trees with linear models
dent attribute well. During split selection, we have to make in the leaf nodes (Dobra & Gehrke, 2002). Another recent
two decisions. First, we need to decide which attribute we development in the general area of data mining is the use
will select as the splitting attribute. Second, given the of ensembles of models, and decision trees are a popular
splitting attribute, we have to decide on the actual split- model for use as a base model in ensemble learning
ting predicate. For a numerical attribute X, splitting predi- (Caruana, Niculescu-Mizil, Crew & Ksikes, 2004). An-
cates are usually of the form X c, where c is a constant. other recent trend is the construction of data-mining
For example, in the tree shown in Figure 1, the splitting models of high-speed data streams, and there have been
predicate of the root node is of this form. For a categorical adaptations of decision tree construction algorithms to
attribute X, splits are usually of the form X in C, where C such environments (Domingos & Hulten, 2002). A last
is a set of values in the domain of X. For example, in the recent trend is to take adversarial behavior into account
tree shown in Figure 1, the splitting predicate of the right (e.g., in classifying spam). In this case, an adversary who
child node of the root is of this form. There exist decision produces the records to be classified actively changes his
trees that have a larger class of possible splitting predi- or her behavior over time to outsmart a static classifier
cates; for example, there exist decision trees with linear (Dalvi, Domingos, Mausam, Sanghai & Verma, 2004).
combinations of numerical attribute values as splitting
predicates a iXi+c0, where i ranges over all attributes)
(Loh & Shih, 1997). Such splits, also called oblique splits, CONCLUSION
result in shorter trees; however, the resulting trees are no
longer easy to interpret. Decision trees are one of the most popular data-mining
The second issue is to devise a pruning algorithm that models. Decision trees are important, since they can result
selects the tree of the right size. If the tree is too large, then in powerful predictive models, while, at the same time,
the tree models the training database too closely instead they allow users to get insight into the phenomenon that
of modeling the underlying population. One possible is being modeled.
choice of pruning a tree is to hold out part of the training
set as a test set and to use the test set to estimate the
misprediction error of trees of different size. We then REFERENCES
simply select the tree that minimizes the misprediction
error. Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J.
The third issue is to devise an algorithm for intelligent (1984). Classification and regression trees. Kluwer Aca-
management of the training database in case the training demic Publishers.
database is very large (Ramakrishnan & Gehrke, 2002).
This issue has only received attention in the last decade, Caruana, R., Niculescu-Mizil, A., Crew, R., & Ksikes, A.
but there exist now many algorithms that can construct (2004). Ensemble selection from libraries of models. Pro-
decision trees over extremely large, disk-resident training
142
TEAM LinG
Classification and Regression Trees
ceedings of the Twenty-First International Conference, Quinlan, J.R. (1993). C4.5: Programs for machine learn-
Banff, Alberta, Canada. ing. Morgan Kaufman. C
Dalvi, N., Domingos, P., Mausam, S.S., & Verma, D. (2004). Ramakrishnan, R. & Gehrke, J. (2002). Database manage-
Adversarial classification. Proceedings of the Tenth Inter- ment systems (3rd ed.). McGrawHill.
national Conference on Knowledge Discovery and Data
Mining, Seattle, Washington. Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A
scalable parallel classifier for data mining. Proceedings
Dobra, A., & Gehrke, J. (2002). SECRET: A scalable linear of the 22nd International Conference on Very Large
regression tree algorithm. Proceedings of the Eighth ACM Databases, Bombay, India.
SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, Edmonton, Alberta, Canada.
Domingos, P., & Hulten, G. (2002). Learning from infinite KEY TERMS
data in finite time. Advances in Neural Information Pro-
cessing Systems, 14, 673-680. Attribute: Column of a dataset.
Gehrke, J., Ramakrishnan, R., & Ganti, V. (2000). Categorical Attribute: Attribute that takes values
RainforestA framework for fast decision tree construc- from a discrete domain.
tion of large datasets. Data Mining and Knowledge Dis-
covery, 4(2/3), 127-162. Classification Tree: A decision tree where the de-
pendent attribute is categorical.
Goebel, M., & Gruenwald, L. (1999). A survey of data
mining software tools. SIGKDD Explorations, 1(1), 20-33. Decision Tree: Tree-structured data mining model
used for prediction, where internal nodes are labeled with
Hand, D. (1997). Construction and assessment of classifi- predicates (decisions), and leaf nodes are labeled with
cation rules. Chichester, England: John Wiley & Sons. data-mining models.
Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison Numerical Attribute: Attribute that takes values
of prediction accuracy, complexity, and training time of from a continuous domain.
thirty-three old and new classification algorithms. Ma-
chine Learning, 48, 203-228. Regression Tree: A decision tree where the depen-
dent attribute is numerical.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods
for classification trees. Statistica Sinica, 7(4), 815-840. Splitting Predicate: Predicate at an internal node of
the tree; it decides which branch a record traverses on its
Murthy, S.K. (1998). Automatic construction of decision way from the root to a leaf node.
trees from data: A multi-disciplinary survey. Data Mining
and Knowledge Discovery, 2(4), 345-389.
143
TEAM LinG
144
Classification Methods
Aijun An
York University, Canada
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Classification Methods
Figure 1. A decision tree with tests on attributes X and Y search heuristics that control the search, and the pruning
method used. The most widespread approach to rule C
X=? induction is sequential covering, in which a greedy
X<1 X 1 general-to-specific search is conducted to learn a dis-
junctive set of conjunctive rules. It is called sequential
Y =? Class 2
Y=C
covering because it sequentially learns a set of rules that
Y=A Y=B together cover the set of positive examples for a class.
Class 1 Class 2 Class 1 Algorithms belonging to this category include CN2
(Clark & Boswell, 1991), RIPPER (Cohen, 1995) and
ELEM2 (An & Cercone, 1998).
class distribution. A simple decision tree is shown in Figure
1. With a decision tree, an object is classified by following Naive Bayesian Classifier
a path from the root to a leaf, taking the edges correspond-
ing to the values of the attributes in the object. The naive Bayesian classifier is based on Bayes theo-
A typical decision tree learning algorithm adopts a rem. Suppose that there are m classes, C1, C2, , Cm. The
top-down recursive divide-and-conquer strategy to con- classifier predicts an unseen example X as belonging to
struct a decision tree. Starting from a root node repre- the class having the highest posterior probability condi-
senting the whole training data, the data is split into two tioned on X. In other words, X is assigned to class Ci if
or more subsets based on the values of an attribute and only if
chosen according to a splitting criterion. For each sub-
set a child node is created and the subset is associated
with the child. The process is then separately repeated P(Ci|X) > P(Cj|X) for 1j m, j i.
on the data in each of the child nodes, and so on, until a
termination criterion is satisfied. Many decision tree
learning algorithms exist. They differ mainly in at-
By Bayes theorem, we have
tribute-selection criteria, such as information gain, gain
ratio (Quinlan, 1993), gini index (Breiman, Friedman,
Olshen, & Stone, 1984), etc., termination criteria and P( X | Ci ) P (Ci )
P (Ci | X ) = .
post-pruning strategies. Post-pruning is a technique that P( X )
removes some branches of the tree after the tree is con-
structed to prevent the tree from over-fitting the training As P(X) is constant for all classes, only
data. Representative decision tree algorithms include CART
P( X | C i ) P(C i ) needs to be maximized. Given a set of
(Breiman et al., 1984) and C4.5 (Quinlan, 1993). There are
also studies on fast and scalable construction of decision training data, P(Ci) can be estimated by counting how
trees. Representative algorithms of such kind include often each class occurs in the training data. To reduce
RainForest (Gehrke, Ramakrishnan, & Ganti, 1998) and the computational expense in estimating P(X|Ci) for all
SPRINT (Shafer, Agrawal, & Mehta., 1996). possible Xs, the classifier makes a nave assumption that
the attributes used in describing X are conditionally
Decision Rule Learning independent of each other given the class of X. Thus,
given the attribute values (x1, x2, xn) that describe X,
we have
Decision rules are a set of if-then rules. They are the
most expressive and human readable representation of
classification models (Mitchell, 1997). An example of n
P( X | C i ) = P( x j | C i ) .
decision rules is if X<1 and Y=B, then the example j =1
belongs to Class 2. This type of rules is referred to as
propositional rules. Rules can be generated by translat-
ing a decision tree into a set of rules one rule for each The probabilities P(x1|Ci), P(x2|Ci), , P(xn|Ci) can be
leaf node in the tree. A second way to generate rules is estimated from the training data.
to learn rules directly from the training data. There is a The nave Bayesian classifier is simple to use and
variety of rule induction algorithms. The algorithms efficient to learn. It requires only one scan of the
induce rules by searching in a hypothesis space for a training data. Despite the fact that the independence
hypothesis that best matches the training data. The algo- assumption is often violated in practice, nave Bayes
rithms differ in the search method (e.g. general-to- often competes well with more sophisticated classifi-
specific, specific-to-general, or two-way search), the ers. Recent theoretical analysis has shown why the naive
145
TEAM LinG
Classification Methods
Bayesian classifier is so robust (Domingos & Pazzani, The k-nearest neighbour classifier is intuitive, easy
1997; Rish, 2001). to implement and effective in practice. It can construct a
different approximation to the target function for each
Bayesian Belief Networks new example to be classified, which is advantageous
when the target function is very complex, but can be
A Bayesian belief network, also known as Bayesian net- discribed by a collection of less complex local approxima-
work and belief network, is a directed acyclic graph whose tions (Mitchell, 1997). However, its cost of classifying
nodes represent variables and whose arcs represent de- new examples can be high due to the fact that almost all
pendence relations among the variables. If there is an arc the computation is done at the classification time. Some
from node A to another node B, then we say that A is a refinements to the k-nearest neighbor method include
parent of B and B is a descendent of A. Each variable is weighting the attributes in the distance computation and
conditionally independent of its nondescendents in the weighting the contribution of each of the k neighbors
graph, given its parents. The variables may correspond to during classification according to their distance to the
actual attributes given in the data or to hidden variables example to be classified.
believed to form a relationship. A variable in the network
can be selected as the class attribute. The classification Neural Networks
process can return a probability distribution for the class
attribute based on the network structure and some condi- Neural networks, also referred to as artificial neural
tional probabilities estimated from the training data, which networks, are studied to simulate the human brain
predicts the probability of each class. although brains are much more complex than any arti-
The Bayesian network provides an intermediate ap- ficial neural network developed so far. A neural net-
proach between the nave Bayesian classification and the work is composed of a few layers of interconnected
Bayesian classification without any independence as- computing units (neurons or nodes). Each unit com-
sumptions. It describes dependencies among attributes, putes a simple function. The input of the units in one
but allows conditional independence among subsets of layer are the outputs of the units in the previous layer.
attributes. Each connection between units is associated with a
The training of a belief network depends on the weight. Parallel computing can be performed among
senario. If the network structure is known and the vari- the units in each layer. The units in the first layer take
ables are observable, training the network only consists input and are called the input units. The units in the last
of estimating some conditional probabilities from the layer produces the output of the networks and are called
training data, which is straightforward. If the network the output units. When the network is in operation, a
structure is given and some of the variables are hidden, a value is applied to each input unit, which then passes its
method of gradient decent can be used to train the net- given value to the connections leading out from it, and
work (Russell, Binder, Koller, & Kanazawa, 1995). Al- on each connection the value is multiplied by the weight
gorithms also exist for learning the netword structure associated with that connection. Each unit in the next
from training data given observable variables (Buntime, layer then receives a value which is the sum of the
1994; Cooper & Herskovits, 1992; Heckerman, Geiger, values produced by the connections leading into it, and
& Chickering, 1995). in each unit a simple computation is performed on the
value - a sigmoid function is typical. This process is
The k-Nearest Neighbour Classifier then repeated, with the results being passed through
subsequent layers of nodes until the output nodes are
The k-nearest neighbour classifier classifies an unknown reached. Neural networks can be used for both regres-
example to the most common class among its k nearest sion and classification. To model a classification func-
neighbors in the training data. It assumes all the ex- tion, we can use one output unit per class. An example
amples correspond to points in a n-dimensional space. A can be classified into the class corresponding to the
neighbour is deemed nearest if it has the smallest dis- output unit with the largest output value.
tance, in the Euclidian sense, in the n-dimensional fea- Neural networks differ in the way in which the
ture space. When k = 1, the unknown example is classi- neurons are connected, in the way the neurons process
fied into the class of its closest neighbour in the training their input, and in the propogation and learning methods
set. The k-nearest neighbour method stores all the train- used (Nurnberger, Pedrycz, & Kruse, 2002). Learning
ing examples and postpones learning until a new example a neural network is usually restricted to modifying the
needs to be classified. This type of learning is called weights based on the training data; the structure of the
instance-based or lazy learning. initial network is usually left unchanged during the
learning process. A typical network structure is the
146
TEAM LinG
Classification Methods
multilayer feed-forward neural network, in which none creasingly applied to provide decision support in busi-
of the connections cycles back to a unit of a previous ness, biomedicine, financial analysis, telecommunications C
layer. The most widely used method for training a feed- and so on. For example, there are recent applications of
forward neural network is backpropagation (Rumelhart, classification techniques to identify fraudulent usage of
Hinton, & Williams, 1986). credit cards based on credit card transaction databases;
and various classification techniques have been explored
Support Vector Machines to identify highly active compounds for drug discovery.
To better solve application-specific problems, there has
The support vector machine (SVM) is a recently devel- been a trend toward the development of more application-
oped technique for multidimensional function approxi- specific data mining systems (Han & Kamber, 2001).
mation. The objective of support vector machines is to Traditional classification algorithms assume that the
determine a classifier or regression function which whole training data can fit into the main memory. As
minimizes the empirical risk (that is, the training set automatic data collection becomes a daily practice in many
error) and the confidence interval (which corresponds businesses, large volumes of data that exceed the memory
to the generalization or test set error) (Vapnik, 1998). capacity become available to the learning systems. Scal-
Given a set of N linearly separable training ex- able classification algorithms become essential. Although
some scalable algorithms for decision tree learning have
amples S = {x i R n | i = 1,2,..., N } , where each example
been proposed, there is still a need to develop scalable and
belongs to one of the two classes, represented efficient algorithms for other types of classification tech-
by yi {+1,1} , the SVM learning method seeks the op- niques, such as decision rule learning.
timal hyperplane w x + b = 0 , as the decision surface, Previously, the study of classification techniques
which separates the positive and negative examples with focused on exploring various learning mechanisms to
the largest margin. The decision function for classify- improve the classification accuracy on unseen examples.
ing linearly separable data is: However, recent study on imbalanced data sets has
shown that classification accuracy is not an appropriate
measure to evaluate the classification performance when
f (x) = sign(w x + b) , the data set is extremely unbalanced, in which almost all
the examples belong to one or more, larger classes and
where w and b are found from the training set by solving far fewer examples belong to a smaller, usually more
a constrained quadratic optimization problem. The final interesting class. Since many real world data sets are
decision function is unbalanced, there has been a trend toward adjusting
existing classification algorithms to better identify ex-
amples in the rare class.
N
f (x) = sign i y i (x i x) + b . Another issue that has become more and more impor-
i =1 tant in data mining is privacy protection. As data mining
tools are applied to large databases of personal records,
privacy concerns are rising. Privacy-preserving data min-
The function depends on the training examples for
ing is currently one of the hottest research topics in data
which i is non-zero. These examples are called sup- mining and will remain so in the near future.
port vectors. Often the number of support vectors is
only a small fraction of the original dataset. The basic
SVM formulation can be extended to the nonlinear case CONCLUSION
by using nonlinear kernels that map the input space to a
high dimensional feature space. In this high dimensional Classification is a form of data analysis that extracts a
feature space, linear classification can be performed. model from data to classify future data. It has been
The SVM classifier has become very popular due to its studied in parallel in statistics and machine learning, and
high performances in practical applications such as text is currently a major technique in data mining with a
classification and pattern recognition. broad application spectrum. Since many application
problems can be formulated as a classification problem
and the volume of the available data has become over-
FUTURE TRENDS whelming, developing scalable, efficient, domain-spe-
cific, and privacy-preserving classification algorithms is
Classification is a major data mining task. As data mining essential.
becomes more popular, classification techniques are in-
147
TEAM LinG
Classification Methods
148
TEAM LinG
Classification Methods
in X and pj denotes the probability of the jth class in X). approximations of a class. It can be used to reduce the
Intuitively, the information gain measures the decrease of feature set and to generate decision rules. C
the weighted average impurity of the partitions E1, ..., E n,
compared with the impurity of the complete set of ex- Sigmod Function: A mathematical function defined
amples E. by the formula
149
TEAM LinG
150
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Closed-Itemset Incremental-Mining Problem
same transaction set. In fact, the longest itemset is the model built upon the previously selected items, thus
most precise characterization of that transaction set, all generating at all times a new and extended data model. C
the others being partial and redundant definitions. Due This strategy generates results layer by layer, just like
to the observation that s t and ts are closure operators, an onion.
the concepts in the lattice of concepts eliminate the The main difference between the depth-first strategy
presence of any unclosed itemsets. Under these cir- and the layer-based strategy is that interactivity is of-
cumstances, the FCA-based approaches have only fered to the user. Just like peeling an onion, one can take
subunitary confidence association rules as results, due a previously found data model, reduce it or enlarge it
to the fact that all unitary confidence association rules with some items, and reach the data view that is the most
are considered redundant behaviors. All unitary confi- revealing to the individual.
dence association rules can be expressed through a base, In breadth-first as well as depth-first strategies, it is
the pseudo-intent set. impossible to provide interactivity to the user due to the
If I consider the data-mining process in the vision of fact that all items of interest for the mining process have
Ankerst (2001) the resulting data model in Apriori is a to be available from the start.
long list of frequent itemsets, while FCA is a conceptual
structure, namely a lattice of concepts, free of any
redundancy. FUTURE TRENDS
While Apriori is a breadth-first result-building algo- The construction of small models of data, which
rithm, most of the FCA-based algorithms are depth-first. makes them more understandable for the user;
Only ERA has a different strategy: Each item in the also, the response time is small
database is used to enlarge an already existing data
151
TEAM LinG
Closed-Itemset Incremental-Mining Problem
The extension of data models with a set of new items on the AprioriTid algorithm. Proceedings of the ACM
returns to the user only the supplementary results, Symposium on Applied Computing (pp. 534-536), Italy.
hence a smaller amount of results; the response time
is considerably smaller than when building the model Imberman, S., & Domanski, B. (2001, August). Finding
from scratch association rules from quantitative data using data
Whenever data models are incomprehensible, some Booleanization. Proceedings of the Seventh Americas
of the items can be removed, thus obtaining and Conference on Information Systems, USA.
easy-to-understand data model Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999,
The extension or reduction of a model spares the January). Discovering frequent closed itemsets for asso-
time spent building it, thus reusing knowledge ciation rules. Proceedings of the International Confer-
ence on Database Theory (pp. 398-416), Israel.
After many successful attempts to make it faster, the
mining process becomes more interactive and flexible Pei, J., Han, J., & Mao, R. (2000, May). CLOSET: An
due to the increased number of human interventions. efficient algorithm for mining frequent closed itemsets.
Proceedings of the Conference on Data Mining and
Knowledge Discovery (pp. 11-20), USA.
REFERENCES Valtchev, P., Missaoui, R., Godin, R., & Meridji, M.
(2002). A framework for incremental generation of
Agrawal, R., Imielinski, T., & Swami, A. (1993, May). frequent closed itemsets using Galois (concept) lat-
Mining association rules between sets of items in large tice theory. Journal of Experimental and Theoretical
databases. Proceedings of the ACM SIGMOD Confer- Artificial Intelligence, 14(2/3), 115-142.
ence on Management of Data (pp. 207-216), USA.
Wang, J., Han, J., & Pei, J. (2003, August). CLOSET+:
Agrawal, R., & Srikant, R. (1994, September). Fast algo- Searching for the best strategies for mining frequent
rithms for mining association rules. Proceedings of the closed itemsets. Proceedings of the ACM SIGKDD
20th International Conference on Very Large Data Bases International Conference on Knowledge Discovery
(pp. 487-499), Chile. and Data Mining (pp. 236-245), USA.
Ankerst, M. (2001, May). Human involvement and Webb, G. I. (2001, August). Discovering associations
interactivity of the next generations data mining tools. with numeric variables. Proceedings of the ACM
Proceedings of the ACM SIGMOD Workshop on Re- SIGKDD International Conference on Knowledge
search Issues in Data Mining and Knowledge Discovery Discovery and Data Mining (pp. 383-388), USA.
(pp. 178-188), USA.
Zaki, M. J., & Gouda, K. (2001). Fast vertical mining
Aumann, Y., & Lindell, Y. (1999, August). A statistical using diffsets (Tech. Rep. No. 01-1): Rensselaer Poly-
theory for quantitative association rules. Proceedings of technic Institute, Department of Computer Science.
the International Conference on Knowledge Discov-
ery in Databases (pp. 261-270), USA. Zaki, M. J., & Hsiao, C. J. (1999). CHARM: An efficient
algorithm for closed association rule mining (Tech.
Dumitriu, L. (2002). Interactive mining and knowledge Rep. No. 99-10). Rensselaer Polytechnic Institute,
reuse for the closed-itemset incremental-mining prob- Department of Computer Science.
lem. Newsletter of the ACM SIG on Knowledge Discov-
ery and Data Mining, 3(2), 28-36. Retrieved from http:// Zaki, M. J., & Ogihara, M. (1998, June). Theoretical
www.acm.org/sigs/sigkdd/explorations/issue3-2/ foundations of association rules. Proceedings of the
contents.htm Third ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery (pp. 7:1-7:8),
Ganter, B., & Wille, R. (1999). Formal concept analy- USA.
sis Mathematical foundations. Berlin, Germany:
Springer-Verlag. Zheng, Z., Kohavi, R., & Mason, L. (2001, August).
Real world performance of association rule algorithms.
Hong, T.-P., Kuo, C.-S., Chi, S.-C., & Wang, S.-L. (2000, Proceedings of the ACM SIGKDD International Con-
March). Mining fuzzy rules from quantitative data based ference on Knowledge Discovery and Data Mining (pp.
401-406), USA.
152
TEAM LinG
Closed-Itemset Incremental-Mining Problem
153
TEAM LinG
154
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Cluster Analysis in Fitting Mixtures of Curves
Figure 1. Four mixture examples containing (a) one, from another cluster. The noise points are identified
(b) one, (c) two, and (d) zero thunderstorm events plus initially as those having low local density (away from C
background noise. The label 1 is for the first the central portion of any cluster) but, during the ex-
thunderstorm in the scene, 2 for the second, and so trapolation, can be judged to be a cluster member, if they
forth., and the highest integer label is reserved for the lie near the extrapolated curve. To increase robustness,
catch-all noise class. Therefore, in (d), because the method 1 can be applied twice, each time using slightly
highest integer is 1, there is no thunderstorm present different inputs (such as the decision threshold for the
(the mixture is all noise). initial noise rejection and the criteria for accepting
points into a cluster that are close to the extrapolated
region of the clusters curve). Then, only clusters that
10
2 2 11 2
111
8
2
are identified both times are accepted.
2 222 2
2 11 1 1
2 22 1
1 111 11
8
1
1
Method 2 uses the minimized, integrated squared
2 2 2 2 11 1
6
11
2 22 2 1
1 11111
2 11111111 1 1 11 1111 1 1
2 1111111 111
6
1111 1
y
1 1 1 1
111111
error (ISE, or L 2 distance) (Scott, 2002; Scott &
11111111111111
1 1 111111
1
11 111
1111111
111 111 1
111
1 111111 11111
4
1 11111 111
111 1111 1111
111 1 1 2 2
1 11111111111 11
111
Szewczyk, 2002) and appears to be a good approach for
4
11111
111111 11111111 2
1 1 1 1111 1111111
111111
111
11111111 11
111111111 1 1 2
1 111 111111111111
111111111
1111 22
111111 11
2
1 111
11
11
111111 2 22
fitting mixture models, including mixtures of regres-
2 2
2 2 2
2
x
(a)
x
(b)
sion models, as is our focus here. Qualitatively, the
minimum L2 distance method tries to find the largest
portion of the data that matches the model. In our
10
33 1
11
333333
1
11
111 context, at each stage, the model is all the points belong-
6
1
8
3 3 33 1 1 1 1 11
1 111 1 1 1 11
4
1 1111 11111
first seek cluster 1 having the most points, regard the
y
3 1 1 111 1 1111
111 1111 1111
1 11
1111111111111
1
1 11111111 1111111111 1
4
3 3 1 1
11 1 1111
remaining points as noise, remove the cluster, then
1 11 11 1
1111111 1 1
2
2221 1 1
22 1111111111111 1 11 1
2
3 22222222222222 2
155
TEAM LinG
Cluster Analysis in Fitting Mixtures of Curves
observations, x1, x2, , xn, and C be a partition considering Therefore, it is similar to method 3 in that a mixture model
of clusters C0, C 1, , C K, where the cluster Cj contains nj is specified but differs in that the curve is fit using
points. The noise cluster is C0 and assumes feature points parametric regression rather than principal curves, so
are distributed uniformly along the true underlying feature yi xi j
so that projections onto the features principal curve are L ( xi | , xi C j ) = f ij = (1/ j ) ( ) where is the stan-
j
randomly drawn from a uniform U(0,j) distribution, where
j is the length of the jth curve. An approximation to the dard Gaussian distribution. Also, Turners implementa-
probability for 0, 1, , 5 clusters is available from the tion did not introduce a clustering criterion, but it did
Bayesian Information Criterion (BIC), which is defined as attempt to estimate the number of clusters as follows.
BIC = 2 log(L(X|) - M log(n), where L is the likelihood of the Introduce indicator variable zi of which component of the
data X, and M is the number of fitted parameters, so M = mixture generated observation yi and iteratively maximize
K(DF + 2) + K + 1. For each of K features, we fit 2 parameters n K
Q = ik ln( f ik ) with respect to where is the complete
( j and j defined below) and a curve having DF degrees of i =1 k =1
freedom; there are K mixing proportions (j defined below), parameter set j for each class. The ik sat-
and the estimate of scene area is used to estimate the noise K
density. The likelihood L satisfies L(X|) = isfy ik = k fik / k fik . Then Q is maximized with respect to
i =1
n K
L( X | ) = L( xi | ) where L( xi | ) = j L( xi | , xi C j ) is the t by weighted regression of yi on x1, , xn with weights
i =1 j =0 n n
mixture likelihood (j is the probability that point i ik, and each k2 is given by k = ik ( yi xi k ) / ik . In
2 2
clusters ( || xi f (ij ) || is the Euclidean distance from xi to its choosing the number of components in the mixture, each
projection point f(ij) on curve j) and L( xi | , xi C j ) = 1/ Area with unknown mixing probability, is that the likelihood
ratio statistic has an unknown distribution. Therefore,
for the noise cluster. Space will not permit a complete
Turner (2000) implemented a bootstrap strategy to choose
description of the HPCC-CEM method, but briefly, the
between 1 and 2 components, between 2 and 3, and so
HPCC steps are as follows:
forth. The strategy to choose between K and K + 1
components is (a) calculate the log-likelihood ratio
(1) Make an initial estimate of noise points and re-
statistic Q for a model having K and for a model having
move;
K + 1 components; (b) simulate data from the fitted K
(2) form an initial clustering with at least seven points
-component model; (c) fit the K and K + 1 component
in each cluster;
n
(3) fit a principal curve to each cluster; and Q*; (d) compute the p-value for Q as p = 1/ n I (Q Q ) ,
*
156
TEAM LinG
Cluster Analysis in Fitting Mixtures of Curves
157
TEAM LinG
Cluster Analysis in Fitting Mixtures of Curves
Hurn, M., Justel, A., & Robert, C. (2003). Estimating Cluster Analysis: Dividing objects into groups using
mixtures of regressions. Journal of Computational and varying assumptions regarding the number of groups,
Graphical Statistics, 12, 55-74. and the deterministic and stochastic mechanisms that
generate the observed values.
Leroux, B. (1992). Consistent estimation of a mixing
distribution. The Annals of Statistics, 20, 1350-1360. Estimation-Maximization Algorithm: An algorithm
for computing maximum likelihood estimates from incom-
Murtagh, F., & Raftery, A. (1984). Fitting straight lines plete data. In the case of fitting mixtures, the group labels
to point patterns. Pattern Recognition, 17, 479-483. are the missing data.
Scott, D. (2002). Parametric statistical modeling by Mixture of Distributions: A combination of two
minimum integrated square error. Technometrics, 43(3), or more distributions in which observations are gener-
274-285. ated from distribution i with probability pi and pi =1.
Scott, D., & Szewczyk, W. (2002). From kernels to Poisson Process: A stochastic process for generat-
mixtures. Technometrics, 43(3), 323-335. ing observations in which the number of observations in
Silverman, B. (1986). Density estimation for statistics a region (a region in space or time, for example) is
and data analysis. London: Chapman and Hall. distributed as a Poisson random variable.
S-PLUS Statistical Programming Language. (2003). Principal Curve: A smooth, curvilinear summary
Seattle, WA: Insightful Corp. of p-dimensional data. It is a nonlinear generalization of
the first principal component line that uses a local
Stanford, D., & Raftery, A. (2000). Finding curvilinear average of p-dimensional data.
features in spatial point patterns: Principal curve clus-
tering with noise. IEEE Transactions on Pattern Analy- Probability Density Function: A function that can
sis and Machine Intelligence, 22(6), 601-609. be summed (for discrete-valued random variables) or
integrated (for interval-valued random variables) to give
Tiggerington, D., Smith, A., & Kakov, U. (1985). Statis- the probability of observing values in a specified set.
tical analysis of finite mixture distributions. New
York: Wiley. Probability Density Function Estimate: An esti-
mate of the probability density function. One example is
Turner, T. (2000). Estimating the propagation rate of a the histogram for densities that depend on one variable
viral infection of potato plants via mixtures of regres- (or multivariate histograms for multivariate densities).
sions. Applied Statistics, 49(3), 371-384. However, the histogram has known deficiencies involv-
ing the arbitrary choice of bin width and locations.
Therefore, the preferred density function estimator,
which is a smoother estimator that uses local weighted
KEY TERMS sums with weights determined by a smoothing param-
eter, is free from bin width and location artifacts.
Bayesian Information Criterion: An approxima-
tion to the Bayes Factor, which can be used to estimate Scatter Plot: One of the most common types of
the Bayesian posterior probability of a specified model. plots, also known as an x-y plot, in which the first
component of a two-dimensional observation is dis-
Bootstrap: A resampling scheme in which surrogate played in the horizontal dimension and the second com-
data is generated by resampling the original data or ponent is displayed in the vertical dimension.
sampling from a model that was fit to the original data.
158
TEAM LinG
159
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Clustering Analysis and Algorithms
label assignment. Although CURE works with numeri- where E is the sum of square-error for all objects in
cal attributes (particularly low-dimensional spatial data), the database, p is point in space representing a given
the algorithm ROCK, developed by the same research-
object, and mi is the mean of cluster Ci (both p and mi
ers (Guha, Rastogi, & Shim, 1999) targets hierarchical
agglomerative clustering for categorical attributes. are multidimensional).
Given a database of n objects and k, the number of In the k-medoids algorithm, a cluster is represented by
clusters to form, a partitioning algorithm organizes the one of its points. Instead of taking the mean value of the
objects into k partitions (k n), where each partition objects in a cluster as a reference point, the medoid can
represents a cluster. The clusters are formed to opti- be used, which is the most centrally located object in a
mize a partitioning criterion, often called a similarity cluster. The basic strategy of the k-medoids clustering
function, such as distance, so that the objects within a algorithms is to find k clusters in n objects by first
cluster are similar, whereas the objects of different arbitrarily finding a representative object (the medoid)
clusters are dissimilar in terms of the database at- for each cluster. Each remaining object is clustered
tributes. with the medoid to which it is the most similar. The
Partitioning clustering algorithms have advantages strategy then iteratively replaces one of the medoids by
in applications involving large data sets for which the one of the nonmedoids as long as the quality of the
construction of a dendrogram is computationally pro- resulting clustering is improved. This quality is esti-
hibitive. A problem accompanying the use of a partition- mated by using a cost function that measures the average
ing algorithm is the choice of the number of desired dissimilarity between an object and the medoid of its
output clusters. A seminal paper (Dubes, 1987) pro- cluster. It is important to understand that k-means is a
vides guidance on this key design decision. The parti- greedy algorithm, but k-medoids is not.
tioning techniques usually produce clusters by optimiz-
ing a criterion function defined either locally (on a Density-Based Clustering
subset of the patterns) or globally (defined over all the
patterns). Combinatorial search of the set of possible Heuristic clustering algorithms (such as partitioning
labelings for an optimum value of a criterion is clearly methods) work well for finding spherical-shaped clus-
computationally prohibitive. In practice, the algorithm ters in databases that are not very large. To find clusters
is typically run multiple times with different starting with complex shapes and for clustering very large data
states, and the best configuration obtained from all the sets, partitioning-based algorithms need to be extended.
runs is used as the output clustering. The most well- Most partitioning-based algorithms cluster objects based
known and commonly used partitioning algorithms are on the distance between objects. Such methods can find
k-means, k-medoids, and their variations. only spherical shaped clusters and encounter difficulty
at discovering clusters of arbitrary shapes. To discover
K-Means Method clusters with arbitrary shape, density-based clustering
algorithms have been developed. These algorithms typi-
The k-means algorithm (Hartigan, 1975) is by far the cally regard clusters as dense regions of objects in the
most popular clustering tool used in scientific and data space that are separated by regions of low density.
industrial applications. It proceeds as follows. First, it The general idea is to continue growing the given
randomly selects k objects, each of which initially cluster as long as the density (number of objects or data
represents a cluster mean or centre. For each of the points) in the neighborhood exceeds some threshold.
remaining objects, an object is assigned to the cluster to That is, for each data point within a given cluster, the
which it is the most similar, based on the distance neighborhood of a given radius has to contain at least a
between the object and the cluster mean. It then com- minimum number of points. Such a method can be used
putes the new mean for each cluster. This process iter- to filter out noise (outliers) and discover clusters of
ates until the criterion function converges. Typically, arbitrary shape. DBSCAN (EsterKriegel, Sander, & Xu,
the squared-error criterion is used, defined as 1996) is a typical density-based algorithm that grows
clusters according to a density threshold. OPTICS
(Ankerst Breuning, Kriegel, & Sander, 1999) is a den-
k
E = i=1 pC | p mi |2 sity-based algorithm that computes an augmented clus-
i
160
TEAM LinG
Clustering Analysis and Algorithms
ter ordering for automatic and interactive cluster analy- approach assumes that the data are generated by a finite
sis. DENCLUE (Hinneburg & Keim, 1998) is another mixture of underlying probability distributions such as C
clustering algorithm based on a set of density distribu- multivariate normal distributions. The Gaussian mix-
tion functions. It differs from partition-based algorithms ture model has been shown to be a powerful tool for
not only by accepting arbitrary shape clusters but also by many applications (Banfield & Raftery, 1993). With
how it handles noise. the underlying probability model, the problems of de-
termining the number of clusters and of choosing an
Grid-Based Clustering appropriate clustering algorithm become probabilistic
model choice problems (Dasgupta & Raftery, 1998).
Grid-based algorithms quantize the object space into a This provides a great advantage over heuristic cluster-
finite number of cells that form a grid structure on which ing algorithms, for which no established method to
all the operations for clustering are performed. To some determine the number of clusters or the best clustering
extent, the grid-based methodology reflects a technical algorithm exists. Model-based clustering follows two
point of view. The category is eclectic: It contains both major approaches: a probabilistic approach or a neural
partitioning and hierarchical algorithms. The main ad- network approach.
vantage of this method is its fast processing time, which
is typically independent of the number of data objects, Probabilistic Approach
yet dependent on only the number of cells in each dimen-
sion in the quantized space. In the probabilistic approach, data are considered to be
Some typical examples of the grid-based algorithms samples independently drawn from a mixture model of
include STING, which explores statistical information several probability distributions (McLachlan & Basford,
stored in the grid cells; WaveCluster, which clusters 1988). The main assumption is that data points are
objects using a wavelet transform method; and CLIQUE, generated by (1) randomly picking a model j with
which represents a grid and density-based approach for probability j and (2) drawing a point x from a corre-
clustering in a high-dimensional data space. The algo-
sponding distribution. The area around the mean of
rithm STING (Wang, W., Wang, J., & Munta, 1997)
each distribution constitutes a natural cluster. We as-
works with numerical attributes (spatial data) and is
sociate the cluster with the corresponding distributions
designed to facilitate region-oriented queries. In doing
parameters such as mean, variance, and so forth. Each
so, STING constructs data summaries in a way similar to
data point carries not only its observable attributes, but
BIRCH. However, it assembles statistics in a hierarchi-
also a hidden cluster ID. Each point x is assumed to
cal tree of nodes that are grid-cells. The algorithm
belong to one and only one cluster.
WaveCluster (Sheikholeslami, Chatterjee, & Zhang,
Probabilistic clustering has some important fea-
1998) works with numerical attributes and has an ad-
tures. For example, it (a) can be modified to handle
vanced multiresolution. It is also known for some out-
records of complex structure, (b) can be stopped and
standing properties such as (a) a high quality of clusters,
resumed with consecutive batches of data, and (c) re-
(b) the ability to work well in relatively high-dimen-
sults in easily interpretable cluster system. Because
sional spatial data, (c) the successful handling of outli-
the mixture model has a clear probabilistic foundation,
ers, and (d) O(N) complexity. WaveCluster, which ap-
the determination of the most suitable number of clus-
plies wavelet transforms to filter the data, is based on
ters k becomes a more tractable task. From a data-
ideas of signal processing. The algorithm CLIQUE
mining perspective, an excessive parameter set causes
(Agrawal, Gehrke, Gunopulos, & Raghavan, 1998) for
overfitting, but from a probabilistic perspective, the
numerical attributes is fundamental in subspace cluster-
number of parameters can be addressed within the
ing. It combines the ideas of density-based clustering,
Bayesian framework. An important property of proba-
grid-based clustering, and the induction through dimen-
bilistic clustering is that the mixture model can be
sions similar to the Apriori algorithm in association rule
naturally generalized to clustering heterogeneous data.
learning.
However, statistical mixture models often require a
quadratic space, and the EM algorithm converges rela-
Model-Based Clustering tively slowly, making scalability an issue.
161
TEAM LinG
Clustering Analysis and Algorithms
works. ANNs have been used extensively over the past bases, such as real-life terabyte data sets. (2) Gracefully
four decades for both classification and clustering (Jain eliminate the need for a priori assumptions about the
& Mao, 1994). The neural ANNs approach to clustering data. (3) Use good sampling and data compression meth-
has two prominent methods: competitive learning and ods to improve efficiency and speed up clustering algo-
self-organizing feature maps. Both involve competing rithms. (4) Cluster extremely large and high-dimen-
neural units. Some of the features of the ANNs that are sional data.
important in pattern clustering are that they (a) process
numerical vectors and so require patterns to be repre-
sented with quantitative features only, (b) are inherently CONCLUSION
parallel and distributed processing architectures, and
(c) may learn their interconnection weights adaptively. This article describes five major approaches to cluster-
The neural network approach to clustering tries to ing, in addition to some other clustering algorithms.
emulate actual brain processing. Further research is Each has both positive and negative aspects, and each is
needed to make it readily applicable to very large data- suitable for different types of data and different as-
bases, due to long processing times and the intricacies sumptions about the cluster structure of the input. .
of complex data. Clustering is a process of grouping data items based on
a measure of similarity. Clustering is also a subjective
Other Clustering Techniques process; the same set of data items often needs to be
partitioned differently for different applications. This
Traditionally, each pattern belongs to one and only one subjectivity makes the process of clustering difficult,
cluster. Hence, the clusters resulting from this kind of because a single algorithm or approach is not adequate
clustering are disjoint. Fuzzy clustering extends this to solve every clustering problem. However, clustering
notion to associate each object with every cluster using is an interesting, useful, and challenging problem. It has
a membership function (Zadeh, 1965). Another approach great potential in applications such as object represen-
is constrained-based clustering, introduced in (Tung, Ng, tation, image segmentation, information filtering and
Lakshmanan & Han, 2001). This approach has important retrieval, and analyzing gene expression data.
applications in clustering two-dimensional spatial data in
the presence of obstacles. Another approach used in
clustering analysis is the Genetic Algorithm (Goldbery, REFERENCES
1989). An example is the GGA (Genetically Guided Algo-
rithm) for fuzzy and hard k-means (Hall, Ozyurt, & Bezdek, Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P.
1999). (1998). Automatic subspace clustering of high dimen-
sional data for data mining applications. Proceedings of
the ACM SIGMOD Conference (pp. 94-105), USA.
FUTURE TRENDS
Ankerst, M., Breuning, M., Kriegel, H., & Sander, J. (1999).
Choosing a clustering algorithm for a particular prob- OPTICS: Ordering points to identify clustering structure.
lem can be a daunting task. One major challenge in using Proceedings of the ACM SIGMOD Conference (pp. 49-
a clustering algorithm on a specific problem lies not in 60), USA.
performing the clustering itself, but rather in choosing Banfield, J. D., & Raftery, A. E. (1993). Model-based
the algorithm and the values of the associated param- Gaussian and non-Gaussian clustering. Biometrics, 49,
eters. Clustering algorithms also face problems of 803-821.
scalability, both in terms of computing time and memory
requirements. Despite the ongoing exponential increases Breuning, M., Kriegel, H., Krger, H., & Sander, J. (2001).
in the power of computers, scalability remains still a Data bubbles: Quality preserving performance boosting
major issue in many clustering applications. In com- for hierarchical clustering. Proceedings of the ACM
mercial data-mining applications, the quantity of the SIGMOD Conference, USA.
data to be clustered can far exceed the main memory
capacity of the computer, making both time and space Dasgupta, A., & Raftery, A. E. (1998). Detecting features
efficiency critical; this issue is addressed by clustering in spatial point processes with clutter via model-based
systems in the database community such as BIRCH. clustering. Journal of the American Statistical Associa-
This leads to the following set of continuing re- tion, 93, 294-302.
search in clustering, and in particular data mining. (1) Dubes, R. C. (1987). How many clusters are best? An
Extend clustering algorithms to handle very large data- experiment. Pattern Recognition, 20, 645-663.
162
TEAM LinG
Clustering Analysis and Algorithms
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Lu, S. Y., & Fu, K. S. (1978). A sentence-to-sentence
Pregibon, D. (1999). Squashing flat files flatter. Proceed- clustering procedure for pattern analysis. IEEE Transac- C
ings of the ACM SIGKDD Conference (pp. 6-15), USA. tions on System Man Cybern., 8, 381-389.
Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A McLachlan, G. J., & Basford, K. D. (1998). Mixture models:
density-based algorithm for discovering clustering in Inference and application to clustering. New York:
large spatial databases with noise. Proceedings of the Dekker.
the ACM SIGKDD Conference (pp. 226-231), USA.
Mirkin, B. (1996). Mathematic classification and cluster-
Ghosh, J. (2002). Scalable clustering methods for data ing. Kluwer Academic.
mining. In N. Ye (Ed.), Handbook of data mining
Lawrence Erlbaum. Ng, R., & Han, J. (1994). Efficient and effective clustering
method for spatial data mining. Proceedings of the 20th
Goldbery, D. (1989). Genetic algorithm in search, Conference on Very Large Data Bases (pp. 144-155),
optimization and machine learning. Addison-Wesley. Chile.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998).
efficient clustering algorithm for large databases. Pro- WaveCluster: A multi-resolution clustering approach
ceedings of the ACM SIGMOD Conference (pp. 73-84), for very large spatial databases. Proceedings of the
USA. 24th Conference on Very Large Data Bases (pp. 428-439),
USA.
Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust
clustering algorithm for categorical attributes. Proceed- Tung, A. K. H., Hou, J., & Han, J. (2001). Spatial clustering
ings of the 15th International Conference on Data Engi- in the presence of obstacles. Proceedings of the 17th
neering (pp. 512-521), Australia. International Conference on Data Engineering (pp. 359-
367), Germany.
Hall, L. O., Ozyurt, B., & Bezdek, J. C. (1999). Clustering
with a genetically optimized approach. IEEE Transac- Tung, A. K. H., Ng, R. T., Lakshmanan, L. V. S., & Han, J.
tions on Evolutionary Computation, 3(2), 103-112. (2001). Constraint-based clustering in large databases.
Proceedings of the Eighth ICDT, London.
Han, J., & Kamber, M. (2001). Data mining: Concepts and
techniques. Morgan Kaufmann. Wang, W., Wang, J., & Munta, R. (1997). STING: A
statistical information grid approach to spatial data
Hartigan, J. (1975). Clustering algorithms. New York: mining. Proceedings of the 23rd Conference on Very
Wiley. Large Data Bases (pp. 186-195), Greece.
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduc- Zadeh, L. H. (1965). Fuzzy sets. Information Control, 8,
tion to the theory of neural computation. Reading, 338-353.
MA: Addison Wesley Longman.
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH:
Hinneburg, A., & Keim, D. (1998). An efficient ap- An efficient data clustering method for very large data-
proach to clustering large multimedia databases with bases. Proceedings of the ACM SIGMOD Conference
noise. Proceedings of the ACM SIGMOD Conference (pp. (pp. 103-114), Canada.
58-65), USA.
Jain, A., & Dubes, R. (1988). Algorithms for clustering
data. Englewood Cliffs, NJ: Prentice-Hall.
KEY TERMS
Jain, A., & Mao, J. (1994). Neural networks and pattern
recognition. In J. M. Zurada, R. J. Marks, & C. J. Apriori Algorithm: An efficient association rule
Robinson (Eds.), Computational intelligence: Imitat- mining algorithm developed by Agrawal, in 1993. Apriori
ing life (pp.194-212). employs a breadth-first search and uses a hash tree
structure to count candidate item sets efficiently. The
Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A algorithm generates candidate item sets of length k
review. ACM Computing Surveys, 31(3), 264-323. from k1 length item sets. Then, the patterns that have an
Kaufman, L., & Rousseeuw, P. (1990). Finding groups in infrequent subpattern are pruned. Following that, the
data: An introduction to cluster analysis. New York: whole transaction database is scanned to determine
Wiley. frequent item sets among the candidates. For determining
163
TEAM LinG
Clustering Analysis and Algorithms
frequent items in a fast manner, the algorithm uses a hash Feature Selection: The process of identifying the
tree to store candidate item sets. most effective subset of the original features to use in
data analysis, such as clustering.
Association Rule: A rule in the form of if this, then
that. It states a statistical correlation between the occur- Overfitting: The effect on data analysis, data min-
rence of certain attributes in a database. ing, and biological learning of training too closely on
limited available data and building models that do not
Customer Relationship Management: The process generalize well to new unseen data.
by which companies manage their interactions with cus-
tomers. Supervised Classification: Given a collection of
labeled patterns, the problem in supervised classifica-
Data Mining: The process of efficient discovery of tion is to label a newly encountered but unlabeled pat-
actionable and valuable patterns from large databases. tern. Typically, the given labeled patterns are used to
learn the descriptions of classes that in turn are used to
label a new pattern.
164
TEAM LinG
165
Adriano Moreira
University of Minho, Portugal
Sofia Carneiro
University of Minho, Portugal
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Clustering in the Identification of Space Models
166
TEAM LinG
Clustering in the Identification of Space Models
25 25
72 99 99
72
38 38
45 45
124 124
87 87
65 65
50 50
167
TEAM LinG
Clustering in the Identification of Space Models
number of clusters, which can be used for different pur- clustering process the Intracluster similarity value is low,
poses. because of the high similarity between the objects inside
For each model, a quality metric is calculated after each the clusters, and the Intercluster similarity value is high,
iteration of the STICH algorithm. This metric, the because of the low similarity between the several clusters.
ModelQuality, is defined as: As the process proceeds, the Intracluster value increases
and the Intercluster value decreases. The minimum differ-
ModelQuality = Intraclust er Intercluster (2) ence between these two values means that the objects are
as separate as possible, considering a low number of
and is based on the difference between the Intracluster clusters. After this point, the clustering process forces
and Intercluster similarities. the aggregation of all the objects into the same cluster (the
The Intracluster indicator is calculated as the sum of stop criterion of agglomerative hierarchical clustering
all distances between the several objects in a given cluster methods), within a few more iterations.
(l represents the total number of objects in a cluster) and At this minimum point (Figure 2), reached at some
the mean value (mi) of the cluster (Ci) in which the object iteration of the clustering process, the resulting Space
(oj) reside. The total number of clusters identified in each Model is the one where the outliers are pointed out, that
iteration is represented by t. The Intracluster indicator is is, where the model isolates in different clusters the
calculated as follows: regions that are very different from all other regions.
t l
The Implementation
Intracluster = o j Ci o j mi (3)
i =1 j =1
STICH was implemented in Visual Basic for Applications
(VBA) and integrated in the Geographic Information Sys-
The Intercluster indicator is calculated as the sum of tem (GIS) ArcView 8.2 using ArcObjects4. For the creation
all distances existing between the centers of all the clus- of Space Models, STICH considers two processes: the
ters identified in a given iteration. The Intercluster indi- clustering of geographic regions based on a selected
cator is calculated as follows: indicator, and the creation of a new space geometry where
a new polygon is created for each resulting cluster by
t dissolving the borders of its members (the regions inside
t
Intercluster = mi m j (4) it). By using ArcGIS, both processes can be implemented
i =1 j =1
j i in the same platform.
1 109
Intracluster similarity
Intercluster similarity
8 108 Diference
6 108
Min
4 108
2 108
Iteration
2 4 6 8 10
168
TEAM LinG
Clustering in the Identification of Space Models
cators on a European-wide NUTS III Level) is a research Quality and Quality of Life, by delivering a tool aimed to
project founded by the European Union through the generate environmental sustainability indices at NUTS5- C
Information Society Technologies program. STICH is a III level.
deliverable of this research project, which contributes to In the following example, an indicator collected in the
the better understanding of the European Environmental EPSILON project is analyzed, namely the concentration in
the air of Heavy Metals - Lead (the data is available in
Table 1 for the 15 European countries that integrate the
Figure 3. Space model: Analysis of the heavy metals
EPSILON database).
lead attribute
Using the k-means and STICH algorithms, adopting
the value k=3 in order to allow the comparison between
the clusters achieved by them, the results obtained are
systematized in Table 2. Analyzing these results it is
possible to see that the clusters generated with the k-
means algorithm, using the implementation available in
the Clementine Data Mining System v8.0, are not as
homogeneous as the clusters obtained with STICH. The
k-means approach integrates the 0.0231413 value, an
outlier6 present in the dataset, with values like 0.00197921,
leaving out of this cluster (Cluster 1) values that are more
proximal to this last one. STICH obtains clusters that are
more homogeneous and that separates values that are
very different from the others.
Figure 3 shows the Space Model created by STICH as
result of the clustering process. This model, with three
clusters, shows the spatial distribution of the concentra-
tion of Lead in the air, for the 15 analyzed countries.
Table 1. Data available in the Epsilon database for the Heavy Metals - Lead attribute
Cluster 3 = {0.0032313}
Cluster 3 = {0.0231413}
169
TEAM LinG
Clustering in the Identification of Space Models
170
TEAM LinG
Clustering in the Identification of Space Models
Salber, D., Dey, A.K., & Abowd, G.D. (1999). The context Partitioning Clustering: A clustering method char-
toolkit: Aiding the development of context-enabled appli- acterized by the division of the initial dataset in order to C
cations. In Proceedings of the 1999 Conference on Hu- find clusters that maximize the similarity between the
man Factors in Computing Systems (CHI 99) (pp. 434- objects inside the clusters.
441). Pittsburgh.
Space Model: A geometry of the geographic space
Zat, M., & Messatfa, H. (1997). A comparative study of obtained by the identification of geographic regions with
clustering methods. Future Generation Computer Sys- similar behavior looking at a specific metric.
tems, 13(2), 149-159.
171
TEAM LinG
172
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Clustering of Time Series Data
173
TEAM LinG
Clustering of Time Series Data
Das, G., Lin, K.-I., Mannila, H., Renganathan, G., & Smyth, ACM SIGKDD International Conference on Knowledge
P. (1998, Sept). Rule discovery from time series. In Pro- Discovery and Data Mining (pp. 544-549). Edmonton,
ceedings IEEE Int. Conf. on Data Mining, Rio de Janeiro, AB, Canada.
Brazil.
Patel, P., Keogh, E., Lin, J., & Lonardi, S. (2002). Mining
Denton, A. (2004, August). Density-based clustering of motifs in massive time series databases. In Proceedings
time series subsequences. In Proceedings The Third 2002 IEEE International Conference on Data Mining,
Workshop on Mining Temporal and Sequential Data Maebashi City, Japan.
(TDM 04) In Conjunction with The Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Reif, F. (1965). Fundamentals of statistical and thermal
Data Mining, Seattle, WA. physics. New York: McGraw-Hill.
Roddick, J.F., & Spiliopoulou, M. (2002). A survey of
Deroski, & Lavra . (2001). Relational data mining. Ber-
temporal knowledge discovery paradigms and meth-
lin: Springer.
ods. IEEE Transactions on Knowledge and Data En-
Jiang, D., Pei, J., & Zhang, A. (2003, March). DHC: A gineering, 14(4), 750-767.
density-based hierarchical clustering method for time
Vlachos, M., Gunopoulos, D., & Kollios, G., (2002, Feb-
series gene expression data. In Proceedings 3rd IEEE
ruary). Discovering similar multidimensional trajecto-
Symposium on Bioinformatics and Bioengineering
ries. In Proceedings 18th International Conference on
(BIBE03), Washington, D.C.
Data Engineering (ICDE02), San Jose, CA.
Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D.
Vaarandi, R. (2003). A data clustering algorithm for min-
(1998, December). Cluster analysis and display of ge-
ing patterns from event logs. In Proceedings 2003 IEEE
nome-wide expression patterns. In Proceedings of the
Workshop on IP Operations and Management, Kansas
National Academy of Science USA, 95 (25) (pp. 14863-8).
City, MO.
Gavrilov, M., Anguelov, D., Indyk, P., & Motwani, R.
(2000). Mining the stock market (extended abstract):
Which measure is best? In Proceedings of the Sixth
ACM SIGKDD International Conference on Knowl- KEY TERMS
edge Discovery and Data Mining (pp. 487-496), Boston,
MA. Dynamic Time Warping (DTW): Sequences are
allowed to be extended by repeating individual time
Gersho, A., & Gray, R.M. (1992). Vector quantization and series elements, such as replacing the sequence
signal compression. Boston, MA: Kluwer Academic Pub- X={x1,x2,x3} by X={x1,x2,x2,x3}. The distance be-
lishers. tween two sequences under dynamic time warping is
the minimum distance that can be achieved in by ex-
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On
tending both sequences independently.
clustering validation techniques. Intelligent Information
Systems Journal, 17(2-3), 107-145. Kernel-Density Estimation (KDE): Consider the
vector space in which the data points are embedded. The
Hinneburg, A., & Keim, D.A. (2003, November). A general
influence of each data point is modeled through a
approach to clustering in large databases with noise.
kernel function. The total density is calculated as the
Knowledge Information Systems, 5(4), 387-415.
sum of kernel functions for each data point.
Kalpakis, K., Gada, D., & Puttagunta, V. (2001). Distance
Longest Common Subsequence Similarity
measures for effective clustering of ARIMA time-series. In
(LCSS): Sequences are compared based on the as-
Proceedings IEEE International Conference on Data
sumption that elements may be dropped. For example,
Mining (pp. 273-280), San Jose, CA.
a sequence X={x1,x2,x3} may be replaced by
Keogh, E.J., Lin, J., & Truppel, W. (2003, December). X={x1,x3}. Similarity between two time series is
Clustering of time series subsequences is meaningless: calculated as the maximum number of matching time
Implications for previous and future research. In Pro- series elements that can be achieved if elements are
ceedings IEEE International Conference on Data Min- dropped independently from both sequences. Matches
ing (pp. 115-122), Melbourne, FL. in real-valued data are defined as lying within some
predefined tolerance.
Jin, X., Lu, Y., & Shi, C. (2002). Similarity measure based on
partial information of time series. In Proceedings Eighth Partition-Based Clustering: The data set is parti-
tioned into k clusters, and cluster centers are defined
174
TEAM LinG
Clustering of Time Series Data
based on the elements of each cluster. An objective Sliding Window: A time series of length n has (n-
function is defined that measures the quality of clustering w+1) subsequences of length w. An algorithm that oper- C
based on the distance of all data points to the center of the ates on all subsequences sequentially is referred to as a
cluster to which they belong. The objective function is sliding window algorithm.
minimized.
Time Series: Sequence of real numbers, collected
Principle Component Analysis (PCA): The projec- at equally spaced points in time. Each number corre-
tion of the data set to a hyper plane that preserves the sponds to the value of an observed quantity.
maximum amount of variation. Mathematically, PCA is
equivalent to singular value decomposition on the cova- Vector Quantization: A signal compression tech-
riance matrix of the data. nique in which an n-dimensional space is mapped to a
finite set of vectors. Each vector is called a codeword
Random Walk: A sequence of random steps in an n- and the collection of all codewords a codebook. The
dimensional space, where each step is of fixed or ran- codebook is typically designed using Linde-Buzo-Gray
domly chosen length. In a random walk time series, time (LBG) quantization, which is very similar to k-means
is advanced for each step and the time series element is clustering.
derived using the prescription of a 1-dimensional ran-
dom walk of randomly chosen step length.
175
TEAM LinG
176
Clustering Techniques
Sheng Ma
IBM T.J. Watson Research Center, USA
Tao Li
Florida International University, USA
BACKGROUND
MAIN THRUST
Generally, clustering problems are determined by five
basic components: We review some of the current clustering techniques in
this section. Figure 1 gives a summary of clustering
Data Representation: What is the (physical) repre- techniques. The following further discusses traditional
sentation of the given dataset? What kind of at- clustering techniques, spectral-based analysis, model-
tributes (e.g., numerical, categorical or ordinal) are based clustering, and co-clustering.
there? Traditional clustering techniques focus on one-sided
Data Generation: The formal model for describing clustering, and they can be classified as partitional, hier-
the generation of the dataset. For example, Gaussian archical, density-based, and grid-based (Han & Kamber,
mixture model is a model for data generation. 2000). Partitional clustering attempts to directly decom-
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Clustering Techniques
pose the dataset into disjoint classes, such that the data embedded in a low-dimensional space. Model-based clus-
points in a class are nearer to one another than the data tering attempts to learn generative models, by which the
points in other classes. Hierarchical clustering proceeds cluster structure is determined, from the data. Tishby,
successively by building a tree of clusters. Density- Pereira, and Bialek (1999) and Slonim and Tishby (2000)
based clustering is grouping the neighboring points of a developed information bottleneck formulation, in which,
dataset into classes based on density conditions. Grid- given the empirical joint distribution of two variables, one
based clustering quantizes the data space into a finite variable is compressed so that the mutual information
number of cells that form a grid-structure and then per- about the other is preserved as much as possible. Other
forms clustering on the grid structure. Most of these recent developments of clustering techniques include
algorithms use distance functions as objective criteria ensemble clustering, support vector clustering, matrix
and are not effective in high-dimensional spaces. factorization, high-dimensional data clustering, distrib-
As an example, we take a closer look at K-means uted clustering, and so forth.
algorithms. The typical K-means type algorithm is a widely- Another interesting development is co-clustering,
used partition-based clustering approach. Basically, it which conducts simultaneous, iterative clustering of both
first chooses a set of K data points as initial cluster data points and their attributes (features) through utiliz-
representatives (e.g., centers) and then performs an itera- ing the canonical duality contained in the point-by-at-
tive process that alternates between assigning the data tribute data representation. The idea of co-clustering of
points to clusters, based on their distances to the cluster data points and attributes dates back to Anderberg (1973)
representatives, and updating the cluster representa- and Nishisato (1980). Govaert (1985) researches simulta-
tives, based on new cluster assignments. The iterative neous block clustering of the rows and columns of the
optimization procedure of K-means algorithm is a special contingency table. The idea of co-clustering also has
form of EM-type procedure. The K-means type algorithm been applied to cluster gene expression and experiments
treats each attribute equally and computes the distances (Cheng & Church, 2000). Dhillon (2001) presents a co-
between data points and cluster representatives to deter- clustering algorithm for documents and words using
mine cluster memberships. bipartite graph formulation and a spectral heuristic. Re-
A lot of algorithms have been developed recently to cently, Dhillon, et al. (2003) proposed an information-
address the efficiency and performance issues presented theoretic co-clustering method for a two-dimensional
in traditional clustering algorithms. Spectral analysis has contingency table. By viewing the non-negative contin-
been shown to tightly relate to clustering task. Spectral gency table as a joint probability distribution between
clustering (Ng, Jordan & Weiss, 2001; Weiss, 1999), two discrete random variables, the optimal co-clustering
closely related to the latent semantics index (LSI), uses then maximizes the mutual information between the clus-
selected eigenvectors of the data affinity matrix to obtain tered random variables. Li and Ma (2004) recently devel-
a data representation that easily can be clustered or oped Iterative Feature and Data (IFD) clustering by rep-
177
TEAM LinG
Clustering Techniques
resenting the data generation with data and feature coef- merical. The presence of complex types also makes
ficients. IFD enables an iterative co-clustering procedure the cluster validation and interpretation difficult.
for both data and feature assignments. However, unlike
previous co-clustering approaches, IFD performs cluster- More challenges also include clustering with multiple
ing using the mutually reinforcing optimization procedure criteria (where clustering problems often require optimi-
that has a proven convergence property. IFD only handles zation over more than one criterion), clustering relation
data with binary features. Li, Ma, and Ogihara (2004b) data (where data is represented with multiple relation
further extended the idea for general data. tables), and distributed clustering (where data sets are
geographically distributed across multiple sites).
FUTURE TRENDS
CONCLUSION
Although clustering has been studied for many years,
many issues, such as cluster validation, still need more Clustering is a classical topic to segment and group
investigation. In addition, new challenges, such as similar data objects. Many algorithms have been devel-
scalability, high-dimensionality, and complex data types, oped in the past. Looking ahead, many challenges drawn
have been brought by the ever-increasing growth of infor- from real-world applications will drive the search for
mation exposure and data collection. efficient algorithms that are able to handle heteroge-
neous data in order to process a large volume and to scale
Scalability and Efficiency: With the collection of to deal with a large number of dimensions.
huge amounts of data, clustering faces the problems
of scalability in terms of both computation time and
memory requirements. To resolve the scalability is- REFERENCES
sues, methods such as incremental and streaming
approaches, sufficient statistics for data summary, Aggarwal, C. et al. (1999). Fast algorithms for projected
and sampling techniques have been developed. clustering. Proceedings of ACM SIGMOD Conference.
Curse of Dimensionality: Another challenge is the
high dimensionality of data. It has been shown that Anderberg, M.-R. (1973). Cluster analysis for applica-
in a high dimensional space, the distance between tions. Academic Press Inc.
every pair of points is almost the same for a wide Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U.
variety of data distributions and distance functions (1999). When is nearest neighbor meaningful? Proceed-
(Beyer et al., 1999). Hence, most algorithms do not ings of the International Conference on Database
work efficiently in high-dimensional spaces, due to Theory.
the curse of dimensionality. Many feature selection
techniques have been applied to reduce the dimen- Brucker, P. (1977). On the complexity of clustering prob-
sionality of the space. However, as demonstrated in lems. In R. Henn, B. Korte, & W. Oletti (Eds.), Optimiza-
Aggarwal, et al. (1999), in many case, the correlations tion and operations research (pp. 45-54). New York:
among the dimensions often are specific to data Springer-Verlag.
locality; in other words, some data points are corre-
lated with a given set of features, and others are Cheng, Y., & Church, G.M. (2000). Bi-clustering of ex-
correlated with respect to different features. As pression data. Proceedings of the Eighth International
pointed out in Hastie, Tibshirani, and Friedman (2001) Conference on Intelligent Systems for Molecular Biol-
and Domeniconi, Gunopulos, and Ma (2004), all ogy (ISMB).
methods that overcome the dimensionality problems Dempster, A., Laird, N., & Rubin, D. (1977). Maximum
use a metric for measuring neighborhoods, which is likelihood from incomplete data via the EM algorithm.
often implicit and/or adaptive. Journal of the Royal Statistical Society, 39, 1-38.
Complex Data Types: The problem of clustering
becomes more challenging when the data contains Dhillon, I. (2001). Co-clustering documents and words
complex types (e.g., when the attributes contain using bipartite spectral graph partitioning. Technical
both categorical and numerical values). There are no Report 2001-05. Austin CS Dept.
inherent distance measures between data values. Dhillon, I.S., Mallela, S., & Modha, S.-S. (2003). Informa-
This is often the case in many applications where tion-theoretic co-clustering. Proceedings of ACM
data are described by a set of descriptive or pres- SIGKDD.
ence/absence attributes, many of which are not nu-
178
TEAM LinG
Clustering Techniques
Domeniconi, C., Gunopulos, D., & Ma, S. (2004). Within- Tishby, N., Pereira, F.-C., & Bialek, W. (1999). The infor-
cluster adaptive metric for clustering. Proceedings of the mation bottleneck method. Proceedings of the 37th An- C
SIAM International Conference on Data Mining. nual Allerton Conference on Communication, Control
and Computing.
Govaert, G. (1985). Simultaneous clustering of rows and
columns. Control and Cybernetics, 437-458. Weiss, Y. (1999). Segmentation using eigenvectors: A
unifying view. Proceedings of ICCV (2).
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient
clustering algorithm for large databases. Proceedings of Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH:
the ACM SIGMOD Conference. An efficient data clustering method for very large data-
bases. Proceedings of the ACM SIGMOD Conference.
Han, J., & Kamber, M. (2000). Data mining: Concepts and
techniques. Morgan Kaufmann Publishers.
Hartigan, J. (1975). Clustering algorithms. Wiley.
KEY TERMS
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Cluster: A set of entities that are similar between
elements of statistical learning: Data mining, inference, themselves and dissimilar to entities from other clusters.
prediction. Springer.
Clustering: The process of dividing the data into
Jain, A.K., & Dubes, R.C. (1988). Algorithms for cluster- clusters.
ing data. Upper Saddle River, NJ: Prentice Hall.
Cluster Validation: Evaluates the clustering results
Li, T., & Ma, S. (2004). IFD: Iterative feature and data and judges the cluster structures.
clustering. Proceedings of the SIAM International Con-
ference on Data Mining. Co-Clustering: Performs simultaneous clustering of
both points and their attributes by way of utilizing the
Li, T., Ma, S., & Ogihara, M. (2004a). Entropy-based canonical duality contained in the point-by-attribute data
criterion in categorical clustering. Proceedings of the representation.
International Conference on Machine Learning (ICML 2004).
Curse of Dimensionality: This expression is due to
Li, T., Ma, S., & Ogihara, M. (2004b). Document clustering via Bellman; in statistics, it relates to the fact that the conver-
adaptive subspace iteration. Proceedings of the ACM SIGIR. gence of any estimator to the true value of a smooth
function defined on a space of high dimension is very
Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for slow. It has been used in various scenarios to refer to the
vector quantization design. IEEE Transactions on Com- fact that the complexity of learning grows significantly
munications, 28(1), 84-95. with the dimensions.
Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral Spectral Clustering: The collection of techniques
clustering: Analysis and an algorithm. Advances in Neu- that performs clustering tasks using eigenvectors of
ral Information Processing Systems, 14. matrices derived from the data.
Nishisato, S. (1980). Analysis of categorical data: Dual Subspace Clustering: An extension of traditional
scaling and its applications. Toronto: University of clustering techniques that seeks to find clusters in differ-
Toronto Press. ent subspaces within a given dataset.
Slonim, N., & Tishby, N. (2000). Document clustering
using word clusters via the information bottleneck method.
Proceedings of the ACM SIGIR.
179
TEAM LinG
180
Frank Rehm
German Aerospace Center, Germany
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Clustering Techniques for Outlier Detection
with the largest distance1 to the mean vector, which is FUTURE TRENDS
assumed to be an outlier, the value of the z-transforma-
tion for each of its components is compared to a critical Figure 1 shows that the algorithm can identify outliers
value. If one of these values is higher than the respective in multivariate data in a stable way. With only a few
critical value, then this vector is declared an outlier. One parameters, the solution can be adapted to different
can use the Mahalanobis distance as in Santos-Pereira requirements concerning the specific definition of an
and Pires (2002), but because simple clustering tech- outlier. With the choice of the number of prototypes, it
niques such as the fuzzy c-means algorithm tend to spheri- is possible to influence the result in a way that with lots
cal clusters, we apply a modified version of Grubbs test, of prototypes, even smaller data groups can be found. To
not assuming correlated attributes within a cluster. avoid overfitting the data, it makes sense in certain
The critical value is a parameter that must be set for cases to eliminate very small clusters. However, finding
each attribute depending on the specific definition of an out the proper number of prototypes should be of inter-
outlier. One typical criterion can be the maximum num- est for further investigations.
ber of outliers with respect to the amount of data In the case of using a fuzzy clustering algorithm such
(Klawonn, 2004). Eventually, large critical values lead as FCM (Bezdek, 1981) to partition the data, it is
to smaller numbers of outliers, and small critical values possible to assign a feature vector to different proto-
lead to very compact clusters. Note that the critical type vectors. In that way, you can consolidate whether a
value is set for each attribute separately. This leads to an certain feature vector is an outlier if the algorithm
axes-parallel view of the data, which in cases of axes- decides for each single cluster that the corresponding
parallel clusters leads to a better outlier detection than feature vector is an outlier.
the (hyper)spherical view of the data. FCM provides membership degrees for each feature
If an outlier is found, the feature vector has to be vector to every cluster. One approach could be to assign
removed from the data set. With the new data set, the a feature vector to the corresponding clusters with the
mean value and the standard deviation have to be calcu- two highest membership degrees. The feature vector is
lated again for each attribute. With the vector that has considered as an outlier if the algorithm makes the same
the largest distance to the new centre vector, the outlier decision in both clusters. In cases where the algorithm
test will be repeated by checking the critical values. This gives no definite answers, the feature vector can be
procedure will be repeated until no outlier is found. The labeled and processed by further analysis.
other clusters are treated in the same way.
Figure 1 shows the results of the proposed algo-
rithm. The crosses in this figure are feature vectors, CONCLUSION
which are recognized as outliers. As expected, only a
few points are declared as outliers when approximating In this article, we describe a method to detect outliers in
the feature space with only one prototype. The proto- multivariate data. Because information about the num-
type will be placed in the centre of all feature vectors. ber and shape of clusters is often not known in advance,
Hence, only points on the edges are defined as outliers. it is necessary to have a method that is relatively robust
Comparing the solutions with 3 and 10 prototypes, you with respect to these parameters. To obtain a stable
can determine that both solutions are almost identical. algorithm, we combined approved clustering techniques,
Even in the border regions, were two prototypes com- including the FCM or k-means, with a statistical method
peting for some data points, the algorithm would rarely to detect outliers. Because the complexity of the pre-
identify these points as outliers, which they intuitively sented algorithm is linear in the number of points, it can
are not. be applied to large data sets.
181
TEAM LinG
Clustering Techniques for Outlier Detection
Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining Outliers: Observations in a sample, so far sepa-
distance-based outliers in large datasets. Proceedings rated in value from the remainder as to suggest that they
of the 24th International Conference on Very Large are generated by another process or are the result of an
Data Bases (pp. 392-403). error in measurement.
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance- Overfitting: The phenomenon that a learning algo-
based outliers: Algorithms and applications. VLDB Jour- rithm adapts so well to a training set that the random
nal, 8(3-4), 237-253. disturbances in the training set are included in the
model as being meaningful. Consequently, as these
Krishnapuram, R., & Keller, J. M. (1993). A possibilistic disturbances do not reflect the underlying distribution,
approach to clustering. IEEE Transactions on Fuzzy the performance on the test set, with its own but defini-
Systems, 1(2), 98-110. tively other disturbances, will suffer from techniques
that tend to fit well to the training set.
182
TEAM LinG
Clustering Techniques for Outlier Detection
183
TEAM LinG
184
Peter Kokol
University of Maribor, FERI, Slovenia
Petra Povalej
University of Maribor, FERI, Slovenia
Milan Zorman
University of Maribor, FERI, Slovenia
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Combining Induction Methods with the Multimethod Approach
185
TEAM LinG
Combining Induction Methods with the Multimethod Approach
Figure 1. An example of a decision tree induced using reused in other methods, we introduced methods on the
multi-method approach. Each node is induced with basis of operators. Therefore, we introduced the opera-
appropriate method (GAgenetic algorithm, ID3, Gini, tion on an individual as a function that transforms one or
Chi-square, J-measure, SVM, neural network, etc.) more individuals into a single individual. Operation can
be a part of one or more methods, like a pruning operator,
a boosting operator, and so forth. An operator-based
GA
view provides the ability simply to add new operations
Gini ID3, GA to the framework (Figure 2).
Usually, methods are composed of operations that
GA GA can be reused in other methods. We introduced the
operation on an individual that is a function, which
ID3
transforms one or more individuals into a single indi-
vidual. Operation can be part of one or more methods,
like a pruning operator, a boosting operator, and so
use different knowledge representations (e.g., neural net- forth. The transformation to another knowledge repre-
works and decision trees). In such cases, we have two sentation also is introduced on the individual operator
alternatives: (1) to convert one knowledge representation level. Therefore, the transition from one knowledge
to another using different already known methods or (2) to representation to another is presented as a method. An
combine both knowledge representations in a single intel- operator-based view provides us with the ability simply
ligent system. In both cases, knowledge transmutation is to add new operations to the framework (Figure 2).
executed (Cox & Ram, 1999). In the first case, conversion Representation with individual operations facili-
between different knowledge representations must be tates is an effective and modular way to represent the
implemented, which is usually not perfect, and some parts result as a single individual, but, in general, the result of
of the knowledge can be lost. But on the other hand, it can operation also can be a population of individuals (e.g.,
provide a different view and a good starting point in mutation operation in EA is defined on an individual
hypothesis search space. level and on the population level). The single method is
The second approach, which is based on combining composed of population operations that use individual
knowledge, requires some cut-points where knowledge operations and is introduced as a strategy in the frame-
representations can be merged. For example, in a deci- work that improves individuals in a population. A single
sion tree, such cut-points are internal nodes, where the method is composed out of population operations that
condition in an internal node can be replaced by another use individual operations and is introduced as a strategy
intelligent system (e.g., support vector machine [SVM]). in the framework that improves individuals in a popula-
The same idea also can be applied in decision leafs tion (Figure 2). Population operators can be general-
(Figure 1). ized with higher order functions and thereby reused in
different methods.
Operators
Meta-Level Control
Using the idea of the multi-method approach, we de-
An important concern of the multi-method framework
signed a framework that operates on a population of
is how to provide the meta-level services to manage the
extracted knowledge representationsindividuals. Since
available resources and the application of the methods.
methods usually are composed of operations that can be
We extended the quest for knowledge into another
dimension; that is, the quest for the best application
order of methods. The problem that arises is how to
Figure 2. Multi-method framework control the quality of resulting individuals and how to
intervene in the case of bad results. Due to different
knowledge representations, solutions cannot be com-
Framework
support
Methods Framework
support
Methods Framework
support
Methods
pared trivially to each other, and the assessment of
Population
Individual
operator 1
Individual
operator i+1
Population
operator 1
operator k+1
.
Strategy
1
Strategy
i+1
.
which method is better is hard to imagine. Individuals in
a population cannot be evaluated explicitly, and the
. . . . . .
. . . . . .
. . . .
186
TEAM LinG
Combining Induction Methods with the Multimethod Approach
function could be found, it probably would be very time- Figure 3. Hypothesis separation using GA
consuming and computational-intensive. Therefore, the C
idea of classical evolutionary algorithms controlling the h1
not be applied.
To achieve self-adaptive behavior of the evolution-
ary algorithm, the strategy parameters have to be coded
directly into the chromosome (Thrun & Pratt, 1998).
But in our approach, the meta-level strategy does not
know about the structure of the chromosome, and not all
methods. On other hand, there are many different methods
of the methods use the EA approach to produce a solu-
for decision-tree induction that all generate knowledge in
tion. Therefore, for meta-level chromosomes, the pa-
the form of tree. The most popular method is greedy
rameters of the method and its individuals are taken.
heuristic induction of a decision tree, which produces a
When dealing with self-adapting population with no
single decision tree with respect to the purity measure
explicit evaluation/fitness function, there is also an
(heuristic) of each split in a decision tree. Altering purity
issue of the best or most promising individual (progar
measure may produce totally different results with differ-
et al. 2000). But, of course, the question of how to
ent aspects (hypotheses) for a given problem. Hypoth-
control the population size or increase selection pres-
esis induction is done with the use of evolutionary algo-
sure must be answered effectively. Our solution was to
rithms (EA). When designing EA, algorithm operators for
classify population operators into three categories:
mutation, crossover and selection have to be carefully
operators for reducing, operators for maintaining, and
chosen (Podgorelec et al., 2001). Combining the EA ap-
operators for increasing the population size.
proach with heuristic approaches dramatically reduces
hypothesis search space.
There is another issue when combining two methods
CONCRETE COMBINATION using another classifier for separation. For example, in
TECHNIQUES Figure 3, there are two hypotheses, h1 and h2, that could
be perfectly separately induced using an existing set of
The multi-method approach searches for solutions in methods. Lets suppose that there is no method that is
huge (infinite) search space and exploits the acquisition able to acquire both hypotheses in a single hypothesis.
technique of each integrated method. Population of Therefore, we need a separation of the problem using
individual solutions represents a different aspect of another hypothesis, h3, which has no special meaning to
extracted knowledge. Transmutation from one to an- induce a successful composite hypothesis.
other knowledge representation introduces new aspects.
We can draw parallels between the multi-method ap- Problem Adaptation
proach and the scientific discovery. In real life, based on
the observed phenomena, various hypotheses are con- In many domains, we encounter data with very unbal-
structed. Different scientific communities draw differ- anced class distribution. That is especially true for
ent conclusions (hypotheses) consistent with collected applications in the medical domain, where most of the
data. For example, there are many theories about the instances are regular and only a small percent are as-
creation of the universe, but the current, widely ac- signed to an irregular class. Therefore, for most of the
cepted theory is the theory of the big bang. During the classifiers that want to achieve high accuracy and low
following phase, scientists discuss their theories, knowl- complexity, it is most rational to classify all new in-
edge is exchanged, new aspects are encountered, and stances into a majority class. But that feature is not
data collected is reevaluated. In that manner, existing desired, because we want to extract knowledge (espe-
hypotheses are improved, and new better hypotheses are cially when we want to explain a decision-making pro-
constructed. cess) and determine reasons for separation of classes.
To cope with the presented problem, we introduced an
Decision Trees instance reweighting method that works in a similar
manner, boosting but on a different level. Instances that
Decision trees are very appropriate to use as glue be- are rarely correctly classified gain importance. Fitness
tween different methods. In general, condition nodes criteria of individuals take importance into account and
contain classifier (usually simple attribute comparison) force competition among individual induction methods.
that enables quick and easy integration of different Of course, there is a danger of over-adapting to the
187
TEAM LinG
Combining Induction Methods with the Multimethod Approach
noise, but in that case, overall classification ability would medicine. For that reason, we have selected only symbolic
be decreased, and other induced classifiers can perform knowledge from a whole population of resulting solu-
better classification (self adaptation). We achieve a simi- tions.
lar effect in boosting by concentrating on hard learnable A detailed description of databases can be found in
instances and not dismissing already extracted knowl- Leni and Kokol (2002). Other databases have been down-
edge. loaded from the online repository of machine-learning
datasets maintained at UCI. We compared two variations
of our multi-method approach (MultiVeDec) with four
EXPERIMENTAL RESULTS conventional approaches; namely, C4.5 (Quinlan, 1993),
C5/See5, Boosted C5, and genetic algorithm (Podgorelec
Our current data-mining research is performed mainly & Kokol, 2001). The results are presented in Table 1. Gray
in the medical domain; thereafter, the knowledge repre- marked fields represent the best method on a specific
sentation that should be in a human understandable form database.
is very important; so we are focused on the decision-
tree induction. To make objective assessment of our
method, a comparison of extracted knowledge used for FUTURE TRENDS
classification was made with reference methods C4.5,
C5/See5 without boosting, C5/See5 with boosting, and As confirmed in many application domains, methods
genetic algorithm for decision-tree construction that use only single approaches often can lead to local
(Podgorelec et al., 2001). The following quantitative optima and do not necessarily provide the big picture
measures were used: about the problem. By applying different methods, in-
duction power of combined methods can supersede
num of correctly classified objects single methods. Of course, there is no single way to
accuracy = wave methods together. Our approach emphasizes modu-
num. of all objects
larity based on knowledge sharing. Implicit induction
method knowledge is shared with others via produced
hypothesis. Synergy of methods also can be improved
num of correctly classified objects in class c by weaving implicit knowledge of induction method
accuracyc =
num. of all objects in class c learning algorithms, which requires very tight coupling
of two or many induction methods.
accuracy i
average class accuracy = i CONCLUSION
num. of classes
The success of the multi-method approach can be ex-
We decided to use average class accuracy instead of plained with the fact that some methods converge to
sensitivity and specificity that are usually used when local optima. With the combination of multiple meth-
dealing with medical databases. Experiments have been ods (operators) in different order, a better local (and
made with seven real-world databases from the field of hopefully global) solution can be found. Static hybrid
188
TEAM LinG
Combining Induction Methods with the Multimethod Approach
systems usually work sequentially or in parallel on the cally induced decision trees. Proceedings of the Interna-
fixed structure and order, performing whole tasks. On tional ICSC Congress on Computational Intelligence: C
the other hand, the multi-method approach works simul- Methods and Applications (CIMA2001).
taneously with several methods on a single task (i.e.,
some parts are induced with different classical heuris- Quinlan, J.R. (1993). C4.5: Programs for machine learn-
tics, some parts with hybrid methods, and still other ing. San Mateo, CA: Morgan Kaufmann Publishers.
parts with evolutionary programming). The presented progar, M., et al. (2000). Vector decision trees. Intel-
multi-method approach enables a quick and modular way ligent Data Analysis, 4(3,4), 305-321.
to integrate different methods into an existing system
and enables the simultaneous application of several Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn.
methods. It also enables partial application of method Kluwer Academic Publishers.
operations to improve and recombine aspects and has no Todorovski, L., & Dzeroski, S. (2000). Combining mul-
limitation to the order and number of applied methods. tiple models with meta decision trees. Proceedings of
the Fourth European Conference on Principles of
Data Mining and Knowledge Discovery.
REFERENCES
Valiant, L.G. (1984). A theory of the learnable. Communi-
Auer, P., Holte, R.C., & Cohen, W.W. (1995). Theory cations of the ACM, 27(11), 1134-1142.
and applications of agnostic PAC-learning with small Vapnik, V.N. (1995). The nature of statistical learning
decision trees. Proceedings of the 12th International theory. New York: Springer Verlag.
Conference on Machine Learning.
Wolpert, D.H., & Macready, W.G. (1995). No free lunch
Cox, M.T., & Ram, A. (1999). Introspective multistrategy theorems for search. Technical Report SFI-TR-95-02-010,
learning: On the construction of learning strategies. Santa Fe, NM.
Artificial Intelligence, 112, 1-55.
Zorman, M., Kokol, P., & Podgorelec, V. (2000). Medi-
Dietterich, T.G. (2000). Ensemble methods in machine cal decision making supported by hybrid decision trees.
learning. Proceedings of the First International Work- Proceedings of ISA2000.
shop on Multiple Classifier Systems.
Goldberg, D.E. (1989). Genetic algorithms in search,
optimization, and machine learning. Reading, MA:
Addison Wesley.
KEY TERMS
Iglesias, C.J. (1996). The role of hybrid systems in Boosting: Creation of ensemble of hypothesis to
intelligent data management: The case of fuzzy/neural convert weak learner to strong one by modifying ex-
hybrids. Control Engineering Practice, 4(6), 839-845. pected instance distribution.
Leni , M., & Kokol, P. (2002). Combining classifiers with Induction Method: Process of learning, from cases
multimethod approach. Proceedings of the Second Inter- or instances, resulting in a general hypothesis of hidden
national Conference on Hybrid Intelligent Systems, Soft concept in data.
Computing Systems: Design, Management and Applica-
tions, Frontiers in Artificial Intelligence and Applica- Method Level Operator: Partial operation of in-
tions, Amsterdam. duction/knowledge transformation level of specific in-
duction method.
McGarry, K., Wermter, S., & MacIntyre, J. (2001). The
extraction and comparison of knowledge from local Multi-Method Approach: Investigation of a research
function networks, International Journal of Computa- question using a variety of research methods, each of
tional Intelligence and Applications, 1(4), 369-382. which may contain inherent limitations, with the expecta-
tion that combining multiple methods may produce con-
Podgorelec, V., & Kokol, P. (2001). Evolutionary deci- vergent evidence.
sion forestsDecision making with multiple evolu-
tionary constructed decision trees: Problems in applied Population Level Operator: Unusually parameter-
mathematics and computational intelligence. World ized operation that applies method level operators (pa-
Scientific and Engineering Society Press, 97-103. rameter) to part or whole evolving population.
Podgorelec, V., Kokol, P., Yamamoto, R., Masuda, G., & Transmutator: Knowledge transformation opera-
Sakamoto, N. (2001). Knowledge discovery with geneti- tor that modifies learner knowledge by exploring learner
experience.
189
TEAM LinG
190
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Comprehensibility of Data Mining Algorithms
decision tree which can be easily inspected. Some other The eclectic algorithms incorporate elements of both
classification algorithms are deemed as incomprehensible the decompositional and pedagogical ones. A represen- C
because the patterns they mine are expressed in an implicit tative is the DEDEC algorithm (Tickle, Orlowski, &
way. Representatives are artificial neural networks that Diederich, 1996), which extracts a set of rules to reflect the
encode the mined patterns in real-valued connection functional dependencies between the inputs and the
weights. Actually, many methods have been developed to outputs of the artificial neural networks. Fig. 1 shows its
improve the comprehensibility of incomprehensible classi- working routine.
fication algorithms, especially for artificial neural networks. The compositional algorithms are not strictly
The main scheme for improving the comprehensibility decompositional because they do not extract rules from
of artificial neural networks is rule extraction, that is, individual units with subsequent aggregation to form a
extracting symbolic rules from trained artificial neural global relationship, nor do them fit into the eclectic cat-
networks. It originates from Gallants work on egory because there is no aspect that fits the pedagogical
connectionist expert system (Gallant, 1983). Good reviews profile. Algorithms belonging to this category are mainly
can be found in (Andrews, Diederich, & Tickle, 1995; designed for extracting deterministic finite-state automata
Tickle, Andrews, Golea, & Diederich, 1998). Roughly (DFA) from recurrent artificial neural networks. A repre-
speaking, current rule extraction algorithms can be cat- sentative is the algorithm proposed by Omlin and Giles
egorized into four categories, namely the decompositional, (1996), which exploits the phenomenon that the outputs
pedagogical, eclectic, or compositional algorithms. Each of the recurrent state units tend to cluster, and if each
category is illustrated with an example below. cluster is regarded as a state of a DFA then the relation-
The decompositional algorithms extract rules from ship between different outputs can be used to set up the
each unit in an artificial neural network and then aggre- transitions between different states. For example, assum-
gate. A representative is the RX algorithm (Setiono, 1997), ing there are two recurrent state units s0 and s1, and their
which prunes the network and discretizes outputs of outputs appear as nine clusters, then the working style of
hidden units for reducing computational complexity in the algorithm is shown in Fig. 2.
examining the network. If a hidden unit has many connec- During the past years, powerful classification algo-
tions then it is split into several output units and some rithms have been developed in the ensemble learning area.
new hidden units are introduced to construct a subnet- An ensemble of classifiers works through training mul-
work, so that the rule extraction process is iteratively tiple classifiers and then combining their predictions,
executed. The RX algorithm is summarized in Table 1. which is usually much more accurate than a single classi-
The pedagogical algorithms regard the trained artifi- fier (Dietterich, 2002). However, since the classification is
cial neural network as an opaque and aim to extract rules made by a collection of classifiers, the comprehensibility
that map inputs directly into outputs. A representative is of an ensemble is poor even when its component classi-
the TREPAN algorithm (Craven & Shavlik, 1996), which fiers are comprehensible.
regards the rule extraction process as an inductive learn- A pedagogical algorithm has been proposed by Zhou,
ing problem and uses oracle queries to induce an ID2-of- Jiang, and Chen (2003) to improve the comprehensibility
3 decision tree that approximates the concept represented of ensembles of artificial neural networks, which utilizes
by a given network. The pseudo-code of this algorithm is the trained ensemble to generate instances and then
shown in Table 2. extracts symbolic rules from them. The success of this
191
TEAM LinG
Comprehensibility of Data Mining Algorithms
TREPAN(training_examples, features)
Queue
for each example E training_examples
E.label ORACLE(E)
initialize the root of the tree, T, as a leaf node
put <T, training_examples, {}> into Queue
while Queue and size(T) < tree_size_limit
remove node N from head of Queue
examplesN example set stored with N
constraintsN constraint set stored with N
use features to build set of candidate splits
use examplesN and calls to ORACLE(constraintsN) to evaluate splits
S best binary split
search for best MOFN splits, S, using S as a seed
make N an internal node with split S
for each outcome, s, of S
make C, a new child node of N
constraintsC constraintsN {S = s}
use calls to ORACLE(constraintsC) to determine if C should remain a leaf
otherwise
examplesC members of examplesN with outcome s on split S
put <C, examplesC, constraintsC> into Queue
return T
s1 s1 s1
1 2 1 1 2 1 1 2 1
0.5
3 4 5 3 4
0 0 0
0 0.5 1 s0 0 0.5 1 s0 0 0.5 1 s0
4 4 5
2
2 2
1
1 1
3 3
(a) All the possible transitions (b) All the possible transitions (c) All the possible transitions
from state 1 from state 2 from states 3 and 4
192
TEAM LinG
Comprehensibility of Data Mining Algorithms
algorithm suggests that research on improving compre- binatorial explosion is inevitable for even moderate-sized
hensibility of artificial neural networks can give illumina- networks. Although many mechanisms such as pruning C
tion to the improvement of comprehensibility of other have been employed to reduce the computational com-
complicated classification algorithms. plexity, the efficiency of most current algorithms is not
Recently, Zhou & Jiang (2003) proposed to combine good enough. In order to work well in real-world applica-
ensemble learning and rule induction algorithms to obtain tions, effective algorithms with better efficiency are
accurate and comprehensible classifiers. Their algorithm needed.
uses an ensemble of artificial neural networks as a data Until now almost all works on improving comprehen-
preprocessing mechanism for the induction of symbolic sibility of complicated algorithms rely on rule extraction.
rules. Later, they (Zhou & Jiang, 2004) presented a new Although symbolic rule is relatively easy to be under-
decision tree algorithm and shown that when the en- stood by human beings, it is not the only comprehensible
semble is significantly more accurate than the decision style that could be exploited. For example, visualization
tree directly grown from the original training set and the may provide good insight into a pattern. However, al-
original training set has not fully captured the target though there are a few works (Frank & Hall, 2004; Melnik,
distribution, using an ensemble as the preprocessing 2002) utilizing visualization techniques to improve the
mechanism is beneficial. These works suggest the twice- comprehensibility of data mining algorithms, few work
learning paradigm to develop accurate and comprehen- attempts to exploit together rule extraction and visualiza-
sible classifiers, that is, using coupled classifiers where a tion, which is evidently very worth exploring.
classifier devotes to the accuracy while the other devotes Previous research on comprehensibility has mainly
to the comprehensibility. focused on classification algorithms. Recently, some works
on improving the comprehensibility of complicated re-
gression algorithms have been presented (Saito & Nakano,
FUTURE TRENDS 2002; Setiono, Leow, & Zurada, 2002). Since complicated
algorithms exist extensively in data mining, more sce-
It was supposed that an algorithm which could produce narios besides classification should be considered.
explicitly expressed patterns is comprehensible. How-
ever, such a supposition might not be so valid as it
appears to be. For example, as for a decision tree contain- CONCLUSION
ing hundreds of leaves, whether or not it is comprehen-
sible? A quantitative answer might be more feasible than This short article briefly discusses complexity issues in
a qualitative one. Thus, quantitative measure of compre- data mining. Although there is still a long way to produce
hensibility is needed. Such a measure can also help solve patterns that can be understood by common people in any
a long-standing problem, that is, how to compare the data mining tasks, endeavors on improving the compre-
comprehensibility of different algorithms. hensibility of complicated algorithms have paced a prom-
Since rule extraction is an important scheme for im- ising way. It could be anticipated that experiences and
proving the comprehensibility of complicated data mining lessons learned from these research might give illumina-
algorithms, frameworks for evaluating the quality of ex- tion on how to design data mining algorithms whose
tracted rules are important. Actually, the FACC (Fidelity, comprehensibility is good enough, not needed to be
Accuracy, Comprehensibility, Consistency) framework further improved. Only when the comprehensibility is not
proposed by Andrews, Diederich, and Tickle (1995) has a problem, the fruits of data mining can be fully enjoyed.
been used for almost a decade, which contains two impor-
tant criteria, i.e. fidelity and accuracy. Recently, Zhou
(2004) identified the fidelity-accuracy dilemma which REFERENCES
indicates that in some cases pursuing high fidelity and
high accuracy simultaneously is impossible. Therefore, Andrews, R., Diederich, J., & Tickle, A.B. (1995). Survey
new evaluation frameworks have to be developed and and critique of techniques for extracting rules from trained
employed, while the ACC (eliminating Fidelity from FACC) artificial neural networks. Knowledge-Based Systems, 8(6),
framework suggested by Zhou (2004) might be a good 373-389.
candidate.
Most current rule extraction algorithms suffer from Craven, M.W., & Shavlik, J.W. (1995). Extracting compre-
high computational complexity. For example, in hensible concept representations from trained neural
decompositional algorithms, if all the possible relation- networks. In Working Notes of the IJCAI95 Workshop on
ships between the connection weights and units in a Comprehensiblility in Machine Learning (pp. 61-75),
trained artificial neural network are considered, then com- Montreal, Canada.
193
TEAM LinG
Comprehensibility of Data Mining Algorithms
Craven, M.W., & Shavlik, J.W. (1996). Extracting tree- Zhou, Z.-H., & Jiang, Y. (2003). Medical diagnosis with
structured representations of trained networks. In D. C4.5 rule preceded by artificial neural network ensemble.
Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in IEEE Transactions on Information Technology in Bio-
Neural Information Processing Systems, 8 (pp. 24-30). medicine, 7(1), 37-42.
Cambridge, MA: MIT Press.
Zhou, Z.-H., & Jiang, Y. (2004). NeC4.5: neural ensemble
Dietterich, T.G. (2002). Ensemble learning. In M.A. Arbib based C4.5. IEEE Transactions on Knowledge and Data
(Ed.), The handbook of brain theory and neural networks Engineering, 16(6), 770-773.
(2nd ed.). Cambridge, MA: MIT Press.
Zhou, Z.-H., Jiang, Y., & Chen, S.-F. (2003). Extracting
Frank, E., & Hall, M. (2003). Visualizing class probability symbolic rules from trained neural network ensembles. AI
estimation. In N. Lavra , D. Gamberger, H. Blockeel, & L. Communications, 16(1), 3-15.
Todorovski (Eds.), Lecture Notes in Artificial Intelli-
gence, 2838. Berlin: Springer, 168-179.
KEY TERMS
Gallant, S.I. (1983). Connectionist expert systems. Com-
munications of the ACM, 31(2), 152-169. Accuracy: The measure of how well a pattern can
Melnik, O. (2002). Decision region connectivity analysis: generalize. In classification it is usually defined as the
a method for analyzing high-dimensional classifiers. percentage of examples that are correctly classified.
Machine Learning, 48(1-3), 321-351. Artificial Neural Networks: A system composed of
Michalski, R. (1983). A theory and methodology of induc- many simple processing elements operating in parallel
tive learning. Artificial Intelligence, 20(2), 111-161. whose function is determined by network structure, con-
nection strengths, and the processing performed at com-
Omlin, C.W., & Giles, C.L. (1996). Extraction of rules from puting elements or units.
discrete-time recurrent neural networks. Neural Networks,
9(1), 41-52. Comprehensibility: The understandability of a pat-
tern to human beings; the ability of a data mining algo-
Saito, K., & Nakano, R. (2002). Extracting regression rules rithm to produce patterns understandable to human be-
from neural networks. Neural Networks, 15(10), 1279- ings.
1288.
Decision Tree: A flow-chart-like tree structure, where
Setiono, R. (1997). Extracting rules from neural networks each internal node denotes a test on an attribute, each
by pruning and hidden-unit splitting. Neural Computa- branch represents an outcome of the test, and each leaf
tion, 9(1), 205-225. represents a class or class distribution.
Setiono, R., Leow, W.K., & Zurada, J.M. (2002). Extraction Ensemble Learning: A machine learning paradigm
of rules from artificial neural networks for nonlinear re- using multiple learners to solve a problem.
gression. IEEE Transactions on Neural Networks, 13(3),
564-577. Fidelity: The measure of how well the rules extracted
from a complicated model mimic the behavior of that
Tickle, A.B., Andrews, R., Golea, M., & Diederich, J. model.
(1998). The truth will come to light: directions and chal-
lenges in extracting the knowledge embedded within MOFN Expression: A boolean expression consisted
trained artificial neural networks. IEEE Transactions on of an integer threshold m and n boolean antecedents,
Neural Networks, 9(6), 1057-1067. which is fired when at least m antecedents are fired. For
example, the MOFN expression 2-of-{ a, b, c } is logically
Tickle, A.B., Orlowski, M., & Diederich, J. (1996). DEDEC: equivalent to (a b) (a c) ( b c).
A methodology for extracting rule from trained artificial
neural networks. In Proceedings of the AISB96 Work- Rule Extraction: Given a complicated model such as
shop on Rule Extraction from Trained Neural Networks an artificial neural network and the data used to train it,
(pp. 90-102), Brighton, UK. produce a symbolic description of the model.
Zhou, Z.-H. (2004). Rule extraction: Using neural net- Symbolic Rule: A pattern explicitly comprising an
works or for neural networks? Journal of Computer Sci- antecedent and a consequent, usually in the form of IF
ence and Technology, 19(2), 249-253. THEN .
194
TEAM LinG
Comprehensibility of Data Mining Algorithms
195
TEAM LinG
196
Figure 1. A 3-D cube that consists of 1-D, 2-D, and 3-D Figure 2. An example 2-D cuboid on (product, year) for
cuboids the 3-D cube in Figure 1 (location='*'); total sales needs
to be aggregated (e.g., SUM)
dimension
Measure(total
sales) 2003
2003
2002
year
year
2002
PA 2001
2.5M
2001 NY
location toy clothes cosmetic
NJ nn s s
toy clothes cosmetic product
s product s
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Computation of OLAP Cubes
The top-down cube computation works with distribu- Cache-results: This optimization aims at ensuring
tive or algebraic functions. These functions have the that the result of a group-by is cached (in memory)
property that more detailed aggregates (i.e., more dimen- so other group-bys can use it in the future.
sions) can be used to compute less detailed aggregates. Amortize-scans: This optimization amortizes the
This property induces a partial-ordering (i.e., a lattice) on cost of a disk read by computing the maximum
all the group-bys of the cube. A group-by is called a child possible number of group-bys together in memory.
of some parent group-by if the parent can be used to Share-sorts: For a sort-based algorithm, this aims
at sharing sorting cost across multiple group-bys.
ABCD ABCD
AB AC AD BC BD CD AB AC AD BC BD CD
A B C D A B C D
all all
197
TEAM LinG
Computation of OLAP Cubes
198
TEAM LinG
Computation of OLAP Cubes
A query Q() view monotonic on view V(Q,X) if for any exceeds the benefit of any other nonmaterialized group-
cell X in any database D such that V is the view for X, the by. Let S be the set of materialized group-bys. This benefit C
condition Q is FALSE, for X implies Q is FALSE for all X of including a group-by v in the set S is the total savings
X. An important property of view monotonicity is that achieved for computing the group-bys not included in S
the time and space required for checking it for a query by using v versus the cost of computing them through
depends on the number of terms in the query and not on some group-by already in S. Gupta, Harinarayan,
the size of the database or the number of its attributes. Rajaraman, and Ullman (1997) further extend this work to
Because most queries typically have few terms, it would include indices in the cost.
be useful in many practical situations. Imielinski et al. The subset of the cuboids selected for materialization
(2002) presents a method for checking view monotonicity is referred to as a partial cube. There have been efficient
for a query that includes constraints of type (Agg {<, >, approaches suggested in the literature for computing the
=, !=} c), where c is a constant and Agg can be MIN, SUM, partial cube. In one such approach suggested by Dehne,
MAX, AVERAGE, COUNT, aggregates that are higher Eavis, and Rau-Chaplin (2004), the cuboids are computed
order moments about the origin, or aggregates that are an in a top-down fashion. The process starts with the original
integral of a function on a single attribute. lattice or the PipeSort spanning tree of the original lattice,
organizes the selected cuboids into a tree of minimal cost,
Hybrid Approach and then further tries to reduce the cost by possibly
adding intermediate nodes.
Typically, for low-dimension, low-cardinality, dense After a set of cuboids has been materialized, queries
datasets, the top-down approach is more applicable than are evaluated by using the materialized results. Park, Kim,
the bottom-up one. However, combining the two ap- and Lee (2001) describe a typical approach. Here, the
proaches leads to an even more efficient algorithm (Xin, OLAP queries are answered in a three-step process. In the
Han, Li, & Wah, 2003). On the global computation order, first step, it selects the materialized results that will be
the work presented uses the top-down approach. At a used for rewriting and identifies the part of the query
sublayer underneath, it exploits the potential of the bot- (region) that the materialized result can answer. Next,
tom-up model. Consider the top-down computation tree in query blocks are generated for these query regions. Fi-
Figure 4. Notice that the dimension ABC is included for all nally, query blocks are integrated into a rewritten query.
the cuboids in the leftmost subtree. Similarly, all the
cuboids in the second subtree include the dimensions
AB. These common dimensions are termed the shared FUTURE TRENDS
dimensions of the particular subtrees and enable bottom-
up computation. The observation is that if a query is In this paper, I focus on the basic aspects of cube compu-
FALSE and (view)-monotonic on the cell defined by the tation. The field is pretty recent, going back to no more
shared dimensions, then the rest of the cells generated than 10 years. However, as the field is beginning to
from this shared dimension are unneeded. The critical mature, issues are becoming better understood. Some of
requirement is that for every cell X, the cell for the shared the issues that will get more attention in future work
dimensions must be computed first. The advantage of include:
such an approach is that it allows for shared computation
as in the top-down approach as well as for pruning, when Advanced data structures for organizing input tuples
possible. of the input cuboids (Han, Pei, Dong, & Wang, 2001;
Xin et al., 2003).
Other Approaches Making use of inherent property of the dataset to
reduce computation of the data cubes. An example
Until now, I have considered computing the cuboids from is the range cubing algorithm, which utilizes the
the base data. Another commonly used approach is to correlation in the datasets to reduce the computa-
materialize the results of a selected set of group-bys and tion cost (Feng, Agrawal, Abbadi, & Metwally,
evaluate all queries by using the materialized results. 2004).
Harinarayan, Rajaraman, and Ullman (1996) describe an Compressing the size of the data cube and storing
approach to materialize a limit of k group-bys. The first it efficiently. A good example is a quotient cube,
group-by to materialize always includes the top group-by, which partitions the cube into classes such that
as none of the group-bys can be used to answer queries each cell in a class has the same aggregate value,
for this group-by. The next group-by to materialize is and the lattice generated preserves the original
included such that the benefit of including it in the set cubes semantics (Lakshmanan, Pei, & Han, 2002).
199
TEAM LinG
Computation of OLAP Cubes
Additionally, I believe that future work in this direc- ceedings of the Hawaii International Conference on
tion would attack the problem from multiple issues rather System Sciences, USA.
than a single focus. Sismanis, Deligiannakis,
Roussopoulus, and Kotidis (2002) describe one such Feng, Y., Agrawal, D., Abbadi, A. E., & Metwally, A.
work, Dwarf. Its architecture integrates multiple fea- (2004). Range CUBE: Efficient cube computation by ex-
tures including the compressing of data cubes, a tunable ploiting data correlation. Proceedings of the Interna-
parameter for controlling the amount of materialization, tional Conference on Data Engineering (pp. 658-670),
and indexing and support for incremental updates (which USA.
is important for the case when the underlying data are Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996).
periodically updated). Data cube: A relational aggregation operator generalizing
group-by, cross-tab, and sub-total. Proceedings of the
International Conference on Data Engineering (pp. 152-
CONCLUSION 159), USA.
This paper focuses on the methods for OLAP cube com- Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman, J. D.
putation. All methods share the similarity that they make (1997). Index selection for OLAP. Proceedings of the
use of the ordering defined by the cube lattice to drive the International Conference on Data Engineering (pp. 208-
computation. For the top-down approach, traversal oc- 219), UK.
curs from the top of the lattice. This has the advantage Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient
that for the computation of successive descendant computation of iceberg cubes with complex measures.
cuboids, intermediate node results are used. The bottom- Proceedings of the ACM SIGMOD Conference (pp. 1-12),
up approach traverses the lattice in the reverse direction. USA.
The method can no longer rely on making use of the
intermediate node results. Its advantage lies in the ability Harinarayan, V., Rajaraman, A., & Ullman, J. D. (1996).
to prune cuboids in the lattice that do not lead to useful Implementing data cubes efficiently. Proceedings of the
answers. The hybrid approach uses the combination of ACM SIGMOD Conference (pp. 205-216), Canada.
both methods, taking advantages of both. Imielinski, T., Khachiyan, L., & Abdulghani, A. (2002).
The other major option for computing the cubes is to Cubegrades: Generalizing association rules. Journal of
materialize only a subset of the cuboids and to evaluate Data Mining and Knowledge Discovery, 6(3), 219-257.
queries by using this set. The advantage lies in storage
costs, but additional issues, such as identification of the Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient
cuboids to materialize, algorithms for materializing these, cube: How to summarize the semantics of a data cube.
and query rewrite algorithms, are raised. Proceedings of the International Conference on Very
Large Data Bases (pp. 778-789), China.
Park, C.-S., Kim, M. H., & Lee, Y.-J. (2001). Rewriting OLAP
REFERENCES queries using materialized views and dimension hierar-
chies in data warehouses. Proceedings of the Interna-
Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., tional Conference on Data Engineering, Germany.
Naughton, J. F., Ramakrishnan, R., et al. (1996). On the
computation of multidimensional aggregates. Proceed- Sismanis, Y., Deligiannakis, A., Roussopoulus, N., &
ings of the International Conference on Very Large Data Kotidis, Y. (2002). Dwarf: Shrinking the petacube. Pro-
Bases (pp. 506-521), India. ceedings of the ACM SIGMOD Conference (pp. 464-475),
USA.
Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining
association rules between sets of items in large data- Xin, D., Han, J., Li, X., & Wah, B. W. (2003). Star-cubing:
bases. Proceedings of the ACM SIGMOD Conference (pp. Computing iceberg cubes by top-down and bottom-up
207-216), USA. integration. Proceedings of the International Confer-
ence on Very Large Data Bases (pp. 476-487), Germany.
Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up com-
putation of sparse and iceberg cubes. Proceedings of the Zhao, Y., Deshpande, P. M., & Naughton, J. F. (1997). An
ACM SIGMOD Conference (pp. 359-370), USA. array-based algorithm for simultaneous multidimensional
aggregates. Proceedings of the ACM SIGMOD Confer-
Dehne, F. K. H., Eavis, T., & Rau-Chaplin, A. (2004). Top- ence (pp. 159-170), USA.
down computation of partial ROLAP data cubes. Pro-
200
TEAM LinG
Computation of OLAP Cubes
201
TEAM LinG
202
Concept Drift
Marcus A. Maloof
Georgetown University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Concept Drift
approaches is a critical issue. In the next two sections, I unknown instance, the method uses the instances values
survey approaches for learning concepts that change to traverse the current tree from the root to a leaf node, C
over time and discuss issues of evaluating such ap- returning as the prediction the associated label.
proaches. The Concept Drift 3 system (Black & Hickey, 1999),
or CD3, uses batches of instances annotated with a time
Survey of Approaches for Concept Drift stamp of either current or new to build a decision tree.
When drift occurs, time becomes more relevant for
STAGGER (Schlimmer & Grainger, 1986) was the first prediction, so the time-stamp attribute will appear higher
system for coping with concept drift. Its model consists in the decision tree. After pruning, CD3 converts the
of nodes, corresponding to features and class labels, tree to rules by enumerating all paths containing a new
linked together with probabilistic arcs, representing the time stamp and then removing conditions involving the
strength of association between features and class labels. time stamp. CD3 predicts the class of the best matching
As STAGGER processes new instances, it increases or rule.
decreases probabilities, and it may add nodes and arcs. To Ensemble methods maintain a set of models and use
classify an unknown instance, STAGGER predicts the a voting procedure to yield a global prediction. Blums
most probable class. (1997) implementation of Weighted-Majority
Partial-memory approaches maintain a store of par- (Littlestone & Warmuth, 1994) uses as models histo-
tially built models, a portion of the previously encoun- ries of labels associated with pairs of features. If a
tered instances, or both. Such approaches vary in how models features are present in an instance, then it
they use such information for adjusting current models. predicts the most frequent label present in its history.
The FLORA Systems (Widmer & Kubat, 1996) maintain The method initializes each model with a weight of 1 and
a sequence of examples over a dynamically adjusted reduces a models weight if it predicts incorrectly. It
window of time. The Window Adjustment Heuristic predicts based on a weighted vote of the predictions of
(WAH) adjusts the size of this window in response to the models.
performance changes. Generally, if performance is de- The Streaming Ensemble Algorithm (Street & Kim,
creasing or poor, then the heuristic reduces the windows 2001) maintains a fixed-size collection of models, each
size; if it is increasing or acceptable, then it increases built from a fixed number of instances. When a new
the size. These systems also maintain a store of rules, batch of instances arrives, SEA builds a new model. If
including ones that are overly general, although these space exists in the collection, then it adds the new
are not used for prediction. As the systems process model. Otherwise, it replaces the worst performing
instances, they create new rules or refine existing ones. model with the new model, if one exists. SEA predicts
To classify an instance, the FLORA systems select the the majority vote of the predictions of the models in the
rule that best matches and return its class label. collection.
The AQ-PM Systems maintain a set of examples The Accuracy-Weighted Ensemble (Wang et al.,
over a window of time, but the systems select examples 2003) also maintains a fixed-size collection of models,
from the boundaries of rules, so they can retain ex- each built from a batch of instances. However, this
amples that do not reoccur in the data stream. AQ-PM method weights each classifier in the collection based
(Maloof & Michalski, 2000) builds new rules when new on its performance on the most recent batch. When
examples arrive, whereas AQ11-PM (Maloof & adding a new weighted model, if there is no space in the
Michalski, 2004) refines existing rules. These systems collection, then the method stores only the top weighted
maintain examples over a static window of time, but models. The method predicts based on a weighted-
AQ11-PM+WAH (Maloof, 2003) incorporates Widmer majority vote of the predictions of the models in the
and Kubats (1996) WAH for dynamically sizing this collection.
window. Because these systems use rules, when classi- Dynamic Weighted Majority (Kolter & Maloof,
fying an instance, they return as their prediction the 2003), or DWM, maintains a collection of weighted
class label of the best matching rule. models but dynamically adds and removes models with
The Concept-adapting Very Fast Decision Tree sys- changes in performance. Instead of building a single
tem (Hulten et al., 2001), or CVFDT, progressively model with each batch, DWM uses new instances to
grows a decision tree downward from the leaf nodes. It refine all the models in the collection. Each time a
maintains frequency counts for attribute values by class model predicts incorrectly, DWM reduces its weight,
and extends the tree when a statistical test indicates that and DWM removes a model from the collection if its
a change has occurred. CVFDT also maintains at each weight falls below a threshold. Like the previous method,
node a list of alternate subtrees, which it swaps with the DWM predicts based on a weighted-majority vote of the
current subtree when it detects drift. To classify an predictions of the models, but if the global prediction
203
TEAM LinG
Concept Drift
(i.e., the weighted-majority vote) is incorrect for an The target concept for the first 40 time steps is [size =
instance, then DWM adds a new model to the collection. small] & [color = red]. For the next 40, it is [color =
green] [shape = circle]. For the final 40, it is [size =
Evaluation medium large]. At each time step, one generates a
single training example and 100 test cases of the target
Evaluating systems for concept drift is a critical issue. concept. Naturally, the method updates its model by
Ideally, one wants to use a real-world data set, but finding using the training example and evaluates it by using the
such a data set in which concept drift is easy to identify testing examples, calculating accuracy. One presenta-
is itself a challenge. After all, if it were easy to detect tion of the STAGGER Concepts is not sufficient for a
concept drift in large data sets, then the task of writing proper evaluation, so researchers conduct multiple
systems to cope with it would be trivial. Even if one runs, averaging accuracy at each time step.
establishes that drift is occurring in a data set, to conduct Clearly, a small problem with only 27 possible
a proper evaluation, the phenomenon must produce a examples, researchers have recently proposed larger
measurable effect in a methods performance. More- synthetic data sets involving concept drift (e.g., Hulten
over, the effect of concept drift on the methods perfor- et al., 2001; Street & Kim, 2001; Wang et al., 2003). I
mance must be greater than all other effects, such as that do not have the space here to survey the details, simi-
due to the variability from processing new examples of larities, and differences of these synthetic problems,
a target concept. As a result of these issues, the majority but they are all based on the same ideas present in the
of evaluations have involved synthetic or artificial data STAGGER Concepts. For instance, researchers have
sets. used rotating (Hulten et al., 2001) and shifting (Street
A hallmark of synthetic data sets for concept drift is & Kim, 2001) hyperplanes as changing target concepts.
that they cycle through a series of target concepts. The As noted previously, there have also been evalua-
first target concept persists for a period of time, then the tions involving concept drift in real data sets. Blum
second persists, and so on. For each time step of a (1997) used a calendar-scheduling task in which a
period, one randomly generates a set of training ex- users preference for meetings changed over time.
amples, which the method uses to build or refine its Lane and Brodley (1998) examined an intrusion-detec-
models. The method evaluates the resulting model on the tion application, mining sequences of UNIX commands.
examples in a test set and computes a measure of perfor- Finally, Black and Hickey (2002) studied concept drift
mance, such as predictive accuracy (i.e., the percentage in the phone records of customers of British Telecom.
of the test examples the method predicted correctly).
One generates test cases every time step or every time
period. As the system processes examples, ideally, its FUTURE TRENDS
performance on the test set improves at a certain rate and
to a certain level of accuracy. Indeed, the slope and Tracking concept drift is a rich problem that has led to
asymptote are critical characteristics of a methods per- a diverse set of approaches and evaluation methodolo-
formance. gies, and there are many opportunities for further in-
Crucially, each time the target concept changes, one vestigation. One is the development of better methods
generates a new set of testing examples. Naturally, when for evaluating systems that cope with evolutionary
the method applies the model built for the previous drift. I have already discussed the difficult nature of
target concept to the examples of the new concept, the this problem: To evaluate how systems cope with drift,
methods performance will be poor. However, as the there must be a measurable effect. Slow or evolution-
method processes training examples of the new target ary drift may not produce an effect different enough
concept, as before, performance will improve with some from that of mining the sequence of instances. If noise
slope and to some asymptote. is present, then it is even more difficult to distinguish
By far, the most widely used synthetic data set is the among the variability due to instances, drift, and noise.
so-called STAGGER Concepts. Originally used by Existing systems are probably capable of tracking slow
Schlimmer and Grainger (1986), it has been the center- concept drift, but we need evaluation methodologies
piece for many evaluations (e.g., Kolter & Maloof, 2003; for measuring how such drift affects performance.
Maloof & Michalski, 2000, 2004; Widmer, 1997; I have made the case for the importance of strong
Widmer & Kubat, 1996). There are three attributes: size, evaluations, and in this regard, researchers need to
color, and shape. Size can be small, medium, or large; better place their work in context with past efforts.
color can be red, blue, or green; shape can be triangle, Presently, researchers often choose not to use data
circle, or rectangle. There are three target concepts, and sets from past studies, and create new ones for their
the presentation of examples lasts for 120 time steps. investigation. Because new studies typically consist of
204
TEAM LinG
Concept Drift
new methods evaluated on new data sets, it is difficult to Lecture Notes in Computer Science: Vol. 2311. Software
place new methods in context with previous work. As a 2002: Computing in an imperfect world (pp. 74-87). New C
consequence, it is presently impossible to truly under- York: Springer.
stand the strengths and weaknesses of existing methods,
both old and new. This is not to say that researchers Blum, A. (1997). Empirical support for Winnow and
should not create or introduce new data sets. Indeed, I Weighted-Majority algorithms: Results on a calendar
have already mentioned that the STAGGER Concepts is scheduling domain. Machine Learning, 26, 5-23.
a small problem, so creating a new, larger one is re- Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
quired if, say, researchers want to establish how meth- data mining. Cambridge, MA: MIT Press.
ods scale to larger data streams. Nonetheless, by first
evaluating new methods on existing problems, research- Hulten, G., Spencer, L., & Domingos, P. (2001). Mining
ers will be able to make stronger conclusions about the time-changing data streams. Proceedings of the Sev-
performance of their method, and the community will enth ACM SIGKDD International Conference on
be able to better understand the contribution of the Knowledge Discovery and Data Mining (pp. 97-106).
method. Klenner, M., & Hahn, U. (1994). Concept versioning: A
Finally, we need to develop a systems theory of methodology for tracking evolutionary concept drift in
concept drift. Presently, we have little to guide the dynamic concept systems. Proceedings of the 11th
development of future methods or to guide the selection European Conference on Artificial Intelligence, En-
of a particular method for a new application. However, gland, XX (pp. 473-477).
before developing such a theory, we will need to develop
better methods for evolutionary concept drift, and we Klinkenberg, R., & Joachims, T. (2000). Detecting concept
will need stronger evaluations that place new work in drift with support vector machines. Proceedings of the
context with that of the past. 17th International Conference on Machine Learning
(pp. 487-494), USA.
Kolter, J., & Maloof, M. (2003). Dynamic weighted major-
CONCLUSION ity: A new ensemble method for tracking concept drift.
Proceedings of the Third IEEE International Conference
Tracking concept drift is important for many applica- on Data Mining (pp. 123-130), USA.
tions, from e-mail sorting to market-basket analysis.
Concepts may change quickly or gradually; they may Kuh, A., Petsche, T., & Rivest, R. L. (1991). Learning time-
occur once or at regular intervals. The diversity and varying concepts. In Advances in neural information
complexity of coping with concept drift has led re- processing systems 3 (pp. 183-189). San Francisco: Mor-
searchers to propose an equally varied set of approaches. gan Kaufmann.
Some modify their models, some do so with partial
memory of the past, and some rely on groups of models. Lane, T., & Brodley, C. (1998). Approaches to online
Although there have been evaluations of these approaches learning and concept drift for user identification in
involving real data sets, the majority have involved computer security. Proceedings of the Fourth Interna-
synthetic data sets, which give researchers great control tional Conference on Knowledge Discovery and Data
in testing hypotheses. With stronger evaluations that Mining (pp. 259-263), USA.
place new systems in context with past work, we will be Littlestone, N., & Warmuth, M. K. (1994). The Weighted
able to propose theories for systems that cope with Majority algorithm. Information and Computation, 108,
concept drift, theories that we can test and that will lead 212-261.
to new systems for this critical problem for many appli-
cations. Maloof, M. (2003). Incremental rule learning with partial
instance memory for changing concepts. Proceedings of
the International Joint Conference on Neural Networks
REFERENCES (pp. 2764-2769), USA.
Maloof, M., & Michalski, R. (2000). Selecting examples for
Black, M., & Hickey, R. (1999). Maintaining the perfor- partial memory learning. Machine Learning, 41, 27-52.
mance of a learned classifier under concept drift. Intel-
ligent Data Analysis, 3, 453-474. Maloof, M., & Michalski, R. (2004). Incremental learning
with partial instance memory. Artificial Intelligence, 154,
Black, M., & Hickey, R. (2002). Classification of customer 95-126.
call data in the presence of concept drift and noise. In
205
TEAM LinG
Concept Drift
206
TEAM LinG
207
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Condensed Representations for Data Mining
208
TEAM LinG
Condensed Representations for Data Mining
borders S and G and has been applied successfully to generators. Interestingly, these generators constitute
feature extraction in the domain of molecular fragment condensed representations, as well. An important char- C
finding. In this case, a conjunction of a minimal frequency acterization is the one of free sets (Boulicaut et al., 2000),
in one set of molecule (e.g., the active ones) and a maximal which has been proposed independently under the name
frequency in another set of molecules (e.g., the inactive key patterns (Bastide et al., 2000). By definition, the
ones) is used. This kind of research is related to the so- closures of (frequent) free sets are (frequent) closed sets.
called emergent pattern discovery (Dong & Li, 1999). Given the quivalence classes we quoted earlier, free sets
Considering the extended theory for frequent itemsets, are their minimal elements, and the freeness property can
it is clear that, given the maximal frequent sets and their lead to efficient pruning, thanks to its anti-monotonicity.
frequencies, we have an approximate condensed represen- Computing the frequent free sets plus an extra collection
tation of the frequent itemsets. Without looking at the of some non-free itemsets (part of the so-called negative
data, we can regenerate the whole collection of the fre- border of the frequent free sets), it is possible to regen-
quent itemsets (subsets of the maximal ones), and we have erate the whole collection of the frequent itemsets and
a bounded error on their frequencies: when considering a their frequencies (Boulicaut et al., 2000; Boulicaut et al.,
subset of a maximal -frequent itemset, we know that its 2003). On one hand, we often have much more frequent
frequency is in [,1]. Even though more precise bounds free sets than frequent closed sets but, on another hand,
can be computed, this approximation is useless in practice. they are smaller. The concept of freeness has been
Indeed, when using borders, users have other applications generalized for other exact condensed representations
in mind (e.g., feature construction). like the disjunct-free itemsets (Bykowski & Rigotti, 2001),
The maximal frequent itemsets can be computed in the non derivable itemsets (Calders & Goethals, 2002),
cases where very large frequent itemsets hold such that and the minimal k-free representations of frequent sets
the regeneration process becomes impossible. Typically, (Calders & Goethals, 2003). Regeneration algorithms and
when a maximal frequent itemset has a size 30, it should lead translations between condensed representations have
to the regeneration of around 1010 frequent sets. been studied, as well (Kryszkiewicz et al., 2004).
209
TEAM LinG
Condensed Representations for Data Mining
Multiples Uses of Condensed resentations are quite general and might be considered
Representations with success for many other pattern domains.
210
TEAM LinG
Condensed Representations for Data Mining
Casali, A., Cicchetti, R., & Lakhal, L. (2003). Cube lattices: tional Conference on Management of Data, Seattle,
A framework for multidimensional data mining. Proceed- Washington. C
ings of the SIAM International Conference on Data
Mining, San Francisco, California. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999).
Efficient mining of association rules using closed itemset
de Raedt, L. (2002). A perspective on inductive databases. lattices. Information Systems, 24(1), 25-46.
SIGKDD Explorations, 4(2), 66-77.
Pei, J., Dong, G., Zou, W., & Han, J. (2002). On computing
de Raedt, L., Jger, M., Lee, S.D., & Mannila, H. (2002). A condensed frequent pattern bases. Proceedings of the
theory of inductive query answering. Proceedings of the IEEE International Conference on Data Mining,
IEEE International Conference on Data Mining, Maebashi City, Japan.
Maebashi City, Japan.
Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining
de Raedt, L., & Kramer, S. (2001). The levelwise version closed sequential patterns in large databases. Proceed-
space algorithm and its application to molecular fragment ings of the SIAM International Conference on Data
finding. Proceedings of the International Joint Confer- Mining, San Francisco, California.
ence on Artificial Intelligence, Seattle, Washington.
Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient
Dong, G., & Li., J. (1999). Efficient mining of emerging algorithm for closed itemset mining. Proceedings of the
patterns: Discovering trends and differences. Proceed- SIAM International Conference on Data Mining, Arling-
ings of the International Conference on Knowledge ton, Texas.
Discovery and Data Mining, San Diego, Caifornia.
Goethals, B., & Zaki, M.J. (2004). Advances in frequent
itemset mining implementations. SIGKDD Explorations, KEY TERMS
6(1), 109-117.
Imielinski, T., & Mannila, H. (1996). A database perspec- Condensed Representations: Alternative representa-
tive on knowledge discovery. Communications of the tions of the data that preserve crucial information for
ACM, 39(11), 58-64. being able to answer some kind of queries. The most
studied example concerns frequent sets and their frequen-
Jeudy, B., & Boulicaut, J.-F. (2002). Optimization of asso- cies. Their condensed representations can be several
ciation rule mining queries. Intelligent Data Analysis, orders of magnitude smaller than the collection of the
6(4), 341-357. frequent itemsets.
Kryszkiewicz, M., Rybinski, H., & Gajek, M. (2004). Dataless Constraint-Based Data Mining: Concerns the active
transitions between concise representations of frequent use of constraints that specify the interestingness of
patterns. Intelligent Information Systems, 22(1), 41-70. patterns. Technically, it needs strategies to push the
constraints, or at least part of them, deeply into the data-
Mannila, H., & Toivonen, H. (1996). Multiple uses of mining algorithms.
frequent sets and condensed representations. Proceed-
ings of the International Conference on Knowledge Inductive Databases: An emerging research domain,
Discovery and Data Mining, Portland, Oregon. where knowledge discovery processes are considered as
querying processes. Inductive databases contain both data
Mannila, H., & Toivonen, T. (1997). Levelwise search and and patterns, or models, which hold in the data. They are
borders of theories in knowledge discovery. Data Mining queried by means of more or less ad-hoc query languages.
and Knowledge Discovery, 1(3), 241-258.
Pattern Domains: A pattern domain is the definition
Mitchell, T.M. (1982). Generalization as search. Artificial of a language of patterns, a collection of evaluation
Intelligence, 18, 203-226. functions that provide properties of patterns in database
Ng, R., Lakshmanan, L.V.S., Han, J., & Pang, A. (1998). instances, and the kinds of constraints that can be used
Exploratory mining and pruning optimizations of con- to specify pattern interestingness.
strained associations rules. Proceedings of the Interna-
211
TEAM LinG
212
Odej Kao
University of Paderborn, Germany
INTRODUCTION BACKGROUND
Sensing and processing multimedia information is one of The concept of content-based retrieval is datacentric per
the basic traits of human beings: The audiovisual system se; that is, the design of a system has to reflect the
registers and transports surrounding images and sounds. characteristics of the data. Hence, neither an optimal
This complex re-cording system, complemented by the solution that can span all kinds of multimedia data exists
senses of touch, taste, and smell, enables perception and nor is addressing the variety of data characteristics within
provides humans with data for analysing and interpreting one type even possible. However, there are parallels that
the environment. Imitating this perception and the simu- lay the foundation, which then require tailor-made adap-
lation of the processing was and still is one of the major tation and specialisation. This section provides the gen-
leitmotifs of multimedia technology developments. The eral groundwork by pointing out the different types of the
goal is to find a representation for every type of knowl- so-called metainformation, which describes the raw data:
edge, which makes the reception and processing of infor-
mation as easy as possible. The need to process given Technical information refers to the details of the
information, deliver it, and explain it to a certain audience recording, conversion, and saving process (i.e.,
exists in nearly all areas of day-to-day life: commerce, format and name of the stored media).
science, education, and entertainment (Smeulders, Extracted attributes are those that have been de-
Worring, Santini, Gupta, & Jain, 2000). duced by analysing the media content. They are
The development of digital technologies and applica- usually called features and emphasise a certain
tions allowed the production of huge amounts of multime- aspect of the media. Simple features describe, for
dia data. This information has to be systematically col- instance, statistical values of the contained infor-
lected, registered, organised, and classified. Furthermore, mation, while complex features and their weighted
search procedures, methods to formulate queries, and combinations attempt to describe the entire media
ways to visualise the results have to be provided. In early content.
years, this task was tended to by existing database man- Knowledge-based information links the objects,
agement systems (DBMS) with multimedia extensions. people, scenarios, and so forth, detected in the
The basis for representing and modelling multimedia data media to entities in the real world.
is so-called binary large objects, which store images, World-oriented information encompasses informa-
video, and audio sequences without any formatting and tion on the producer of the media, the date, location,
analysis done by the system. Often, however, only a and so forth. Manually added keywords are espe-
reference to the object is handled within the DBMS. For cially in this group, which makes a primitive descrip-
the utilisation of the stored multimedia data, user-defined tion and characterisation of the content possible.
functions (e.g., content analysis) access the actual data
and integrate their results in the existing database. Hence, As can be seen by this classification, technical and
content-based retrieval becomes possible. A survey of world-oriented information can be modelled straightfor-
existing retrieval systems was presented, for example, by wardly in traditional database structures. Organising and
Naphade & Huang (2002). searching can be done by using existing database func-
This article provides an overview of the complex tions. The utilisation of the extracted attributes and knowl-
relations and interactions among the different aspects of edge-based information is more complex in nature. Al-
a content-based retrieval system, whereby the scope is though most of the currently avail-able DBMSs can be
purposely limited to images. The main issues of the data extended with multimedia add-ins, in many cases these are
description, similarity expression, and access are ad- not sufficient, because they cannot describe the stored
dressed and illustrated for an actual system. data to the required degree of retrieval accuracy. How-
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Content-Based Image Retrieval
ever, only these two latter types of metainformation lift All approaches have their individual advantages as
the system to an abstract level that allows the full exploi- well as disadvantages, and a suitable selection depends C
tation of the content. on the domain. For example, a fingerprint database is best
For an in-depth overview of content-based image realised by using the query-by-pictorial-example tech-
retrieval techniques and systems, refer to Deb and Zhang nique, but selection from standards is a suitable candidate
(2004), Kalipsiz (2000), Smeulders et al. (2000), Vasconcelos for a comic strip data-base with a limited number of
and Kunt (2001), and Xiang and Huang (2000). characters. However, the similarity search, in particular
the query-by-pictorial-example approach, is one of the
most powerful methods because it provides the greatest
MAIN THRUST degree of flexibility. Thus, it determines the focus hereaf-
ter.
The goal of multimedia retrieval is the selection of one or Many different methods for feature extraction were
more images whose metainformation meets certain re- developed and can be classified by various criteria. Based
quirements or is similar to a given sample media instance. on the point in time in which the features are extracted, a-
Searching the metainformation is usually based on a full- priori and dynamically extracted features are distinguished.
text search among the assigned keywords. Furthermore, Although the first group was extracted during insertion of
content references, such as colour distributions in an the corresponding media object in the database, the latter
image, or more complex information, such as wavelet kind is generated at query time. The advantage of the
coefficients, can be used. To solve the issue of having the dynamic feature extraction is that the user can define
desired search characteristic in the first place, most sys- relevant elements in the sample image, and the remaining
tems prefer to use a query with an example media item. The parts of the query image do not distract the actual search
systems use this media as a starting point for the search objective. Note that both approaches can be combined.
and processed it in the same manner as the other media Regardless of the chosen approach, the actual fea-
objects, when they were inserted in the database. The tures have to be extracted from the considered data.
content is then analysed with the selected procedures, Examples for this step are histogram-based methods,
and the media is mapped to a vector consisting of (semi- calculation of statistical colour information (Mojsilovic,
) automatically extracted features. Hereafter, the raw data Hu, & Soljanin, 2002), contour descriptors (Berretti, Del
is only needed for display purposes, and all further Bimbo, & Pala, 2000), texture analysis (Gevers, 2002), and
processing focuses on analysing and comparing the wavelet coefficient selection (Albuz, Kocalar, & Khokhar,
representative vectors. The result of this comparison is a 2001). The gained information possibly from different
similarity ranking. The following interfaces can be used to algorithms is combined in a so-called feature vector
specify a query in a multimedia database: that is, by orders of magnitude, smaller than the raw data.
This reduction in volume enables not only a suitable
Browsing: Beginning with a predefined data set, the handling within the DBMS but also a higher level of
user can navigate in any desired direction by using abstraction. Therefore, it can often be utilised directly by
a browser until a suitable media sample is found. semantic-based approaches (Djeraba, 2003; Fan, Luo, &
This approach is often used when no suitable start- Elmagarmid, 2004; Lu, Zhang, Liu, & Hu, 2003) and data-
ing media is available. mining techniques (Datcu, Daschiel, & Pelizzari, 2003; Li
Search with keywords: Technical and world-ori- & Narayanan, 2004).
ented data are represented by alphanumerical fields. The similarity of two multimedia objects in the con-
These can be searched for a given keyword. Choos- tent-based retrieval process is determined by comparing
ing these keywords is extraordinarily difficult for the representing feature vectors. Over the years, a large
abstract structures such as textures, partially due to variety of metrics and similarity functions was developed
the subjectivity of the human perception. for this pur-pose, whereby the best-known methods com-
Similarity search: The similarity search is based on pute a multi-dimensional distance between the vectors:
comparing features extracted from the raw data. The smaller the distance, the higher the similarity of the
Most of these features do not exhibit immediate corresponding media objects. Through the introduction
references to the image, making them highly ab- of weights for the individual positions within the feature
stract for users without special knowledge (Assfalg, vectors, it is possible to emphasise and/or suppress
Del Bimbo, & Pala, 2002; Brunelli & Mich, 2000). desirable and undesirable query characteristics, respec-
Depending on the availability and characteristic of tively. In particular, the approach can help to particularise
the query medium, one differentiates between query the query in iterative retrieval systems; that is, if the users
by pictorial example, query by painting, selection select suitable and unsuitable retrievals, which are used
from standards, and image montage. by the system for adaptation in the next iteration (Jing, Li,
213
TEAM LinG
Content-Based Image Retrieval
Zhang, H.-J., & Zhang, B., 2004). However, the application to special applications such as remote sensing. A sample
of distance-based similarity measures is not undisputed, application of the latter is given as an example in order to
because the meaning of distance in a multidimensional clarify the necessary adaptations.
vector space spanned by arbitrary features can be ambigu- The developments and applications of space-borne
ous. Proposed alternatives to the general problem include remote sensing platforms result in the production of
angular measures or the incorporation of the data statistic huge amounts of image data. In particular, the step
in the actual distance measure. Nevertheless, the applica- towards higher spatial and spectral resolution increases
tion of even the simplest measures can be successful for the obtained amount of data by orders of magnitude. To
all intents and pur-poses if they are tailor-made with enable maintenance as well as retrieval, a large number of
respect to the actual database. Still, if this requirement is operational DBMS for remotely sensed imagery is avail-
not met, the retrieval can nevertheless be sufficient as long able (Bretschneider, Cavet, & Kao, 2002; Datcu et al.,
as a -sufficient number of matches for all the kinds of 2003). However, due to the approach to base the queries
possible queries exist. on world-oriented descriptions (e.g., location, scanner,
Hitherto the discussion on the evaluation of similarity and date), the systems are not always suitable for users
among different feature vectors was limited to fairly simple with little expertise in remote sensing. Furthermore, con-
structures; that is, the order and length of the vectors were tent-related inquiries are not feasible in situations like
clearly defined. However, these assumptions are insuffi- the following scenario:
cient when dealing with more complex feature extraction
algorithms. An example is given by algorithms that result A certain region reveals symptoms of saltification and
in an arbitrary number of features solely depending on the a corresponding satellite image of the area was
content of the multimedia object. Thus, feature vectors of purchased. It is of interest to find other regions which
different lengths have to be compared. To circumvent this suffered from the same phenomenon and which were
problem, most of the current systems limit the dimension- successfully recovered or preserved, respectively. Thus
ality to a fixed extent either by the choice of extraction the applied strategies in these regions can help to
algorithms or through a reducing postprocessing. In- develop effective counteractions for the specific case
stances of the latter approach are the principal component under investigation.
analysis (PCA) and the vector quantisation (VQ). The
retrieval results are acceptable, but the overall system Generic content-based retrieval systems as intro-
performance suffers tremendously that is, either by duced earlier in this paper are not suitable, because their
constantly repeating the PCA or optimising the code-book extracted features do not consider the special character-
for the VQ dynamically in systems with a high insertion istic of the satellite imagery; that is, for these systems, all
rate, because a permanent adaptation is required. Mean- remotely sensed images exhibit a high degree of similar-
while, a variety of alternative measures that can handle ity that prohibits an appropriate differentiation. The Grid
more complex feature vectors were developed, and one of Retrieval System for Remotely Sensed Imagery, G(RS)2I,
the most prominent cases is the Earth Movers Distance is a tailor-made retrieval system that consists of highly
(EMD) (Rubner, Tomasi, & Guibas, 2000). specialised feature extraction modules, corresponding
Selected features of an object, file, or other data struc- similarity evaluation, and an adapted indexing technique
tures are stored in indices, a offering accelerated access. under the umbrella of a web-accessible DBMS.
This fact implies that the construction and maintenance Most of the image content in terms of remote sensing
method of an index is of utmost importance for database is contained in the spectral characteristic of an obtained
efficiency. Data structures employed to support such scene, whereby in contrast to generic images, the number
queries are called multidimensional index structures. Well- of colour bands can vary between a few and several
known examples are the k-d-trees, grid files, R/R*-trees, hundred. For the description of such data, a ground
SS/SR-trees, VP-trees, and VA files. A general overview of cover classification approach is most suitable, because
the context of multimedia retrieval can be found in Lu it is not only an abstract description of the content but
(2002). also is easily linkable with the human understanding of
observed features, for example, water surfaces, forests,
and urban areas. For data that consists of multiple bands,
FUTURE TRENDS this approach leads to a highly precise retrieval. Sec-
ondly, the spatial arrangement of the detected regions,
Future trends in image retrieval are manifold, covering as well as their textural composition, are retrieved as
areas such as detailed search for image objects via consid- features. Last but not least, highly specialised extraction
eration of semantic information, from identified entities on techniques ana-lyse the data for specific features such
the image as well as the extension of the retrieval methods as airports, rivers, and road networks. Due to the large
214
TEAM LinG
Content-Based Image Retrieval
spatial coverage of a satellite scene (covering hundreds Assfalg, J., Del Bimbo, A., & Pala, P. (2002). Three-
of kilometres and therefore containing highly varying dimensional interfaces for querying by example in con- C
landscapes), the extraction of several feature vectors at tent-based image retrieval. IEEE Transactions on Visual-
different positions within a scene is required, because a ization and Computer Graphics, 8(4), 305-318.
global descriptor provides only insufficient accuracy. To
solve this issue, the G(RS)2I uses feature functions that Berretti, S., Del Bimbo, A., & Pala, P. (2000). Retrieval by
describe the under-lying data; that is, the multidimen- shape similarity with perceptual distance and effective
sional feature vectors are approximated by a hyper sur- indexing. IEEE Transactions on Multimedia, 2(4), 225-
face, which is modelled by radial basis function 239.
(Bretschneider & Li, 2003). These analytically described Bretschneider, T., Cavet, R., & Kao, O. (2002). Retrieval of
surfaces enable powerful access approaches to the un- remotely sensed imagery using spectral information con-
derlying data. Hence, the search for a best match of an tent. Proceedings of the Geoscience and Remote Sensing
extracted feature vector from a query image becomes an Symposium, 4 (pp. 2253-2256).
optimisation problem that is easily solvable, due to the
explicit existence of the first derivative of the feature Bretschneider, T., & Li, Y. (2003). On the problems of
function. locally defined content vectors in image databases for
For the measurement of similarity, throughout the large images. Proceedings of the Pacific-Rim Conference
entire system the G(RS)2I uses a modified version of the on Multimedia, 3 (pp. 1604-1608).
EMD (Li & Bretschneider, 2003), because the amount of Brunelli, R., & Mich, O. (2000). Image retrieval by ex-
content in a satellite scene and therefore the length of amples. IEEE Transactions on Multimedia, 2(3), 164-171.
the feature vector is generally not predictable. This
concept extends even within the indexing of the data that Datcu, M., Daschiel, H., & Pelizzari, A. (2003). Information
is based on a VP-tree and the realisation of the iterative mining in remote sensing image archives: System con-
search engine. With respect to the latter aspect, the cepts. IEEE Transactions on Geoscience and Remote
problem in content-based retrieval is that one feature Sensing, 41(12), 2923-2936.
vector often cannot de-scribe the desired search charac-
teristic precisely enough. Instead of adapting weights Deb, S., & Zhang, Y. (2004). An overview of content-
obtained through the users feedback regarding the rel- based image retrieval techniques. Proceedings of the
evance of the previously retrieved data, the G(RS)2I is not International Conference on Advanced Information
limited to moving a single query point in the search space. Networking and Applications, 1 (pp. 59-64).
The approach is to actually fuse the information content Djeraba, C. (2003). Association and content-based re-
from the positively rated feature vectors by analysing the trieval. IEEE Transactions on Knowledge and Data En-
corresponding EMD flow matrix (Li & Bretschneider, gineering, 15(1), 118-135.
2003).
Fan, J., Luo, H., & Elmagarmid, A. K. (2004). Concept-
oriented indexing of video databases: Toward semantic
CONCLUSION sensitive retrieval and browsing. IEEE Transactions on
Image Processing, 13(7), 974-992.
Content-based retrieval is beneficial for most types of Gevers, T. (2002). Image segmentation and similarity of
data because it resembles the human approach of access- color-texture objects. IEEE Transactions on Multimedia,
ing the respective medium. In particular, this idea holds 4(4), 509-516.
true for data that mankind can directly process. The actual
conceptual and technical realisation of this natural pro- Jing, F., Li, M., Zhang, H.-J., & Zhang, B. (2004). Rel-
cess is a major challenge, because knowledge is fairly evance feedback in region-based image retrieval. IEEE
limited with respect to which way this is accomplished. Transactions on Circuits and Systems for Video Technol-
ogy, 14(5), 672-681.
Kalipsiz, O. (2000). Multimedia databases. Proceedings of
REFERENCES the IEEE International Conference on Information Visu-
alization (pp. 111-115).
Albuz, E., Kocalar, E., & Khokhar, A. A. (2001). Scalable
color image indexing and retrieval using vector wavelets. Li, J., & Narayanan, R. M. (2004). Integrated spectral and
IEEE Transactions on Knowledge and Data Engineer- spatial information mining in remote sensing imagery.
ing, 13(5), 851-861. IEEE Transactions on Geoscience and Remote Sensing,
42(3), 673-685.
215
TEAM LinG
Content-Based Image Retrieval
Li, Y., & Bretschneider, T. (2003). Supervised content- Xiang, S. Z., & Huang, T. S. (2000). Image retrieval: Feature
based satellite image retrieval using piecewise defined primitives, feature representation, and relevance feed-
signature similarities. Proceedings of the Geoscience and back. Proceedings of the IEEE Workshop on Content-
Remote Sensing Sympo-sium, 2 (pp. 734-736). based Access of Image and Video Libraries (pp. 10-14).
Lu, G.-J. (2002). Techniques and data structures for effi-
cient multimedia retrieval based on similarity. IEEE Trans- KEY TERMS
actions on Multimedia, 4(3), 372-384.
Lu, Y., Zhang, H., Liu, W., & Hu, C. (2003). Joint semantics Content-Based Retrieval: The search for suitable
and feature based image retrieval using relevance feed- objects in a database based on the content; often used to
back. IEEE Transactions on Multimedia, 5(3), 339-347. retrieve multimedia data.
Mojsilovic, A., Hu, H., & Soljanin, E. (2002). Extraction of Dynamic Feature Extraction: Analysis and descrip-
perceptually important colors and similarity measurement tion of the media content at the time of querying the
for image matching, retrieval and analysis. IEEE Transac- database. The information is computed on demand and
tions on Image Processing, 11(11), 1238-1248. discarded after the query has been processed.
Mojsilovic, A., Kovacevic, J., Hu, J.-Y., Safranek, R. J., & Feature Vector: Data that describes the content of the
Ganapathy, S. K. (2000). Matching and retrieval based on corresponding multimedia object. The elements of the
the vocabulary and grammar of color patterns. IEEE feature vector represent the extracted descriptive infor-
Transactions on Image Processing, 9(1), 38-54. mation with respect to the utilised analysis.
Naphade, M. R., & Huang, T. S. (2002). Extracting seman- Index Structures: Adapted data structures to accel-
tics from audio-visual content: The final frontier in multi- erate the retrieval. The a-priori extracted features are
media retrieval. IEEE Transactions on Neural Networks, organised in such a way that the comparisons can be
13(4), 793-810. focused to a certain area around the query.
Rubner, Y., Tomasi, C., & Guibas, I. J. (2000). The earth Multimedia Database: A multimedia database system
movers distance as a metric for image retrieval. Journal consists of a high-performance database management
of Computer Vision, 40(2), 99-121. system and a database with a large storage capacity and
supports and manages, in addition to alphanumerical data
Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, types, multimedia objects regarding storage, querying,
R. (2000). Content-based image retrieval at the end of the and searching.
early years. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 1349-1380. Query by Pictorial Example: The query is formulated
by using a user-provided example for the desired retrieval.
Vasconcelos, N., & Kunt, M. (2001). Content-based re- Both the query and stored media objects are analysed in
trieval from image databases: Current solutions and fu- the same way.
ture directions. Proceedings of the IEEE International
Conference on Image Processing, 3 (pp. 6-9). Similarity: Correspondence of two data objects of
the same medium. The similarity is determined by compar-
ing the corresponding feature vectors, for example, by a
metric or distance function.
216
TEAM LinG
217
Joseph B. ODonnell
Canisius College, USA
G. Lawrence Sanders
State University of New York at Buffalo, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Continuous Auditing and Data Mining
More robust auditing tools, using more sophisticated Comparing Methods of Data Mining
data mining methods, are needed for mining large data-
bases and to help auditors meet auditing requirements. Data mining is a process by which one discovers previ-
According to auditing standards (Statement on Audit ously unknown information from large sets of data. Data
Standards 99), auditors should incorporate mining algorithms can be divided into three major groups:
unpredictability in procedures performed (Ramos, 2003). (1) mathematical-based methods, (2) logic-based meth-
Otherwise, perpetrators of frauds may become familiar ods, and (3) distance-based methods (Weiss & Indurkhya,
with common audit procedures and conceal fraud by 1998). Common examples from each major category are
placing it in areas that auditors are least likely to look. described below.
Mathematical-Based Methods
MAIN THRUST
Neural Networks
It is critical for the modern auditor to understand the
nature of CA, and the capabilities of different data mining An Artificial Neural Network (ANN) is a network of nodes
methods in designing an effective audit approach. To- modeled after a neuron or neural circuit. The neural
ward this end, this paper addresses these issues through network mimics the processing of the human brain. In a
discussion of CA, comparison of data mining methods, neural network, neurons are grouped into layers or slabs
and we also provide a potential CA and data mining (Lam, 2004). An input layer consists of neurons that
architecture. Although it is beyond the scope of this receive input from the external environment. The output
paper to provide an in-depth technical discussion of the layer communicates results of the ANN to the user or
details of the proposed architecture, we hope this stimu- external environment. The ANN may also consist of a
lates technical research in this area and provides a start- number of intermediate (hidden) layers. The processing of
ing point for CA system designers. an ANN starts with inputs being received by the input
layer and, upon being excited; the neurons are fired and
Continuous Auditing (CA) produce outputs to the other layers of the system. The
nodes or neurons are interconnected and they will send
Audits involve three major components: audit planning, signals or fire only if the signals it receives exceed a
conducting the audit, and reporting on audit findings certain threshold value. The value of a node is a non-
(Konrath, 2002). The CA approach can be used for the linear, (usually logistic), function of the weighted sum of
audit planning and conducting the audit phases. Accord- the values sent to it by nodes that are connected to it
ing to Pushkin (2003), CA is useful for the strategic audit (Spangler, May, & Vargas, 1999).
planning component that addresses the strategic risk of Programming a neural network to process a set of
reaching an inappropriate conclusion by not integrating inputs and produce the desired output is a matter of
essential activities into the audit plan (p. 27). Strategic designing the interactions among the neurons. This pro-
information may be captured from the entitys Intranets cess consists of the following: (1) arranging neurons in
and from the global internet using intelligent agents various layers, (2) deciding both the connections among
(Pushkin, 2003, p. 28). neurons of different layers, as well as the neurons within
CA is also useful for performing the audit or what a layer, (3) determining the way a neuron receives input
Pushkin (2003) refers to as the tactical component of the and produces output (e.g., the type of function used), and
audit. Tactical activities are most often directed at ob- (4) determining the strength of connections within the
taining transactional evidence as a basis on which to network by selecting and using a training data set so that
assess the validity of assertions embodied in account the ANN can determine the appropriate values of connec-
balances (p.27). For example, CA is useful in testing that tion weights (Lam, 2004).
entities comply with financial performance measures of Prior neural network research has addressed audit
debt covenants in loan agreements (Woodroof & Searcy, areas of risk assessment, errors and fraud, going concern
2001). audit opinion, financial distress, and bankruptcy predic-
CA requires prompt responses to high-risk transac- tion (Lin, Hwang, & Becker, 2003). Research has identified
tions and the ability to identify financial trends from large successful uses of ANN, however, there are still many
volumes of data. Intelligent agents can be used to promptly other issues to address. For instance, ANN is effective for
identify and respond to erroneous transactions. Under- analytical review procedures although there is no clear
standing the capabilities of data mining methods in iden- guideline for the performance measures to use for this
tifying financial trends is useful in selecting an appropri- analysis (Koskivaara, 2004). Neural network research
ate data mining approach for CA. found differences between the patterns of quantitative
218
TEAM LinG
Continuous Auditing and Data Mining
and qualitative (textual) information from financial state- ters, which can be described as regions of this space
ments (Back, Toivenen, Vanharanta, & Visa, 2001). Audi- containing points that are close to each other (Maltseva C
tors would benefit from advancement in identifying appro- et al., 2000).
priate financial inputs for ANN and developing models
that integrate quantitative and textual information. Logic-Based Methods
219
TEAM LinG
Continuous Auditing and Data Mining
and often compromise between predictive accuracy, level data with differing file formats and record structures from
of understandability, and computational demand (Apte et heterogeneous platforms (Rezaee et al., 2002). Another
al., 2003). consideration in developing a technology infrastructure
Indeed, when one examines issues such as scalability to gather transactions and other activity for auditing is
(i.e., how well the data mining method works regardless of the degree of automation employed. The degree of auto-
the size of the data set), accuracy (i.e., how well the mation can vary depending on the audit system design
information extracted remains stable and constant be- but at least three possibilities are:
yond the boundaries of the data from which it was ex-
tracted, or trained), robustness (i.e., how well the data 1. Embedded audit modules where audit programs are
mining method works in a wide variety of domains), and tightly integrated with application source code to
interpretability (i.e., how well the data mining method constantly monitor and report on exceptional con-
provides understandable information and insight of value ditions (limited use of this approach due to potential
to the end user), it becomes clear that no data mining adverse effects on entity processing, see Back-
methods currently excel in all areas (Apte et al., 2003). ground section for further discussion);
The specific demands of CA raise a number of issues 2. The automatic capture and transformation of data
that relate to the selection of data mining methods and and storage in data warehouses but still requiring
their appropriate application in this domain. The next auditor involvement in running queries to isolate
section explores these issues in greater detail. exceptions and detect unusual patterns (automatic
capture but with auditor intervention);
The CA Challenge: Difficulties and 3. The automatic capture and transformation of data
Issues and storage in data warehouses and the integration
of intelligent agents to modify data capture routines
The demand for more timely communication of auditing and exception reporting and trend spotting via
information to business stakeholders requires auditors to multiple data mining methods (automatic, modified
find new ways to monitor, assemble and analyze audit capture and modified data mining for trend analy-
information. A number of continuous audit tools and sis).
techniques will need to be developed to enable auditors
to assess risk, evaluate internal controls, extract data, Finally, the appropriate selection of a data mining
download data for analytical review, and identify excep- method is critical for determining unusual trends and
tions and unusual transactions. spotting fraudulent behavior. Important considerations
One of the challenges in developing a CA capability include: (1) the size of the data set, (2) the accuracy of the
is developing a technology infrastructure for extracting given data mining method in the particular domain, and (3)
Databases
Environmental
Monitoring
Transactions
Transactions
Data
Transformation
Data Mart
Data Mart
______________
220
TEAM LinG
Continuous Auditing and Data Mining
the interpretability of the data mining output information. reliability of performance reporting. Auditors need to
Regardless of the particular context, in the area of CA a understand the capabilities of different data mining ap- C
prime consideration is the frequent and (automatic) alter- proaches to ensure effective continuous auditing. Look-
ation or change of the data mining technique and the ing ahead, researchers will need to further develop data
appropriate selection and changing of the selected audit- mining approaches and tools for the expanding needs of
ing information to be reviewed. These are important con- the audit profession.
siderations because the audited parties should not be
able to anticipate the audit tests and thus create errone-
ous transactions that might go undetected (Ramos, 2003). REFERENCES
A Potential Architecture Apte, C.V., Hong, S.J., Natarajan, R., Pednault, E.P.D.,
Tipu, F.A., & Weiss, S.M. (2003). Data-intensive analytics
In order to handle the issues identified above, a proposed for predictive modeling. IBM Journal of Research and
architecture for the CA process is presented in Figure 1. Development, 47(1), 17-23.
Two important features of the architecture are: (1) the use
of data warehousing and data marts, and (2) the use of Back, B., Toivenen, J., Vanharanta, H., & Visa, A. (2001).
intelligent agents to periodically modify the data extrac- Comparing numerical data and text information from an-
tion and data mining methods. The architecture involves nual reports using self-organizing maps. International
the transfer of data from the entitys systems to the Journal of Accounting Information Systems, 2(2001),
auditors systems through the use of XBRL and XML. The 249-269.
auditors systems include environmental monitoring of Bierstaker, J.L., Burnaby, P., & Hass, S. (2003). Recent
political, economic, and technological factors to address changes in internal auditors use of technology. Internal
strategic audit issues and data mining of transactions to Auditing, 18(4), 39-45.
address tactical audit issues.
David, J.S., & Steinbart, P.J. (2000). Data warehousing
and data mining: Opportunities for internal auditors.
FUTURE TRENDS Altamonte Springs, FL: The Institute of Internal Auditors
Research Foundation.
In the future, auditing and information systems research- Kogan, A., Sudit, E.F., & Vasarhelyi, M.A. (2003). Con-
ers will need to identify additional tools and data mining tinuous online auditing: An evolution (pp. 1-25). Unpub-
methods that are most appropriate for this new applica- lished Workpaper.
tion area. Emerging innovations in Web technology and
the general publics demand for reliable performance Konrath, L.F. (2002). Auditing: A risk analysis approach.
reporting are expected to spur demand for the use of data Cincinnati, OH: South-Western.
mining for CA. Koskivaara, E. (2004). Artificial neural networks in ana-
Expected growth in data mining for CA will provide lytical procedures. Managerial Auditing Journal,
opportunities and issues for auditors and researchers in 19(2), 191-223.
several areas. Advancement in the areas of data mining of
textual information will be useful to CA. Recognizing Lam, M. (2004). Neural network techniques for finan-
patterns in qualitative information provides a broader cial performance prediction: Integrating fundamental
perspective and richer understanding of the entitys and technical analysis. Decision Support Systems, 37(4),
internal and external situation (Back et al., 2001). Also, for 567-581.
risk based auditing ANN model builders will need quali-
tative and quantitative factors that capture political, eco- Liang, D., Fengyi, L., & Wu, S. (2001). Electronically
nomic, and technological factors, as well as balanced auditing EDP systems with the support of emerging infor-
scorecard metrics that signal the extent to which an entity mation technologies. International Journal of Account-
is achieving its strategic objectives (Lin et al., 2003, p. 230). ing Information Systems, 2, 130-147.
Lin, J.W., Hwang, M.I., & Becker, J.D. (2003). A fuzzy
neural network for assessing the risk of fraudulent finan-
CONCLUSION cial reporting. Managerial Auditing Journal, 18(8),
657-665.
The use of data mining for continuous auditing is poised
to improve the effectiveness of audits, and ultimately the
221
TEAM LinG
Continuous Auditing and Data Mining
Maltseva, E., Pizzuti, C., & Talia, D. (2000). Indirect knowl- Auditing: Systematic process of objectively obtain-
edge discovery by using singular value decomposition. ing and evaluating evidence of assertions about eco-
In Data Mining II. Southhampton, UK: WIT Press. nomic actions and events to ascertain the correspon-
dence between those assertions and established criteria
Nigrini, M.J. (2002). Analysis of digits and number pat- and communicating the results to interested parties.
terns. In J.C. Robertson (Ed.), Fraud examination for
managers and auditors (pp. 495-518). Austin, TX: Atex Clustering: Data mining approach that partitions
Austin, Inc. large sets of data objects into homogeneous groups.
Pushkin, A.B. (2003). Comprehensive continuous audit- Computerized Assisted Auditing Techniques
ing: The strategic component. Internal Auditing, 18(1), (CAATs): Software applications that are used to improve
26-33. the efficiency of an audit.
Ramos, M. (2003). Auditors responsibility for fraud de- Continuous Auditing: Type of auditing, which pro-
tection. Journal of Accountancy, 195(1), 28-36. duces audit results simultaneously, or a short period of
time after, the occurrence of relevant events.
Rezaee, Z., Sharbatoghlie, A., Elam, R., & McMickle, P.L.
(2002). Continuous auditing: Building automated audit- Discriminant Analysis: Statistical methodology used
ing capability. Auditing: A Journal of Practice & Theory, for classification that is based on the general regression
21(1), 147-163. model, and uses a nominal or ordinal dependent variable.
Spangler, W.E., May, J.H., & Vargas, L.G. (1999). Choos- eXtensible Business Reporting Language (XBRL):
ing data mining methods for multiple classification: Rep- Mark up language that allows for the tagging of data and
resentational and performance measurement implications is designed for performance reporting. It is a variant of
for decision support. Journal of Management Informa- XML.
tion Systems, 16(1), 37-62.
eXtensible Markup Language (XML): Mark-up lan-
Weiss, S.M., & Indurkhya, N. (1998). Predictive data guage that allows for the tagging of data in order to add
mining. San Francisco, CA: Morgan Kaufmann. meaning to the data.
Woodroof, J., & Searcy, D. (2001). Continuous audit Tree and Rule Induction: Data mining approach that
model development and implementation within a debt uses an algorithm (e.g., the ID3 algorithm) to induce a
covenant compliance domain. International Journal of decision tree from a file of individual cases, where the case
Accounting Information Systems, 2, 169-191. is described by a set of attributes and the class to which
the case belongs.
KEY TERMS
222
TEAM LinG
223
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Driven vs. Metric Driven Data Warehouse Design
derived, because the summaries are based upon existing bers of the class to guess the ages of other members. I
data. However, it is not without flaws. First, the integra- could rank the students by age and then use the ranking
tion of multiple data sources may be difficult. These number instead of age. The point is that each of these
operational data sources may have been developed inde- attempts is somewhere between the two extremes, and
pendently, and the semantics may not agree. It is diffi- the validity of my data improves as I move closer to the
cult to resolve these conflicting semantics without a first extreme. That is, I have measurements of a specific
known end state to aim for. But the more damaging phenomenon, and those measurements are likely to
problem is epistemological. The summary data derived represent that phenomenon faithfully. The epistemo-
from the operational systems represent something, but logical problem in data-driven data warehouse design is
the exact nature of that something may not be clear. that data is collected for one purpose and then used for
Consequently, the meaning of the information that de- another purpose. The strongest validity claim that can be
scribes that something may also be unclear. This is made is that any information derived from this data is
related to the semantic disintegrity problem in rela- true about the data set, but its connection to the organi-
tional databases. A user asks a question of the database zation is tenuous. This not only creates problems with
and gets an answer, but it is not the answer to the the data warehouse, but all subsequent data-mining dis-
question that the user asked. When the somethings that coveries are suspect also.
are represented in the database are not fully understood,
then answers derived from the data warehouse are likely
to be applied incorrectly to known somethings. Unfor- METRIC-DRIVEN DESIGN
tunately, this also undermines data mining. Data mining
helps people find hidden relationships in the data. But if The metric-driven approach to data warehouse design
the data do not represent something of interest in the begins by defining key business processes that need to
world, then those relationships do not represent any- be measured and tracked in order to maintain or improve
thing interesting, either. the efficiency and productivity of the organization.
Research problems in data warehousing currently After these key business processes are defined, they are
reflect this data-driven view. Current research in data modeled in a dimensional data model and then further
warehousing focuses on a) data extraction and integra- analysis is done to determine how the dimensional
tion, b) data aggregation and production of summary model will be populated. Hopefully, much of the data
sets, c) query optimization, and d) update propagation can be derived from operational data stores, but the
(Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2000). All these metrics are the driver, not the availability of data from
issues address the production of summary data based on operational data stores.
operational data stores. A relational database models the entities or objects
of interest to an organization (Teorey, 1999). These
A Poverty of Epistemology objects of interest may include customers, products,
employees, and the like. The entity model represents
The primary flaw in data-driven data warehouse design is these things and the relationships between them. As
that it is based on an impoverish epistemology. Episte- occurrences of these entities enter or leave the organi-
mology is that branch of philosophy concerned with zation, that addition or deletion is reflected in the
theories of knowledge and the criteria for valid knowl- database. As these entities change in state, somehow,
edge (Fetzer & Almeder, 1993; Palmer, 2001). That is those state changes are also reflected in the database.
to say, when you derive information from a data ware- So, theoretically, at any point in time, the database
house based on the data-driven approach, what does that faithfully represents the state of the organization. Que-
information mean? How does it relate to the work of the ries can be submitted to the database, and the answers to
organization? To see this issue, consider the following those queries should, indeed, be the answers to those
example. If I asked each student in a class of 30 for their questions if they were asked and answered with respect
ages, then summed those ages and divided by 30, I should to the organization.
have the average age of the class, assuming that every- A data warehouse, on the other hand, models the
one reported their age accurately. If I were to generate business processes in an organization to measure and
a list of 30 random numbers between 20 and 40 and took track those processes over time. Processes may include
the average, that average would be the average of the sales, productivity, the effectiveness of promotions,
numbers in that data set and would have nothing to do and the like. The dimensional model contains facts that
with the average age of the class. In between those two represent measurements over time of a key business
extremes are any number of options. I could guess the process. It also contains dimensions that are attributes
ages of students based on their looks. I could ask mem- of these facts. The fact table can be thought of as the
224
TEAM LinG
Data Driven vs. Metric Driven Data Warehouse Design
dependent variable in a statistical model, and the dimen- we construct appropriate measures for these processes?
sions can be thought of as the independent variables. So c) How do we know those measures are valid? d) How do ,
the data warehouse becomes a longitudinal dataset track- we know that a dimensional model has accurately cap-
ing key business processes. tured the independent variables? e) Can we develop an
abstract theory of aggregation so that the data aggrega-
A Parallel with Pre-Relational Days tion problem can be understood and advanced theoreti-
cally? and, finally, f) can we develop an abstract data
You can see certain parallels between the state of data language so that aggregations can be expressed math-
warehousing and the state of database prior to the rela- ematically by the user and realized by the machine?
tional model. The relational model was introduced in Initially, both data-driven and metric-driven de-
1970 by Codd but was not realized in a commercial signs appear to be legitimate competing paradigms for
product until the early 1980s (Date, 2004). At that time, data warehousing. The epistemological flaw in the data-
a large number of nonrelational database management driven approach is a little difficult to grasp, and the
systems existed. All these products handled data in dif- distinction that information derived from a data-
ferent ways, because they were software products devel- driven model is information about the data set, but
oped to handle the problem of storing and retrieving data. information derived from a metric-driven model is
They were not developed as implementations of a theo- information about the organization may also be a bit
retical model of data. When the first relational product elusive. However, the implications are enormous. The
came out, the world of databases changed almost over- data-driven model has little future in that it is founded
night. Every nonrelational product attempted, unsuc- on a model of data exploitation rather than a model of
cessfully, to claim that it was really a relational product data. The metric-driven model, on the other hand, is
(Codd, 1985). But no one believed the claims, and the likely to have some major impacts and implications.
nonrelational products lost their market share almost
immediately.
Similarly, a wide variety of data warehousing prod- FUTURE TRENDS
ucts are on the market today. Some are based on the
dimensional model, and some are not. The dimensional The Impact on White-Collar Work
model provides a basis for an underlying theory of data
that tracks processes over time rather than the current The data-driven view of data warehousing limits the
state of entities. Admittedly, this model of data needs future of data warehousing to the possibilities inherent
quite a bit of work, but the relational model did not come in summarizing large collections of old data without a
into dominance until it was coupled with entity theory, so specific purpose in mind. The metric-driven view of
the parallel still holds. We may never have an announce- data warehousing opens up vast new possibilities for
ment in data warehousing as dramatic as Codds paper in improving the efficiency and productivity of an organi-
relational theory. It is more likely that a theory of tem- zation by tracking the performance of key business
poral dimensional data will accumulate over time. How- processes. The introduction of quality management
ever, in order for data warehousing to become a major procedures in manufacturing a few decades ago dra-
force in the world of databases, an underlying theory of matically improved the efficiency and productivity of
data is needed and will eventually be developed. manufacturing processes, but such improvements have
not occurred in white-collar work.
The Implications for Research The reason that we have not seen such an improve-
ment in white-collar work is that we have not had
The implications for research in data warehousing are metrics to track the productivity of white-collar work-
rather profound. Current research focuses on issues ers. And even if we did have the metrics, we did not have
such as data extraction and integration, data aggregation a reasonable way to collect them and track them over
and summary sets, and query optimization and update time. The identification of measurable key business
propagation. All these problems are applied problems in processes and the modeling of those processes in a data
software development and do not advance our under- warehouse provides the opportunity to perform quality
standing of the theory of data. management and process improvement on white-collar
But a metric-driven approach to data warehouse de- work.
sign introduces some problems whose resolution can Subjecting white-collar work to the same rigorous
make a lasting contribution to data theory. Research definition as blue-collar work may seem daunting, and
problems in a metric-driven data warehouse include a) indeed that level of definition and specification will
How do we identify key business processes? b) How do not come easily. So what would motivate a business to
225
TEAM LinG
Data Driven vs. Metric Driven Data Warehouse Design
do this? The answer is simple: Businesses will have to do variables. We begin to realize that the choice of data types
this when the competitors in their industry do it. Whoever (e.g., interval or ratio) will affect the types of analysis we
does this first will achieve such productivity gains that can do on the data and will hence potentially limit the
competitors will have to follow suit in order to compete. queries. So the database designer has to address con-
In the early 1970s, corporations were not revamping their cerns that have traditionally been the domain of the
internal procedures because computerized accounting statistician. Similarly, the statistician cannot afford the
systems were fun. They were revamping their internal luxury of constructing a data set for a single purpose or
procedures because they could not protect themselves a single type of analysis. The data set must be rich enough
from their competitors without the information for deci- to allow the statistician to find relationships that may not
sion making and organizational control provided by their have been considered when the data set was being con-
accounting information systems. A similar phenomenon structed. Variables must be included that may potentially
is likely to drive data warehousing. have impact, may have impact at some times but not
others, or may have impact in conjunction with other
Dimensional Algebras variables. So the statistician has to address concerns that
have traditionally been the domain of the database de-
The relational model introduced Structured Query Lan- signer.
guage (SQL), an entirely new data language that allowed What this points to is the fact that database design
nontechnical people to access data in a database. SQL and statistical exploitation are just different ends of the
also provided a means of thinking about record selec- same problem. After these two ends have been con-
tion and limited aggregation. Dimensional models can nected by data warehouse technology, a single theory of
be exploited by a dimensional query language such as data must be developed to address the entire problem.
MDX (Spofford, 2001), but much greater advances are This unified theory of data would include entity theory
possible. and measurement theory at one end and statistical ex-
Research in data warehousing will likely yield some ploitation at the other. The middle ground of this theory
sort of a dimensional algebra that will provide, at the will show how decisions made in database design will
same time, a mathematical means of describing data affect the potential exploitations, so intelligent design
aggregation and correlation and a set of concepts for decisions can be made that will allow full exploitation
thinking about aggregation and correlation. To see how of the data to serve the organizations needs to model
this could happen, think about how the relational model itself in data.
led us to think about the organization as a collection of
entity types or how statistical software made the con-
cepts of correlation and regression much more con- CONCLUSION
crete.
Data warehousing is undergoing a theoretical shift from
A Unified Theory Of Data a data-driven model to a metric-driven model. The met-
ric-driven model rests on a much firmer epistemologi-
In the organization today, the database administrator and cal foundation and promises a much richer and more
the statistician seem worlds apart. Of course, the statis- productive future for data warehousing. It is easy to haze
tician may have to extract some data from a relational over the differences or significance between these two
database in order to do his or her analysis. And the approaches today. The purpose of this article was to
statistician may engage in limited data modeling in show the potentially dramatic, if somewhat speculative,
designing a data set for analysis by using a statistical implications of the metric-driven approach.
tool. The database administrator, on the other hand, will
spend most of his or her time in designing, populating,
and maintaining a database. A limited amount of time REFERENCES
may be devoted to statistical thinking when counts,
sums, or averages are derived from the database. But Artz, J. (2003). Data push versus metric pull: Compet-
these two individuals will largely view themselves as ing paradigms for data warehouse design and their impli-
participating in greatly differing disciplines. cations. In M. Khosrow-Pour (Ed.), Information tech-
With dimensional modeling, the gap between data- nology and organizations: Trends, issues, challenges
base theory and statistics begins to close. In dimen- and solutions. Hershey, PA: Idea Group Publishing.
sional modeling we have to begin thinking in terms of
construct validity and temporal data. We need to think Codd, E. F. (1970). A relational model of data for large
about correlations between dependent and independent shared data banks. Communications of the ACM, 13(6),
377-387.
226
TEAM LinG
Data Driven vs. Metric Driven Data Warehouse Design
227
TEAM LinG
228
INTRODUCTION BACKGROUND
Data management in its general term refers to activities Three-dimensional structures can be used to describe
that involve the acquisition, storage, and retrieval of data. data in different domains. In biology and chemistry, for
Traditionally, information retrieval is facilitated through example, a molecule is represented as a 3D structure with
queries, such as exact search, nearest neighbor search, connections. Each point is the center of an atom and the
range search, etc. In the last decade, data mining has connections are bonds between atoms. In computer-
emerged as one of the most dynamic fields in the frontier aided design, an object is specified as a set of 3D vectors
of data management. Data mining refers to the process of that describes the shape of the object (Veltkamp, 2001;
extracting useful knowledge from the data. Popular data Suzuki & Sugimoto, 2003). In computer vision, the shape
mining techniques include association rule discovery, of a 3D object can be caught by X-ray or ultrasonic
frequent pattern discovery, classification, and clustering. scanning devices. The result is a set of 3D points (Hilaga,
In this chapter, we discuss data management in a specific Shinagawa, Kohmura, & Kunii, 2001). In medical imaging,
type of data i.e., three-dimensional structures. While 3D images of tissues or tumors can be collected using
research on text and multimedia data management has magnetic resonance imaging or computer tomography
attracted considerable attention and substantial progress (Akutsu, Arakawa, & Murase, 2002). With advances in
has been made, data management in three-dimensional the Internet, scanning devices, and storage, the World
structures is still in its infancy (Castelli & Bergman, 2001; Wide Web is becoming a huge reservoir of all kinds of
Paquet & Rioux, 1999). Data management in 3D structures data. The 3D models available over the Internet dramati-
raises several interesting problems: cally increased in the last two decades. Similarity search
is a highly desirable technique in all these domains.
1. Similarity search Classification and clustering of biological data or chemi-
2. Pattern discovery cal compounds have special significances. For example,
3. Classification traditionally proteins are classified to families according
4. Clustering to their specific functions. However, recently, many ap-
proaches have been proposed to classify proteins ac-
Given a database of 3D structures and a query 3D cording to their structures. Some of these approaches
structure, similarity search looks for those structures in achieve very high accuracy when compared with their
the database that match the query structure within a range biological counterparts. Classification and clustering can
of tolerable errors. The similarity could be defined in two also help build index structures in 3D model retrieval to
different measurements. The first measurement compares speed up similarity search.
the data structure with the query structure in their entirety Currently, there is not a universal model or framework
i.e., a point-to-point match. We will call this aggregate for the representation, storage, and retrieval of three-
similarity search. The second measurement compares dimensional structures. Most of these data are stored in
only the contours or shapes of the data structure with that plain text files in some specific format. The format is
of the query structure. This is generally referred to as different from application to application. Likewise, the
shape-based similarity search. The range of tolerable existing techniques for information retrieval and data
errors specifies how close the match should be when the mining in three-dimensional structures take root in the
data structure is aligned with the query structure. Pattern areas of application. Two main areas of application are
discovery is concerned with similar substructures that computer vision and scientific data mining, where compu-
occur in multiple structures. Classification and clustering tation-intensive techniques have been developed and are
when applied to these domains attempt to group 3D still in demand. We focus on data management in these
structures with similar shapes or containing similar pat- two areas.
terns together.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Management in Three-Dimensional Structures
229
TEAM LinG
Data Management in Three-Dimensional Structures
bioinformatics, few practical approaches have been devel- domain expert can choose an optimal value according to
oped. The latest results are a performance study based on the context. A new index structure, B+ Tree, was intro-
2D structures that was conducted by Gavrilov, Indyk, duced by Wang in 2002 to overcome these weaknesses
Motwani and Venkatasubramanian at Stanford (1999) and (Wang, 2002). The B+ Trees preserve precise informa-
a novel algorithm that was developed by Wang and col- tion of the data, allow pattern discovery with variable
laborators in (Wang & Wang, 2000). The algorithm discov- ranges of tolerable errors, and remove false matches
ers patterns in 3D data with no assumptions of edges. The totally. In an attempt to compare the shapes of 3D
research group headed by Karypis at University of Minne- structures, Wang also invented a new notion called the
sota studies pattern discovery in graphs, with vertices and a-surface to capture the surface of a 3D structure in
edges (Kuramochi & Karypis, 2002). variable details (Wang, 2001).
Pattern discovery is another highly desirable tech-
nique in these domains. A motif is a substructure in
proteins that has specific geometric arrangement and, in FUTURE TRENDS
many cases, is associated with a particular function, such
as DNA binding. Active sites are another type of patterns A universal model or framework for the representation,
in protein structures. They play an important role through storage, and retrieval of three-dimensional structures is
protein-protein and protein-ligand interaction, i.e. the bind- highly desirable. However, due to the different nature
ing process. In drug design, scientists try to determine the and applications in the two areas, a model or framework
binding sites of the target molecules and seek inhibitors that fits both is not feasible. In computer vision and 3D
that can be bound to it. For example in determining the model retrieval, the number of points in each object is
structure of HIV protease and looking for effective inhibi- huge and a single point does not impact perception
tors, over 120 of these structure determinations have been substantially. In fact, a point that does not get along with
done and at least two inhibitors of HIV protease are now other points is likely to be considered an outlier or noise.
being regularly used to treat AIDS. In the past two de- The number of objects increases tremendously everyday
cades, the number of 3D protein structures dramatically in different sources over the Internet. We are likely to see
increased. The Protein Data Bank maintains 26,800 entries a 3D model search engine much like the search engine for
as of August 2004. New structures are deposited every- text documents. A centralized database for 3D models is
day. Performing similarity search and pattern discovery in possible, but will be only a small portion of the available
such an enormous data set urgently demands highly effi- search space. The search engine will be capable to access
cient computational tools. Wang and coauthors (2002) data in different formats to answer the user query. Shape-
developed a framework for discovering frequently occur- based statistical approaches will remain a main stream.
ring patterns in 3D structures and applied the approach to On the other hand, in scientific data, each individual
scientific data mining. The algorithm is a variant of the point is very important. For example, it may represent the
geometric hashing technique invented for model based center of an atom. Each point may also be associated with
recognition in computer vision in 1988 (Lamdan & Wolfson, different properties. Thus blindly searching through
1988). Similar approaches were also introduced in different sources for data in different formats is not
(Verbitsky, Nussinov, & Wolfson, 1999). Since hash func- practical. Instead, we are likely to see large centralized
tions map floating point numbers to integers when calcu- data reservoir like The Protein Data Bank. For most of the
lating the hash bin addresses, they do not preserve precise applications, pattern discovery, classification, and clus-
information of the data. As a consequence, these ap- tering are most desirable techniques.
proaches are not suitable for similarity search that allows
variable ranges of tolerable errors. Furthermore, false
matches must be filtered out via a verification process. It CONCLUSION
is well known in the literature that geometric hashing is too
sensitive to noise in the data. Due to regularity of biologi- The human interest about three-dimensional structures
cal and chemical structures, dissimilarity is often very began even before Euclid. After centuries of investiga-
subtle. Inaccuracy introduced by the scanning devices tion and learning, it became clear that although we have
adds noise to the data (Ankerst, Kastenmller, & Kriegel, a better perception of 3D objects compared with our
Seidl, 1999). It is extremely difficult to choose a fixed range comprehension of text and multimedia data our tech-
of tolerable errors, especially, when the data are collected niques in the representation, storage, search, and dis-
by different domain experts, using different equipments, covery of 3D structures are lagged far behind. In com-
such as in the case of Protein Data Bank (Berman, et al., puter vision, 3D object recognition is a well know difficult
2000; Westbrook, et al., 2002). It is critical that the range problem. With advances in computational power and
of tolerable errors be set to a tunable parameter, so that the storage devices, maintaining an enormous amount of 3D
230
TEAM LinG
Data Management in Three-Dimensional Structures
structures is not only feasible but also inexpensive. How- Funkhouser, T.A., Min, P. Kazhdan, M.M., Chen, J.,
ever, few effective approaches have been developed to Halderman, A., Dobkin, D.P., et al. (2003). A search engine ,
facilitate similarity search and pattern discovery in a very for 3D models. ACM Transactions on Graphics, 22(1), 83-105.
large database of 3D structures. Applications of 3D struc-
tures to bioinformatics pose even more intricate challenges, Gavrilov, M., Indyk, P., Motwani, R., &
due to the nature that these structures often need to be Venkatasubramanian, S. (1999). Geometric pattern match-
compared point-to-point. Fortunately, research in these ing: A performance study. Proc. of the Fifteenth Annual
topics has attracted more and more attention from computer Symposium on Computational Geometry (pp. 79-85),
scientists in different areas, such as computer vision, data- Miami Beach, Florida.
base systems, artificial intelligence, etc. It can be foreseen Hilaga, M., Y. Shinagawa, Y., T. Kohmura, T., & T. L. Kunii
that substantial progress will be achieved and novel tech- T.L. (2001). Topology matching for fully automatic simi-
niques will emerge in the near future. larity estimation of 3D shapes. SIGGRAPH, 203-212.
Johnson, A.E., & Hebert, M. (1999). Using spin images for
REFERENCES efficient object recognition in cluttered 3D scenes. IEEE
Transactions on Pattern Analysis and Machine Intelli-
Akutsu, T., Arakawa, K., & Murase H. (2002). Shape from gence, 21(5), 433-449.
contour using adaptive image selection. Systems and Keim, D.A. (1999). Efficient geometry-based similarity
Computers in Japan, 33(11), 50-60. search of 3D spatial databases. Proc. of ACM SIGMOD
Ankerst, M., Kastenmller, G., Kriegel, H-P., & Seidl, T. International Conference on Management of Data (pp.
(1999). Nearest neighbor classification in 3D protein da- 419-430), Philadephia, Pennsylvania.
tabases. Proc. of the 7th International Conference on Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., &
Intelligent Systems for Molecular Biology (pp. 34-43), Protopapas, Z. (1998). Fast and effective retrieval of
Heidelberg, Germany. medical tumor shapes. IEEE Transactions on Knowledge
Belongie, S. Malik, J., & Puzicha, J. (2001). Matching and Data Engineering, 10(6), 889-904.
shapes. Proc. of the Eighth International Conference on Kuramochi, M., & Karypis, G. (2002). Discovering geo-
Computer Vision (pp. 454-463), Los Alamitos, California. metric frequent subgraphs. Proc. of the 2002 IEEE Inter-
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape match- national Conference on Data Mining (pp. 258-265),
ing and object recognition using shape contexts. IEEE Maebashi, Japan.
Transactions on Pattern Analysis and Machine Intelli- Lamdan, Y., & Wolfson, H. (1988). Geometric hashing: A
gence, 24(4), 509-522. general and efficient model-based recognition scheme.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, Proc. of International Conference on Computer Vision
T.N., Weissig, H., et al. (2000). The protein data bank. (pp. 237-249).
Nucleic Acids Research, 28(1), 235-242. Lehmann, T.M., Plodowski, B., Spitzer, K., Wein, B.B.,
Bespalov, D., Shokoufandeh, A., Regli, W.C., & Sun, W. Ney, H., & Seidl, T. (2004). Extended query refinement for
(2003). Scale-space representation of 3D models and to- content-based access to large medical image databases.
pological matching. ACM Symposium on Solid Modeling Proc. of SPIE Medical Imaging (pp. 90-98).
and Applications (pp. 208-215). Lou, K., Prabhakar, S., & Ramani, K. (2004). Content-
Castelli, V., & Bergman, L. (2001). Image databases: based three-dimensional engineering shape search. Proc.
Search and retrieval of digital imagery. John Wiley & Sons. of the 2004 IEEE International Conference on Data
Engineering (pp. 754-765), Boston.
Chui, H., & Rangarajan A. (2000). A new algorithm for non-
rigid point matching. Proc. of the IEEE Conference on Osada, R., Funkhouser, T., Chazelle, B., & Dobkin, D.
Computer Vision and Pattern Recognition (pp. 44-51), (2001). Matching 3D models with shape distributions.
Hilton Head Island, South Carolina. Proc. of the International Conference on Shape Model-
ing and Applications, Genova, Italy.
Elad, M., Tal, A., & Ar, S. (2001), Content based retrieval
of VRML objects An iterative and interactive approach. Paquet, E., & Rioux, M. (1999). Crawling, indexing and
Proc. of the Eurographics Workshop on Multimedia (pp. retrieval of three-dimensional data on the web in the
97-108), Manchester, UK. framework of MPEG-7. VISUAL, 179-18.
231
TEAM LinG
Data Management in Three-Dimensional Structures
Saupe, D., & Vranic, D. V. (2001). 3D model retrieval with KEY TERMS
spherical harmonics and moments. DAGM-Symposium
2001 (pp. 392-397). B+ Tree: An index structure that decomposes a 3D
Suzuki, M.T., & Sugimoto, Y.Y. (2003). A search method structure into point-triplets and indexes the triplets in a
to find partially similar triangular faces from 3D polygonal three-dimensional B + tree.
models. Modeling and Simulation, 323-328. -Surface: The surface of a 3D structure that is
Veltkamp, R. C. (2001). Shape matching: Similarity mea- constructed by rolling a solid ball with radius along the
sures and algorithms. IEEE Shape Modeling Interna- contour and extracting every point the solid ball touches.
tional, 188-197. Aggregate Similarity Search: The search operation
Veltkamp, R. C., & Hagedoorn, M. (1999). State-of-the-art in 3D structures that matches the structures point-to-
in shape matching. Technical Report UU-CS-1999-27, point in their entirety.
Utrecht. Feature Vector: A vector in which every dimension
Verbitsky, G., Nussinov, R., & Wolfson, H. J. (1999). represents a property of a 3D structure. A good feature
Flexible structural comparison allowing hinge bending vector captures similarity and dissimilarity of 3D struc-
and swiveling motions. Proteins, 34, 232-254. tures.
Wang, X. (2001). -Surface and its application to mining Histogram: The original term refers to a bar graph that
protein data. Proc. of the IEEE International Conference represents a distribution. In information retrieval, it also
on Data Mining (pp. 659-662), San Jose, California. refers to a weighted vector that describes the properties
of an object or the comparison of properties between two
Wang, X. (2002). B+ tree: Indexing 3D point sets for objects. Like a feature vector a good histogram captures
pattern discovery. Proc. of the IEEE International Con- similarity and dissimilarity of the objects of interest.
ference on Data Mining (pp. 701-704), Maebashi, Japan.
Shape-Based Similarity Search: The search opera-
Wang X., & Wang, J.T.L. (2000). Fast similarity search in tion in 3D structures that matches only the surfaces of the
three-dimensional structure databases. Journal of Chemi- structures without referring to the points inside the sur-
cal Information and Computer Sciences, 40(2), 442-451. faces.
Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Superimposition: The process of matching two 3D
Rigoutsos, I., & Zhang, K. (2002). Finding patterns in structures by alignment through rigid translations and
three dimensional graphs: Algorithms and applications to rotations.
scientific data mining. IEEE Transaction on Knowledge
and Data Engineering, 14(4), 731-749. The Curse of Dimensionality: The original term refers
to the exponential growth of hyper-volume as a function
Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., of dimensionality. In information retrieval, it refers to the
Ravichandran, V. et al. (2002). The protein data bank: phenomenon that the performance of an index structure
Unifying the archive. Nucleic Acids Research, 30(1), 245- for nearest neighbor search and -range search deterio-
248. rates rapidly due to the growth of hyper-volume.
232
TEAM LinG
233
Amar Gupta
University of Arizona, USA
Shiraj Khan
University of South Florida, USA
INTRODUCTION maker is not how one can get more information or design
better information systems but what to make of the infor-
Analytical Information Technologies mation and systems already in place. The challenge is to
be able to utilize the available information, to gain a better
Information by itself is no longer perceived as an asset. understanding of the past, and to predict or influence the
Billions of business transactions are recorded in enter- future through better decision making. Researchers in
prise-scale data warehouses every day. Acquisition, stor- data mining technologies (DMT) and decision support
age, and management of business information are com- systems (DSS) are responding to this challenge. Broadly
monplace and often automated. Recent advances in re- defined, data mining (DM) relies on scalable statistics,
mote or other sensor technologies have led to the devel- artificial intelligence, machine learning, or knowledge
opment of scientific data repositories. Database tech- discovery in databases (KDD). DSS utilize available infor-
nologies, ranging from relational systems to extensions mation and DMT to provide a decision-making tool usu-
like spatial, temporal, time series, text, or media, as ally relying on human-computer interaction. Together,
well as specialized tools like geographical information DMT and DSS represent the spectrum of analytical infor-
systems (GIS) or online analytical processing (OLAP), mation technologies (AIT) and provide a unifying plat-
have transformed the design of enterprise-scale busi- form for an optimal combination of data dictated and
ness or large scientific applications. The question in- human-driven analytics.
creasingly faced by the scientific or business decision-
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining and Decision Support for Business and Science
234
TEAM LinG
Data Mining and Decision Support for Business and Science
2002; Wang & Jain, 2003; Yurkiewicz, 2003) usually need for high value-added jobs (e.g., after a Pareto classifica-
to read data from a variety of sources like online transac- tion) or for exceptional situations (e.g., large prediction ,
tional processing (OLTP) systems, historical data ware- variance or significant risks). In addition, certain research
houses and data marts, syndicated data vendors, legacy studies have indicated that judgmental overrides may not
systems, or public domain sources like the Internet, as well improve upon the results of automated predictive models
as in the form of real-time or incremental data entry from on the (longer-term) average.
external or internal collaborators, expert consultants, plan- Scientists and engineers traditionally have utilized
ners, decision makers, and/or executives. Data from dis- advanced quantitative approaches for making sense of
parate sources usually are mapped to a predefined com- observations and experimental results, formulating theo-
mon data model and incorporated through extraction, trans- ries and hypotheses, and designing experiments. For
formation, and loading (ETL) tools. End users are provided users of statistical and numerical approaches in these
GUI-based access to define application contexts and set- domains, DMT often seems like the proverbial old wine
tings, structure business workflows and planning cycles in new bottles. However, innovative use of DMT in-
and format data models for visualization, judgmental up- cludes the development of algorithms, systems, and
dates, or analytical and predictive modeling. The param- practices that can not only apply novel methodologies,
eters of the embedded data mining models might be preset, but also can scale to large scientific data repositories
calculated dynamically based on data or user inputs, or (Connover et al., 2003; Graves, 2003; Han et al., 2002; He
specified by a power user. The results of the data mining et al., 2003; Ramachandran et al., 2003). While scientific
models can be utilized automatically for optimization and and business data mining have a lot in common, the
recommendation systems and/or can be used to serve as incorporation of domain knowledge is probably more
baselines for planners and decision makers. Tools like BI, critical in scientific applications. When appropriately
Reports, and OLAP (Hammer, 2003) are utilized to help combined with domain specific knowledge about the
planners and decision makers visualize key metrics and physics or the data sources/uncertainties, DMT ap-
predictive modeling results, as well as to utilize alert proaches have the potential to revolutionize the pro-
mechanisms and selection tools to manage by exception or cesses of scientific discovery, verification, and predic-
by objectives. Judgmental updates at various levels of tion (Han et al., 2002; Karypis, 2002). This potential has
aggregation and their reconciliation, collaboration among been demonstrated by recent applications in diverse
internal experts and external trading partners, as well as areas like remote sensing (Hinke et al., 2000), material
managerial review processes and adherence to corporate sciences (Curtarolo et al., 2003), bioinformatics (Graves,
directives, are aided by allocation and consolidation 2003), and the earth sciences (Ganguly, 2002b; Kamath et
engines, tools for simulation, ad hoc and predefined al., 2002; Potter et al., 2003; Thompson et al., 2002; see
reports, user-defined business workflows, audit trails http://datamining.itsc.uah.edu/adam/). Besides physi-
with comments and reason codes, and flexible informa- cal and data-dictated methods, human-computer interac-
tion transfer and data-handling capabilities. tion retains a significant role in real-world scientific
Emerging technologies include the use of automated decision making. This necessitates the use of DSS, where
DMT for aiding traditional DSS tasks; for example, the the results of DMT can be combined with expert judg-
use of data mining to zero down on the cause of aggregate ment and techniques from simulation, operations re-
exceptions in multidimensional OLAP cubes. The end search (OR), and other DSS tools. The Reviews of
results of the planning process are usually published in a Geophysics (American Geophysical Union, 1995) pro-
pre-defined placeholder (e.g., a relational database table), vides a slightly dated discussion on the use of data
which, in turn, can be accessed by execution systems or assimilation, estimation, and OR, as well as DSS, (http:/
other planning applications. The use of elaborate mecha- /www.agu.org/journals/rg/rg9504S/contents.html# hy-
nisms for user-driven analysis and judgmental or col- drology). Examples of decision support systems and
laborative decisions, as opposed to reliance on auto- tools in scientific and engineering applications also
mated DMT, remains a guiding principle for the current can be found in dedicated journals like Decision Sup-
genre of business-planning applications. The value of port Systems or Journal of Decision Systems (see
collaborative decision making and global visibility of Vol. 8, Number 2, 1998, and the latest issues), as well
information is near axiomatic for business applications. as in journals or Web sites dealing with scientific and
However, researchers need to design better DMT applica- engineering topics (McCuistion & Birk, 2002) (see the
tions that can utilize available information from disparate NASA air traffic control Web site at http://
sources through advanced analytics and account for spe- www.asc.nasa.gov/aatt/dst.html; NASA research Web
cific domain knowledge, constraints, or bottlenecks. Valu- sites for a global carbon DSS http://geo.arc.nasa.gov/
able and/or scarce human resources can be conserved by website/cquestwebsite/index.html; and institutes like
automating routine tasks and by reserving expert resources
235
TEAM LinG
Data Mining and Decision Support for Business and Science
MIT Lincoln Laboratories at http://www.ll.mit.edu/ South Florida and the second author was a member of the
AviationWeather/index2.html). faculty at the MIT Sloan School of Management.
Business applications have focused on DSS with embed- Agosta, L., Orlov, L.M., & Hudso R. (2003). The future
ded and scalable implementations of relatively straight- of data mining: Predictive analytics. Forrester Brief.
forward DMT. Scientific applications have focused tradi-
tionally on advanced DMT in prototype applications with Apte, C., Liu, B., Pednault, E.P.D., & Smyth, P. (2002,
sample data. Researchers and practitioners of the future August). Business applications of data mining. Commu-
need to utilize advanced DMT for business applications nications of the ACM, 45(8), 49-53.
and scalable DMT, and DSS for scientists and engineers. Bradley, P. et al. (2002). Scaling mining algorithms to large
This provides a perfect opportunity for innovative and databases. Communications of the ACM, 45(8), 38-43.
multi-disciplinary collaborations.
Carlsson, C., & Turban, E. (2002). DSS: Directions for the next
decade. Decision Support Systems, 33(2), 105-110.
CONCLUSION Conover, H. et al. (2003). Data mining on the TeraGrid.
Proceedings of the Supercomputing Conference, Phoe-
The power of information technologies has been uti- nix, Arizona.
lized to acquire, manage, store, retrieve, and represent
data in information repositories, and to share, report, Curtarolo, S. et al. (2003). Predicting crystal structures
process, collaborate on, and move data in scientific and with data mining of quantum calculations. Physics Review
business applications. Database management and data Letters, 91(13).
warehousing technologies have matured significantly Fayyad, U., & Uthurusamy, R. (2002, August). Evolving
over the years. Tools for building custom and packaged data mining into solutions for insights. Communica-
applications, including, but not limited to, workflow tions of the ACM, 45(8), 28-31.
technologies, Web servers, and GUI-based data entry
and viewing forms, are steadily maturing. There is a Ganguly, A.R. (2002a). Software reviewData mining
clear and present need to exploit the available data and components. ORMS Today, 29(5), 56-59.
technologies to develop the next generation of scien-
tific and business applications, which can combine data- Ganguly, A.R. (2002b). A hybrid approach to improving
dictated methods with domain-specific knowledge. Ana- rainfall forecasts. Computers in Science and Engi-
lytical information technologies, which include DMT neering, 4(4), 14-21.
and DSS, are particularly suited for these tasks. These Geoffrion, A.M., & Krishnan, R. (Eds.). (2003). E-
technologies can facilitate both automated (data-dic- business and management scienceMutual impacts
tated) and human expert-driven knowledge discovery (Parts 1 and 2). Management Science, 49(10-11).
and predictive analytics, and can also be made to utilize
the results of models and simulations that are based on Graves, S.J. (2003). Data mining on a bioinformatics
process physics or business insights. If DMT and DSS grid. Proceedings of the SURA BioGrid Workshop,
were to be defined broadly, a broad statement can per- Raleigh, North Carolina.
haps be made that, while business applications have Grossman, R. et al. (2001). Data mining for scientific and
scalable but straightforward DMT embedded within DSS, engineering applications. Kluwer.
scientific applications have utilized advanced DMT but
focused less on scalability and DSS. Multidisciplinary Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data
research and development efforts are needed in the mining standards initiative. Communications of the ACM,
future for maximal utilization of analytical information 45(8), 59-61.
technologies in the context of these applications.
Grossman, R.L., & Mazzucco, M. (2002). DataSpace: A data
Web for the exploratory analysis and mining of data.
Computers in Science and Engineering, 4(4), 44-51.
ACKNOWLEDGMENTS
Hammer, J. (Ed.). (2003). Advances in online analytical
Most of the work was completed while the first author was processing. Data & Knowledge Engineering, 45(2), 127-
on a visiting faculty appointment at the University of 256.
236
TEAM LinG
Data Mining and Decision Support for Business and Science
Han, J. et al. (2002). Emerging scientific applications in data Wang, G.C.S., & Jain, C.L. (2003). Regression analysis:
mining. Communications of the ACM, 45(8), 54-58. Modeling and forecasting. Institute of Business Fore- ,
casting.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
data mining. Cambridge, MA: MIT Press. Yurkiewicz, J. (2003). Forecasting software survey: Pre-
dicting which product is right for you. ORMS Today.
He, Y. et al. (2003). Framework for mining and analysis of
space science data. Proceedings of the SIAM Interna-
tional Conference on Data Mining, San Francisco, Cali-
fornia.
KEY TERMS
Hinke, T., Rushing, J., Ranganath, H.S., & Graves, S.J.
(2000). Techniques and experience in mining remotely Analytical Information Technologies (AIT): Informa-
sensed satellite data. Artificial Intelligence Review, tion technologies that facilitate tasks like predictive mod-
14(6), 503-531. eling, data assimilation, planning, or decision making
through automated data-driven methods, numerical solu-
Kamath, C. et al. (2002). Classifying of bent-double galaxies. tions of physical or dynamical systems, human-computer
Computers in Science and Engineering, 4(4), 52-60. interaction, or a combination. AIT includes DMT, DSS, BI,
Karypis, G. (2002). Guest editors introduction: Data mining. OLAP, GIS, and other supporting tools and technologies.
Computers in Science and Engineering, 4(4), 12-13. Business and Scientific Applications: End-user mod-
Kohavi, R., Rothleder, N.J., & Simoudis, E. (2002). Emerg- ules that are capable of utilizing AIT along with domain-
ing trends in business analytics. Communications of the specific knowledge (e.g., business insights or constraints,
ACM, 45(8), 45-48. process physics, engineering know-how). Applications
can be custom-built or pre-packaged and are often distin-
Linden, A., & Fenn, J. (2003). Hype cycle for advanced guished form other information technologies by their
analytics, 2003. Gartner Strategic Analysis Report. cognizance of the specific domains for which they are
McCuistion, J.D., & Birk, R. (2002). From observations designed. This can entail the incorporation of domain-
to decision support: The new paradigm for satellite data. specific insights or models, as well as pre-defined infor-
NASA Technical Report. Retrieved from http:// mation and process flows.
www.iaanet.org/symp/berlin/IAA-B4-0102.pdf Business Intelligence (BI): Broad set of tools and
Potter, C. et al. (2003). Global teleconnections of ocean technologies that facilitate management of business
climate to terrestrial carbon flux. Journal of Geophysical knowledge, performance, and strategy through auto-
Research, American Geophysical Union, 108(D17), 4556. mated analytics or human-computer interaction.
Ramachandran, R. et al. (2003). Flexible framework for Data Assimilation: Statistical and other automated
mining meteorological data. Proceedings of the Ameri- methods for parameter estimation, followed by predic-
can Meteorological Societys (AMS) 19th International tion and tracking.
Conference on Interactive Information Processing Data Mining Technologies (DMT): Broadly de-
Systems (IIPS) Meteorology, Oceanography, and Hy- fined, these include all types of data-dictated analytical
drology, Long Beach, California. tools and technologies that can detect generic and inter-
Shim, J. et al. (2002). Past, present, and future of decision esting patterns, scale (or can be made to scale) to large
support technology. Decision Support Systems, 33(2), data volumes, and help in automated knowledge discov-
111-126. ery or prediction tasks. These include determining asso-
ciations and correlations, clustering, classifying, and
Smyth, P., Pregibon, D., & Faloutsos, C. (2002). Data- regressing, as well as developing predictive or forecast-
driven evolution of data mining algorithms. Communica- ing models. The specific tools used can range from
tions of the ACM, 45(8), 33-37. traditional or emerging statistics and signal or image
processing, to machine learning, artificial intelligence,
Thompson, D.S. et al. (2002). Physics-based feature min- and knowledge discovery from large databases, as well
ing for large data exploration. Computers in Science and as econometrics, management science, and tools for
Engineering, 4(4), 22-30. modeling and predicting the evolutions of nonlinear dy-
namical and stochastic systems.
237
TEAM LinG
Data Mining and Decision Support for Business and Science
Decision Support Systems (DSS): Broadly defined, ses, as well as presentation, allocation, and consolidation
these include technologies that facilitate decision making. of information along multiple dimensions (e.g., product,
These can embed DMT and utilize these through auto- location, and time). These technologies are well suited for
mated batch processes and/or user-driven simulations or management by exceptions or objectives, as well as auto-
what-if scenario planning. The tools for decision support mated or judgmental decision making.
include analytical or automated approaches like data as-
similation and operations research, as well as tools that Operations Research (OR): Mathematical and con-
help the human experts or decision makers manage by straint programming and other techniques for mathemati-
objectives or by exception, like OLAP or GIS. cally or computationally determining optimal solutions
for objective functions in the presence of constraints.
Geographical Information Systems (GIS): Tools that
rely on data management technologies to manage, pro- Predictive Modeling: The process through which
cess, and present geospatial data, which, in turn, can vary mathematical or numerical technologies are utilized to
with time. understand or reconstruct past behavior and predict
expected behavior in the future. Commonly utilized
Online Analytical Processing (OLAP): Broad set of tools include statistics, data mining, and operations
technologies that facilitate drill-down or aggregate analy- research, as well as numerical or analytical methodolo-
gies that rely on domain-knowledge.
238
TEAM LinG
239
Shital C. Shah
The University of Iowa, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining and Warehousing in Pharma Industry
A future event can be predicted in two major ways: C. Neural network (Mitchell, 1997)
population-based and individual-based. The population- D. Support vector machines
based prediction says, for example, a drug A has been E. Decision tree algorithms [C4.5 (Quinlan, 1992)]
effective in treating 80% of patients in the population P F. Decision rule algorithms [Rough set algorithms
as symbolically illustrated in Figure 2. Of course, any (Pawlak, 1991)]
patient would like to belong to the 80% rather than the 20% G. Association rule algorithms
category before the drug A is administered. Statistics and H. Learning classifier systems
other tools have been widely used in support of the I. Inductive learning algorithms
population-paradigm, among others in medicine and phar- J. Text learning algorithms
maceutical industry.
The individual-based approach supported by numer- Each class containing numerous algorithms, for ex-
ous DM algorithms emphasizes an individual patient ample, there are more than 100 implementations of the
rather than the population (Kusiak et al., 2005). One of decision tree algorithm (class E).
many decision-making scenarios is illustrated in Figure 3,
where the original population P of patients has been Data Warehouse Design
partitioned into two segments 1 and 2. The decisions for
each patient in Segment 1 are made with high confidence; A warehouse has to be designed to meet users require-
say 99%, while the decisions for Segment 2 are predicted ments. DM, online analytical processing (OLAP), and
with lower confidence. reporting are the top items on the list of requirements.
It is quite possible that Segment 2 patients would Systems design methodologies, and tools can be used to
seek an alternative drug or a treatment. There are differ- facilitate the requirements capture. Examples of meth-
ent ways of using DM algorithms. They cover the range odologies for analysis of data warehouse (DW) require-
between the population and individual-based paradigms. ments include AND/OR graphs and the house of quality
The exiting DM algorithms can be grouped into the (Kusiak, 2000).
following basic ten classes (Kusiak, 2001): The architecture of a typical DW embedded in a
pharmaceutical environment is shown in Figure 4. The
A. Classical statistical methods (e.g., linear, qua- pharmaceutical data is extracted from numerous sources
dratic, and logistic discriminant analyses) and preprocessed to minimize inconsistencies. Also, data
B. Modern statistical techniques (e.g., projection transformation will capture intricate solution spaces to
pursuit classification, density estimation, k-near- improve knowledge discovery. The cleaned and trans-
est neighbor, Bayes algorithm) formed data is loaded and refreshed directly into a DW
or data marts. A data mart might be a precursor to the
full-fledged DW or function as a specialized DW. A
Figure 2. Population-based paradigm
special purpose (exploratory) data mart or a DW might
be created for exploratory data analysis and research.
80% P cured The warehouse and data marts serve various applications
that justify the development and maintenance cost in
Drug A Population P Confidence = 95%
this data storage technology. The range of services that
20% P not cured could be developed off a DW could be expanded beyond
OLAP and DM into almost all pharmaceutical business
areas, including interactions with federal agencies and
Figure 3. Individual-based paradigm using data other businesses.
mining tools
Data Flow Analysis
70% P cured
Population A task that may parallel the capture of requirements for
Confidence = 99%
Segment 1
a DW involves analysis of data flow. A warehouse is to
15% P not cured integrate various streams of data that have to be identi-
Drug A fied. The information analysts and users need to feel
10% P cured comfortable with the data flow methodology selected
Population
for capturing the data flow logic. At minimum this data
Confidence = 85%
Segment 2 flow modeling exercise should increase efficiency of
5% P not cured the data handling and management. An example to a
methodology that can be used to model data flow is the
240
TEAM LinG
Data Mining and Warehousing in Pharma Industry
Data mining
Data
Transformation Datya
Data
Servive
and Loading
Raw Data
OLAP
Data
Exploratory
warehouse
data
Information
warehouse
Data and
Sources Knowledge
Services
241
TEAM LinG
Data Mining and Warehousing in Pharma Industry
Table 2. Noisy genetic data 2000). The snowflake schema is a variation of star schema
in which the dimensional tables are organized into a
ID Treatment SNP1 SNP2 SNP3 SNP4 SNP5 Decision
hierarchy by normalizing the tables. Another important
1 Drug A/T C/T A/C G/T C/T Improved
issue while designing the DW especially for pharmaceu-
2 Drug A/T T/T A/A G/G T/T Improved
3 Drug A/T C/T A/C G/T T/T Not_Improved
tical industry is the security issues, that is, avoiding
4 Drug A/A C/C A/A T/T C/T Not_Improved unauthorized access.
5 Drug A/T C/T A/C G/T C/T Not_Improved
6 Placebo A/T C/T C/C G/T T/T Improved Data Modeling
7 Placebo A/A C/T A/C G/G C/T Improved
8 Placebo A/T C/C A/C G/T C/T Improved The data stored in DW is relatively error free and com-
9 Placebo A/T C/T A/C G/T C/T Not_Improved pact. This data contains hidden information, which needs
10 Placebo T/T T/T C/C T/T C/C Not_Improved
to be extracted. It forms a staging ground for knowledge
discovery, and decision-making (Elmasri & Navathe,
Data Warehouse Characteristics 2000). The desired functionality of DW access tools is
as follows:
A DW is a subject-oriented, integrated, nonvolatile, time-
variant collection of data in support of users (Elmasri & Tabular reporting and information mapping
Navathe, 2000). The prominent feature of the DW is to Complex queries and sophisticated criteria search
support large volume of data and relatively small number Ranking
of users with relatively long interactions, high perfor- Multivariable and time series analysis
mance levels, multidimensional view, usability, manage- Data visualization, graphing, charting, and pivoting
ability, and flexible reporting. It must efficiently extract Complex textual search
and process data. Advanced statistical analysis
Data cuboids in the DW are multidimensional con- Trend discovery and analysis
structs reflecting the relationships in the underlying Pattern and associations discovery
data. A three-dimensional cuboid is called a data cube
and is illustrated in Figure 5. The data cube represents There are three main streams for data analysis namely,
three dimensions, namely cancer type, medication, and OLAP, statistics, and DM. OLAP supports various view-
time factor. The data cubes can be views through multiple points at different levels of abstraction, which helps in
orientations using a technique called pivoting (Elmasri & data visualization and analysis. Population-based statis-
Navathe, 2000). Thus the data cube in can be pivoted to tical analysis of the data can be performed through various
form a separate medication and time factor table for each tools ranging from simple regression to complex multivari-
cancer type. This data structure allows for roll-up and ate analysis. DM offers tools (decision trees, decision
drill-down capabilities. Roll-up moves up the hierarchy, rules, support vector machines, neural networks, associa-
for example grouping the cancer type dimension to present tion rules, and so on) for discovery of new knowledge by
a single medication and time factor view. Similarly drill- converting hidden information into business models. It
down helps in finer grained view, for example, the medi- provides tools for identifying valid, novel, potentially
cations can be broken down into cardiac, respiratory, and useful, and ultimately understandable patterns from data
skin medications. and constructs high confidence predictions for individu-
The DW is normally designed with star schema or the als (Fayyad et al., 1997). This may represent valuable
snowflake schema. The star schema is designed with a knowledge that might lead to medical discoveries, for
fact table (tuples arranged in one per recorded fact) and example, certain ranges of parameter values leading to
single table for each dimension (Elmasri & Navathe, longer survival time.
Thus OLAP with DM complements each other in the
analysis and provides different but much needed func-
Figure 5. Data cube
tionality for data understanding, visualization, and mak-
Time ing individualized decisions. Data modeling provides ana-
Cancer lytical reasoning for decision-making to solve specific
type patient and business specific issues.
Applications
Medication
DM, OLAP, and decision-making algorithms will lead to
outcome predictions, identification of significant patterns
242
TEAM LinG
Data Mining and Warehousing in Pharma Industry
and parameters, patient-specific decisions, and ultimately of parameter relevancy. Dynamic clinical studies would
process/goal optimization. Prediction, identification, clas- provide additional facet to interact and intervene, result-
sification, and optimization are key to drug discovery, ing in clean, error free, and necessary data collection and
adverse effect analysis, outcome predictions, and indi- analysis.
vidualized protocols (Figure 6).
The issues highlighted in this article are illustrated
with various medical informatics research projects. DM CONCLUSION
has led to identifying predictive parameters, formulating
individualized treatment protocols, categorizing model Data warehousing, data modeling, data mining, OLAP,
patients, and developing a decision-making model for and decision-making algorithms will ultimately lead to
predictions of survival for dialysis patients (Kusiak et al., targeted drug discovery and individualized treatments
2005). A gene/SNP selection approach (Shah & Kusiak, with minimum adverse effects. The success of the phar-
2004) was developed using weighted decision trees and maceutical industry will largely depend on following
genetic algorithm. The high quality significant gene/SNP the course outlined by the new data paradigm.
subset can be targeted for drug development and valida-
tion. An intelligent DM system determined the wellness
score for the infants with hypoplastic left heart syndrome REFERENCES
based on 73 physiologic, laboratory, and nurse-assessed
parameters (Kusiak et al., 2003). For cancer-related analy-
Bahler, D., Stone, B., Wellington, C., & Bristol, D.W.
sis, a patient acceptance model can be developed to
(2000). Symbolic, neural, and Bayesian machine learn-
predict the treatment type, drug toxicity, and the length of
ing models for predicting carcinogenicity of chemical
disease free status after the treatment. These concept
compounds. Journal of chemical information and com-
discussed in this chapter can be applied across all medical
puter sciences, 40(4), 906-914.
topics including phenotypic and genotypic data.
Other examples are Merck gene sequencing (Eckman Berndt, D.J., Fisher, J.W., Hevner, R.A., & Studnicki, J.
et al., 1998), epidemiological and clinical toxicology (2001). Healthcare data warehousing and quality assur-
(Helma et al., 2000), prediction of rodent carcinogenic- ance. IEEE Computer, 34(12), 56-65.
ity bioassays (Bahler et al., 2000), gene expression
level analysis (Dudoit et al., 2000), predicting risk of Dudoit, S., Fridlyand, J., & Speed, T.P. (2000). Com-
coronary artery disease (Tham et al., 2003), and VA parison of discrimination methods for the classifica-
health care system (Smith & Joseph, 2003). tion of tumors using gene expression data. Technical
Report 576. Berkeley, CA: Department of Statistics,
University of California.
FUTURE TRENDS Eckman, B.A., Aaronson, J.S., Borkowski, J.A., Bailey,
W.J., Elliston, K.O., Williamson, A. R., & Blevins, R.A.
The future of data mining, warehousing, and modeling in (1998). The Merck gene index browser: An extensible
pharmaceutical industry offers numerous challenges. data integration system for gene finding, gene character-
The first one is the scale of the data, measured with the ization and EST data mining. Bioinformatics, 14, 2-13.
number of features. New scalable schemes for parameter
selection and knowledge discovery will be developed. Elmasri, R., & Navathe, S. B. (2000). Fundamentals of
There is a need to develop new tools for rapid evaluation database systems. New York: Addison-Wesley.
243
TEAM LinG
Data Mining and Warehousing in Pharma Industry
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Tye, H. (2004). Application of statistical design of experi-
Uthurusamy, R. (1997). Advances in knowledge discov- ments methods in drug discovery. Drug Discovery To-
ery and data mining. Cambridge, MA: MIT Press. day, 9(11), 485-491.
Helma, C., Gottmann, E., & Kramer, S. (2000). Knowledge
discovery and data mining in toxicology. Statistical Meth-
ods in Medical Research, 9(4), 329-358. KEY TERMS
King, S.Y., Kung, M.S., & Fung, H.L. (1984). Statistical
prediction of drug stability based on nonlinear parameter Adverse Effects: Any untoward medical occurrences
estimation. Journal of pharmaceutical sciences, 73(5), that may be life threatening and requires in-patient
657-662. hospitalization.
Kusiak, A. (2000). Computational intelligence in design Clustering: Clustering algorithms discover simi-
and manufacturing. New York: John Wiley. larities and differences among groups of items. It di-
vides a dataset so that patients with similar content are
Kusiak, A. (2001). Feature transformation methods in in the same group, and groups are as different as pos-
data mining. IEEE Transactions on Electronics Pack- sible from each other.
aging Manufacturing, 24(3), 214-221.
Customized Protocols: A specific set of treatment
Kusiak, A., Caldarone, C.A., Kelleher, M.D., Lamb, F.S., parameters and their values those are unique for indi-
Persoon, T.J., Gan, Y., & Burns, A. (2003, April). Mining vidual. The customized protocols are derived from dis-
temporal data sets: Hypoplastic left heart syndrome case covered knowledge patterns.
study. In Proceedings of the SPIE Conference on Data
Mining and Knowledge Discovery: Theory, Tools, and Data Visualization: The method or end result of
Technology V 5098, SPIE (pp. 93-101). Belingham, WA. transforming numeric and textual information into a
graphic format. Visualizations are used to explore large
Kusiak, A., Dixon, B., & Shah, S. (2005). Predicting sur- quantities of data holistically in order to understand
vival time for kidney dialysis patients: A data mining trends or principles.
approach. Computers in Biology and Medicine, 35(4),
311-327. Decision Trees: Decision-tree algorithm creates
rules based on decision trees or sets of if-then state-
Kusiak, A., Kern, J.A., Kernstine, K.H., & Tseng, T.L. ments to maximize interpretability.
(2000). Autonomous decision-making: a data mining ap-
proach. IEEE Transactions on Information Technology Drug Discovery: It is a research process that identi-
in Biomedicine, 4(4), 274-284. fies molecules with desired biological effects so a to
develop new therapeutic drugs.
Mitchell, T.M. (1997). Machine learning. New York:
McGraw Hill. Feature Reduction Methods: The goal of feature
reduction method is to identify the minimum set of non-
Pawlak, Z. (1991). Rough sets: Theoretical aspects of redundant features (e.g., SNPs, genes) that are useful in
reasoning about data. Boston, MA: Kluwer. classification.
Quinlan, R. (1992). C 4.5 programs for machine learn- Knowledge Discovery: Knowledge discovery in
ing. San Meteo, CA: Morgan Kaufmann. databases is the process of identifying valid, novel,
potentially useful, and ultimately understandable pat-
Shah, S.C., & Kusiak, A. (2004). Data mining and genetic terns/models in data.
algorithm based Gene/SNP selection. Artificial Intelli-
gence in Medicine, 31(3), 183-196. Neural Networks: Neural networks are a set of
simple units (neurons) that receive a number of real values
Smith, M.W., & Joseph, G.J. (2003). Pharmacy data in inputs, which are processed through the network to
the VA health care system. Medical Care Research and produce a real value output.
Review: MCRR, 60 (3 Suppl), 92S-123S.
OLAP: Online analytical processing is a category of
Tham, C.K., Heng, C.K., & Chin, W.C. (2003). Predicting software tools that provides analysis of data stored in a
risk of coronary artery disease from DNA microarray- database.
based genotyping using neural networks and other statis-
tical analysis tool. Journal of Bioinformatics and Compu-
tational Biology, 1(3), 521-539.
244
TEAM LinG
245
Aleksandar Lazarevic
University of Minnesota, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining for Damage Detection in Engineering Structures
& Ambur, 2003; Lazarevic et al., 2004; Ni, Wang & Ko, CLASSIFICATION OF DAMAGE
2002; Sandhu et al., 2001; Yun & Bahng, 2000; Zhao, Ivan DETECTION TECHNIQUES
& DeWolf, 1998), support vector machines (SVMs) (Mita
& Hagiwara 2003), decision trees (Sandhu et al., 2001), We provide several different criteria for classification of
have been applied successfully to structural damage damage detection techniques based on data mining.
detection problems, thus showing that they can be poten- In the first classification, damage detection techniques
tially useful for such class of problems. This success can can be categorized into continuous (Keller & Ray, 2003)
be attributed to numerous disciplines integrated with data and periodic (Patsias & Staszewski, 2002) damage detec-
mining, such as pattern recognition, machine learning, tion systems. Continuous techniques usually employ an
and statistics. In addition, it is well known that data mining integrated approach that consists of data acquisition
techniques effectively can handle noisy, partially incom- process, feature extraction from large amounts of data
plete, and faulty data, which is particularly useful, since collected from real-time sensors, and damage detection
in damage detection applications, measured data are ex- process. In periodic techniques, feature extraction pro-
pected to be incomplete, noisy, and corrupted. cess is optional, since the amount of data that need to be
The intent of this paper is to provide a survey of processed is not large and does not necessarily require
emerging data mining techniques for damage detection data mining techniques for feature extraction.
structures. Although the field of damage detection is very In the second classification, we distinguish between
broad and consists of vast literature that is not based on application-based and application-independent tech-
data mining techniques, this survey will be focused pre- niques. Application-based techniques are generally ap-
dominantly on data mining techniques for damage detec- plicable to a specific structural system, and they typically
tion based on changes in properties of the structure. assume that the monitored structure responds in some
However, a large amount of literature applicable to fault predetermined manner that can be accurately modeled by
detection and diagnosis to application-specific system, (i) numerical techniques such as finite element (Sandhu et
such as rotating machinery, is not within the scope of this paper. al., 2001) or boundary element analysis (Anderson, Lemoine
& Ambur, 2003) and/or (ii) behavior of the response of the
structures that are based on physics-based models (Keller
CATEGORIZATION OF STRUCTURAL & Ray, 2003). Most of damage detection techniques that
DAMAGE exist in literature belong to the application-based ap-
proach, where the minimization of the residue between the
The damage in structures can be classified as linear or experimental and the analytical model is built into the
nonlinear. Damage is considered as linear if the undam- system. Often, this type of data is not available and can
aged linear-elastic structure remains linear-elastic after render application-based methods impractical for certain
damage. However, if the initially linear-elastic structure applications, particularly for structures that are designed
behaves in a nonlinear manner after the damage initiation, and commissioned without such models. On the other
then the damage is considered as nonlinear. However, it hand, application-independent techniques do not de-
is possible that the damage is linear at the damage initia- pend on specific structure, and they are generally appli-
tion phase, but after prolonged growth in time, it may cable to any structural system. However, the literature on
become nonlinear. For example, loose connections be- these techniques is very sparse, and the research in this
tween the structures at the joints or the joints that rattle area is at a very nascent stage (Bernal & Gunes, 2000;
(Sekhar, 2003) are considered non-linear damages. Ex- Zang, Friswell & Imregun 2004).
amples of such non-linear damage detection systems are In the third classification, damage detection tech-
described in Adams and Farrar (2002) and Kerschen and niques are split into signature-based and non-signature-
Golinval (2004). based methods. Signature-based techniques extensively
Most of the modal data in the literature are proposed use signatures of known damages in the given structure
for the linear case. They are based on the following three that are provided by human experts. These techniques
levels of damage identification: (1) Recognition-quali- commonly fall into the category of recognition of damage
tative indication that damage might be present in the detection, which only provides the qualitative indication
structure; (2) Localizationinformation about the prob- that damage might be present in the structure (Friswell,
able location of the damage in the structure; and (3) Penny & Wilson, 1994) and, to a certain extent, the
Assessmentestimate of the extent of severity of the localization of the damage (Friswell, Penny & Garvey,
damage in the structure. Such linear damage detection 1997). Non-signature methods are not based on signa-
techniques can be found in Yun and Bahng (2000), Ni, et tures of known damages, and they not only recognize but
al. (2002), and Lazarevic, et al. (2004). also localize and assess the extent of damage. Most of the
damage detection techniques in the literature fall into this
246
TEAM LinG
Data Mining for Damage Detection in Engineering Structures
category (Lazarevic et al., 2004; Ni et al., 2002; Yun & rules about the structural damage were found. In addi-
Bahng, 2000). tion, a method using support vector machines (SVMs) ,
In the fourth classification, damage detection tech- has been proposed to detect local damages in a building
niques are classified into local (Sekhar, 2003; Wang, 2003) structure (Mita & Hagiwara, 2003). The method is verified
and global (Fritzen & Bohle, 2001) techniques. Typically, to have the capability to identify not only the location of
the damage is initiated in a small region of the structure, damage, but also the magnitude of damage with satisfac-
and, hence, it can be considered as local phenomenon. One tory accuracy, employing modal frequency patterns as
could employ local or global damage detection features damage features.
that are derived from local or global response or properties
of the structure. Although local features can detect the Pattern Recognition
damage effectively, these features are very difficult to
obtain from the experimental data, such as higher natural Pattern recognition techniques also are applied to dam-
frequencies and mode shapes of structure (Ni et al., 2002). age detection by various researchers. For example, the
In addition, since the vicinity of damage is not known a statistical pattern recognition is applied to damage de-
priori, the global methods that can employ only global tection employing relatively few measurements of modal
damage detection features, such as lower natural frequen- data collected from three scale model-reinforced con-
cies of the structure (Lazarevic et al., 2004), are preferred. crete bridges (Haritos & Owen, 2004), but the method
Finally, damage detection techniques can be classified only was able to indicate that damage had occurred. In
as traditional and emerging data mining techniques. Tra- addition, independent component analysis, a multivari-
ditional analytical techniques employ mathematical mod- ate statistical method also known as proper orthogonal
els to approximate the relationships between specific dam- decomposition, has been applied to damage detection
age conditions and changes in the structural response or problems on time history data to capture essential pat-
dynamic properties. Such relationships can be computed tern of the measured vibration data (Zang, Friswell &
by solving a class of so-called inverse problems. The major Imregun, 2004).
drawbacks of the existing approaches are as follows: (i) the
more sophisticated methods involve computationally cum- Neural Networks
bersome system solvers, which are typically solved by
singular value decomposition techniques, non-negative Prediction-based techniques such as neural networks
least-squares techniques, bounded variable least-squares (Lazarevic et al., 2004; Ni et al., 2002, Zhao et al., 1998) has
techniques, and so forth; and, (ii) all computationally been successfully applied to detect the existence, loca-
intensive procedures need to be repeated for any newly tion, and quantification of the damage in the structure
available measured test data for a given structure. A brief employing modal data. Neural networks have been ex-
survey of these methods can be found in Doebling, et al. tremely popular in recent years due to their capabilities
(1996). On the other hand, data mining techniques consist as universal approximators.
of application techniques to model an explicit inverse In damage detection approaches based on neural
relation between damage detection features and damage networks, the damage location and severity are simulta-
by minimization of the residue between the experimental neously identified using a one-stage scheme, also called
and the analytical model at the training level. For example, the direct method (Zhao et al., 1998), where the neural
the damage detection features could be natural frequen- network is trained with different damage levels at each
cies, mode shapes, mode curvatures, and so forth. It possible damage location. However, these studies were
should be noted that data mining techniques also are restricted to very small models with a small number of
applied to detect features in large amounts of measurement target variables (order of 10), and the development of a
data. In the next few sections, we will provide a short predictive model that could identify correctly the loca-
description of several types of data mining algorithms tion and severity of damage in practical large-scale com-
used for damage detection. plex structures using this direct approach was a consid-
erable challenge. Increased geometric complexity of the
Classification structure caused an increase in the number of target
variables, thus resulting in data sets with a large number
Data mining techniques based on classification were suc- of target variables. Since the number of prediction mod-
cessfully applied to identify the damage in the structures. els that needs to be built for each continuous target
For example, decision trees have been applied to detect variable increases, the number of training data records
damage in an electrical transmission tower (Sandhu et al., required for effective training of neural networks also
2001). It has been found that in this approach, decision increases, thus requiring more computational time for
trees can be easily understood, while many interesting training neural networks, but also more time for data
247
TEAM LinG
Data Mining for Damage Detection in Engineering Structures
generation, since each damage state (data record) re- FUTURE TRENDS
quires an eigen solver to generate natural frequency and
mode shapes of the structure. The earlier direct approach, Damage detection is increasingly becoming an indispens-
employed by numerous researchers, required the predic- able and integral component of any comprehensive struc-
tion of the material property; namely, the Youngs modu- tural health monitoring program for mechanical and large-
lus of elasticity considering all the elements in the domain scale civilian, aerospace, and space structures. Although
individually or simultaneously. However, this approach a variety of techniques have been developed for detecting
does not scale to situations in which thousands of ele- damages, there is still a number of research issues con-
ments are present in the complex geometry of the structure cerning the prediction performance and efficiency of the
or when multiple elements in the structure have been techniques that needs to be addressed (Auwerarer &
damaged simultaneously. Peeters, 2003; De Boe & Golinval, 2001).
To reduce the size of the system under consideration,
several substructure-based approaches have been pro-
posed (Sandhu et al., 2002; Yun & Bahng, 2000). These CONCLUSION
approaches partition the structure into logical substruc-
tures and then predict the existence of the damage in each In this paper, a survey of emerging data mining tech-
of them. However, pinpointing the location of damage and niques for damage detection in structures is provided.
extent of the damage is not resolved completely in these This survey reveals that the existing data mining tech-
approaches. Recently, these issues have been addressed niques are based predominantly on changes in properties
in two hierarchical approaches (Lazarevic et al., 2004; Ni of the structure to classify, localize, and predict the extent
et al., 2002). In the former, neural networks are hierarchi- of damage.
cally trained using one-level damage samples to first
locate the position of the damage, and then the network
is retrained by an incremental weight update method
using additional samples corresponding to different dam-
REFERENCES
age degrees, but only at the identified location at the first
stage. The input attributes of neural networks are de- Adams, D., & Farrar, C. (2002). Classifying linear and
signed to depend only on damage location, and they nonlinear structural damage using frequency domain ARX
consisted of several natural frequencies and a few incom- models. Structural Health Monitoring, 1(2), 185-201.
plete modal vectors. Since measuring mode shapes are Anderson, T., Lemoine, G., & Ambur, D. (2003). An arti-
difficult, global methods based only on natural frequen- ficial neural network based damage detection scheme for
cies are highly preferred. However, employing natural electrically conductive composite structures. Proceed-
frequencies as features traditionally has many drawbacks ings of the 44th AIAA/ASME/ASCE/AHS/ASC Structures,
(e.g., two symmetric damage locations cannot be distin- Structural Dynamics, and Materials Conference, Nor-
guished using only natural frequencies). To overcome folk, Virginia.
these drawbacks, Lazarevic, et al. (2004) proposed hierar-
chical and localized clustering approaches based only on Auwerarer, H., & Peeters, B. (2003). International research
natural frequencies as features where symmetrical dam- projects on structural health monitoring: An overview.
age locations of damage as well as spatial characteristics Structural Health Monitoring, 2(4), 341-358.
of structural systems are integrated in building the model. Bernal, D., & Gunes, B. (2000). Extraction of system
matrices from state-space realizations. Proceedings of the
Other Techniques 14 th Engineering Mechanics Conference, Austin, Texas.
Other data mining based approaches also have been De Boe, P., & Golinval, J.-C. (2001). Damage localization
applied to different problems in structural health monitor- using principal component analysis of distributed sensor
ing. For example, outlier-based analysis techniques array. In F.K. Chang (Ed.), Structural health monitoring:
(Worden, Manson & Fieller, 2000) have been used to The demands and challenges (pp. 860-861). Boca Raton,
detect the existence of damage; wavelet-based approaches FL: CRC Press.
(Wang, 2003) have been used to detect damage features; Doebling, S., Farrar, C., Prime, M., & Shevitz, D. (1996).
and a combination of independent component analysis Damage identification and health monitoring of struc-
and artificial neural networks (Zang, Friswell & Imregun, tural systems from changes in their vibration character-
2004) have been applied successfully to detect damages istics: A literature review [Report LA-12767-MS]. Los
in structures. Alamos National Laboratory, Los Alamos, NM.
248
TEAM LinG
Data Mining for Damage Detection in Engineering Structures
Doherty, J. (1987). Nondestructive evaluation. In A.S. Maalej, M., Karasaridis, A., Pantazopoulou, S., &
Kobayashi (Ed.), Handbook on experimental mechanics Hatzinakos, D. (2002). Structural health monitoring of ,
(ch. 12). Society of Experimental Mechanics, Inc, England smart structures. Smart Materials and Structures, 11,
Cliffs, NJ. 581-589.
Friswell, M., Penny J., & Garvey, S. (1997). Parameter Mita, A., & Hagiwara, H. (2003). Damage diagnosis of a
subset selection in damage location. Inverse Problems in building structure using support vector machine and
Engineering, 5(3), 189-215. modal frequency patterns. Proceedings of SPIE, 5057,
San Diego, CA.
Friswell, M., Penny J., & Wilson, D. (1994). Using vibra-
tion data and statistical measures to locate damage in Ni, Y., Wang, B., & Ko, J. (2002). Constructing input
structures. Modal Analysis: The International Journal of vectors to neural networks for structural damage identi-
Analytical and Experimental Modal Analysis, 9(4), 239-254. fication. Smart Materials and Structures, 11, 825-833.
Fritzen, C., & Bohle, K. (2001). Vibration based global Patsias, S., & Staszewski, W.J. (2002). Damage detection
damage identificationA tool for rapid evaluation of using optical measurements and wavelets. Structural
structural safety. In F.D. Chang (Ed.), Structural health Heath Monitoring, 1(1), 5-22.
monitoring: The demands and challenges (pp. 849-859).
Boca Raton, FL: CRC Press. Sandhu, S., Kanapady, R., Tamma, K.K., Kamath, C., &
Kumar, V. (2001). Damage prediction and estimation in
Haritos, N., & Owen, J.S. (2004). The use of vibration data structural mechanics based on data mining. Proceedings
for damage detection in bridges: A comparison of system of the 7th ACM SIGKDD International Conference on
identification and pattern recognition approaches. Struc- Knowledge Discovery and Data Mining/Fourth Work-
tural Health Monitoring, 3(2), 141-163. shop on Mining Scientific Datasets, San Francisco, Cali-
fornia.
Keller, E., & Ray, A. (2003). Real-time health monitoring of
mechanical structures. Structural Health Monitoring, Sandhu, S., Kanapady, R., Tamma, K.K., Kamath, C., &
2(3), 191-203. Kumar, V. (2002). A sub-structuring approach via data
mining for damage prediction and estimation in complex
Kerschen, G., & Golinval, J-C. (2004). Feature extraction structures. Proceedings of the SIAM International Con-
using auto-associative neural networks. Smart Materials ference on Data Mining, Arlington, Virginia.
and Structures, 13, 211-219.
Sekhar, S. (2003). Identification of a crack in rotor system
Khoo, L., Mantena, P., & Jadhav, P. (2004). Structural using a model-based wavelet approach. Structural Health
damage assessment using vibration modal analysis. Struc- Monitoring, 2(4), 293-308.
tural Health Monitoring, 3(2), 177-194.
Wang, W. (2003). An evaluation of some emerging tech-
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., & niques for gear fault detection. Structural Health Moni-
Kumar, V. (2003a). Localized prediction of continuous toring, 2(3), 225-242.
target variables using hierarchical clustering. Proceed-
ings of the Third IEEE International Conference on Data Worden, K., Manson, G., & Fieller, N. (2000). Damage
Mining, Florida. detection using outlier analysis. Journal of Sound and
Vibration, 229(3), 647-667.
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., &
Kumar, V. (2003b). Damage prediction in structural me- Yun, C., & Bahng, E.Y. (2000). Sub-structural identifica-
chanics using hierarchical localized clustering-based tion using neural networks. Computers & Structures, 77, 41-52.
approach. Proceedings of Data Mining and Knowledge
Discovery: Theory, Tools, and Technology V, Orlando, Zang, C., Friswell, M.I., & Imregun, M. (2004). Structural
Florida. damage detection using independent component analy-
sis. Structural Health Monitoring, 3(1), 69-83.
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., &
Kumar, V. (2004). Effective localized regression for dam- Zhao, J., Ivan, J., & De Wolf, J. (1998). Structural damage
age detection in large complex mechanical structures. detection using artificial neural networks. Journal of
Proceedings of the 9th ACM SIGKDD International Infrastructure Systems, 4(3), 93-101.
Conference on Knowledge Discovery and Data Mining,
Seattle, Washington.
249
TEAM LinG
Data Mining for Damage Detection in Engineering Structures
KEY TERMS Mode Shapes: Eigen vectors associated with the natu-
ral frequencies of the structure.
Boundary Element Method: Numerical method to solve Natural Frequency: Eigen values of the mass and
the differential equations with boundary/initial condi- stiffness matrix system of the structure.
tions over surface of a domain.
Smart Structure: A structure with a structurally inte-
Finite Element Method: Numerical method to solve grated fiber optic sensing system.
the differential equations with boundary/initial condi-
tions over a domain. Structures and Structural System: The word struc-
ture used has been employed loosely in the paper. The
Modal Properties: Natural frequency, mode shapes, structure is referred to as the continuum material, whereas
and mode curvatures constitutes modal properties. the structural system consists of structures that are
connected at joints.
250
TEAM LinG
251
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining for Intrusion Detection
tion systems (e.g., tcpdump and netflow data for net- MADAM ID (Lee, 2000, 2001) was one of the first
work intrusion detection, syslogs or system calls for projects that applied data mining techniques to the intru-
host intrusion detection). However, such collected data sion detection problem. In addition to the standard
is often available in a raw format and needs to be features that were available directly from the network
processed in order to be used in data mining techniques. traffic (e.g., duration, start time, service), three groups
For example, in MADAM ID project (Lee, 2000, 2001) of constructed features were also used by the RIPPER
at Columbia University, association rules and frequent algorithm to learn intrusion detection rules from DARPA
episodes were extracted from network connection 1998 data set (Lippmann, 1999). Other classification
records to construct three groups of features: (i) con- algorithms that are applied to the intrusion detection
tent-based features that describe intrinsic characteris- problem include standard decision trees (Bloedorn,
tics of a network connection (e.g., number of packets, 2001; Sinclair, 1999), modified nearest neighbor algo-
acknowledgments, data bytes from source to destina- rithms (Ye, 2001b), fuzzy association rules (Bridges,
tion), (ii) time-based traffic features that compute the 2000), neural networks (Dao, 2002; Lippman, 2000a),
number of connections in some recent time interval nave Bayes classifiers (Schultz, 2001), genetic algo-
(e.g., last few seconds) and (iii) connection based fea- rithms (Bridges, 2000), genetic programming
tures that compute the number of connections from a (Mukkamala, 2003a), and etcetera. Most of these ap-
specific source to a specific destination in the last N proaches attempt to directly apply specified standard
connections (e.g., N = 1000). techniques to publicly available intrusion detection data
When the feature construction step is complete, ob- sets (Lippmann, 1999, 2000b), assuming that the labels
tained features may be used in any data mining technique. for normal and intrusive behavior are already known.
Since this is not realistic assumption, misuse detection
Misuse Detection based on data mining has not been very successful in
practice.
In misuse detection based on data mining, each instance
in a data set is labeled as normal or attack/intrusion Anomaly Detection
and a learning algorithm is trained over the labeled data.
These techniques are able to automatically retrain intru- Anomaly detection creates profiles of normal legiti-
sion detection models on different input data that in- mate computer activity (e.g., normal behavior of users,
clude new types of attacks, as long as they have been hosts, or network connections) using different tech-
labeled appropriately. Unlike signature-based intrusion niques and then uses a variety of measures to detect
detection systems, data mining based misuse detection deviations from defined normal behavior as potential
models are created automatically, and can be more anomaly. Anomaly detection models often learn from a
sophisticated and precise than manually created signa- set of normal (attack-free) data, but this also requires
tures. In spite of the fact that misuse detection models cleaning data from attacks and labeling only normal data
have high degree of accuracy in detecting known attacks records. Nevertheless, other anomaly detection tech-
and their variations, their obvious drawback is the in- niques detect anomalous behavior without using any
ability to detect attacks whose instances have not yet knowledge about the training data. Such models typi-
been observed. In addition, labeling data instances as cally assume that the data records that do not belong to
normal or intrusive may require enormous time for the majority behavior correspond to anomalies.
many human experts. The major benefit of anomaly detection algorithms
Since standard data mining techniques are not di- is their ability to potentially recognize unforeseen and
rectly applicable to the problem of intrusion detection emerging cyber attacks. However, their major limita-
due to dealing with skewed class distribution (attacks/ tion is potentially high false alarm rate, since deviations
intrusions correspond to a class of interest that is much detected by anomaly detection algorithms may not nec-
smaller, i.e., rarer, than the class representing normal essarily represent actual attacks, but new or unusual, but
behavior) and learning from data streams (attacks/intru- still legitimate, network behavior.
sions very often represent sequence of events), a num- Anomaly detection algorithms can be classified into
ber of researchers have developed specially designed several groups: (i) statistical methods; (ii) rule-based
data mining algorithms that are suitable for intrusion methods; (iii) distance-based methods; (iv) profiling
detection. Research in misuse detection has focused methods; and (v) model-based approaches (Lazarevic,
mainly on classification of network intrusions using 2004). Although anomaly detection algorithms are quite
various standard data mining algorithms (Barbara, 2001; diverse in nature, and thus may fit into more than one
Ghosh, 1999; Lee, 2001; Sinclair, 1999), rare class proposed category, most of them employ certain data
predictive models (Joshi, 2001) and association rules mining or artificial intelligence techniques.
(Barbara, 2001; Lee, 2000; Manganaris, 2000).
252
TEAM LinG
Data Mining for Intrusion Detection
Statistical methods: Statistical methods monitor Rule-based systems: Rule-based systems were
the user or system behavior by measuring certain used in earlier anomaly detection based IDSs to ,
variables over time (e.g., login and logout time of characterize normal behavior of users, networks
each session). The basic models keep averages of and/or computer systems by a set of rules. Ex-
these variables and detect whether thresholds are amples of such rule-based IDSs include
exceeded based on the standard deviation of the ComputerWatch (Dowell, 1990) and Wisdom &
variable. More advanced statistical models com- Sense (Liepins, 1992).
pute profiles of long-term and short-term user Profiling methods: In profiling methods, pro-
activities by employing different techniques, such files of normal behavior are built for different
as Kolmogorov-Smirnov test (Cabrera, 2000), Chi- types of network traffic, users, programs and
square (c 2) statistics (Ye, 2001a), probabilistic etcetera; and deviations from them are consid-
modeling (Yamanishi, 2000), and likelihood of ered as intrusions. Profiling methods vary greatly
data distributions (Eskin, 2000). ranging from different data mining techniques to
Distance-based methods: Most statistical ap- various heuristic-based approaches.
proaches have limitation when detecting outliers in
higher dimensional spaces, since it becomes in- For example, ADAM (Audit Data and Mining) (Bar-
creasingly difficult and inaccurate to estimate the bara, 2001) is a hybrid anomaly detector trained on
multidimensional distributions of the data points. both attack-free traffic and traffic with labeled attacks.
Distance-based approaches attempt to overcome The system uses a combination of association rule
limitations of statistical outlier detection ap- mining and classification to discover novel attacks in
proaches and they detect outliers by computing tcpdump data by using the pseudo-Bayes estimator.
distances among points. Several distance-based Recently reported IDDM system (Abraham, 2001) rep-
outlier detection algorithms that have been re- resents an off-line IDS, where the intrusions are de-
cently proposed for detecting anomalies in net- tected only when sufficient amounts of data are col-
work traffic (Lazarevic, 2003) are based on com- lected and analyzed. The IDDM system describes pro-
puting the full dimensional distances of points files of network data at different times, identifies any
from one another using all the available features, large deviations between these data descriptions and
and on computing the densities of local neighbor- produces alarms in such cases. PHAD (packet header
hoods. Values of categorical features are converted anomaly detection) (Mahoney, 2002) monitors net-
into the frequencies of their occurrences and they work packet headers and builds profiles for 33 differ-
are further considered as continuous ones. MINDS ent fields from these headers by observing attack free
(Minnesota Intrusion Detection System) (Ertoz, traffic and building contiguous clusters for the values
2004) employs outlier detection algorithms to as- observed for each field. ALAD (application layer
sign an anomaly score to each network connection. anomaly detection) (Mahoney, 2002) uses the same
A human analyst then has to look at only the most method for calculating the anomaly scores as PHAD,
anomalous connections to determine if they are but it monitors TCP data and builds TCP streams when
actual attacks or other interesting behavior. Ex- the destination port is smaller than 1024.
periments on live network traffic have shown that Finally, there have also been several recently pro-
MINDS is able to routinely detect various suspi- posed commercial products that use profiling-based
cious behavior (e.g., policy violations), worms, as anomaly detection techniques. For example, Antura
well as various scanning activities. from System Detection (System Detection, 2003) use
data mining based user profiling, while Mazu Profiler
In addition, several clustering based techniques, such from Mazu Networks (Mazu Networks, 2003) and
as fixed-width and canopy clustering (Eskin, 2002), have Peakflow X from Arbor networks (Arbor Networks,
been used to detect network intrusions in DARPA 1998 2003) use rate-based and connection profiling anomaly
data sets as small clusters when compared to the large detection schemes.
ones that corresponded to the normal behavior. In an-
other interesting approach (Fan, 2001), artificial anoma- Model-based approaches: Many researchers
lies in the network intrusion detection data are generated have used different types of data mining models,
around the edges of the sparsely populated data regions, such as replicator neural networks (Hawkins,
thus forcing the learning algorithm to discover the spe- 2002) or unsupervised support vector machines
cific boundaries that distinguish these regions from the (Eskin, 2002; Lazarevic, 2003), to characterize
rest of the data. the normal behavior of the monitored system. In
253
TEAM LinG
Data Mining for Intrusion Detection
Summarization techniques use frequent itemsets or as- Arbor Networks. (2003). Intelligent network manage-
sociation rules to characterize normal and anomalous ment with peakflow traffic. Retrieved from http://
behavior in the monitored computer systems. For ex- www.arbornetworks.com/products_sp.php
ample, association patterns generated at different times Barbara, D., Wu, N., & Jajodia, S. (2001). Detecting
were used to study significant changes in the network novel network intrusions using Bayes estimators. In
traffic characteristics at different periods of time (Lee, Proceedings of the First SIAM Conference on Data Min-
2001). Association pattern analysis has also been shown ing, Chicago, IL.
to be beneficial in constructing profiles of normal
network traffic behavior (Manganaris, 1999). MINDS Bloedorn, E., Christiansen, A., Hill, W., Skorupka, C.,
(Ertoz, 2004) uses association patterns to provide high- Talbot, L., & Tivel, J. (2001). Data mining for network
level summary of network connections that are ranked intrusion detection: How to get started. MITRE Tech-
highly anomalous in the anomaly detection module. nical Report. Retrieved from www.mitre.org/work/
These summaries allow a human analyst to examine a tech_papers/tech_papers_01/bloedorn_datamining
large number of anomalous connections quickly and to
provide templates from which signatures of novel at- Bridges, S., & Vaughn, R. (2000). Fuzzy data mining and
tacks can be built for augmenting the database of signa- genetic algorithms applied to intrusion detection. In
ture-based intrusion detection systems. Proceedings of the 23rd National Information Sys-
tems Security Conference, Baltimore, MD.
Cabrera, J., Ravichandran, B., & Mehra, R. (2000). Statis-
FUTURE TRENDS tical traffic modeling for network intrusion detection. In
The Proceedings of 8th International Symposium on
Intrusion detection techniques have improved dramati- Modeling, Analysis and Simulation of Computer and
cally over time, especially in the past few years. IDS Telecommunication System.
technology is developing rapidly and its near-term fu-
ture is very promising. Data mining techniques for Dao, V., & Vemuri, R. (2002). Computer network intrusion
intrusion detection increasingly become an indispens- detection: A comparison of neural networks methods.
able and integral component of any comprehensive en- Differential Equations and Dynamical Systems, Special
terprise security program, since they successfully Issue on Neural Networks.
complement traditional security mechanisms.
254
TEAM LinG
Data Mining for Intrusion Detection
Dowell, C., & Ramstedt, P. (1990). The Computerwatch Lee, W., & Stolfo, S.J. (2000). A framework for construct-
data reduction tool. In Proceedings of the 13th Na- ing features and models for intrusion detection systems. ,
tional Computer Security Conference, Washington, DC. ACM Transactions on Information and System Security,
3(4), 227-261.
Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P., Srivastava,
J., Kumar, V., & Dokas, P. (2004). The MINDS: Minne- Lee, W., Stolfo, S.J., & Mok, K. (2001). Adaptive intrusion
sota intrusion detection system. In A. Joshi, H. Kargupta, detection: A data mining approach. Artificial Intelli-
K. Sivakumar, & Y. Yesha (Eds.), Next generation data gence Review, 14, 533-567.
mining. Boston: Kluwer Academic Publishers.
Liepins, G., & Vaccaro, H. (1992). Intrusion detection: Its role
Eskin, E. (2000). Anomaly detection over noisy data using and validation. Computers and Security, 347-355.
learned probability distributions. In Proceedings of the
International Conference on Machine Learning, Stanford Lippmann, R., & Cunningham, R. (2000a). Improving intru-
University, CA. sion detection performance using keyword selection and
neural networks. Computer Networks, 34 (4), 597-603.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., & Stolfo, S.
(2002). A geometric framework for unsupervised anomaly Lippmann, R., Haines, J.W., Fried, D.J., Korba, J., & Das,
detection: Detecting intrusions in unlabeled data. In S. K. (2000b). The 1999 DARPA off-line intrusion detection
Jajodia & D. Barbara (Eds.), Applications of data mining evaluation. Computer Networks.
in computer security, advances in information security. Lippmann, R.P., Cunningham, R.K., Fried, D.J., Graf, I.,
Boston: Kluwer Academic Publishers. Kendall, K.R., Webster, S.E., & Zissman, M.A. (1999).
Fan, W., Lee, W., Miller, M., Stolfo, S.J., & Chan, P.K. Results of the DARPA 1998 offline intrusion detection
(2001). Using artificial anomalies to detect unknown evaluation. In Proceedings of Workshop on Recent Ad-
and known network intrusions. In the Proceedings of vances in Intrusion Detection.
the First IEEE International Conference on Data Min- Mahoney, M., & Chan, P. (2002). Learning nonstationary
ing, San Jose, CA. models of normal network traffic for detecting novel
Ghosh, A., & Schwartzbard, A. (1999). A study in using attacks. In Proceedings of the Eighth ACM Interna-
neural networks for anomaly and misuse detection. In tional Conference on Knowledge Discovery and Data
Proceedings of the Eighth USENIX Security Sympo- Mining (pp. 376-385), Edmonton, Canada.
sium (pp. 141-151). Manganaris, S., Christensen, M., Serkle, D., & Hermiz, K.
Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). (2000). A data mining analysis of RTID alarms. Computer
Outlier detection using replicator neural networks. In Networks, 34(4), 571-577.
Proceedings of the 4th International Conference on Mazu Networks. (2003). Mazu Profiler. An Overview.
Data Warehousing and Knowledge Discovery (pp. www.mazunetworks.com/solutions/white_papers/down-
170-180). Lecture Notes in Computer Science 2454. load/Mazu_Profiler.pdf
Aix-en-Provence, France.
Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford,
Joshi, M., Agarwal, R., & Kumar, V. (2001). PNrule, S., & Weaver, N. (2003). The spread of the Sapphire/
mining needles in a haystack: Classifying rare classes Slammer Worm. Retrieved from www.cs.berkeley.edu/
via two-phase rule induction. In Proceedings of the ~nweaver/sapphire
ACM SIGMOD Conference on Management of Data,
Santa Barbara, CA. Mukkamala, S., Sung, A., & Abraham, A. (2003a). A
linear genetic programming approach for modeling in-
Lazarevic, A., Ertoz, L., Ozgur, A., Srivastava, J., & Kumar, trusion. In Proceedings of the IEEE Congress on Evo-
V. (2003). A comparative study of anomaly detection lutionary Computation, Perth, Australia.
schemes in network intrusion detection. In Proceedings
of the Third SIAM International Conference on Data Ryan, J., Lin, M-J., & Miikkulainen, R. (1997). Intrusion
Mining, San Francisco, CA. detection with neural networks. In Proceedings of the
AAAI Workshop on AI Approaches to Fraud Detection
Lazarevic, A., Kumar, V., & Srivastava, J. (2004). Intrusion and Risk Management (pp. 72-77), Providence, RI.
detection: A survey. In V. Kumar, J. Srivastava, & A.
Lazarevic (Eds.), Managing cyber threats: Issues, ap- Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001). Data
proaches and challenges. Boston: Kluwer Academic Pub- mining methods for detection of new malicious executables.
lishers. In Proceedings of the IEEE Symposium on Security and
Privacy (pp. 38-49), Oakland, CA.
255
TEAM LinG
Data Mining for Intrusion Detection
Sinclair, C., Pierce, L., & Matzner, S. (1999). An application KEY TERMS
of machine learning to network intrusion detection. In
Proceedings of the 15th Annual Computer Security Ap- Anomaly Detection: Analysis strategy that identi-
plications Conference (pp. 371-377). fies intrusions as unusual behavior that differs from the
System Detection. (2003). Anomaly detection: The Antura normal behavior of the monitored system.
difference. Retrieved from http://www.sysd.com/library/ Intrusion: Malicious, externally induced, operational
anomaly.pdf fault in the computer system.
Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. Intrusion Detection: Identifying a set of malicious
(2000). On-line unsupervised outlier detection using actions that compromise the integrity, confidentiality,
finite mixtures with discounting learning algorithms. In and availability of information resources.
Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Min- Misuse Detection: Analysis strategy that looks for
ing (pp. 320-324), Boston, MA. events or sets of events that match a predefined pattern
of a known attack.
Ye, N., & Chen, Q. (2001a). An anomaly detection tech-
nique based on a chi-square statistic for detecting intru- Signature-Based Intrusion Detection: Analysis
sions into information systems. Quality and Reliability strategy where monitored events are matched against a
Engineering International Journal, 17(2), 105-112. database of attack signatures to detect intrusions.
Ye, N., & Li, X. (2001b, June). A scalable clustering Tcpdump: Computer network debugging and secu-
technique for intrusion signature recognition. In Pro- rity tool that allows the user to intercept and display
ceedings of the IEEE Workshop on Information Assur- TCP/IP packets being transmitted over a network to
ance and Security. United States Military Academy, which the computer is attached.
West Point, NY. Worms: Self-replicating programs that aggressively
spread through a network by taking advantage of auto-
matic packet sending and receiving features found on
many computers.
256
TEAM LinG
257
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining in Diabetes Diagnosis and Detection
258
TEAM LinG
Data Mining in Diabetes Diagnosis and Detection
registries and databases with systematically collected Figure 1. Image of ROSETTA software on training and
patient information. A large-scale research was carried out test data sets ,
by applying data mining techniques to a diabetic data
warehouse from an integrated health care system in the
New Orleans area with 30,383 diabetic patients (Breault et
al., 2002).
To prepare the data tables, structured query language
(SQL) statements were executed on the data warehouse to
form the flat file as input to data mining software. To do
data mining, CART software was used with a binary target
variable and ten predictors: age, sex, emergency depart-
ment visits, office visits, comorbidity index, dyslipidemia,
hypertension, cardiovascular disease, retinopathy, and
end-stage renal disease. The outcome showed that the
most important variable associated with bad glycemic
control is younger age, not the comorbiditity index or
whether patients have related diseases. The total classifi-
cation error (40.5%) was substantial. Limited, 2002). Rough sets have been applied to mine
data in a Polish diabetes database using the ROSETTA
PIMA Indian CaseMachine Learning software (hrn, 1997). The rough sets approach investi-
Mining Approach gates structural relationships in the data rather than
probability distributions and produces decision tables
Another special group in the USA that deserves a separate rather than trees.
mention because of the importance of the findings is the A recent study from a Polish medical school used a
PIMA Indian case. With a very high rate of diabetes for dataset of 107 patients aged five to 22 who were suffering
Pima Indians, the Pima Indian Diabetes Database (PIDD) from insulin-dependent diabetes. In this study, it was
with 768 diabetes patients has been established by the found that the minimal subsets of attributes that are
National Institute of Health. efficient for rule making included age of disease diagno-
There have been many studies applying data mining sis, microalbuminuria (yes/no) and disease duration in
techniques to the PIDD. Some well-known examples of years. With the use of rough sets techniques, decision
data mining techniques used include multi-stream depen- rules were generated to predict microalbuminuria. The
dency detection (MSDD) algorithm, Bayesian neural net- best predictor was age <7 predicting no microalbuminuria
works, and multiplier-free feed-forward neural networks. at 83.3% accuracy.
Although the cited examples use somewhat different sub-
groups of the PIDD, accuracy for predicting diabetes Singapore CaseData Cleansing and
ranges from 66% to 81%. It is interesting to see that with Interaction With Domain Experts
a wide variety of prediction tools available, efficiency and
accuracy can be improved greatly in diabetes data mining. Around 10% of the population in Singapore is diabetic.
Recently, several modified data mining techniques have In the diabetes data mining exercise conducted in
been used successfully on the same database. This in- Singapore, the objective was to find rules that could be
cludes the usage of decision trees with an augmented used by physicians to understand more about diabetes
splitting criterion (Buja & Lee, 2001). The use of the fuzzy and to find special patterns about a particular patient
Nave Bayes method yielded a best-case accuracy of population (Hsu et al., 2000). In order to deal with noisy
76.95% (Tang et al., 2002). The application of shunted data, a semi-automatic data cleaning system was utilized
inhibitory neural networks, where the neurons can act as to reconcile format differences among tables with user
adaptive non-linear filters, resulted in an accuracy of over mapping input. A sorted neighborhood method was
80% and performed better than multi-layer perceptrons, in used to remove duplicate records.
general (Arulampalam & Bouzerdoum, 2001). In a particular case at the National University of
Singapore, the researchers mined the data via a classifi-
Poland CaseRough Sets Technique cation with association rule tool, using thresholds of 1%
support and 50% confidence. They generated 700 rules.
In Poland, more than 1.5 million people suffer from diabe- The physicians were overwhelmed with the large number
tes, and more than 41,000 have Type 1 diabetes (Medtronic and also wanted to know causal connections rather than
259
TEAM LinG
Data Mining in Diabetes Diagnosis and Detection
associations. It was realized that post-processing was knowledge about associations between diabetes and
needed in order to make the results usable for the physicians. other diseases have to be dug out in order to improve
The tree method was employed to generate general public health in the future. Though Type 1 and Type 2
rules giving the underlying trends in the data that the diabetes have been studied frequently, more effort is
physicians already knew and exception rules giving de- needed in the future to study the diagnosis and detection
viations to these trends. The physicians found the excep- of gestational diabetes.
tion rules especially helpful in understanding how sub-
population trends differ from the main population. With
the help of the physicians domain knowledge, the data CONCLUSION
mining process was optimized by reducing the size of data
and hypothesis space and by removing unnecessary The occurrence of diabetes is increasing at an alarming
query operations (Owrang, 2000). rate all over the world, and its development is believed to
involve the interplay among many unknown and mysteri-
ous reasons such as genetic and environmental factors.
FUTURE TRENDS In view of this, data-mining technology can play a very
important role in analyzing existing diabetes databases
An important step in using data mining for diagnosing and identifying useful rules that help diabetes prevention
diabetes is the creation of a centralized database that can and control.
store most diabetic patients health data. The larger the It is worthwhile to recognize that successful applica-
data pool, the more accurate the results from data mining tion of data-mining technologies to diabetes prevention
are likely to be. The experiences and the many challenges and control requires: (1) preparing a comprehensive dia-
faced in building a data warehouse for a not-for-profit betes database for input into data mining software to
organization called Christiana Care in the state of Dela- avoid garbage in and garbage out; this will require data
ware, USA, is described by Ewen, et al., and it is reported cleansing and transformations from a relational data ware-
that the creation of the data warehouse was responsible house to a data mining data table that is useable by data
for a gain in operating revenue in 1997 (Ewen et al., 1998). mining tools; (2) selecting and skillfully applying the
Though the task is not easy, for successful data mining, appropriate data-mining software and techniques; (3)
it is almost imperative to create data warehouses for intelligently sifting through the software output to priori-
sharing of diabetes-related information among physi- tize the areas that will provide the most cost savings or
cians across the world. The major obstacle is that there is outcomes improvement; and (4) interacting with domain
only a small number of diabetic patients in this world that experts to select the best-fit rules and patterns that opti-
have access to shared care, while many others remain mize the efficiency and effectiveness of the data-mining
undiagnosed, untreated, or suboptimally treated (Chan, process. In this paper, we have studied how the above
2000). This is a major challenge that has to be overcome steps are applied for mining diabetes databases in gen-
in the future. eral, and, in particular, we have discussed the various
The case studies clearly indicate that there is no single methods adopted by researchers in different countries for
method that proves to be effective for mining diabetes diagnosis of diabetes.
databases. Many tools and techniques, such as statistical
methods, machine learning algorithms, neural networks,
Bayesian neural sets, and rough sets, have been em- REFERENCES
ployed to study and figure out useful rules and patterns
to help physicians combat the deadly disease. There is Apte, C., Liu, B., Pednault, E.P.D., & Smyth, P. (2002).
further need in the future to use other techniques like Business applications of data mining. Communications of
case-based reasoning for assessment of risk of complica- the ACM, 45(8), 49-53.
tions for individual diabetes patients (Armengol et al.,
2004; Montani et al., 2003) or hybrid techniques like rule Armengol, E., Palaudaries, A., & Plaza, E. (2004). Indi-
induction, using simulated annealing for discovering vidual prognosis of diabetes long-term risks: A CBR
associations between observations made of patients on approach. Technical Report IIIA.
their first visit and early mortality (Richards et al., 2001). Arulampalam, G., & Bouzerdoum, A. (2001). Application
Moreover, besides doing data mining in diabetes of shunting inhibitory artificial neural networks to medical
databases, it is also important that the same work be diagnosis. Proceedings of the Seventh Australian and
carried out for studying the complications of diabetes and New Zealand Intelligent Information Systems Confer-
understanding its relationship to other diseases. By ap- ence.
plying different data-mining techniques and tools, more
260
TEAM LinG
Data Mining in Diabetes Diagnosis and Detection
Breault, J.L. (2001). Data mining diabetic databases: Are Richards, G., Rayward-Smith, V.J., Sonksen, P.H., Carey,
rough sets a useful addition? [Electronic version]. Com- S., & Weng, C. (2001). Data mining for indicators of early ,
puting Science and Statistics, 33. mortality in a database of clinical records. Artificial Intel-
ligence in Medicine, 22, 215-231.
Breault, J.L. (2002). Mathematical challenges of variable
transformations in data mining diabetic data ware- Tang, Y., Pan, W., Qiu, X, & Xu, Y. (2002). The identifica-
houses. Retrieved from http://www.ipam.ucla.edu/pub- tion of fuzzy weighted classification system incorporated
lications/sdm2002/sdm2002_jbreault_poster.pdf with fuzzy Nave Bayes from data. Proceedings of the
IEEE International Conference on Systems, Man and
Breault, J.L., Goodall C.R., & Fos, P.J. (2002). Data mining Cybernetics.
a diabetic data warehouse. Artificial Intelligence in
Medicine, 26, 37-54.
Buja, A., & Lee, Y-S. (2001). Data mining criteria for tree-
based regression and classification. San Francisco, CA:
KEY TERMS
KDD 2001.
Body Mass Index (BMI): A measure of mass of an
Chan, J.C.N. (2000). Heterogeneity of diabetes mellitus in individual that is calculated as weight divided by height
the Hong Kong Chinese population. Hong Kong Medical squared.
Journal, 6(1), 77-84.
Data Warehouse: A database, frequently very large,
Diabetes Facts. (2004). Retrieved from http:// that can access all of a companys information. It contains
diabetes.mdmercy.com/about_diabetes/facts.html data about how the warehouse is organized, where the
information can be found, and any connections between
Ewen, E.F. et al. (1998). Data warehousing in an inte- existing data.
grated health system: Building the business case. Wash-
ington, D.C.: DOLAP 98. Gestational Diabetes: This form of diabetes develops
in 2% to 5% of all pregnancies but disappears when a
Hsu, W., Lee, M.L., Liu, B., & Ling, T.W. (2000). Explora- pregnancy is over.
tion mining in diabetic patients databases: Findings and
conclusions. Proceedings of the Sixth ACM SIGKDD Neural Networks: Computer processors or software
International Conference on Knowledge Discovery and based on the human brains mesh-like neuron structure.
Data Mining. Boston, Massachusetts. Neural networks can learn to recognize patterns and
programs to solve related problems on their own.
Lee, S.C. et al. (2000). Diabetes in Hong Kong Chinese:
Evidence for familial clustering and parental effects. Dia- Rough Sets: A method of representation of uncer-
betes Care, 23, 1365-1368. tainty in the membership of a set. It is related to fuzzy sets
and is a popular data mining technique in medicine and
Medtronic Limited. (2002). Diabetes overview. Retrieved finance.
from http://www.medtronic.com/UK/health/diabetes/
diabetes_overview.html Type 1 Diabetes: This insulin-dependent diabetes
mellitus includes risk factors that are less well defined.
Montani, S. et al. (2003). Integrating model-based deci- Autoimmune, genetic, and environmental factors are in-
sion support in a multi-modal reasoning systems for volved in the development of this type of diabetes.
managing type 1 diabetic patients. Artificial Intelligence
in Medicine, 29, 131-151. Type 2 Diabetes: This non-insulin-dependent diabe-
tes mellitus accounts for about 90% to 95% of all diag-
hrn, A., & Komorowski, J. (1997). ROSETTA: A rough nosed cases of diabetes. Risk factors include older age,
set toolkit for analysis of data. Proceedings of the Joint obesity, family history, prior history of gestational diabe-
Conference of Information Sciences, Durham, North Carolina. tes, impaired glucose tolerance, physical inactivity, and
Owrang, M.M. (2000). Using domain knowledge to opti- race/ethnicity.
mize the knowledge discovery process in databases.
International Journal of Intelligent Systems,15, 45-60.
261
TEAM LinG
262
Lori K. Long
Kent State University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining in Human Resources
are not necessarily statistically significant or useful Selecting and preparing the data is the next step in the
(Fayyad et al., 1996a). There has not been any specific data mining process. Some organizations have indepen- ,
exploration of applying these techniques to human re- dent Human Resource Information Systems that feature
source applications; however, there are some guidelines multiple databases that are not connected to each other.
in the process that are transferable to an HRIS. Feelders, This type of system is sometimes selected to offer greater
Daniels and Holsheimer (2000) outline six important steps flexibility to remote organizational locations or sub-groups
in the data mining process: 1) problem definition, 2) with unique information needs (Anthony et al., 1996). The
acquisition of background knowledge, 3) selection of possible inconsistency of the design of the databases
data, 4) pre-processing of data, 5) analysis and interpre- could make data mining difficult when multiple databases
tation, and 6) reporting and use. At each of these steps, exist. Data warehousing can prevent this problem and an
we will look at important considerations as they relate to organization may need to create a data warehouse before
data mining human resources databases. Further, we will they begin a data-mining project. The advantage gained
examine some specific legal and ethical considerations of in first developing the data warehouse or mart is that most
data mining in the HR context. of the data editing is effectively done in advance.
The formulation of the questions to be explored is an Another challenge in mining data is dealing with the
important aspect of the data mining process. As men- issues of missing or noisy data. Data quality may be
tioned earlier, with enough searching or application of insufficient if data is collected without any specific analy-
sufficiently many techniques, one might be able to find sis in mind (Feelders et al., 2000). This is especially true
useless or ungeneralizable patterns in almost any set of for human resource information. Typically when HR data
data. Therefore, the effectiveness of a data mining project is collected, the purpose is some kind of administrative
is improved through establishing some general outlines need such as payroll processing. The need of data for the
of inquiry prior to start the project. To this extent, data required transaction is the only consideration in the type
mining and the more traditional statistical studies are of data to collect. Future analysis needs and the value in
similar. Thus, careful attention to the scientific method the data collected is not usually considered. Missing data
and sound research methods are to be followed. A widely may also be a problem, especially if the system adminis-
respected source of guidelines on research methods is the trator does not have control over data input. Many orga-
book by Kerlinger and Lee (2000). nizations have taken advantage of web-based technology
A certain level of expertise is necessary to carefully to allow for employee input and updating of their own data
evaluate questions posed in a data mining project. Obvi- (Hendrickson, 2003). Employees may choose not to enter
ously, a requirement is data mining and statistical exper- certain types of data resulting in missing data. However,
tise, but one must also have some intimate understanding a data warehouse or datamart may help to prevent or
of the data that is available, along with its business systemized the handling of many of these problems.
context. Furthermore, some subject matter expertise is There are many types of algorithms in use in data
needed to determine useful questions, select relevant mining. The choice of the algorithm depends on the
data and interpret results (Feelders et al., 2000). For intended use of the extracted knowledge (Brodley, Lane,
example, a firm with interest in evaluating the success of & Stough, 1999). The goals of data mining can be broken
an affirmative action program needs to understand the down into two main categories. Some applications seek to
Equal Employment Opportunity (EEO) classification sys- verify the hypothesis formulated by the user. The other
tem to know what data is relevant. main goal is the discovery or uncovering new patterns
Another important consideration in the process of systematically (Fayyad et al., 1996a). Within discovery,
developing a question to look at is the role of causality the data can be used to either predict future behavior or
(Feelders et al., 2000). A subject matter experts involve- describe patterns in an understandable form. A complete
ment is important in interpreting the results of the data discussion of data mining techniques is beyond the scope
analysis. For example, a firm might find a pattern indicat- of this paper. However, the following techniques have the
ing a relationship between high compensation levels and potential to be applicable for data mining of human re-
extended length of service. The question then becomes, sources information.
do employees stay with the company longer because they Clustering and classification is an example of a set of
receive high compensation? Or do employees receive data mining techniques borrowed from classical statisti-
higher compensation if they stay longer with the com- cal methods that can help describe patterns in informa-
pany? An expert in the area can take the relationship tion. Clustering seeks to identify a small set of exhaustive
discovered and build upon it with additional information and mutual exclusive categories to describe the data that
available in the organization to help understand the cause is present (Fayyad et al., 1996a). This might be a useful
and effect of the specific relationship identified. application to human resource data if you were trying to
263
TEAM LinG
Data Mining in Human Resources
identify a certain set of employees with consistent at- that direct use of all available observations is imprac-
tributes. For example, an employer may want to find out tical for regression and similar studies. Thus, random
what the main categories of top performers are that sampling may be necessary to use regression analysis.
employees fall into with an eye towards tailoring various Various nonlinear regression techniques are also avail-
programs to the groups or further study of such groups. able in commercial statistical packages and can be used
One category may be more or less appropriate for one type in a similar way for data mining. Recently, a new model
of training program. A difficulty with clustering tech- fitting technique was proposed in Troutt, et al. (2001).
niques is that no normative techniques are known that In this approach the objective is to explain the highest
specify the correct number of clusters that should be or lowest performers, respectively, as a function of
formed. In addition, there exist many different logics that one or more independent variables.
may be followed in forming the clusters. Therefore, the art The final step in the process emphasizes the value of
of the analyst is critical. the use of the information. The information extracted
Similarly, classification is a data mining technique that must be consolidated and resolved with previous infor-
maps a data item into one of several pre-defined classes mation and then shared and acted upon (Fayyad, Piatsky-
(Fayyad et al., 1996a). Classification may be useful in Shapiro, Smyth, & Uthurusamy, 1996b). Too often, orga-
human resources to classify trends of movement through nizations go through the effort and expense of collecting
the organization for certain sets of successful employees. and analyzing data without any idea of how to use the
A company is at an advantage when recruiting if it can information retrieved. Applying data mining techniques
point out some realistic career paths for new employees. to an HRIS can help support the justification of the
Being able to support those career paths with information investment in the system. Therefore, the firm should
reflecting employee success can make this a strong re- have some expected use for the information retrieved in
source for those charged with hiring in an organization. the process.
Decision Tree Analysis, also called tree or hierarchical One use of human resource related information is to
partitioning, is a somewhat related technique but follows support decision-making in the organization. The results
a very different logic and can be rendered somewhat more obtained from data mining may be used for a full range of
automatic. Here, a variable is chosen first in such a way as decision-making steps. It can be used to provide infor-
to maximize the difference or contrast formed by splitting mation to support a decision, or can be fully integrated
the data into two groups. One group consists of all obser- into an end-user application (Feelders et al., 2000). For
vations having a value higher on a certain value of the example, a firm might be able to set up decision rules
variable, such as the mean. Then the complement, namely regarding employees based on the results of data mining.
those lower than that value, becomes the other group. They might be able to determine when an employee is
Then each half can be subjected to successive further promotion eligible or when a certain work group should
splits with possibly different variables becoming impor- be eligible for additional company benefits.
tant to different halves. For example, employees might first Organizational leaders must be aware of legislation
be split into two groups above and below average tenure concerning legal and privacy issues when making deci-
with the firm. Then the statistics of the two groups can be sions about using personal data collected from individu-
compared and contrasted to gain insights about employee als in organizations (Hubbard, Forcht, & Thomas, 1998).
turnover factors. A further split of the lower tenure group, By their nature, systems that collect employee informa-
say based on gender, may help prioritize those most likely tion run the risk of invading the privacy of employees by
to need special programs for retention. Thus, clusters or allowing access to the information to others within the
categories can be formed by binary cuts, a kind of divide organization. Although there is no explicit constitutional
and conquer approach. In addition, the order of variables right to privacy, certain amendments and federal laws
can be chosen differently to make the technique more have relevance to this issue as they provide protection
flexible. For each group formed, summary statistics can be for employees from invasion of privacy and defamation
presented and compared. This technique is a rather pure (Fisher, Schoenfeldt, & Shaw, 1999). Further, recent
form of data mining and can be performed in the absence epidemics of identity theft create a need for organiza-
of specific questions or issues. It might be applied as a way tions to monitor access to employee information (Carlson,
of seeking interesting questions about a very large datamart. 2004). Organizations can protect themselves from these
Regression and related models, also borrowed from employee concerns by having solid business reasons for
classical statistics, permits estimating a linear function of any data collected from employees and ensuring access
independent variables that best explains or predicts a to this data is restricted.
given dependent variable. Since this technique is gener- There are also some potential legal issues if a firm
ally well known, we will not dwell on the details here. uses inappropriate information extracted from data min-
However, data warehouses and datamarts may be so large ing to make employment related decisions. Even if a
264
TEAM LinG
Data Mining in Human Resources
manager has an understanding of current laws, they Similarly, Dynamic Health Strategies (DHS, http://
could still face challenges as laws and regulations con- www.dhsgroup.com/) is an association that concentrates ,
stantly change (Ledvinka & Scarpello, 1992). An ex- on group health benefit issues. It combines the use of
treme example that may violate equal opportunity laws proprietary technology, audit discipline, analytical soft-
is a decision to hire only females in a job classification ware, and bio-statistical evaluation techniques to assess
because the data mining uncovered that females were and improve the quality and performance of existing
consistently more successful. health care providers, health plans, and health systems.
One research study found that an employees ability DHS performs analysis services for self-insured corpora-
to authorize disclosure of personal information affected tions, universities, government entities and group health
their perceptions of fairness and invasion of privacy management. The process allows DHS to pinpoint spe-
(Eddy & Stone-Romero, 1999). Therefore, it is recom- cific measures that enable clients to reduce costs, reduce
mended that firms notify employees upon hire that the healthcare risk exposure and improve quality of care.
information they provide may be used in data analyses. Group health analysis allows the monitoring of cost,
Another recommendation is to establish a committee or quality and utilization of health care prior to problems
review board to monitor any activities relating to analysis becoming major issues. Clients and consultants are then
of personal information (Osborn, 1978). This committee able to apply the findings and recommendations to realize
can review any proposed research and ensure compliance benefits such as: targeting of specific health and wellness
with any relevant employment or privacy laws. Often the issues, monitoring of utilization and cost over time, as-
employees reaction to the use of their information is sessment of effectiveness and efficiency of health and
based upon their perceptions. If there is a perception that wellness initiatives, design health plan benefits to meet
the company is analyzing the data to take negative actions specific needs, application of risk models and cost projec-
against employees, employees are more apt to object to tions for budgeting and financial strategies.
the use. However, if the employer takes the time to notify Evidently, the group health insurance benefits area is
employees and obtain their permission, the perception of spearheading interest in DM. We may postulate several
negativity may be removed. Even if there may be no legal reasons for this. First, this area represents one of consid-
consequences, employee confidence is something that erable importance in terms of financial impacts. Next, the
employers need to maintain. existence of the Health Insurance Portability and Ac-
countability Act of 1996 (HIPAA) (http://
www.hipaadvisory.com/) created a requirement for ad-
UPDATES AND ISSUES ministrative simplification, compliance, and reporting
within healthcare entities. Third, the National Committee
As of this writing, there are still surprisingly few reported for Quality Assurance (NCQA) (http://www.ncqa.org/
industry experiences with DM in HR. Of course, the index.asp) has established a quality assurance system
newness of the area will require a time lag. In addition, called the Health Plan Employer Data and Information Set
many HR groups will not yet have the requisite experts (HEDIS). Information Systems for reporting related to
and information systems capabilities. However, there are HIPAA and HEDIS serve as a ready data source for DM.
several forces at work that should improve the situation Software vendors have been developing new DM
in the near future. tools at an accelerating pace. At least one vendor,
First, DM for the human resources area is rapidly Lawson (http://www.bitpipe.com/), has developed a prod-
becoming of interest to industry associations and soft- uct specialized to provide HR reporting, monitoring, and
ware vendors. Human Resources Benchmarking decision support. The Defense Software Collaborators
Association(http://www.hrba.org/roundtable.pdf) now (DACS, http://www.dacs.dtic.mil/) website has collected
provides links related to DM. In fact, this association is links to a very large number of DM tools and vendors.
affiliated with and linked to the Data Mining Benchmarking Many of these purport to be designed for the nontechni-
Association. cal business user.
The former offers a variety of services related to DM,
such as
FUTURE TRENDS
Consortium studies with costs divided
HR benchmarking efforts The increasing interest in DM by HR related associations
Data collection and database access will likely continue and should provide a strong impetus
Benchmarking studies of important data mining and guidance for individual member firms. These associa-
processes. tions make it possible for member firms to pool their
databases. Such pooled information gives members a
265
TEAM LinG
Data Mining in Human Resources
statistical leverage in that their own data may be insuf- tion of privacy and procedural justice perspectives. Per-
ficient for particular studies. However, when their par- sonnel Psychology, 52(2), 335-358.
ticular data are viewed in the context of the larger
database, Bayesian methods might be brought to bear to Fayyad, U.M., Piatsky-Shapiro, G., & Smyth, P. (1996a).
make better inferences that may not be reliable with From data mining to knowledge discovery in databases.
smaller data sets (Bickell & Doksum, 1977). AI Magazine, (7), 37-54.
In addition to more firm-specific experience studies, Fayyad, U.M., Piatsky-Shapiro, G., Smyth, P., &
evaluation studies and comparisons are needed for the Uthurusamy, R. (1996b). Advances in knowledge discov-
various DM tools becoming available. Potential adopters ery and data mining. California: American Association
need assessments of both the effectiveness and ease of use. for Artificial Intelligence.
Feelders, A., Daniels, H., & Holsheimer, M. (2000). Meth-
CONCLUSION odological and practical aspects of data mining. Informa-
tion & Management, (37), 271-281.
The use of HR data beyond administrative purposes can Fisher, C.D., Schoenfeldt, L.F., & Shaw, J.B. (1999). Hu-
provide the basis for a competitive advantage by allowing man resource management. Boston: Houghton Mifflin
organizations to strategically analyze one of their most Company.
important assets, their employees. Organizations must be
able to transform the data they have collected into useful Hendrickson, A. (2003). Human resource information sys-
information. Data mining provides an attractive opportu- tems: Backbone technology of contemporary human
nity that has not yet been adequately exploited. The resources. Journal of Labor Research, 24(3) 381-394.
application of DM techniques to HR requires organiza- Hubbard, J.C., Forcht, K.A., & Thomas, D.S. (1998). Hu-
tional expertise and work to prepare the system for mining. man resource information systems: An overview of cur-
In particular, a datamart for HR is a useful first step. With rent ethical and legal issues. Journal of Business Ethics,
proper preparation and consideration, HR databases to- (17), 1319-1323.
gether with data mining create an opportunity for organi-
zations to develop their competitive advantage through Kerlinger, F. N., & Lee, H. B. (2000). Foundations of
using that information for strategic decision-making. A Behavioral Research (3rd ed.). Orlando: Harcourt, Inc.
number of current influences promise to increase the
interest of firms in the application of DM to the Human Ledvinka, J., & Scarpello, V.G. (1992). Federal regulation
Resources area. Trade association and software vendor of personnel and human resource management. Belmont:
activities should also facilitate an increase acceptance Wadsworth Publishing Company.
and willingness to adopt DM. Long, L. K., & M.D. Troutt. (2003). Data mining human
resource information systems. In J. Wang (Ed.), Data
mining: Opportunities and challenges (pp. 366-381).
REFERENCES Hershey: Idea Group Publishing.
Bickell, P.J., & Doksum, K. A. (1977). Mathematical sta- Osborn, J.L. (1978). Personal information: Privacy at the
tistics: Basic ideas and selected topics. San Francisco: workplace. New York: AMACOM.
Holden Day, Inc. Patterson, B., & Lindsey, S. (2003). Mining the gold: Gain
Brodley, C.E., Lane, T., & Stough, T.M. (1999). Knowledge competitive advantage through HR data analysis. HR
discovery and data mining. American Scientist, 87, 54-61. Magazine, 48(9), 131-136.
Bussler, L., & Davis, E. (2002). Information systems: The SAS Institute Inc. (2001). John Deer harvests HR records
quite revolution in human resource management. Journal with SAS. Retrieved from http://www.sas.com/news/suc-
of Computer Information Systems, 17-20. cess/johndeere.html
Carlson, L. (2004). Employers offering identity theft Townsend, A., & Hendrickson, A. (1996). Recasting HRIS
protection. Employee Benefit News, 18(3), 50-51. as an information resource. HR Magazine, 41(2), 91-96.
Eddy, E.R., Stone, D.L., & Stone-Romero, E.F. (1999). The Troutt, M.D., Hu, M., Shanker, M., & Acar, W. (2003).
effects of information management policies on reac- Frontier versus ordinary regression models for data min-
tions to human resource information systems: An integra- ing. In P.C. Pendharker (Ed.), Managing data mining
266
TEAM LinG
Data Mining in Human Resources
technologies in organizations: Techniques and appli- Opportunity Commission (EEOC) for demographic report-
cations (pp. 21-31). Hershey: Idea Group Publishing. ing requirements. ,
HEDIS : Health Plan Employer Data and Information
Set, a quality assurance system established by the Na-
KEY TERMS tional Committee for Quality Assurance (NCQA).
HIPAA: the Health Insurance Portability and Ac-
Benchmarking: To identify the Best in Class of countability Act of 1996.
business processes, which might then be implemented
or adapted for use by other businesses. Human Resource Information System: An integrated
system used to gather and store information regarding an
Enterprise Resource Planning System: An inte- organizations employees.
grated software system processing data from a variety of
functional areas such as finance, operations, sales, hu- Subject Matter Expert: A person who is knowledge-
man resources and supply-chain management. able about the skills and abilities required for a specific
domain such as Human Resources.
Equal Employment Opportunity Classification: A job
classification system set forth by the Equal Employment
267
TEAM LinG
268
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining in the Federal Government
legislation that would require agencies to report to Con- areas have also benefited from the approach such as in
gress on data mining activities to support homeland sales and marketing in the private sector. Statistics ,
security purposes (Miller, 2004). (dollars recovered) from efforts such as this can be used
to support future data mining projects.
Steer Clear of the Guns Drawn
Mentality if Data Mining Unearths a Use Data Mining for Supporting
Discovery Budgetary Requests
DoD Defense Finance & Accounting Services Opera- Veterans Administration Demographics System pre-
tion Mongoose was a program aimed to discover billing dicts demographic changes based on patterns among its
errors and fraud through data mining. About 2.5 million 3.6 million patients as well as data gathered from insur-
financial transactions were searched to locate inaccu- ance companies. Data mining enables the VA to provide
rate charges. This approach detected data patterns that Congress with much more accurate budget requests. The
might indicate improper use. Examples include pur- VA spends approximately $19 billion a year to provide
chases made on weekends and holidays, entertainment medical care to veterans. All government agencies such
expenses, highly frequent purchases, multiple purchases as the VA are under increasing scrutiny to prove that they
from a single vendor and other transactions that do not are operating effectively and efficiently. This is particu-
match with the agencys past purchasing patterns. It larly true as a driving force behind the Presidents Man-
turned up a cluster of 345 cardholders (out of 400,000) agement Agenda (Executive Office of the President,
who had made suspicious purchases. 2002). For many, data mining is becoming the tool of
However, the process needs some fine-tuning. As an choice to highlight good performance or dig out waste.
example, buying golf equipment appeared suspicious United States Coast Guard developed an executive
until it was learned that a manager of a military recre- information system designed for managers to see what
ation center had the authority to buy the equipment. resources are available to them and better understand
Also, casino-related expense revealed to be a common- the organizations needs. Also, it is also used to identify
place hotel bill. Nevertheless, the data mining results relationships between Coast Guard initiatives and sei-
have shown sufficient potential that data mining will zures of contraband cocaine and establish tradeoffs
become a standard part of the Departments efforts to between costs of alternative strategies. The Coast Guard
curb fraud. has numerous databases; content overlap each other; and
only one employee understands each database well
Create a Business Case Based on Case enough to extract information. In addition, field offices
are organized geographically but budgets are drawn up
Histories to Justify Costs by programs that operate nationwide. Therefore, there is
a disconnect between organizational structure and ap-
FAA Aircraft Accident Data Mining Project involved propriations structure. The Coast Guard successfully
the Federal Aviation Administration hiring MITRE Cor- used their data mining program to overcome these is-
poration to identify approaches it can use to mine vol- sues (Ferris, 2000).
umes of aircraft accident data to detect clues about their DOT Executive Reporting Framework (ERF) aims
causes and how those clues could help avert future to provide complete, reliable and timely information in
crashes (Bloedorn, 2000). One significant data mining an environment that allows for the cross-cutting identi-
finding was that planes with instrument displays that can fication, analysis, discussion and resolution of issues.
be viewed without requiring a pilot to look away from The ERF also manages grants and operations data. Grants
the windshield were damaged a smaller amount in run- data shows the taxes and fees that DOT distributes to the
way accidents than planes without this feature. states highway and bridge construction, airport devel-
On the other hand, the government is careful about opment and transit systems. Operations data covers
committing significant funds to data mining projects. payroll, administrative expenses, travel, training and
One of the problems is how do you prove that you kept other operations cost. The ERF system accesses data
the plane from falling out of the sky, said Trish Carbone, from various financial and programmatic systems in use
a technology manager at MITRE. It is difficult to justify by operating administrators.
data mining costs and relate it to benefits (Matthews, Before 1993 there was no financial analysis system
2000). to compare the departments budget with congressional
One way to justify data mining program is to look at appropriations. There was also no system to track per-
past successes in data mining. Historically, fraud detec- formance against the budget or how they were doing
tion has been the highest payoff in data mining, but other with the budget. The ERF system changed this by track-
269
TEAM LinG
Data Mining in the Federal Government
ing of the budget and providing the ability to correct any FUTURE TRENDS
areas that have been over planned. Using ERF, adjust-
ments were made within the quarter so the agency did not Despite the privacy concerns, data mining continues to
go over budget. ERF is being extended to manage budget offer much potential in identifying waste and abuse,
projections, development and formulation. It can be used potential terrorist and criminal activity, and identify
as a proactive tool, which allows an agency to be more clues to improve efficiency and effectiveness within
dynamically to project ahead. The system has improved organizations. This approach will become more perva-
financial accountability in the agency (Ferris, 1999). sive because of its integration with online analytical
tools, the improved ease of use in utilizing data mining
Give Users Continual Computer-Based tools and the appearance of novel visualization tech-
Training niques for reporting results. Also, the emergence of a
new branch of data mining called text mining to help
DoD Medical Logistics Support System (DMLSS) built improve the efficiency of searching on the Web. This
a data warehouse with a front-end decision support/data approach transforms textual data into a useable format
mining tools to help manage the growing costs of health that facilitates classifying documents, finds explicit
care, enhance health care delivery, enhance health deliv- relationships or associations among documents, and
ery in peacetime and promote wartime readiness and clusters documents into categories (SAS, 2004).
sustainability. DMLSS is responsible for the supply of
medical equipment and medicine worldwide for DoD
medical care facilities. The system received recognition CONCLUSION
for reducing the inventory in its medical depot system by
80 percent and reducing supply request response time These lessons learned may or may not fit in all environ-
from 71 to 15 days (Government Computer News, 2001). ments due to cultural, social and financial consider-
One major challenge faced by the agency was the diffi- ations. However, the careful review and selection of
culty in keeping up on the training of users because of the relevant lessons learned could result in addressing the
constant turnover of military personnel. It was deter- required goals of the organization by improving the
mined that there is a need to provide quality computer- level of corporate knowledge.
based training on a continuous basis (Olsen, 1997). A decision maker needs to think outside the box
and move away from the traditional approaches to suc-
Provide the Right Blend of Technology, cessfully implement and manage their programs. Data
Human Capital Expertise and Data mining poses a challenging but highly effective ap-
proach to improve business intelligence within ones
Security Measures domain.
General Accounting Office used data mining to identify
numerous instances of illegal purchases of goods and
services from restaurants, grocery stores, casinos, toy
REFERENCES
stores, clothing retailers, electronics stores, gentlemens
clubs, brothels, auto dealers and gasoline service sta- Bloedorn, E. (2000). Data mining for aviation safety.
tions. This was all part of their effort to audit and inves- MITRE Publications.
tigate federal government purchase and travel card and Executive Office of the President, Office of Manage-
related programs (General Accounting Office, 2003). ment and Budget. (2002). Presidents Management
Data mining goes beyond using the most effective Agenda. Fiscal Year 2002.
technology and tools. There must be well-trained indi-
viduals involved who know about the process, proce- Ferris, N. (1999). 9 Hot Trends for 99. Government
dures and culture of the system being investigated. They Executive.
need to understand the capabilities and limitation of data Ferris, N. (2000). Information is power. Government
mining concepts and tools. In addition, these individual Executive.
must recognize the data security issues associated with
the use of large, complex and detailed databases. General Accounting Office. (2003). Data mining:
Results and challenges for government program au-
dits and investigations. GAO-03-591T.
270
TEAM LinG
Data Mining in the Federal Government
General Accounting Office. (2004). Data mining: Federal Federal Government: The national government of the
efforts cover a wide range of uses. GAO-04-584. United States, established by the Constitution, which ,
consists of the executive, legislative, and judicial
Gillmor, D. (2004). Data mining by government ram- branches. The head of the executive branch is the Presi-
pant. eJournal. dent of the United States. The legislative branch con-
Government Computer News. (2001). Ten agencies sists of the United States Congress, and the Supreme
honored for innovative projects. Government Com- Court of the United States is the head of the judicial
puter News. branch.
Hamblen, M. (1998). Pentagon to deploy huge medical Legacy System: Typically, a database management
data warehouse. Computer World. system in which an organization has invested consider-
able time and money and resides on a mainframe or
Matthews, W. (2000). Digging Digital Gold. Federal minicomputer.
Computer Week.
Logistics Support System: A computer package
Miller, J. (2004). Lawmakers renew push for data- that assists in the planning and deploying the movement
mining law. Government Computer News. and maintenance of forces in the military. The package
Olsen, F. (1997). Health record project hits pay dirt. could deals with the design and development, acquisi-
Government Computer News. tion, storage, movement, distribution, maintenance,
evacuation and disposition of material; movement, evacu-
SAS. (2004). SAS Text Miner. ation, and hospitalization of personnel; acquisition of
construction, maintenance, operation and disposition
Schwartz, A. (2000). Making the Web safe. Federal of facilities; and acquisition of furnishing of services.
Computer Week.
Purchase Cards: Credit cards used in the federal
Sullivan, A. (2004). U.S. Government still data mining. government used by authorized government official for
Reuters. small purchases, usually under $2,500.
Travel Cards: Credit cards issued to federal em-
ployees to pay for costs incurred on official business
KEY TERMS travel.
271
TEAM LinG
272
Shamik Sural
Indian Institute of Technology, Kharagpur, India
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining in the Soft Computing Paradigm
in recognizing patterns by developing mathematical struc- subset is taken into account while computing the
tures with the ability to learn. An artificial neural network degree of support and the degree of confidence. The ,
(ANN) learns through training. These are simple com- measures are similar in spirit to the count operator
puter-based programs whose primary function is to con- used for fuzzy cardinality. Subsequently, with these
struct models of a problem space based on trial and error. extended measures incorporated, several mining
The process of training a neural net to associate certain algorithms have been developed (Gyenesei, 2000;
input patterns with correct output responses involves the Gyenesei & Teuhola, 2001; Shu et al., 2001). Instead
use of repetitive examples and feedback, much like the of dividing quantitative attributes into fixed inter-
training of a human being. vals, linguistic terms can be used to represent the
Rough set theory finds application in studying impre- regularities and exceptions the way in which hu-
cision, vagueness, and uncertainty in data analysis and mans perceive the reality. Chen et al. (2002) have
is based on the establishment of equivalence classes developed an algorithm for fuzzy association rules
within a given training data. A rough set gives an approxi- in dealing with partitioning quantitative data do-
mation of a vague concept by two precise concepts, called mains. Wei & Chen (1999) extended generalized
the lower and upper approximations. These two approxi- association rules with fuzzy taxonomies, by which
mations are a classification of the domain of interest into partial belongings could be incorporated. Further-
disjoint categories. The lower approximation is a descrip- more, a recent effort has been made which incorpo-
tion of the domain objects known with certainty to belong rates linguistic hedges on existing fuzzy taxonomies
to the subset of interest and upper approximation is a (Chen et al., 1999; Chen et al., 2002a). Several fuzzy
description that may possibly belong to the subset. extensions have been made on interestingness
Genetic algorithms (GAs) are computational models measures. A measure called Interestingness De-
used in efficient and global search methods for optimality gree has been proposed which can be seen as the
in problem solving. These search algorithms are based on increase in probability of an event Y caused by the
the mechanics of natural genetics theory combined with occurrence of another event X. Attempts have been
Darwins theory of survival of the fittest and are par- made to introduce thresholds for filtering databases
ticularly suitable for solving complex optimization prob- in dealing with very low membership degrees
lems as well as applications that require adaptive problem- (Hullermeier, 2001).
solving strategies. In data mining, GA finds application in Genetic Algorithm: Min et al. (2001) have used a
hypothesis testing and refinement. GA-based data mining approach in e-commerce to
With this background, we next present how soft com- find association rules of IF-THEN form for adopters
puting techniques can be applied to specific data mining and non-adopters of e-purchasing. Association
problems. rules of the form IF-THEN form can also be mined,
which provides a high degree of accuracy and cov-
erage (Lopes et al., 1999).
MAIN THRUST Clustering: Following are some of the applications
of soft computing tools in clustering.
Association Rule Mining: Following are some of the Fuzzy Logic: A fuzzy clustering algorithm makes an
applications of soft computing tools in association rule attempt to group the prospects into categories based
mining. on their identifying characteristics. For example, for
prospective customers of any business, the key
Fuzzy Logic: A generalized association rule may attributes can include geographic data, psycho-
involve binary, quantitative or categorical data and graphic data and others. Clusters expressed in lin-
hierarchical relation. In quantitative or categorical guistic terms can be easily handled using fuzzy sets.
association rule mining, irrespective of the method- Using fuzzy sets, we can also find dependencies
ology used, sharp boundaries remain a problem between data expressed in qualitative format. Use of
which under-estimates or over-emphasizes the ele- fuzzy logic can help in avoiding searching for less
ments near the boundaries. This, may, therefore, important, trivial or meaningless patterns in data-
lead to an inaccurate representation of semantics. bases. Fuzzy clustering algorithms have been de-
To deal with the problem, fuzzy sets and fuzzy items, veloped for mining telecommunications customer
usually in the form of labels or linguistic terms, are and prospect database to gain customer information
used and defined onto the domains (Chien et al., for deciding a marketing strategy (Russell et al., 1999).
2001). In the fuzzy framework, conventional notions Neural Network: Self Organizing Map (SOM) is one
of support and confidence could be extended as of the most widely used unsupervised neural net-
well. The partial belongingness of an item in a work models that employ competitive learning steps.
273
TEAM LinG
Data Mining in the Soft Computing Paradigm
An important data mining task is organizing data Neuro-Fuzzy Computing: Neuro-Fuzzy comput-
points with single or multiple dimensions in their ing combines strong features of neural network
natural cluster. Kohonen et al. (2000) has demon- and fuzzy approach together. To deal with numeri-
strated the applicability of self-organizing map where cal and linguistic data and granular knowledge, a
large data sets can be partitioned in stages. A two- granular neural network can be designed (Zhang et
phase method can be used where in the first phase, al., 2000). High-level granular knowledge in the
the step-wise strategy of SOM is used, and then in form of rules is generated by compressing the low-
the second phase, the resulting prototypes of the level granular data. An algorithm has been devel-
SOM are clustered by an agglomerative clustering oped to mine mixed fuzzy rules involving both
method or by k-means clustering (Vesanto et al., numeric and categorical attributes. Fuzzy set-driven
2000). A dimension-independent algorithm has been computational techniques for data mining have
developed which allows hierarchical clustering of been discussed by Pedrycz (1998), establishing the
SOMs, based on a spread factor. This can be used as relationship between data mining and fuzzy model-
a controlling measure for generating maps with dif- ing.
ferent dimensionality (Alahakoon et al., 2000). Other Hybrid Approaches: Hybrid prediction sys-
Classification and Rule Extraction: Following are tem based on neural network with its learning
some of the applications of soft computing tools in based on memory-based reasoning can be de-
classification and rule extraction. signed for classification. It can also learn the dy-
Neural Network: For the purpose of classification or namic behavior of a system over a period of time.
rule extraction, ANN is used in the supervised learn- It has been established by experimentation that a
ing paradigm. The most common supervised learn- hybrid system has a high potential in solving data
ing paradigm is error back-propagation. Here, a neu- mining problems (Shin et al., 2000). The concepts
ral network receives an input example, generates a of fuzzy logic and rough set can be applied in Multi-
guess, and compares that guess with the expected layer Perceptron (MLP) neural network to extract
result. The error between the guess and the desired rules from crude domain knowledge. Appropriate
result is fed back to improve the guess in an iterative number of the hidden nodes is automatically deter-
manner. In this sense, the ANN is being supervised mined and the dependency factors are used in the
by the feedback, which shows the network where it initial weight encoding.
made the mistakes and how the correct result should Other Data Mining Applications: The soft com-
look. The most common form of the back-propaga- puting paradigm has also been extended to some
tion algorithm uses a sum-of-squared-errors approach other types of data mining as discussed below.
to generate an aggregate measure of the difference
error. A general framework for classification rule Genetic algorithms have been used in regression
mining, called NEUCRUM (NEUral Classification RUle analysis. One basic assumption in traditional regression
Mining) has been developed. It has two compo- models is that there is no interaction amongst the at-
nents, one is a specific neural classifier named FANNC tributes. To learn non-linear multi-regression from a set
and the other is a novel rule extraction approach of training data, an adaptive GA can be used. A genetic
named STARE (Zhou et al., 2000). algorithm can handle attribute interactions efficiently.
Rough Set: Piasta (1999) has presented an approach GA has been used to discover interesting rules in a
called the ProbRough system to analyze business dependency-modeling task (Noda et al., 1999). The sys-
databases based on rule induction. The ProbRough tem developed by Shin et al. (2000) for classification, as
system can induce decision rules from databases discussed above, can be applied for regression analysis also.
with a very high number of objects and attributes. Fuzzy functional dependencies are extensions of clas-
Based on rough set theory, another approach in the sical functional dependencies, aimed at dealing with
selection of attributes for construction of decision fuzziness in databases and reflecting the semantics that
tree has been developed (Wei, 2003). According to close values of a collection of certain items are depen-
Han & Kamber (2003), rough set theory can be dent on close values of a collection of a set of different
applied for classification to discover structural rela- items. Generally, fuzzy functional dependencies have
tionships in imprecise or noisy data. It can be applied different forms depending on the different aspects of
to discrete-valued attributes and hence, continu- integrating fuzzy logic in classical functional dependen-
ous-valued attributes must be discretized prior to its cies. Fuzzy inference generalizes both imprecise and
use. A classifier can be trained using rough set precise inference. An attempt has been made by Yang &
learning algorithm for rule extraction in IF-THEN Singhal (2001) to develop a framework of linking fuzzy
form, from a decision table.
274
TEAM LinG
Data Mining in the Soft Computing Paradigm
functional dependencies and fuzzy association rules in a data with a reasonable time response is a primary require-
closer manner. ment of any kind of data mining algorithm. Since soft ,
Discovering relationships among time series is an computing techniques typically require more processing
interesting application since time series patterns reflect than the traditional techniques, it remains to be seen how
the evolution of changes in item values with sequential well they can be adopted successfully to the interesting
factors like time. The value of each time series item is and challenging field of data mining.
viewed as a pattern over time, and the similarity between
any two patterns is measured by pattern matching. Chen
et al. (2001) have presented a method based on Dynamic REFERENCES
Time Warping (DTW) to discover pattern associations.
Summarization is one of the major components of data Alahakoon, D., Halgamuge, S.K., & Srinivasan, B. (2000).
mining. Lee & Kim (1997) have proposed an interactive Dynamic self-organizing maps with controlled growth for
top-down summary discovery process which utilizes fuzzy knowledge discovery. IEEE Transactions on Neural
ISA hierarchies as domain knowledge. They have defined Networks, 11(3), 601-614.
a generalized tuple as a representational form of a data-
base summary including fuzzy concepts. By virtue of Bonaventura, P., Marco, G., Marco, M., Franco, S., &
fuzzy ISA hierarchies, where fuzzy ISA relationships Sheng, J. (2003). A hybrid model for the prediction of the
common in actual domains are naturally expressed, the linguistic origin of surnames. IEEE Transactions on Knowl-
discovery process comes up with more accurate database edge and Data Engineering, 15(3), 760-763.
summaries. They have also presented an informative- Chen, G.Q., Wei, Q., & Kerre, E.E. (1999). Fuzzy data
ness measure for distinguishing generalized tuples that mining: Discovery of fuzzy generalized association rules.
deliver more information to users, based on Shannons In Recent Research Issues on Management of Fuzziness
information theory. in Databases. Berlin: Springer-Verlag.
GA-based approach can be used for discovering tem-
poral trends by synthesizing Bayesian Networks Chen, G.Q., Yan, P., & Kerre, E.E. (2002a). Mining fuzzy
(Novobilski & Kamangar, 2002). Bonaventura et al. (2003) implication-based association rules in quantitative data-
have developed a hybrid model for the prediction of the bases. In International FLINS Conference on Computa-
linguistic origin of the surnames. It is a neural network tional Intelligent Systems for Applied Research, Bel-
module combining the results provided both by the lexical gium.
rule module and by the statistical module and used to
compute the evidence for the classes. Chen, G.Q., Wei, Q., Liu, D., & Wets, G. (2002). Simple
association rules (SAR) and the SAR-based rule discov-
ery. Journal of Computer & Industrial Engineering, 43,
721-733.
FUTURE DIRECTIONS
Chen, G.Q., Wei, Q., & Zhang, H. (2001). Discovering
In this paper, we have discussed various soft computing similar time-series patterns with fuzzy clustering and
methods used in data mining. Our focus has been on four DTW methods. In International Fuzzy Systems Associa-
primary soft computing techniques, namely, fuzzy logic, tion Conference, Vancouver, BA, Canada.
neural network, rough set and genetic algorithm. Al-
though these techniques have not yet attained maturity Chien, B.C., Lin, Z.L., & Hong, T.P. (2001). An efficient
to the extent the conventional data mining techniques clustering algorithm for mining fuzzy quantitative asso-
have, it is expected that these techniques will mature well ciation rules. In Ninth International Fuzzy Systems Asso-
enough to be dealt with as independent areas of data ciation World Congress (pp. 1306-1311), Vancouver,
mining very soon. Canada.
Gyenesei, A. (2000). A fuzzy approach for mining quan-
titative association rules. TUCS Technical Reports 336.
CONCLUSION Department of Computer Science, University of Turku,
Finland.
Hybridization of fuzzy, neural and genetic algorithms in
solving data mining problems seems to be an upcoming Gyenesei, A., & Teuhola, J. (2001). Interestingness mea-
area in the field of data mining. However, more stress sures for fuzzy association rules. In Principles and Prac-
needs to be given to the domain of improving the effi- tice of Knowledge Discovery in Databases, Freiburg,
ciency of these soft computing techniques when applied Germany.
to the data mining problems. Processing of tera bytes of
275
TEAM LinG
Data Mining in the Soft Computing Paradigm
Han, J., & Kamber, M. (2003). Data mining: Concepts and Shu, J.Y., Tsang, E.C.C., Daniel, & Yeung, S. (2001). Query
techniques. San Francisco: Morgan Kaufmann. fuzzy association rules in relational database. In Ninth
International Fuzzy Systems Association World Con-
Hullermeier, E. (2001). Fuzzy association rules: Semantics gress, Vancouver, Canada.
issues and quality measures. In Lecture Notes in Com-
puter Science 2206 (pp. 380-391). Berlin & Heidelberg: Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-
Springer. organizing map. IEEE Transactions on Neural Networks,
11(3), 586-600.
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela,
J., Paatero, V., & Saarela, A. (2000). Self organization of a Wei, J.M. (2003). Rough set based approach to selection
massive document collection. IEEE Transactions on of node. International Journal of Computational Cogni-
Neural Networks, 11(3), 574-585. tion, 1(2).
Lee, D.H., & Kim, M.H. (1997). Database summarization Wei, Q., & Chen, G.Q. (1999). Mining generalized associa-
using fuzzy ISA hierarchies. IEEE Transactions on Sys- tion rules with fuzzy taxonomic structures. In Eighteenth
tems, Man and Cybernetics Part B: Cybernetics, 27 (1). International Conference of North Atlantic Fuzzy Informa-
tion Processing Systems (pp. 477-481), New York, NY, USA.
Lopes, C., Pacheco, M., Vellasco, M., & Passos, E. (1999).
Rule-Evolver: An evolutionary approach for data mining. Yang, Y., & Singhal, M. (2001). Fuzzy functional depen-
In Seventh International Workshop on Rough Sets, Fuzzy dencies and fuzzy association rules. In First Interna-
Sets, Data Mining and Granular-Soft Computing (pp. tional Conference on Data Warehouse and Knowledge
458-462), Yamaguchi, Japan. Discovery (pp. 229-240), Florence, Italy.
Min, H., Smolinski, T., & Boratyn, G.A. (2001). A genetic Zhang, Y.Q., Fraser, M.D., Gagliano, R.A., & Kandel, A.
algorithm-based data mining approach to profiling the (2000). Granular neural networks for numerical-linguistic
adopters and non-adopters of e-purchasing. In Third data fusion and knowledge discovery. IEEE Transac-
International Conference on Information Reuse and tions on Neural Networks, 11, 658-667.
Integration, Las Vegas, USA.
Zhou, Z.H., Yuan, J., & Chen, S.F. (2000). A general neural
Mitra, S., Pal, S.K., & Mitra, P. (2002). Data mining in soft framework for classification rule mining. International
computing framework: A survey. IEEE Transactions on Journal of Computers, Systems, and Signals, 1(2), 154-168.
Neural Networks, 13, 3-14.
Noda, E., Freitas, A.A., & Lopes, H.S. (1999). Discovering
interesting prediction rules with a genetic algorithm. In KEY TERMS
IEEE Congress on Evolutionary Computing (pp. 1322-1329).
Data Mining: A set of tools, techniques and methods
Novobilski, A., & Kamangar, F. (2002). A genetic algo- used to find new, hidden or unexpected patterns from a
rithm based approach for discovering temporal trends large collection of data typically stored in a data ware-
using Bayesian networks. In Sixth World Conference on house.
Systemics, Cybernetics, and Informatics.
Fuzzy Set: A set that captures the different degrees of
Pedrycz, W. (1998). Fuzzy set technology in knowledge belongingness of different objects in the universe instead
discovery. Fuzzy Sets and Systems, 98, 279-290. of a sharp demarcation between objects that belong to a
Piasta, Z. (1999). Analyzing business databases with the set and those that does not.
ProbRough rule induction system. In Workshop on Data Genetic Algorithm: A search and optimization tech-
Mining in Economics, Marketing and Finance (pp. 22- nique that uses the concept of survival of genetic mate-
29), Chania, Greece. rials over various generations of populations much like
Russell, S., & Lodwick, W. (1999). Fuzzy clustering in data the theory of natural evolution.
mining for telco database marketing campaigns. In North Hybrid Technique: A combination of two or more soft
Atlantic Fuzzy Information Processing Symposium (pp. computing techniques used for data mining. Examples are
720-726), New York. neuro-fuzzy, neuro-genetic, and etcetera.
Shin,C.K., Tak Yun, U., Kang Kim, H., & Chan Park, S. Neural Network: A connectionist model that can be
(2000). A hybrid approach of neural network and memory- trained in supervised or unsupervised mode for learning
based learning to data mining. IEEE Transactions on patterns in data.
Neural Networks, 11(3), 637-646.
276
TEAM LinG
Data Mining in the Soft Computing Paradigm
Rough Set: A method of modeling impreciseness and Soft Computing: Collection of methods and tech-
vagueness in data through two sets representing the niques like fuzzy set, neural network, rough set and ,
upper bound and lower bound of the data set. genetic algorithm for solving complex real world prob-
lems.
277
TEAM LinG
278
Xiaohua Hu
Drexel University, USA
Given the exponential growth rate of medical data and Text mining can be viewed as a modular process that
the accompanying biomedical literature, more than involves two modules: an information retrieval module
10,000 documents per week (Leroy et al., 2003), it has and an information extraction module. presents the
become increasingly necessary to apply data mining relationship between the modules and the relationships
techniques to medical digital libraries in order to assess between the phases within the information retrieval
a more complete view of genes, their biological func- module. The former module involves using NLP tech-
tions and diseases. Data mining techniques, as applied to niques to pre-process the written language and using
digital libraries, are also known as text mining. techniques for document categorization in order to find
relevant documents. The latter module involves finding
specific and relevant facts within text. NLP consists of
BACKGROUND three distinct phases: (1) tokenization, (2) parts of
speech (PoS) tagging and (3) parsing. In the tokenization
Text mining is the process of analyzing unstructured step, the text is decomposed into its subparts, which are
text in order to discover information and knowledge that subsequently tagged during the second phase with the
are typically difficult to retrieve. In general, text mining part of speech that each token represents (e.g., noun,
involves three broad areas: Information Retrieval (IR), verb, adjective, etc.). It should be noted that generating
Natural Language Processing (NLP) and Information the rules for PoS tagging is a very manual and labor-
Extraction (IE). Each of these areas are defined as intensive task. Typically, the parsing phase utilizes shal-
follows: low parsing in order to group syntactically related words
together because full parsing is both less efficient (i.e.,
Natural Language Processing: a discipline that very slow) and less accurate (Shatkay & Feldman, 2003).
deals with various aspects of automatically pro- Once the documents have been pre-processed, then they
cessing written and spoken language. can be categorized.
Information Retrieval: a discipline that deals There are two approaches to document categoriza-
with finding documents that meet a set of specific tion: Knowledge Engineering (KE) and Machine Learn-
requirements. ing (ML). Knowledge Engineering requires the user to
Information Extraction: a sub-field of NLP that manually define rules, which can consequently be used
addresses finding specific entities and facts in to categorize documents into specific pre-defined cat-
unstructured text. egories. Clearly, one of the drawbacks of KE is the time
that it would take a person (or group of people) to
manually construct and maintain the rules. ML, on the
MAIN THRUST other hand, uses a set of training documents to learn the
rules for classifying documents. Specific ML tech-
niques that have successfully been used to categorize
The current state of text mining in digital libraries is text documents include, but are not limited to, Decision
provided in order to facilitate continued research, which Trees, Artificial Neural Networks, Nearest Neighbor
subsequently can be used to develop large-scale text and Support Vector Machines (SVM) (Stapley et al.,
mining systems. Specifically, an overview of the pro- 2002). Once the documents have been categorized, then
cess, recent research efforts and practical uses of min- documents that satisfy specific search criteria can be
ing digital libraries, future trends and conclusions are retrieved.
presented.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining Medical Digital Libraries
Decomposed, untagged
tokens (i.e. words)
OR OR
There are several techniques for retrieving documents that do not always hold true (Pearson, 2001). Furthermore,
that satisfy specific search criteria. The Boolean ap- this approach relies heavily on completeness of the list of
proach returns documents that contain the terms (or gene names and synonyms and summarizes the modular
phrases) contained in the search criteria; whereas, the process of text mining.
vector approach returns documents based upon the term
frequency-inverse document frequency (TF x IDF) for Research to Address Issues in Mining
the term vectors that represent the documents. Varia-
Digital Libraries
tions of clustering and clustering ensemble algorithms
(Iliopoulous et al., 2001; Hu, 2004), classification al-
gorithms (Marcotte et al., 2001) and co-occurrence The issues in mining digital libraries, specifically medi-
vectors (Stephens et al., 2001) have been successfully cal digital libraries, include scalability, ambiguous En-
used to retrieve related documents. An important point glish and biomedical terms, non-standard terms and
to mention is that the terms that are used to represent the structure and inconsistencies between medical reposi-
search criteria as well as the terms used to represent the tories (Shatkay & Feldman, 2003). Most of the current
documents are critical to successfully and accurately text mining research focuses on automating informa-
returning related documents. However, terms often have tion extraction (Shatkay & Feldman, 2003). The
multiple meanings (i.e., polysemy) and multiple terms scalability of the text mining approaches is of concern
can have the same meaning (i.e., synonyms). This repre- because of the rapid rate of growth of the literature. As
sents one of the current issues in text mining, which will such, while most of the existing methods have been
be discussed in the next section. applied to relatively small sample sets, there has been an
The last part of the text mining process is informa- increase in the number of studies that have been focused
tion extraction, of which the most popular technique is on scaling techniques to apply to large collections
co-occurrence (Blaschke & Valencia, 2002; Jenssen et (Pustejovsky et al., 2002; Jenssen et al., 2002). One
al., 2001). There are two disadvantages to this approach, exception to this is the study by Jenssen et al. (2001) in
each of which creates opportunities for further re- which the authors used a predefined list of genes to
search. First, this approach depends upon assumptions retrieve all related abstracts from PubMed that con-
regarding sentence structure, entity names, and etcetera tained the genes on the predefined list.
279
TEAM LinG
Data Mining Medical Digital Libraries
aliases,
Identify and retrieve
IR synonyms &
relevant docs
homonyms
Since mining digital libraries relies heavily on the abil- to discovering protein associations (Fu et al., 2003). For
ity to accurately identify terms, the issues of ambiguous instance, Srinivasan (2004) developed MeSH-based text
terms, special jargon and the lack of naming conventions mining methods that generate hypotheses by identifying
are not trivial. This is particularly true in the case of digital potentially interesting terms related to specific input.
libraries where the issue is further compounded by non- Further examples include, but are not limited to: uncov-
standard terms. In fact, a lot of effort has been dedicated ering uses for thalidomide (Weeber et al., 2003), discov-
to building ontologies to be used in conjunction with text ering functional connections between genes (Chaussabel
mining techniques (Boeckmann et al., 2003; HUGO, 2003; & Sher, 2002) and identifying viruses that could be used
Liu et al., 2001; NLM, 2003; Oliver et al., 2002; Pruitt & as biological weapons (Swanson et al., 2001). He summa-
Maglott, 2001; Pustejovsky et al., 2002). Manually build- rizes some of the recent uses of text mining in medical
ing and maintaining ontologies, however, is a time con- digital libraries.
suming effort. In light of that, there have been several
efforts to find ways of automatically extracting terms to
incorporate into and build ontologies (Nenadic et al., 2002; FUTURE TRENDS
Ono et al., 2001). The ontologies are subsequently used to
match terms. For instance, Nenadic et al. (2002) developed
the Tagged Information Management System (TIMS), which The large volume of genomic data resulting and the
is an XML-based Knowledge Acquisition system that uses accompanying literature from the Human Genomic
ontology for information extraction over large collections. project is expected to continue to grow. As such, there
will be a continued need for research to develop scal-
able and effective data mining techniques that can be
Uses of Text Mining in Medical Digital used to analyze the growing wealth of biomedical data.
Libraries Additionally, given the importance of gene names in
the context of mining biomedical literature and the fact
There are many uses for mining medical digital libraries that there are a number of medical sources that use
that range from generating hypotheses (Srinivasan, 2004) different naming conventions and structures, research
Discovering uses for thalidomide Mapping phrases to UMLS concepts Weeber et al., 2001
Extracting and combining relations Rule-based parser and co-occurrence Leroy et al., 2003
Incorporating ontologies (e.g., mapping
Generating hypotheses terms to MeSH) Srinivasan, 2004; Weeber et al., 2003
Identifying biological virus weapons Swanson et al., 2001
280
TEAM LinG
Data Mining Medical Digital Libraries
to further develop ontology will play an important part in concept discovery in molecular biology. In Proceedings
mining medical digital libraries. Finally, it is worth mentioning of the Pacific Symposium on Biocomputing (PSB) (pp. ,
that there has been some effort to link the unstructured text 384-395).
documents within medical digital libraries with their related
structured data in data repositories. Jenssen, T.K., Laegrid, A., Komorowski, J., & Hovig, E.
(2001). A literature network of human genes for high-
throughput analysis of gene expression. Nature Genet-
ics, 28(1), 21-28.
CONCLUSION
Leroy, G., Chen, H., Martinez, J.D., Eggers, S., Flasey, R.R.,
Given the practical applications of mining digital librar- Kislin, K.L., Huang, Z., Li, J., Xu, J., McDonald, D.M., &
ies and the continued growth of available data, mining Ng, G. (2003). Genescene: Biomedical text and data mining.
digital libraries will continue to be an important area In Proceedings of the Third ACM/IEEE-CS joint confer-
that will help researchers and practitioners gain invalu- ence on Digital Libraries (pp. 116-118).
able and undiscovered insights into genes, their rela- Liu, H. Lussier, Y.A., & Friedman, C. (2001). Diambiguating
tionships, biological functions, diseases and possible ambiguous biomedical terms in biomedical narrative text:
therapeutic treatments. an unsupervised method-abstract. Journal of Biomedi-
cal Informatics, 34(4), 249-261.
Chaussabel, D., & Sher, A. (2002). Mining microarray Pruitt, K.D., & Maglott, D.R. (2001). RefSeq and
expression data by literature profiling. Genome Biol- LocusLink:NCBI gene-centered resources. Nucleic Acids
ogy, 3(10), research0055.1-0055.16. Research, 29(1), 137-140.
De Bruijn, B., & Martin, J. (2002). Getting to the core of Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., &
knowledge: Mining biomedical literature. International Cochran, B. (2002). Robust relational parsing over bio-
Journal of Medical Informatics, 67, 7-18. medical literature: Extracting inhibit relations. Pro-
ceedings of Pacific Symposium on Biocomputing (PSB)-
Fu, Y., Mostafa, J., & Seki, K. (2003). Protein association 2002, 7, 362-373.
discovery in biomedical literature. In Proceedings of the
Third ACM/IEEE-CS Joint Conference on Digital Li- Shatkay, H., & Feldman, R. (2003). Mining the biomedi-
braries (pp.113-115). cal literature in the genomic era: An overview. Journal
of Computational Biology, 10(6), 821-855.
Hu, X. (2004). Integration of cluster ensemble and text
summarization for gene expression. In Proceedings of Srinivasan, P. (2004). Text Mining: Generating hypoth-
the 2004 IEEE Symposium of Bioinformatics and eses from MEDLINE. Journal of the American Society
Bioengineering. for Information Science and Technology, 55(5), 396-413.
HUGO. (2003). HUGO (The Human Genome Organi- Stapley, B.J., Kelley, L.A., & Sternberg, M.J. (2002). Predict-
zation) Gene Nomenclature Committee. Retrieved from ing the sub-cellular location of proteins from text using
http://www.gene.ucl.ac.uk/nomenclature support vector machines. Proceedings of the Pacific Sym-
posium on Biocomputing (PSB), 7, 374-385.
Iliopolous, I., Enright, A.J., & Ouzounis, C.A. (2001).
Textquest: Document clustering of Medline abstracts for
281
TEAM LinG
Data Mining Medical Digital Libraries
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., & Bioinformatics: Data mining applied to medical digi-
Mostafa, J. (2001). Detecting gene relations from Medline tal libraries.
abstracts. In Proceedings of the Pacific Symposium on
Biocomputing (PSB) (pp. 483-496). Clustering: An algorithm that takes a dataset and
groups the objects such that objects within the same
Swanson, D.R., Smalheiser, N.R., & Bookstein, A. (2001). cluster have a high similarity to each other, but are
Information discovery from complementary literatures: dissimilar to objects in other clusters.
Categorizing viruses as potential weapons. Journal of the
American Society for Information Science, 52(10), 797- Information Extraction: A sub-field of NLP that ad-
812. dresses finding specific entities and facts in unstructured
text.
Weeber, M., Klein, H., Berg, L., & Vos, R. (2001). Using
concepts in literature-based discovery: Simulating Information Retrieval: A discipline that deals with
Swansons Raynaud-Fish Oil and Migraine-Magnesium finding documents that meet a set of specific require-
discoveries. Journal of the American Society for Informa- ments.
tion Science, 52(7), 548-557. Machine Learning: Artificial intelligence meth-
Weeber, M., Vos, R., Klein, H., de Jong-Van den Berg, ods that use a dataset to allow the computer to learn
L.T.W., Aronson, A., & Molema, G. (2003). Generating models that fit the data.
hypotheses by discovering implicit associations in the Natural Language Processing: A discipline that
literature: A case report for new potential therapeutic deals with various aspects of automatically processing
uses for Thalidomide. Journal of the American Medi- written and spoken language.
cal Informatics Association, 10(3), 252-259.
Supervised Learning: A machine learning tech-
nique that requires a set of training data, which consists
of known inputs and a priori desired outputs (e.g., clas-
KEY TERMS sification labels) that can subsequently be used for
either prediction or classification tasks.
Bibliomining: Data mining applied to digital li- Unsupervised Learning: A machine learning tech-
braries to discover patterns in large collections. nique, which is used to create a model based upon a
dataset; however, unlike supervised learning, the de-
sired output is not known a priori.
282
TEAM LinG
283
Huan Liu
Arizona State University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining Methods for Microarray Data Analysis
a b
MAJOR LINES OF RESEARCH AND 2002). The methods of K-means often require specifica-
tion of the number of clusters, K, and the selection of K
DEVELOPMENT
instances as the initial clusters. All instances are then
partitioned into the K clusters, optimizing some objec-
In this part, we briefly review methods for each of the
tive function (e.g., inner-cluster similarity) by assign-
data-mining tasks identified earlier: gene clustering,
ing each instance to the most similar cluster, which is
sample clustering, sample class prediction, and gene
determined by the distance between the instance and the
selection. We discuss gene clustering and sample clus-
mean of each cluster in the current iteration. Self-
tering together, for these two tasks are common; how-
organizing maps (SOMs) are variations of K-means
ever, they are applied on microarray data from different
methods and require specification of the initial topol-
directions.
ogy of K nodes to construct the map. In graph-based
partitioning methods, a Minimum Spanning Tree (MST)
Clustering is often constructed, and the clusters are generated by
deleting the MST edges with the largest lengths. Graph-
Clustering is a process of grouping similar samples, based partitioning methods do not heavily depend on the
objects, or instances into clusters. Many clustering regularity of the geometric shape of cluster boundaries,
methods exist (for a review, see Jain, Murty, & Flynn, as K-means and SOMs do.
1999; Parson, Ehtesham, & Liu, 2004). They can be Traditional clustering methods require that each
applied to microarray data analysis for clustering genes instance belong to a single cluster, even though some
or samples. In this article, we present three groups of instances may be only slightly relevant for the biologi-
frequently used clustering methods. cal significance of their assigned clusters. Fuzzy C-
The first use of hierarchical clustering in gene means (Dembele & Kastner, 2003) apply a fuzzy parti-
clustering is first described in Eisen et al. (1998). Each tioning method that assigns cluster membership values
instance forms a cluster in the beginning, and the two to instances; this process is called fuzzy clustering. It
most similar clusters are merged until all instances are links each instance to all clusters via a real-value vector
in one single cluster. The clustering results in the form of indexes. The value of each index lies between 0 and 1,
of a tree structure, called dendrogram, which can be where a value close to 1 indicates a strong association to
broken at different levels by using domain knowledge. the corresponding cluster, while a value close to 0
Tree structures are easy to understand and can reveal indicates no association. The vector of indexes thus
close relationships among resulting clusters, but they defines the membership of an instance with respect to
do not provide a unique partition among all the in- the various clusters.
stances, because different ways to determine a basic
level in the dendrogram can result in different cluster- Sample Class Prediction
ing results.
Unlike hierarchical clustering methods, partition-
Apart from clustering methods, which do not require a
based clustering methods divide the whole data into a
priori knowledge about the classes of available instances,
fixed number of clusters. Examples are K-means (Herwig,
a classification method requires training instances with
et at., 1999), self-organizing maps (Tamayo, et al., 1999),
labeled classes, leans patterns that discriminate be-
and graph-based partitioning (Xu, Y., Olman, & Xu, D.,
284
TEAM LinG
Data Mining Methods for Microarray Data Analysis
tween various classes, and ideally, correctly predicts the Therefore, selecting a small number of discriminative
classes of unseen instances. Many classification meth- genes from thousands of genes is essential for suc- ,
ods can be applied to predict diseases or phenotypes of cessful sample classification and clustering. Feature
novel samples from microarray data. We present four selection methods (for a review, see Blum & Langley,
commonly used methods in this section. 1997; Liu & Motoda, 1998) can be applied in microarray
Linear discriminative analysis (LDA) For an m x n data analysis for gene selection.
gene expression matrix X (m is the number of samples, and Among gene selection methods, earlier methods
n is the number of genes), Linear discriminative analysis often evaluate genes in isolation without considering
(LDA) seeks linear combinations, xa, of sample vectors xi gene-to-gene correlation. They rank genes according
= (xi1,,xin) with large ratios of between-class to within- to their individual relevance or discriminative power to
class sums of squares. In other words, it tries to maximize the targeted classes and select top-ranked genes. Some
the ratio aTBa/aTWa, where B and W denote, respectively, methods based on statistical tests or information gain
the n x n matrices of between-class and within-class sums have been employed in Golub et al. (1999) and Model,
of squares (Dudoit, Fridlyand, & Speed, 2000). (2001). However, a number of studies (Ding & Peng,
Nearest neighbor (NN) usually does not learn during 2003; Xion, Fang, & Zhao, 2001) point out that simply
the training phase. Only when it is required to classify a combining a highly ranked gene with another highly
new sample does NN search the data to find the nearest ranked gene often does not form a good gene set,
neighbor for the new sample, using the class label of the because some highly correlated genes could be redun-
nearest neighbor to predict the class label of the new dant. Removing redundant genes among selected ones
sample. K-NN makes the prediction for a new sample can achieve a better representation of the characteris-
based on the most common class label among the K tics of the targeted classes and lead to improved clas-
training samples most similar to the new sample. Ex- sification accuracy. Methods that handle gene redun-
amples can be found in Pomeroy, et. al, 2002). dancy based on pair-wise correlation analysis among
Decision trees classify samples by building a tree- genes can be found in Ding and Peng (2003); Xing,
like structure. Specifically, they recursively split samples Jordan, and Karp (2001); and Yu and Liu (2004). A gene
into two child branches based on the values of a selected selection method for unlabeled samples is also pro-
feature, starting with all the samples. Each leaf node of posed and is shown effective for sample clustering in
the tree is pure in terms of classes, and the resulting Xing and Karp (2001).
partition corresponds to a classifier. By limiting the
number of consecutive branches, they can produce more
generalized classifiers. Different forms of trees exist. FUTURE TRENDS
In Wu et al. (2003), classification and regression trees
are applied for sample classification. Traditional data-mining methods are often designed
Support vector machines (SVMs) have also been for data where the number of instances is significantly
shown effective in sample classification (Brown, et al., larger than the number of features. In microarray data
2000). They try to separate a set of training samples of two analysis, however, the number of features (genes) is
different classes with a hyperplane in an n-dimensional huge, and the number of instances (samples) is rela-
space defined by n features (genes). If no separating tively small, for tasks of sample clustering or classifi-
hyperplane exists in the original space, a kernel function cation. This unique characteristic of microarray data
is used to map the samples into a higher dimensional presents a challenge to the scalability of current data-
space where a separating hyperplane exists. Complex mining methods to high dimensionality. In addition, the
kernel functions that provide nonlinear mappings result relative shortage of instances in the context of high
in nonlinear classifiers. SVMs avoid overfitting by se- dimensionality often causes many methods to overfit
lecting a hyperplane that is maximally distant from the the training data. Therefore, besides improving current
training samples of two different classes, called maxi- data-mining methods, substantial research efforts are
mum margin separating hyperplane, from among many needed to come up with new methods specifically
hyperplanes that can separate the two classes. designed for microarray data.
Gene Selection
CONCLUSION
The nature of relatively high dimensionality but a small
sample size of data in sample classification and cluster- Gene expression microarrays are a revolutionary tech-
ing can cause the problem of curse of dimensionality nology with great potential to provide accurate medical
and overfitting of the training data (Dougherty, 2001). diagnostics, develop cures for diseases, and produce a
285
TEAM LinG
Data Mining Methods for Microarray Data Analysis
detailed genome-wide molecular portrait of cellular states Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data
(Piatetsky-Shapiro & Tamayo, 2003). Data-mining meth- clustering: A review. ACM Computing Surveys, 31(3),
ods are effective tools to turn massive raw data from 264-323.
microarray experiments into biologically important in-
sights. In this article, we provide a brief introduction to Liu, H., & Motoda, H. (1998). Feature selection for knowl-
microarray data analysis and a concise review of various edge discovery and data mining. Boston: Kluwer Aca-
data-mining methods for microarray data. demic.
Model, F. (2001). Feature selection for DNA methylation
based cancer classification. Bioinformatics, 17, 154-164.
REFERENCES
Parson, L., Ehtesham, H., & Liu, H. (2004). Subspace
Alon, U. (1999). Broad patterns of gene expression re- clustering for high dimensional data: A review. ACM
vealed by clustering analysis of tumor and normal colon SIGKDD Explorations, 6(1), 90-105.
tissues probed by oligonucleotide arrays. Proceedings Piatetsky-Shapiro, G., & Tamayo, P. (2003). Microarray
of the National Acad. Sci., 96 (pp. 6745-6750), USA. data mining: Facing the challenges. SIGKDD Explora-
Blum, A. L., & Langley, P. (1997). Selection of relevant tions, 5(2), 1-5.
features and examples in machine learning. Artificial Pomeroy, S. L. (2002). Prediction of central nervous sys-
Intelligence, 97, 245-271. tem embryonal tumor outcome based on gene expression.
Brown, M. et al. (2000). Knowledge-based analysis of Nature, 415, 436-442.
microarray gene expression data by using support vec- Schena, M. (1995). Quantitative monitoring of gene ex-
tor machines. Proceedings of the National Acad. Sci., 97 pression patterns with a complementary DNA microarray.
(pp. 262-267), USA. Science, 270, 467-470.
Dembele, D., & Kastner, P. (2003). Fuzzy C-means method Tamayo, P. (1999). Interpreting patterns of gene expres-
for clustering microarray data. Bioinformatics, 19(8), 973- sion with self-organizing maps: Methods and application
980. to hematopoietic differentiation. Proceedings of the Na-
Ding, C., & Peng, H. (2003). Minimum redundancy feature tional Acad. Sci., 96 (pp. 2907-2912), USA.
selection from microarray gene expression data. Proceed- Wu, B., (2003). Comparison of statistical methods for
ings of the Computational Systems Bioinformatics Con- classification of ovarian cancer using mass spectrom-
ference (pp. 523-529). etry data. Bioinformatics, 19, 1636-1643.
Dougherty, E. R. (2001). Small sample issue for microarray- Xing, E., Jordan, M., & Karp, R. (2001). Feature selection
based classification. Functional Genomics, 2, 28-34. for high-dimensional genomic microarray data. Proceed-
Draghici, S. (2003). Data analysis tools for DNA ings of the 18th International Conference on Machine
microarrays. Chapman & Hall/CRC. Learning (pp. 601-608).
Dudoit, S., Fridlyand, J., & Speed, T. P. (2000). Com- Xing, E., & Karp, R. (2001). CLIFF: Clustering of high-
parison of discrimination methods for the classifica- dimensional microarray data via iterative feature filtering
tion of tumors using gene expression data (Tech. Rep. using normalized cuts. Bioinformatics, 17, S306-S315.
No. 576). Berkeley, CA: University of California at Xion, M., Fang, Z. & Zhao, J. (2001). Biomarker identifica-
Berkeley, Department of Statistics. tion by feature wrappers. Genome Research, 11, 1878-
Eisen, M. (1998). Clustering analysis and display of 1887.
genome-wide expression patterns. Proceedings of the Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expres-
National Acad. Sci., 95 (pp. 14863-14868), USA. sion data using a graph-theoretic approach: An applica-
Golub, T. (1999). Molecular classification of cancer: Class tion of minimum spanning trees. Bioinformatics, 18(4),
discovery and class prediction by gene expression moni- 536-545.
toring. Science, 286, 531-537. Yu, L., & Liu, H. (2004). Redundancy based feature selec-
Herwig, R., (1999). Large-scale clustering of cDNA finger- tion for microarray data. Proceedings of the 10th ACM
prints. Genomics, 66, 249-256. SIGKDD Conference on Knowledge Discovery and Data
Mining.
286
TEAM LinG
Data Mining Methods for Microarray Data Analysis
287
TEAM LinG
288
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining with Cubegrades
289
TEAM LinG
Data Mining with Cubegrades
be a database and XD be a cell. A query is monotonic at However, if the view for C is (Count = 1200;
X if the condition Q(X) is FALSE implies that Q(X) is AVG(salesMilk) = 57; MAX(salesMilk) =80;
FALSE for any X`X. MIN(salesMilk) = 30; SUM(salesMilk) = 68400), then it
However, as described by Imielinski et al. (2002), deter- can be shown that there cannot exist any subcell C of C
mining whether a query Q is monotonic in terms of this in any database for which the original query can be
definition is an NP-hard problem for many simple classes satisfied. Thus, the query can be pruned on cell C.
of queries. To work around this problem, the authors After the source cells have been computed, the next
introduced another notion of monotonicity, referred to as task is to compute the set of target cells. This is done by
view monotonicity of a query. Suppose you have cuboids performing a set of query conversions which make it
defined on a set S of dimension and measures. A view V on possible to reduce cubegrade query evaluation to ice-
S is an assignment of values to the elements of the set. If berg queries. Given a specific candidate source cell C,
the assignment holds for the dimensions and measures in define Q[C] as the query which results from Q by source
a given cell X, then V is a view for X on the set S. So, for substitution. Here, Q is transformed by substituting into
example, if in a cell of rural buyers, the average sales of its where clause all the values of the measures, and the
bread for 20 buyers is 15, then the view on the set {areaType, descriptors of the source cell C as well, by performing the
COUNT(), AVG(salesBread)} for the cell is delta elimination step, which replaces all the relative
{areaType=rural, COUNT()=20, AVG(salesBread)=15}. delta-values (expressed as fractions) by the regular less
Extending the definition, a view on a query is an assign- thangreater than conditions. This is possible because
ment of values for the set of dimension and measure the values for the measures of C are known. With this, the
attributes of the query expression. A query Q()view mono- delta-values can now be expressed as conditions on
tonic on view V if for any cell X in any database D such that values of the target. For example, if AVG(Salary)=40K in
V is the view for X, the condition Q is FALSE, for X implies cell C and the condition on the DeltaAVG(Salary) is of the
form DeltaAVG(Salary) >1.10, this can be translated now
Q is FALSE for all X'X. An important property of view
to AVG(Salary) >44K, where AVG(Salary) references the
monotonicity is that the time and space required for check- target cell. The final step in source substitution is join
ing it for a query depends on the number of terms in the transformation, where the join conditions (specializa-
query, not on the size of the database or the number of its tions, generalizations, and mutations) in Q are trans-
attributes. Because most of the queries typically have few formed into the target conditions because the source cell
terms, it would be useful in many practical situations. The is known. Notice, thus, that Q[C] is the cube query
method presented can be used for checking for view specifying the target cell.
monotonicity for queries that include constraints of type Dong, Han, Lam, Pei, and Wang (2001) present an
(Agg {<, >, =, !=} c), where c is a constant and Agg can be optimized version of target generation that is particularly
MIN, SUM, MAX, AVERAGE, COUNT, an aggregate that useful for the case where the number of source cells are
is a higher order moment about the origin, or an aggregate few in number. The ideas in the algorithm include the
that is an integral of a function on a single attribute. following steps:
Consider a hypothetical query asking for cubes with
1000 or more buyers and with total milk sales less than Perform for the set of identified source cells the
$50,000. In addition, the average milk sales per customer lowest common delta elimination such that the
should be between $20 and $50, with maximum sales greater resulting target condition does not exclude any
than $75. This query can be expressed as follows: possible target cells.
Perform a bottom-up iceberg query for the target
COUNT(*)>=1000 and AVG(salesMilk)>=20 and cells based on the target condition. Define
AVG(salesMilk)<50 and LiveSet(T) of a target cell T as the candidate set of
MAX(salesMilk)>=75 and SUM(saleMilk)<50K source cells that can possibly match or join with the
target. A target cell, T, may identify a source, S, in
Suppose, while performing bottom-up cube computa- its LiveSet to be prunable based on its monotonic-
tion, you have a cell C with the following view V (Count ity and thus removable from its LiveSet. In such a
=1200; AVG(salesMilk) = 50; MAX(salesMilk) = 80; case, all descendants of T would also not include
MIN(salesMilk) = 30; SUM(salesMilk) =60000). Using the S in their LiveSets.
method for checking view monotonicity, it can be shown Perform a join of the target and each of its LiveSets
that some cell C of C can exist (though this subcell is not source cells and for the resulting cubegrade; check
guaranteed to exist in this database) with 1000<= count whether it satisfies the join criteria and the delta-
<1075 for which this query can be satisfied. Thus, this value condition for the query.
query cannot be pruned on the cell.
290
TEAM LinG
Data Mining with Cubegrades
In a typical scenario, it is not expected that users 2002). Applying the cubegrade paradigm in such domains
would be asking for cubegrades per se. Rather, it may be provides an opportunity for richer mining results. ,
more likely that they pose a query on how a given delta
change affects a set of cells, which cells are affected by
a given delta change, or what delta changes affect a set of CONCLUSION
cells in a prespecified manner.
Further, more complicated sets of applications can be In this article, I look at a generalization of association rules
implemented by using cubegrades (Abdulghani, 2001). referred to as cubegrades. These generalizations include
For example, one may be interested in finding cells that allowing the evaluation of relative changes in other mea-
remain stable and are not significantly affected by gener- sures, rather than just confidence, to be returned as well
alization, specialization, or mutation. An illustration for as allowing cell modifications to occur in different direc-
such a situation would be to find cells that remain stable tions. The additional directions that I consider here in-
on the blood pressure measure and are not affected by a clude generalizations, which modify cells towards the
different specialization on age or area demographics. more general cell with fewer descriptors, and mutations,
Another application could be to find effective factors, a which modify the descriptors of a subset of the attributes
set of specialization, generalization, or mutation descrip- in the original cell definition with the others remaining the
tors that are effective in changing a measure value m by same. The paradigm allows you to ask queries that were
a significant ratio. For example, one may want to find not possible through association rules. The downside is
effective factors in decreasing a cholesterol level across that it comes with the price of relatively increased compu-
a set of selected cells. tation/storage costs that need to be tackled with innova-
tive methods.
FUTURE TRENDS
REFERENCES
A major challenge for cubegrade processing is its compu-
tational complexity. Potentially, an exponential number of Abdulghani, A. (2001). Cubegrades-generalization of
source/target cells can be generated. A positive develop- association rules to mine large datasets. Doctoral disser-
ment in this direction is the work done on Quotient Cubes tation, Rutgers University, New Brunswick, NJ. Disserta-
(Lakshmanan, Pei, & Han, 2002). This work provides a tion Abstracts International, DAI-B 62/10, UMI Number
method for partitioning and compressing the cells of a 3027950.
data cube into equivalent classes such that the resulting
classes have cells covering the same set of tuples and Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining
preserving the cubes semantic roll-up/drill-down. In this association rules between sets of items in large data-
context, we can reduce the number of cells generated for bases. Proceedings of the ACM SIGMOD Conference (pp.
the source and target. Further, the pairings for the 207-216), USA.
cubegrade can be reduced by restricting the source and Agrawal, R., & Srikant, R. (1994). Fast algorithms for
target to different classes. mining association rules in large databases. Proceedings
Another related challenge for cubegrades is to iden- of the International Conference on Very Large Data
tify the set of interesting cubegrades (cubegrades that are Bases (pp. 487-499), Chile.
somewhat surprising). Insights to this problem can be
obtained from similar work done in the context of associa- Bayardo, R., & Agrawal, R. (1999). Mining the most
tion rules (Bayardo & Agrawal, 1999; Liu, Ma, &Yu, 2001). interesting rules. Proceedings of the ACM International
The main difference being that for cubegrades, measures Conference on Knowledge Discovery and Data Mining
(possibly a combination of them) other than COUNT are (pp. 145-154), USA.
involved, but for association rules, the interesting func-
tions are based on the count function. In addition, Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up com-
cubegrades have cell modification in multiple directions, putation of sparse and iceberg CUBEs. Proceedings of
but association rules are restricted to specializations. the ACM SIGMOD Conference (pp. 359-370), USA.
As cubegrades are better understood, wider applica- Dong, G., Han, J., Lam, J. M., Pei, J., & Wang, K. (2001).
tions of the concept to various domains are expected to Mining multi-dimensional constrained gradients in data
be seen. Association rules have been applied with suc- cubes. Proceedings of the International Conference on
cess to such areas as intrusion detection (Lee, Stolfo, & Very Large Data Bases (pp. 321-330), Italy.
Mok, 1998), and microarray data (Tuzhilin & Adomavicius,
291
TEAM LinG
Data Mining with Cubegrades
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). KEY TERMS
Data cube: A relational aggregation operator generalizing
group-by, cross-tab, and sub-total. Proceedings of the Cubegrade: A cubegrade is a 5-tuple (source, target,
International Conference on Data Engineering (pp. 152- measures, value, delta-value) where
159), USA.
source and target are cells
Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient measures is the set of measures that are evaluated
computation of iceberg cubes with complex measures. both in the source as well as in the target
Proceedings of the ACM SIGMOD Conference (pp. 1-12), value is a function, value: measures R, that
USA. evaluates measure m measures in the source
Imielinski, T., Khachiyan, L., & Abdulghani, A. (2002). delta-value is a function, delta-value: measures
Cubegrades: Generalizing association rules. Journal of R, that computes the ratio of the value of m
Data Mining and Knowledge Discovery, 6(3), 219-257. measures in the target versus the source
Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient Drill-Down: A cube operation that allows users to
cube: How to summarize the semantics of a data cube. navigate from summarized cells to more detailed cells.
Proceedings of the International Conference on Very
Large Data Bases (pp. 778-789), China. Generalizations: A cubegrade is a generalization if
the set of descriptors of the target cell are a subset of the
Lee, W., Stolfo, S. J., & Mok, K. W. (1998). Mining audit set of attribute-value pairs of the source cell.
data to build intrusion detection models. Proceedings of
the ACM International Conference on Knowledge Dis- Iceberg Cubes: The set of cells in a cube that satisfies
covery and Data Mining (pp. 66-72), USA. an iceberg query.
Liu, B., Ma, Y., & Yu, S. P. (2001). Discovering unexpected Iceberg Query: A query on top of a cube that asks for
information from your competitors Web sites. Proceed- aggregates above a certain threshold.
ings of the ACM International Conference on Knowl- Mutations: A cubegrade is a mutation if the target and
edge Discovery and Data Mining (pp. 144-153), USA. source cells have the same set of attributes but differ on
Sarawagi, S. (2000). User-adaptive exploration of multidi- the values.
mensional data. Proceedings of the International Con- Query Monotonicity: () is monotonic at a cell X if the
ference on Very Large Data Bases (pp. 307-316). condition Q(X) is FALSE implies that Q(X) is FALSE for
Sarawagi, S., Agrawal, R., & Megiddo, N.(1998). Discov- any cell X`X.
ery driven exploration of OLAP data cubes. Proceedings Roll-Up: A cube operation that allows users to aggre-
of the International Conference on Extending Database gate from detailed cells to summarized cells.
Technology (pp. 168-182), Spain.
Specialization: A cubegrade is a specialization if the
Tuzhilin, A., & Adomavicius, G. (2002). Handling very set of attribute-value pairs of the target cell is a superset
large numbers of association rules in the analysis of of the set of attribute-value pairs of the source cell.
microarray data. Proceedings of the ACM International
Conference on Knowledge Discovery and Data Mining View: A view V on S is an assignment of values to the
(pp. 396-404), Canada. elements of the set. If the assignment holds for the
dimensions and measures in a given cell X, then V is a view
Xin, D., Han, J., Li, X., & Wah, B. W.(2003). Star-cubing: for X on the set S.
Computing iceberg cubes by top-down and bottom-up
integration. Proceedings of the International Confer- View Monotonicity: A query Q is view monotonic on
ence on Very Large Data Bases (pp. 476-487), Germany. view V if for any cell X in any database D such that V is
the view for X for query Q , the condition Q is FALSE for
X implies that Q is FALSE for all X` X.
292
TEAM LinG
293
Shouhong Wang
University of Massachusetts Dartmouth, USA
BACKGROUND
MAIN THRUST
There have been three traditional approaches to handling
missing data in statistical analysis and data mining. One There have been two primary approaches of data mining
of the convenient solutions to incomplete data is to with incomplete data: conceptual construction and en-
eliminate from the data set those records that have hanced data mining.
missing values (Little & Rubin, 2002). This, however,
ignores potentially useful information in those records. Conceptual Construction with
In cases where the proportion of missing data is large, Incomplete Data
the data mining conclusions drawn from the screened
data set are more likely misleading. Conceptual construction with incomplete data reveals
the patterns of the missing data as well as the potential
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Mining with Incomplete Data
impacts of these missing data on the mining results the data miner can define index VM(i,j)/VM(i) where
based only on the complete data. Conceptual construc- VM(i,j) is the number of missing values in both
tion on incomplete data is a knowledge development variables i and j, and VM(i) is the number of missing
process. To construct new concepts on incomplete values in variable i. This concept discloses the
data, the data miner needs to identify a particular prob- correlation of two variables in terms of missing
lem as a base for the construction. According to (Wang, values. The higher the value VM(i,j)/VM(i) is, the
S. & Wang, H., 2004), conceptual construction is car- stronger the correlation of missing values would be.
ried out through two phases. First, data mining tech- (4) Conditional Effects: The concept of conditional
niques (e.g., cluster analysis) are applied to the data set effects reveals the potential changes to the under-
with complete data to reveal the unsuspected patterns of standing of the problem caused by the missing
the data, and the problem is then articulated by the data values. To develop the concept of conditional
miner. Second, the incomplete data with missing values effects, the data miner assumes different possible
related to the problem are used to construct new con- values for the missing values, and then observe the
cepts. In this phase, the data miner evaluates the impacts possible changes of the nature of the problem. For
of missing data on the identification of the problem and instance, in the above example, the data miner can
develops knowledge related to the problem. For ex- define index P|z(i)=k where P is the change of
ample, suppose a data miner is investigating the profile the size of the target consumer group perceived by
of the consumers who are interested in a particular the data miner, z(i) represents all missing values
product. Using the complete data, the data miner has of variable i, and k is the possible value variable i
found that variable i (e.g., income) is an important factor might have for the survey. Typically, k={max, min,
of the consumers purchasing behavior. To further verify p} where max is the maximal value of the scale, min
and improve the data mining result, the data miner must is the minimal value of the scale, and p is the random
develop new knowledge through mining the incomplete variable with the same distribution function of the
data. Four typical concepts as results of knowledge values in the complete data. By setting different
discovery in data mining with incomplete data are de- possible values of k for the missing values, the data
scribed as follows: miner is able to observe the change of the size of the
consumer group and redefine the problem.
(1) Reliability: The reliability concept reveals the
scope of the missing data in terms of the problem Enhanced Data Mining with Incomplete
identified based only on complete data. For in- Data
stance, in the above example, to develop the reli-
ability concept, the data miner can define index The second primary approach to data mining with in-
VM(i)/VC(i) where VM(i) is the number of missing complete data is enhanced data mining, in which incom-
values in variable i, and V C(i) is the number of plete data are fully utilized. Enhanced data mining is
samples used for the problem identification in carried out through two phases. In the first phase, obser-
variable i. Accordingly, the higher VM(i)/V C(i) is, vations with missing data are transformed into fuzzy
the lower the reliability of the factor would be. observations. Since missing values make the observa-
(2) Hiding: The concept of hiding reveals how likely tion fuzzy, according to fuzzy set theory (Zadeh, 1978),
an observation with a certain range of values in one an observation with missing values can be transformed
variable is to have a missing value in another into fuzzy patterns that are equivalent to the observa-
variable. For instance, in the above example, the tion. For instance, suppose there is an observation
data miner can define index VM(i)|x(j)(a,b) where A=X(x1, x2, . . . xc . . .xm) where x c is the variable with
VM(i) is the number of missing values in variable i, missing value, and xc{ r1, r 2 ... rp } where rj (j=1, 2, ... p)
x(j) is the occurrence of variable j (e.g., education is the possible occurrence of xc. Let j = P j(xc = rj), the
years), and (a,b) is the range of x(j); and use this fuzzy membership (or possibility) that x c belongs to rj ,
index to disclose the hiding relationships between (j=1, 2, ...p), and j j = 1. Then, j [X |(xc= rj)] (j=1, 2,
variables i and j, say, more than two thousand records ... p) are fuzzy patterns that are the equivalence to the
have missing values in variable income given the observation A.
value of education years ranging from 13 to 19. In the second phase of enhanced data mining, all
(3) Complementing: The concept of complementing fuzzy patterns, along with the complete data, are used
reveals what variables are more likely to have miss- for data mining using tools such as self-organizing maps
ing values at the same time; that is, the correlation (SOM) (Deboeck & Kohonen, 1998; Kohonen, 1989;
of missing values related to the problem being Vesanto & Alhoniemi, 2000) and other types of neural
investigated. For instance, in the above example,
294
TEAM LinG
Data Mining with Incomplete Data
networks (Wang, 2000, 2002). These tools used for en- incomplete data. It provides effective techniques for
hanced data mining are different from the original ones in knowledge development so that the data miner is allowed ,
that they are capable of retaining information of fuzzy to interpret the data mining results based on the particu-
membership for each fuzzy pattern. Wang (2003) has devel- lar problem domain and his/her perception of the missing
oped a SOM-based enhanced data mining model to utilize all data. The other approach is fuzzy transformation. Ac-
fuzzy patterns and the complete data for knowledge discov- cording to this approach, observations with missing
ery. Using this model, the data miner is allowed to compare values are transformed into fuzzy patterns based on
SOM based on complete data and fuzzy SOM based on all fuzzy set theory. These fuzzy patterns along with obser-
incomplete data to perceive covert patterns of the data set. vations with complete data are then used for data mining
It also allows the data miner to conduct what-if trials by through, for examples, data visualization and classifica-
including different portions of the incomplete data to dis- tion.
close more accurate facts. Wang (2005) has developed a The inclusion of incomplete data for data mining
Hopfield neural network based model (Hopfield & Tank, would provide more information for the decision maker
1986) for data mining with incomplete survey data. The in identifying problems, verifying and improving the
enhanced data mining method utilizes more information data mining results derived from observations with
provided by fuzzy patterns, and thus makes the data mining complete data only.
results more accurate. More importantly, it produces rich
information about the uncertainty (or risk) of the data mining
results. REFERENCES
295
TEAM LinG
Data Mining with Incomplete Data
Tseng, S., Wang, K., & Lee, C. (2003). A pre-processing Enhanced Data Mining with Incomplete Data: Data
method to deal with missing values by integrating cluster- mining that utilizes incomplete data through fuzzy trans-
ing and regression techniques. Applied Artificial Intelli- formation.
gence, 17(5/6), 535-544.
Fuzzy Transformation: The process of transform-
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self- ing an observation with missing values into fuzzy pat-
organizing map. IEEE Transactions on Neural Networks, terns that are equivalent to the observation based on
11(3), 586-600. fuzzy set theory.
Wang, S. (2000). Neural networks. In M. Zeleny (Ed.), Hopfield Neural Network: A neural network with
IEBM handbook of IT in business (pp. 382-391). London, a single layer of nodes that have binary inputs and
UK: International Thomson Business Press. outputs. The output of each node is fed back to all other
nodes simultaneously, and each of the node forms a
Wang, S. (2002). Nonlinear pattern hypothesis genera- weighted sum of inputs and passes the output result
tion for data mining. Data & Knowledge Engineering, through a nonlinearity function. It applies a supervised
40(3), 273-283. learning algorithm, and the learning process continues
Wang, S. (2003). Application of self-organizing maps until a stable state is reached.
for data mining with incomplete data Sets. Neural Com- Incomplete Data: The data set for data mining con-
puting & Application, 12(1), 42-48. tains some data entries with missing values. For in-
Wang, S. (2005). Classification with incomplete survey stance, when surveys and questionnaires are partially
data: A Hopfield neural network approach, Computers completed by respondents, the entire response data
& Operational Research, 32(10), 2583-2594. becomes incomplete data.
Wang, S., & Wang, H. (2004). Conceptual construction Neural Network: A set of computer hardware and/
on incomplete survey data. Data and Knowledge Engi- or software that attempt to emulate the information
neering, 49(3), 311-323. processing patterns of the biological brain. A neural
network consists of four main components:
Zadeh, L.A. (1978). Fuzzy sets as a basis for a theory of
possibility. Fuzzy Sets and Systems, 1, 3-28. (1) Processing units (or neurons); and each of them
has a certain output level at any point in time.
(2) Weighted interconnections between the various
processing units which determine how the output
KEY TERMS of one unit leads to input for another unit.
(3) An activation rule which acts on the set of input at
Aggregate Conceptual Direction: Aggregate con- a processing unit to produce a new output.
ceptual direction describes the trend in the data along (4) A learning rule that specifies how to adjust the
which most of the variance occurs, taking the missing weights for a given input/output pair.
data into account.
Self-Organizing Map (SOM): Two layer neural
Conceptual Construction with Incomplete Data: network that maps the high-dimensional data onto low-
Conceptual construction with incomplete data is a knowl- dimensional grid of neurons through unsupervised learn-
edge development process that reveals the patterns of ing or competitive learning process. It allows the data
the missing data as well as the potential impacts of these miner to view the clusters on the output maps.
missing data on the mining results based only on the
complete data.
296
TEAM LinG
297
Massimo Mecella
Universit di Roma La Sapienza, Italy
Monica Scannapieco
Universit di Roma La Sapienza, Italy
Antonino Virgillito
Universit di Roma La Sapienza, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Quality in Cooperative Information Systems
298
TEAM LinG
Data Quality in Cooperative Information Systems
Rating
,
Service
Communication
Infrastructure
Quality Notification
Service (QNS) Quality
Factory (QF)
Orgn
Cooperative
Gateway
Cooperative
Gateway
Org2 internals of the
Cooperative organization
Gateway
Data Quality
Broker (DQB)
quality in CISs; this architecture allows the diffusion of quirements specified in the request cannot be satisfied,
data and related quality and exploits data replication to then the broker initiates a negotiation with the user that
improve the overall quality of cooperative data. The can optionally weaken the constraints on the desired
interested reader can find a detailed description of the data. The data quality broker is, in essence, a data
design and implementation of the architecture in integration system (Lenzerini, 2002) deployed as a
Scannapieco, et al. (2004). Each organization offers peer-to-peer system, which allows to pose a quality-
services to other organizations on its own cooperative enhanced query over a global schema and to select data
gateway and also to specific services to its internal back- satisfying such requirements.
end systems. Therefore, cooperative gateways interface The Quality Notification Service is a publish/sub-
both internally and externally through services. More- scribe engine used as a quality message bus between
over, the communication infrastructure offers some spe- services and/or organizations. More specifically, it al-
cific services. Services are all identical and peer (i.e., lows quality-based subscriptions for users to be notified
they are instances of the same software artifacts and act on changes of the quality of data. For example, an orga-
both as servers and clients of the other peers, depending nization may want to be notified if the quality of some
on the specific activities to be carried out). The overall data it uses degrades below a certain threshold, or when
architecture is depicted in Figure 1. Organizations ex- high quality data are available. Also the quality notifica-
port data and quality data according to a common model tion service is deployed as a peer-to-peer system.
referred to as Data and Data Quality (D2Q) model. It The Rating Service associates trust values to each
includes the definitions of (i) constructs to represent data source in the CIS. These values are used to deter-
data, (ii) a common set of data quality properties, (iii) mine the reliability of the quality evaluation performed
constructs to represent them, and (iv) the association by organizations. The rating service is a centralized
between data and quality data. service, to be provided by a third-party organization.
In order to produce data and quality data according to The interested reader can find further details on the
the D2Q model, each organization deploys on its coop- rating service design and implementation in De Santis,
erative gateway a quality factory service that is respon- et al. (2003).
sible for evaluating the quality of its own data. The design
of the quality factory has been addressed in Cappiello, et
al. (2003). FUTURE TRENDS
The Data Quality Broker poses, on behalf of a re-
questing user, a data request over other cooperating The complete development of a framework for data
organizations, also specifying a set of quality require- quality management in CISs requires the solution of
ments that the desired data have to satisfy; this is re- further issues. An important aspect concerns the tech-
ferred to as quality brokering function. Different cop- niques to be used for quality dimension measurement.
ies of the same data received as responses to the request In both statistical and machine learning areas, some
are reconciled, and a best-quality value is selected and techniques could be usefully exploited. The general
proposed to organizations, which can choose to discard idea is to have quality values estimated with a certain
their data and to adopt higher quality ones; this is re- probability instead of a deterministic quality evalua-
ferred to as quality improvement function. If the re-
299
TEAM LinG
Data Quality in Cooperative Information Systems
tion. In this way, the task of assigning quality values to Berti-Equille, L. (2003). Quality-extended query process-
each data value could be considerably simplified. ing for distributed processing. Proceedings of the ICDT03
The data quality broker covers some aspects of qual- International Workshop on Data Quality in Cooperative
ity-driven query processing in CISs. Nevertheless, there Information Systems (DQCIS03), Siena, Italy.
is still the need to investigate instance reconciliation
techniques; whereas, quality values are not attached to Bertolazzi, P., & Scannapieco, M. (2001). Introducing data
exported data, how is it possible to select between two quality in a cooperative context. Proceedings of the 6th
conflicting instances of same data? Current data inte- International Conference on Information Quality
gration systems simply do not provide any answer in (ICIQ01), Boston, Massachusetts.
such cases, but looser semantics for query answering is Bressan, S. et al. (1997). The context interchange mediator
needed in order to make data integration systems actu- prototype. Proceedings of the ACM SIGMOD Interna-
ally work in real scenarios where errors and conflicts tional Conference on Management of Data, Tucson,
are present. Arizona.
Some further open issues concern trusting data
sources. Data ownership is an important one. From a Cappiello, C. et al. (2003). Data quality assurance in
data quality perspective, assigning a responsibility on cooperative information systems: A multi-dimension
data helps actually to engage improvement actions as quality certificate. Proceedings of the ICDT03 Inter-
well as to trust sources providing data. In some cases, national Workshop on Data Quality in Cooperative
laws help to assign responsibilities on data typologies, Information Systems (DQCIS03), Siena, Italy.
but this is not always possible. Models and techniques De Michelis, G. et al. (1997). Cooperative information
that allow trusting such sources with respect to provided systems: A manifesto. In M.P. Papazoglou, & G. Schlageter
data are an open and important issue, especially when (Eds.), Cooperative information systems: Trends & di-
data sources interact in open and dynamic environments rections. London: Academic Press.
like peer-to-peer systems.
De Santis, L., Scannapieco, M., & Catarci, T. (2003).
Trusting data quality in cooperative information sys-
CONCLUSION tems. Proceedings of the 11th International Confer-
ence on Cooperative Information Systems
Managing data quality in CISs requires solving problems (CoopIS03), Catania, Italy.
from many research areas of computer science, such as Fan, W., Lu, H., Madnick, S., & Cheung, D. (2001).
databases, software engineering, distributed comput- Discovering and reconciling value conflicts for numeri-
ing, security, and information systems. This implies that cal data integration. Information Systems, 26(8), 635-656.
the proposal of integrated solutions is very challenging.
In this article, an architecture to support data quality Fellegi, I., & Holt, D. (1976). A systematic approach to
management in CISs has been described; such an archi- automatic edit and imputation. Journal of the American
tecture consists of modules that provide some solutions Statistical Association, 71, 17-35.
to principal data quality problems in CISs; namely, qual-
ity-driven query answering, data quality access, data Fellegi, I., & Sunter, A. (1969). A theory for record linkage.
quality maintenance, and trust management. The archi- Journal of the American Statistical Association, 64, 1183-
tecture is suitable in all contexts in which data stored by 1210.
different and distributed sources are overlapping and Galhardas, H. et al. (2000). An extensible framework for
affected by data errors. Among such contexts, the archi- data cleaning. Proceedings of the 16th International
tecture has been validated in the Italian e-government Conference on Data Engineering (ICDE 2000), San
scenario, in which all public administrations store data Diego, California.
about citizens that need to be corrected and reconciled.
Hernandez, M., & Stolfo, S. (1998). Real-world data is
dirty: Data cleansing and the merge/purge problem. Jour-
REFERENCES nal of Data Mining and Knowledge Discovery, 1(2), 9-37.
Jarke, M., Lenzerini, M., Vassiliou, V., & Vassiliadis, P.
Batini, C., & Mecella, M. (2001). Enabling Italian e- (Eds.). (1995). Fundamentals of data warehouses. Berlin,
government through a cooperative architecture. IEEE Heidelberg, Germany: Springer Verlag.
Computer, 34(2), 40-45.
300
TEAM LinG
Data Quality in Cooperative Information Systems
301
TEAM LinG
302
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Quality in Data Warehouses
If R < T , then designate the pair as a nonmatch. the EM algorithm. The parameters are known to vary
significantly across files (Winkler, 1999b). They can even ,
The cutoff thresholds T and T are determined by a vary significantly across similar files representing an
priori error bounds on false matches and false urban area and an adjacent suburban area. If two files each
nonmatches. Rule 2 agrees with intuition. If con- contain 1,000 or more records, than bringing together all
sists primarily of agreements, then would intu- pairs from two files is impractical,due to the small number
itively be more likely to occur among matches than of potential matches within the total set of pairs. Blocking
nonmatches, and Ratio 1 would be large. On the other is the method of considering only pairs that agree exactly
hand, if consists primarily of disagreements, then (character by character) on subsets of fields. For instance,
Ratio 1 would be small. Rule 2 partitions the set a set of blocking criteria may be to consider only pairs that
into three disjoint subregions. The region T<R<T is agree on the U.S. Postal zip code and the first character of
referred to as the no-decision region or clerical-review the last name. Additional blocking passes may be needed
region. In some situations, resources are available to to obtain matching pairs that are missed by earlier block-
review pairs clerically. ing passes (Newcombe et al., 1959; Hernandez & Stolfo,
Linkages can be error prone in the absence of strong 1995; Winkler, 2004a).
or unique identifiers such as a verified social security
number that identifies an individual record or entity. Statistical Data Editing and Imputation
Weak identifiers such as name, address, and other
nonuniquely identifying information are used. The com- Correcting inconsistent information and filling in miss-
bination of weak identifiers can determine whether a ing information needs to be efficient and cost effective.
pair of records represents the same entity. If errors or For single fields, edits are straightforward. A lookup
differences exist in the representations of names and table may yield correct diagnostic or zip codes. For
addresses, then many duplicates can erroneously be multiple fields, an edit might require that an individual
added to a warehouse. For instance, a the name of a younger than 15 years of age must have a marital status
business may be John K Smith and Company in one of unmarried. If a record fails this edit, then a subse-
file and J. K. Smith, Inc. in another file. Without the quent procedure would need to change either the age or
additional corroborating of information such as ad- the marital status.
dresses, it is difficult to determine whether the two Editing has been done extensively in statistical agen-
names correspond to the same entity. With three ad- cies since the 1950s. Early work was clerical. Later
dresses such as 123 E. Main Street, 123 East Main computer programs applied if-then-else rules with logic
St., and P.O. Box 456 and the two names, the linkage similar to the clerical review. The main disadvantage
can still be quite difficult. With suitable preprocessing was that edits that did not fail, for a record would
methods, it may be possible to represent the names in initially fail as the values in fields associated with edit
forms in which the different components can be com- failures were changed. Fellegi and Holt (1976) provided
pared. To use addresses of the forms 123 E. Main a theoretical model. In providing their model, they had
Street and P.O. Box 456, it may be necessary to use three goals:
an auxiliary file or expensive follow up that indicates
that the addresses have at some time been associated 1. The data in each record should be made to satisfy
with the same entity. all edits by changing the fewest possible variables
If individual fields have a minor typographical error, (fields).
then string comparators that account for such errors can 2. Imputation rules should derive automatically from
allow effective comparisons (Winkler, 1995, 2004b; edit rules.
Cohen, Ravikumar, & Fienberg, 2003). Individual fields 3. When imputation is necessary, it should maintain
might be first name, last name, and street name, which the joint distribution of variables.
are delineated by standardization software. Rule-based
methods of standardization are available in commercial Fellegi and Holt (1976; Theorem 1) proved that
software for addresses and in other software for names implicit edits are needed for solving the problem of
(Winkler, 1995, 1999b). The probabilities in Equations Goal 1. Implicit edits are those that can be logically
1 and 2 are referred to as matching parameters. If derived from explicitly defined edits. Implicit edits
training data consisting of matched and unmatched pairs provide information about edits that do not fail initially
are available, then a supervised method requiring train- for a record but may fail as the values in fields that are
ing data can be used for estimation of the matching associated with failing edits are changed. The following
parameters. Optimal-matching parameters can sometimes example illustrates some of the computational issues. An
be estimated via unsupervised learning methods, such as edit can be considered as a set of points. Let edit E =
303
TEAM LinG
Data Quality in Data Warehouses
{married & age 15}. Let r be a data record. Then r E => Chaudhuri, Gamjam, Ganti, and Motwani (2003) provide
r fails edit. This formulation is equivalent to If age 15, a method of indexing that significantly improves over
then not married. If a record r fails a set of edits, then one brute-force methods, in which all pairs in two files are
field in each of the failing edits must be changed. An compared. Second, some research deals with better string
implicit edit E3 can be implied from two explicitly defined comparators for comparing fields having typographical
edits E 1 and E2; i.e., E1 and E2 => E3. errors. Cohen et al. (2003) have methods that improve
over the basic Jaro-Winkler methods (Winkler, 2004b).
E1 = {age 15, married, . } Third, another research area investigates methods to
E2 = { . , not married, spouse} standardize and parse general names and address fields
E3 = {age 15, . , spouse} into components that can be more easily compared.
Borkar, Desmukh, and Sarawagi (2001), Churches, Chris-
The edits restrict the fields age, marital status, and ten, Lu, and Zhu (2002), and Agichtein and Ganti (2004)
relationship to head of household. Implicit edit E3 is have Hidden Markov methods that work as well or better
derived from E1 and E2. If E3 fails for a record r = {age 15, than some of the rule-based methods. Although the
not married, spouse}, then necessarily either E1 or E2 fails. Hidden Markov methods require training data, Churches
Assume that the implicit edit E3 is unobserved. If edit E2 et al. provide methods for quickly creating additional
fails for record r, then one possible correction is to change training data for new applications. The additional train-
the marital status field in record r to married in order to ing data supplement a core set of generic training data.
obtain a new record r1. Record r1 does not fail for E2 but now The training data consists of free-form records and the
fails for E1. The additional information from edit E3 assures corresponding records that have been processed into
that record r satisfies all the edits after changing one components. Fourth, because training data are rarely
additional field. For larger data situations with more edits available and optimal matching parameters are needed,
and more fields, the number of possibilities increases at a some research investigates unsupervised learning meth-
very high exponential rate. ods that do not require training data. Ravikumar and
In data warehouse situations, the ease of implement- Cohen (2004) have unsupervised learning methods that
ing the ideas of Fellegi and Holt (1996) by using gener- improve over the basic EM methods of Winkler (1995,
alized edit software is dramatic. An analyst who has 1999b, 2003b). Their methods are even competitive with
knowledge of the edit situations might put together the supervised learning methods. Fifth, other research con-
edit tables in a relatively short time. Kovar and Winkler siders methods of (nearly) automatic error rate estima-
(1996) compared two edit systems on economic data. tion with little or no training data. Winkler (2002) consid-
Both were installed and run in less than one day. In many ers methods that use unlabeled data and very small
business situations, only a few simple edit rules might be subsets of labeled data (training data). Sixth, some
needed. In their books on data quality, Redman (1996), research considers methods for adjusting statistical and
English (1999), and Loshin (2001) have described edit- other types of analyses for matching error. Lahiri and
ing files to assure that business rules are satisfied. The Larsen (2004) have methods for improving the accuracy
authors have not noted the difficulties of applying of statistical analyses in the presence of matching error.
hardcoded if-then-else rules and the relative ease of Their methods extend ideas introduced by Scheuren and
applying Fellegi and Holts methods. Winkler (1993, 1997).
There are three trends for edit/imputation research.
The first trend comprises faster ways of determining
FUTURE TRENDS the minimum number of variables containing values
that contradict the edit rules. DeWaal (2003a, 2003b,
There are several trends in data cleaning. First, some 2003c) applies Fourier-Motzkin and cardinality con-
research considers better search strategies to compen- strained Chernikova algorithms that allow direct bound-
sate for typographical error. Winkler (2003b, 2004a) ing of the number of computational paths. Riera-
applies efficient blocking strategies for bringing to- Ledesma and Salazar-Gonzalez (2004) apply clever
gether pairs in a situation where one file and its indexes heuristics in the setup of the problems that allow direct
are held in memory. This allows the matching of a mod- integer programming methods to perform much faster.
erate size file of 100 million records (which will reside The second trend is to apply machine learning methods.
in 4 GB of memory) against large administrative files of Nordbotten (1995) applies neural nets to the basic
upwards of 4 billion records. This type of record linkage editing problem. Di Zio, Scanu, Coppola, Luzi, and
requires only one pass against both files, whereas con- Ponti (2004) apply Bayesian networks to the imputation
ventional record linkage requires many sorts, matching problem. The advantage of these approaches is that they
passes of both files, and large amounts of disk space. do not require the detailed rule elicitation of other edit
304
TEAM LinG
Data Quality in Data Warehouses
approaches; they depend on representative training data. De Waal, T. (2003b). A fast and simple algorithm for
The training data consists of unedited records and the automatic editing of mixed data. Journal of Official Sta- ,
corresponding edited records after review by subject tistics, 19(4), 383-402.
matter specialists. The third trend comprises methods
that preserve both statistical distributions and edit con- De Waal, T. (2003c). Processing of erroneous and un-
straints in the preprocessed data. Winkler (2003a) con- safe data. Rotterdam: ERIM Research in Management.
nects the generalized imputation methods of Little and Di Zio, M., Scanu, M., Coppola, L., Luzi, O., & Ponti, A.
Rubin (2002) with generalized edit methods of Winkler (2004). Bayesian networks for imputation. Journal of
(1999a, 2003b). The potential advantage is that the the Royal Statistical Society, A, 167(2), 309-322.
statistical properties needed for data mining may be
preserved. English, L. P. (1999). Improving data warehouse and
business information quality: Methods for reducing
costs and increasing profits. New York: Wiley.
CONCLUSION Fayyad, U., & Uthurusamy, R. (2002). Evolving data mining
into solutions for insights. Communications of the Asso-
To data mine effectively, data need to be preprocessed ciation of Computing Machinery, 45(8), 28-31.
in a variety of steps that include removing duplicates,
performing statistical data editing and imputation, and Fellegi, I. P., & Holt, D. (1976). A systematic approach
doing other cleanup and regularization of the data. If to automatic edit and imputation. Journal of the Ameri-
moderate errors exist in the data, data mining may waste can Statistical Association, 71, 17-35.
computational and analytic resources with little gain in Fellegi, I. P., & Sunter, A. B. (1969). A theory of record
knowledge. linkage. Journal of the American Statistical Associa-
tion, 64, 1183-1210.
305
TEAM LinG
Data Quality in Data Warehouses
Redman, T. C. (1996). Data quality in the information age. Winkler, W. E. (2004a). Approximate string comparator
Boston, MA: Artech. search strategies for very large administrative lists. Pro-
ceedings of the Section on Survey Research Methods,
Riera-Ledesma, J., & Salazar-Gonzalez, J.-J. (2004). A American Statistical Association.
branch-and-cut algorithm for the error localization
problem in data cleaning (Tech. Rep.). Tenerife, Spain: Winkler, W. E. (2004b). Methods for evaluating and cre-
Universidad de la Laguna. ating data quality. Information Systems, 29(7), 531-550.
Scheuren, F., & Winkler, W. E. (1993). Regression
analysis of data files that are computer matched. Survey
Methodology, 19, 39-58. KEY TERMS
Scheuren, F., & Winkler, W. E. (1997). Regression
analysis of data files that are computer matched, II. Data Cleaning: The methodology of identifying
Survey Methodology, 23, 157-165. duplicates in a single file or across a set of files by using
a name, address, and other information.
Winkler, W. E. (1995). Matching and record linkage. In
B. G. Cox (Eds.), Business survey methods (pp. 355-384). Data Mining: The application of analytical methods
New York: Wiley. and tools to data for the purpose of identifying patterns
and relationships, such as classification, prediction,
Winkler, W. E. (1999a). The state of statistical data estimation, or affinity grouping.
editing. In Statistical data editing (pp. 169-187). Rome:
ISTAT. Edit Restraints: Logical restraints such as busi-
ness rules that assure that an employees listed salary in
Winkler, W. E. (1999b). The state of record linkage and a job category is not too high or too low or that certain
current research problems. Proceedings of the Survey contradictory conditions, such as a male hysterectomy,
Methods Section, Statistical Society of Canada (pp. 73- do not occur.
80).
Imputation: The method of filling in missing data
Winkler, W. E. (2002). Record linkage and Bayesian that sometimes preserves statistical distributions and
networks. Proceedings of the Section on Survey Re- satisfies edit restraints.
search Methods, American Statistical Association,
Retrieved from http://www.census.gov/srd/www/ Preprocessed Data: In preparation for data mining,
byyear.html data that have been through preprocessing such as data
cleaning or edit/imputation.
Winkler, W. E. (2003a). A contingency table model for
imputing data satisfying analytic constraints. Proceed- Rule Induction: The process of learning, from cases
ings of the Section on Survey Research Methods, or instances, if-then rule relationships consisting of an
American Statistical Association, . Retrieved from http:/ antecedent (if-part, defining the preconditions or cov-
/www.census.gov/srd/www/byyear.html erage of the rule) and a consequent (then-part, stating a
classification, prediction, or other expression of a prop-
Winkler, W. E. (2003b). Data cleaning methods. Pro- erty that holds for cases defined in the antecedent).
ceedings of the ACM Workshop on Data Cleaning,
Record Linkage and Object Identification, USA. Re- Training Data: A representative subset of records
trieved from http://csaa.byu.edu/kdd03cleaning.html for which the truth of classifications and relationships
is known and that can be used for rule induction in
machine learning models.
306
TEAM LinG
307
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Reduction and Compression in Database Systems
where is a diagonal matrix of eigenvalues n = sn2/ M. We difference between the distances of points with k di-
assume, without a loss in generality, that the eigenvalues mensions and the original distance, is used to represent
are in nonincreasing order, such that the transformation the goodness of the fit. The value of k should be selected
of coordinates into the principal components will yield Y to be as small as possible, while stress is maintained at
= XV, whose columns are in decreasing order of their an appropriately low level. A fast, approximate alterna-
energy or variance. There is a reduction in the rank of the tive is FASTMAP, whose goal is to find a k-dimensional
matrix when some eigenvalues are equal to zero. If we space that matches the distances of an NN matrix for N
retain the first p columns of Y, the Normalized Mean points (Faloutsos & Lin, 1995).
Square Error (NMSE) is equal to the sum of the eigenval-
ues of discarded columns divided by the trace of the matrix
(sum of the eigenvalues or diagonal elements of C, which WAVELETS
remains invariant). A significant reduction in the number
of columns can be attained at a relatively small NMSE, as According to Fouriers theorem, a continuous function
shown in numerous studies (see Korn, Jagadish, & can be expressed as the sum of sinusoidal functions. A
Faloutsos, 1997). discrete signal with n points can be expressed by the n
Higher dimensional data, as in the case of data ware- coefficients of a Discrete Fourier Transform (DFT).
houses, for example, (product, customer, date) (dol- According to Parsevals theorem, the energy in the time
lars) can be reduced to two dimensions by appropriate and frequency domain are equal (Faloutsos, 1996).
transformations, for example, with products as rows and The DFT consists of the sum of sine and cosine
(customer date) as columns. functions. I am interested in transforms, which can
The columns of dataset X may not be globally corre- capture a vector with as few coefficients as possible.
late for example, high-income customers buy expensive The Discrete Cosine Transform (DCT) achieves better
items, and low-income customers buy economy items energy concentration than DFT and also solves the fre-
so that the items bought by these two groups of custom- quency-leak problem that plagues DFT (Agrawal,
ers are disjoint. Higher data compression (for a given Faloutsos, & Swami, 1993).
NMSE) can be attained by first clustering the data, using The Discrete Wavelet Transform (DWT) is also
an off-the-shelf clustering method, such as k-means (Dun- related to DFT but achieves better lossy data compres-
ham, 2003), and then applying SVD to clusters (Castelli, sion. The Haar transform is a simple wavelet transform
Thomasian, & Li, 2003). More sophisticated clustering that operates on a time sequence and computes the sum
methods, which generate elliptical clusters, may yield and difference of its halves, recursively. DWT can be
higher dimensionality reduction. An SVD-friendly clus- applied to signals with multiple dimensions, one dimen-
tering method, which generates clusters amenable to sion at a time (Press, Teukolsky, Vetterling, & Flannery,
dimensionality reduction, is proposed in Chakrabarti and 1996). To illustrate how a single dimensional wavelet
Mehrotra (2000). transform works, consider an image with four pixels
K-nearest-neighbor (k-NN) queries can be carried having the following values: [9,7,3,5] (Stollnitz, Derose,
out with respect to a dataset, which has been subjected & Salesin, 1996). We obtain a lower resolution image by
to SVD, by first transforming the query point to the substituting pairs of pixel values with their average:
appropriate coordinates by using the principal compo- [8,4]. Information is lost due to down sampling. The
nents. In the case of multiple clusters, we first need to original pixels can be recovered by storing detail coef-
determine the cluster to which the query point belongs. ficients, given as 1=98 and 1=34, that is, [1,1].
In the case of the k-means clustering method, the query Another averaging and detailing step yields [6] and [2].
point belongs to the cluster with the closest centroid. The wavelet transform of the original image is then
After determining the k nearest neighbors in the primary [6,2,1,1]. In fact, for normalization purposes, the last
cluster, I need to determine if other clusters are to be two coefficients have to be divided by the square root of
searched. A cluster is searched if the hypersphere cen- 2. Wavelet compression is attained by not retaining all
tered on the query point, with the k nearest neighbors the coefficients.
inside it, intersects with the hypersphere of that cluster. As far as data compression in data warehousing is
This step is repeated until no more intersections exist. concerned, a k-d DWT can be applied to a k-d data cube
Multidimensional scaling (MS) is another method to obtain a compressed approximation by saving a frac-
for dimensionality reduction (Kruskal & Wish, 1978). tion of the strongest coefficients. An approximate com-
Given the pair-wise distances or dissimilarities among putation of multidimensional aggregates for sparse data
a set of objects, the goal of MS is to represent them in using wavelets is reported in Vitter and Wang (1999).
k dimensions so that their distances are preserved. A
stress function, which is the sum of squares of the
308
TEAM LinG
Data Reduction and Compression in Database Systems
309
TEAM LinG
Data Reduction and Compression in Database Systems
numerous multidimensional indexing methods (MDIMs) of a cluster sample. Stratified sampling could be attained
have been proposed (Gaede & Gunther, 1998), although if the records were indexed according to the year.
their usage in operational database systems remains rather
limited. MDIMs have been classified into data-partition-
ing and space-partitioning methods. In the former case FUTURE TRENDS
the partitioning is carried out based on insertions and
deletions of multidimensional points in the index. Ex- With rapid increases in the volume of data being held for
amples are R-trees and their variations. Space-partition- data mining and data warehousing, lossy yet accurate
ing methods, such as quad-trees and k-d-b trees, recur- data compression methods are required. This is also
sively partition the space globally when local overflows important from the viewpoint of collecting data from
occur. As a result, space-partitioning methods may not be remote sources with low bandwidth transmission capa-
balanced. I will discuss only data-partitioning methods in bility. The new field of data streams (Golab & Ozsu,
the remainder of this article. 2003) uses summarization methods, such as transmit-
MDIMs tend to be viable for a limited number of ting averages rather than detailed values.
dimensions, and as the number of dimensions increases,
the dimensionality curse sets in (Faloutsos, 1996). In
effect, the index loses its effectiveness, because a CONCLUSION
rather large fraction of the pages constituting the index
are touched as part of query processing. I have provided a summary of data reduction methods
Indexes need some extensions to provide summary applicable to data warehousing and databases in general.
data for data reduction. They can be considered as I have also discussed data compression. Appropriate
hierarchical histograms, but there is no easy way to references are given for further study.
extract this information from an index.
ACKNOWLEDGMENT
SAMPLING
Supported by NSF through Grant 0105485 in Computer
Sampling achieves data compression by selecting an Systems Architecture.
appropriate subset of a large dataset to which a (rela-
tional) query is applied. Sampling can be categorized as
follows (Han & Kamber, 2001): (a) simple random
sample without replacement, (b) simple random sample
REFERENCES
with replacement, (c) cluster sample, and (d) stratified
sample (examples follow). Agrawal, A., Faloutsos, C., & Swami, A. (1993). Effi-
Consider a large university, which maintains student cient similarity search in sequence databases. Proceed-
records in a relational table. Two columns of this table ings of the Foundations of Data Organization and
are of interest: year of study (i.e., freshman, sopho- Algorithms Conference (pp. 69-84), USA.
more, junior, senior) and the grade point average (GPA). Barbara, D., et al. (1997). The New Jersey data reduction
The average GPA by year of study can be specified report. Data Engineering Bulletin, 20(4), 3-42.
succinctly as an SQL GROUP-By query. Assuming that
neither column is indexed, instead of scanning the table Castelli, V., Thomasian, A., & Li, C. S. (2003). CSVD:
to obtain the averages GPAs, the system may randomly Clustering and singular value decomposition for ap-
select records from the table to carry out this task. proximate similarity search in high dimensional spaces.
Online query processing is possible in this context, IEEE Transactions on Knowledge and Data Engineer-
that is, the system starts displaying the average GPAs ing, 15(3), 671-685.
based on samples it has obtained so far by year to the Chakrabarti, K., & Mehrotra, S. (2000). Local dimen-
user and an error bound. The running of the query can be sionality reduction: A new approach to indexing high-
stopped as soon as the user is satisfied (Hellerstein, dimensional spaces. Proceedings of the 26th Interna-
Haas, & Wang, 1997). When the data is not ordered by tional Conference on Very Large Data Bases (pp. 89-100),
one of the attributes under consideration, for example, Egypt.
it is ordered alphabetically by student last names; then
all the records in the page can be used in sampling to Dunham, M. H. (2003). Data mining: Introductory and
reduce the number of disk accesses. This is an example advanced topics. Prentice-Hall.
310
TEAM LinG
Data Reduction and Compression in Database Systems
Faloutsos, C. (1996). Searching multimedia databases by Witten, I. H., Bell, T., & Moffat, A. (1999). Managing
content. Kluwer Academic Publishers. gigabytes: Compressing and indexing documents and ,
images (2nd ed.). Morgan Kaufmann.
Faloutsos, C., & Lin, K. I. (1995). Fastmap: A fast algorithm
for indexing, data-mining and visualization of traditional
and multimedia datasets. Proceedings of the ACM SIGMOD
International Conference (pp. 163-174), USA. KEY TERMS
Gaede, V., & Guenther, O. (1998). Multidimensional index-
ing methods. ACM Computing Surveys, 30(2), 170-231. Clustering: The process of grouping objects based
on their similarity and dissimilarity. Similar objects
Golab, L., & Ozsu, M. T. (2003). Data stream management should be in the same cluster, which is different from
issues A survey. ACM SIGMOD Record, 32(2), 5-14. the cluster for dissimilar objects.
Han, J., & Kamber, M. (2001). Data mining: Concepts Histogram: A data structure that maintains one or
and techniques. Morgan Kaufmann. more attributes or columns of a relational DBMS to
assist the query optimizer.
Hellerstein, J. M., Haas, P. J., & Wang, H. J. (1997).
Online aggregation. Proceedings of the ACM SIGMOD Index Tree: Partitions the space in a single or
International Conference (pp. 171-182), USA. multiple dimensions for efficient access to the subset
of the data that is of interest.
Korn, F., Jagadish, H., & Faloutsos, C. (1997). Effi-
ciently supporting ad hoc queries in large datasets of Karhunen-Loeve Transform (KLT): Utilizes prin-
time sequences. Proceedings of the ACM SIGMOD cipal component analysis or singular value decomposi-
International Conference (pp. 289-300), USA. tion to minimize the distance error introduced for that
level of dimensionality reduction.
Kruskal, J. B., & Wish, M. (1978). Multidimensional
scaling. Beverly Hills, CA: Sage Publications. Principal Component Analysis (PCA): Computes
the eigenvectors for principal components and uses
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & them to transform a matrix X into a matrix Y, whose
Flannery, B. P. (1996). Numerical recipes in C: The art columns are aligned with the principal components.
of scientific computing. Cambridge University Press. Dimensionality is reduced by discarding columns in Y
Ramakrishnan, K., & Gehrke, J. (2003). Database man- with the least variance or energy.
agement systems (3rd ed.). McGraw-Hill. Sampling: A technique for selecting units from a
Sayood, K. (2002). Introduction to data compression population so that by studying the sample, you may
(2nd ed.). Elsevier. fairly generalize your results back to the population.
Stollnitz, E. J., Derose, T. D., & Salesin, D. H. (1996). Singular Value Decomposition (SVD): Attains
Wavelets for computer graphics: Theory and applica- the same goal as PCA by decomposing the X matrix into
tions. Prentice-Hall. a U matrix, a diagonal matrix of eigenvalues, and a
matrix of eigenvectors the same as those obtained by
Vetterli, M., & Kovacevic, J. (1995). Wavelets and subband PCA.
coding. Prentice-Hall.
Wavelet Transform: A method to transform data so
Vitter, J. S., & Wang, M. (1999). An approximate com- that it can be represented compactly.
putation of multidimensional aggregates of sparse data
using wavelets. Proceedings of the ACM SIGMOD Inter-
national Conference (pp. 193-204), USA.
311
TEAM LinG
312
Dimitri Theodoratos
New Jersey Institute of Technology, USA
BACKGROUND
A Data Warehouse (DW) is a collection of technologies information coming from multiple sources into a common
aimed at enabling the knowledge worker (executive, man- format; (d) the cleaning of the resulting data set on the
ager, analyst, etc.) to make better and faster decisions. basis of database and business rules; and (e) the propa-
Data warehouses typically are divided into the front-end gation of the data to the data warehouse and/or data
part concerning end users who access the data ware- marts. In the sequel, we will adopt the general acronym
house with decision-support tools, and the back-stage ETL for all kinds of in-house or commercial tools and all
part, where the collection, integration, cleaning and trans- the aforementioned categories of tasks.
formation of data takes place in order to populate the In Figure 2, we abstractly describe the general frame-
warehouse. The architecture of a data warehouse exhibits work for ETL processes. In the left side, we can observe
various layers of data in which data from one layer are the original data providers (sources). Typically, data
derived from data of the previous layer (Figure 1). The providers are relational databases and files. The data from
processes that take part in the back stage of the data these sources are extracted by extraction routines, which
warehouse are data intensive, complex, and costly provide either complete snapshots or differentials of the
(Vassiliadis, 2000). Several reports mention that most of data sources. Then, these data are propagated to the Data
these processes are constructed through an in-house Staging Area (DSA), where they are transformed and
development procedure that can consume up to 70% of cleaned before being loaded to the data warehouse. Inter-
the resources for a data warehouse project (Gartner, 2003). mediate results, again in the form of (mostly) files or
In order to facilitate and manage the data warehouse relational tables, are part of the data-staging area. The
operational processes, commercial tools exist in the mar- data warehouse is depicted in the right part of Figure 2 and
ket under the general title Extraction-Transformation- comprises the target data stores (i.e., fact tables for the
Loading (ETL) tools. To give a general idea of the func- storage of information and dimension tables with the
tionality of these tools, we mention their most prominent description and the multidimensional, roll-up hierarchies
tasks, which include (a) the identification of relevant of the stored facts). The loading of the central warehouse
information at the source side; (b) the extraction of this is performed from the loading activities depicted right
information; (c) the customization and integration of the before the data warehouse data store.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Warehouse Back-End Tools
Figure 2. The environment of extract-transformation- Scalzo (2003) mentions that 90% of the problems in
load processes data warehouses arise from the nightly batch cycles that ,
load the data. At this stage, the administrators have to
Extract Transform Load deal with problems like (a) efficient data loading and (b)
& Clean
concurrent job mixture and dependencies. Moreover, ETL
processes have global time constraints, including the
time they must be initiated and their completion deadlines.
In fact, in most cases, there is a tight time window in the
night that can be exploited for the refreshment of the data
warehouse, since the source system is off-line or not
heavily used during this period. Other general problems
Sources DSA DW include the scheduling of the overall process, the finding
of the right execution order for dependent jobs and job
sets on the existing hardware for the permitted time
State of the Art schedule, and the maintenance of the information in the
data warehouse.
In the past, there have been research efforts toward the
design and optimization of ETL tasks. We mention three Phase I: Extraction and Transportation
research prototypes: (a) the AJAX system (Galhardas et
al., 2000); (b) the Potters Wheel system (Raman & During the ETL process, a first task that must be per-
Hellerstein, 2001); and (c) ARKTOS II (Arktos II, 2004). The formed is the extraction of the relevant information that
first two prototypes are based on algebras, which we find has to be propagated further to the warehouse
mostly tailored for the case of homogenizing Web data; (Theodoratos et al., 2001). In order to minimize the overall
the latter concerns the modeling and the optimization of processing time, this involves only a fraction of the
ETL processes in a customizable and extensible manner. source data that has changed since the previous execu-
An extensive review of data quality problems and tion of the ETL process, mainly concerning the newly
related literature, along with quality management method- inserted and possibly updated records. Usually, change
ologies, can be found in Jarke, et al. (2000). Rundensteiner detection is performed physically by the comparison of
(1999) offers a discussion of various aspects of data two snapshots (one corresponding to the previous ex-
transformations. Sarawagi (2000) offers a similar collec- traction and the other to the current one). Efficient algo-
tion of papers in the field of data, including a survey rithms exist for this task, like the snapshot differential
(Rahm & Do, 2000) that provides an extensive overview of algorithms presented in Labio and Garcia-Molina (1996).
the field, along with research issues and a review of some Another technique is log sniffing (i.e., the scanning of the
commercial tools and solutions on specific problems log file in order to reconstruct the changes performed
(Monge, 2000; Borkar et al., 2000). In a related but different since the last scan). In rare cases, change detection can
context, we would like to mention the IBIS tool (Cal et al., be facilitated by the use of triggers. However, this solu-
2003). IBIS is an integration tool following the global-as- tion is technically impossible for many of the sources that
view approach to answer queries in a mediated system. are legacy systems or plain flat files. In numerous other
Moreover, there is a variety of ETL tools in the market. cases, where relational systems are used at the source
Simitsis (2003) lists the ETL tools available at the time that side, the usage of triggers also is prohibitive due to the
this paper was written. performance degradation that their usage incurs and to
the need to intervene in the structure of the database.
Moreover, another crucial issue concerns the transporta-
MAIN THRUST tion of data after the extraction, where tasks like ftp,
encryption-decryption, compression-decompression, and
In this section, we briefly review the problems and con- so forth can possibly take place.
straints that concern the overall ETL process, as well as
the individual issues that arise separately in each phase Phase II: Transformation and Cleaning
of an ETL process (extraction and exportation, transfor-
mation and cleaning, and loading). Simitsis (2004) offers It is possible to determine typical tasks that take place
a detailed study on the problems described in this paper during the transformation and cleaning phase of an ETL
and presents a framework toward the modeling and the process. Rahm and Do (2000) further detail this phase in
optimization of ETL processes. the following tasks: (a) data analysis; (b) definition of
313
TEAM LinG
Data Warehouse Back-End Tools
transformation workflow and mapping rules; (c) verifica- call a surrogate key (Kimball et al., 1998). The basic
tion; (d) transformation; and (e) backflow of cleaned data. reasons for this replacement are performance and seman-
In terms of the transformation tasks, we distinguish two tic homogeneity. Performance is affected by the fact that
main classes of problems (Lenzerini, 2002): (a) conflicts textual attributes are not the best candidates for indexed
and problems at the schema level (e.g., naming and struc- keys and need to be replaced by integer keys. More
tural conflicts) and (b) data-level transformations (i.e., at importantly, semantic homogeneity causes reconcilia-
the instance level). tion problems, since different production systems might
The integration and transformation programs perform use different keys for the same object (synonyms) or the
a wide variety of functions, such as reformatting data, same key for different objects (homonyms), resulting in
recalculating data, modifying key structures of data, add- the need for a global replacement of these values in the
ing an element of time to data warehouse data, identifying data warehouse. Observe row (20,green) in table Src_1 of
default values of data, supplying logic to choose between Figure 3. This row has a synonym conflict with row
multiple sources of data, summarizing data, merging data (10,green) in table Src_2, since they represent the same
from multiple sources, and so forth. real-world entity with different IDs, and a homonym
In the sequel, we present four common ETL transforma- conflict with row (20,yellow) in table Src_2 (over at-
tion cases as examples: (a) semantic normalization and tribute ID). The production key ID is replaced by a
denormalization; (b) surrogate key assignment; (c) slowly surrogate key through a lookup table of the form
changing dimensions; and (d) string problems. The re- Lookup(SourceID,Source,SurrogateKey). The Source
search prototypes presented in the previous section and column of this table is required, because there can be
several commercial tools already have done some piece of synonyms in the different sources, which are mapped to
progress in order to tackle problems like these four. Still, different objects in the data warehouse (e.g., value 10 in
their presentation in this paper aspires to make the reader tables Src_1 and Src_2). At the end of this process, the
understand that the whole process should be discrimi- data warehouse table DW has globally unique, recon-
nated from the way we resolved integration issues until now. ciled keys.
314
TEAM LinG
Data Warehouse Back-End Tools
sion data Dold as they were received from the source and process. The option to turn them off contains some risk
their current version Dnew. We discriminate the new and in the case of a loading failure. So far, the best technique ,
updated rows through the respective operators. The new seems to be the usage of the batch loading tools offered
rows are assigned a new surrogate key through a function by most RDBMSs that avoid these problems. Other tech-
application. The updated rows are assigned a surrogate niques that facilitate the loading task involve the creation
key, which is the same as the one that their previous of tables at the same time with the creation of the respec-
version had already being assigned. Then, we can join the tive indexes, the minimization of inter-process wait states,
updated rows with their old versions from the target table, and the maximization of concurrent CPU usage.
which subsequently will be deleted, and project only the
attributes with the new values.
In the Type 2 policy, we copy the previous version of FUTURE TRENDS
the dimension record and create a new one with a new
surrogate key. If there is no previous version of the In terms of financial growth, in a recent study (Giga
dimension record, we create a new one from scratch; Information Group, 2002), it is reported that the ETL
otherwise, we keep them both. This policy can be used market reached a size of $667 million for year 2001; still, the
whenever we want to track the history of dimension growth rate reached a rather low 11% (as compared to a
changes. rate of 60% growth for year 2000). More recent studies
Finally, Type 3 processing is also very simple, since, (Giga Information Group, 2002; Gartner, 2003) account the
again, we only have to issue update commands to existing ETL issue as a research challenge and pinpoint several
dimension records. For each attribute A of the dimension topics for future work:
table, which is checked for updates, we need to have an
extra attribute called old_A. Each time we spot a new value Integration of ETL with XML adapters; EAI (Enter-
for A, we write the current A value to the old_A field and prise Application Integration) tools (e.g., MQ-Se-
then write the new value to attribute A. In this way, we can ries); customized data quality tools; and the move
have both new and old values present at the same dimen- toward parallel processing of the ETL workflows.
sion record. Active ETL (Adzic & Fiore, 2003), meaning the need
to refresh the warehouse with as fresh of data as
String Problems possible (ideally, online).
Extension of the ETL mechanisms for non-tradi-
A major challenge in ETL processes is the cleaning and tional data, like XML/HTML, spatial, and biomedical.
the homogenization of string data (e.g., data that stands
for addresses, acronyms, names, etc.). Usually, the ap-
proaches for the solution of this problem include the CONCLUSION
application of regular expressions for the normalization of
string data to a set of reference values. ETL tools are pieces of software responsible for the
extraction of data from several sources, their cleansing,
Phase III: Loading customization, and insertion into a data warehouse. In all
the phases of an ETL process (extraction and exportation,
The final loading of the data warehouse has its own transformation and cleaning, and loading), individual
technical challenges. A major problem is the ability to issues arise and, along with the problems and constraints
discriminate between new and existing data at loading that concern the overall ETL process, make its lifecycle a
time. This problem arises when a set of records has to be very troublesome task. The key factors underlying the
classified to (a) the new rows that need to be appended to main problems of ETL workflows are (a) vastness of the
the warehouse and (b) rows that already exist in the data data volumes; (b) quality problems, since data are not
warehouse, but their value has changed and must be always clean and have to be cleansed; (c) performance,
updated (e.g., with an UPDATE command). Modern ETL since the whole process has to take place within a specific
tools already provide mechanisms for this problem, mostly time window; and (d) evolution of the sources and the
through language predicates. Also, simple SQL com- data warehouse, which can eventually lead even to daily
mands are not sufficient, since the open-loop-fetch tech- maintenance operations. Although state of the art in the
nique, where records are inserted one by one, is extremely field of both research and commercial ETL tools includes
slow for the vast volume of data to be loaded in the some signs of progress, much work remains to be done
warehouse. An extra problem is the simultaneous usage before we can claim that this problem is resolved. In our
of the rollback segments and log files during the loading opinion, there are several issues that are technologically
315
TEAM LinG
Data Warehouse Back-End Tools
open and that present interesting topics of research for Rahm, E., & Do, H.H. (2000). Data cleaning: Problems and
the future in the field of data integration in data ware- current approaches. Bulletin of the Technical Committee
house environments. on Data Engineering, 23(4).
Raman, V., & Hellerstein, J. (2001). Potters wheel: An
interactive data cleaning system. Proceedings of 27th
REFERENCES International Conference on Very Large Data Bases
(VLDB), Rome, Italy.
Adzic, J., & Fiore, V., (2003). Data warehouse population
platform. Proceedings of 5th International Workshop on Rundensteiner, E. (Ed.). (1999). Special issue on data
the Design and Management of Data Warehouses transformations. Bulletin of the Technical Committee on
(DMDW), Berlin, Germany. Data Engineering, 22(1).
Arktos II. (2004). A framework for modeling and managing Sarawagi, S. (2000). Special issue on data cleaning. Bulle-
ETL processes. Retrieved from http://www.dblab.ece. tin of the Technical Committee on Data Engineering, 23(4).
ntua.gr/~asimi
Scalzo, B. (2003). Oracle DBA guide to data warehousing
Borkar, V., Deshmuk, K., & Sarawagi, S. (2000). Automati- and star schemas. Upper Saddle River, NJ : Prentice Hall.
cally extracting structure from free text addresses. Bulletin
of the Technical Committee on Data Engineering, 23(4). Simitsis, A. (2003). List of ETL tools. Retrieved from http:/
/www.dbnet.ece.ntua.gr/~asimi/ETLTools.htm
Cal, A. et al. (2003). IBIS: Semantic data integration at
work. Proceedings of the 15th CAiSE. Simitsis, A. (2004). Modeling and managing extraction-
transformation-loading (ETL) processes in data ware-
Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000). house environments [doctoral thesis]. National Technical
Ajax: An extensible data cleaning tool. Proceedings ACM University of Athens, Greece.
SIGMOD International Conference On the Management
of Data, Dallas, Texas. Theodoratos, D., Ligoudistianos, S., & Sellis, T. (2001).
View selection for designing the global data warehouse.
Gartner. (2003). ETL magic quadrant update: Market Data & Knowledge Engineering, 39(3), 219-240.
pressure increases. Retrieved from http://www.gartner.
com/reprints/informatica/112769.html Vassiliadis, P. (2000). Gulliver in the land of data ware-
housing: Practical experiences and observations of a
Giga Information Group. (2002). Market overview update: researcher. Proceedings of 2nd International Workshop
ETL. Technical Report RPA-032002-00021. on Design and Management of Data Warehouses
(DMDW), Sweden.
Inmon, W.-H. (1996). Building the data warehouse. New
York: John Wiley & Sons, Inc.
Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P.
KEY TERMS
(Eds.). (2000). Fundamentals of data warehouses.
Springer-Verlag. Data Mart: A logical subset of the complete data
warehouse. We often view the data mart as the restriction
Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. of the data warehouse to a single business process or to
(1998). The data warehouse lifecycle toolkit: Expert a group of related business processes targeted toward a
methods for designing, developing, and deploying data particular business group.
warehouses. New York: John Wiley & Sons.
Data Staging Area (DSA): An auxiliary area of volatile
Labio, W., & Garcia-Molina, H. (1996). Efficient snapshot data employed for the purpose of data transformation,
differential algorithms for data warehousing. Proceed- reconciliation, and cleaning before the final loading of the
ings of 22nd International Conference on Very Large data warehouse.
Data Bases (VLDB), Bombay, India.
Data Warehouse: A subject-oriented, integrated, time-
Lenzerini, M. (2002). Data integration: A theoretical per- variant, non-volatile collection of data used to support
spective. Proceedings of 21st Symposium on Principles the strategic decision-making process for the enterprise.
of Database Systems (PODS), Wisconsin. It is the central point of data integration for business
intelligence and is the source of data for the data marts,
Monge, A. (2000). Matching algorithms within a duplicate delivering a common view of enterprise data (Inmon, 1996).
detection system. Bulletin of the Technical Committee
on Data Engineering, 23(4).
316
TEAM LinG
Data Warehouse Back-End Tools
ETL: Extract, transform, and load (ETL) are data ware- Source System: An operational system of record
housing functions that involve extracting data from out- whose function is to capture the transactions of the ,
side sources, transforming them to fit business needs, business. A source system is often called a legacy system
and ultimately loading them into the data warehouse. ETL in a mainframe environment.
is an important part of data warehousing, as it is the way
data actually gets loaded into the warehouse. Target System: The physical machine on which the
data warehouse is organized and stored for direct query-
Online Analytical Processing (OLAP): The general ing by end users, report writers, and other applications. A
activity of querying and presenting text and number data target system is often called a presentation server.
from data warehouses, as well as a specifically dimen-
sional style of querying and presenting that is exemplified
by a number of OLAP vendors.
317
TEAM LinG
318
Yu Hong
BearingPoint Inc., USA
Zu-Hsu Lee
Montclair State University, USA
INTRODUCTION BACKGROUND
A data warehouse is a large electronic repository of There are inherent differences between a traditional data-
information that is generated and updated in a structured base system and a data warehouse system, though to a
manner by an enterprise over time to aid business intelli- certain extent, all databases are similarly designed to
gence and to support decision making. Data stored in a serve a basic administrative purpose, e.g., to deliver a
data warehouse is non-volatile and time variant and is quick response to transactional data processes such as
organized by subjects in a manner to support decision entry, update, query and retrieval. For many conventional
making (Inmon, Rudin, Buss, & Sousa, 1998). Data ware- databases, this objective has been achieved by online
housing has been increasingly adopted by enterprises as transactional processing (OLTP) systems (e.g. Oracle
the backbone technology for business intelligence re- Corp, 2004; Winter & Auerbach, 2004). In contrast, data
porting and query performance has become the key to the warehouses deal with a huge volume of data that are more
successful implementation of data warehouses. Accord- historical in nature. Moreover, data warehouse designs
ing to a survey of 358 businesses on reporting and end- are strongly organized for decision making by subject
user query tools, conducted by Appfluent Technology, matter rather than by defined access or system privileges.
data warehouse performance significantly affects the As a result, a dimension model is usually adopted in a data
Return on Investment (ROI) on Business Intelligence (BI) warehouse to meet these needs, whereas an Entity-Rela-
systems and directly impacts the bottom line of the tionship model is commonly used in an OLTP system. Due
systems (Appfluent Technology, 2002). Even though in to these differences, an OLTP query usually requires
some circumstances it is very difficult to measure the much shorter processing time than a data warehouse
benefits of BI projects in terms of ROI or dollar figures, query (Raden, 2003). Performance enhancement tech-
management teams are still eager to have a single version niques are, therefore, especially critical in the arena of
of the truth, better information for strategic and tactical data warehousing.
decision making, and more efficient business processes Despite the differences, these two types of database
by using BI solutions (Eckerson, 2003). systems share some common characteristics. Some tech-
Dramatic increases in data volumes over time and the niques used in a data warehouse to achieve a better
mixed quality of data can adversely affect the performance performance are similar to those used in OLTP, while some
of a data warehouse. Some data may become outdated are only developed in relation to data warehousing. For
over time and can be mixed with data that are still valid for example, as in an OLTP system, an index is also used in a
decision making. In addition, data are often collected to data warehouse system, though a data warehouse might
meet potential requirements, but may never be used. Data have different kinds of indexing mechanisms based on its
warehouses also contain external data (e.g. demographic, granularity. Partitioning is a technique which can be used
psychographic, etc.) to support a variety of predictive in data warehouse systems as well (Silberstein, Eacrett,
data mining activities. All these factors contribute to the Mayer, & Lo, 2003).
massive growth of data volume. As a result, even a simple On the other hand, some techniques are developed
query may become burdensome to process and cause specifically to improve the performance of data ware-
overflowing system indices (Inmon, Rudin, Buss & Sousa, houses. For example, aggregates can be built to provide
2001). Thus, exploring the techniques of performance a quick response time for summary information (e.g. Eacrett,
tuning becomes an important subject in data warehouse 2003; Silberstein, 2003). Query parallelism can be imple-
management. mented to speed up the query when data are queried from
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Warehouse Performance
several tables (Silberstein, et al., 2003). Caching and query requirements, the cardinality of the table can be
statistics are unique for data warehouses since the statis- decided. Given a tables cardinality, an appropriate ,
tics will help to build a smart cache for better performance. indexing method can then be chosen.
Also, pre-calculated reports are useful to certain groups Dimensional Models: Most data warehouse de-
of users who are only interested in seeing static reports signs use dimensional models, such as Star-Schema,
(Eacrett, 2003). Periodic data compression and archiving Snow-Flake, and Star-Flake. A star-schema is a
helps to cleanse the data warehouse environment. Keep- dimensional model with fully denormalized hierar-
ing only the necessary data online will allow faster access chies, whereas a snowflake schema is a dimensional
(e.g. Kimball, 1996). model with fully normalized hierarchies. A star-flake
schema represents a combination of a star schema
and a snow-flake schema (e.g. Moody & Kortink,
MAIN THRUST 2003). Data warehouse architects should consider
the pros and cons of each dimensional model before
As discussed earlier, performance issues play a crucial making a choice.
role in a data warehouse environment. This chapter de-
scribes ways to design, build, and manage data ware- Aggregates
houses for optimum performance. The techniques of
tuning and refining the data warehouse discussed below Aggregates are the subsets of the fact table data (Eacrett,
have been developed in recent years to reduce operating 2003). The data from the fact table are summarized into
and maintenance costs and to substantially improve the aggregates and stored physically in a different table than
performance of new and existing data warehouses. the fact table. Aggregates can significantly increase the
performance of the OLAP query since the query will read
Performance Optimization at the Data fewer data from the aggregates than from the fact table.
Model Design Stage Database read time is the major factor in query execution
time. Being able to reduce the database read time will help
Adopting a good data model design is a proactive way to the query performance a great deal since fewer data are
enhance future performance. In the data warehouse being read. However, the disadvantage of using aggre-
design phase, the following factors should be taken into gates is its loading performance. The data loaded into the
consideration. fact table have to be rolled up to the aggregates, which
means any newly updated records will have to be updated
Granularity: Granularity is the main issue that in the aggregates as well to make the data in the aggre-
needs to be investigated carefully before the data gates consistent with those in the fact table. Keeping the
warehouse is built. For example, does the report data as current as possible has presented a real challenge
need the data at the level of store keeping units to data warehousing (e.g. Bruckner & Tjoa, 2002). The
(SKUs), or just at the brand level? These are size ratio of the database records transferred to the database
questions that should be asked of business users records read is a good indicator of whether or not to use
before designing the model. Since a data ware- the aggregate technique. In practice, if the ratio is 1/10 or
house is a decision support system, rather than a less, building aggregates will definitely help performance
transactional system, the level of detail required (Silberstein, 2003).
is usually not as deep as the latter. For instance, a
data warehouse does not need data at the document Database Partitioning
level such as sales orders, purchase orders, which
are usually needed in a transactional system. In Logical partitioning means using year, planned/actual
such a case, data should be summarized before data, and business regions as criteria to partition the
they are loaded into the system. Defining the data database into smaller data sets. After logical partitioning,
that are needed no more and no less will a database view is created to include all the partitioned
determine the performance in the future. In some tables. In this case, no extra storage is needed and each
cases, the Operational Data Stores (ODS) will be partitioned table will be smaller to accelerate the query
a good place to store the most detailed granular (Silberstein, Eacrett, Mayer & Lo, 2003). Take a multina-
level data and those data can be provided on the tional company as an example. It is better to put the data
jump query basis. from different countries into different data targets, such
Cardinality: Cardinality means the number of pos- as cubes or data marts, than to put the data from all the
sible entries of the table. By collecting business countries into one data target. By logical partitioning
(splitting the data into smaller cubes), query can read the
319
TEAM LinG
Data Warehouse Performance
smaller cubes instead of large cubes and several parallel adopted. In contrast, for a high cardinality dimen-
processes can read the small cubes at the same time. sion table, the B-Tree index should be used (e.g.
Another benefit of logical partitioning is that each parti- McDonald, Wilmsmeier, Dixon & Inmon, 2002).
tioned cube is less complex to load and easier to perform The Master Data Table Index: Since the data
the administration. Physical partitioning can also reach the warehouse commonly uses dimensional models,
same goal as logical partitioning. Physical partitioning the query SQL plan always starts from reading the
means that the database table is cut into smaller chunks of master data table. Using indices on the master data
data. The partitioning is transparent to the user. The table will significantly enhance the query perfor-
partitioning will allow parallel processing of the query and mance.
each parallel process will read a smaller set of data sepa-
rately. Caching Technology
In a relational database, indexing is a well known technique Pre-calculation is one of the techniques where the admin-
for reducing database read time. By the same token, in a istrator can distribute the workload to off-peak hours and
data warehouse dimensional model, the use of indices in have the result sets ready for faster access (Eacrett,
the fact table, dimension table, and master data table will 2003). There are several benefits of using the pre-calcu-
improve the database read time. lated reports. The user will have faster response time
since calculation takes place on the fly. Also, the system
The Fact Table Index: By default the fact table workload is balanced and shifted to off-peak hours.
will have the primary index on all the dimension Lastly, the reports can be available offline.
keys. However, a secondary index can also be built
to fit a different query design (e.g. McDonald, Use Statistics to Further Tune up the
Wilmsmeier, Dixon, & Inmon, 2002). Unlike a System
primary index, which includes all the dimension
keys, the secondary index can be built to include In a real-world data warehouse system, the statistics data
only some dimension keys to improve the perfor- of the OLAP are collected by the system. The statistics
mance. By having the right index, the query read provide such information as what the most used queries
time can be dramatically reduced. are and how the data are selected. The statistics can help
The Dimension Table Index: In a dimensional further tune the system. For example, examining the
model, the size of the dimension table is the decid- descriptive statistics of the queries will reveal the most
ing factor affecting query performance. Thus, the common used drill-down dimensions as well as the
index of the dimension table is important to de- combinations of the dimensions. Also, the OLAP statis-
crease the master data read time, and thus to im- tics will also indicate what the major time component is
prove filtering and drill down. Depending on the out of the total query time. It could be database read time
cardinality of the dimension table, different index or OLAP calculation time. Based on the data, one can
methods will be adopted to build the index. For a low build aggregates or offline reports to increase the query
cardinality dimension table, Bit-Map index is usually performance.
320
TEAM LinG
Data Warehouse Performance
321
TEAM LinG
Data Warehouse Performance
Beitler, S.S., & Leary, R. (1997). Sears EPIC transforma- Uhle, R. (2003). Data aging with mySAP business intelli-
tion: Converting from mainframe legacy systems to On- gence. SAP White Paper.
Line Analytical Processing (OLAP). Journal of Data
Warehousing, 2, 5-16. Watson, H., Gerard, J., Gonzalez, L.E., Haywood, M.E., &
Fenton, D. (1999). Data warehousing failures: Case stud-
Bruckner, R.M., & Tjoa, A.M. (2002). Capturing delays ies and findings. Journal of Data Warehousing, 4, 44-55.
and valid times in data warehousestowards timely con-
sistent analyses. Journal of Intelligent Information Sys- Winter, R., & Auerbach, K. (2004). Contents under pres-
tems, 19(2), 169-190. sure: scalability challenges for large databases. Intelli-
gent Enterprise, 7(7), 18-25.
Eacrett, M. (2003). Hitchhikers guide to SAP business
information warehouse performance tuning. SAP White
Paper. KEY TERMS
Eckerson, W. (2003). BI StatShots. Journal of Data Ware- Cache: A region of a computers memory which stores
housing, 8(4), 64. recently or frequently accessed data so that the time of
Grim, R., & Thorton, P.A. (1997). A customer for life: the repeated access to the same data can decrease.
warehouseMCI approach. Journal of Data Warehousing, Granularity: The level of detail or complexity at which
2, 73-79. an information resource is described.
Inmon, W.H., Imhoff, C., & Sousa, R. (2001). Corporate Indexing: In data storage and retrieval, the creation
information factory. New York: John Wiley & Sons, Inc. and use of a list that inventories and cross-references
Inmon, W.H., Rudin, K., Buss, C.K., & Sousa, R. (1998). data. In database operations, a method to find data more
Data Warehouse Performance. New York: John Wiley & efficiently by indexing on primary key fields of the data-
Sons, Inc. base tables.
Kimball, R. (1996). The data warehouse toolkit. New York: ODS (Operational Data Stores): A system with
John Wiley & Sons, Inc. capability of continuous background update that keeps
up with individual transactional changes in operational
Ma, C., Chou, D.C., & Yen, D.C. (2000). Data warehous- systems versus a data warehouse that applies a large load
ing, technology assessment and management. Industrial of updates on an intermittent basis.
Management + Data Systems, 100, 125
OLAP (Online Analytical Processing): A category
McDonald, K., Wilmsmeier, A., Dixon, D.C., & Inmon, of software tools for collecting, presenting, delivering,
W.H. (2002, August). Mastering the SAP business infor- processing and managing multidimensional data (i.e.,
mation warehouse. Hoboken, NJ: John Wiley & Sons, data that has been aggregated into various categories or
Inc. dimensions) in order to provide analytical insights for
business management.
Moody, D., & Kortink, M.A.R. (2003). From ER models to
dimensional models, part II: Advanced design issues. OLTP (Online Transaction Processing): A stan-
Journal of Data Warehousing, 8, 20-29. dard, normalized database structure designed for trans-
actions in which inserts, updates, and deletes must be
Oracle Corp. (2004). Largest transaction processing db on fast.
Unix runs oracle database. Online Product News, 23(2).
Service Management: The strategic discipline for
Peterson, S. (1994). Stars: A pattern language for query identifying, establishing, and maintaining IT services to
optimized schema. Sequent Computer Systems, Inc. White support the organizations business goal at an appropri-
Paper. ate cost.
Raden, N. (2003). Real time: Get real, part II. Intelligent SQL (Structured Query Language): A standard
Enterprise, 6(11), 16. interactive programming language used to communicate
Silberstein, R. (2003). Know how network: SAP BW perfor- with relational databases in order to retrieve, update, and
mance monitoring with BW statistics. SAP White Paper. manage data.
Silberstein, R., Eacrett, M., Mayer, O., & Lo, A. (2003). SAP
BW performance tuning. SAP White Paper.
322
TEAM LinG
323
Reuven R. Levary
Saint Louis University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Warehousing and Mining in Supply Chains
items that are appealing to consumers. The items are MAIN THRUST
classified according to consumers socioeconomic back-
grounds and interests, the sale price that consumers are Data Aggregation in Supply Chains
willing to pay, and the location of the point of sale.
Data regarding the return of sold products are used to
identify potential problems with the products and their Large amounts of data are being accumulated and stored
uses. These data include information about product qual- by companies belonging to supply chains. Data aggrega-
ity, consumer disappointment with the product, and legal tion can improve the effectiveness of using the data for
consequences. Data-mining techniques can be used to operational, tactical, and strategic planning models. The
identify patterns in returns so that retailers can better concept of data aggregation in manufacturing firms is
determine which type of product to order in the future and called group technology (GT). Nonmanufacturing firms
from which supplier it should be purchased. Retailers are are also aggregating data regarding products, suppliers,
also interested in collecting data regarding competitors customers, and markets.
sales so that they can better promote their own product
and establish a competitive advantage. Group Technology
Data related to political and economic conditions in
supplier countries are of interest to retailers. Data-mining Group technology is a concept of grouping parts, re-
techniques can be used to identify political and economic sources, or data according to similar characteristics. By
patterns in countries. Information can help retailers choose grouping parts according to similarities in geometry,
suppliers who are situated in countries where the flow of design features, manufacturing features, materials used,
products and funds is expected to be stable for a reason- and/or tooling requirements, manufacturing efficiency
ably long period of time. can be enhanced, and productivity increased. Manufac-
Manufacturers collect data regarding a) particular turing efficiency is enhanced by
products and their manufacturing process, b) suppliers,
and c) the business environment. Data regarding the Performing similar activities at the same work center
product and the manufacturing process include the char- so that setup time can be reduced
acteristics of products and their component parts ob- Avoiding duplication of effort both in the design
tained from CAD/CAM systems, the quality of products and manufacture of parts
and their components, and trends in theresearch and Avoiding duplication of tools
development (R & D) of relevant technologies. Data- Automating information storage and retrieval
mining techniques can be applied to identify patterns in (Levary, 1993)
the defects of products, their components, or the manu-
facturing process. Data regarding suppliers include avail- Effective implementation of the GT concept necessi-
ability of raw materials, labor costs, labor skills, techno- tates the use of a classification and coding system. Such
logical capability, manufacturing capacity, and lead time a system codes the various attributes that identify simi-
of suppliers. Data related to qualified teleimmigrants (e.g., larities among parts. Each part is assigned a number or
engineers and computer software developers) is valuable alphanumeric code that uniquely identifies the parts
to many manufacturers. Data-mining techniques can be attributes or characteristics. A parts code must include
used to identify those teleimmigrants having unique knowl- both design and manufacturing attributes.
edge and experience. Data regarding the business envi- A classification and coding system must provide an
ronment of manufacturers include information about com- effective way of grouping parts into part families. All parts
petitors, potential legal consequences regarding a prod- in a given part family are similar in some aspect of design
uct or service, and both political and economic conditions or manufacture. A part may belong to more than one
in countries where the manufacturer has either facilities or family.
business partners. Data-mining techniques can be used A part code is typically composed of a large number
to identify possible liability concerning a product or of characters that allow for identification of all part at-
service as well as trends in political and economic condi- tributes. The larger the number of attributes included in a
tions in countries where the manufacturer has business part code, the more difficult the establishment of standard
interests. procedures for classifying and coding. Although numer-
Retailers, manufacturers, and suppliers are all inter- ous methods of classification and coding have been
ested in data regarding transportation companies. These developed, none has emerged as the standard method.
data include transportation capacity, prices, lead time, Because different manufacturers have different require-
and reliability for each mode of transportation. ments regarding the type and composition of parts codes,
324
TEAM LinG
Data Warehousing and Mining in Supply Chains
customized methods of classification and coding are gen- Products intended to be used or consumed in a
erally required. Some of the better known classification specific season of the year ,
and coding methods are listed by Groover (1987). Volume and speed of product movement
After a code is established for each part, the parts are The methods of the transportation of the products
grouped according to similarities and are assigned to part from the suppliers to the retailers
families. Each part family is designed to enhance manufac- The geographical location of suppliers
turing efficiency in a particular way. The information re- Method of transaction handling with suppliers; for
garding each part is arranged according to part families in example, EDI, Internet, off-line
a GT database. The GT database is designed in such a way
that users can efficiently retrieve desired information by As in the case of GT, retailing products may belong
using the appropriate code. to more than one family.
Consider part families that are based on similarities of
design features. A GT database enables design engineers Aggregation of Data Regarding
to search for existing part designs that have characteris- Customers of Finished Products
tics similar to those of a new part that is to be designed. The
search begins when the design engineer describes the To effectively market finished products to customers, it
main characteristics of the needed part with the help of a is helpful to aggregate customers with similar character-
partial code. The computer then searches the GT database istics into families. Examples of customers of finished
for all the items with the same code. The results of the product families include
search are listed on the computer screen, and the designer
can then select or modify an existing part design after Customers residing in a specific geographical re-
reviewing its specifications. Selected designs can easily gion
be retrieved. When design modifications are needed, the Customers belonging to a specific socioeconomic
file of the selected part is transferred to a CAD system. group
Such a system enables the design engineer to effectively Customers belonging to a specific age group
modify the parts characteristics in a short period of time. Customers having certain levels of education
In this way, efforts are not duplicated when designing Customers having similar product preferences
parts. Customers of the same gender
The creation of a GT database helps reduce redun- Customers with the same household size
dancy in the purchasing of parts as well. The database
enables manufacturers to identify similar parts produced Similar to both GT and retailing products, customers
by different companies. It also helps manufacturers to of finished products may belong to more than one family.
identify components that can serve more than a single
function. In such ways, GT enables manufacturers to
reduce both the number of parts and the number of suppli-
ers. Manufacturers that can purchase large quantities of a
FUTURE TRENDS
few items rather than small quantities of many items are
able to take advantage of quantity discounts. Supply Chain Decision Databases
Aggregation of Data Regarding Retailing The enterprise database systems that support supply
chain management are repositories for large amounts of
Products transaction-based data. These systems are said to be
data rich but information poor. The tremendous amount
Retailers may carry thousands of products in their stores. of data that are collected and stored in large, distributed
To effectively manage the logistics of so many products, database systems has far exceeded the human ability for
product aggregation is highly desirable. Products are comprehension without analytic tools. Shapiro estimates
aggregated into families that have some similar character- that 80% of the data in a transactional database that
istics. Examples of product families include the following: supports supply chain management is irrelevant to deci-
sion making and that data aggregations and other analy-
Products belonging to the same supplier ses are needed to transform the other 20% into useful
Products requiring special handling, transportation, information (2001). Data warehousing and online analyti-
or storage cal processing (OLAP) technologies combined with tools
Products intended to be used or consumed by a for data mining and knowledge discovery have allowed
specific group of customers the creation of systems to support organizational deci-
sion making.
325
TEAM LinG
Data Warehousing and Mining in Supply Chains
The supply chain management (SCM) data warehouse ers, distributors, and carriers to incorporate RFID tags
must maintain a significant amount of data for decision into both products and operations. Other large retailers
making. Historical and current data are required from are following Wal-Marts lead in requesting RFID tags to
supply chain partners and from various functional areas be installed in goods along their supply chain. The tags
within the firm in order to support decision making in follow products from the point of manufacture to the store
regard to planning, sourcing, production, and product shelf. RFID technology will significantly increase the
delivery. Supply chains are dynamic in nature. In a supply effectiveness of tracking materials along supply chains
chain environment, it may be desirable to learn from an and will also substantially reduce the loss that retailers
archived history of temporal data that often contains accrue from thefts. Nonetheless, civil liberty organiza-
some information that is less than optimal. In particular, tions are trying to stop RFID tagging of consumer goods,
SCM environments are typically characterized by variable because this technology has the potential of affecting
changes in product demand, supply levels, product at- consumer privacy. RFID tags can be hidden inside objects
tributes, machine characteristics, and production plans. without customer knowledge. So RFID tagging would
As these characteristics change over time, so does the make it possible for individuals to read the tags without
data in the data warehouses that support SCM decision the consumers even having knowledge of the tags exist-
making. We should note that Kimball and Ross (2002) use ence.
a supply value chain and a demand supply chain as the Sun Microsystems has designed RFID technology to
framework for developing the data model for all business reduce or eliminate drug counterfeiting in pharmaceutical
data warehouses. supply chains (Jaques, 2004). This technology will make
The data warehouses provide the foundation for de- the copying of drugs extremely difficult and unprofitable.
cision support systems (DSS) for supply chain manage- Delta Air Lines has successfully used RFID tags to track
ment. Analytical tools (simulation, optimization, & data pieces of luggage from check-in to planes (Brewin, 2003).
mining) and presentation tools (geographic information The luggage-tracking success rate of RFID was much
systems and graphical user interface displays) are coupled better than that provided by bar code scanners.
with the input data provided by the data warehouse Active RFID tags, unlike passive tags, have an inter-
(Marakas, 2003). Simchi-Levi, D., Kaminsky, and Simchi- nal battery. The tags have the ability to be rewritten and/
Levi, E. (2000) describe three DSS examples: logistics or modified. The read/write capability of active RFID tags
network design, supply chain planning and vehicle rout- is useful in interactive applications such as tracking work
ing, and scheduling. Each DSS requires different data in process or maintenance processes. Active RFID tags
elements, has specific goals and constraints, and utilizes are larger and more expensive than passive RFID tags.
special graphical user interface (GUI) tools. Both the passive and active tags have a large, diverse
spectrum of applications and have become the standard
The Role of Radio Frequency technologies for automated identification, data collec-
Identification (RFID) in Supply Chains tion, and tracking. A vast amount of data will be recorded
by RFID tags. The storage and analysis of this data will
Data Warehousing pose new challenges to the design, management, and
maintenance of databases as well as to the development
The emerging RFID technology will generate large amounts of data-mining techniques.
of data that need to be warehoused and mined. Radio
frequency identification (RFID) is a wireless technology
that identifies objects without having either contact or
sight of them. RFID tags can be read despite environmen-
CONCLUSION
tally difficult conditions such as fog, ice, snow, paint, and
widely fluctuating temperatures. Optically read technolo- A large amount of data is likely to be gathered from the
gies, such as bar codes, cannot be used in such environ- many activities along supply chains. This data must be
ments. RFID can also identify objects that are moving. warehoused and mined to identify patterns that can lead
Passive RFID tags have no external power source. to better management and control of supply chains. The
Rather, they have operating power generated from a more RFID tags installed along supply chains, the easier
reader device. The passive RFID tags are very small and data collection becomes. As the tags become more popu-
inexpensive. Further, they have a virtually unlimited op- lar, the data collected by them will grow significantly. The
erational life. The characteristics of these passive RFID increased popularity of the tags will bring with it new
tags make them ideal for tracking materials through sup- possibilities for data analysis as well as new warehousing
ply chains. Wal-Mart has required manufacturers, suppli- and mining challenges.
326
TEAM LinG
Data Warehousing and Mining in Supply Chains
327
TEAM LinG
328
Charles Greenidge
University of the West Indies, Barbados
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Warehousing Search Engine
True decision making embraces as many pertinent Engine Model (DWSE), which has an intermediate data
sources of information as possible so that a holistic extraction/cleaning layer functionally called the Meta-
perspective of facts, trends, and individual pieces of Data Engine, sandwiched between the data warehouse
data can be obtained. Increasingly, the growth of com- and search engine environments.
merce and business on the Internet has meant that in
addition to traditional modes of disseminating informa- The DWSE Model
tion, the Internet has become a forum for ready posting
of information of all kinds (Hall, 1999). The data warehouse and search engine environments
The prevalence of so much information on the Internet serve two distinct and important roles at the current
means that it is potentially a superior source of external time, but there is scope to utilize the strengths of both
data for the data warehouse. Since such data typically in conjunction for maximum usefulness (Barquin &
originate from a variety of sources (sites), it has to Edelstein, 1997; Sonnenreich & Macinta, 1998). Our
undergo a merging and transformation process before it proposed DWSE model seeks to allow cooperative links
can be used in the data warehouse. In the case of internal between data warehouse and search engine with an aim
data, which forms the core of the data warehouse, previ- of satisfying external data requirements. The model
ous manual methods of applying ad-hoc tools and tech- consists of (1) data warehouse (DW), (2) meta-data
niques to the data cleaning process are being replaced by engine (MDE) and (3) search engine (SE). The MDE is
more automated forms such as ETL (Extract, Trans- the component that provides a bridge over which infor-
form, Load). Although current ETL standardized pack- mation must pass from one environment to the other.
ages are expensive, they offer productivity gains in the The MDE enhances queries coming from the warehouse
long run (Earls, 2003; Songini, 2004). and also captures, merges, and formats information
The existence of the so-called invisible Web and returned by the search engine (Devlin, 1998).
ongoing efforts to gain access to these untapped sources The new model, through the MDE, seeks to augment
suggest that the future of external data retrieval will the operations of both by allowing external data to be
enjoy the same interest as that shown in internal data collected for the business analyst, while improving the
(Inmon, 2002; Sherman & Price, 2003; Smith, 2001). search engine searches through query modifications of
The need for reliable and consistent external data pro- queries emerging from the data warehouse.
vides the motivation for an intermediate layer between The generalized process is as follows. A query origi-
raw data gathered from the Internet and external data nates in the warehouse environment and is modified by
storage areas lying within the domain of the data ware- the MDE so that it is specific and free of nonsense
house (Agosta, 2000). words. A word that has a high occurrence in a text but
Until there is a maturing of widely available tools conveys little specific information about the subject of
with the ability to access the invisible Web, there will be the text is deemed to be a nonsense word. Typically,
a continued reliance on information retrieval techniques, these words include pronouns, conjunctions, and proper
as contrasted with data retrieval techniques, to gather names. This term is synonymous with noise word, as
external data (van Rijsbergen, 1979). The need for three found in information retrieval texts (Belew, 2000). The
environments to be present to process external data modified query is transmitted to the search engine that
from Web sources into the warehouse suggests a three- performs its operations and retrieves its results docu-
tier solution to this problem. Accordingly, we propose ments. The documents returned are analyzed by the
a tri-partite model called the Data Warehouse Search MDE, and information is prepared for return to the
329
TEAM LinG
Data Warehousing Search Engine
Figure 2. The DWSE model Figure 3. Bridging the architectures: The meta-data
engine
Data Warehouse Meta-Data Engine Search Engine Web
Enterprise Wide
S o u r ces Extract/ I Meta-Data
Transform SE n Engine
Meta t
Data e
Storage r
n
e
SE Files t
DSS Tool
Internal Data Internet
End User
warehouse. Finally, the information relating to the answer Index words retrieved during SE crawls.
to the query is returned to the warehouse environment. Manage the scheduling and control of tasks aris-
The entire process may take days or weeks, as both ing from the operation of both DW and SE.
warehouse and search engine operate independently ac- Provide a neutral platform so that SE and DW
cording to their own schedules. operational cycles and architectures cannot in-
The design of both search engine and data warehouse terfere with each other.
are skill-intensive and specialized tasks. Prudent judg- Track meta-data that may be used to enhance the
ment dictates that nothing should be done to add to the quality of queries and results in the future (e.g.,
burden of building and maintaining these complex sys- monitor the use of domain-specific words/phrases
tems. History shows that many information technology such as jargon for a particular domain).
(IT) projects fail when expediency overtakes deliberate
design. In Figure 2, we see the cooperation among com- Of major interest in Figure 4 is the automatic infor-
ponents of the DWSE. mation retrieval (AIR) component. AIR seeks to high-
The DWSE model considers only the external data light the existence of documents that are related to a
requirements of the data warehouse. The SE component given query (Belew, 2000) using a process that is differ-
performs actual searches but no analysis of documents. ent from data retrieval. Data retrieval (van Rijsbergen,
The indexing of the retrieved documents and other analy- 1979), as used in relation to database management sys-
sis is done by the MDE. tems (RDBMSs) requires an exact match of terms, uses
an artificial query language (e.g., SQL), and uses deduc-
Meta-Data Engine Model tion. Information retrieval, on the other hand (as is
common for Internet searches), uses partial matching
To cope with the radical differences between the SE and and a natural query language, and inductively produces
DW designs, we propose a meta-data engine to coordi- relevant results (Spertus & Stein, 1999). This idea of
nate all activities. Typical commercial SEs are com- relevance distinguishes the IR search from the normal
posed of a crawler (spider) and an indexer (mite). The data retrieval where a matching of terms is done.
indexer is used to codify results into a database for easy
querying. Our approach reduces the complexity of the
search engine by moving the indexing tasks to the meta- ANALYSIS OF MODEL
data engine. The meta-data engine seeks to form a bridge
between the diverse SE and DW environments. The pur- Strengths
pose of the meta-data engine is to facilitate an automatic
information retrieval mode. Figure 3 illustrates the con- The major strengths of this model are:
cept of the bridge between DW and SE environments, and
Figure 4 outlines the main features of the MDE. Independence
The following are some of the main functions of the
meta-data engine: This model carries logical independence. There is a
deliberate attempt to ensure that the best practices of
Capture and transform data arising from SE (e.g., handle each individual model is adhered to by insistence that
HTML, XML, .pdf, and other document formats).
330
TEAM LinG
Data Warehousing Search Engine
Figure 4. Characteristics of the meta-data engine ment to take place in three languages. For example,
when there is a query language for the warehouse, perl ,
for the meta-data engine, and java for the search engine,
Meta-Data Engine the strengths of each can be utilized effectively.
The new model allows for security and integrity to
Operate in AIR be maintained. The warehouse need not expose itself to
Mode dangerous integrity questions or to increasingly mali-
Process Queries cious threats on the Internet. By isolating the sector that
Emerging From interfaces with the Internet from the sector that carries
Data Warehouse
vital internal data, the prospects for good security are
Provide Neutral improved.
Platform
Process Meta-Data Relieves Information Overload
331
TEAM LinG
Data Warehousing Search Engine
Figure 5. Hidden dangers in DWSE model alternate path into the meta-data engine of the new model
may be required.
In the case of the second outcome, a new model may
B D be proposed with this process (OLTP external data to
A meta-data engine) as an integral part. Current data ware-
house models already allow for the establishment of
data marts and querying by external tools. If these new
Large volumes of redundant/unusable data may be stored
processes are considered desirable components of the
data warehousing process, then a new data warehouse
End-user analysis may become skewed by wrong use of external data
model could be proposed.
Validity and accuracy of external data not always known, so there
are risks involved in using the external data. Relevance remains a thorny issue. The problem of
Figure 8.2 Hidden dangers in DWSE model providing relevant data without anyone to manually verify
its relevance is challenging. Semantic Web developments
promise to address questions of relevance on the Internet.
3. Central importance of the Internet and its related
technologies (Goodman, 2000).
4. Increasing sophistication and focus of Web sites. REFERENCES
5. Increased need to automate analysis and decision
support on the growing volumes of data; both Agosta, L. (2000). The essential guide to data ware-
internal and external to the organization. housing. New Jersey: Prentice-Hall.
6. Adoption of specialized search engines (Raab, Barquin, R., & Edelstein, H. (Eds.). (1997). Planning
1999). and designing the data warehouse. New Jersey:
7. Adoption of specialized sites (vortals) catering to Prentice-Hall.
a particular category of user (Rupley, 2000).
8. Ability to mine invisible Web, increasing poten- Belew, R.K. (2000). Finding out about: A cognitive
tial benefits from a data warehouse search engine perspective on search engine technology and the
model (Sherman & Price, 2003; Smith, 2001; WWW. New York: Cambridge University Press.
Soudah, Sullivan, 2000).
Berson, A., & Smith, S.J. (1997). Data warehousing,
data mining and olap. New York: McGraw-Hill.
The DWSE model is poised to gain prominence as
practitioners realize the benefits of three distinct and Brake, D. (1997). Lost in cyberspace [Electronic ver-
architecturally independent layers. We are confident sion]. New Scientist.
that with vital and open architectures and models of
clarity, data warehousing efforts can be successful in Celko, J. (1995). Dont warehouse dirty data.
the long term. Datamation, 42-53.
Devlin, B. (1998). Meta-data: The warehouse atlas. DB2
Magazine, 3(1), 8-9.
CONCLUSION
Earls, A.R. (2003). ETL: Preparation is the best bet.
More research is needed to investigate how to enable Computerworld, 37(34), 25-27.
the data warehouse to take advantage of the growing Goodman, A. (2000). Searching for a better way [Elec-
external data stores on the Internet. Comparison of tronic version].
results obtained using the three-tiered approach of the
DWSE model, as compared with the traditional external Greenfield, L. (1996). Dont let data warehousing
data approaches, must be carried out. Unclear also is gotchas getcha. Datamation, 76-77.
how the traditional Online Transaction Processing
Hall, C. (1999). Enterprise information portals: Hot air
(OLTP) systems will funnel data toward this model. If
or hot technology [Electronic version]. Business Intel-
operational systems start to retain significant quantities
ligence Advisor, 111(11).
of external data, it may mitigate against this model as
more and more external data will be available in the Higgins, K.J. (2003). Warehouse data earns its keep.
warehouse through normal integration processes. On Network Computing, 14(8), 111-115.
the other hand, the costs of dealing with such external
data in the warehouse may be so prohibitive that an Inmon, W.H. (2002). Building the data warehouse.
New York: John Wiley & Sons.
332
TEAM LinG
Data Warehousing Search Engine
Inmon, W.H. (2003). The story so far. Computerworld, Wixom, B.H., & Watson, H.J. (2001). An empirical inves-
37(15), 26-27. tigation of the factors affecting data warehousing suc- ,
cess. MIS Quarterly, 25(1), 17-39.
Kimball, R. (1996). Dangerous preconceptions [Electronic
version].
Kimball, R. (1997). A dimensional modeling manifesto
[Electronic version]. DBMS Magazine.
KEY TERMS
Madria, S.K. et al. (1999). Research issues in Web data
mining. Proceedings of Data Warehousing and Knowl- Data Retrieval: Denotes the standardized database
edge Discovery, First International Conference. methods of matching a set of records, given a particular
query (e.g., use of the SQL SELECT command on a
Pfaffenberger, B. (1996). Web search strategies. New database.
York: MIS Press.
Decision Support System (DSS): An interactive
Raab, D.M. (1999). Enterprise information portals [Elec- arrangement of computerized tools tailored to retrieve
tronic version]. Relationship Marketing Report. and display data regarding business problems and queries.
Rupley, S. (2000). From portals to vortals. PC Magazine. ETL (Extract/Transform/Load): This term speci-
Sander-Beuermann, W., & Schomburg, M. (1998). fies a category of software that efficiently handles three
Internet information retrievalThe further development essential components of the warehousing process. First,
of meta-search engine technology. Proceedings of the data must be extracted (removed from originating sys-
Internet Summit, Internet Society, Geneva, Switzerland. tem), then transformed (reformatted and cleaned), and
third, loaded (copied/appended) into the data warehouse
Schwartz, E. (2003). Data warehouses get active. database system.
InfoWorld, 25(48), 12-13.
External Data: A broad term indicating data that is
Sherman, C., & Price, G. (2003). The invisible Web: external to a particular company. Includes electronic
Uncovering sources search engines cant see. Library and non-electronic formats.
Trends, 52(2), 282-299.
Information Retrieval: Denotes the attempt to
Smith, C.B. (2001). Getting to know the invisible Web. match a set of related documents to a given query using
Library Journal, 126(11), 16-19. semantic considerations (e.g., library catalogue sys-
Songini, M.L. (2004). ETL. Computerworld, 38(5), 23-24. tems often employ information retrieval techniques).
Sonnenreich, W., & Macinta, T. (1998). Web Internal Data: Previously cleaned warehouse data
developer.com guide to search engines. New York: that originated from the daily information processing
Wiley Computer Publishing. systems of a company.
Soudah, T. (2000). Search, and you shall find [Electronic Invisible Web: Denotes those significant portions
version]. of the Internet where data is stored which are inacces-
sible to the major search engines. The invisible Web
Spertus, E., & Stein, L.A. (1999). Squeal: A structured represents an often ignored/neglected source of poten-
query language for the Web [Electronic version]. tial online information.
Sullivan, D. (2000). Invisible Web gets deeper [Elec- Metadata: Data about data; in the data warehouse, it
tronic version]. The Search Engine Report. describes the contents of the data warehouse.
Van Rijsbergen, C.J. (1979). Information retrieval [Elec-
tronic version]. In Finding out about. [CD-ROM]. Rich-
ard Belew.
333
TEAM LinG
334
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Data Warehousing Solutions for Reporting Problems
lyzed, and used as a basis for the reports (Begg & Connolly, products will be a demanding task. Centralization of the
2002). A data warehouse provides decision support to data is also one topic that has been discussed in SOK ,
organizations with the help of analytical databases and Corporation, which has been justified with improve-
Online Analytical Processing (OLAP) tools (Gorla, 2003). ments in reporting.
A data warehouse (see Figure 1) receives data from the In Salon Seudun Puhelin, Ltd., the interviewees men-
operational databases on a regular basis, and new data is tioned that the major information system is somehow
added to the existing data. The warehouse contains both used inconsistently. Therefore, the data is not consis-
detailed aggregated data and summarized data to speed tent and influences the reporting. This company has
up the queries. It is typically organized in smaller units developed its own reporting application with Microsoft
called data marts, which support the specific analysis Access, but the program is not capable of managing files
needs of a department or business unit (Bonifati, Cattaneo, over 1 GB, which reduces the possibilities of using the
Ceri, Fuggetta, & Paraboschi, 2001). system. According to the interviewees, this limit pre-
In the case organizations, the idea of the data ware- vents, for example, the follow-up of daily sales. Another
house has been discussed, but so far no data warehouses problem concerning the reporting system is that users
exist, although in one case, a data warehouse pilot is in are incapable of defining their own reports when their
use. The rationale for these discussions is that at the needs change. Analyzing customer data is also difficult,
moment, the reporting and the analyzing possibilities because collecting all customer data together is a very
are not serving the organizations very well. Actually, the burdensome task. Therefore, Salon Seudun Puhelin, Ltd.,
interviewees identified many problems in reporting. has also discussed a data warehouse solution for three
In the SOK Corporation, the interviewees complained reasons: a) to get rid of the size limits, b) to provide a
that information is distributed in numerous information system for users where they can easily define new
systems; thus, building a comprehensive view of the reports, and c) to gain more versatile analysis possibili-
information is difficult. Another problem is in financial ties.
reporting. A financial report taken from different infor- The State Provincial Office of Western Finland is a
mation systems gives different results, though they joint regional administrative authority of seven minis-
should be equal. A reason for this inequality is that the tries. One of their yearly responsibilities is to evaluate
data is not harmonized and processed similarly. In the the basic service in their region. In practice, this respon-
restaurant business of SOK Corporation, an essential sibility means that they gather and analyze a large amount
piece of information is the sales figures of the products. of data. The first problem is that they have not used a
It should be able to analyze which, where, and how many special data management tool. The lack of an adequate
products have been bought. In the whole SOK Corpora- tool for data management makes it difficult to do any
tion, analyzing different customers and their behavior in time-series analysis, which many of the interviewees
detail is, at the moment, impossible. The interviewees hoped for. Another problem is that the results should be
also mentioned that a common database containing all easily distributed in forms of different reports, but at
products of the co-operative society might help in re- the moment, this is not the case.
porting, but defining a common classification of the In TS-Group, Ltd., a data warehouse pilot has been
implemented. The pilot enables versatile reporting and,
as the interviewees mentioned, this opportunity should
Figure 1. Data warehousing not be lost. However, some reporting problems still
exist. For example, the distribution and the format of the
Operational Valuable information for processes reports should be solved. In the department of financial
databases
management, the reporting system does not support the
latest operating systems and, therefore, only some com-
puters are capable of using the system.
In Optiroc, Ltd., reports are generated directly from
Data
the operational databases that are not designed for re-
Warehouse
Summarized
porting purposes. One problem attendant on this design
Cleaning
Reformatting
data
Analytical is that the reports run slowly. In principle, users should
Detailed tools also be able to create their reports by themselves, but in
data
reality, only a few of them are able to. Maybe this is one
reason that the interviewees presented very critical
Data
Data
Other
Data Filter Data
Mart comments on reporting. The interviewees mentioned as
Mart
Inputs Mart well that the implementation of a data warehouse system
is strongly supported and is seen as a solution for prob-
335
TEAM LinG
Data Warehousing Solutions for Reporting Problems
lems in reporting. A data warehouse was also justified need to pay attention to the quality of the source data in
because, with it, the company could serve customers the operational databases (Finnegan & Sammon, 2000).
better and could produce customized reports. At the One problem with MDS is that when the business envi-
moment, customer reporting is analyzed to develop re- ronment changes, the evolution of multidimensional
porting and to define necessary tools. schemas is not as manageable as with normalized schemas
(Martyn, 2004). It is also said that any architecture not
based on third normalized form can cause the failure of
MAIN THRUST a data warehouse project (Gardner, 1998). On the other
hand, a dimensional model provides a better solution for
The case organizations have plans to start exploiting a a decision support application than a pure normalized
data warehouse, but before the introduction of a data relational model does (Loukas & Spencer, 1999). All of
warehouse, plenty of design must be accomplished in all the above is actually related to efficiency, because a large
these cases. Designing a data warehouse requires quite amount of data is processed during analysis. Typically,
different techniques than the design of an operational this is a question about needed joins in the database
database (Golfarelli & Rizzi, 1998). Modeling a data level. Usually, a star schema is the most efficient design
model for a data warehouse is seen as one of the most for a data warehouse, because the denormalized tables
critical phases in the development process (Bonifati et require fewer joins (Martyn, 2004). However, recent de-
al., 2001). This data modeling has specific features that velopments in storage technology, access methods (such
distinguish it from normal data modeling (Busborg, as bitmap indexes), and query optimization indicate that
Christiansen, & Tryfona, 1999). At the beginning, the the performance with the third normalized form should be
content of the operational information systems, the in- tested before moving to multidimensional schemas
terconnections between them, and the equivalent entities (Martyn, 2004). From this 3NF schema, a natural step
should be understood (Blackwood, 2000). In practice, toward MDS is to use denormalization, which will
this entails studying the data models of the operational support both efficiency and flexibility issues (Finnegan
databases and developing an integrated schema to en- & Sammon, 2000). Still, it is possible to define neces-
hance the data interoperability (Bonifati et al., 2001). sary SQL views on top of the 3NF schema without
The data modeling of a data warehouse is called dimen- denormalization (Martyn, 2004). Finally, during the
sionality modeling (DM) (Golfarelli & Rizzi, 1998; design, the issues of physical and logical design should
Begg & Connolly, 2002). Dimensional models were be separated; physical design is about performance, and
developed to support analytical tasks (Loukas & Spen- logical design is about understandability (Kimball,
cer, 1999). Dimensionality modeling concentrates on 2001).
facts and the properties of the facts and dimensions The OLAP tools, which are based on MDS views,
connected to facts (Busborg et al., 1999). Facts are access data warehouses for complex data analysis and
numeric, and quantitative data of the business and dimen- decision support activities (Kambayashi, Kumar,
sions describe different dimensions of the business Mohania, & Samtani, 2004). These tools typically in-
(Bonifati et al., 2001). Fact tables contain all the busi- clude assessing the effectiveness of a marketing cam-
ness events to be analyzed, and dimension tables define paign, forecasting product sales, and planning capacity.
how to analyze fact information (Loukas & Spencer, The architecture of the underlying database of the data
1999). The result of the dimensionality modeling is warehouse categorizes the different analysis tools
typically presented in a star model or in a snowflake (Begg & Connolly, 2002). Depending on the schema
model (Begg & Connolly, 2002). Multidimensional type, the terms Relational OLAP (ROLAP), Multidi-
schema (MDS) is a more generic term that collectively mensional OLAP (MOLAP), and Hybrid OLAP (HOLAP)
refers to both schemas (Martyn, 2004). When a star are used (Kroenke, 2004). ROLAP is a preferable
model is used, the fact tables are normalized, but dimen- choice when a) the information needs change frequently,
sion tables are not. When dimension tables are normal- b) the information should be as current as possible, and
ized too, the star model turns into a snowflake model c) the users are sophisticated computer users (Gorla,
(Bonifati et al., 2001). 2003). The main differences between ROLAP and
Ideally, an information system such as a data ware- MOLAP are in the currency of data and in the data
house should be correct, fast, and friendly (Martyn, storage processing capacity. MOLAP populates its own
2004). Correctness is especially important in data ware- structure of the original data when it is loaded from the
houses to ensure that decisions are based on accurate operational databases (Dodds, Hasan, Hyland, &
information. Actually, an estimated 30% to 50% of Veeraraghavan, 2000). In MOLAP, the data is stored in
information in a typical database is either missing or a special-purpose MDS (Begg & Connolly, 2002). ROLAP,
incorrect (Blackwood, 2000). This idea emphasizes the on the other hand, analyzes the original data, and the
336
TEAM LinG
Data Warehousing Solutions for Reporting Problems
users can drill down to the unit data level (Dodds et al., The presented cases reflect the overall situation in
2000). ROLAP uses a meta-data layer to avoid the creation different organizations quite well and might predict ,
of an MDS. It typically utilizes the SQL extensions, such what will happen in the future. The cases show that
as CUBE and ROLLUP in Oracle DBMS (Begg & Connolly, organizations have clear problems in analyzing the op-
2002). erational data. They have developed their own applica-
tions to ease the reporting but are still living with
inadequate solutions. Data warehousing has been identi-
FUTURE TRENDS fied as a possible solution for the reporting problems,
and the future will show whether data warehouses dif-
In the future, the SOK Corporation needs to build a fuse in the organizations.
comprehensive view of the information. They can
achieved this goal first by building an enterprise data
model and then by modifying the existing information CONCLUSION
systems. From the reporting point of view, the opera-
tional data should be further modeled into a data ware- The case organizations have recognized the possibilies
house solution. After these steps, the problems relating of a data warehouse to solve problems in reporting.
to reporting should be solved. Their initiatives are based on business requirements,
In order to improve the consistency of the data at which is a good starting point because to be successful,
Salon Seudun Puhelin, Ltd., most concerns need to be a data warehouse should be a business-driven initiative
placed on the correctness of the data and the ways users in partnership with the information technology depart-
work with the information systems. Until then, exploit- ment (Gardner, 1998; Finnegan & Sammon, 2000).
ing a data warehouse is not rational. For this company, it However, the first step should be the analysis of the real
is reasonable to evaluate the true requirements of a data needs of a data warehouse. At the same time, the present
warehouse for reporting, because other solutions that problems in reporting should be solved if possible.
are easier than starting a data warehouse project might To really support reporting, the operational data
be on hand. should be organized like that in a data warehouse. In
In the State Provincial Office of Western Finland practice, this means studying and analyzing the existing
(WEST), the most important issue is acquiring a suitable operational databases. As a result, a data model describ-
tool for managing the collected data. After that step, the ing the necessary elements of the data warehouse should
emphasis can shift to reporting, analyzing, and in time- be achieved. A suggestion is to first produce an enter-
series. Basically, the environment of WEST is ideal for prise level data model to describe all the data and their
exploiting a data warehouse. The processes and the interconnections. This enterprise level data model will
functions are heavily dependent on the analysis and the be the basis for the data warehouse design. As the theory
developments in time series. suggested at the beginning of this article, a normalized
In TS-Group, Ltd., a data warehouse could solve the data model with SQL views should be produced and
problems in reporting as well. Introducing a data ware- tested. This model can further be denormalized into a
house would define the standard formats of the reports, snowflake or a star model when performance require-
which are currently causing a problem. At the same time, ments are not met with the normalized schema.
the distribution problems of the reports would be solved. Producing a data model for the data warehouse is
In Optiroc, Ltd., a data warehouse can solve the only one part of the process. In addition, special atten-
slowness of running reports and the difficulties in analy- tion should be taken in designing data extraction from
sis. One reason for the slowness is that the reports run operational databases to the data warehouse. The enter-
directly from the operational databases; moving the prise level data model helps these databases understand
origin of the data to a data warehouse might offer the data and thus eases the development of data extrac-
improvements. Another problem deals with the analysis tion solutions. When the data warehouse is in use,
and the possibilities to define necessary reports on the automated tools are necessary to speed up the loading of
fly. Introducing necessary OLAP tools and training the the operational data into the data warehouse (Finnegan
users sufficiently will solve this problem. & Sammon, 2000). Before loading data into the data
In Statistics Finland, the interviewees mentioned warehouse, the organizations should also analyze the
that reporting is not a problem. However, a data ware- correctness and the quality of the data in the operational
house might offer extra value in data analysis. At the databases.
moment, though, data warehousing is not the most acute Finally, as this article shows, a data warehouse is a
topic in information technology development in Statis- relevant alternative for solving reporting problems that
tics Finland.
337
TEAM LinG
Data Warehousing Solutions for Reporting Problems
the case organizations are currently facing. After the Inmon, W. H. (1992). Building the data warehouse. New
implementation of a data warehouse, organizations must York: Wiley.
ensure that the possible users of the system are edu-
cated in order to fully take advantage of the new possi- Kambayashi, Y., Kumar, V., Mohania, M., & Samtani, S.
bilities that the data warehouse offers. Of course, orga- (2004). Recent advances and research problems in data
nizations should remember that there is no quick jump warehousing. Lecture Notes in Computer Science, 1552,
to data warehouse exploitation. 81-92.
Kimball, R. (2001). A trio of interesting snowflakes.
Intelligent Enterprise, 4, 30-32.
REFERENCES
Kroenke, D. M. (2004). Database processing: Funda-
Begg, C., & Connolly, T. (2002). Database systems: A mentals, design and implementation. Upper Saddle
practical guide to design, implementation, and man- River, NJ: Pearson Prentice Hall.
agement. Addison-Wesley. Loukas, T., & Spencer, T. (1999, October). From star to
Blackwood, P. (2000). Eleven steps to success in data snowflake to ERD: Comparing data warehouse design
warehousing. Business Journal, 14(44), 26-27. approaches. Enterprise Systems.
Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., & Martyn, T. (2004). Reconsidering multi-dimensional
Paraboschi, S. (2001). Designing data marts for data schemas. ACM SIGMOD Record, 33(1), 83-88.
warehouses. ACM Transactions on Software Engineer-
ing and Methodology, 10(4), 452-483.
Busborg, F., Christiansen, J. G. B., & Tryfona, N. (1999). KEY TERMS
StarER: A conceptual model for data warehouse design.
Proceedings of the ACM International Workshop on Data Extraction: A process in which data is trans-
Data Warehousing and OLAP, USA. ferred from operational databases to a data warehouse.
Dodds, D., Hasan, H., Hyland, P., & Veeraraghavan, R. Dimensionality Modeling: A logical design tech-
(2000). Approaches to the development of multidimen- nique that aims to present data in a standard, intuitive
sional databases: Lessons from four case studies. 31(3), form that allows for high-performance access.
10-23. Retrieved from the ACM SIGMIS database. Normalization/Denormalization: Normalization
Elmasri, R., & Navathe, S. B. (2000). Fundamentals of is a technique for producing a set of relations with
database systems. Reading, MA: Addison-Wesley. desirable properties, given the data requirements of an
enterprise. Denormalization is a step backward in the
Finnegan, P., & Sammon, D. (2000). The ten command- normalization process for example, to improve per-
ments of data warehousing. 31(4), 82-91. Retrieved from formance.
the ACM SIGMIS database.
OLAP: The dynamic synthesis, analysis, and con-
Gardner, S. R. (1998). Building the data warehouse. solidation of large volumes of multidimensional data.
Communications of the ACM, 41(9), 52-60.
Snowflake Model: A variant of the star schema, in
Golfarelli, M., & Rizzi, S. (1998). A methodological which dimension tables do not contain denormalized
framework for data warehouse design. Proceedings of data.
the ACM International Workshop on Data Warehous-
ing and OLAP, USA. Star Model: A logical structure that has a fact table
containing factual data in the center, surrounded by
Gorla, N. (2003). Features to consider in a data ware- dimension tables containing reference data.
housing system. Communications of the ACM, 46(11),
111-115.
338
TEAM LinG
339
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Database Queries, Data Mining, and OLAP
returns all the customers from the above customer table These rules are understandable because they summa-
that spent more than $100: rize hundreds, possibly thousands, of records in the
customer database and it would be difficult to glean this
SELECT * FROM CUSTOMER_TABLE WHERE information off the query result. The rules are also action-
TOTAL_SPENT > $100; able. Consider that the first rule tells the storeowner that
adults over the age of 35 that own a minivan are likely to
This query returns a list of all instances in the table spend more than $100. Having access to this information
where the value of the attribute Total Spent is larger allows the storeowner to adjust the inventory to cater to
than $100. As this example highlights, queries act as this segment of the population, assuming that this repre-
filters that allow the user to select instances from a sents a desirable cross-section of the customer base.
table based on certain attribute values. It does not matter Similar with the second rule, male customers that reside in
how large or small the database table is, a query will a certain ZIP code are likely to spend more than $100.
simply return all the instances from a table that satisfy Looking at census information for this particular ZIP
the attribute value constraints given in the query. This code, the storeowner could again adjust the store inven-
straightforward approach to retrieving data from a data- tory to also cater to this population segment, presumably
base has also a drawback. Assume for a moment that our increasing the attractiveness of the store and thereby
example store is a large store with tens of thousands of increasing sales.
customers (perhaps an online store). Firing the above As we have shown, the fundamental difference be-
query against the customer table in the database will tween database queries and data mining is the fact that in
most likely produce a result set containing a very large contrast to queries data mining does not return raw data
number of customers and not much can be learned from that satisfies certain constraints, but returns models of
this query except for the fact that a large number of the data in question. These models are attractive be-
customers spent more than $100 at the store. Our innate cause in general they represent understandable and ac-
analytical capabilities are quickly overwhelmed by large tionable items. Since no such modeling ever occurs in
volumes of data. database queries we do not consider running queries
This is where differences between querying a data- against database tables as data mining, it does not matter
base and mining a database surface. In contrast to a how large the tables are.
query, which simply returns the data that fulfills certain
constraints, data mining constructs models of the data in Database Queries vs. OLAP
question. The models can be viewed as high level sum-
maries of the underlying data and are in most cases more In a typical relational database queries are posed against
useful than the raw data, since in a business sense they a set of normalized database tables in order to retrieve
usually represent understandable and actionable items instances that fulfill certain constraints on their at-
(Berry & Linoff, 2004). Depending on the questions of tribute values (Date, 2000). The normalized tables are
interest, data mining models can take on very different usually associated with each other via primary/foreign
forms. They include decision trees and decision rules keys. For example, a normalized database of our store
for classification tasks, association rules for market with multiple store locations or sales units might look
basket analysis, as well as clustering for market seg- something like the database given in Figure 2. Here, PK
mentation among many other possible models. Good and FK indicate primary and foreign keys, respectively.
overviews of current data mining techniques and models From a user perspective it might be interesting to ask
can be found in Berry & Linoff (2004), Han & Kamber some of the following questions:
(2001), Hand, Mannila, & Smyth (2001), and Hastie,
Tibshirani, & Friedman (2001). How much did sales unit A earn in January?
To continue our store example, in contrast to a How much did sales unit B earn in February?
query, a data mining algorithm that constructs decision What was their combined sales amount for the
rules might return the following set of rules for custom- first quarter?
ers that spent more than $100 from the store database:
Even though it is possible to extract this information
IF AGE > 35 AND CAR = MINIVAN THEN TOTAL SPENT with standard SQL queries from our database, the nor-
> $100 malized nature of the database makes the formulation of
OR the appropriate SQL queries very difficult. Further-
IF SEX = M AND ZIP = 05566 THEN TOTAL SPENT > $100 more, the query process is likely to be slow due to the
fact that it must perform complex joins and multiple
340
TEAM LinG
Database Queries, Data Mining, and OLAP
Figure 2. Normalized database schema for a store dimensions to each other and specifies the measures that
(Source: Craig et al., 1999, Figure 3.2) are to be aggregated. Here the measures are dollar_total, ,
sales_tax, and shipping_charge. Figure 4 shows a
three-dimensional data cube pre-aggregated from the
star schema in Figure 3 (in this cube we ignored the
customer dimension, since it is difficult to illustrate four-
dimensional cubes). In the cube building process the
measures are aggregated along the smallest unit in each
dimension, giving rise to small pre-aggregated segments
in a cube.
Data cubes can be seen as a compact representation
of pre-computed query results1. Essentially, each seg-
ment in a data cube represents a pre-computed query
result to a particular query within a given star schema.
The efficiency of cube querying allows the user to
interactively move from one segment in the cube to
another enabling the inspection of query results in real
scans of entire database tables in order to compute the time. Cube querying also allows the user to group and
desired aggregates. ungroup segments, as well as project segments onto
By rearranging the database tables in a slightly differ- given dimensions. This corresponds to such OLAP
ent manner and using a process called pre-aggregation or operations as roll-ups, drill-downs, and slice-and-dice,
computing cubes the above questions can be answered respectively (Gray, Bosworth, Layman, & Pirahesh,
with much less computational power enabling a real time 1997). These specialized operations in turn provide
analysis of aggregate attribute values OLAP (Craig et answers to the kind of questions mentioned above.
al., 1999; Kimball, 1996; Scalzo, 2003). In order to As we have seen, OLAP is enabled by organizing a
enable OLAP, the database tables are usually arranged relational database in a way that allows for the pre-
into a star schema where the innermost table is called the aggregation of certain query results. The resulting data
fact table and the outer tables are called dimension tables. cubes hold the pre-aggregated results giving the user
Figure 3 shows a star schema representation of our store the ability to analyze these aggregated results in real
organized along the main dimensions of the store busi- time using specialized OLAP operations. In a larger
ness: customers, sales units, products, and time. context we can view OLAP as a methodology for the
The dimension tables give rise to the dimensions in organization of databases along the dimensions of a
the pre-aggregated data cubes. The fact table relates the business making the database more comprehensible to
the end user.
341
TEAM LinG
Database Queries, Data Mining, and OLAP
database in such a way that it allows for the pre-compu- not seem surprising that all three tools are now routinely
tation of certain query results. OLAP itself is a way to look bundled.
at these pre-aggregated query results in real time. How-
ever, OLAP itself is still simply a way to evaluate queries,
which is different from building models of the data as in REFERENCES
data mining. Therefore, from a technical point of view we
cannot consider OLAP to be data mining. Where data Berry, M.J.A., & Linoff, G.S. (2004). Data mining tech-
mining tools model data and return actionable rules, niques: For marketing, sales, and customer relationship
OLAP allows users to compare and contrast measures management (2nd ed.). New York: John Wiley & Sons.
along business dimensions in real time.
It is interesting to note that recently a tight integra- Codd, E.F. (1970). A relational model of data for large
tion of data mining and OLAP has occurred. For ex- shared data banks. Communications of the ACM, 13(6),
ample, Microsoft SQL Server 2000 not only allows 377-387.
OLAP tools to access the data cubes but also enables its Craig, R.S., Vivona, J.A., & Bercovitch, D. (1999). Microsoft
data mining tools to mine data cubes (Seidman, 2001). data warehousing. New York: John Wiley & Sons.
Date, C.J. (2000). An introduction to database systems
FUTURE TRENDS (7th ed.). Reading, MA: Addison-Wesley.
342
TEAM LinG
Database Queries, Data Mining, and OLAP
Seidman, C. (2001). Data mining with Microsoft SQL OLAP (Online Analytical Processing): A cat-
Server 2000 technical reference. Microsoft Press. egory of applications and technologies for collecting, ,
managing, processing and presenting multidimensional
Yin, X., Han, J., Yang, J., & Yu, P.S. (2004). CrossMine: Efficient data for analysis and management purposes. (Source:
classification across multiple database relations. Paper pre- http://www.olapreport.com/glossary.htm)
sented at the 20th International Conference on Data Engi-
neering (ICDE 2004), Boston, MA, USA. Query: This term generally refers to databases. A
query is used to retrieve database records that match
certain criteria. (Source: http://usa.visa.com/business/
merchants/online_trans_glossary.html)
KEY TERMS
SQL (Structured Query Language): SQL is a stan-
Business Intelligence: Business intelligence (BI) is a dardized programming language for defining, retriev-
broad category of technologies that allows for gathering, ing, and inserting data objects in relational databases.
storing, accessing and analyzing data to help business Star Schema: A database design that is based on a
users make better decisions. (Source: http:// central detail fact table linked to surrounding dimension
www.oranz.co.uk/glossary_text.htm) tables. Star schemas allow access to data using business
Data Cubes: Also known as OLAP cubes. Data stored terms and perspectives. (Source: http://www.ds.uillinois.
in a format that allows users to perform fast multi- edu/glossary.asp)
dimensional analysis across different points of view.
The data is often sourced from a data warehouse and
relates to a particular business function. (Source: http:/
/www.oranz.co.uk/glossary_text.htm)
ENDNOTE
343
TEAM LinG
344
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Database Sampling for Data Mining
sample S is composed of k partial samples s1, s 2,.., sk, each records that still need to be chosen for the sample, and
drawn randomly, with replacement or not, from one of the RMsize = (N-t) is the number of records in the file still to ,
strata. Rao (2000) discusses several methods of allocating be processed. This sampling method is commonly referred
the number of sampled elements for each stratum. Bryant to as method S (Vitter, 1987).
et al. (1960) argue that, if the sample is allocated to the The reservoir sampling method (Fan et al., 1962;
strata in proportion to the number of elements in the Jones, 1962; Vitter, 1985, 1987) is a sequential sampling
strata, it is virtually certain that the stratified sample method over a finite population of database records, with
estimate will have a smaller variance than a simple random an unknown population size. Olken (1993) discuss its use
sample of the same size. The stratification of a sample may in sampling of database query outputs on the fly. This
be done according to one criterion. Most commonly technique produces a sample of size S, by initially placing
though, there are several alternative criteria that may be the first S records of the database/file/query in the reser-
used for stratification. When this is the case, the different voir. For each subsequent kth database record, that record
criteria may all be employed to achieve multi-way stratifi- is accepted with probability S/k. If accepted, it replaces a
cation. Neyman (1934) argues that there are situations randomly selected record in the reservoir.
when it is very difficult to use an individual unit as the unit Acceptance/Rejection sampling (A/R sampling) can
of sampling. For such situations, the sampling unit should be used to obtain weighted samples (Olken, 1993). For a
be a group of elements, and each stratum should be weighted random sample, the probabilities of inclusion of
composed of several groups. In comparison with strati- the elements of the population are not uniform. For data-
fied random sampling, where samples are selected from base sampling, the inclusion probability of a data record
each stratum, in cluster sampling a sample of clusters is is proportional to some weight calculated from the records
selected and observations/measurements are made on the attributes. Suppose that one database record rj is to be
clusters. Cluster sampling and stratification may be com- drawn from a file of n records with the probability of
bined (Rao, 2000). inclusion being proportional to the weight wj. This may be
done by generating a uniformly distributed random inte-
Database Sampling Methods ger 1 j n and then accepting the sampled record rj with
probability j = w j / w max, where wmax is the maximum
Database sampling has been practiced for many years for possible value for wj. The acceptance test is performed by
purposes of estimating aggregate query results, database generating another uniform random variate uj, 0 uj 1,
auditing, query optimization, and, obtaining samples for and accepting rj iff uj < j. If r j is rejected, the process is
further statistical processing (Olken, 1993). Static sam- repeated until some rj is accepted.
pling (Olken, 1993) and adaptive (dynamic) sampling
(Haas & Swami, 1992) are two alternatives for obtaining Stratified Sampling
samples for data mining tasks. In recent years, many
studies have been conducted in applying sampling to Density biased sampling (Palmer & Faloutsos, 2000) is a
inductive and non-inductive data mining (John & Lan- method that combines clustering and stratified sampling.
gley, 1996; Provost et al., 1999; Toivonen, 1996). In density biased sampling, the aim is to sample so that
within each cluster points are selected uniformly, the
Simple Random Sampling sample is density preserving, and the sample is biased by
cluster size. Density preserving in this context means that
Simple random sampling is by far, the simplest method of the expected sum of weights of the sampled points for
sampling a database. Simple random sampling may be each cluster is proportional to the clusters size. Since it
implemented using sequential random sampling or reser- would be infeasible to determine the clusters apriori,
voir sampling. For sequential random sampling, the prob- groups are used instead to represent all the regions in n-
lem is to draw a random sample of size n without replace- dimensional space. Sampling is then done to be density
ment, from a file containing N records. The simplest preserving for each group. The groups are formed by
sequential random sampling method is due to Fan et al. placing a d-dimensional grid over the data. In the d-
(1962) and Jones (1962). An independent uniform random dimensional grid, the d dimensions of each cell are labeled
variate [from the uniform interval (0,1)] is generated for either with a bin value for numeric attributes, or by a
each record in the file to determine whether the record discrete value for categorical attributes. The d-dimen-
should be included in the sample. If m records have sional grid defines the strata for multi-way stratified
already been chosen from among the first t records in sampling. A one-pass algorithm is used to perform the
the file, the (t+1)st record is chosen with probability weighted sampling, based on the reservoir algorithm.
(RQsize/RMsize), where RQsize = (n-m) is the number of
345
TEAM LinG
Database Sampling for Data Mining
Adaptive Sampling suitable sample size, and the level accuracy that can be
tolerated. These issues are discussed below.
Lipton et al. (1990) use adaptive sampling, also known as
sequential sampling, for database sampling. In sequential Deciding on a Sampling Method
sampling, a decision is made after each sampled element
whether to continue sampling. Olken (1993) has observed The data to be sampled may be balanced, imbalanced,
that sequential sampling algorithms outperform conven- clustered or unclustered. These characteristics will af-
tional single-stage algorithms in terms of the number of fect the quality of the sample obtained. While simple
sample elements required, since they can adjust the sample random sampling is very easy to implement, it may pro-
size to the population parameters. Haas & Swami (1992) duce non-representative samples for data that is
have proved that sequential sampling uses the minimum imbalanced or clustered. On the other hand, stratified
sample size for the required accuracy. sampling, with a good choice of strata cells, can be used
John & Langley (1996) have proposed a method they to produce representative samples, regardless of the
call dynamic sampling, which combines database sam- characteristics of the data. Implementation of one-way
pling with the estimation of classifier accuracy. The method stratification should be straight forward, however, for
is most efficiently applied to classification algorithms that multi-way stratification, there are many considerations
are incremental, for example nave Bayes and artificial to be made. For example, in density-biased sampling, a d-
neural-network algorithms such as backpropagation. They dimensional grid is used. Suppose each dimension has n
define the concept of probably close enough (PCE), possible values (or bins). The multi-way stratification
which they use for determining when a sample size pro- will result in n d strata cells. For large d, this is a very large
vides an accuracy that is probably good enough. Good number of cells. When it is not easy to estimate the
enough in this context means that there is a small prob- sample size in advance, adaptive (or dynamic) sampling
ability that the mining algorithm could do better by using may be employed, if the data mining algorithm is incre-
the entire database. The smallest sample size n is chosen mental.
from a database of size N, so that: Pr (acc(N) acc(n) >
) <= , where: acc(n) is the accuracy after processing a Determining the Representative
sample of size n, and is a parameter that describes what Sample Size
close enough means. The method works by gradually
increasing the sample size n until the PCE condition is For static sampling, the question must be asked: What
satisfied. is the size of a representative sample? A sample is
Provost, Jensen, & Oates (1999) use progressive sam- considered statistically valid if it is sufficiently similar to
pling, another form of adaptive sampling, and analyse its the database from where it is drawn (John & Langley,
efficiency relative to induction with all available examples. 1996). Univariate sampling may be used to test that each
The purpose of progressive sampling is to establish nmin, field in the sample comes from the same distribution as
the size of the smallest sufficient sample. They address the the parent database. For categorical fields, the chi-squared
issue of convergence, where convergence means that a test can be used to test the hypothesis that the sample
learning algorithm has reached its plateau of accuracy. In and the database come from the same distribution. For
order to detect convergence, they define the notion of a continuous-valued fields, a large-sample test can be
sampling schedule S as S = {no, n1, .., nk} where ni is an used to test the hypothesis that the sample and the
integer that specifies the size of the sample, and S is a database have the same mean. It must however be pointed
sequence of sample sizes to be provided to an inductive out that obtaining fixed size representative samples from
algorithm. They show that schedules in which n i increases a database is not a trivial task, and consultation with a
geometrically as: {no, a.n o, a2.no,., ak.no}, are asymptoti- statistician is recommended.
cally optimal. As one can see, progressive sampling is For inductive algorithms, the results from the theory
similar to the adaptive sampling method of John & Langley of probably approximately correct (PAC) learning have
(1996), except that a non-linear increment for the sample been suggested in the literature (Valiant, 1984; Haussler,
size is used. 1990). These have however been largely criticized for
overestimating the sample size (e.g., Haussler, 1990). For
incremental inductive algorithms, dynamic sampling (John
THE SAMPLING PROCESS & Langley, 1996; Provost et al., 1999) may be employed
to determine when a sufficient sample has been pro-
Several decisions need to be made when sampling a cessed. For association rule mining, the methods de-
database. One needs to decide on a sampling method, a scribed by Toivonen (1996) may be used to determine the
sample size.
346
TEAM LinG
Database Sampling for Data Mining
Determining the Accuracy and reduce the amount of data presented to a data mining
Confidence of Results Obtained from algorithm. More than fifty years of research has produced ,
a variety of sampling algorithms for database tables and
Samples query result sets. The selection of a sampling method
should depend on the nature of the data as well as the
For the task of inductive data mining, suppose we have algorithm to be employed. Estimation of sample sizes for
estimated the classification error for a classifier con- static sampling is a tricky issue. More research in this area
structed from sample S, to be errors (h), as the proportion is needed in order to provide practical guidelines. Strati-
of the test examples that are misclassified. Statistical fied sampling would appear to be a versatile approach to
theory, based on the central limit theorem, enables us to sampling any type of data. However, more research is
conclude that, with approximately N% probability, the needed to address especially the issue of how to define
true error lies in the interval: the strata for sampling.
347
TEAM LinG
Database Sampling for Data Mining
Mitchell, T.M. (1997). Machine learning. McGraw-Hill. Dynamic Sampling (Adaptive Sampling): A method
of sampling where sampling and processing of data pro-
Neyman, J. (1934). On the two different aspects of the ceed in tandem. After processing each incremental part of
representative method: The method of stratified sampling the sample, a decision is made whether to continue sam-
and the method of purposive selection. Journal of the pling or not.
Royal Statistical Society, 97, 558-625.
Reservoir Sampling: A database sampling method that
Olken, F. (1993). Random sampling from databases. PhD implements uniform random sampling on a database table of
thesis. University of California at Berkeley. unknown size, or a query result set of unknown size.
Palmer C.R., & Faloutsos, C. (2000). Density biased sam- Sequential Random Sampling: A database sam-
pling: An improved method for data mining and cluster- pling method that implements uniform random sam-
ing. In Proceedings of the ACM SIGMOD Conference (pp. pling on a database table whose size is known.
82-92).
Simple Random Sampling: Simple random sampling
Provost, F., Jensen, D., & Oates, T. (1999). Efficient involves selecting at random elements of the population
progressive sampling. In Proceedings of the Fifth ACM to be studied. The sample S is obtained by selecting at
DIGKDD International Conference on Knowledge Dis- random single elements of the population P.
covery and Data Mining (pp. 23-32), San Diego, CA.
Simple Random Sampling with Replacement
Rao, P.S.R.S. (2000). Sampling methodologies with ap- (SRSWR): A method of simple random sampling where an
plications. FL: Chapman & Hall/CRC. element stands a chance of being selected more than once.
Toivonen, H. (1996). Sampling large databases for asso- Simple Random Sampling without Replacement
ciation rules. In Proceedings of the Twenty-Second (SRSWOR): A method of simple random sampling where
Conference on Very Large Databases (VLDDB96), each element stands a chance of being selected only
Mumbai India. once.
Valiant, L.G. (1984). A theory of the learnable. Communi- Static Sampling: A method of sampling where the
cations of the ACM, 27 (11), 1134-1142. whole sample is obtained before processing begins. The
Vitter, J.S. (1985). Random sampling with a reservoir. ACM user must specify the sample size.
Transactions on Mathematical Software, 11, 37-57. Stratified Sampling: For this method, before the
Vitter, J.S. (1987). An efficient method for sequential samples are drawn, the population P is divided into
random sampling. ACM Transactions on Mathematical several strata, p1, p2,.. pk, and the sample S is composed
Software, 13 (1), 58-67. of k partial samples s1, s 2,.., sk, each drawn randomly, with
replacement or not, from one of the strata.
348
TEAM LinG
349
Luvai Motiwalla
University of Massachusetts Lowell, USA
M. Riaz Khan
University of Massachusetts Lowell, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
DEA Evaluation of Performance of E-Business Initiatives
benchmarks where multiple measurements exist. It is rare stable to any input changes, if an input-oriented super-
that one single measure can suffice for the purpose of efficiency DEA model is used (or any output changes, if
performance assessment. In our empirical study, there are an output-oriented super-efficiency DEA model is used).
multiple measures that characterize the performance of Therefore, we can use + to represent the super-efficiency
retail companies. This requires that the research tool used score (i.e., infeasibility means the highest super-efficiency).
here have the flexibility to deal with changing production Chen (2004) shows that (i) if an efficient DMU does not
technology in the context of multiple performance mea- possess any input super-efficiency (input saving), it must
sures. Data envelopment analysis (DEA) is originally possess output super-efficiency (output surplus), and (ii)
developed to measure the relative efficiency of peer if an efficient DMU does not possess any output super-
decision-making units (DMUs) in a multiple input-output efficiency, it must possess input super-efficiency. We
setting. DEA has been proven to be an excellent method- thus can use both input-oriented and output-oriented
ology for performance evaluation and benchmarking (Zhu, super-efficiency DEA models to fully characterize the
2003). super-efficiency.
Based on Cooper, Seiford, and Zhu (2004), the specific Based on the above derivations, Chen, et al. (2004) are
reasons for using DEA are given as follows. First, DEA is able to rank the performance of a set of publicly held
a data-oriented approach for evaluating the performance corporations in retail industry over the period 1997-2000.
of a set of peer DMUs, which convert multiple inputs into Specifically, the objective of this study is to determine
multiple outputs. In our case, the DMUs can be, for whether the financial data support the beneficial claims
example, corporations that have launched EB activities. made in the popular literature that EB has boosted the
For each corporation, each year can be regarded as a bottom-line.
DMU. Second, DEA is a methodology directed to fron-
tiers rather than central tendencies. Instead of trying to fit
a regression plane through the center of the data, as is MAIN TRUST
done in statistical regression, for example, one floats a
piecewise linear surface to rest on top of the observations. To present our DEA methodology, we assume that there
Because of this approach, DEA proves particularly adept are n DMUs to be evaluated. Each DMU consumes vary-
at uncovering relationships that remain hidden in other ing amounts of m different inputs to produce s different
methodologies. Third, DEA does not require explicitly outputs. Specifically, DMUj consumes amount xij of input
formulated assumptions of functional form as in linear and i and produces amount yrj of output r. We assume that xij
nonlinear regression models. This flexibility allows us to > 0 and y rj > 0 and further assume that each DMU has at
identify the multi-dimensional efficient frontier without least one positive input and one positive output value.
the need for explicitly expressing the technology change The input-output oriented super efficiency models
and organizational knowledge. whose frontier exhibits VRS can be expressed as Seiford
In order to discriminate the performance among the and Zhu (1999) in Box 1, where xio and yro are respec-
efficient DMUs, a super-efficiency DEA model in which a
tively the ith input and rth output for a DMU o under
DMU under evaluation is excluded from the reference set
evaluation.
is developed. However, the super-efficiency model has
been restricted to the case of constant returns to scale Let o represent the score for characterizing the super-
(CRS), because the non-CRS super-efficiency DEA model efficiency in terms of input saving, we have
can be infeasible (Seiford & Zhu, 1998; 1999; Zhu, 1996).
It is difficult to precisely define infeasibility. As a VRS super* if the input - oriented super - efficiency model is feasible
result, one cannot rank the performance of a set of DMUs. o = 1 o if the input - oriented super - efficiency model is infeasible
In fact, an input-oriented super-efficiency DEA model
measures the input super-efficiency when outputs are
fixed at their current levels. Likewise, an output-oriented Note that o > 1. If o >1, a specific efficient DMU o has
super-efficiency DEA model measures the output super- input super-efficiency. If o = 1, DMU o does not have
efficiency when inputs are fixed at their current levels.
input super-efficiency. Similarly, let o represent the
From the different uses of the super-efficiency concept,
we see that super-efficiency can be interpreted as the score for characterizing the output super-efficiency, we
degree of efficiency stability or input saving/output sur- have
plus achieved by an efficient DMU. If super-efficiency is
used as an efficiency stability measure, then infeasibility oVRS super* if the output - oriented super - efficiency model is feasible
o =
means that an efficient DMUs efficiency classification is 1 if the output - oriented super - efficiency model is infeasible
350
TEAM LinG
DEA Evaluation of Performance of E-Business Initiatives
Box 1.
min oVRS-super
,
max oVRS super
n n
s.t. x
j =1
j ij oVRS super xio i = 1,2,..., m; s.t. x j ij xio i = 1,2,..., m;
j =1
j o j o
n n
y
j=1
j rj y ro r = 1,2,..., s; j y rj oVRS super y ro r = 1,2,..., s;
j=1
j o j o
n n
j=1
j =1 j =1
j=1
j o j o
Note that o < 1. If o < 1, a specific efficient DMU o indicate that the EB companies performed better in some
measures than the non-EB companies included in the
has output super-efficiency. If o = 1, DMU o does not sample. Further, by contrasting the performance of EB
have output super-efficiency. companies against the non-EB companies, the findings
The DEA inputs and outputs can be developed from confirm that the EB companies have benefited from the
the financial ratios. For example, the inputs can include (i) innovation and the strategic integration of the EB technologies.
number of employees, (ii) inventory cost, (iii) total current
assets, and (iv) cost of sales; and the outputs can include
(i) revenue and (ii) net income. In our study, 75% of the EB REFERENCES
companies are efficient, and 57% of the non-EB companies
are efficient. Thus, in general, the EB companies have Bingi, P., Mir, A., & Khamalah, J. (2000). The challenges
performed better. In terms of the super-efficiency DEA facing global e-commerce. Information Systems Man-
model, the EB companies demonstrate a better perfor- agement, 6(2), 26-34.
mance than the companies that have not yet adopted the
Charnes, A, Cooper, W.W., & Rhodes, E. (1978). Measur-
EB initiatives (Chen et al., 2004).
ing the efficiency of decision making units. European
Journal of Operational Research, 2, 429-444.
FUTURE TRENDS Chen, Y. (2004). Ranking efficient units in DEA. OMEGA,
32, 213-219.
It has been recognized that the link between IT investment
Chen, Y., Motiwalla, L., & Khan, M.R. (2004). Using
and firm performance is indirect. Future research should
super-efficiency DEA to evaluate financial performance
focus on multi-stage performance of IT impacts on firm
of e-business initiative in the retail industry. Interna-
performance. For example, Chen and Zhu (2004) developed
tional Journal of Information Technology and Decision
a preliminary DEA-based model to (i) characterize the
Making, 3(2), 337-351.
indirect impact of IT on firm performance, (ii) identify the
efficient frontier of two value-added stages related to IT Chen, Y., & Zhu, J. (2004). Measuring information
investment and profit generation, and (iii) highlight firms technologys indirect impact on firm performance. Infor-
that can be further analyzed for best practice benchmarking. mation Technology & Management Journal, 5(1), 9-22.
Choi, S., & Winston, A. 2000. Benefits and requirements
CONCLUSION for interoperability in electronic marketplace. Technol-
ogy in Society, 22, 33-44.
The analysis of the retail industry data indicates that there Cooper, W.W., Seiford, L.M., & Zhu, J. (2004). Handbook
is some evidence that the EB initiatives have had some on data envelopment analysis. Boston: Kluwer Aca-
degree of favorable impact on the financial performance of demic Publishers.
companies that have moved in this direction. Results
351
TEAM LinG
DEA Evaluation of Performance of E-Business Initiatives
Hoffman, D., Novak, T., & Chatterjee, P. (1995). Commer- KEY TERMS
cial scenarios for the Web: Opportunities and challenges.
Journal of Computer-Mediated Communications, 1(3). Data Envelopment Analysis (DEA): A data-oriented
Motiwalla, L., & Khan, M.R. (2002). Financial impact of e- mathematical programming approach that allows multiple
business initiatives in the retail industry. Journal of performance measures in a single model.
Electronic Commerce in Organization, 1(1), 55-73. Decision Making Unit (DMU): The subject under
Pyle, R. (1996). Commerce and the Internet. Communica- evaluation.
tions of the ACM, 39(6), 23. Efficient: Full efficiency is attained by any DMU if and
Seiford, L.M., & Zhu, J. (1998). Sensitivity analysis of only if none of its inputs or outputs can be improved
DEA models for simultaneous changes in all the data. without worsening some of its other inputs or outputs.
Journal of the Operational Research Society, 49, 1060- Electronic Business (EB): Business transactions in-
1071. volving exchange of goods and services with customers
Seiford, L.M., & Zhu, J. (1999). Infeasibility of super- and/or business partners over the Internet.
efficiency data envelopment analysis models. INFOR, 37, Inputs/Outputs: Refer to the performance measures
174-187. used in DEA evaluation. Inputs usually refer to the re-
Steinfield, C., & Whitten, P. (1999). Community level sources used, and outputs refer to the outcomes achieved
socio-economic impacts of electronic commerce. Journal by an organization or DMU.
Of Computer-Mediated Communications, 5(2). Returns to Scale (RTS): RTS are considered to be
White, G. (1999, December 3). How GM, Ford think Web increasing if a proportional increase in all the inputs
can make a splash on the factory floor. Wall Street Jour- results in a more than proportional increase in the single
nal, 1, 1. output. In DEA, the concept of returns to scale is extended
to multiple inputs and multiple outputs situations.
Wigand, R., & Benjamin, R. (1995). Electronic commerce:
effects on electronic markets. Journal of Computer Medi- Super-Efficiency: The input savings or output sur-
ated Communication, 1(3). pluses achieved by an efficient DMU.
352
TEAM LinG
353
Claudio Conversano
University of Cassino, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Decision Tree Induction
depends on both the hypotheses to verify and their (Mola & Siciliano, 1997) becomes a fundamental step to
alternatives. For instance, in classification trees, the num- speed up the overall DTI procedure.
ber of response classes and the prior distribution of cases
among the classes influence the quality of the final deci- Tree Simplification: Pruning Algorithms
sion rule. In the credit-scoring example, an induction
procedure using a sample of 80% of good clients and 20% A further step is required for DTI relying on the hypoth-
of bad clients likely will provide reliable rules to identify esis of uncertainty in the data due to noise and residual
good clients and unreliable rules to identify bad ones. variation. Simplifying trees is necessary to remove the
most unreliable branches and improve understandabil-
ity. Thus, the goal of simplification is inferential (i.e.,
MAIN THRUST to define the structural part of the tree and reduce its
size while retaining its accuracy). Pruning methods
Exploratory trees can be fruitfully used to investigate consist in simplifying trees in order to remove the most
the data structure, but they cannot be used straightfor- unreliable branches and improve the accuracy of the rule
wardly for induction purposes. The main reason is that for classifying fresh cases.
exploratory trees are accurate and effective with re- The pioneer approach of simplification was pre-
spect to the training data used for growing the tree, but sented in the Automatic Interaction Detection (AID) of
they might perform poorly when applied to classifying/ Morgan and Sonquist (1963). It was based on arresting
predicting fresh cases that have not been used in the the recursive partitioning procedure according to some
growing phase. stopping rule (pre-pruning).
Alternative procedures consist in pruning algorithms
DTI Main Tasks working either from the bottom to the top of the tree
(post-pruning) or vice versa (pre-pruning). CART
DTI definitely has an important purpose represented by (Breiman et al., 1984) introduced the idea to grow the
understandability: the tree structure for induction needs totally expanded tree for removing retrospectively some
to be simple and not large; this is a difficult task since a of the branches (post-pruning). This results in a set of
predictor may reappear (even though in a restricted optimally pruned trees for the selection of the final
form) many times down a branch. At the same time, a decision rule.
further requirement is given by the identification issue: The main issue of pruning algorithms is the defini-
on one hand, terminal branches of the expanded tree tion of a complexity measure that takes account of both
reflect particular features of the training set, causing the tree size and accuracy through a penalty parameter
over-fitting; on the other hand, over-pruned trees neces- expressing the gain/cost of pruning tree branches. The
sarily do not allow identification of all the response training set is often used for pruning, whereas the test
classes/values (under-fitting). set for selecting the final decision rule. This is the case of
both the error-complexity pruning of CART and the criti-
Tree Model Building cal value pruning (Mingers, 1989). Nevertheless, some
methods require only the training set. This is the case of
Simplification method performance in terms of accu- the pessimistic error pruning and the error-based pruning
racy depends on the partitioning criterion used in the (Quinlan, 1987, 1993) as well as the minimum error pruning
tree-growing procedure (Buntine & Niblett, 1992). Thus, (Cestnik & Bratko, 1991) and the CART cross-validation
exploratory trees become an important preliminary step method. Instead, other methods use only the test set,
for DTI. In tree model building, it is worth distinguish- such as the reduced error pruning (Quinlan, 1987). These
ing between the optimality criterion for tree pruning latter pruning algorithms yield to just one best pruned
(simplification method) and the criterion for selecting tree, which represents in this way the final rule.
the best decision rule (decision rule selection). These In DTI, accuracy refers to the predictive ability of the
criteria often use independent datasets (training set and decision tree to classify/predict an independent set of test
test set). In addition, a validation set can be required to data. In classification trees, the error rate, measured by the
assess the quality of the final decision rule (Hand, number of incorrect classifications of the tree on test data,
1997). In this respect, segmentation with pruning and does not reflect accuracy of predictions for classes that
assessment can be viewed as stages of any computa- are not equally likely, and those with few cases are usually
tional model-building process based on a supervised badly predicted. As an alternative to the CART pruning,
learning algorithm. Furthermore, growing the tree struc- Cappelli, et al. (1998) provided a pruning algorithm based
ture using a Fast Algorithm for Splitting Trees (FAST) on the impurity-complexity measure to take account of the
distribution of the cases over the classes.
354
TEAM LinG
Decision Tree Induction
355
TEAM LinG
Decision Tree Induction
356
TEAM LinG
Decision Tree Induction
Mingers, J. (1989a). An empirical comparison of selection bootstrap sample to generate single prediction/classifi-
measures for decision tree induction. Machine Learning, cation rules that being aggregated provides a final deci- ,
3, 319-342. sion rule consisting in either the average (for regression
problems) or the modal class (for classification problems)
Mingers, J. (1989b). An empirical comparison of prun- among the single estimates.
ing methods for decision tree induction. Machine Learn-
ing, 4, 227-243. Classification Tree: An oriented tree structure ob-
tained by a recursive partitioning of a sample of cases on
Mola, F., & Siciliano, R. (1997). A fast splitting algo- the basis of a sequential partitioning of the predictor
rithm for classification trees. Statistics and Comput- space such to obtain internally homogenous groups and
ing, 7, 209-216. externally heterogeneous groups of cases with respect to
Morgan, J.N., & Sonquist, J.A. (1963). Problem in the a categorical variable.
analysis of survey data and a proposal. Journal of the Decision Rule: The result of an induction proce-
American Statistical Association, 58, 415-434. dure providing the final assignment of a response class/
Oliver, J.J., & Hand, D.J. (1995). On pruning and aver- value to a new object so that only the predictor measure-
aging decision trees. Proceedings of Machine Learn- ments are known. Such rule can be drawn in the form of
ing, the 12th International Workshop. Berlin: Springer. decision tree.
Quinlan, J.R. (1986). Induction of decision tree. Ma- Ensemble: A combination, typically weighted or
chine Learning, 1, 86-106. unweighted aggregation, of single induction estimators
able to improve the overall accuracy of any single
Quinlan, J.R. (1987). Simplifying decision tree. Internat. induction method.
J. Man-Mach. Studies, 27, 221-234.
Exploratory Tree: An oriented tree graph formed
Schapire, R.E., Freund, Y., Barlett, P., & Lee, W.S. by internal nodes and terminal nodes, the former allow-
(1998). Boosting the margin: A new explanation for the ing the description of the conditional interaction paths
effectiveness of voting methods. The Annals of Statis- between the response variable and the predictors,
tics, 26(5), 1998. whereas the latter are labeled by a response class/value.
Siciliano, R. (1998). Exploratory versus decision trees. FAST (Fast Algorithm for Splitting Trees): A
R. Payne, & P. Green (Eds.), Proceedings in computa- splitting procedure to grow a binary tree using a suitable
tional statistics (pp. 113-124). Physica-Verlag. London: mathematical property of the impurity proportional re-
Springer. duction measure to find out the optimal split at each
Siciliano, R., Aria, M., & Conversano, C. (2004). Tree node without trying out necessarily all candidate splits.
harvest: Methods, software and applications. In J. Antoch Partitioning Tree Algorithm: A recursive algo-
(Ed.), COMPSTAT 2004 Proceedings (pp. 1807-1814). rithm to form disjoint and exhaustive subgroups of
Berlin: Springer. objects from a given group in order to build up a tree
Zhang, H., & Singer, B. (1999). Recursive partitioning structure.
in the health sciences. New York: Springer Verlag. Production Rule: A tree path characterized by a
sequence of predictor interactions yielding to a spe-
cific label class/value of the response variable.
KEY TERMS Pruning: A top-down or bottom-up selective algo-
rithm to reduce the dimensionality of a tree structure in
Adaptive Boosting (AdaBoost): An iterative boot- terms of the number of its terminal nodes.
strap replication of the sample units of the training
sample such that at any iteration, misclassified/worse Random Forest: An ensemble of unpruned trees
predicted cases have higher probability to be included in obtained by introducing two bootstrap resampling
the current bootstrap sample, and the final decision rule schema, one on the objects and another one on the
is obtained by majority voting. predictors, such that an out-of-bag sample provides the
estimation of the test set error, and suitable measures of
Bagging (Bootstrap Aggregating): A bootstrap predictor importance are derived for the final interpre-
replication of the sample units of the training sample, tation.
each having the same probability to be included in the
357
TEAM LinG
Decision Tree Induction
358
TEAM LinG
359
The National Academy of Sciences convened in 1995 Three major challenges are reviewed here: a) understand-
for a conference on massive data sets. The presentation ing and converting the diabetic databases into a data-
on health care noted that massive applies in several mining data table, b) the data mining, and c) utilizing
dimensions . . . the data themselves are massive, both in results to assist clinicians and managers in improving the
terms of the number of observations and also in terms of health of the population studied.
the variables . . . there are tens of thousands of indicator
variables coded for each patient (Goodall, 1995, para- The Diabetic Database
graph 18). We multiply this by the number of patients in
the United States, which is hundreds of millions. The diabetic data warehouse we studied included 30,383
Diabetic registries have existed for decades. Data- diabetic patients during a 42-month period with hun-
mining techniques have recently been applied to them in dreds of fields per patient.
an attempt to predict diabetes development or high-risk Understanding the data requires awareness of its
cases, to find new ways to improve outcomes, and to limitations. These data were obtained for purposes other
detect provider outliers in quality of care or in billing than research. Clinicians will be aware that billing codes
services (Breault, 2001; He, Koesmarno, Van, & Huang, are not always precise, accurate, and comprehensive.
2000; Hsu, Lee, Liu, & Ling, 2000; Kakarlapudi, Sawyer, & However, the codes are widely used in outcomes mod-
Staecker, 2003; Stepaniuk, 1999; Tafeit, Moller, Sudi, & eling. Epidemiologists and clinicians will be aware that
Reibnegger, 2000). important predictors of diabetic outcomes are missing
Diabetes is a major health problem. The long history from the database, such as body mass index, family
of diabetic registries makes it a realistic and valuable history of diabetes, time since the onset of diabetes,
target for data mining. diet, and exercise habits. These variables were not elec-
tronically stored and would require going to the paper
chart and patient interviews to obtain.
BACKGROUND
Developing the Data-Mining Data Table
In-depth examination of one such diabetic data ware-
house developed a method of applying data-mining tech- The major challenge is transforming the data from the
niques to this type of database (Breault, Goodall, & Fos, relational structure of the diabetic data warehouse with
2002). There are unique data issues and analysis prob- its multiple tables to a form suitable for data mining
lems with medical transactional databases. The lessons (Nadeau, Sullivan, Teorey, & Feldman, 2003). Data-min-
learned will be applicable to any diabetic database and ing algorithms are most often based on a single table,
perhaps to broader medical databases. within which is a record for each individual, and the fields
Methods for translating a complex relational medi- contain variable values specific to the individual. We call
cal database with time series and sequencing informa- this the data-mining data table. The most portable format
tion to a flat file suitable for data mining are challeng- for the data-mining data table is a flat file, with one line for
ing. We used the classification tree approach with a each individual record.
binary target variable. While many data mining methods SQL statements on the data warehouse create the flat
(neural networks, logistic regression, etc.) could be file output that the data-mining software then reads. The
used, classification trees have been noted to be appeal- steps are as follows:
ing to physicians because much of medical diagnosis
training operates in a fashion similar to classification Review each table of the relational database and
trees. select the fields to export.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Diabetic Data Warehouses
Determine the interactions between the tables in the had at least two HgbA1c tests and at least two office
relational database. visits, the criteria we used for minimal continuity in this
Define the layout of the data-mining data table. 42-month period.
Specify patient inclusion and exclusion criteria.
What is the time interval? What are the minimum Data-Mining Technique
and maximum number of records (e.g., clinic vis-
its or outcome measures) each patient must have We used the classification tree approach as standard-
to be included? What relevant fields can be miss- ized in the CART software by Salford Systems. As
ing and still include the individual in the data- detailed in Hand, Mannila, and Smyth (2001), the prin-
mining data table? ciple behind all tree models is to recursively partition
Extract data, including the stripping of patient the input variable space to maximize purity in the termi-
identifiers to protect human subjects. nal tree nodes. The partitioning split in any cell is done
Determine how to handle missing values (Duhamel, by searching each possible threshold for each variable
Nuttens, Devos, Picavet, & Beuscart, 2003) to find the threshold split that leads to the greatest
Perform sanity checks on the data-mining data improvement in the purity score of the resultant nodes.
table, for example, that the minimum and maxi- Hence, this is a monothetic process, which may be a
mum of each variable make clinical sense. limitation of this method in some circumstances.
In CARTs defaults, the Gini splitting criteria are
Handling time series medical data is challenging for used, although other methods are options. This could
data-mining software. One example in our study is the recursively continue to the point of perfect purity,
HgbA1c, the key measure of glycemic control. This is which would sometimes mean only one patient in a
closely related to clinical outcomes and complication terminal node. But overfitting of the data does not help
rates in diabetes. Health care costs increase markedly in accurately classifying another data set. Therefore, we
with each 1% increase in baseline HgbA1c; patients with divide the data randomly into learning and test sets. The
an HgbA1c of 10% versus 6% had a 36% increase in 3- number of trees generated is halted or pruned back by
year medical costs (Blonde, 2001). How should this how accurately the classification tree created from the
time series variable be transformed from the relational learning set can predict classification in the test set.
database to a vector (column) in the data-mining data Cross-validation is another option for doing this, though
table? A given diabetic patient may have many of these in the CART softwares defaults this is limited to n =
HgbA1c results. We could pick the last one, the first, a 3000. This could be changed higher to use our full data
median or mean value. Because the trend over time for set, but some CART consultants note, The n-fold cross-
this variable is important, we could choose the slope of validation technique is designed to get the most out of
its regression line over time. However, a linear function datasets that are too small to accommodate a hold-out or
may be a good representation for some patients, but a test sample. Once you have 3,000 records or more, we
very bad one for others that may be better represented by recommend that a separate test set be used (Timberlake-
an upside down U curve. This difficulty is a problem for Consultants, 2001). The original CART creators recom-
most repeated laboratory tests. Some information will mended dividing the data into test and learning samples
be lost in the creation of the data-mining data table. whenever there were more than 1,000 cases, with cross-
We used the average HgbA1c for a given patient and validation being preferable in smaller data sets (Breiman,
excluded patients who did not have at least two HgbA1c Friedman, Olshen, & Stone, 1984).
results in the data warehouse. We repartitioned this The 10 predictor variables were used with the binary
average HgbA1c into a binary variable based on a mean- target variable of the HgbA1c average (cut-point of
ingful clinical cut-point of 9.5%. Experts agree that an 9.5%) in an attempt to find interesting patterns that may
HgbA1c >9.5% is a bad outcome, or a medical quality have management or clinical importance and are not
error, no matter what the circumstances (American already known.
Medical Association, Joint Commission on Accredita- The variables that are most important to classifica-
tion of Healthcare Organizations, & National Commit- tion in the optimal CART tree were age (100, where the
tee for Quality Assurance, 2001). most important variable is arbitrarily given a relative
Our final data-mining data table had 15,902 patients score of 100), number of office visits (51), comorbidity
(rows). Mean HgbA1c > 9.5% was the target variable, index (CMI) (44), cardiovascular disease (16), choles-
and the 10 predictors were age, sex, emergency depart- terol problems (17), number of emergency room visits
ment visits, office visits, comorbidity index, (7), and hypertension (0.6).
dyslipidemia, hypertension, cardiovascular disease, re- CART can be used for multiple purposes. Here we
tinopathy, and end stage renal disease. All these patients want to find clusters of deviance from glycemic control.
360
TEAM LinG
Diabetic Data Warehouses
With no analysis, the rate of bad glycemic control in the years rather than older than 65.6 if they have a bad
learning sample is 13.2%. We want to find nodes that glycemic control is 4.11 (95% CI:3.60, 4.69). This is ,
have higher rates of bad glycemic control. The first split surprising information to most clinicians. Similar find-
in node 1 of the tree is based on an age of 65.6 (CART ings were recently reported with vascular endpoints in
sends all cases less than or equal to the cut-point to the diabetic patients (Miyaki, Takei, Watanabe, Nakashina,
left and greater than the cut-point to the right). In node 2 & Omae, 2002).
(<=65.6 years of age), we have bad glycemic control in The case may be that those with the worst glycemic
19.4%. control die young and never make it to the older group.
If we look at the tree to find terminal nodes (TN) Although this is an interesting theoretical explanation,
where the percentage of all the patients in those nodes nevertheless, these numbers represent real patients
having a bad HgbA1c is greater than the 19.4% true of that need help with their glycemic control now.
those <= 65.6 years of age, we identify 4 of the 10 TNs If we want to target diabetics with bad HgbA1c
in the learning sample. values, the odds of finding them are 3.2 times as high in
The purer the nodes we limit ourselves to, the less diabetic patients younger than 65.6 years than those
percentage of the overall population with bad glycemic who are older and 4.1 times as high in those who are
control we get. With no analysis in the learning set, we younger than 55 than those over 65. This information is
can capture all 1052 patients with bad glycemic control clinically important because the younger group has so
but must target the entire group of 7953. When we limit many more years of life left to develop diabetic com-
ourselves to the purer node 2, we capture 74% of those plications from bad glycemic control. This is espe-
with bad glycemic control by targeting only 50% of the cially helpful because it tells us which population to
population. If we limit ourselves to TN1, we capture 49% target interventions at even before we have the HgbA1c
of those with bad glycemic control by targeting only values to show us. Health maintenance organizations
27% of the population. If we use more complicated and public health workers may want to explore what
criteria by combining the 4 TNs with worst glycemic educational interventions can be successfully directed
control, we capture 54% of those with bad glycemic to younger diabetics (younger than 65, especially
control by targeting only 30% of the population. younger than 55) who are much more likely to have bad
The classification errors in the learning and test glycemic control than the geriatric patients.
samples are substantial, as a quarter of the bad glycemic
control patients are missed in the CART analysis. CART
is doing a good job with the 10 predictor variables it is FUTURE TRENDS
given, but more accurate prediction requires additional
variables not in our database. Areas that need further work to fully utilize data mining
Adjustment to defaults in CART can give better re- in health care include time-series issues, sequencing
sults defined as capturing a larger percentage of those information, data-squashing technologies, and a tight
with bad glycemic control within a smaller percentage of integration of domain expertise and data-mining skills.
the population. However, the complexity of the formulas We have already discussed time-series issues. This
to identify the population is difficult for managers to has been investigated but needs further exploration in
use. Not only must managers identify the persons, but health care data mining (Bellazzi, Larizza, Magni,
they must get enough of a feel for what the population Montani, & Stefanelli, 2000; Goodall, 1999; Tsien,
characteristics are to know what interventions are likely 2000).
to be helpful. This is more intuitive for those who are The sequence of various events may hold meaning
younger than 55 or younger than 65 than it is for those important to a study. For example, a patient may have
who satisfy 0.451*(AGE) + 0.893*(CMI) <= 32.5576. better glycemic control, manifested in improved
HgbA1c values, especially when the patient had an
Results office visit with a physician within a month of a previ-
ous HgbA1c. Perhaps this is meaningful information
From this CART analysis, the most important variable that implies that the proper sequence of physician
associated with a bad HgbA1c score is age less than 65. visits relative to HgbA1c measurements is an important
Those less than 65.6 years old are almost three times as predictor of good outcomes. All this information is
likely to have bad glycemic control than those who are located in the relational database, but we must ferret it
older. The odds ratio that someone is less than 65.6 years out by having in advance an idea that a sequence of this
old if they have a bad HgbA1c (average reading > 9.5%) sort may be important and then searching for such
is 3.18 (95% CI: 2.87, 3.53) (Fos & Fine, 2000). Simi- associations. There may be many such sequences in-
larly, the odds ratio that someone is younger than 55.2 volving interactions between hospital, clinic, pharmacy,
361
TEAM LinG
Diabetic Data Warehouses
and laboratory variables. Regrettably, no amount of data nated performance measurement for the management of
mining will be able to extract sequence associations where adult diabetes. Retreived March 24, 2005, from http://
we have not thought to extract the prerequisite variables www.ama-assn.org/ama/upload/mm/370/nr.pdf
from the relational database into the data-mining data
table. In the ideal data-mining scenario, software could Bellazzi, R., Larizza, C., Magni, P., Montani, S., &
interface directly with the relational database and extract Stefanelli, M. (2000). Intelligent analysis of clinical
all possibly meaningful sequences for us to review, and time series: An application in the diabetes mellitus
domain experts would then sort through the list. This domain. Artificial Intelligence Med, 20(1), 37-57.
issue has begun to get attention (Deroski & Lavra , 2001) Blonde, L. (2001). Epidemiology, costs, consequences,
and will need to be addressed in future health care data and pathophysiology of type 2 diabetes: An American
mining. epidemic. Ochsner Journal, 3(3), 126-131.
It has been shown that using a data-squashing algo-
rithm to reduce a massive data set is more powerful and Breault, J. L. (2001). Data mining diabetic databases: Are
accurate than using a random sample (DuMouchel, rough sets a useful addition? In E. Wegman, A. Braverman,
Volinsky, Johnson, Cortes, & Pregibon, 1999). Squash- A. Goodman, & P. Smyth (Eds.), Computing science and
ing is a form of lossy compression that attempts to statistics (pp. 597-606). Interface Foundation of North
preserve statistical information (DuMouchel, 2001). America, Inc, 33, Fairfax Station, VA.
These newer data-squashing techniques may be a better Breault, J. L., & Goodall, C. R. (2002, January). Mathemati-
approach than random sampling in massive data sets. cal challenges of variable transformations in data mining
These techniques also protect human subjects privacy. diabetic data warehouses. Paper presented at the Math-
Transactional health care data mining, exemplified ematical Challenges in Scientific Data Mining Confer-
in the diabetic data warehouses discussed previously, ence, Los Angeles, CA.
involves a number of tricky data transformations that
require close collaboration between domain experts Breault, J. L., Goodall, C. R., & Fos, P. J. (2002). Data
and data miners (Breault & Goodall, 2002). Even with mining a diabetic data warehouse. Artificial Intelli-
ideal collaboration or overlapping expertise, we need to gence Med, 26(1-2), 37-54.
develop new ways to extract variables from relational
Breiman, L. (1984). Classification and regression
databases containing time-series and sequencing infor-
trees. Belmont, CA: Wadsworth International.
mation. Part of the answer lies in collaborative groups
that can have additional insights. Part of the answer lies Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., &
in the further development of data-mining tools that act Beuscart, R. (2003). A preprocessing method for improv-
directly on a relational database without transformation ing data mining techniques: Application to a large medical
to explicit data arrays. In some circumstances, it may be diabetes database. Stud Health Technol Inform, 95, 269-
useful to produce several data-mining data tables to 274.
independently data mine and then combine the results.
This may be particularly useful when different DuMouchel, W. (2001). Data squashing: Constructing
granularities of attributes are glossed over by the use of summary data sets. In E. Wegman, A. Braverman, A.
a single data-mining data table. Goodman, & P. Smyth (Eds.), Computing science and
statistics. Interface Foundation of North America, Inc.,
Fairfax Station, VA.
CONCLUSION DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., &
Pregibon, D. (1999, August). Squashing flat files flatter.
Data mining is valuable in discovering novel associa- Proceedings of the Fifth ACM SIGKDD International
tions in diabetic databases that can prove useful to Conference on Knowledge Discovery and Data Mining,
clinicians and administrators. This may also be the case USA.
for many other health care problems.
Deroski, S., & Lavra , N. (2001). Relational data mining.
New York: Springer.
REFERENCES Fos, P. J., & Fine, D. J. (2000). Designing health care
for populations: Applied epidemiology in health care
American Medical Association, Joint Commission on administration. San Francisco: Jossey-Bass.
Accreditation of Healthcare Organizations, & National
Committee for Quality Assurance. (2001). Coordi- Goodall, C. (1995). Massive data sets in healthcare. In
(Chair: Jon R. Kettenring), Massive data sets. Paper pre-
362
TEAM LinG
Diabetic Data Warehouses
sented at the meeting of the Committee on Applied and Timberlake-Consultants. (2001). CART frequently asked
Theoretical Statistics at the National Academy of Sciences, questions. Retrieved October 21, 2001, from http:// ,
National Research Council, Washington, DC. Retrieved www.timberlake.co.uk/software/cart/cartfaq1.htm#q23
March 29, 2005, from http://bob.nap.edu/html/massdata/
media/cgoodall-t.html Tsien, C. L. (2000). Event discovery in medical time-series
data. Proceedings of the AMIA Symposium, 858-862.
Goodall, C. R. (1999). Data mining of massive datasets in
healthcare. Journal of Computational and Graphical
Statistics, 8(3), 620-634.
KEY TERMS
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles
of data mining. Cambridge, MA: MIT Press. Co-Morbidity Index: A composite variable that
He, H., Koesmarno, Van, & Huang. (2000). Data mining gives a measure of how many other medical problems
in disease management: A diabetes case study. In R. someone has in addition to the one being studied.
Mizoguchi & J. K. Slaney (Eds.), Proceedings of the Data Mining Data Table: The flat file constructed
Sixth Pacific Rim International Conference on Artifi- from the relational database that is the actual table used
cial Intelligence: Topics in artificial intelligence (p. by the data-mining software.
799). New York: Springer.
Glycemic Control: Tells how well controlled are
Hsu, W. (2000, August). Exploration mining in diabetic the sugars of a diabetic patient. Usually measured by
patients databases: Findings and conclusions. Proceed- HgbA1c.
ings of the Sixth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, USA. HgbA1c: Blood test that measures the percent of
receptors on a red blood cell that are saturated with
Kakarlapudi, V., Sawyer, R., & Staecker, H. (2003). The glucose. This is translated into a measure of how sugars
effect of diabetes on sensorineural hearing loss. Otol have averaged over the last few months. Normal is less
Neurotol, 24(3), 382-386. than 6, depending on the laboratory standards.
Miyaki, K., (2002). Novel statistical classification Medical Transactional Database: The database
model of type 2 diabetes mellitus patients for tailor- created from the billing and required reporting transac-
made prevention using data mining algorithm. J tions of a medical practice. Clinical experience is some-
Epidemiol, 12(3), 243-248. times required to understand gaps and inadequacies in
Nadeau, T. P., (2003). Applying database technology to the collected data.
clinical and basic research bioinformatics projects. J Monothetic Process: In a classification tree, when
Integr Neurosci, 2(2), 201-217. data at each node are split on just one variable rather than
Stepaniuk, J. (1999, June). Rough set data mining of several variables.
diabetes data. In Z. Ras & A. Skowron (Eds.), Proceed- Recursive Partitioning: The method used to di-
ings of the 11th International Symposium onVol. Foun- vide data at each node of a classification tree. At the top
dations of intelligent systems (pp. 457-465). New York: node, every variable is examined at every possible value
Springer. to determine which variable split will produce the maxi-
Tafeit, E., Moller, Sudi, & Reibnegger. (2000). ROC mum and minimum amounts of the target variable in the
and CART analysis of subcutaneous adipose tissue to- daughter nodes. This is recursively done for each addi-
pography (SAT-Top) in type-2 diabetic women and tional node.
healthy females. American Journal of Human Biology,
12, 388-394.
363
TEAM LinG
364
INTRODUCTION sented by Hagita and Sawaki (1995). The fourth one is the
dice coefficient. The fifth one is the Phi coefficient. The
One of the most important issues in data mining is to last two are both mentioned by Manning and Schutze
discover an implicit relationship between words in a large (1999). The sixth one is the proposal measure (PM) sug-
corpus and labels in a large database. The relationship gested by Ishiduka, Yamamoto, and Umemura (2003). It is
between words and labels often is expressed as a function one of the several new measures developed by them in
of distance measures. An effective measure would be their paper.
useful not only for getting the high precision of data In order to evaluate these distance measures, formulas
mining, but also for time saving of the operation in data are required. Yamamoto and Umemura (2002) analyzed
mining. In previous research, many measures for calculat- these measures and expressed them in four parameters of
ing the one-to-many relationship have been proposed, a, b, c, and d (Table 1).
such as the complementary similarity measure, the mutual Suppose that there are two words or labels, x and y, and
information, and the phi coefficient. Some research showed they are associated together in a large database. The mean-
that the complementary similarity measure is the most ings of these parameters in these formulas are as follows:
effective. The author reviewed previous research related
to the measures in one-to-many relationships and pro- a. The number of documents/records that have x and
posed a new idea to get an effective one, based on the y both.
heuristic approach in this article. b. The number of documents/records that have x but
not y.
c. The number of documents/records that do not have
BACKGROUND x but do have y.
d. The number of documents/records that do not have
Generally, the knowledge discover in databases (KDD) either x or y.
process consists of six stages: data selection, cleaning, n. The total number of parameters a, b, c, and d.
enrichment, coding, data mining, and reporting (Adriaans
& Zantinge, 1996). Needless to say, data mining is the Umemura (2002) pointed out the following in his paper:
most important part in the KDD. There are various tech- Occurrence patterns of words in documents can be
niques, such as statistical techniques, association rules, expressed as binary. When two vectors are similar, the
and query tools in a database, for different purposes in two words corresponding to the vectors may have some
data mining. (Agrawal, Mannila, Srikant, Toivonen & implicit relationship with each other. Yamamoto and
Verkamo, 1996; Berland & Charniak, 1999; Caraballo, 1999; Umemura (2002) completed their experiment to test the
Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996; validity of these indexes under Umemuras concept. The
Han & Kamber, 2001). result of the experiment of distance measures without
When two words or labels in a large database have noisy pattern from their experiment can be seen in Figure
some implicit relationship with each other, one of the 1 (Yamamoto & Umemura, 2002).
different purposes is to find out the two relative words or The experiment by Yamamoto and Umemura (2002)
labels effectively. In order to find out relationships be- showed that the most effective measure is the CSM. They
tween words or labels in a large database, the author indicated in their paper as follows: All graphs showed
found the existence of at least six distance measures after that the most effective measure is the complementary
reviewing previously conducted research. similarity measure, and the next is the confidence and the
The first one is the mutual information proposed by third is asymmetrical average mutual information. And the
Church and Hanks (1990). The second one is the confi- least is the average mutual information (Yamamoto and
dence proposed by Agrawal and Srikant (1995). The third Umemura, 2002). They also completed their experiments
one is the complementary similarity measure (CSM) pre- with noisy pattern and found the same result (Yamamoto
& Umemura, 2002).
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discovering an Effective Measure in Data Mining
365
TEAM LinG
Discovering an Effective Measure in Data Mining
measure, the author selected actual data of a places name, part of the function can be varied.
such as the name of prefecture and the name of a city in The first one is the experiment with the PM.
Japan from the articles of a nationally circulated newspa- The total number of combinations of five variables is
per, the Yomiuri. The reasons for choosing a places name 3,125. The author calculated all combinations, except for
are as follows: first, there are one-to-many relationships the case when the denominator of the PM becomes zero.
between the name of a prefecture and the name of a city; The result of the top 20 in the PM experiment, using a
second, the one-to-many relationship can be checked years amount of articles from the Yomiuri in 1991, is
easily from the maps and telephone directory. Generally calculated as follows.
speaking, the first name, such as the name of a prefecture, In Table 2, the No. 1 function of a11c1 has a highest
consists of another name, such as the name of a city. For a 1 1 a
instance, Fukuoka City is geographically located in Fukuoka correct number, which means S ( F , T ) = = .
c +1 1+ c
Prefecture, and Kitakyushu City also is included in Fukuoka
Prefecture, so there are one-to-many relationships be-
1 a 1 a
The No. 11 function of 1a1c0 means S ( F , T ) = = .
tween the name of the prefecture and the name of the city. c+0 c
The distance measure would be calculated with a large The rest function in Table 2 can be read as mentioned
database in the experiments. The experiments were ex- previously.
ecuted as follows: This result appears to be satisfactory. But to prove
whether it is satisfactory or not, another experiment
Step 1. Establish the database of the newspaper. should be done. The author adopted the Phi coefficient
Step 2. Choose the prefecture name and city name from the and calculated its correct number. To compare with the
database mentioned in step 1. result of the PM experiment, the author completed the
Step 3. Count the number of parameters a, b, c, and d from experiment of iterating the exponential index of denomi-
the newspaper articles. nator in the Phi coefficient from 0 to 2, with 0.01 step
Step 4. Then, calculate the distance measure adopted. based upon the idea of fractal dimension in the complex
Step 5. Sort the result calculated in step 4 in descent order theory instead of fixed exponential index at 0.5. The result
upon the distance measure. of the top 20 in this experiment, using a years amount of
Step 6. List the top 2,000 from the result of step 5. articles from the Yomiuri in 1991, just like the first experi-
Step 7. Judge the one-to-many relationship whether it is ment, can be seen in Table 3.
correct or not. Compared with the result of the PM experiment men-
Step 8. List the top 1,000 as output data and count its tioned previously, it is obvious that the number of cor-
number of correct relationships. rect relationships is less than that of the PM experiment;
Step 9. Finally, an effective measure will be found from the therefore, it is necessary to uncover a new, more effective
result of the correct number. measure. From the previous research done by Yamamoto
and Umemura (2002), the author found that an effective
To uncover an effective one, two methods should be measure is the CSM.
considered. The first one is to test various combinations The author completed the third experiment using the
of each variable in distance measure and to find the best CMS. The experiment began by iterating the exponential
combination, depending upon the result. The second one index of the denominator in the CSM from 0 to 1, with 0.01
is to assume that the function is a stable one and that only steps just like the second experiment. Table 4 shows the
366
TEAM LinG
Discovering an Effective Measure in Data Mining
Note: C.N. means the correct number, and E.I. means the exponential index.
result of the top 20 in this experiment, using a years Determine the Most Effective
amount of articles from the Yomiuri from 1991-1997. Exponential Index of the Denominator
It is obvious by these results that the CSM is more
effective than the PM and the Phi coefficient. The relation-
in the CSM
ship of the complete result of the exponential index of the
denominator in the CSM and the correct number can be To discover the most effective exponential index of the
seen as in Figure 2. denominator in the CSM, a calculation about the relation-
The most effective exponential index of the denomina- ship between the exponential index and the total number
tor in the CSM is from 0.73 to 0.85, but not as 0.5, as many of documents of n was carried out, but none was found.
researchers believe. It would be hard to get the best result In fact, it is hard to consider that the exponential index
with the usual method using the CSM. would vary with the difference of the number of documents.
367
TEAM LinG
Discovering an Effective Measure in Data Mining
Figure 2. Relationships between the exponential indexes of denominator in the CSM and the correct number
900
800
700
1991
600 1992
500 1993
1994
400 1995
1996
300
1997
200
100
0
0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96
The ExponentialIndex of Denom inator In C SM
The details of the four parameters of a, b, c, and d in For many things in the world, part of the economic
Table 4 are listed in Table 5. phenomena obtain a normal distribution referred to in
The author adopted the average of the cumulative statistics, but others do not, so the author developed the
exponential index to find the most effective exponential CSM measure from the idea of fractal dimension in this
index of the denominator in the CSM. Based upon the article. Conventional approaches may be inherently inac-
results in Table 4, calculations for the average of the curate, because usually they are based upon linear math-
cumulative exponential index of the top 20 and those ematics. The events of a concurrence pattern of correlated
results are presented in Figure 3. pair words may be explained better by nonlinear math-
From the results in Figure 3, it is easy to understand ematics. Typical tools of nonlinear mathematics are com-
that the exponential index converges at a constant value plex theories, such as chaos theory, cellular automaton,
of about 0.78. Therefore, the exponential index could be percolation model, and fractal theory. It is not hard to
fixed at a certain value, and it will not vary with the size of predict that many more new measures will be developed
documents. in the near future, based upon the complex theory.
In this article, the author puts the exponential index at
0.78 to forecast the correct number. The gap between the
calculation result forecast by the revised method and the CONCLUSION
maximum result of the third experiment is illustrated in
Figure 4. Based upon previous research of the distance measures,
the author discovered an effective measure of distance
measure in one-to-many relationships with the heuristic
FUTURE TRENDS approach. Three kinds of experiments were conducted,
and it was confirmed that an effective measure is the CSM.
Many indexes have been developed for the discovery of In addition, it was discovered that the most effective
an effective measure to determine an implicit relationship exponential of the denominator in the CSM is 0.78, not
between words in a large corpus or labels in a large 0.50, as many researchers believe.
database. Ishiduka, Yamamoto, and Umemura (2003) re- A great deal of work still needs to be carried out, and
ferred to part of them in their research paper. Almost all of one of them is the meaning of the 0.78. The meaning of the
these approaches were developed by a variety of math- most effective exponential of denominator 0.78 in the CSM
ematical and statistical techniques, a conception of neural should be explained and proved mathematically, although
network, and the association rules. its validity has been evaluated in this article.
368
TEAM LinG
Discovering an Effective Measure in Data Mining
Table 5. Relationship between the maximum correct number and their parameters
,
Year Exponential Index Correct Number a b c d n
1991 0.73 850 2,284 46,242 3,930 1,329 53,785
1992 0.75 923 453 57,636 1,556 27 59,672
1993 0.79 883 332 51,649 1,321 27 53,329
1994 0.81 820 365 65,290 1,435 36 67,126
1995 0.85 854 1,500 67,914 8,042 190 77,646
1996 0.77 832 636 56,529 2,237 2,873 62,275
Figure 3. Relationships between the exponential index and the average of the number of the cumulative exponential
index
T he exponential index
0.795
0.79
0.785
0.78
0.775
0.77
0.765
0.76
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
T he number of the cumulative exponential index
369
TEAM LinG
Discovering an Effective Measure in Data Mining
Figure 4. Gaps between the maximum correct number and the correct number forecasted by the revised method
1000
900
800
700
600
m axim um correct num ber
500
correct num ber focasted by revised
m ethod
400
300
200
100
0
1991 1992 1993 1994 1995 1996 1997
Fayyad, U.M. et al. (1996). Advances in knowledge dis- Yamamoto, E., & Umemura, K. (2002). A similarity measure
covery and data mining.AAAI Press/MIT Press. for estimation of one-to-many relationship in corpus
[Japanese edition]. Journal of Natural Language Pro-
Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth, P. cessing, 9, 45-75.
(1996). The KDD process for extracting useful knowl-
edge from volumes of data. Communications of the
ACM, 39(11), 27-34. KEY TERMS
Glymour, C. et al. (1997). Statistical themes and lessons for
data mining. Data Mining and Knowledge Discover, 1(1), Complementary Similarity Measurement: An index
11-28. developed experientially to recognize a poorly printed
character by measuring the resemblance of the correct
Hagita, N., & Sawaki, M. (1995). Robust recognition of pattern of the character expressed in a vector. The author
degraded machine-printed characters using complimen- calls this a diversion index to identify the one-to-many
tary similarity measure and error-correction learning. Pro- relationship in the concurrence patterns of words in a
ceedings of the SPIEThe International Society for large corpus or labels in a large database in this article.
Optical Engineering.
Confidence: An asymmetric index that shows the
Han, J., & Kamber, M. (2001). Data mining. Morgan percentage of records for which A occurred within the
Kaufmann Publishers. group of records and for which the other two, X and Y,
actually occurred under the association rule of X, Y => A.
Ishiduka, T., Yamamoto, E., & Umemura, K. (2003). Evalu-
ation of a function presumes the one-to-many relation- Correct Number: For example, if the city name is
ship [unpublished research paper] [Japanese edition]). included in the prefecture name geographically, the au-
thor calls it correct. So, in this index, the correct number
Manning, C.D., & Schutze, H. (1999). Foundations of indicates the total number of correct one-to-many rela-
statistical natural language processing. MIT Press. tionship calculated on the basis of the distance measures.
Umemura, K. (2002). Selecting the most highly correlated Distance Measure: One of the calculation techniques
pairs within a large vocabulary. Proceedings of the to discover the relationship between two implicit words in
COLING Workshop SemaNet02, Building and Using a large corpus or labels in a large database from the
Semantic Networks. viewpoint of similarity.
370
TEAM LinG
Discovering an Effective Measure in Data Mining
Mutual Information: Shows the amount of informa- documents based upon the chi square test with the
tion that one random variable x contains about another y. frequencies expected for independence. ,
In other words, it compares the probability of observing
x and y, together with the probabilities of observing x and Proposal Measure: An index to measure the concur-
y independently. rence frequency of the correlated pair words in docu-
ments, based upon the frequency of two implicit words
Phi Coefficient: One metric for corpus similarity mea- that appear. It will have high value, when the two implicit
sured upon the chi square test. It is an index to calculate words in a large corpus or labels in a large database occur
the frequencies of the four parameters of a, b, c, and d in with each other frequently.
371
TEAM LinG
372
XML
Mining
XML XML
Structure Content
Mining Mining
Intra- Inter-
Content Structure
Structure Structure Analysis
Mining Mining Clarification
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discovering Knowledge from XML Documents
The clustering task of data mining identifies similari- Classification is performed on XML content, labeling
ties among various XML documents. A clustering algo- new XML content as belonging to a predefined class. To ,
rithm takes a collection of schemas to group them together reduce the number of comparisons, pre-existing schemas
on the basis of self-similarity. These similarities are then classify the new documents schema. Then, only the
used to generate new schema. As a generalization, the instance classifications of the matching schemas need to
new schema is a superclass to the training set of schemas. be considered in classifying a new document.
This generated set of clustered schemas can now be used Clustering on XML content identifies the potential
in classifying new schemas. The superclass schema also for new classifications. Again, consideration of schemas
can be used in integration of heterogeneous XML docu- leads to quicker clustering; similar schemas are likely
ments for each application domain. This allows users to to have a number of value sets. For example, all schemas
find, collect, filter, and manage information sources more concerning vehicles have a set of values representing
effectively on the Internet. cars, another set representing boats, and so forth. How-
The association data mining describes relationships ever, schemas that appear dissimilar may have similar
between tags that tend to occur together in XML docu- content. Mining XML content inherits some problems
ments that can be useful in the future. By transforming faced in text mining and analysis. Synonymy and polysemy
the tree structure of XML into a pseudo-transaction, it can cause difficulties, but the tags surrounding the
becomes possible to generate rules of the form if an content usually can help resolve ambiguities.
XML document contains a <craft> tag, then 80% of the
time it also will contain a <licence> tag. Such a rule Structural Clarification
then may be applied in determining the appropriate
interpretation for homographic tags. Concerned with distinguishing the similar structured
documents based on contents. The following mining
Interstructure Mining tasks can be performed.
Content provides support for alternate clustering of
Concerned with the structure between XML documents. similar schemas. Two distinctly structured schemas
Knowledge is discovered about the relationship be- may have document instances with identical content.
tween subjects, organizations, and nodes on the Web in Mining these avails new knowledge. Vice versa, schemas
this type of mining. The following mining tasks can be provide support for alternate clustering of content. Two
applied. XML documents with distinct content may be clustered
Clustering schemas involves identifying similar together, given that their schemas are similar.
schemas. The clusters are used in defining hierarchies Content also may prove important in clustering
of schemas. The schema hierarchy overlaps instances schemas that appear different but have instances with
on the Web, thus discovering authorities and hubs similar content. Due to heterogeneity, the incidence of
(Garofalakis et al. 1999). Creators of schema are iden- synonyms is increased. Are separate schemas actually
tified as authorities, and creators of instances are hubs. describing the same thing, only with different terms?
Additional mining techniques are required to identify While thesauruses are vital, it is impossible for them to
all instances of schema present on the Web. The follow- be exhaustive for the English language, let alone handle
ing application of classification can identify the most all languages. Conversely, schemas appearing similar
likely places to mine for instances. Classification is actually are completely different, given homographs.
applied with namespaces and URIs (Uniform Resource The similarity of the content does not distinguish the
Identifiers). Having previously associated a set of semantic intention of the tags. Mining, in this case,
schemas with a particular namespace or URI, this infor- provides probabilities of a tag having a particular mean-
mation is used to classify new XML documents origi- ing or a relationship between meaning and a URI.
nating from these places.
Content is the text between each start and end tag in
XML documents. Mining for XML content is essen- METHODS OF XML STRUCTURE
tially mining for values (an instance of a relation), MINING
including content analysis and structural clarification.
Mining of structures from a well-formed or valid docu-
Content Analysis ment is straightforward, since a valid document has a
schema mechanism that defines the syntax and structure
Concerned with analysing texts within XML documents. of the document. However, since the presence of schema
The following mining tasks can be applied to contents. is not mandatory for a well-formed XML document, the
373
TEAM LinG
Discovering Knowledge from XML Documents
document may not always have an accompanying schema. data graph to produce the most specific data guide
To describe the semantic structure of the documents, (Nayak et al., 2002). The data graph represents the inter-
schema extraction tools are needed to generate schema for actions between the objects in a given data domain.
the given well-formed XML documents. When extracting a schema from a data graph, the goal is
DTD Generator (Kay, 2000) generates the DTD for a to produce the most specific schema graph from the
given XML document. However, the DTD generator yields original graph. This way of extracting schema is more
a distinct DTD for every XML document; hence, a set of general than using the schema for a guide, because most
DTDs is defined for a collection of XML documents rather of the XML documents do not have a schema, and some-
than an overall DTD. Thus, the application of data mining times, if they have a schema, they do not conform to it.
operations will be difficult in this matter. Tools such as
XTRACT (Garofalakis, 2000) and DTD-Miner (Moh et al.,
2000) infer an accurate and semantically meaningful DTD METHODS OF XML CONTENT
schema for a given collection of XML documents. How- MINING
ever, these tools depend critically on being given a rela-
tively homogeneous collection of XML documents. In Before knowledge discovery in XML documents oc-
such heterogeneous and flexible environment as the Web, curs, it is necessary to query XML tags and content to
it is not reasonable to assume that XML documents related prepare the XML material for mining. An SQL-based
to the same topic have the same document structure. query can extract data from XML documents. There are
Due to a number of limitations using DTDs as an a number of query languages, some specifically de-
internal structure, such as limited set of data types, loose signed for XML and some for semi-structured data, in
structure constraints, and limitation of content to tex- general. Semi-structured data can be described by the
tual, many researchers propose the extraction of XML grammar of SSD (semi-structured data) expressions.
schema as an extension to XML DTDs (Feng et al., 2002; The translation of XML to SSD expression is easily
Vianu, 2001). In Chidlovskii (2002), a novel XML schema automated (Abiteboul et al., 2000). Query languages
extraction algorithm is proposed, based on the Extended for semi-structured data exploit path expressions. In
Context-Free Grammars (ECFG) with a range of regular this way, data can be queried to an arbitrary depth. Path
expressions. Feng et al. (2002) also presented a seman- expressions are elementary queries with the results
tic network-based design to convey the semantics car- returned as a set of nodes. However, the ability to
ried by the XML hierarchical data structures of XML return results as semi-structured data is required, which
documents and to transform the model into an XML path expressions alone cannot do. Combining path ex-
schema. However, both of these proposed algorithms are pressions with SQL-style syntax provides greater flex-
very complex. ibility in testing for equality, performing joins, and
Mining of structures from ill-formed XML docu- specifying the form of query results. Two such lan-
ments (that lack any fixed and rigid structure) are per- guages are Lorel (Abiteboul et al., 2000) and Unstruc-
formed by applying the structure extraction approaches tured Query Language (UnQL) (Farnandez et al., 2000).
developed for semi-structured documents. But not all of UnQL requires more precision and is more reliant on
these techniques can effectively support the structure path expressions.
extraction from XML documents that is required for XML-QL, XML-GL, XSL, and Xquery are designed
further application of data mining algorithms. For in- specifically for querying XML (W3c, 2004). XML-QL
stance, the NoDoSe tool (Adelberg & Denny, 1999) (Garofalsaki et al., 1999) and Xquery bring together
determines the structure of a semi-structured document regular path expressions, SQL-style query techniques,
and then extracts the data. This system is based primarily and XML syntax. The great benefit is the construction
on plain text and HTML files, and it does not support of the result in XML and, thus, transforming XML data
XML. Moreover, in Green, et al. (2002), the proposed from one schema to another. Extensible Stylesheet
extraction algorithm considers both structure and con- Language (XSL) is not implemented as a query lan-
tents in semi-structured documents, but the purpose is to guage but is intended as a tool for transforming XML to
query and build an index. They are difficult to use without HTML. However, XSL_S select pattern is a mecha-
some alteration and adaptation for the application of data nism for information retrieval and, as such, is akin to a
mining algorithms. query (W3c, 2004). XML-GL (Ceri et al., 1999) is a
An alternative method is to approach the document as graphical language for querying and restructuring XML
the Object Exchange Model (OEM) (Nestorov et al. documents.
1999; Wang et al. 2000) data by using the corresponding
374
TEAM LinG
Discovering Knowledge from XML Documents
FUTURE TRENDS Adelberg, B., & Denny, M. (1999). Nodose version 2.0.
Proceedings of the ACM SIGMOD Conference on ,
There has been extensive effort to devise new technolo- Management of Data, Seattle, Washington.
gies to process and integrate XML documents, but a lot Arcade. (2004). http://datamining.csiro.au/arcade.html
of open possibilities still exist. For example, integration of
data mining, XML data models and database languages Ceri, S. et al. (1999). XMLGl: A graphical language for
will increase the functionality of relational database prod- querying and restructuring XML documents. Proceed-
ucts, data warehouses, and XML products. Also, to ings of the 8th International WWW Conference, Toronto,
satisfy the range of data mining users (from naive to expert Canada.
users), future work should include mining user graphs
that are structural information of Web usages, as well as Chidlovskii, B. (2002). Schema extraction from XML col-
visualization of mined data. As data mining is applied to lections. Proceedings of the 2nd ACM/IEEE-CS Joint
large semantic documents or XML documents, extraction Conference on Digital Libraries, Portland, Oregon.
of information should consider rights management of Dunham, M.H. (2003). Data mining: Introductory and
shared data. XML mining should have the authorization advanced topics. Upper Saddle River, NJ: Prentice Hall
level to empower security to restrict only appropriate
users to discover classified information. Farnandez, M., Buneman, P., & Suciu, D. (2000). UNQL: A
query language and algebra for semistructured data based
on structural recursion. VLDB JOURNAL: Very Large
CONCLUSION Data Bases, 9(1), 76-110.
Feng, L., Chang, E., & Dillon, T. (2002). A semantic
XML has proved effective in the process of transmitting network-based design methodology for XML documents.
and sharing data over the Internet. Companies want to ACM Transactions of Information Systems (TOIS),
bring this advantage into analytical data, as well. As 20(4), 390-421.
XML material becomes more abundant, the ability to
gain knowledge from XML sources decreases due to Garofalakis, M. et al. (1999). Data mining and the Web:
their heterogeneity and structural irregularity; the idea Past, present and future. Proceedings of the Second
behind the XML data mining looks like a solution to put International Workshop on Web Information and Data
to work. Using XML data in the mining process has Management, Kansas City, Missouri.
become possible through new Web-based technologies Garofalakis, M.N. et al. (2000). XTRACT: A system for
that have been developed. Simple Object Access Proto- extracting document type descriptors from XML docu-
col (SOAP) is a new technology that has enabled XML ments. Proceedings of the 2000 ACM SIGMOD Inter-
to be used in data mining. For example, vTag Web national Conference on Management of Data, Dallas,
Mining Server aims at monitoring and mining of the Texas.
Web with the use of information agents accessed by
SOAP (vtag, 2003). Similarly, XML for Analysis de- Green, R., Bean, C.A., & Myaeng, S.H. (2002). The seman-
fines a communication structure for an application pro- tics of relationships: An interdisciplinary perspective.
gramming interface, which aims at keeping client pro- Boston: Kluwer Academic Publishers.
gramming independent from the mechanics of data trans-
port but, at the same time, providing adequate informa- Kay, M. (2000). SAXON DTD generatorA tool to gen-
tion concerning the data and ensuring that it is properly erate XML DTDs. Retrieved January 2, 2003, from http:/
handled (XMLanalysis, 2003). Another development, /home.iclweb.com/ic2/mhkay/dtdgen.html
YALE, is an environment for machine learning experi- Moh, C.-H., & Lim, E.-P. (2000). DTD-miner: A tool for
ments that uses XML files to describe data mining mining DTD from XML documents. Proceedings of the
experiments setup (Yale, 2004). The Data Miners AR- Second International Workshop on Advanced Issues of
CADE also uses XML as the target language for all data E-Commerce and Web-Based Information Systems, Cali-
mining tools within its environment (Arcade, 2004). fornia.
Nayak, R., Witt, R., & Tonev, A. (2002, June). Data mining
REFERENCES and XML documents. Proceedings of the 2002 Interna-
tional Conference on Internet Computing, Nevada.
Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on Nestorov, S. et al. (1999). Representative objects: Concise
the Web: From relations to semistructured data and representation of semi-structured, hierarchical data. Pro-
XML. San Francisco, CA: Morgan Kaumann.
375
TEAM LinG
Discovering Knowledge from XML Documents
ceedings of the IEEE Proc on Management of Data, Well-Formed XML Documents: To be well-formed, a
Seattle, Washington. pages XML must have properly nested tags, unique
attributes (per element), one or more elements, exactly one
Vianu, V. (2001). A Web odyssey: from Codd to XML. root element, and a number of schema-related constraints.
Proceedings of the 20 th ACM SIGMOD-SIGACT- Well-formed documents may have a schema, but they do
SIGART Symposium on Principles of Database Systems, not conform to it.
California.
XML Content Analysis Mining: Concerned with
Vtag. (2003). http://www.connotate.com/csp.asp analysing texts within XML documents.
W3c. (2004). XML query (Xquery). Retrieved March 18, XML Interstructure Mining: Concerned with the
2004, from http://www.w3c.org/XML/Query structure between XML documents. Knowledge is dis-
Wang, Q., Yu, X.J., & Wong, K. (2000). Approximate graph covered about the relationship among subjects, organi-
scheme extraction for semi-structured data. Proceedings zations, and nodes on the Web.
of the 7th International Conference on Extending Data- XML Intrastructure Mining: Concerned with the
base Technology, Konstanz. structure within an XML document(s). Knowledge is
XMLanalysis. (2003). http://www.intelligenteai.com/fea- discovered about the internal structure of XML docu-
ture/011004/editpage.shtml ments.
376
TEAM LinG
377
Praveen Pathak
University of Florida, USA
INTRODUCTION where wdi, wqi (for i=1 to t) are weights assigned to different
terms in the document and the query respectively. The
The field of information retrieval deals with finding similarity between the two vectors is calculated as the
relevant documents from a large document collection cosine of the angle between the two vectors. It is ex-
or the World Wide Web in response to a users query pressed as (Salton & Buckley, 1988):
seeking relevant information. Ranking functions play a
very important role in the retrieval performance of such t
retrieval systems and search engines. A single ranking w qi wdi
function does not perform well across different user Similarity (Q, D) = i =1
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discovering Ranking Functions for Information Retrieval
Figure 1. A sample tree representation for a ranking representations for the solutions. A tree-based represen-
function tation allows for easier parsing and implementation. An
example of a term weighting formula using tree structure
+ is given in Figure 1. We will use such a tree based
representation in this chapter. Third, GP is very effective
- lo g
for non-linear function and structure discovery problems
tf * n df where traditional optimization methods do not seem to
work well (Banzhaf, Nordin, Keller, & Francone, 1998).
df N Finally, it has been empirically found that GP discovers
better solutions than those obtained by conventional
heuristic algorithms.
Ranking function discovery as presented in this chap-
MAIN THRUST ter is different from classification in that we seek to
find a function that will be used for ranking or prioritiz-
In this chapter we present a systematic and automatic ing documents. There are efforts in IR that treat this as
discovery process to discover ranking functions. The a classification problem in which a classifier or dis-
process is based on an artificial intelligence technique criminant function is used for ranking. But evidence
called genetic programming (GP). GP is based on ge- shows that ranking function discovery has yielded better
netic algorithms (GA) (Goldberg, 1989; Holland, 1992). retrieval results than the results obtained by treating this
Because of the intrinsic parallel search mechanism and as a classification problem using Support Vector Ma-
powerful global exploration capability in high-dimen- chines and Neural Networks (Fan, Gordon, Pathak,
sional space, both GA and GP have been used to solve a Wensi, & Fox, 2004; Fuhr & Pfeifer, 1994). We now
wide range of hard optimization problems. They are proceed to describe the discovery process using GP.
used in various optimal design and data-mining applica-
tions (Koza, 1992). Ranking Function Discovery by GP
GP represents the solution to a problem as a chro-
mosome (or an individual) in a population pool. It In order to apply GP in our context we need to define
evolves the population of chromosomes in successive several components for it. We use the tree structure as
generations by following the genetic transformation shown in Figure 1 to represent term weighting formula.
operations such as reproduction, crossover, and muta- Components needed for such a representation are given
tion to discover chromosomes with better fitness val- in Table 1.
ues. A fitness function assigns a fitness value for each For the purpose of our discovery framework we will
chromosome that represents how good the chromosome define these parameters as follows:
is at solving the problem at hand.
We use GP for discovering ranking functions be- An individual in the population is expressed in
cause of four reasons. First, in GP there is no stringent term of a tree which represents one possible rank-
requirement for an objective function to be continuous. ing function. A population in a generation con-
All that is needed is that the objective function should be sists of P such trees.
able to differentiate good solutions from the bad ones.
This property allows us to use common IR performance
Terminals: We use the features mentioned in
measures, like average precision (P_Avg), which are Table 2 and real constants as the terminals.
non-linear in nature as objective functions. Second, GP Functions: We use +, -, *, /, and log as the
is well suited to represent the common tree based functions allowed
GP Parameters Meaning
Terminals Leaf nodes in the tree data structure
Functions Non-leaf nodes used to combine the leaf nodes. Typically
numerical operations
Fitness Function The objective function that needs to be optimized
Reproduction and Crossover Genetic operators used to copy fit solutions from one generation
to another and to introduce diversity in the population
378
TEAM LinG
Discovering Ranking Functions for Information Retrieval
Fitness Function: We use the P_Avg as the fitness The detailed process for the ranking function discov-
function which is defined in equation ery is as given in Figure 2. It is an iterative process. First ,
a population of random ranking functions is created. The
training documents are divided into a set of training set
i and a validation set. This two dataset training method-
D r (d j ) ology has been commonly used in machine learning
r (d ) * j =1
i =1
i
i
experiments (Mitchell, 1997). Relevance judgments for a
(2) query for each of the documents in the training and
validation set are known. The random ranking functions
P _ Avg =
T Re l are evaluated using relevance information for the train-
ing set. The performance measure used is given in Equa-
where r(di) (0,1) is the relevance score assigned to tion (2). The topmost ranking function in terms of perfor-
a document, it being 1 if the document is relevant and mance is noted. The population is subjected to the
0 otherwise. |D| is the total number of retrieved genetic operations of selection, reproduction, and cross-
documents. TRel is the total number of relevant over to generate the next generation. The process is
documents for the query. This equation incorpo- repeated for each generation. At the end of 30 genera-
rates both the standard retrieval measures of preci- tions the thirty ranking functions are applied to the
sion and recall. It also takes into account the order- validation set and the best performing ranking function
ing of the relevant retrieved documents. For ex- is chosen as the discovered ranking function for the
ample, if there are 20 documents retrieved and only particular query.
5 of them are relevant then P_Avg score is higher if It is to be noted that over the 30 generations the best
the relevant documents are higher up in the order ranking function in each generation does successively
(say top 5 retrieved documents are relevant) as improve retrieval performance on the training data set.
compared to the P_Avg score with relevant docu- This improvement in retrieval performance is as ex-
ments that are lower in the retrieval order (say the pected. However, to avoid overfitting problems that are
16th to 20th document). This property is very im- very common in these techniques, we apply the best
portant in many retrieval scenarios where the user ranking function from each of the 30 generations to the
is willing to see only the top few retrieved docu- unseen validation data set and choose the best perform-
ments. P_Avg is the most widely used measure in ing ranking function on the validation data set to be
retrieval studies for comparing performance of applied to the test data set. Thus the final ranking
different systems. function that we choose for applying to the test data set
Reproduction: Reproduction copies the top (in need not necessarily come from the last generation.
terms of fitness) trees in the population into the next
population. If P is the population size and reproduc- Experiments and Results
tion rate is rate_r then top rate_r * P trees are
copied into next generation. rate_r is set to 0.1. We have applied the discovery process described above
Crossover: We use tournament selection to se- on various TREC (Harman, 1996; Hawking, 2000; Hawk-
lect, with replacement, 6 random trees from the ing & Craswell, 2001) document collections. The TREC
population. The top two among the six trees (in datasets were divided into training, validation, and test
terms of fitness) are selected for crossover and datasets. The discovered ranking function after the train-
they exchange sub-trees to form trees for the next ing and validation phase was applied to the test dataset.
generation. The performance results of our method were compared
379
TEAM LinG
Discovering Ranking Functions for Information Retrieval
Figure 2. Discovery process we believe, will yield even better retrieval performance.
Finally, we believe the technique mentioned in this
IN P U T : Q u ery, T raining d ocu m en ts chapter can be combined with data-fusion techniques to
O U TP U T: D isco vered ra nkin g fu nctio n combine successful traits from various algorithms to
P rocess: yield better retrieval performance.
T raining docum ents divided into a training set and a
validation set
Initial population of random ranking functions
generated CONCLUSION
A pply the follow ing on training dataset for 30
generations
Use the individual ranking function in the generation to rank In this chapter we have presented a ranking function
documents in the training set discovery process to discover new ranking functions.
Create a new population by applying genetic operators of
reproduction and crossover on the existing population Ranking functions match the information in documents
T he top perform ing individual from each generation is with that in the queries to rank the documents in the
applied to the docum ents in the validation set and the
best perform ing ranking function is selected decreasing order of predicted relevance of the docu-
T his is the D iscovered ranking function ments to the user. Although there are well known rank-
ing functions in the IR literature we believe even better
ranking functions can be discovered for specific queries
with the results obtained by applying well known ranking or a set of queries. The discovery of ranking functions
functions in the literature (OKAPI, Pivoted TFIDF, and was accomplished using Genetic Programming, an arti-
INQUERY) (Singhal, Salton, Mitra, & Buckley, 1996) to the ficial intelligence algorithm based on evolutionary
same test dataset. These well known functions essentially theory. The results of GP based retrieval have been
differ based on their term weighting strategies i.e. the way found to significantly outperform the results obtained
they combine various weighting features shown in Table by well known ranking functions. We believe this line of
2. We have found that retrieval performance has signifi- research will be potentially very rewarding in terms of
cantly increased by using our method of discovering much improved retrieval performance.
ranking functions. The performance improvement, in terms
of P_Avg, has been anywhere from 11% to 75%, with the
results being statistically significant. More details about REFERENCES
the performance comparisons are available elsewhere
(Fan, Gordon, & Pathak, 2004a, 2004b).
Banzhaf, W., Nordin, P., Keller, R., & Francone, F.
(1998). Genetic programming: An introduction - On
the automatic evolution of computer programs and its
FUTURE TRENDS applications. San Francisco, CA: Morgan Kaufmann
Publishers.
We believe the ranking functions discovery as outlined
in this chapter is just the beginning of a promising Fan, W., Gordon, M., & Pathak, P. (2004a). Discovery of
avenue of research. This discovery process can be ap- context-specific ranking functions for effective informa-
plied to either the routing search task (search specific tion retrieval using genetic programming. IEEE Transac-
to a particular query) or the ad-hoc search task (a gen- tions on Knowledge and Data Engineering, 16(4), 523-
eralized search for a set of queries). Thus the process 527.
could be made more personalized for an individual user
Fan, W., Gordon, M., & Pathak, P. (2004b). A generic
with persistent information requirements or to a group of
ranking function discovery framework by genetic pro-
users (for example in a department in an organization)
gramming for information retrieval. Information Pro-
with similar, but not same, information requirements.
cessing and Management, 40(4), 587-602.
Another promising avenue of research can deal with
intelligently leveraging both the structural as well as Fan, W., Gordon, M. D., Pathak, P., Wensi, X., & Fox, E.
statistical information available in documents, specifi- (2004). Ranking function optimization for effective
cally HTML and XML documents. The work presented web search by genetic programming: an empirical study,
here utilizes just the statistical information in the docu- Proceedings of 37th Hawaii International Confer-
ments in terms of frequencies of terms in documents or ence on System Sciences, Big Island, Hawaii. IEEE.
across documents. But structural information such as
where the terms appear e.g. in the title, anchor, body, or Fuhr, N., & Pfeifer, U. (1994). Probabilistic information
abstract of the document can also be leveraged using the retrieval as combination of abstraction inductive learning
discovery process described in the chapter. Such leveraging, and probabilistic assumptions. ACM Transactions on
Information Systems, 12, 92-115.
380
TEAM LinG
Discovering Ranking Functions for Information Retrieval
Hawking, D., & Craswell, N. (2001). Overview of the TREC- Document Frequency: The number of documents
2001 Web track. In E. Voorhees & D. K. Harman (Eds.), in the document collection that the term appears in.
Proceedings of the Tenth Text Retrieval Conference (Vol. Genetic Programming: A stochastic search algo-
500-250, pp. 61-67). NIST. rithm based on evolutionary theory, with the aim to
Holland, J. H. (1992). Adaptation in natural and artificial optimize structure or functional form. A tree structure
systems (2nd ed.). MIT Press. is commonly used for representation of solutions.
Koza, J. R. (1992). Genetic programming: On the pro- Precision: The ratio of the number of relevant
gramming of computers by means of natural selection. documents retrieved to the total number of documents
Cambridge, MA: MIT Press. retrieved.
Mitchell, T. M. (1997). Machine learning. McGraw Hill). Ranking Function: A function that matches the
information in documents with that in the user query to
Salton, G. (1989). Automatic text processing. Reading, assign a score for each document in the collection.
MA: Addison-Wesley Publishing Co.
Recall: The ratio of the number of relevant docu-
Salton, G., & Buckley, C. (1988). Term weighting ap- ments retrieved to the total number of relevant docu-
proaches in automatic text retrieval. Information pro- ments in the document collection.
cessing and management, 24(5), 513-523.
Term Frequency: The number of times a term
Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996). appears in a document.
Document length normalization. Information process-
ing and management, 32(5), 619-633. Vector Space Model (VSM): A common IR model
where both documents and queries are represented as
vectors of terms.
381
TEAM LinG
382
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discovering Unknown Patterns in Free Text
the discovery of critical, yet hidden, business informa- automatically and suggest the appropriate reply tem-
tion. In the field of academic research, text mining can be plates (Weng & Liu, 2004). ,
used to scan large numbers of publications in order to
select the most relevant literature and to propose new Similarity Detection
links between independent research results. Text min-
ing is also needed to formulate and assess hypotheses Texts are grouped according to their own content into
arising in biomedical research and for helping categories that were not previously known. The docu-
make policy decisions regarding technical innovation ments are analysed by a clustering computer program,
(Smallheiser, 2001, p. 690). Another application of text often a neural network, but the clusters still have to be
mining is in medical science, to discover gene interac- interpreted by a human expert (Hearst, 1999). Docu-
tions, functions and relations, or to build and structure ment pre-processing (tagging of parts of speech,
medical knowledge bases, and to find undiscovered lemmatisation, filtering and structuring) precedes the
relations between diseases and medications (De Bruijn actual clustering phase (Iiritano et al., 2004). The clus-
& Martin, 2002, p. 8). tering program finds similarities between documents,
for example, common author, same themes, or informa-
Types of Text Mining tion from common sources. The program does not need
a training set or taxonomy, but generates it dynamically
Keyword-Based Association Analysis (cf., Sullivan, 2001, p. 201). One example of the use of
text clustering is found in the work of Fattori et al.
Association analysis looks for correlations between (2003), whose text-mining tool processes patent docu-
texts based on the occurrence of related keywords or ments into dynamic clusters to discover patenting trends,
phrases. Texts with similar terms are grouped together. which constitutes information that can be used as com-
The pre-processing of the texts is very important and petitive intelligence.
includes parsing and stemming, and the removal of
words with minimal semantic content. Another issue is Link Analysis
the problem of compounds and non-compounds should
the analysis be based on singular words or should word Link analysis is the process of building up networks of
groups be accounted for? (cf., Han & Kamber, 2001, p. interconnected objects through relationships in order
433). Kostoff et al. (2002), for example, have measured to expose patterns and trends (Westphal & Blaxton,
the frequencies and proximities of phrases regarding 1998, p. 202). In text databases, link analysis is the finding
electrochemical power to discover central themes and of meaningful, high levels of correlations between text
relationships among them. This knowledge discovery, entities. The user can, for example, suggest a broad
combined with the interpretation of human experts, can hypothesis and then analyse the data in order to prove or
be regarded as an example of knowledge creation through disprove this hunch. It can also be an automatic or semi-
intelligent text mining. automatic process, in which a surprisingly high number of
links between two or more nodes may indicate relations
Automatic Document Classification that have hitherto been unknown. Link analysis can also
refer to the use of algorithms to build and exploit networks
Electronic documents are classified according to a pre- of hyperlinks in order to find relevant and related docu-
defined scheme or training set. The user compiles and ments on the Web (Davison, 2003). Yoon & Park (2004)
refines the classification parameters, which are then use link analysis to construct a visual network of pat-
used by a computer program to categorise the texts in ents, which facilitates the identification of a patents
the given collection automatically (cf., Sullivan, 2001, relative importance: The coverage of the application is
p. 198). Classification can also be based on the analysis wide, ranging from new idea generation to ex post facto
of collocation [the juxtaposition or association of a auditing (p. 49). Text mining is also used to identify
particular word with another particular word or words experts by finding and evaluating links between persons
(The Oxford Dictionary, 1995)]. Words that often and areas of expertise (Ibrahim, 2004).
appear together probably belong to the same class (Lopes
et al., 2004). According to Perrin & Petry (2003) use- Sequence Analysis
ful text structure and content can be systematically
extracted by collocational lexical analysis with statis- A sequential pattern is the arrangement of a number of
tical methods. Text classification can be used by busi- elements, in which the one leads to the other over time
nesses, for example, to categorise customers e-mails (Wong et al., 2000). Sequence analysis is the discovery
383
TEAM LinG
Discovering Unknown Patterns in Free Text
of patterns that are related to time frames, for example, the (Nasukawa & Nagano, 2001, p. 967).
origin and development of a news thread (cf., Montes-y- Authors also differ on the issue of natural language
Gmez et al., 2001), or tracking a developing trend in processing within text mining. Some prefer a more statis-
politics or business. It can also be used to predict recurring tical approach (cf., Hearst, 1999), while others feel that
events. linguistic parsing is an essential part of text mining.
Sullivan (2001, p. 37) regards the representation of mean-
Anomaly Detection ing by means of syntactic-semantic representations as
essential for text mining: Text processing techniques,
Anomaly detection is the finding of information that based on morphology, syntax, and semantics, are pow-
violates the usual patterns, for example, a book that refers erful mechanisms for extracting business intelligence
to a unique source, or a document lacking typical informa- information from documents. We can scan text for
tion. An example of anomaly detection is the detection of meaningful phrase patterns and extract key features and
irregularities in news reports or different topic profiles in relationships. According to De Bruijn & Martin (2002,
newspapers (Montes-y-Gmez et al., 2001). p. 16), [l]arge-scale statistical methods will continue to
challenge the position of the more syntax-semantics
Hypertext Analysis oriented approaches, although both will hold their own
place.
Text mining is about looking for patterns in natural In the light of the various definitions of text mining,
language text. Web mining is the slightly more general it should come as no surprise that authors also differ on
case of looking for patterns in hypertext and often ap- what qualifies as text mining and what does not. Build-
plies graph theoretical approaches to detect and utilise ing on Hearst (1999), Kroeze, Matthee, & Bothma
the structure of web sites (New Zealand Digital Library, (2003) use the parameters of novelty and data type to
2002). Marked-up language, especially XML tags, fa- distinguish between information retrieval, standard text
cilitates text mining because the tags can often be used mining and intelligent text mining (see Figure 1).
to simulate database attributes and to convert data-cen- Halliman (2001, p. 7) also hints at a scale of new-
tric documents into databases, which can then be ex- ness of information: Some text mining discussions
ploited (Tseng & Hwung, 2002). Mark-up tags also make stress the importance of discovering new knowledge.
it possible to create artificial structures [that] help us And the new knowledge is expected to be new to every-
understand the relationship between documents and docu- body. From a practical point of view, we believe that
ment components (Sullivan, 2001, p. 51). business text should be mined for information that is
new enough to give a company a competitive edge
once the information is analyzed.
Another issue is the question of when text mining
CRITICAL ISSUES can be regarded as intelligent. Intelligent behavior is
the ability to learn from experience and apply knowl-
Many sources on text mining refer to text as unstruc- edge acquired from experience, handle complex situa-
tured data. However, it is a fallacy that text data are tions, solve problems when important information is
unstructured. Text is actually highly structured in terms missing, determine what is important, react quickly and
of morphology, syntax, semantics and pragmatics. On correctly to a new situation, understand visual images,
the other hand, it must be admitted that these structures process and manipulate symbols, be creative and imagi-
are not directly visible: text represents factual infor- native, and use heuristics (Stair & Reynolds, 2001, p.
mation in a complex, rich, and opaque manner
Figure 1. A differentiation between information retrieval, standard and intelligent metadata mining, and
standard and intelligent text mining (abbreviated from Kroeze, Matthee, & Bothma, 2003)
384
TEAM LinG
Discovering Unknown Patterns in Free Text
421). Intelligent text mining should therefore refer to the Hearst, M.A. (1999). Untangling text data mining. In
interpretation and evaluation of discovered patterns. Proceedings of ACL99: the 37th Annual Meeting of the ,
Association for Computational Linguistics. University
of Maryland, June 20-26 (invited paper). Retrieved from
FUTURE TRENDS http://www.ai.mit.edu/people/jimmylin/papers/
Hearst99a.pdf
Mack and Hehenberger (2002, p. S97) regards the automa- Ibrahim, A. (2004). Expertise location: Can text mining
tion of human-like capabilities for comprehending com- help? In N.F.F. Ebecken, C.A. Brebbia, & A. Zanas (Eds.),
plicated knowledge structures as one of the frontiers of Data mining IV (pp. 109-118). Southampton: WIT Press.
text-based knowledge discovery. Incorporating more
artificial intelligence abilities into text-mining tools will Iiritano, S., Ruffolo, M., & Rullo, P. (2004). Preprocessing
facilitate the transition from mainly statistical procedures method and similarity measures in clustering-based text
to more intelligent forms of text mining. mining: A preliminary study. In N.F.F. Ebecken, C.A.
Brebbia, & A. Zanas (Eds.), Data mining IV (pp. 73-79).
Southampton: WIT Press.
CONCLUSION Kostoff, R.N., Tshiteya, R., Pfeil, K.M., & Humenik,
J.A. (2002). Electrochemical power text mining using
Text mining can be regarded as the next frontier in the bibliometrics and database tomography. Journal of
science of knowledge discovery and creation, enabling Power Sources, 110(1), 163-176.
businesses to acquire sought-after competitive intelli-
gence, and helping scientists of all academic disciplines Kroeze, J.H., Matthee, M.C., & Bothma, T.J.D. (2003).
to formulate and test new hypotheses. The greatest Differentiating data- and text-mining terminology. In IT
challenges will be to select the most appropriate tech- Research in Developing Countries - Proceedings of
nology for specific problems and to popularise these SAICSIT 2003 (Annual Research Conference of the
new technologies so that they become instruments that South African Institute of Computer Scientists and
are generally known, accepted and widely used. Information Technologists) (pp. 93-101), September 17-
19. Pretoria. SAICSIT.
Lopes, M.C.S., Terra, G.S., Ebecken, N.F.F., & Cunha,
REFERENCES G.G. (2004). Mining text databases on clients opinion
for oil industry. In N.F.F. Ebecken, C.A. Brebbia, & A.
Chen, H. (2001). Knowledge management systems: A Zanas (eds.), Data Mining IV (pp. 139-147).
text mining perspective. Tucson, AZ: University of Southampton: WIT Press.
Arizona (Knowledge Computing Corporation).
Mack, R., & Hehenberger, M. (2002). Text-based knowl-
Davison, B.D. (2003). Unifying text and link analysis. In edge discovery: Search and mining of life-science docu-
Text-Mining & Link-Analysis Workshop of the 18 th ments. Drug Discovery Today, 7(11) (Suppl.), S89-S98.
International Joint Conference on Artificial Intelli-
gence. Retrieved from http://www-2.cs.cmu.edu/ Montes-y-Gmez, M., Gelbukh, A., & Lpez-Lpez, A.
~dunja/TextLink2003/ (2001). Mining the news: Trends, associations, and devia-
tions. Computacin y Sistemas, 5(1). Retrieved from http:/
De Bruijn, B., & Martin, J. (2002). Getting to the (c)ore /ccc.inaoep.mx/~mmontesg/publicaciones/2001/
of knowledge: Mining biomedical literature. Interna- NewsMining-CyS01.pdf
tional Journal of Medical Informatics, 67(1-3), 7-18.
Montes-y-Gmez, M., Prez-Coutio, M., Villaseor-
Fattori, M., Pedrazzi, G., & Turra, R. (2003). Text Pineda, L., & Lpez-Lpez, A. (2004). Contextual ex-
mining applied to patent mapping: A practical business ploration of text collections. Lecture Notes in Computer
case. World Patent Information, 25(4), 335-342. Science (Vol. 2945) Berlin: Springer-Verlag. Retrieved
Halliman, C. (2001). Business intelligence using smart from http://ccc.inaoep.mx/~mmontesg/publicaciones/
techniques: Environmental scanning using text min- 2004/ContextualExploration-CICLing04.pdf
ing and competitor analysis using scenarios and Nasukawa, T., & Nagano, T. (2001). Text analysis and
manual simulation. Houston, TX: Information Uncover. knowledge mining system. IBM Systems Journal, 40(4),
Han, J., & Kamber, M. (2001). Data mining: Concepts 967-984.
and techniques. San Francisco, CA: Morgan Kaufmann.
385
TEAM LinG
Discovering Unknown Patterns in Free Text
New Zealand Digital Library, University of Waikato. (2002). Hypertext: A collection of texts containing links to
Text mining. Retrieved from http://www.cs.waikato.ac.nz/ each other to form an interconnected network (Sullivan,
~nzdl/textmining/ 2001, p. 46).
Perrin, P., & Petry, F.E. (2003). Extraction and repre- Information Retrieval: The searching of a text
sentation of contextual information for knowledge dis- collection based on a users request to find a list of
covery in texts. Information Sciences, 151, 125-152. documents organised according to its relevance, as
judged by the retrieval engine (Montes-y-Gmez et al.,
Rob, P., & Coronel, C. (2004). Database systems: Design, 2004). Information retrieval should be distinguished
implementation, and management (6th ed.). Boston, MA: from text mining.
Course Technology.
Knowledge Creation: The evaluation and interpre-
Smallheiser, N.R. (2001). Predicting emerging tech- tation of patterns, trends or anomalies that have been
nologies with the aid of text-based data mining: The discovered in a collection of texts (or data in general),
micro approach. Technovation, 21(10), 689-693. as well as the formulation of its implications and conse-
Stair, R.M., & Reynolds, G.W. (2001). Principles of quences, including suggestions concerning reactive busi-
information systems: A managerial approach (5th ed.). ness decisions.
Boston, MA: Course Technology. Knowledge Discovery: The discovery of patterns,
Sullivan, D. (2001). Document warehousing and text trends or anomalies that already exist in a collection of
mining: Techniques for improving business opera- texts (or data in general), but have not yet been identi-
tions, marketing, and sales. New York, NY: John Wiley. fied or described.
Tseng, F.S.C., & Hwung, W.J. (2002). An automatic Mark-Up Language: Tags that are inserted in free
load/extract scheme for XML documents through ob- text to mark structure, formatting and content. XML
ject-relational repositories. Journal of Systems and tags can be used to mark attributes in free text and to
Software, 64(3), 207-218. transform free text into an exploitable database (cf.,
Tseng & Hwung, 2002).
Weng, S.S., & Liu, C.K. (2004). Using text classifica-
tion and multiple concepts to answer e-mails. Expert Metadata: Information regarding texts, for example,
Systems with Applications, 26(4), 529-543. author, title, publisher, date and place of publication,
journal or series, volume, page numbers, key words, etc.
Westphal, C., & Blaxton, T. (1998). Data mining solu-
tions: Methods and tools for solving real-world prob- Natural Language Processing (NLP): The automatic
lems. New York, NY: John Wiley. analysis and/or processing of human language by com-
puter software, focussed on understanding the contents
Wong, P.K., Cowley, W., Foote, H., Jurrus, E., & Tho- of human communications. It can be used to identify
mas, J. (2000). Visualizing sequential patterns for text relevant data in large collections of free text for a data
mining. In Proceedings of the IEEE Symposium on mining process (Westphal & Blaxton, 1998, p. 116).
Information Visualization 2000 (p. 105). Retrieved from
http://portal.acm.org/citation.cfm Parsing: A (NLP) process that analyses linguistic
structures and breaks them down into parts, on the
Yoon, B., & Park, Y. (2004). A text-mining-based patent morphological, syntactic or semantic level.
network: Analytical tool for high-technology trend. The
Journal of High Technology Management Research, Stemming: Finding the root form of related words,
15(1), 37-50. for example singular and plural nouns, or present and
past tense verbs, to be used as key terms for calculating
occurrences in texts.
386
TEAM LinG
387
Discovery Informatics ,
William W. Agresti
Johns Hopkins University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discovery Informatics
it strikes at what is often an essential element for contents and then discover new items that are similar.
success and progressdiscovery. Current technology can support this ability to associate
a fingerprint with a document (Heintze, 2004) in order
to characterize its meaning, thereby enabling concept-
MAIN THRUST based searching. Discovery informatics recognize that
advances in search and retrieval enhance the discovery
Both the technology and application dimensions will be process.
explored to help clarify the meaning of discovery This same semantic analysis can be exploited in
informatics. other settings, such as within organizations. It is pos-
sible now to have your e-mail system prompt you, based
Discovery Across Technologies on the content of messages you compose. When you
click send, the e-mail system may open a dialogue box
The technology dimension is considered broadly to in- (e.g., Do you also want to send that to Mary?). The
clude automated hardware and software systems, theo- system has analyzed the content of your message, deter-
ries, algorithms, architectures, techniques, methods, and mining that, for messages in the past having similar
practices. Included here are familiar elements associated content, you also have sent them to Mary. So the system
with data mining and knowledge discovery, such as clus- is now asking you if you have perhaps forgotten to
tering, link analysis, rule induction, machine learning, include her. While this feature can certainly be intrusive
neural networks, evolutionary computation, genetic al- and bothersome unless it is wanted, the point is that the
gorithms, and instance-based learning (Wang, 2003). same semantic analysis advances are at work here as
However, the discovery informatics viewpoint goes with the Internet search example.
further, to activities and advances that are associated with The informatics part of discovery informatics also
other areas but should be seen as having a role in discov- conveys the breadth of science and technology needed
ery. Some of these activities, like searching or knowl- to support discovery. There are commercially available
edge sharing, are well known from everyday experiences. computer systems and special-purpose software dedi-
Conducting searches on the Internet is a common cated to knowledge discovery (see listings at http://
practice that needs to be recognized as part of a thread www.kdnuggets.com/). The informatics support includes
of information retrieval. Because it is practiced essen- comprehensive hardware-software discovery platforms
tially by all Internet users and involves keyword search, as well as advances in algorithms and data structures,
there is a tendency to minimize its importance. Search which are core subjects of computer science. The latest
technology is extremely sophisticated (Baeza-Yates & developments in data sharing, application integration,
Ribiero-Neto, 1999). People always have some starting and human-computer interfaces are used extensively in
point for their searches. Often, it is not a keyword, but the automated support of discovery. Particularly valu-
a concept. So people are forced to perform the transfor- able, because of the voluminous data and complex rela-
mation from a notional concept of what is desired to a tionships, are advances in visualization (Marakas, 2003).
list of one or more keywords. The net effect can be the Commercial visualization packages are used widely to
familiar many-thousand hits from the search engine. display patterns and to enable expert interaction and
Even though the responses are ranked for relevance (a manipulation of the visualized relationships.
rich and research-worthy subject itself), people may
still find that the returned items do not match their Discovery Across Domains
intended concepts.
Offering hope for improved search are advances in Discovery informatics encourages a view that spans
concept-based search (Houston & Chen, 2004), more application domains. Over the past decade, the term has
intuitiveness to a persons sense of find me content been associated most often with drug discovery in the
like this, where this can be a concept embodied in an pharmaceutical industry, mining biological data. The
entire document or series of documents. For example, financial industry also was known for employing tal-
a person may be interested in learning which parts of a ented programmers to write highly sophisticated math-
new process guideline are being used in practice in the ematical algorithms for analyzing stock trading data,
pharmaceutical industry. Trying to obtain that informa- seeking to discover patterns that could be exploited for
tion through keyword searches typically would involve financial gain. Retailers were prominent in developing
trial and error on various combinations of keywords. large data warehouses that enabled mining across inven-
What the person would like to do is to point a search tool tory, transaction, supplier, marketing, and demographic
to an entire folder of multimedia electronic content and databases. The situation (marked by drug discovery
ask the tool to effectively integrate over the folder informatics, financial discovery informatics, etc.) was
388
TEAM LinG
Discovery Informatics
evolving into one in which discovery informatics was Common Elements of Discovery
preceded by more and more words as it was being used in Informatics ,
an increasing number of domain areas. One way to see the
emergence of discovery informatics is to strip away the The only constant in discovery informatics is data and
domain modifiers and recognize the universality of ev- an interacting entity with an interest in discovering new
ery application area and organization wanting to take information from it. What varies and has an enormous
advantage of its data. effect on the ease of discovering new information is
Discovery informatics techniques are very influen- everything else, notably the following:
tial across professions and domains:
389
TEAM LinG
Discovery Informatics
sity of California at Berkeley, 2003). Our stockpiles of Agresti, W.W. (2003). Discovery informatics. Commu-
data are expanding rapidly in every field of endeavor. nications of the ACM, 46(8), 25-28.
Businesses at one time were comfortable with opera-
tional data summarized over days or even weeks. Increas- Allen, J.E., Pertea, M., & Salzberg, S.L. (2004). Computa-
ing automation led to point-of-decision data on transac- tional gene prediction using multiple sources of evidence.
tions. With online purchasing, it is now possible to know Genome Research, 14, 142-148.
the sequence of clickstreams leading up to the sale. So Baeza-Yates, R., & Ribiero-Neto, B. (1999). Modern infor-
the granularity of the data is becoming finer, as busi- mation retrieval. Reading, MA: Addison-Wesley.
nesses are learning more about their customers and about
ways to become more profitable. This business analytics Bergeron, B. (2003). Bioinformatics computing. Upper
is essential for organizations to be competitive. Saddle River, NJ: Prentice Hall.
A similar process of finer granular data exists in Heintze, N. (2004). Scalable document fingerprinting.
bioinformatics. In the human body, there are 23 pairs of Carnegie Mellon University. Retrieved from http://
human chromosomes, approximately 30,000 genes, and www2.cs.cmu.edu/afs/cs/user/nch/www/koala/
more than 1,000,000 proteins (Watkins, 2001). The main.html
advances in decoding the human genome are remark-
able, but it is proteins that ultimately regulate metabo- Houston, A., & Chen, H. (2004). A path to concept-based
lism and disease in the body (Watkins, 2001, p. 27). So, information access: From national collaboratories to
the challenges for bioinformatics continue to grow digital libraries. University of Arizona. Retrieved from
along with the data. http://ai.bpa.arizona.edu/go/intranet/papers/
Book7.pdf
Koza, J.R. et al. (Eds.) (2003). Genetic programming IV:
CONCLUSION Routine human-competitive machine intelligence.
Dordrecht, The Netherlands: Kluwer Academic Publishers.
Discovery informatics is an emerging methodology that
promotes a crosscutting and integrative view. It looks Marakas, G.M. (2003). Modern data warehousing,
across both technologies and application domains to mining, and visualization. Upper Saddle River, NJ:
identify and organize the techniques, tools, and models Prentice Hall.
that improve data-driven discovery.
There are significant research questions as this meth- Prahalad, C.K., & Hamel, G. (1990). The core compe-
odology evolves. Continuing progress will be received tence of the corporation. Harvard Business Review, 3,
eagerly from efforts in individual strategies for knowl- 79-91.
edge discovery and machine learning, such as the excel- Senator, T.E., Goldberg, H.G., & Wooton, J. (1995). The
lent contributions in Koza, et al. (2003). An additional financial crimes enforcement network AI system (FAIS):
opportunity is to pursue the recognition of unifying Identifying potential money laundering from reports of
aspects of practices now associated with diverse disci- large cash transactions. AI Magazine, 16, 21-39.
plines. While the anticipation of new discoveries is
exciting, the evolving practical application of discovery Tsantis, L., & Castellani, J. (2001). Enhancing learning
methods needs to respect individual privacy and a di- environments through solution-based knowledge dis-
verse collection of laws and regulations. Balancing covery tools: Forecasting for self-perpetuating sys-
these requirements constitutes a significant and persis- temic reform. Journal of Special Education Technol-
tent challenge as new concerns emerge and as laws are ogy, 16, 39-52.
drafted. University of California at Berkeley. (2003). How much
Looking ahead to the challenges and opportunities information? School of Information Management and
of the 21st century, discovery informatics is poised to Systems. Retrieved from http://www.sims.berkeley.edu/
help people and organizations learn as much as possible research/projects/how-much-info/how-much-info.pdf
from the worlds abundant and ever-growing data assets.
Wang, J. (2003). Data mining: Opportunities and chal-
lenges. Hershey, PA: Idea Group Publishing.
REFERENCES Watkins, K.J. (2001). Bioinformatics. Chemical & En-
gineering News, 79, 26-45.
Agresti, W.W. (2000). Knowledge management. Ad-
vances in Computers, 53, 171-283.
390
TEAM LinG
Discovery Informatics
KEY TERMS gorithms to find the fittest models from the set to serve as
inputs to the next iteration, ultimately leading to a model ,
Clickstream: The sequence of mouse clicks ex- that best represents the data.
ecuted by an individual during an online Internet session. Knowledge Management: The practice of transform-
Data Mining: The application of analytical methods ing the intellectual assets of an organization into business
and tools to data for the purpose of identifying patterns value.
and relationships such as classification, prediction, es- Neural Networks: Learning systems, designed by
timation, or affinity grouping. analogy with a simplified model of the neural connec-
Discovery Informatics: The study and practice of tions in the brain, which can be trained to find nonlinear
employing the full spectrum of computing and analyti- relationships in data.
cal science and technology to the singular pursuit of Rule Induction: Process of learning from cases or
discovering new information by identifying and validat- instances the if-then rule relationships consisting of an
ing patterns in data. antecedent (i.e., if-part, defining the preconditions or
Evolutionary Computation: Solution approach coverage of the rule) and a consequent (i.e., then-part,
guided by biological evolution, which begins with po- stating a classification, prediction, or other expression of
tential solution models and then iteratively applies al- a property that holds for cases defined in the antecedent).
391
TEAM LinG
392
Geoffrey I. Webb
Monash University, Australia
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discretization for Data Mining
encompasses the entire value range, then repeat- tion that this difference is less than the difference
edly splits it into sub-intervals until some stop- between 9 and 29. ,
ping criterion is satisfied. Merge discretization Fuzzy vs. Non-fuzzy (Ishibuchi, Yamamoto, &
starts with each value in a separate interval, then Nakashima, 2001; Wu, 1999): Fuzzy discretization
repeatedly merges adjacent intervals until a stop- creates a fuzzy mapping function. A value may
ping criterion is met. It is possible to combine belong to multiple intervals, each with varying de-
both split and merge techniques. For example, grees of strength. Non-fuzzy discretization forms
initial intervals may be formed by splitting. A exact cut points.
merge process is then applied to post-process
these initial intervals. Non-hierarchical Composite methods first generate a mapping function
discretization creates intervals without forming a using an initial primary method. They then use other
hierarchy. For example, many methods forming primary methods to adjust the initial cut points.
the intervals sequentially in a single scan through
the data.
Univariate vs. Multivariate (Bay, 2000): Univariate MAIN THRUST
methods discretize an attribute without reference to
attributes other than the class. In contrast, multi- The main thrust of this chapter deals with how to select
variate methods consider relationships among at- a discretization method. This issue is particularly impor-
tributes during discretization. tant since there exist a large number of discretization
Disjoint vs. Non-Disjoint (Yang & Webb, 2002): methods and no one can be universally optimal. When
Disjoint methods discretize the value range of an selecting between discretization methods it is critical to
attribute into intervals that do not overlap. Non- take account of the learning context, in particular, of the
disjoint methods allow overlap between intervals. learning algorithm, the nature of the data, and the learning
Global vs. Local (Dougherty, Kohavi, & Sahami, objectives. Different learning contexts have different char-
1995): Global methods create a single mapping acteristics and hence have different requirements for
function that is applied throughout a given classi- discretization. It is unrealistic to pursue a universally
fication task. Local methods allow different map- optimal discretization approach that can be blind to its
ping functions for a single attribute in different learning context.
classification contexts. For example, decision tree Many discretization techniques have been developed
learning may discretize a single attribute into differ- primarily in the context of a specific type of learning
ent intervals at different nodes of a tree (Quinlan, algorithm, such as decision tree learning, decision rule
1993). Global techniques are more efficient, because learning, naive-Bayes learning, Bayes network learning,
one discretization is used throughout the entire clustering, and association learning. Different types of
data mining process, but local techniques may re- learning have different characteristics and hence require
sult in the discovery of more useful cut points. different strategies of discretization.
Eager vs. Lazy (Hsu, Huang, & Wong, 2000, 2003): For example, decision tree learners can suffer from the
Eager methods generate the mapping function prior fragmentation problem. If an attribute has many values, a
to classification time. Lazy methods generate the split on this attribute will result in many branches, each of
mapping function as it is needed during classifica- which receives relatively few training instances, making
tion time. it difficult to select appropriate subsequent tests. Hence
Ordinal vs. Nominal: Ordinal discretization forms a they may benefit more than other learners from
mapping function from quantitative to ordinal quali- discretization that results in few intervals. Decision rule
tative data. It seeks to retain ordering information learners may require pure intervals (containing instances
implicit in quantitative attributes. In contrast, nomi- dominated by a single class), while probabilistic learners
nal discretization forms a mapping function from such as naive-Bayes do not. The relations between at-
quantitative to nominal qualitative data, thereby tributes are key themes for association learning, and
discarding any ordering information. For example, if hence multivariate discretization that can capture the
the value range 0 29 were discretized into three inter-dependencies among attributes is desirable. If
intervals 0 9, 10 19 and 20 29, if the intervals are coupled with lazy discretization, lazy learners can further
treated as nominal then a value in the interval 0 9 save training effort. Non-disjoint discretization is not
will be treated as dissimilar to one in 20 29 as it is applicable if the learning algorithm, such as decision tree
to one in 10 19. In contrast, while ordinal learning, requires disjoint attribute values.
discretization will treat the difference between 9 and In order to facilitate understanding this issue, we
either 10 or 19 as equivalent, it retains the informa- contrast discretization strategies in two popular learning
393
TEAM LinG
Discretization for Data Mining
contexts, decision tree learning and naive-Bayes learning. often violated in real-world applications, naive-Bayes
Although both are commonly used for data mining appli- learning still achieves surprisingly good classification
cations, they have very different inductive biases and performance. Domingos and Pazzani (1997) suggested
learning mechanisms. As a result, they desire different one reason is that the classification estimation under
discretization methodologies. zero-one loss is only a function of the sign of the
probability estimation. The classification accuracy can
Discretization in Decision Tree Learning remain high even while the assumption violation causes
poor probability estimation, so long as the highest
The learned concept is represented by a decision tree in estimate relates to the correct class. Because they are
decision tree learning. Each non-leaf node tests an at- simple, effective, efficient, robust to noise, and sup-
tribute. Each branch descending from that node corre- port incremental training, nave-Bayes classifiers have
sponds to one of the attributes values. Each leaf node been employed in numerous classification tasks.
assigns a class label. A decision tree classifies instances Appropriate discretization mechanisms for naive-
by sorting them down the tree from the root to some leaf Bayes learning include fixed-frequency discretization
node (Mitchell, 1997). Algorithms such as ID3 (Quinlan, (Yang, 2003), proportional discretization (Yang, 2003)
1986) and its successor C4.5 (Quinlan, 1993) are well known and non-disjoint discretization (Yang, 2003). For ex-
exemplars. ample, when discretizing a quantitative attribute, fixed-
Fayyad & Irani (1993) proposed multi-interval-entropy- frequency discretization (FFD) predefines a sufficient
minimization discretization (MIEMD), which has been one interval frequency k. It then discretizes the sorted values
of the most popular discretization mechanisms for deci- into intervals so that each interval has approximately the
sion tree learning over years. Briefly speaking, MIEMD same number k of training instances with adjacent (pos-
discretizes a quantitative attribute by calculating the class sibly identical) values. By this means, FFD fixes an
information entropy as if the classification only uses that interval frequency that is not arbitrary but can ensure
single attribute after discretization. This can be suitable that each interval contains sufficient instances to allow
for the divide-and-conquer strategy of decision tree learn- reasonable probability estimates.
ing, which handles one attribute at a time. However, it is not However, FFD may result in inferior performance for
necessarily appropriate for other learning mechanisms decision tree learning. FFD first ensures that each inter-
such as naive-Bayes learning, which involves all the at- val contains sufficient instances for estimating the na-
tributes simultaneously (Yang, 2003). ive-Bayes probabilities. On top of that, FFD tries to
Furthermore, MIEMD uses the minimum description maximize the number of discretized intervals to reduce
length criterion (MDL) as its termination condition that discretization bias (Yang, 2003). If employed in decision
decides when to stop further partitioning a quantitative tree learning, this maximization effect of FFD tends to
attributes value range. As An and Cercone (1999) indi- cause a severe fragmentation problem.
cate, this criterion has an effect to form qualitative at- The other way around, MIEMD is effective for deci-
tributes with few values. This effect is desirable for deci- sion tree learning but not for naive-Bayes learning.
sion tree learning, since it helps avoid the fragmentation Because of its attribute independence assumption, na-
problem by minimizing the number of values of an attribute. ive-Bayes learning is not subject to the fragmentation
If an attribute has many values, forming a split on those problem. MIEMDs tendency to minimize the number of
values fragments that data into small subsets with respect intervals has a strong potential to reduce the classifica-
to which it is difficult to perform further learning (Quinlan, tion variance but increase the classification bias. As the
1993). However, this effect of minimization is not so wel- data size becomes large, it is very likely that the loss
come to naive-Bayes learning because it brings adverse through bias increase will soon overshadow the gain
impact as we will detail in the next section. through variance reduction, resulting in inferior learning
performance (Yang, 2003). This impact is particularly
Discretization in Naive-Bayes Learning undesirable, since, due to its efficiency, naive-Bayes
learning is very popular for learning from large data.
When classifying an instance, nave-Bayes learning ap- Hence, MIEMD is not a desirable approach for
plies Bayes theorem to calculate the probability of each discretization in naive-Bayes learning.
class given this instance. The most probable class is
chosen as the class of this instance. In order to simplify the Discretization in Association Rule
calculation, an attribute independence assumption is Discovery
made, assuming attributes conditionally independent of
each other given the class. Although this assumption is Association rules (Agrawal, Imielinski, & Swami, 1993)
have quite distinct discretization requirements to the
394
TEAM LinG
Discretization for Data Mining
classification learning techniques discussed above. Be- from the stock market and observation data from monitor-
cause any attribute may appear in the consequent of an ing sensors. The key theme here is to boost ,
association rule, there is no single class variable with discretizations efficiency (without any significant ac-
respect to which global supervised discretization might curacy loss) and to quickly incorporate discretizations
be performed. Further, there is no clear evaluation crite- results into the learner.
rion for the sets of rules that are produced. Whereas it is
possible to assess the expected error rate of a decision
tree or a nave Bayes classifier, there is no corresponding CONCLUSION
metric for comparing the quality of two alternative sets of
association rules. In consequence, it is not apparent how The process that transforms quantitative data to qualita-
one might assess the quality of two alternative tive data is discretization. The real world is abundant in
discretizations of an attribute. It is not even possible to quantitative data. In contrast, many learning algorithms
run the system with each alternative and evaluate the are more adept at learning from qualitative data. This gap
quality of the respective results. For these reasons simple can be shrunk by discretization, which makes discretization
global discretization techniques such as fixed frequency an important research area for knowledge discovery.
discretization are often used. Numerous methods have been developed as under-
In contrast, Srikant & Agrawal (1996) present a hybrid standing of discretization evolves. Different methods are
global unsupervised and local multi-variate supervised tuned to different learning tasks. When seeking to employ
discretization technique. Each attribute is initially an existing discretization method or develop a new
discretized using fixed frequency discretization. Then, for discretization mechanism, it is very important to under-
each rule a locally optimal discretization is generated by stand the learning context within which the discretization
considering new discretizations formed by joining neigh- lies. For example, what learning algorithm will make use of
boring intervals. This technique is more computationally the discretized values? Different learning algorithms have
demanding than a simple global approach, but may result different characteristics and require different discretization
in more useful rules. strategies. There is no universally optimal discretization
solution.
There are significant research questions that still
FUTURE TRENDS remain open in discretization. Continuing progress is
both desirable and necessary.
Many problems still remain open with regard to
discretization.
The relationship between the nature of a learning task REFERENCES
and the appropriate discretization strategies requires fur-
ther investigation. One challenge is to identify what Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining
aspects of a learning task are relevant to the selection of associations between sets of items in massive databases.
discretization strategies. It would be useful to character- Proceedings of the 1993 ACM-SIGMOD International
ize tasks and discretization methods into abstract features Conference on Management of Data (pp. 207-216).
that facilitate the selection.
Discretization for time-series data is another interest- An, A., & Cercone, N. (1999). Discretization of continuous
ing topic. In time-series data, each instance is associated attributes for learning classification rules. Proceedings of
with a time stamp. The concept that underlies the data may the 3rd Pacific-Asia Conference on Methodologies for
drift over time, which implies that the appropriate Knowledge Discovery and Data Mining (pp. 509-514).
discretization cut points may change. It will be time Bay, S.D. (2000). Multivariate discretization of continu-
consuming if discretization has to be conducted from ous variables for set mining. Proceedings of the 6th ACM
scratch each time the data changes. In this case, incremen- SIGKDD International Conference on Knowledge Dis-
tal discretization that only needs the old cut points and covery and Data Mining (pp. 315-319).
new data to form new cut points can be of great utility.
A third trend is discretization for stream data. A Bluman, A.G. (1992). Elementary statistics: A step by step
fundamental difference between time-series data and approach. Dubuque, IA: Wm.C.Brown Publishers.
stream data is that for stream data, one only has access to
data in the current time window, but not any previous Domingos, P., & Pazzani, M. (1997). On the optimality of
data. The data may have large volume, change very fast the simple Bayesian classifier under zero-one loss. Ma-
and require fast response, for example, exchange data chine Learning, 29, 103-130.
395
TEAM LinG
Discretization for Data Mining
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Super- Yang, Y., & Webb, G.I. (2002). Non-disjoint discretization
vised and unsupervised discretization of continuous fea- for naive-Bayes classifiers. Proceedings of the 19th Inter-
tures. Proceedings of the 12th International Confer- national Conference on Machine Learning (pp. 666-
ence on Machine Learning (pp. 94-202). 673).
Fayyad, U.M., & Irani, K.B. (1993). Multi-interval Yang, Y. (2003). Discretization for naive-Bayes learning.
discretization of continuous-valued attributes for classi- PhD thesis, School of Computer Science and Software
fication learning. Proceedings of the 13th International Engineering, Monash University, Melbourne, Australia.
Joint Conference on Artificial Intelligence (pp. 1022-
1027).
KEY TERMS
Hsu, C.N., Huang, H.J., & Wong, T.T. (2000). Why
discretization works for naive Bayesian classifiers. Pro- Turning to the authority of introductory statistical
ceedings of the 17th International Conference on Ma- textbooks (Bluman, 1992; Samuels & Witmer, 1999), the
chine Learning (pp. 309-406). following definitions are adopted.
Hsu, C.N., Huang, H.J., & Wong, T.T. (2003). Implications Continuous Data: Can assume all values on the num-
of the Dirichlet assumption for discretization of continu- ber line within their value range. The values are obtained
ous variables in naive Bayesian classifiers. Machine by measuring. An example is: temperature.
Learning, 53(3), 235-263.
Discrete Data: Assume values that can be counted.
Ishibuchi, H., Yamamoto, T., & Nakashima, T. (2001). The data cannot assume all values on the number line
Fuzzy data mining: Effect of fuzzy discretization. Proceed- within their value range. An example is: number of children
ings of the 2001 IEEE International Conference on Data in a family.
Mining (pp. 241-248).
Discretization: A process that transforms quantita-
Kerber, R. (1992). Chimerge: Discretization for numeric tive data to qualitative data.
attributes. Proceedings of 10 th National Conference on
Artificial Intelligence (pp. 123-128). Nominal Data: Classified into mutually exclusive (non-
overlapping), exhaustive categories in which no meaning-
Mitchell, T.M. (1997). Machine learning. New York, NY: ful order or ranking can be imposed on the data. An
McGraw-Hill Companies. example is: blood type of a person: A, B, AB, O.
Quinlan, J.R. (1986). Induction of decision trees. Machine Ordinal Data: Classified into categories that can be
Learning, 1, 81-106. ranked. However, the differences between the ranks can-
Quinlan, J.R. (1993). C4.5: Programs for machine learn- not be calculated by arithmetic. An example is: assign-
ing. San Francisco, CA: Morgan Kaufmann Publishers. ment evaluation: fail, pass, good, excellent.
Samuels, M.L., & Witmer, J.A. (1999). Statistics for the life Qualitative Data: Also often referred to as categorical
sciences (2nd ed.). Prentice-Hall. data, are data that can be placed into distinct categories.
Qualitative data sometimes can be arrayed in a meaningful
Srikant, R., & Agrawal, R. (1996). Mining quantitative order. But no arithmetic operations can be applied to them.
association rules in large relational tables. Proceedings of Quantitative data can be further classified into two groups,
the 1996 ACM-SIGMOD International Conference on nominal or ordinal.
Management of Data (pp. 1-12).
Quantitative Data: Numeric in nature. They can be
Wu, X. (1999). Fuzzy interpretation of discretized inter- ranked in order. They also admit to meaningful arithmetic
vals. IEEE Transactions on Fuzzy Systems, 7(6), 753-759. operations. Quantitative data can be further classified
into two groups, discrete or continuous.
396
TEAM LinG
397
Ricco Rakotomalala
ERIC, Universit Lumire - Lyon 2, France
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Discretization of Continuous Attributes
attribute by associating it with its corresponding inter- lies. In the following sections, we use these criteria to
val. There are many ways to realize this process. distinguish the particularities of each discretization
One of these ways consists of realizing a method.
discretization with a fixed number of intervals. In this
situation, the user must choose the appropriate number Univariate Unsupervised Discretization
a priori: Too many intervals will be unsuited to the
learning problem, and too few intervals can risk losing The simplest discretization methods make no use of the
some interesting information. A continuous attribute instance labels of the class attribute. For example, the
can be divided in intervals of equal width (see Figure 1) equal width interval binning consists of observing the
or equal frequency (see Figure 2). Other methods exist to values of the dataset to identify the minimum and the
constitute the intervals based on the clustering prin- maximum values observed and to divide the continuous
ciples, for example, k-means clustering discretization attribute into the number of intervals chosen by the user
(Monti & Cooper, 1999). (Figure 1). Nevertheless, in this situation, if uncharac-
Nevertheless, for supervised learning, these teristic extreme values exist in the dataset (outliers),
discretization methods ignore an important source of the range will be changed, and the intervals will be
information: the instance labels of the class attribute. misappropriated. To avoid this problem, divide the con-
By contrast, the supervised discretization methods tinuous attribute into intervals containing the same num-
handle the class label repartition to achieve the differ- ber of instances (Figure 2): This method is called the
ent cuts and find the more appropriate intervals. Fig- equal frequency discretization method.
ure 3 shows a situation where it is more efficient to have The unsupervised discretization can be grasped as a
only two intervals for the continuous attribute instead of problem of sorting and separating intermingled prob-
three: It is not relevant to separate two bordering inter- ability laws (Potzelberger & Felsenstein, 1993). The
vals if they are composed of the same class data. There- existence of an optimum analysis was studied by Teicher
fore, the supervised or unsupervised quality of a (1963) and Yakowitz and Spragins (1968). Neverthe-
discretization method is an important criterion to take less, these methods are limited in their application in
into consideration. data mining due to too strong of statistical hypotheses
Another important criterion to qualify a method is seldom checked with real data.
the fact that a discretization either processes on the
different attributes one by one or takes into account the Univariate Supervised Discretization
whole set of attributes for doing an overall cutting. The
second case, called multivariate discretization, is par- To improve the quality of a discretization in supervised
ticularly interesting when some interactions exist be- data-mining methods, it is important to take into ac-
tween the different attributes. In Figure 4, a supervised count the instance labels of the class attribute. Figure 3
discretization attempts to find the correct cuts by taking shows the problem of constituting intervals without the
into account only one attribute independently of the information of the class attribute. The intervals that are
others. This will fail: It is necessary to represent the the better adapted to a discrete machine-learning method
data with the attributes X1 and X2 together to find the are the pure intervals containing only instances of a
appropriate intervals on each attribute. given class. To obtain such intervals, the supervised
discretization methods such as the state-of-the-art
method Minimum Description Length Principle Cut
MAIN THRUST (MDLPC) are based on statistical or information-theo-
retical criteria and heuristics (Fayyad & Irani, 1993).
The two criteria mentioned in the previous section In a particular case, even if one supervised method can
unsupervised/supervised and univariate/multivariate give better results than another (Kurgan & Krysztof,
will characterize the major discretization method fami- 2004), with real data, the improvements of one method
398
TEAM LinG
Discretization of Continuous Attributes
Figure 3. Supervised and unsupervised discretizations Figure 4. Interaction between the attributes X1 and X2
X2
,
X1
399
TEAM LinG
Discretization of Continuous Attributes
to different classes are cut on the graph to constitute the does not benefit from the discretized attributes
clusters; third, the minimal and maximal values of each (Muhlenbach & Rakotomalala, 2002).
relevant cluster are used as cut-points on each predictive
attribute. The intervals found by this method have the
characteristic to be pure on a pavement of the whole ACKNOWLEDGMENT
representation space, even if the purity is not guaranteed
for an independent attribute. The combination of all Edited with the aid of Christopher Yukna.
predictive attribute intervals is what will provide pure
areas in the representation space.
REFERENCES
FUTURE TRENDS Bay, S. D. (2001). Multivariate discretization for set
mining. Knowledge and Information Systems, 3(4), 491-
Today, the discretization field is well studied in the 512.
supervised and unsupervised cases for a univariate pro-
cess. However, there is little work in the multivariate Chmielewski, M. R., & Grzymala-Busse, J. W. (1994).
case. A related problem exists in the feature selection Global discretization of continuous attributes as pre-
domain, which needs to be combined with the aforemen- processing for machine learning. Proceedings of the
tioned multivariate case. This should bring improved Third International Workshop on Rough Sets and Soft
and more pertinent progress. It is virtually certain that Computing (pp. 294-301).
better results can be obtained for a multivariate
discretization if all attributes of the representation Divina, F., Keijzer, M., & Marchiori, E. (2003). A method
space are relevant for the learning problem. for handling numerical attributes in GA-based inductive
concept learners. Proceedings of the Genetic and Evolu-
tionary Computation Conference (pp. 898-908).
CONCLUSION Dougherty, K., Kohavi, R., & Sahami, M. (1995). Super-
vised and unsupervised discretization of continuous
In a data-mining task, for a supervised or unsupervised features. Proceedings of the 12th International Con-
learning problem, the discretization turned out to be an ference on Machine Learning (pp. 194-202).
essential preprocessing step on which the performance
of the learning algorithm that uses discretized attributes Fayyad, U. M., & Irani, K. B. (1993). The attribute selection
will depend. problem in decision tree generation. Proceedings of the
Many methods, supervised or not, multivariate or 13th International Joint Conference on Artificial Intel-
not, exist to perform this pretreatment, more or less ligence (pp. 1022-1027).
adapted to a given dataset and learning problem. Further- Fisher, W. D. (1958). On grouping for maximum homo-
more, a supervised discretization can also be applied in geneity. Journal of the American Statistical Society, 53,
a regression problem when the attribute to be predicted 789-798.
is continuous (Ludl & Widmer, 2000b). The choice of a
particular discretization method depends on (a) its algo- Frank, E., & Witten, I. (1999). Making better use of global
rithmic complexity (complex algorithms will take more discretization. Proceedings of the 16th International
computation time and will be unsuited to very large Conference on Machine Learning (pp. 115-123).
datasets), (b) its efficiency (the simple unsupervised Friedman N., & Goldszmidt, M. (1996). Discretization of
univariate discretization methods are inappropriate to continuous attributes while learning Bayesian networks
complex learning problems), and (c) its appropriate from mixed data. Proceedings of the 13th International
combination with the learning method using the Conference on Machine Learning (pp. 157-165).
discretized attributes (a supervised discretization is
better adapted to the supervised learning problem). For Grzymala-Busse, J. W., & Stefanowski, J. (2001). Three
the last point, it is also possible to significantly improve discretization methods for rule induction. International
the performance of the learning method by choosing an Journal of Intelligent Systems, 16, 29-38.
appropriate discretization, for instance, a fuzzy
discretization for the nave Bayes algorithm (Yang & Hsu, H., Huang, H., & Wong, T. (2003). Implication of
Webb, 2002). Nevertheless, it is unnecessary to employ a dirichlet assumption for discretization of continuous
sophisticated discretization method if the learning method variables in nave Bayes classifiers. Machine Learning,
53(3), 235-263.
400
TEAM LinG
Discretization of Continuous Attributes
Kurgan, L., & Krysztof, J. (2004). CAIM discretization Zighed, D., Rakotomalala, R., & Feschet, F. (1997). Optimal
algorithm. IEEE Transactions on Knowledge and Data multiple intervals discretization of continuous attributes ,
Engineering, 16(2), 145-153. for supervised learning. Proceedings of the Third Inter-
national Conference on Knowledge Discovery in Data-
Liu, H., Hussain, F., Tan, C., & Dash, M. (2002). bases (pp. 295-298).
Discretization: An enabling technique. Data Mining
and Knowledge Discovery, 6(4), 393-423.
Ludl, M., & Widmer, G. (2000a). Relative unsupervised
discretization for association rule mining. Principles
KEY TERMS
of the Fourth European Conference on Data Mining
and Knowledge Discovery (pp. 148-158). Cut-Points: A cut-point (or split-point) is a value
that divides an attribute into intervals. A cut-point has to
Ludl, M., & Widmer, G. (2000b). Relative unsupervised be included in the range of the continuous attribute to
discretization for regression problem. Proceedings of discretize. A discretization process can produce none
the 11th European Conference on Machine Learning or several cut-points.
(pp. 246-253).
Discrete/Continuous Attributes: An attribute is a
Macskassy, S. A., Hirsh, H., Banerjee, A., & Dayanik, A. quantity describing an example (or instance); its domain
A. (2001). Using text classifiers for numerical classifi- is defined by the attribute type, which denotes the values
cation. Proceedings of the 17th International Joint taken by an attribute. An attribute can be discrete (or
Conference on Artificial Intelligence (pp. 885-890). categorical, indeed symbolic) when the number of val-
ues is finite. A continuous attribute corresponds to real
Mittal, A., & Cheong, L. (2002). Employing discrete Bayes numerical values (for instance, a measurement). The
error rate for discretization and feature selection tasks. discretization process transforms an attribute from con-
Proceedings of the First IEEE International Conference tinuous to discrete.
on Data Mining (pp. 298-305).
Instances: An instance is an example (or record) of
Monti, S., & Cooper, G. F. (1999). A latent variable model the dataset; it is often a row of the data table. Instances
for multivariate discretization. Proceedings of the Sev- of a dataset are usually seen as a sample of the whole
enth International Workshop on Artificial Intelligence population (the universe). An instance is described by
and Statistics its attribute values, which can be continuous or discrete.
Muhlenbach, F., & Rakotomalala, R. (2002). Multivariate Number of Intervals: The number of intervals cor-
supervised discretization: A neighborhood graph ap- responds to the different values of a discrete attribute
proach. Proceedings of the First IEEE International resulting from the discretization process. The number
Conference on Data Mining (pp. 314-321). of intervals is equal to the number of cut-points plus 1.
Potzelberger, K., & Felsenstein, K. (1993). On the fisher The minimum number of intervals of an attribute is equal
information of discretized data. Journal of Statistical to 1, and the maximum number of intervals is equal to
Computation and Simulation, 46(3-4), 125-144. the number of instances.
Teicher, H. (1963). Identifiability of finite mixtures. Ann. Representation Space: The representation space is
Math. Statist., 34, 1265-1269. formed with all the attributes of a learning problem. In
the supervised learning, it consists of the representation
Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiabil- of the labeled instances in a multidimensional space,
ity of finite mixtures. Ann. Math. Statist., 39, 209-214. where all predictive attributes play the role of a dimen-
Yang, Y., & Webb, G. (2002). Non-disjoint discretization sion.
for nave Bayes classifiers. Proceedings of the 19th Supervised/Unsupervised: A supervised learning
International Conference on Machine Learning (pp. algorithm searches a functional link between a class-
666-673). attribute (or dependent attribute or attribute to be pre-
Yang, Y., & Webb, G. (2003). On why discretization works dicted) and predictive attributes (the descriptors). The
for nave Bayes classifiers. Proceedings of the 16th Aus- supervised learning process aims to produce a predictive
tralian Joint Conference on Artificial Intelligence (pp. model that is as accurate as possible. In an unsupervised
440-452). learning process, all attributes play the same role; the
unsupervised learning method tries to group instances in
clusters, where instances in the same cluster are similar,
and instances in different clusters are dissimilar.
401
TEAM LinG
Discretization of Continuous Attributes
402
TEAM LinG
403
David Taniar
Monash University, Australia
Kate A. Smith
Monash University, Australia
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Distributed Association Rule Mining
404
TEAM LinG
Distributed Association Rule Mining
405
TEAM LinG
Distributed Association Rule Mining
TID X-1 X-2 X-3 TID X-1 X-2 X-3 TID X-1 X-2 TID X-1 X-3 TID X-1 X-4
1 1.1 2.2 3.1 8 1.5 2.1 3.1 1 1.1 2.2 1 1.5 3.1 1 1.5 4.1
2 1.1 2.2 3.1 9 1.6 2.2 3.2 2 1.1 2.2 2 1.6 3.1 2 1.6 4.2
3 1.3 2.3 3.3 10 1.3 2.1 3.3 3 1.3 2.3 3 1.3 3.3 3 1.3 4.1
4 1.2 2.5 3.2 11 1.4 2.4 3.4 4 1.2 2.5 4 1.4 3.2 4 1.4 4.4
5 1.7 2.5 3.3 12 1.5 2.4 3.5 5 1.7 2.5 5 1.5 3.3 5 1.5 4.4
6 1.6 2.6 3.6 13 1.6 2.6 3.6 6 1.6 2.6 6 1.6 3.6 6 1.6 4.5
7 1.7 2.7 3.7 14 1.7 2.7 3.7 7 1.7 2.7 7 1.7 3.7 7 1.7 4.5
Site A Site B Site A Site B Site C
(a) (b)
hierarchical taxonomy. If a different taxonomy level exists ises to generate association rules with minimal communi-
in datasets of different sites, as shown in Figure 3, it cation cost, but it also utilizes the resources distributed
becomes very difficult to maintain the accuracy of global among different sites efficiently. However, the accept-
models. ability of DARM depends to a great extent on the issues
discussed in this article.
FUTURE TRENDS
REFERENCES
The DARM algorithms often consider the datasets of
various sites as a single virtual table. On the other hand, Agrawal, R., Imielinsky, T., & Sawmi, A.N. (1993). Mining
such assumptions become incorrect when DARM uses association rules between sets of items in large databases.
different datasets that are not from the same domain. Proceedings of the ACM SIGMOD International Confer-
Enumerating rules using DARM algorithms on such ence on Management of Data, Washington D.C.
datasets may cause a discrepancy, if we assume that
semantic meanings of those datasets are the same. The Agrawal, R., & Shafer, J.C. (1996). Parallel mining of
future DARM algorithms will investigate how such association rules. IEEE Transactions on Knowledge
datasets can be used to find meaningful rules without and Data Engineering, 8(6), 962-969.
increasing the communication cost. Agrawal, R., & Srikant, R. (1994). Fast algorithms for
mining association rules in large database. Proceedings
of the International Conference on Very Large Data-
CONCLUSION bases, Santiago de Chile, Chile.
The widespread use of computers and the advances in Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2003). To-
database technology have provided a large volume of wards privacy preserving distributed association rule
data distributed among various sites. The explosive mining. Proceedings of the Distributed Computing,
growth of data in databases has generated an urgent need Lecture Notes in Computer Science, IWDC03, Calcutta,
for efficient DARM to discover useful information and India.
knowledge. Therefore, DARM becomes one of the ac- Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004). Reduc-
tive subareas of data-mining research. It not only prom- ing communication cost in privacy preserving distrib-
uted association rule mining. Proceedings of the Data-
base Systems for Advanced Applications, DASFAA04,
Figure 3. A generalization scenario Jeju Island, Korea.
Beverage Level 3
Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004). ODAM:
An optimized distributed association rule mining algo-
rithm. IEEE Distributed Systems Online, IEEE.
Coffee Tea Soft drink Level 2 Assaf, S., & Ron, W. (2002). Communication-efficient
distributed mining of association rules. Proceedings of
the ACM SIGMOD International Conference on Man-
Ice Tea Coke Pepsi Level 1 agement of Data, California.
406
TEAM LinG
Distributed Association Rule Mining
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996a). Zaki, M.J. (1999). Parallel and distributed association
Efficient mining of association rules in distributed data- mining: A survey. IEEE Concurrency, 7(4), 14-25. ,
bases. IEEE Transactions on Knowledge and Data En-
gineering, 8(6), 911-922. Zaki, M.J. (2000). Scalable algorithms for association
mining. IEEE Transactions on Knowledge and Data
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996b). A fast Engineering, 12(2), 372-390.
distributed algorithm for mining association rules. Pro-
ceedings of the International Conference on Parallel Zaki, M.J., & Ya, P. (2002). Introduction: Recent develop-
and Distributed Information Systems, Florida. ments in parallel and distributed data mining. Journal of
Distributed and Parallel Databases, 11(2), 123-127.
Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, J.
(2002). Privacy preserving mining association rules.
Proceedings of the ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, KEY TERMS
Edmonton, Canada.
DARM: Distributed Association Rule Mining.
Jiawei, H., Jian, P., & Yiwen, Y. (2000). Mining fre-
quent patterns without candidate generation. Proceed- Data Center: A centralized repository for the stor-
ings of the ACM SIGMOD International Conference age and management of information, organized for a
on Management of Data, Dallas, Texas. particular area or body of knowledge.
Kantercioglu, M., & Clifton, C. (2002). Privacy pre- Frequent Itemset: A set of itemsets that have the
serving distributed mining of association rules on hori- user specified support threshold.
zontal partitioned data. Proceedings of the ACM Network Intrusion Detection: A system that de-
SIGMOD Workshop of Research Issues in Data Min- tects inappropriate, incorrect, or anomalous activity in
ing and Knowledge Discovery DMKD, Edmonton, Canada. the private network.
Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy SMC: SMC computes a function f (x1, x2 , x3 xn)
in association rule mining. Proceedings of the Interna- that holds inputs from several parties, and, at the end, all
tional Conference on Very Large Databases, Hong Kong, parties know about the result of the function f (x1, x2 , x3
China. xn) and nothing else.
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso- Taxonomy: A classification based on a pre-deter-
ciation rule mining in vertically partitioned data. Proceed- mined system that is used to provide a conceptual frame-
ings of the ACM SIGKDD International Conference on work for discussion, analysis, or information retrieval.
Knowledge Discovery and Data Mining, Edmonton,
Canada.
407
TEAM LinG
408
Fabio de Luigi
University of Ferrara, Italy
Palle Haastrup
European Commission, Italy
Vittorio Maniezzo
University of Bologna, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Distributed Data Management of Daily Car Pooling Problems
tion, compared to private transportation, can be improved service is supported by a database of potential users (e.g.,
through an integrated process, based on data manage- employees of a company) that daily commute from their ,
ment facilities. Usually, potential customers of a new houses to their workplace. A subset of them offers seats
transportation system face serious problems of knowl- in their cars. Moreover, they specify the departure time
edge retrieval, which is the reason for devoting substan- (when they leave their house) and the mandatory arrival
tial efforts to providing clients with an easy and powerful time at the office. The employees that offer seats in their
tool to find information. Current information systems give cars are called servers. The employees asking for a lift are
the possibility of real-time data management and include called clients. The set of servers and the set of clients need
the possibility to react to unscheduled events, as they can to be redefined once a day. The effectiveness of the
be reached from almost anywhere. proposed system is related strictly to the architecture and
Architectures for information browsing, like the World the techniques used to manage information.
Wide Web (WWW), are an easy and powerful means The objective of this research is to prove that, at least
through which to deploy databases and, thus, represent in a particular site (the Joint Research Center [JRC] of the
obvious options for setting up services, such as the one European Commission located in Ispra, northern Italy), it
advocated in this work. The WWW is useful for providing could be possible to reduce the number of transport
access to central data storage, for collecting remote data, vehicles without significantly changing either the number
and for allowing GIS and optimization software to interact. of commuters or their comfort-level. The users of the
All these elements have an important impact on the system are commuters normally using their own cars for
implementation of new transport services (Ridematching, traveling between home and a workplace poorly served by
2004). Based on the way they use the WWW and on the public transportation.
services they provide, car-pooling systems range be-
tween two extremes (Carpoolmatch, 2004; Carpooltool, System Architecture
2004; Ridepro, 2004; SAC, 2004, etc.), one where there is
a WWW site collecting information about trips, which are The architecture of the system developed for this problem
open to every registered user, and another one where the is shown in Figure 1 (interested readers can find a com-
users of the system are a restricted group and are granted plete description of the whole system in Wolfler et al.,
more functionalities. 2004). The system consists of the following five main
A main issue for the first type is to guarantee the modules:
reliability of the information. Their interface often is
designed mainly to help set service-related geographical OPT: An optimization module that generates a
information (customer delivery points, customers pick up feasible solution using an algorithm that defines the
points, paths). Such systems only rarely will suggest a paths for the servers. The algorithm makes use of a
matching between clients and servers but operate as a heuristic approach and is used to assign the clients
post-it wall, where users can consult or leave information to the servers and to define the path for each server.
about travel routes. Each path minimizes the travel time, maximizes the
As for the latter type, the idea is normally to set up a number of requests picked up, and satisfies the time
car-pooling service among the users, usually employees and capacity constraints.
of the same organization. This system is more structured.
Moreover, the spontaneous user matching often is sub-
{M-, S-, W-}CAR: Three modules that permit to
stituted by a solution found by means of an algorithmic receive, decrypt, and send SMS (SCAR), EMAIL
approach. An example of this type of system is reported (MCAR) and Web pages (WCAR) to the users,
in Dailey, et al. (1999) or in Lexington-Fayette County respectively. The module Sms Car Pooling (SCAR)
(2004). These systems use the WWW and react to unex- allows the server to send and receive Sms messages.
pected events with e-mail messages. The module Mail Car Pooling (MCAR) supports e-
mail communication and uses POP3 and SMTP as
protocols. The module (WCAR) is the gateway for
the Web interaction. All modules filter the customers
MAIN THRUST access, allowing the entitled user to insert new data,
to query, and to modify the database (e.g., the
This article describes an integrated ICT system that departure and the arrival time desired) and to access
supports the management of a car-pooling service in a service data.
real-world prototypical setting. An approach similar to
Dailey, et al. (1999) is suggested, and a complete system
GUI: A graphical user interface based on ESRI
for supporting the operation of a car-pooling case as a ArcView. It generates a view (a digital map) of the
prototype for a real-life application is described. The current problem instance and provides all relevant
data management.
409
TEAM LinG
Distributed Data Management of Daily Car Pooling Problems
Figure 1. The architecture of the system and offers; the system then generates Web pages con-
taining text and maps presenting the results of the opti-
mization algorithm. Both e-mail and SMS also can be used
SMS SC AR for submitting requests and for receiving results or
Requests
variations of previously proposed results (following
E-mail
Email MCAR
Offers
real-time notifications). The clients know whether their
W EB
requests could be matched and the detail of their pickup,
W CAR
Employees
while the servers simply receive an SMS notifying the
pool formation; travel details are separately sent via e-
GUI Solutions mail. SMS also is used by the servers (via a service relay)
to inform clients of delays.
OPT
Geography Operationally, a user sending an e-mail to the system
must insert the message string in the subject field, which
will be processed by a mail receiver agent and finally sent
to the parser. Economic and privacy considerations sug-
The system collects and uses data of two different gested sending all SMS to an SMS engine, which is
types: geographic and alphanumeric. The geographic da- interfaced with the system and which transfers the string
tabase contains the maps of the region of interest with a sent to the parser. Each time the system receives a
detailed road network, the geocoded client pickup sites, syntactically correct e-mail or SMS, it automatically sends
and all the geographic information needed to generate a an acknowledgment SMS message.
GIS output on the Web. The alphanumeric data repository, While the system generates messages to be sent
maintained by a relational database, contains information directly to users, those that users generate, either e-mail
about the employees and a representation of the road or SMS, must follow a rigidly structured format in order
network of the area where the car-pooling service is active. to allow automatic processing. We implemented a single
Other data, input by the users, are related strictly to the parser for all these messages. The Web interface is
daily situationdetailed information about the service, obviously more user-friendly. Each employee can specify
whether a user is a server or a client, the departure time from through an ASP page the set of days (possibly empty)
home and the maximal acceptable arrival time at work, the when he or she is willing to drive and the set of days when
number of available seats in the cars offered by the servers, he or she asks to be picked up.
and the maximum accepted delay, which is the parameter The system then computes a matching of servers as
used to specify how far out of the shortest way a server is clients. The result is a set of routes starting from the
willing to drive in order to pick up colleagues. The users server houses, arriving at the workplace, and passing
are permitted to consult, modify, or delete their own entries through the set of client houses without violating the
in the database at any time. time windows constraints and the car capacity con-
The road network and all user-related data are pro- straints.
cessed by the optimization module in order to define user These routes are made available both to clients and
pools and car paths. The optimization algorithm can be to server. A well-known GIS module, ArcView, loads the
activated without immediate presence of anybody on some route information given in alphanumeric format, after the
regular schedule, searching a local database and present- optimization. Then, it transforms this information in a set
ing the results on the Web. All these actions are performed of GIS views through a set of queries to its database. The
on a periodic basis on the evening before each work day. last step transforms the views into bitmaps displayed by
The system is designed to support two types of users: the Web server. Moreover, the system alerts clients and
the system administrator and the employee. The system servers in real time by means of SMS when any variation
administrator must update the static databases (users, of the scheduling happens.
maps) and guarantee all the functionalities. All other data
are entered, edited and deleted by the users through a
distributed GUI.
Optimization Algorithms and
Implementation Details
Communication Subsystems
One of the most interesting features of the approach
described in this article, with respect to the other systems
The system supports three different communication chan-
currently in use, is the set of algorithms used to obtain
nels: Web, SMS, and e-mails. The Web is the main interface
a matching of clients and servers. Due to the NP-hard-
to connect a user to the system. The user, by means of a
ness of the underlying problem (Varrentrapp et al., 2002)
standard GUI, is allowed to insert transportation requests
and to the size of the instances to solve, a heuristic
410
TEAM LinG
Distributed Data Management of Daily Car Pooling Problems
approach was in order. In designing it, the efficiency was and more common in the near future. Already, several
the main parameters, ensuring relatively fast response software companies are marketing tailored solutions. ,
time also for larger instances. The matching obtained Therefore, it is easy to envisage that systems featuring
minimizes the total travel length of all servers going to the the services included in the prototype described in this
workplace and maximizes the number of clients serviced, work will define their own market niche. This, however,
while meeting the operational constraints. In addition, can happen only parallel to an increasing sensibility of
real-time services are supported; for example, the system local governments, which alone can define incentive
sends warnings to employees when delays occur. policies for drivers who share their cars. These policies
The system was implemented at the Joint Research will reflect features of the implemented systems. As shown
Center of the European Commission, using standard com- in this article, all needed technological infrastructure is
mercial software and developing extra modules in C++ already available.
using Microsoft Visual Studio C++ compiler. The data-
base used is Microsoft Access. The geographical data are
stored in shape files and used by an ArcView application CONCLUSION
(release 3.2). The system administrator can access data
directly through MSAccess and ArcView, which are both This article presents the essential elements of a prototypi-
packages available and running in the same machine cal deployment of ITC support to a car-pooling service for
where the car-pooling application resides. Through a large-sized organization. The essential features that can
Arcview, the system manager can start the optimization make its actual use possible are in the technological
module. infrastructure, essentially in the distributed data manage-
The Joint Research Center of the European Commis- ment and accompanying optimization, which helps to
sion (JRC) is situated in the northwest of Italy not far from provide real-time response to user needs. The current
Milan. It covers a big area of 2 km2, and its mission is to possibility of using any data access means as a system
carry out research useful to the European Commission. interface (we used e-mails, SMS, and Web, but new smart
The number of employees is about 2,000 people, divided phones and more in general UMTS can provide other
into three main classes: administrative, staff, and re- access points) can ease system acceptance and end-user
searchers. They come from all around Europe and live in interest. However, the ease of usage alone would be of
the area surrounding the center. The wide geographical little interest, if the solutions proposed, in terms of driver
area covered by the commuters is about 100 Km2. Since paths, were not of good quality. Current development of
this is a sparsely populated area, public transportation optimization research, both in the areas of heuristic and
(i.e., trains, buses, etc.) is definitely insufficient; thus, exact approaches, provide firm basis for designing sup-
private cars are necessary and used. port system that also can deal with the dimension of the
Module OPT, in particular, was tested on a set of real- instances induced by large-sized organizations.
world problem instances derived from data provided by
the JRC. The instances used are the same as those de-
scribed in (Baldacci et al., 2004); they are derived from the REFERENCES
real-world instance defined by over 600 employees by
randomly selecting the desired number of clients and Baldacci, R., Maniezzo, V., & Mingozzi, A. (2004). An exact
servers. method for the car pooling problem based on Lagrangean
Computational results show that the CPU time used by column generation. Operations Research, 52(3), 422-439.
our heuristic increases approximately linearly with the
problem dimension, and the number of unserviced re- Carpoolmatch. (2004). http://www.carpoolmatchnw.org/
quests is a constant proportion of the total number of
employees. Moreover, comparing two instances with the Carpooltool. (2004). http://www.carpooltool.com/en/my/
same dimension, but with different percentage of servers, Colorni, A., Cordone, R., Laniado, E., & Wolfler Calvo, R.
one can see that the CPU time increases when fewer (1999). Innovation in transports: Planning and manage-
servers are available, as expected, while the total travel ment [in Italian]. In S. Pallottino, & A. Sciomachen (Eds.),
time decreases, since fewer paths are driven. Scienze delle decisioni per i trasporti. Franco Angeli.
Cordeau, J.-F., & Laporte, G. (2003). The dial-a-ride prob-
FUTURE TRENDS lem (DARP): Variants, modeling issues and algorithms.
Quarterly Journal of the Belgian, French and Italian
Car-pooling services as well as most other alternative Operations Research Societies, 1, 89-101.
public transportation services are likely to become more
411
TEAM LinG
Distributed Data Management of Daily Car Pooling Problems
Dailey, D.J., Loseff, D., & Meyers, D. (1999). Seattle smart KEY TERMS
traveler: Dynamic ridematching on the World Wide Web.
Transportation Research Part C, 7, 17-32. Car Pooling: A collective transportation system based
Hildmann, H. (2001). An ants metaheuristic to solve car on the shared use of private cars (vehicles) with the
pooling problems [masters thesis]. University of objective of reducing the number of cars on the road.
Amsterdam, The Netherlands. Computational Problem: A relation between input
Lexington-Fayette County. (2004). Ride matching ser- and output data, where input data are known (and corre-
vices. Retrieved from http://www.lfucg.com/mobility/ spond to all possible different problem instances), and
rideshare.asp output data are to be identified, but predicates or asser-
tions they must verify are given.
Maniezzo, V. (2002). Decision support for location prob-
lems. Encyclopedia of Microcomputers, 8, 31-52. NY: Data Mining: The application of analytical methods
Marcel Dekker. and tools to data for the purpose of identifying patterns
and relationships, such as classification, prediction, es-
Maniezzo, V., Carbonaro, A., & Hidmann, H. (2004). An timation, or affinity grouping.
ANTS heuristic for the long-term car pooling problem. In
G.C. Onwuboulu, & B.V. Babu (Eds.), New optimization GIS: Geographic Information Systems; tools used to
techniques in engineering (pp. 411-430).Heidelber, Ger- gather, transform, manipulate, analyze, and produce in-
many: Springer-Verlag. formation related to the surface of the Earth.
Mattarelli, M., Maniezzo, V., & Haastrup, P. (1998). A Heuristic Algorithms: Optimization algorithms that
decision support system distributed on the Internet. do not guarantee to identify the optimal solution of the
Journal of Decision Systems, 6(4), 353-368. problem they are applied to, but which usually provide
good quality solutions in an acceptable time.
Ridematching. (2004). University of South Florida,
ridematching systems. Retrieved from http:// NP-Hard Problems: Optimization problems for which
www.nctr.usf.edu/clearinghouse/ridematching.htm a solution can be verified in polynomial time, but no
polynomial solution algorithm is known, even though no
Ridepro. (2004). http://www.ridepro.net/index.asp one so far has been able to demonstrate that none exists.
SAC. (2004). San Antonio Colleges carpool matching Optimization Problem: A computational problem for
service. Retrieved from http://www.accd.edu/sac/ which an objective function associates a merit figure with
carpool/ each problem solution, and it is asked to identify a feasible
solution that minimizes or maximizes the objective function.
Varrentrapp, K., Maniezzo, V., & Sttzle, T. (2002). The
long term car pooling problem: On the soundness of the
problem formulation and proof of NP-completeness. Tech-
nical Report AIDA-02-03. Darmstadt, Germany: Technical ENDNOTE
University of Darmstadt.
1
This article is an abridged and updated version of
Wolfler Calvo, R., de Luigi, F., Haastrup, P., & Maniezzo, the paper, A Distributed Geographic Information
V.(2004). A distributed geographic information system for System for the Daily Car Pooling Problem, pub-
the daily car pooling problem. Computers & Operations lished as Wolfler et al., 2004.
Research, 31, 2263-2278.
412
TEAM LinG
413
Hong Guo
Southern Illinois University, USA
Feng Yan
Williams Power, USA
Qiang Zhu
University of Michigan, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Drawing Representative Samples from Large Databases
v
rw( x ) = 1 . (2) MAIN THRUST
all x
TEAM LinG
Drawing Representative Samples from Large Databases
415
TEAM LinG
Drawing Representative Samples from Large Databases
the sample has a distribution close to that of the popula- distribution and the greater the Wmax / Wavg value. This
tion when the sample size is large enough. Let r l be a explains why the AR sampling becomes less efficient as
random variable denoting the number of occurrences the dimension increases. In comparison, our Metropolis
sampling method accepts one object in every trial.
for object xl (l = 1, 2, , k) in the sample, then
E (r1 ) : E (r2 ) : ... : E (rk ) W ( x1 ) : W ( x2 ) : ... : W ( xk )
, Quality of the Samples
when the sample size is large enough, where E(rl) is the
expected number of occurrences for object rl in the sample. 2
Instead of showing the variance of estimation < > ,
We also have E (r1 ) / N w( x1 ), E ( r2 ) / N w( x2 ), ,
here we show the second moment of , < 2 > . The
E ( rk ) / N w( xk ) . second moment tells how different the estimates are from
Finally, it should be pointed out that when all objects the actual values, especially when the estimate is biased.
have an equal weight, the Metropolis sampling degener- As shown in Figures 4 and 5, both methods yield pretty
ates to the simple random sampling. accurate estimates of the population mean (= 0.5) and
second moment (= 0.75) for one-dimensional case. Hence,
Experimental Results from Equation (7), they also give good estimates of the
variance.
Now, we report the results of our empirical evaluations of As shown in Figures 6 and 7 for d=3, the AR sampling
the Metropolis sampling and the AR sampling. We com- yields estimates around 1.0 and 1.7 for the mean and
pare the efficiency of the methods and the quality of the second moment, respectively, which are below their re-
samples yielded. spective exact values 3/2 and 15/4. On the other hand, our
To compare the methods quantitatively, we select a Metropolis sampling yields quite accurate estimates. As
model for which we know the analytic values. Here, we the sample size gets larger, our estimates get closer to the
v analytic values, but ARs do not.
have chosen the Gaussian model w(x ) =
Similar results are also observed for higher dimen-
v v v
(1 / ) d / 2 e ( x ) , where d is the dimension and ( x ) = x 2 .
2
Equations (5) and (6), and the vast remaining region has
v
vanishing contributions. As the dimension increases, the the dimension is high. The low W ( x ) / Wmax values could
peak becomes higher and narrower, and the distribution make the remote points not selectable when compared
becomes more skewed. In our experiments, we let d = with the random numbers generated in the process.
1, 3, 10, and 20. The random number generators on most computers
are based on the linear congruent method, which first
Sampling Efficiency generates a sequence of integers by the recurrence rela-
tion I j +1 = aI j + c (mod m), where m is the modulus, and
First, let us compare the cost of sampling. As shown in
Figure 3, the AR method roughly needs 5, 10, 75, and 1,750 a and c are positive integers (Press, Teukolsky, Vetterling,
trials to accept just one object into the sample in 1, 3, 10, & Flannery, 1994). The random numbers generated are I1/
and 20-dimensional cases, respectively. It is noted that m, I 2/m, I3/m, ..., and the smallest numbers generated is 1/
the higher the dimension, the more skewed the Gaussian m. As a result, a trial point in the AR sampling, whose
416
TEAM LinG
Drawing Representative Samples from Large Databases
d=20
1.6M
1.2M
d=10
Trial
s
800.0k
d=1
400.0k
d=3
0.0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 2k
Sample Size
Figure 4. Sample means for d=1. < > = 0.5 Figure 5. Sample means of second moment. < 2 > = 0.75
0.9
0.8
MC
0.7 AR
0.6
0.5
<>
0.4
0.3
0.2
0.1
0.0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Sample Size
Figure 6. Sample means for d=3. < > = 1.5 Figure 7. Sample means of second moment < 2 > = 15/
4
2.5 7
MC 6 MC
2.0 AR AR
5
1.5 4
< 2 >
<>
3
1.0
2
0.5
1
0.0 0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Sample Size Sample Size
417
TEAM LinG
Drawing Representative Samples from Large Databases
Figure 8. Sample means for d=20. < > = 10 Figure 9. Sample means of second moment. < 2 > = 110
12
160
140 MC
10
AR
120
8
< 2 >
100
<>
6 80
60
4
MC 40
2 AR
20
0 0
0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Sample Size Sampel Size
418
TEAM LinG
Drawing Representative Samples from Large Databases
Figure 10. Chi-square test on 1-d samples Figure 11. Chi-square test on 3-d samples with
with 2
0.95 = 12.6 and = 6 02.95 = 386 and = 342 ,
25 1000
MC
AR
MC
20 800
AR
15
600
2
2
10
400
5
200
0
0 500 1000 1500 2000 2500 3000
0
Sam p le S iz e N 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample Size N
large to handle. Sampling, therefore, becomes a necessity sional data for data mining application. In Proceedings of
for analyses, surveys, and numerical calculations in these the ACM SIGMOD Conference (pp. 94-105).
applications. In addition, for many modern applications in
OLAP and data mining, where fast responses are required, Haas, P., & Swami, A. (1992). Sequential sampling proce-
sampling also becomes a viable approach for construct- dures for query size estimation. In Proceedings of the
ing an in-core representation of the data. A general ACM SIGMOD Conference (pp. 341-350).
sampling algorithm that applies to all distributions is Hou, W.-C., & Ozsoyoglu, G. (1991). Statistical estimators
needed more than ever. for aggregate relational algebra queries. ACM Transac-
tions on Database Systems, 16(4), 600-654.
419
TEAM LinG
Drawing Representative Samples from Large Databases
Rubinstein, R. (1981). Simulation and the Monte Carlo thermodynamics, solid state physics, biological systems,
Method. New York: John Wiley & Sons. and etcetera. The algorithm is known as the most success-
ful and influential Monte Carlo method.
Spiegel, M. (1991). Probability and statistics. McGraw-
Hill, Inc. Monte Carlo Method: The heart of this method is a
random number generator. The term Monte Carlo Method
Wu, Y., Agrawal, D., & Abbadi, A. (2001). Using the now stands for any sort of stochastic modeling.
golden rule of sampling for query estimation. In Proceed-
ings of ACM SIGMOD Conference (pp. 279-290). OLAP: Online Analytical Processing.
Xu, X., Ester, M., Kriegel, H., & Sander, J. (1998). A Representative Sample: A sample whose distribution
distribution-based clustering algorithm for mining in large is the same as that of the underlying population.
spatial databases. In Proceedings of the IEEE ICDE
Conference (pp. 324-331). Sample: A set of elements drawn from a population.
Selectivity: The ratio of the number of output tuples
of a query to the total number of tuples in the relation.
KEY TERMS
Uniform Sampling: All objects or clusters of objects
Metropolis Algorithm: Was proposed in 1953 by are drawn with equal probability.
Metropolis et al. for studying statistical physics. Since
then it has become a powerful tool for investigating
420
TEAM LinG
421
Figure 1. From facts to data cubes and drill-down on the time dimension
C P2 C P3 n Italy
Norther
C P1 Italy
C entral
Phone calls
rn Italy
in 2003 So uthe
Average on the
duration attribute
Time
Italy Italy
CP3 hern CP3 hern
No rt aly No rt tral It
aly
CP2 tral It Italy CP2 C en Italy
C en thern thern
CP1 Sou CP1 Sou
2003 (1. quat.)
2003 2003 (2. quat.)
2003 (3. quat.)
Drill-down on the 2003 (4. quat.)
time dimension
2002
2001
plan R eg
C all ion
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Efficient Computation of Data Cubes and Aggregate Views
all possible projections in lower dimensional spaces, historical analyses with trend forecasts. However,
constitutes the so-called data cube. In most cases, dimen- the presence of this temporal dimension may cause
sions are structured in hierarchies, representing several problems in query formulation and processing, as
granularity levels of the corresponding measures schemes may evolve over time and conventional
(Jagadish, Lakshmanan, & Srivastava, 1999). Hence a time query languages are not adequate to cope with them
dimension can be organized into days, months, quarters (Vaisman & Mendelzon, 2001).
and years; a territorial dimension into towns, regions and Target users: Typical OLTP system users are clerks,
countries; a product dimension into brands, families and and the types of query are rather limited and predict-
types. When querying multidimensional data, the user able. In contrast, multidimensional databases are
specifies the measures of interest and the level of detail usually the core of decision support systems, tar-
required by indicating the desired hierarchy level for each geted at management level. Query types are only
dimension. In a multidimensional environment querying partly predictable and often require highly expres-
is often an exploratory process, where the user moves sive (and complex) query language. However, the
along the dimension hierarchies by increasing or reducing user usually has little experience even in easy
the granularity of displayed data. The drill-down opera- query languages like basic SQL: the typical interac-
tion corresponds to an increase in detail, for example, by tion paradigm is a spreadsheet-like environment
requesting the number of calls by region and quarter, based on iconic interfaces and the graphical meta-
starting from data on the number of calls by region or by phor of the multidimensional cube (Cabibbo &
region and year. Conversely, roll-up allows the user to Torlone, 1998).
view data at a coarser level of granularity.
Multidimensional querying systems are commonly
known as OLAP (Online Analytical Processing) Systems, MAIN THRUST
in contrast to conventional OLTP (Online Transactional
Processing) Systems. The two types have several con- In this section we briefly analyze the techniques pro-
trasting features, although they share the same require- posed to compute data cubes and, more generally, mate-
ment of fast online response times: rialized views containing aggregates. The focus here is on
the exact calculation of the views from scratch. In particu-
Number of records involved: One of the key differ- lar, we do not consider (a) the problems of aggregate view
ences between OLTP and multidimensional queries maintenance in the presence of insertions, deletions and
is the number of records required to calculate the updates (Kotidis & Roussopoulos, 1999; Riedewald,
answer. OLTP queries typically involve a rather Agrawal, & El Abbadi, 2003) and (b) the approximated
limited number of records, accessed through pri- calculation of data cube views (Wu, Agrawal, & El Abbadi,
mary key or other specific indexes, which need to be 2000; Chaudhuri, Das, & Narasayya, 2001), as they are
processed for short, isolated transactions or to be beyond the scope of this paper.
issued on a user interface. In contrast, multidimen-
sional queries usually require the classification and (Efficiently) Computing Data Cubes and
aggregation of a huge amount of data.
Indexing techniques: Transaction processing is
Materialized Views
mainly based on the access of a few records through
primary key or other indexes on highly selective A typical multidimensional query consists of an aggre-
attribute combinations. Efficient access is easily gate group by query applied to the join of the fact table
achieved by well-known and established indexes, with two or more dimension tables. In consequence, it has
particularly B+-tree indexes. In contrast, multidi- the form of an aggregate conjunctive query, for example:
mensional queries require a more articulated ap-
proach, as different techniques are required and SELECT D1.dim1, D2.dim2, AGG(F.measure)
each index performs well only for some categories of FROM fact_table F, dim_table1 D1, dim_table2 D2
queries/aggregation functions (Jrgens & Lenz, 1999). WHERE F.dimKey1 = D1.dimKey1
Current state vs. historical DBs: OLTP operations AND F.dimKey2 = D2.dimKey2
require up-to-date data. Simultaneous information GROUP BY D1.dim1, D2.dim2 (Q1)
access/update is a critical issue and the database
usually represents only the current state of the where AGG is an aggregation function, such as SUM,
system. In OLAP systems, the data does not need MIN, AVG, etc.
to be the most recent available and should in fact be For example, in the above-mentioned phone call data
time-stamped, thus enabling the user to perform warehouse the fact table Phone_calls may have the (sim-
422
TEAM LinG
Efficient Computation of Data Cubes and Aggregate Views
plified) schema (call_id, territ_id, call_plan_id, duration), be made significantly more efficient (and even more so
and the dimension tables Terr and Call_pl the simplified than ROLAP-based calculations) by exploiting the inher- -
schemas (territ_id, town, region) and (call_plan_id, ently compressed data representation of MOLAP sys-
call_plan_name) respectively. The aggregation illustrated tems. The algorithm particularly benefits from the com-
in Figure 1 would be performed by the following query: pactness of multidimensional arrays, enabling the query
processor to transfer larger chunks of data to the main
SELECT D1.region, D2.call_plan_name, AVG(F.duration) memory and efficiently process them.
FROM Phone_calls F, Terr D1, Call_pl D2 In many practical cases GROUP BYs at the finest
WHERE F.territ_id = D1.territ_id granularity level correspond to sparse data cubes, that
AND F.call_plan_id = D2.call_plan_id is, cubes where a high percentage of points correspond
GROUP BY D1.region, D2.call_plan_name (Q2) to null values. Consider for instance the CUBE corre-
sponding to the query Number of calls by customer, day
Traditional query processing systems first perform all and antenna: it is evident that a considerable number of
joins expressed in the FROM and WHERE clause, and only combinations correspond to zero. Fast techniques to
afterwards perform the grouping on the result of the join compute sparse data cubes are proposed in Ross &
and aggregation on each group. The algorithms producing Srivastava (1997). They are based on (i) decomposing the
the groups can be broadly classified as techniques based fact table into fragments that can be stored in main
on (a) sorting on the GROUP BY attributes and (b) hashing memory; (ii) computing the data cube in the main memory
tables [see Graefe (1993)]. However, there are many com- for each fragment and finally (iii) combining the partial
mon cases where an early evaluation of the GROUP BY is results obtained in (i) and (ii).
possible. This can significantly reduce calculation time, as
it (i) reduces the input size of the join (usually very large
in the context of multidimensional databases) and (ii)
Using Indexes to Improve Efficiency
enables the query processing engine to use indexes to
perform the (early) GROUP BY on the base tables. The most common indexes used in traditional DBMSs are
In Chaudhuri & Shim (1995, 1996), some techniques and probably B+-trees, a particular form labeled with index
applicability conditions are proposed for transforming keyvalues and having a list of record identifiers on the
execution plans into equivalent (more efficient) ones. As leaf level; that is, a list of elements specifying the actual
is typical in query optimization, the technique is based on position of each record on the disk. Index keyvalues may
pull-up transformations, which delay the execution of a consist of one or more columns of the indexed table.
costly operation (e.g., a group by on a large dataset) by OLTP queries usually retrieve a very limited number of
moving it towards the root of the query tree, and on push- tuples (or even a single tuple accessed through the
down transformations, used for example to anticipate an primary key index) and in these cases B+-trees have been
aggregation, thus decreasing the size of a join. demonstrated as particularly efficient.
Several transformations for multidimensional queries In contrast, OLAP queries typically involve aggrega-
are also proposed in Gupta, Harinarayan, & Quass (1995), tion of large tuples groups, requiring specifically de-
based on the concept of generalized projection (GP). Trans- signed indexing structures. In contrast to the OLTP
formations enable the optimizer to (i) push a GP down the context, there is no universally good index for multidi-
query tree; (ii) pull a GP up the query tree; (iii) coalesce two mensional queries, but rather a variety of techniques,
GPs into one, or conversely split one GP into two. Query each of which may perform well for specific data types
tree transformations are also used in the rewriting process. and query forms but be inappropriate for others.
In Agarwal et al. (1996) some algorithms are proposed Let us again consider the typical multidimensional
and compared to extend the traditional techniques for the query expressed by the SQL query (Q1). The core opera-
GROUP BY query evaluation to the CUBE operator pro- tions related to its evaluation are: (1) the joins of the fact
cessing. These are applicable in the case of distributive table with two or more dimension tables, (2) tuple group-
aggregate functions and are based on the property that ing by various dimensional values, and (3) application of
higher-level aggregates can be calculated from lower lev- an aggregation function to each tuple group. An interest-
els in this case. ing index type which can be used to efficiently perform
In MOLAP systems, however, the above methods for operation (1) is the join index; while conventional in-
CUBE calculation are inadequate, as they are substantially dexes map column values to records in one table, join
based on sorting and hashing techniques, which can not indexes map them to records in two (or more) joined
be applied to multidimensional arrays. In Zhao, Deshpande, tables, thus constituting a particular form of materialized
& Naughton (1997) an algorithm is proposed to calculate view. Join indexes in their original version can not be
CUBEs in a MOLAP environment. It is shown that this can used directly for efficient evaluation of OLAP queries,
but can be very effective in combination with other
423
TEAM LinG
Efficient Computation of Data Cubes and Aggregate Views
indexing techniques, such as bitmaps and partitioning. querying is still lacking, particularly a standardized query
Bitmap indexes (Chan & Ioannidis, 1998) are useful for language independent from the specific storage tech-
tuple grouping and performing some aggregation forms. nique.
In practice, these indexes use a bitmap representation for Although research results on the exact computation
the list of record identifiers in the trees leaf level: if table of data cubes in the last few years have been mainly
t contains n records, then each leaf of the bitmap index incremental, several interesting techniques have been
(corresponding to a specific value c of the indexed column proposed for the approximated computation, particularly
C) contains a sequence of n bits, where the i-th bit is set some based on wavelets, which certainly require further
to 1 if ti.C=c, and otherwise to zero. Bitmap representa- investigations.
tions are indicated when the number of distinct keyvalues In contrast to the OLTP context, we have shown that
is low and several predicates of the form (Column = value) there is no universally good index for multidimensional
are to be combined in AND/OR, as the operation can be queries, but rather a variety of techniques, each of which
efficiently performed by AND-/OR-ing the corresponding may perform well for specific data types and query forms
bitmap representation bit to bit. This operation can be but be inappropriate for others. Hence, an algorithm of
performed in parallel and very efficiently on modern index selection for data warehouses should determine not
processors. Finally, bitmap indexes can be used for fast only the several sets of attributes to be indexed, but also
count query evaluation, as counting can be performed the index type(s) to be used. The definition of an indexing
directly on the bitmap representation without even ac- structure enabling a good trade-off in the several cases of
cessing the selected records. interest is also an interesting issue for future research.
In projection indexes (ONeil & Quass, 1997) the tree
access structure is coupled with a sort of materialized
view, representing the projection of the table on the CONCLUSION
indexed column. The technique has some analogies with
vertically partitioned tables and is indicated when the In this paper we have discussed the main issues related
aggregate operations need to be performed on one or more to the computation of data cubes and aggregate material-
indexed columns. ized views in a data warehouse environment. First of all,
Bit-sliced indexes (ONeil & Quass, 1997) can be con- the main features of OLAP queries with respect to conven-
sidered as a combination of the two previous techniques. tional OLTP queries have been summarized, particularly
Values of the projected column are encoded and a bitmap the number of records involved, the temporal aspects, the
associated with each resulting bit component. The tech- specific indexes. Various techniques for the exact compu-
nique has some analogies with bit transposed files (Wong, tation of materialized views and data cubes from scratch
Li, Olken, Rotem, & Wong, 1986), which were proposed for in both ROLAP and MOLAP environments have been
query evaluation in very large scientific and statistical discussed. Finally, the main indexing techniques for OLAP
databases. These indexes work best for SUM and AVG queries and their applicability have been illustrated.
aggregations, but are not well suited for aggregations
involving more than one column.
In Chan & Ioannidis (1998, 1999) some variations on REFERENCES
the general idea of bitmap indexes are presented. Encod-
ing schemes, time-optimal and space-optimal indexes, and Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A.,
trade-off solutions are studied. A comparison of the use Naughton, J.F., Ramakrishnan, R., & Sarawagi, S. (1996).
of STR-tree based indexes (a particular form of spatial On the computation of multidimensional aggregates. In
index) and some variations of bitmap indexes in the context International Conference on Very Large Data Bases
of OLAP range queries can be found in Jrgens & Lenz (1999). (VLDB96) (pp. 506-521).
Cabibbo, L., & Torlone, R. (1998). From a procedural to a
FUTURE TRENDS visual query language for OLAP. In International Con-
ference on Scientific and Statistical Database Manage-
The efficiency of many evaluation techniques is strictly ment (SSDBM98) (pp. 74-83).
related to the adopted query language and storage tech- Chan, C.Y., & Ioannidis, Y.E. (1998). Bitmap index design
nique (e.g., MOLAP and ROLAP). This stresses the im- and evaluation. In ACM International Conference on
portance of a standardization process: there is indeed a Management of Data (SIGMOD98) (pp. 355-366).
general consensus on data warehouse key concepts, but
a common unified framework for multidimensional data Chan, C.Y., & Ioannidis, Y.E. (1999). An efficient bitmap
encoding scheme for selection queries. In ACM Interna-
424
TEAM LinG
Efficient Computation of Data Cubes and Aggregate Views
tional Conference on Management of Data (SIGMOD99) Ross, K.A., & Srivastava, D. (1997). Fast computation of
(pp. 215-226). sparse datacubes. In International Conference on Very -
Large Data Bases (VLDB97) (pp. 116-125).
Chaudhuri, S., Das, G., & Narasayya, V. (2001). A robust,
optimization-based approach for approximate answering Vaisman A.A., & Mendelzon A.O. (2001). A temporal
of aggregate queries. In ACM International Conference query language for OLAP: Implementation and a case
on Management of Data (SIGMOD01) (pp. 295-306). study. In 8th International Workshop on Database Pro-
gramming Languages (DBPL 2001) (pp. 78-96).
Chaudhuri, S., & Shim, K. (1995). An overview of cost-
based optimization of queries with aggregates. Data En- Wong, H.K.T., Li, J., Olken, F., Rotem, D., & Wong, L.
gineering Bulletin, 18(3), 3-9. (1986). Bit transposition for very large scientific and
statistical databases. Algorithmica, 1(3), 289-309.
Chaudhuri, S., & Shim, K. (1996). Optimizing queries with
aggregate views. In International Conference on Ex- Wu, Y., Agrawal, D., & El Abbadi, A. (2000). Using
tending Database Technology (EDBT96) (pp. 167-182). wavelet decomposition to support progressive and ap-
proximate range-sum queries over data cubes. In Confer-
Graefe, G. (1993). Query evaluation techniques for large ence on Information and Knowledge Management
databases. ACM Computing Surveys, 25(2), 73-170. (CIKM00) (pp. 414-421).
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Zhao, Y., Deshpande, P., & Naughton, J.F. (1997). An
Data cube: A relational aggregation operator generalizing array-based algorithm for simultaneous multidimensional
group-by, cross-tab, and sub-total. In International Con- aggregates. In ACM International Conference on Man-
ference on Data Engineering (ICDE96) (pp. 152-159). agement of Data (SIGMOD97) (pp. 159-170).
Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggre-
gate-query processing in data warehousing environments.
In International Conference on Very Large Data Bases
KEY TERMS
(VLDB95) (pp. 358-369).
Aggregate Materialized View: A materialized view
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D. (see below) in which the results of a query containing
(1999). What can hierarchies do for data warehouses? aggregations (like count, sum, average, etc.) are stored.
International Conference on Very Large Data Bases
(VLDB99) (pp. 530-541). B+-Tree: A particular form of search tree in which the
keys used to access data are stored in the leaves. Particu-
Jrgens, M., & Lenz, H.J. (1999). Tree based indexes vs. larly efficient for key-access to data stored in slow memory
bitmap indexes: A performance study. In International devices (e.g., disks).
Workshop on Design and Management of Data Ware-
houses (DMDW99). Retrieved from <http:// Data Cube: A collection of aggregate values classified
sunsite.informatik.rwth-aachen.de/Publications/CEUR- according to several properties of interest (dimensions).
WS/Vol-19/paper1.pdf> Combinations of dimension values are used to identify the
single aggregate values in the data cube.
Kotidis, Y., & Roussopoulos, N. (1999). DynaMat: A
dynamic view management system for data warehouses. Dimension: A property of the data used to classify it
In ACM International Conference on Management of and navigate the corresponding data cube. In multidimen-
Data (SIGMOD99) (pp. 371-382). sional databases dimensions are often organized into
several hierarchical levels, for example, a time dimension
Lenzerini, M. (2002). Data integration: A theoretical per- may be organized into days, months and years.
spective. In ACM Symposium on Principles of Database
Systems (PODS02) (pp. 233-246). Drill-Down (Roll-Up): Typical OLAP operation, by
which aggregate data are visualized at a finer (coarser)
ONeil, P.E., & Quass, D. (1997). Improved query perfor- level of detail along one or more analysis dimensions.
mance with variant indexes. In ACM International Con-
ference on Management of Data (SIGMOD97) (pp. 38- Fact (Multidimensional Datum): A single elementary
49). datum in an OLAP system, the properties of which corre-
spond to dimensions and measures.
Riedewald, M., Agrawal, D., & El Abbadi, A. (2003).
Dynamic multidimensional data cubes. In M. Rafanelli Fact Table: A table of (integrated) elementary data
(Ed.), Multidimensional Databases: Problems and Solu- grouped and aggregated in the multidimensional query-
tions (pp. 200-221). Hershey, PA: Idea Group Publishing. ing process.
425
TEAM LinG
Efficient Computation of Data Cubes and Aggregate Views
Materialized View: A particular form of query whose Multidimensional Query: A query on a collection of
answer is stored in the database to accelerate the evalu- multidimensional data, which produces a collection of
ation of further queries. measures classified according to some specified dimen-
sions.
Measure: A numeric value obtained by applying an
aggregate function (such as count, sum, min, max or OLAP System: A particular form of information sys-
average) to groups of data in a fact table. tem specifically designed for processing, managing and
reporting multidimensional data.
426
TEAM LinG
427
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Embedding Bayesian Networks in Sensor Grids
vast. In such scenarios, multiple sensors that are Figure 1. A domain scenario
mounted at different locations can maximize the
regions of scanning. Multisensor data fusion pro-
vides increased temporal coverage, as some sen-
sors can provide information when others cannot.
Increased confidence: Single target location can be
confirmed by more than one sensor, which increases
the confidence in target detection.
Reduced ambiguity: Joint information from multiple
sensors can reduce the set of beliefs about data.
Decreased costs: Multiple, inexpensive sensors
can replace expensive single-sensor architectures
at a significant reduction of cost.
Improved detection: Integrating measurements from
multiple sensors can reduce the signal-to-noise
The goal is to recognize situations locally and glo-
ratio, which ensures improved detection.
bally, identify the available options, and make global and
local decisions quickly in order to reduce or eliminate the
Bayesian-based or entropy-based algorithms can the threats and optimize the use of assets. The task is difficult
used to construct efficient data structuresknown as due to the dynamics and uncertainties in the domain.
Bayesian networksto represent the relations and the Threats may change in many ways, targets may move,
uncertainties in the domain (Pearl, 1988). After the Baye- enemy forces may identify the sensing capabilities and
sian networks are created, they can act as hyperdimensional eliminate them, and so forth. In most cases, sensing
knowledge representations that can be used for probabi- information will contain noise and most likely will be
listic inference. In situations when data is not as rich, the inaccurate, unreliable, and uncertain. These constraints
knowledge representations can still be created from state- suggest a distributed, bottom-up approach to match the
ments of causality and independence formulated by ex- natural dynamics and uncertainties of the problem. Thus,
pert opinions. Under this framework, activities such as at the core of this problem, a theoretical framework that
vehicle control, maneuvering, and scheduling could be effectively balances local and global conditions is needed.
planned, and the effectiveness of those plans could be Distributed Bayesian networks offer that balance (Xiang,
evaluated online as the actions of the plans are executed. 2002; Valtorta, Kim, & Vomlel, 2002).
To illustrate these ideas, consider a domain composed
of threats, assets, and grids of sensors. Although un-
manned vehicles loaded with sensors might be able to
MAIN THRUST
detect potential targets and provide data to guide the
distribution of assets, integrating and transforming those
data into meaningful information that is amenable for Embedding Bayesian Networks in
intelligent decisions is very demanding. The data must be Sensor Grids
filtered, the relationships between seemingly unrelated
data sets must be determined, and knowledge representa- Recent advances in the theory of Bayesian network infer-
tions must be created to support wise and timely deci- ence (Darwiche, 2003; Castillo, Gutierrez, & Hadi, 1996;
sions, because conditions of uncertain and incomplete Utete, 1998) have resulted in algorithms that can perform
information are the norm, not the exception. Therefore, a probabilistic inference on very small-scale computing
solution is to endow sensors with embedded local and devices that are comparable to commercially available
global intelligence, as shown in Figure 1. In the figure, PDAs. The algorithms can encode, in real-time, families of
friendly airplanes and tanks, in red, use Bayesian net- polynomial equations representing queries of the type
works (BNs) to make decisions. The BNs are illustrated as p(e|h) involving sets of variables local to the device and
red boxes containing graphs. Each node in the graphs its neighbors.
corresponds to a variable in the domain. The data for each Using the knowledge representations locally encoded
variable may come from sensors spread in the battlefield. into these devices, larger, distributed systems can be
The red nodes are variables related to the friendly re- interconnected. The devices can assess their local condi-
sources, and the blue variables to the enemy resources. tions given local observations and engage with other
The dotted red arrows connecting the BNs represent devices in the system to gain better understanding of the
wireless communication between the BNs. global situation, to obtain more assets, or to convey
428
TEAM LinG
Embedding Bayesian Networks in Sensor Grids
information needed by other devices engaged in larger parameters. In such cases, symbolic propagation can be
scale tactical decisions. The devices can maintain models used to obtain the values of the unknown parameters to -
of the local parameters surrounding them, making them propagate and compute the probabilities of all the rel-
situationally aware and enabling them to participate as evant variables in a query (Zhaoyu, DAmbrosio, 1994;
cells of a larger decision-making process. Sensory infor- Castillo, Gutierrez, & Hadi, 1995; Castillo et al., 1996).
mation might be redirected towards a device with no Symbolic propagation of evidence consists of asserting
access to that sensors information. For example, an air- polynomials involving the known and missing param-
plane might have full access to information about the eters and the available evidence. After those polynomi-
topology of a certain area, but another vehicle may have als are found, customized code generators can produce
access to this information only if it can communicate with code to solve the expressions in real-time, obtain the
the airplane. Global information known by groups of de- exact values of the parameters, and complete the propa-
vices can be sent to the relevant devices such that when- gation of evidence.
ever a device gets new input data, the findings are used to In general, a node X i having a conditional probability
determine locally the most probable states for each of the
variables within the devices and propagate those determi- p ( xi | i ) can be expressed as a parametric family of the
nations to the neighboring devices. form
The situation is illustrated in Figure 2, in which the ij = p( X i = j | i = ), j {0,..., ri } (1)
variables x3, x4, and x6 are shared by more than one
device. The direction of the dotted red arrows indicate the where i refers to the node number, j refers to the state of
direction of the flow of evidence.
the node, and is any instantiation i of the parents of
Symbolic Propagation of Evidence X i . Assume a JPD given on the binary variables
{x1 , x2 , x3 , x4 } , as
Methods for exact and approximate propagation of evi-
dence in Bayesian network models are discussed in other p( x1 , x2 , x3 , x4 ) = p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 , x3 ) (2)
sections of this encyclopedia. A common requirement for
both types of methods is that all the parameters of the joint The complete set of parameters for the binary vari-
probability distribution (JPD) must be known prior to ables for a Bayesian network compatible with the JPD is
propagation. However, complete specifications might not given in Table 1.
always be available. This may happen when the number of
observations in some combinations of variables is not
sufficient to support exact quantifications of conditional
probabilities, or in cases when domain experts may only
know ranges but may not know the exact values of the
X1 10 = p ( x1 ) 11 = 1 10
X1 X5
X2 200 = p ( x2 | x1 ) 210 = 1 200
X4
X4 201 = p ( x2 | x1 ) 211 = 1 201
X2
X3
X6 X3 300 = p( x3 | x1 ) 310 = 1 300
301 = p ( x3 | x1 ) 311 = 1 300
429
TEAM LinG
Embedding Bayesian Networks in Sensor Grids
430
TEAM LinG
Embedding Bayesian Networks in Sensor Grids
now, the main barriers to the use of Bayesian networks in Goode, B. (2004, August). Keys to keynotes. Sensors
sensor grids have been the computational complexity of Magazine. -
the inference process the lack of efficient methods for
learning the graph of a network, and adapting to the Jensen, F. V. (2001). Bayesian networks and decision
dynamics of the domain. The latter two, however, are graphs. New York: Springer.
problems that permeate the field of Bayesian networks Lauritzen, S.L., & Spiegelhalter, D.J. (1988). Local compu-
and constitute fertile ground for future research. tations with probabilities on graphical structures and
their application to expert systems. Journal of the Royal
Statistical Society, Series B, 50, 157-224.
CONCLUSION
Microsoft launches smart personal object technology
This paper outlines recent work aimed at endowing sensor initiative. (2002, November 17). Retrieved from http://
grids with local and global intelligence. I describe the www.microsoft.com/presspass/features/2002/nov02/11-
advantages of knowledge representations known as Baye- 17SPOT.asp
sian networks and argue that Bayesian networks offer an Olesen, K. G. , Lauritzen, & Jensen, F. V. (1992). Hugin: A
excellent theoretical foundation to accomplish this goal. system creating adaptive causal probabilistic networks.
The power of Bayesian networks lends them to decom- Proceedings of the Eighth Conference on Uncertainty in
posing a problem into a set of smaller, distributed problem Artificial Intelligence (pp. 223-229), USA.
solvers. Each variable in a network could be associated to
a different sensing agent represented at the local level as Pearl, J. (1988). Probabilistic reasoning in intelligent
a Bayesian network. The sensing agents could decide if systems: Networks of plausible inference. San Mateo,
and when to send observations to the other agents. As CA: Morgan Kaufmann.
information flows between the agents, they in turn could Pister, K. S. J., Kahn, J. M., & Boser, B. E.(1999). Smart
decide to send their own observations, their local infer- dust: Wireless networks of millimeter-scale sensor nodes.
ences, or request local or remote agents for more informa- Highlight Article in Electronics Research Laboratory
tion, given the available evidence. Another advantage of Research Summary. Retrieved from http://
this approach is that it is scalable on demand; more agents www.xbow.com/Products/Wireless_Sensor_Networks.
can be added as the problem gets bigger. Recovery from htm
loss of parts of the system is possible by introducing a set
of redundant observations, each of which can be part of Sensors Magazine. (Ed.). (2004, August). Best of sensors
a local solver agent. expo awards [Special issue]. Sensors Magazine.
Stone, C. A., Lawrence, D., Barlow, C. A., & Corwin, T. L.
(1999). Bayesian multiple target tracking. Boston: Artech
REFERENCES House.
Utete, S. W. (1998). Local information processing for
Castillo, E. Gutierrez, J. M., & Hadi, A. S. (1995). Symbolic decision making in decentralised sensing networks. Pro-
propagation in discrete and continuous Bayesian net- ceedings of the 11th International Conference on Indus-
works. In V. Keranen & P. Mitic (Eds.), Proceedings of the trial and Engineering Applications of Artificial Intelli-
First International Mathematica Symposium: Vol. Math- gence and Expert Systems, IEA/AIE-667-676, Castellon,
ematics with vision (pp. 77-84). Southhampton, UK: Com- Spain.
putational Mechanics Publications, Southerhampton, UK.
Valtorta, M., Kim, Y. G., & Vomlel, J. (2002). Soft evidential
Castillo, E., Gutierrez, J. M., & Hadi, A. S. (1996). A new update for probabilistic multiagent systems. Interna-
method for efficient symbolic propagation in discrete tional Journal of Approximate Reasoning, 29(1), 71-106.
Bayesian networks. Networks, 28, 31-43.
Vargas, J. E., Tvarlapati, K., & Wu, Z. (2003). Target
Darwiche, A. (2003). Revisiting the problem of belief tracking with Bayesian estimation. In V. Lesser, C. Ortiz,
revision with uncertain evidence. Proceedings of the 18th & M. Tambe (Eds.), Distributed sensor networks. Kluwer
International Joint Conference on Artificial Intelligence, Academic Press.
Acapulco, Mexico.
Vargas, J. E., & Wu, Z. (2003). Real-time multiple-target
Goode, B. (2003, September). Having wonder full time. tracking using networked wireless sensors. Proceedings
Sensors Magazine. of the Second Conference on Autonomous Intelligent
Networks and Systems, Palo Alto, CA.
431
TEAM LinG
Embedding Bayesian Networks in Sensor Grids
Xiang, Y. (2002). Probabilistic reasoning in multiagent efficient data structure. A junction tree contains cliques,
systems: A graphical models approach. Cambridge, MA: each of which is a set of variables from the domain. The
Cambridge University Press. junction tree is configured to maintain the probabilistic
dependencies of the domain variables and provides a data
Zhaoyu, L., & DAmbrosio, B. (1994). Efficient inference structure over queries of the type What is the most
in Bayes networks as a combinatorial optimization prob- probable value of variable D given that the values of
lem. International Journal of Approximate Reasoning, variables A, B, etc., are known?
11, 55-81.
Knowledge Discovery: The process by which new
pieces of information can be revealed from a set of data.
KEY TERMS For example, given a set of data for variables {A,B,C,D},
a knowledge discovery process could discover unknown
Bayesian Network: A directed acyclic graph (DAG) probabilistic dependencies among variables in the do-
that encodes the probabilistic dependencies between the main, using measures such as Kullbacks mutual informa-
variables within a domain and is consistent with a joint tion, which, for the discrete case, is given by the formula
probability distribution (JPD) for that domain. For ex-
ample, a domain with variables {A,B,C,D}, in which the p (ai , b j )
I ( A : B) = p(a , b ) log P(a ) P(b )
variables B and C depend on A and the variable D depends i j
i j
i j
on C and B, would have the following JPD: P(A,B,C,D) =
p(A)p(B|A)p(C|A)p(D|C,D) and the graph Smart Sensors: Transducers that convert some physi-
cal parameters into an electrical signal and are equipped
A
with some level of computing power for signal processing.
432
TEAM LinG
433
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Employing Neural Networks in Data Mining
implementation of the data-mining process. Another im- (1996) show how effective data mining can be achieved by
portant data-mining issue is concerned with the relation- combining the power of neural networks with the rigor of
ship between theoretical data-mining research and data- more traditional statistical tools. They argue that this
mining applications. Data mining is an exponentially grow- alliance can generate important synergies. Craven and
ing field with a strong emphasis on applications. Shavlik (1997) describe neural network learning algo-
A further issue of great importance is the research in rithms for data mining that are able to produce comprehen-
data-mining algorithms and the discussion of issues of sible models and that do not require excessive training
scale (Hearst, 1997). The commonly used tools may not times. They argue that neural network methods deserve a
scale up to huge volumes of data. Scalable data-mining place in the toolboxes of data-mining specialists. Mitra,
tools are characterized by the linear increase of their Pal, and Mitra (2002) provide a survey of the available
runtime with the increase of the number of data points literature on data mining using soft computing method-
within a fixed amount of available memory. An overview ologies, including neural networks. They came to the
of scalable data-mining tools is given in Ganti, Gehrke, and conclusion that neural networks are suitable in data-rich
Ramakrishnan (1999). In addition to scalability, robust environments and are typically used for extracting embed-
techniques to model noisy data sets containing an un- ded knowledge in the form of rules, quantitative evalua-
known number of overlapping categories are of great tion of these rules, clustering, self-organization, classifi-
importance (Krishnapuram et al., 2001). cation, and regression. Vesely (2003) argues that from the
methods of data mining based on neural networks, the
Kohonens self-organizing maps are the most promising,
MAIN THRUST because, by using a self-organizing map, one can more
easily visualize high-dimensional data. Self-organizing
Exploiting Neural Networks in Data maps also outperform other conventional methods such
as the popular Principal Component Analysis (PCA)
Mining method for screening analysis of high-dimensional data.
Although highly successful in typical cases, PCA suffers
How is data mining able to tell you important things that from the drawback of being a linear method. Furthermore,
you didnt know or tell you what is going to happen next? real-world data manifolds, besides being nonlinear, often
The technique that is used to perform these feats is called are corrupted by noise and embed into high-dimensional
modeling. Modeling is simply the act of building a model spaces. Self-organizing maps are more robust against
based on data from situations where the answer is known, noise and are often used to provide representations that
and then applying the model to other situations where the can be analyzed successfully using conventional meth-
answers arent known. Modeling techniques have been ods like PCA.
around for centuries, but it is only recently that data In spite of their excellent performance in concept
storage and communication capabilities required to col- discovery, neural networks do suffer from some short-
lect and to store huge amounts of data and the computa- comings. They are sensitive to the net topology, initial
tional power to automate modeling techniques to work weights, and the selection of attributes. If the number of
directly on the data have been available. Modeling tech- layers is not selected suitably, the learning efficiency will
niques used for data mining include decision trees, rule be affected. Too many irrelevant nodes can cause unnec-
induction, genetic algorithms, nearest neighbor, artificial essary computational expense and overfit (i.e., the net-
neural networks, and many other techniques (Chen, Han work creates meaningless concepts); randomly selected
& Yu, 1996; Cios, Pedrycz & Swiniarski, 1998; Hand, initial weights sometimes can trap the nets in so-called
Mannila & Smyth, 2000). pitfalls; that is, neural nets stabilize around local minima
Exploiting artificial neural networks as a modeling instead of the global minimum. Background knowledge
technique for data mining is considered to be an important remains unused in neural nets. The knowledge discov-
direction of research. Neural networks can be applied to ered by nets is not transparent to users. This is perhaps
a number of data-mining problems, including classifica- the main failing of neural networks, as they are unintelli-
tion, regression, and clustering, and there are quite a few gible black boxes.
interesting developments and tools that are being devel- Our own work is focused on mining educational data
oped in this field. Lu, Setiono, and Liu (1996) applied to assist e-learning in a variety of ways. In the following
neural networks to mine symbolic classification rules from section, we report on the experience with MASACAD
large databases. They report that neural networks were (Multi-Agent System for ACademic ADvising), a data-
able to deliver a lower error rate and are more robust mining, multi-agent system that advises students using
against noise than decision trees. Ainslie and Drze neural networks.
434
TEAM LinG
Employing Neural Networks in Data Mining
Case Study: Academic Advising of programs, and curricula. This kind of information is
Students published in Web pages, in booklets, and in many other -
forms such as printouts and announcements. The sources
At the UAE University, there is enormous interest in the of knowledge are many; however, the primary source will
area of online education. Rigorous steps are being taken be a human expert, who should posses more complex
toward the creation of the technological infrastructure and knowledge than can be found in documented sources.
the academic infrastructure for the improvement of teach-
ing and learning. MASACAD, the academic advising sys- The Advising System MASACAD
tem described in the following, is to be understood as a tool
that mines educational data to support learning. MASACAD is a multi-agent system that offers aca-
The general goal of academic advising is to assist demic advice to students by mining the educational data
students in developing educational plans that are consis- described in the previous section. It consists of a user
tent with academic, career, and life goals, and to provide system, a grading system, a course announcement sys-
students with information and skills needed to pursue tem, and a mediation agent. The mediation agent pro-
those goals. In order to improve the advising process and vides the information retrieving service. It moves from
make it easier, an intelligent assistant in the form of a the site of an application to another, where it interacts
computer program will be of great interest. The goal of with the agent wrappers. The agent wrappers manage the
academic advising, as stated previously, is too general, states of the applications they are wrapped around,
because many experts are involved and a huge amount of invoking them when necessary. The application grading
expertise is needed. This makes the realization of such an system is a database application for answering queries
assistant too difficult, if not impossible. Therefore, in the about the students and the courses they have already
implemented system, the scope of academic advising was taken. The application course announcement system is
restricted. It was understood as just being intended to a Web application for answering queries about the
provide the student with an opportunity to plan programs courses that are expected to be offered in the semester
of study, select appropriate required and elective classes, for which advising is needed. The application user
and schedule classes in a way that provides the greatest system is the heart of the advising system, and it is here
potential for academic success. where intelligence resides. The application gives stu-
dents the opportunity to express their desires concern-
Data Mined by the Advising System ing the courses to be attended by choosing among the
courses that are offered, initiating a query to obtain
MASACAD advice, and, finally, seeing the results returned by the
advising system. The system also alerts the user auto-
To extract the academic advice (i.e., to provide the student matically via e-mail when something changes in the
with a set of appropriate courses he or she should register offered courses or in the student profile. The advising
for in the coming term), MASACAD has to mine a huge procedure suggests courses according to university
amount of educational data available in different formats. laws, in a way that provides the greatest potential for
The data contain the student profile, which includes the academic success, as seen by a human academic advi-
courses already attended, the corresponding grades, the sor. Taking into account the adequacy of the machine-
desires of the student concerning the courses to be at- learning approach for data mining, added to the avail-
tended, and much other information. The part of the profile ability of experience with advising students, made the
consisting of the courses already attended, the corre- adoption of a paradigm of supervised learning from
sponding grades, and so forth, is maintained by the univer- examples using artificial neural networks interesting.
sity administration in appropriate databases. The part of For academic advising, the known information (input
the profile consisting of the desires of the student con- variables) consists of the profile of the student and of
cerning the courses to be attended should be asked for the offered courses. The unknown information (output
from the student before advising is performed. The data to variables) consists of the advice expected by the stu-
be mined also include the courses that are offered in the dent. In order for the network to be able to infer the
semester for which advising is needed. This information is unknown information, prior training is needed. Train-
maintained by the university administration in appropriate ing will integrate the expertise in academic advising
Web sites. Finally, a very important component of the data into the network. The back-propagation algorithm was
that the system has to mine is expertise. For the problem used for training the neural network.
of academic advising, expertise consists partly of the Information (i.e., training examples) is gained from
university laws concerning academic advising. These con- information about students and courses they really took
sist of all the details and regulations concerning courses, in previous semesters. The selection of these courses
435
TEAM LinG
Employing Neural Networks in Data Mining
was made, based on the advice of human experts special- key findings much faster than many other tools. As
izing in academic advising. About 250 computer science computer systems become faster, the value of neural
students in different stages of study were available for the networks as a data-mining tool only will increase.
learning procedures. Each one of the 250 examples con-
sisted of a pair of input-output vectors. The input vector
summarized all the information needed for advising a CONCLUSION
particular student (85 real-valued components; each com-
ponent encodes the information about one of the 85 The amount of raw data stored in databases and the Web
courses of the curriculum). The output vector encodes the is exploding. Raw data by itself, however, does not pro-
final decision concerning the courses in which the stu- vide much information. One benefits when meaningful
dent actually enrolled, based on the advice of the human trends and patterns are extracted from the data. Data-
academic advisor (85 integer-valued components; each mining techniques help to recognize significant facts,
component represents a priority value for one of the 85 relationships, trends, patterns, exceptions, and anoma-
courses of the curriculum, and a higher priority value lies that might otherwise go unnoticed. In this contribu-
indicates a more appropriate course for the student). The tion, we have seen an example of how neural networks can
aim of the learning phase was to determine the most be used to help mine data about students and courses with
suitable values for the learning rate, the size of the net- the aim of developing educational plans that are consis-
work (number of neurons, number of hidden layers was set tent with academic, career, and life goals, and providing
to 2), and the number of training cycles that are needed for students with information and skills needed to pursue
the convergence of the network. Many experiments were those goals. The neural network paradigm seems interest-
conducted to obtain these parameters and to test the ing and viable enough to be used as a data-mining tool.
system. With a network topology of 85-100-100-85 and
systematically selected network parameters (50 different
experiments chosen carefully were performed to obtain REFERENCES
these parameters), the layered, fully connected back-
propagation network was able to deliver a considerable Ainslie, A., & Drze, X. (1996). Data mining: Using neural
performance. Fifty students participated in the evaluation networks as a benchmark for model building. Decision
of the system. In 92% of the cases, the network was able Marketing, 7, 77-86.
to produce very appropriate advice according to human
experts in academic advising. In the remaining 8% of the Chen, M.S., Han, J., & Yu, P.S. (1996). Data mining: An
cases (4 cases), some unsatisfactory course suggestions overview from database perspective. IEEE Transactions
were produced by the network (Hamdi, 2004). on Knowledge and Data Engineering, 8(6), 866-883.
Cios, K.J., Pedrycz, W., & Swiniarski, R.W. (1998). Data
mining methods for knowledge discovery. Norwell, MA:
FUTURE TRENDS Kluwer Academic Publishers.
The rapid growth of business, industrial, and educational Craven, M.W., & Shavlik, J.W. (1997). Using neural net-
data sources has overwhelmed the traditional, interactive works for data mining. Future Generation Computer
approaches to data analysis and created a need for a new Systems, 13(2-3), 211-229.
generation of tools for intelligent and automated discov-
ery in data. Neural networks are well suited for data- Fayyad, U.M., Piatesky-Shapiro, G., Smyth, P., &
mining tasks, due to their ability to model complex, multi- Uthurusamy, R. (1996). Advances in knowledge discov-
dimensional data. As data availability has magnified, so ery and data mining. Menlo Park, CA: MIT Press.
has the dimensionality of problems to be solved, thus Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). Mining
limiting many traditional techniques, such as manual very large databases. IEEE Computer, 32(8), 38-45.
examination of the data and some statistical methods.
Although there are many techniques and algorithms that Hamdi, M.S. (2004). MASACAD: A learning multi-agent
can be used for data mining, some of which can be used system that mines the Web to advise students. Proceed-
effectively in combination, neural networks offer many ings of the International Conference on Internet Com-
desirable qualities, such as the automatic search of all puting (IC04), Las Vegas, Nevada.
possible interrelationships among key factors, the auto- Hand, D.J., Mannila, H., & Smyth, P. (2000). Principles of
matic modeling of complex problems without prior knowl- data mining. Cambridge, MA: MIT Press.
edge of the level of complexity, and the ability to extract
436
TEAM LinG
Employing Neural Networks in Data Mining
Haykin, S. (1999). Neural networks: A comprehensive Classification: The process of dividing a dataset into
foundation. Upper Saddle River, NJ: Prentice Hall. mutually exclusive groups such that the members of each -
group are as close as possible to one another, and differ-
Hearst, M. (1997). Distinguishing between web data min- ent groups are as far as possible from one another, where
ing and information access. The Internet <http:// distance is measured with respect to specific variable(s)
www.sims.berkeley.edu/~hearst/talks/data-mining- one is trying to predict. For example, a typical classifica-
panel/index.htm> tion problem is to divide a database of companies into
Kobsa, A. (2002). Personalized hypermedia and interna- groups that are as homogeneous as possible with respect
tional privacy. Communications of the ACM, 45(5), 64-67. to a creditworthiness variable with values good and bad.
Supervised classification is when we know the class
Krishnapuram, R., Joshi, A., Nasraoui, O., & Yi, L. (2001). labels and the number of classes.
Low complexity fuzzy relational clustering algorithms for Web
mining. IEEE Transactions on Fuzzy Systems, 9(4), 596-607. Clustering: The process of dividing a dataset as in
classification, but the distance is now measured with
Lu, H., Setiono, R., & Liu, H. (1996). Effective data mining respect to all available variables. Unsupervised classifi-
using neural networks. IEEE Transactions on Knowledge cation is when we do not know the class labels and may
and Data Engineering, 8(6), 957-961. not know the number of classes.
Mitra, S., Pal, S.K., & Mitra, P. (2002). Data mining in soft Data Mining: The extraction of hidden predictive
computing framework: A survey. IEEE Transactions On information from large databases.
Neural Networks, 13(1), 3-14.
Decision Tree: A tree-shaped structure that repre-
Musick, R., Fidelis, K., & Slezak, T. (1997). Large-scale sents a set of decisions. These decisions generate rules
data mining pilot project in human genome. Retrieved for the classification of a dataset.
from http://home.comcast.net/~crmusick/papers/
RDOFIS.html Linear Regression: A classic statistical problem is to
try to determine the relationship between two random
Vesely, A. (2003). Neural networks in data mining. variables X and Y. For example, we might consider height
AGRIC.ECON.-CZECH, 49(9), 427-431. and weight of a sample of adults. Linear regression at-
tempts to explain this relationship with a straight line fit
to the data.
KEY TERMS
OLAP: Online analytical processing. Refers to array-
Artificial Neural Networks: Non-linear predictive oriented database applications that allow users to view,
models that learn through training and resemble biologi- navigate through, manipulate, and analyze multidimen-
cal neural networks in structure. sional databases.
437
TEAM LinG
438
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Enhancing Web Search through Query Log Mining
individual user. Activities of the same user could be other on query session. Since queries with the same or
grouped by their IP addresses, agent types, site topolo- similar search intentions may be represented with differ- -
gies, cookies, user IDs, etc. The goal of session identifi- ent words and the average length of Web queries is very
cation is to divide the queries and page accesses of each short, content-based query clustering usually does not
user into individual sessions. Finding the beginning of a perform well.
query session is trivial: a query session begins when a Using query sessions mined from query logs to cluster
user submits a query to a search engine. However, it is queries is proved to be a more promising method (Wen,
difficult to determine when a search session ends. The Nie, & Zhang, 2002). Through query sessions, query
simplest method of achieving this is through a timeout, clustering is extended to query session clustering.
where if the time between page requests exceeds a certain The basic assumption here is that the activities following
limit, it is assumed that the user is starting a new session. a query are relevant to the query and represent, to some
extent, the semantic features of the query. The query text
Log-Based Query Clustering and the activities in a query session as a whole can
represent the search intention of the user more precisely.
Query clustering is a technique aiming at grouping users Moreover, the ambiguity of some query terms is elimi-
semantically (not syntactically) related queries in Web nated in query sessions. For instance, if a user visited a
query logs. Query clustering could be applied to FAQ few tourism Websites after submitting a query Java, it
detecting, index-term selection and query reformulation, is reasonable to deduce that the user was searching for
which are effective ways to improve Web search. First of information about Java Island, not Java programming
all, FAQ detecting means to detect Frequently Asked language or Java coffee. Moreover, query clustering
Questions (FAQs), which can be achieved by clustering and document clustering can be combined and reinforced
similar queries in the query logs. A cluster being made up with each other (Beeferman & Berger, 2000).
of many queries can be considered as a FAQ. Some search
engines (e.g. Askjeeves) prepare and check the correct Log-Based Query Expansion
answers for FAQs by human editors, and a significant
majority of users queries can be answered precisely in Query expansion involves supplementing the original
this way. Second, inconsistency between term usages in query with additional words and phrases, which is an
queries and those in documents is a well-known problem effective way to overcome the term-mismatching problem
in information retrieval, and the traditional way of directly and to improve search performance. Log-based query
extracting index terms from documents will not be effec- expansion is a new query expansion method based on
tive when the user submits queries containing terms query log mining. Taking query sessions in query logs as
different from those in the documents. Query clustering a bridge between user queries and Web pages, probabi-
is a promising technique to provide a solution to the word listic correlations between terms in queries and those in
mismatching problem. If similar queries can be recognized pages can then be established. With these term-term
and clustered together, the resulting query clusters will be correlations, relevant expansion terms can be selected
very good sources for selecting additional index terms for from the documents for a query. For example, a recent work
documents. For example, if queries such as atomic bomb, by Cui, Wen, Nie, and Ma (2003) shows that, from query
Manhattan Project, Hiroshima bomb and nuclear logs, some very good terms, such as personal com-
weapon are put into a query cluster, this cluster, not the puter, Apple Computer, CEO, Macintosh and
individual terms, can be used as a whole to index docu- graphical user interface, can be detected to be tightly
ments related to atomic bomb. In this way, any queries correlated to the query Steve Jobs, and using these
contained in the cluster can be linked to these documents. terms to expand the original query can lead to more
Third, most words in the natural language have inherent relevant pages.
ambiguity, which makes it quite difficult for user to formu- Experiments by Cui, Wen, Nie, and Ma (2003) show
late queries with appropriate words. Obviously, query that mining user logs is extremely useful for improving
clustering could be used to suggest a list of alternative retrieval effectiveness, especially for very short queries
terms for users to reformulate queries and thus better on the Web. The log-based query expansion overcomes
represent their information needs. several difficulties of traditional query expansion meth-
The key problem underlying query clustering is to ods because a large number of user judgments can be
determine an adequate similarity function so that truly extracted from user logs, while eliminating the step of
similar queries can be grouped together. There are mainly collecting feedbacks from users for ad-hoc queries. Log-
two categories of methods to calculate the similarity based query expansion methods have three other impor-
between queries: one is based on query content, and the tant properties. First, the term correlations are pre-
439
TEAM LinG
Enhancing Web Search through Query Log Mining
computed offline and thus the performance is better than means that objects newly introduced into the system
traditional local analysis methods which need to calculate have not been rated by any users and can therefore not
term correlations on the fly. Second, since user logs be recommended. Due to the absence of recommenda-
contain query sessions from different users, the term tions, users tend to not be interested in these new
correlations can reflect the preference of the majority of objects. This in turn has the consequence that the newly
the users. Third, the term correlations may evolve along added objects remain in their state of not being recom-
with the accumulation of user logs. Hence, the query mendable. Combining collaborative and content-based
expansion process can reflect updated user interests at a filtering is therefore a promising approach to solving the
specific time. above problems (Baudisch, 1999).
440
TEAM LinG
Enhancing Web Search through Query Log Mining
441
TEAM LinG
Enhancing Web Search through Query Log Mining
Wen, J.-R., Nie, J.-Y., & Zhang, H.-J. (2002). Query clus- tions between terms in the user queries and those in the
tering using user logs. ACM Transactions on Information documents can then be established through user logs.
Systems (ACM TOIS), 20(1), 59-81. With these term-term correlations, relevant expansion
terms can be selected from the documents for a query.
Xu, J., & Croft, W.B. (2000). Improving the effectiveness
of information retrieval with local context analysis. ACM Log-Based Personalized Search: Personalized search
Transactions on Information Systems, 18(1), 79-112. targets to return results related to users preferences. The
core task of personalization is to obtain the preference of
each individual user, which could be learned from query logs.
KEY TERMS Query Log: A type of file keeping track of the activi-
ties of the users who are utilizing a search engine.
Collaborative Filtering: A method of making auto- Query Log Mining: An application of data mining
matic predictions (filtering) about the interests of a user techniques to discover interesting knowledge from Web
by collecting taste information from many users (collabo- query logs. The mined knowledge is usually used to
rating). enhance Web search.
Log-Based Query Clustering: A technique aiming at Query Session: a query submitted to a search engine
grouping users semantically related queries collected in together with the Web pages the user visits in response
Web query logs. to the query. Query session is the basic unit of many query
Log-Based Query Expansion: A new query expansion log mining tasks.
method based on query log mining. Probabilistic correla-
442
TEAM LinG
443
INTRODUCTION BACKGROUND
The Web is an open and free environment for people to Web structure mining can be further divided into three
publish and get information. Everyone on the Web can categories based on the kind of structured data used.
be either an author, a reader, or both. The language of the
Web, HTML (Hypertext Markup Language), is mainly Web graph mining: Compared to a traditional
designed for information display, not for semantic rep- document set in which documents are indepen-
resentation. Therefore, current Web search engines dent, the Web provides additional information about
usually treat Web pages as unstructured documents, and how different documents are connected to each
traditional information retrieval (IR) technologies are other via hyperlinks. The Web can be viewed as a
employed for Web page parsing, indexing, and search- (directed) graph whose nodes are the Web pages
ing. The unstructured essence of Web pages seriously and whose edges are the hyperlinks between them.
blocks more accurate search and advanced applications There has been a significant body of work on ana-
on the Web. For example, many sites contain structured lyzing the properties of the Web graph and mining
information about various products. Extracting and inte- useful structures from it (Page et al., 1998; Kleinberg,
grating product information from multiple Web sites 1998; Bharat & Henzinger, 1998; Gibson, Kleinberg,
could lead to powerful search functions, such as com- & Raghavan, 1998). Because the Web graph struc-
parison shopping and business intelligence. However, ture is across multiple Web pages, it is also called
these structured data are embedded in Web pages, and interpage structure.
there are no proper traditional methods to extract and Web information extraction (Web IE): In addition,
integrate them. Another example is the link structure of although the documents in a traditional information
the Web. If used properly, information hidden in the retrieval setting are treated as plain texts with no or
links could be taken advantage of to effectively improve few structures, the content within a Web page does
search performance and make Web search go beyond have inherent structures based on the various HTML
traditional information retrieval (Page, Brin, Motwani, and XML tags within the page. While Web content
& Winograd, 1998, Kleinberg, 1998). mining pays more attention to the content of Web
Although XML (Extensible Markup Language) is an pages, Web information extraction has focused on
effort to structuralize Web data by introducing seman- automatically extracting structures with various
tics into tags, it is unlikely that common users are accuracy and granularity out of Web pages. Web
willing to compose Web pages using XML due to its content structure is a kind of structure embedded in
complication and the lack of standard schema defini- a single Web page and is also called intrapage
tions. Even if XML is extensively adopted, a huge amount structure.
of pages are still written in the HTML format and remain Deep Web mining: Besides Web pages that are
unstructured. Web structure mining is the class of meth- accessible or crawlable by following the
ods to automatically discover structured data and infor- hyperlinks, the Web also contains a vast amount of
mation from the Web. Because the Web is dynamic, noncrawlable content. This hidden part of the Web,
massive and heterogeneous, automated Web structure referred to as the deep Web or the hidden Web
mining calls for novel technologies and tools that may (Florescu, Levy, & Mendelzon, 1998), comprises
take advantage of state-of-the-art technologies from various a large number of online Web databases. Compared
areas, including machine learning, data mining, information to the static surface Web, the deep Web contains a
retrieval, and databases and natural language processing. much larger amount of high-quality structured in-
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Enhancing Web Search through Web Structure Mining
formation (Chang, He, Li, & Zhang, 2003). Auto- weight and a hub weight to solve the first problem and
matically discovering the structures of Web data- combine connectivity and content analysis to solve the
bases and matching semantically related attributes latter two. Chakrabarti, Joshi, and Tawde (2001) ad-
between them is critical to understanding the struc- dressed another problem with HITS: regarding the whole
tures and semantics of the deep Web sites and to page as a hub is not suitable, because a page always
facilitating advanced search and other applications. contains multiple regions in which the hyperlinks point
to different topics. They proposed to disaggregate hubs
into coherent regions by segmenting the DOM (docu-
MAIN THRUST ment object model) tree of an HTML page.
Mining the Web graph has attracted a lot of attention in The main drawback of the HITS algorithm is that the hubs
the last decade. Some important algorithms have been and authority score must be computed iteratively from
proposed and have shown great potential in improving the query result on the fly, which does not meet the real-
the performance of Web search. Most of these mining time constraints of an online search engine. To over-
algorithms are based on two assumptions. (a) Hyperlinks come this difficulty, Page et al. (1998) suggested using
convey human endorsement. If there exists a link from a random surfing model to describe the probability that
page A to page B, and these two pages are authored by a page is visited and taking the probability as the impor-
different people, then the first author found the second tance measurement of the page. They approximated this
page valuable. Thus the importance of a page can be probability with the famous PageRank algorithm, which
propagated to those pages it links to. (b) Pages that are computes the probability scores in an iterative manner.
co-cited by a certain page are likely related to the same The main advantage of the PageRank algorithm over the
topic. Therefore, the popularity or importance of a page HITS algorithm is that the importance values of all pages
is correlated to the number of incoming links to some are computed off-line and can be directly incorporated
extendt, and related pages tend to be clustered together into ranking functions of search engines.
through dense linkages among them. Noisy link and topic drifting are two main problems
in the classic Web graph mining algorithms. Some links,
Hub and Authority such as banners, navigation panels, and advertisements,
can be viewed as noise with respect to the query topic
In the Web graph, a hub is defined as a page containing and do not carry human editorial endorsement. Also,
pointers to many other pages, and an authority is de- hubs may be mixed, which means that only a portion of
fined as a page pointed to by many other pages. An the hub content may be relevant to the query. Most link
authority is usually viewed as a good page containing analysis algorithms treat each Web page as an atomic,
useful information about one topic, and a hub is usually indivisible unit with no internal structure. This leads to
a good source to locate information related to one false reinforcements of hub/authority and importance
topic. Moreover, a good hub should contain pointers to calculation. Cai, He, Wen, and Ma (2004) used a vision-
many good authorities, and a good authority should be based page segmentation algorithm to partition each
pointed to by many good hubs. Such a mutual reinforce- Web page into blocks. By extracting the page-to-block,
ment relationship between hubs and authorities is taken block-to-page relationships from the link structure and
advantage of by an iterative algorithm called HITS page layout analysis, a semantic graph over the Web can
(Kleinberg, 1998). HITS computes authority scores and be constructed such that each node exactly represents a
hub scores for Web pages in a subgraph of the Web, single semantic topic. This graph can better describe the
which is obtained from the (subset of) search results of semantic structure of the Web. Based on block-level
a query with some predecessor and successor pages. link analysis, they proposed two new algorithms, Block
Bharat and Henzinger (1998) addressed three prob- Level PageRank and Block Level HITS, whose perfor-
lems in the original HITS algorithm: mutually rein- mances are shown to exceed the classic PageRank and
forced relationships between hosts (where certain docu- HITS algorithms.
ments conspire to dominate the computation), auto-
matically generated links (where no humans opinion is Community Mining
expressed by the link), and irrelevant documents (where
the graph contains documents irrelevant to the query Many communities, either in an explicit or implicit
topic). They assign each edge of the graph an authority form, exist in the Web today, and their number is grow-
444
TEAM LinG
Enhancing Web Search through Web Structure Mining
ing at a very fast speed. Discovering communities from a analysis, semantic analysis, and discourse analysis. IE
network environment such as the Web has recently be- tools for semistructured pages are different from the -
come an interesting research problem. The Web can be classical ones, as IE utilizes available structured informa-
abstracted into directional or nondirectional graphs with tion, such as HTML tags and page layouts, to infer the
nodes and links. It is usually rather difficult to understand data formats or pages. Such a kind of methods is also
a networks nature directly from its graph structure, par- called wrapper induction (Kushmerick, Weld, &
ticularly when it is a large scale complex graph. Data mining Doorenbos, 1997; Cohen, Hurst, & Jensen, 2002). In
is a method to discover the hidden patterns and knowledge contrast to classic IE approaches, wrapper induction
from a huge network. The mined knowledge could provide operates less dependently of the specific contents of
a higher logical view and more precise insight of the nature Web pages and mainly focuses on page structure and
of a network and will also dramatically decrease the dimen- layout. Existing approaches for Web IE mainly include
sionality when trying to analyze the structure and evolu- the manual approach, supervised learning, and unsuper-
tion of the network. vised learning. Although some manually built wrappers
Quite a lot of work has been done in mining the exist, supervised learning and unsupervised learning
implicit communities of users, Web pages, or scientific are viewed as more promising ways to learn robust and
literature from the Web or document citation database scalable wrappers, because building IE tools manually
using content or link analysis. Several different defini- is not feasible and scalable for the dynamic, massive and
tions of community were also raised in the literature. In diverse Web contents. Moreover, because supervised
Gibson et al. (1998), a Web community is a number of learning still relies on manually labeled sample pages and
representative authority Web pages linked by important thus also requires substantial human effort, unsuper-
hub pages that share a common topic. Kumar, Raghavan, vised learning is the most suitable method for Web IE.
Rajagopalan, and Tomkins (1999) define a Web commu- There have been several successful fully automatic IE
nity as a highly linked bipartite subgraph with at least one tools using unsupervised learning (Arasu & Garcia-
core containing complete bipartite subgraph. In Flake, Molina, 2003; Liu, Grossman, & Zhai, 2003).
Lawrence, and Lee Giles (2000), a set of Web pages that
linked more pages in the community than those outside Deep Web Mining
of the community could be defined as a Web community.
Also, a research community could be based on a single- In the deep Web, it is usually difficult or even impos-
most-cited paper and could contain all papers that cite it sible to directly obtain the structures (i.e. schemas) of the
(Popescul, Flake, Lawrence, Ungar, & Lee Giles, 2000). Web sites backend databases without cooperation from
the sites. Instead, the sites present two other distin-
Web Information Extraction guishing structures, interface schema and result schema,
to users. The interface schema is the schema of the query
Web IE has the goal of pulling out information from a interface, which exposes attributes that can be queried in
collection of Web pages and converting it to a homoge- the backend database. The result schema is the schema
neous form that is more readily digested and analyzed for of the query results, which exposes attributes that are
both humans and machines. The results of IE could be shown to users.
used to improve the indexing process, because IE re- The interface schema is useful for applications, such
moves irrelevant information in Web pages and facili- as a mediator that queries multiple Web databases, be-
tates other advanced search functions due to the struc- cause the mediator needs complete knowledge about the
tured nature of data. The structuralization degrees of search interface of each database. The result schema is
Web pages are diverse. Some pages can be just taken as critical for applications, such as data extraction, where
plain text documents. Some pages contain a little loosely instances in the query results are extracted. In addition
structured data, such as a product list in a shopping page to the importance of the interface schema and result
or a price table in a hotel page. Some pages are organized schema, attribute matching across different schemas is
with more rigorous structures, such as the home pages of also important. First, matching between different inter-
the professors in a university. Other pages have very strict face schemas and matching between different results
structures, such as the book description pages of Amazon, schemas (intersite schema matching) are critical for
which are usually generated by a uniform template. metasearching and data-integration among related Web
Therefore, basically two kinds of Web IE techniques databases. Second, matching between the interface
exist: IE from unstructured pages and IE from schema and the result schema of a single Web database
semistructured pages. IE tools for unstructured pages (intrasite schema matching) enables automatic data an-
are similar to those classical IE tools that typically use notation and database content crawling.
natural language processing techniques such as syntactic
445
TEAM LinG
Enhancing Web Search through Web Structure Mining
Most existing schema-matching approaches for Web greatly improve the effectiveness of current Web search
databases primarily focus on matching query interfaces and will enable much more sophisticated Web information
(He & Chang, 2003; He, Meng, Yu, & Wu, 2003; Raghavan retrieval technologies in the future.
& Garcia-Molina, 2001). They usually adopt a label-
based strategy to identify attribute labels from the de-
scriptive text surrounding interface elements and then REFERENCES
find synonymous relationships between the identified
labels. The performance of these approaches may be Arasu, A., & Garcia-Molina, H. (2003). Extracting struc-
affected when no attribute description can be identified tured data from Web pages. Proceedings of the ACM
or when the identified description is not informative. In SIGMOD International Conference on Management
Wang, Wen, Lochovsky, and Ma (2004), an instance- of Data.
based schema-matching approach was proposed to iden-
tify both the interface and result schemas of Web data- Bharat, K., & Henzinger, M. R. (1998). Improved algo-
bases. Instance-based approaches depend on the content rithms for topic distillation in a hyperlinked environ-
overlap or statistical properties, such as data ranges and ment. Proceedings of the 21st Annual International
patterns, to determine the similarity of two attributes. ACM SIGIR Conference.
Thus, they could effectively deal with the cases where
attribute names or labels are missing or not available, Cai, D., He, X., Wen J.-R., & Ma, W.-Y. (2004). Block-
which are common for Web databases. level link analysis. Proceedings of the 27th Annual
International ACM SIGIR Conference.
Chakrabarti, S., Joshi, M., & Tawde, V. (2001). En-
FUTURE TRENDS hanced topic distillation using text, markup tags, and
hyperlinks. Proceedings of the 24th Annual Interna-
It is foreseen that the biggest challenge in the next tional ACM SIGIR Conference (pp. 208-216).
several decades is how to effectively and efficiently dig
out a machine-understandable information and knowl- Chang, C. H., He, B., Li, C., & Zhang, Z. (2003). Structured
edge layer from unorganized and unstructured Web data. databases on the Web: Observations and implications
However, Web structure mining techniques are still in discovery (Tech. Rep. No. UIUCCDCS-R-2003-2321). Ur-
their youth today. For example, the accuracy of Web bana-Champaign, IL: University of Illinois, Department of
information extraction tools, especially those auto- Computer Science.
matically learned tools, is still not satisfactory to meet Cohen, W., Hurst, M., & Jensen, L. (2002). A flexible
the requirements of some rigid applications. Also, deep learning system for wrapping tables and lists in HTML
Web mining is a new area, and researchers have many documents. Proceedings of the 11th World Wide Web
challenges and opportunities to further explore, such as Conference.
data extraction, data integration, schema learning and
matching, and so forth. Moreover, besides Web pages, Flake, G. W., Lawrence, S., & Lee Giles, C. (2000).
various other types of structured data exist on the Web, Efficient identification of Web communities. Proceed-
such as e-mail, newsgroup, blog, wiki, and so forth. ings of the Sixth International Conference on Knowl-
Applying Web mining techniques to extract structures edge Discovery and Data Mining.
from these data types is also a very important future Florescu, D., Levy, A. Y., & Mendelzon, A. O. (1998).
research direction. Database techniques for the World Wide Web: A sur-
vey. SIGMOD Record, 27(3), 59-74.
446
TEAM LinG
Enhancing Web Search through Web Structure Mining
e-commerce. Proceedings of the 29th International Con- Wang, J., Wen, J.-R., Lochovsky, F., & Ma, W.-Y. (2004).
ference on Very Large Data Bases. Instance-based schema matching for Web databases by -
domain-specific query probing. Proceedings of the 30th
Kleinberg, J. (1998). Authoritative sources in a hyperlinked International Conference on Very Large Data Bases.
environment. Proceedings of the Ninth ACM SIAM Inter-
national Symposium on Discrete Algorithms (pp. 668-
677).
Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A.
KEY TERMS
(1999). Trawling the Web for emerging cyber-commu-
nities. Proceedings of the Eighth International World Community Mining: A Web graph mining algorithm
Wide Web Conference. to discover communities from the Web graph in order to
provide a higher logical view and more precise insight of
Kushmerick, N., Weld, D., & Doorenbos, R. (1997). the nature of the Web.
Wrapper induction for information extraction. Pro-
ceedings of the International Joint Conference on Deep Web Mining: Automatically discovering the
Artificial Intelligence. structures of Web databases hidden in the deep Web and
matching semantically related attributes between them.
Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data
records in Web pages. Proceedings of the ACM SIGKDD HITS: A Web graph mining algorithm to compute
International Conference on Knowledge Discovery & authority scores and hub scores for Web pages.
Data Mining. PageRank: A Web graph mining algorithm that uses
Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). the probability that a page is visited by a random surfer
The PageRank citation ranking: Bringing order to the on the Web as a key factor for ranking search results.
Web (Tech. Rep.). Stanford University. Web Graph Mining: The mining techniques used to
Popescul, A., Flake, G. W., Lawrence, S., Ungar, L. H., discover knowledge from the Web graph.
& Lee Giles, C. (2000). Clustering and identifying Web Information Extraction: The class of mining
temporal trends in document databases. Proceedings of methods to pull out information from a collection of
the IEEE Conference on Advances in Digital Librar- Web pages and converting it to a homogeneous form
ies. that is more readily digested and analyzed for both
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the humans and machines.
hidden Web. Proceedings of the 27th International Web Structure Mining: The class of methods used
Conference on Very Large Data Bases. to automatically discover structured data and informa-
tion from the Web.
447
TEAM LinG
448
BACKGROUND
MAIN THRUST
A supervised machine learning task involves construct-
ing a mapping from input data (normally described by We now discuss the key elements of an ensemble-learn-
several features) to the appropriate outputs. In a classi- ing method and ensemble model and, in the process,
fication learning task, each output is one or more classes discuss several ensemble methods that have been devel-
to which the input belongs. The goal of classification oped.
learning is to develop a model that separates the data into
the different classes, with the aim of classifying new Ensemble Methods
examples in the future. For example, a credit card company
may develop a model that separates people who defaulted The example shown in Figure 1 is an artificial example. We
on their credit cards from those who did not, based on cannot normally expect to obtain base models that
other known information such as annual income. The goal misclassify examples in completely separate parts of the
would be to predict whether a new credit card applicant is input space and ensembles that classify all the examples
likely to default on his or her credit card and thereby correctly. However, many algorithms attempt to generate
decide whether to approve or deny this applicant a new a set of base models that make errors that are as uncorrelated
card. In a regression learning task, each output is a as possible. Methods such as bagging (Breiman, 1994)
continuous value to be predicted (e.g., the average bal- and boosting (Freund & Schapire, 1996) promote diver-
ance that a credit card holder carries over to the next sity by presenting each base model with a different subset
month). of training examples or different weight distributions over
Many traditional machine learning algorithms gener- the examples. For example, in Figure 1, if the plusses in the
ate a single model (e.g., a decision tree or neural network). top part of the figure were temporarily removed from the
Ensemble learning methods instead generate multiple training set, then a linear classifier learning algorithm
models. Given a new example, the ensemble passes it to trained on the remaining examples would probably yield
each of its multiple base models, obtains their predictions, a classifier similar to C. On the other hand, removing the
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Ensemble Data Mining Methods
Figure 1. An ensemble of linear classifiers gating network essentially keeps track of how well each
base model performs in each part of the input space. The -
hope is that each model learns to specialize in different
input regimes and is weighted highly when the input falls
into its specialty. Merzs method uses PCA to lower the
weights of base models that perform well overall but are
redundant and, therefore, effectively give too much weight
to one model. For example, in Figure 1, if an ensemble of
three models instead had two copies of A and one copy of
B, we may prefer to lower the weights of the two copies of
A because, essentially, A is being given too much weight.
Here, the two copies of A would always outvote B, thereby
rendering B useless. Merzs method also increases the
weight on base models that do not perform as well overall
but perform well in parts of the input space, where the
other models perform poorly. In this way, a base models
plusses in the bottom part of the figure would probably unique contributions are rewarded.
yield classifier B, or something similar. In this way, run- When designing an ensemble learning method, in
ning the same learning algorithm on different subsets of addition to choosing the method by which to bring about
training examples can yield very different classifiers, diversity in the base models and choosing the combining
which can be combined to yield an effective ensemble. method, one has to choose the type of base model and
Input decimation ensembles (IDE) (Oza & Tumer, 2001; base model learning algorithm to use. The combining
Tumer & Oza, 2003) and stochastic attribute selection method may restrict the types of base models that can be
committees (SASC) (Zheng & Webb, 1998) instead pro- used. For example, to use average combining in a classi-
mote diversity by training each base model with the same fication problem, one must have base models that can
training examples but different subsets of the input fea- yield probability estimates. This precludes the use of
tures. SASC trains each base model with a random subset linear discriminant analysis or support vector machines,
of input features. IDE selects, for each class, a subset of which cannot return probabilities. The vast majority of
features that has the highest correlation with the presence ensemble methods use only one base model learning
of that class. Each feature subset is used to train one base algorithm but use the methods described earlier to bring
model. However, in both SASC and IDE, all the training about diversity in the base models. Surprisingly, little
patterns are used with equal weight to train all the base work has been done (e.g., Merz, 1999) on creating en-
models. sembles with many different types of base models.
So far we have distinguished ensemble methods by Two of the most popular ensemble learning algorithms
the way they train their base models. We can also distin- are bagging and boosting, which we briefly explain next.
guish methods by the way they combine their base mod-
els predictions. Majority or plurality voting is frequently Bagging
used for classification problems and is used in bagging.
If the classifiers provide probability values, simple aver- Bootstrap aggregating (Bagging) generates multiple
aging is commonly used and is very effective (Tumer & bootstrap training sets from the original training set (by
Ghosh, 1996). Weighted averaging has also been used, using sampling with replacement) and uses each of them
and different methods for weighting the base models have to generate a classifier for inclusion in the ensemble. The
been examined. Two particularly interesting methods for algorithms for bagging and sampling with replacement are
weighted averaging include mixtures of experts (Jordan & given in Figure 2. In these algorithms, T is the original
Jacobs, 1994) and Merzs use of principal components training set of N examples, M is the number of base models
analysis (PCA) to combine models (Merz, 1999). In the to be learned, L b is the base model learning algorithm, the
mixtures of experts method, the weights in the weighted his are the base models, random_integer (a,b) is a func-
average combination are determined by a gating network, tion that returns each of the integers from a to b with equal
which is a model that takes the same inputs that the base probability, and I(A) is the indicator function that returns
models take and returns a weight on each of the base 1 if A is true and 0 otherwise.
models. The higher the weight for a base model, the more To create a bootstrap training set from an original
that base model is trusted to provide the correct answer. training set of size N, we perform N multinomial trials,
These weights are determined during training by how well where in each trial, we draw one of the N examples. Each
the base models perform on the training examples. The example has the probability 1/N of being drawn in each
449
TEAM LinG
Ensemble Data Mining Methods
Figure 2. Batch bagging algorithm and sampling with Figure 3. Batch boosting algorithm
replacement
M If m 1/2 then,
Return h fin ( x) = arg max yY I (h
m =1
m ( x) = y ). set M = m 1 and abort this loop.
Update distribution Dm :
1
2 1 if hm (x n ) = y n
Sample_With_Replacement(T , N ) ( m)
Dm +1 (n) = Dm (n )
S = {} 1
otherwise.
2m
For i = 1, 2,K , N
M 1 m
r = random_integer(1, N ) Return h fin (x) = argmaxy Y I(hm (x) = y)log .
Add T [r ] to S . m=1 m
Return S .
450
TEAM LinG
Ensemble Data Mining Methods
ated base models. If this condition is satisfied, then we and documents. Research in ensemble methods is begin-
calculate a new distribution, D2, over the training examples ning to explore these new types of data. For example, -
as follows. Examples that were correctly classified by h1 ensemble learning traditionally has required access to the
have their weights multiplied by 1/(2(1-e1). Examples that entire dataset at once; that is, it performs batch learning.
were misclassified by h1 have their weights multiplied by However, this idea is clearly impractical for very large
1/(2e1). Note that because of our condition e1 < 1/2, datasets that cannot be loaded into memory all at once.
correctly classified examples have their weights reduced, Oza and Russell (2001) and Oza (2001) apply ensemble
and misclassified examples have their weights increased. learning to such large datasets. In particular, this work
Specifically, examples that h1 misclassified have their total develops online bagging and boosting; that is, they learn
weight increased to 1/2 under D2, and examples that h1 in an online manner. Whereas standard bagging and
correctly classified have their total weight reduced to 1/ boosting require at least one scan of the dataset for every
2 under D2. We then go into the next iteration of the loop base model created, online bagging and online boosting
to construct base model h2 using the training set and the require only one scan of the dataset, regardless of the
new distribution D2. The point is that the next base model number of base models. Additionally, as new data arrive,
will be generated by a weak learner (i.e., the base model will the ensembles can be updated without reviewing any past
have an error less than 1/2); therefore, at least some of the data. However, because of their limited access to the data,
examples misclassified by the previous base model will these online algorithms do not perform as well as their
have to be correctly classified by the current base model. standard counterparts. Other work has also been done to
In this way, boosting forces subsequent base models to apply ensemble methods to other types of data, such as
correct the mistakes made by earlier models. We construct time-series data (Weigend, Mangeas, & Srivastava, 1995).
M base models in this fashion. The ensemble returned by However, most of this work is experimental. Theoretical
AdaBoost is a function that takes a new example as input frameworks that can guide us in the development of new
and returns the class that gets the maximum weighted vote ensemble learning algorithms specifically for modern
over the M base models, where each base models weight datasets have yet to be developed.
is log((1-em)/em), which is proportional to the base models
accuracy on the weighted training set presented to it.
AdaBoost has performed very well in practice and is CONCLUSION
one of the few theoretically motivated algorithms that has
turned into a practical algorithm. However, AdaBoost can Ensemble methods began about 10 years ago as a separate
perform poorly when the training data is noisy (Dietterich, area within machine learning and were motivated by the
2000); that is, the inputs or outputs have been randomly idea of wanting to leverage the power of multiple models
contaminated. Noisy examples are normally difficult to and not just trust one model built on a small training set.
learn. Because of this fact, the weights assigned to noisy Significant theoretical and experimental developments
examples often become much higher than for the other have occurred over the past 10 years and have led to
examples, often causing boosting to focus too much on several methods, especially bagging and boosting, being
those noisy examples at the expense of the remaining data. used to solve many real problems. However, ensemble
Some work has been done to mitigate the effect of noisy methods also appear to be applicable to current and
examples on boosting (Oza, 2003, 2004; Ratsch, Onoda, & upcoming problems of distributed data mining and online
Muller, 2001). applications. Therefore, practitioners in data mining should
stay tuned for further developments in the vibrant area of
ensemble methods. An excellent way to do this is to follow
FUTURE TRENDS the series of workshops called the International Work-
shop on Multiple Classifier Systems. This series balance
The fields of machine learning and data mining are in- between theory, algorithms, and applications of ensemble
creasingly moving away from working on small datasets methods gives a comprehensive idea of the work being
in the form of flat files that are presumed to describe a done in the field.
single process. They are changing their focus toward the
types of data increasingly being encountered today: very
large datasets, possibly distributed over different loca- REFERENCES
tions, describing operations with multiple regimes of
operation, time-series data, online applications (the data Breiman, L. (1994). Bagging predictors (Tech. Rep. 421).
is not a time series but nevertheless arrives continually Berkeley: University of California, Department of Statis-
and must be processed as it arrives), partially labeled data, tics.
451
TEAM LinG
Ensemble Data Mining Methods
Dietterich, T. (2000). An experimental comparison of three Zheng, Z., & Webb, G. (1998). Stochastic attribute selec-
methods for constructing ensembles of decision trees: tion committees. Proceedings of the 11th Australian
Bagging, boosting, and randomization. Machine Learn- Joint Conference on Artificial Intelligence (pp. 321-332),
ing, 40, 139-158. Brisbane, Australia. Springer-Verlag.
Freund, Y., & Schapire, R. (1996). Experiments with a new
boosting algorithm. In M. Kaufmann (Ed.), Proceedings KEY TERMS
of the 13th International Conference on Machine Learn-
ing (pp. 148-156). Bari, Italy: Morgan Kaufmann Publish- Batch Learning: Learning by using an algorithm that
ers. views the entire dataset at once and can access any part
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixture of the dataset at any time and as many times as desired.
of experts and the EM algorithm. Neural Computation, 6, Decision Tree: A model consisting of nodes that
181-214. contain tests on a single attribute and branches repre-
Merz, C. J. (1999). A principal component approach to senting the different outcomes of the test. A prediction is
combining regression estimates. Machine Learning, 36, generated for a new example by performing the test de-
9-32. scribed at the root node and then proceeding along the
branch that corresponds to the outcome of the test. If the
Oza, N. C. (2001). Online ensemble learning. Unpublished branch ends in a prediction, then that prediction is re-
doctoral dissertation, University of California, Berkeley. turned. If the branch ends in a node, then the test at that
node is performed and the appropriate branch selected.
Oza, N. C. (2003). Boosting with averaged weight vectors. This continues until a prediction is found and returned.
In T. Windeatt & F. Roli (Eds.), Proceedings of the Fourth
International Workshop on Multiple Classifier Systems Ensemble: A function that returns a combination of
(pp. 15-24). Guildford, UK: Springer-Verlag. the predictions of multiple machine learning models.
Oza, N. C. (2004). AveBoost2: Boosting with noisy data. Machine Learning: The branch of artificial intelli-
In F. Roli, J. Kittler, & T. Windeatt (Eds.), Proceedings of gence devoted to enabling computers to learn.
the Fifth International Workshop on Multiple Classifier
Systems (pp. 31-40). Cagliari, Italy: Springer-Verlag. Neural Network: A nonlinear model derived through
analogy with the human brain. It consists of a collection
Oza, N. C., & Russell, S. (2001). Experimental comparisons of elements that linearly combine their inputs and pass the
of online and batch versions of bagging and boosting. result through a nonlinear transfer function.
Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Online Learning: Learning by using an algorithm that
USA, (pp. 359-364). ACM Press. only examines the dataset once in order. This paradigm is
often used in situations when data arrives continually in
Oza, N. C., & Tumer, K. (2001). Input decimation en- a stream and when predictions must be obtainable at any
sembles: Decorrelation through dimensionality reduc- time.
tion. Proceedings of the Second International Workshop
on Multiple Classifier Systems, Berlin (pp. 238-247). Principal Components Analysis (PCA): Given a
Springer-Verlag. dataset, PCA determines the axes of maximum variance.
For example, if the dataset were shaped like an egg, then
Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins the long axis of the egg would be the first principal
for AdaBoost. Machine Learning, 42, 287-320. component, because the variance is greatest in this direc-
tion. All subsequent principal components are found to
Tumer, K., & Ghosh, J. (1996). Error correlation and error be orthogonal to all previous components.
reduction in ensemble classifiers. Connection Science,
8(3-4), 385-404.
Tumer, K., & Oza, N. C. (2003). Input decimated en-
ENDNOTES
sembles. Pattern Analysis and Applications, 6(1), 65-77. 1
If Lb cannot take a weighted training set, then one
Weigend, A. S., Mangeas, M., & Srivastava, A. N. (1995). can call it with a training set that is generated by
Nonlinear gated experts for time-series: Discovering re- sampling with replacement from the original training
gimes and avoiding overfitting. International Journal of set according to the distribution Dm.
Neural Systems, 6(4), 373-399. 2
This requirement is perhaps too strict when more
than two classes exist. AdaBoost has a multiclass
452
TEAM LinG
Ensemble Data Mining Methods
453
TEAM LinG
454
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Ethics of Data Mining
(No. 01-2000 1st Cir. Dec. 17, 2001), the First Circuit Court 1. Selecting the wrong problem for data mining.
of Appeals in Massachusetts held that Explorica, a tour 2. Ignoring what the sponsor thinks data mining is and -
operator for students, improperly obtained confidential what it can and cannot do.
information about how rival EFs Web site worked and 3. Leaving insufficient time for data preparation.
used that information to write software that gleaned data 4. Looking only at aggregated results, never at indi-
about student tour prices from EFs Web site in order to vidual records.
undercut EFs prices (Scott, 2002). In this case, Explorica 5. Being nonchalant about keeping track of the mining
probably violated the federal Computer Fraud and Abuse procedure and results.
Act (18 U.S.C. Sec. 1030). Hence, the source of the data is 6. Ignoring suspicious findings in a haste to move on.
important when data mining. 7. Running mining algorithms repeatedly without think-
Typically, with applied ethics, a morally controversial ing hard enough about the next stages of the data
practice, such as how data mining impacts privacy, is analysis.
described and analyzed in descriptive terms, and finally 8. Believing everything you are told about the data.
moral principles and judgments are applied to it and moral 9. Believing everything you are told about your own
deliberation takes place, resulting in a moral evaluation, data mining analyses.
and operationally, a set of policy recommendations (Brey, 10. Measuring results differently from the way the
2000, p. 10). Applied ethics is adopted by most of the sponsor will measure them.
literature on computer ethics (Brey, 2000). Data mining
may appear to be morally neutral, but appearances in this These blunders are hidden ethical dilemmas faced by
case are deceiving. This paper takes an applied perspec- those who perform data mining. In the next subsections,
tive to the ethical dilemmas that arise from the application sample ethical dilemmas raised with respect to the appli-
of data mining in specific circumstances as opposed to cation of data mining results in the public sector are
examining the technological artifacts (i.e., the specific examined, followed briefly by those in the private sector.
software and how it generates inferences and predictions)
used by data miners. Ethics of Data Mining in the Public
Sector
MAIN THRUST Many times, the objective of data mining is to build a
customer profile based on two types of datafactual
Computer technology has redefined the boundary be- (who the customer is) and transactional (what the cus-
tween public and private information, making much more tomer does) (Adomavicius & Tuzhilin, 2001). Often, con-
information public. Privacy is the freedom granted to sumers object to transactional analysis. What follows are
individuals to control their exposure to others. A custom- two examples; the first (identifying successful students)
ary distinction is between relational and informational creates a profile based primarily on factual data, and the
privacy. Relational privacy is the control over ones second (identifying criminals and terrorists) primarily on
person and ones personal environment, and concerns transactional.
the freedom to be left alone without observation or inter-
ference by others. Informational privacy is ones control Identifying Successful Students
over personal information in the form of text, pictures,
recordings, and so forth (Brey, 2000). Probably the most common and well-developed use of
Technology cannot be separated from its uses. It is data mining is the attraction and retention of customers.
the ethical obligation of any information systems (IS) At first, this sounds like an ethically neutral application.
professional, through whatever means he or she finds out Why not apply the concept of students as customers to
that the data that he or she has been asked to gather or the academe? When students enter college, the transition
mine is going to be used in an unethical way, to act in a from high school for many students is overwhelming,
socially and ethically responsible manner. This might negatively impacting their academic performance. High
mean nothing more than pointing out why such a use is school is a highly structured Monday-through-Friday
unethical. In other cases, more extreme measures may be schedule. College requires students to study at irregular
warranted. As data mining becomes more commonplace hours that constantly change from week to week, depend-
and as companies push for even greater profits and market ing on the workload at that particular point in the course.
share, ethical dilemmas will be increasingly encountered. Course materials are covered at a faster pace; the duration
Ten common blunders that a data miner may cause, result- of a single class period is longer; and subjects are often
ing in potential ethical or possibly legal dilemmas, are more difficult. Tackling the changes in a students aca-
(Skalak, 2001):
455
TEAM LinG
Ethics of Data Mining
demic environment and living arrangement as well as example, the following databases represent some of those
developing new interpersonal relationships is daunting used by the U.S. Immigration and Naturalization Service
for students. Identifying students prone to difficulties and (INS) to capture information on aliens (Verton, 2002).
intervening early with support services could significantly
improve student success and, ultimately, improve reten- Employment Authorization Document System
tion and graduation rates. Marriage Fraud Amendment System
Consider the following scenario that realistically could Deportable Alien Control System
arise at many institutions of higher education. Admissions Reengineered Naturalization Application Casework
at the institute has been charged with seeking applicants System
who are more likely to be successful (i.e., graduate from the Refugees, Asylum, and Parole System
institute within a five-year period). Someone suggests Integrated Card Production System
data mining existing student records to determine the Global Enrollment System
profile of the most likely successful student applicant. Arrival Departure Information System
With little more than this loose definition of success, a Enforcement Case Tracking System
great deal of disparate data is gathered and eventually Student and Schools System
mined. The results indicate that the most likely successful General Counsel Electronic Management System
applicant, based on factual data, is an Asian female whose Student Exchange Visitor Information System
familys household income is between $75,000 and $125,000 Asylum Prescreening System
and who graduates in the top 25% of her high school class. Computer-Linked Application Information Man-
Based on this result, admissions chooses to target market agement System (two versions)
such high school students. Is there an ethical dilemma? Non-Immigrant Information System
What about diversity? What percentage of limited market-
ing funds should be allocated to this customer segment? There are islands of excellence within the public
This scenario highlights the importance of having well- sector. One such example is the U.S. Armys Land Infor-
defined goals before beginning the data mining process. mation Warfare Activity (LIWA), which is credited with
The results would have been different if the goal were to having one of the most effective operations for mining
find the most diverse student population that achieved a publicly available information in the intelligence commu-
certain graduation rate after five years. In this case, the nity (Verton, 2002, p. 5).
process was flawed fundamentally and ethically from the Businesses have long used data mining. However,
beginning. recently, governmental agencies have shown growing
interest in using data mining in national security initia-
Identifying Criminals and Terrorists tives (Carlson, 2003, p. 28). Two government data min-
ing projects, the latter renamed by the euphemism fac-
The key to the prevention, investigation, and prosecution tual data analysis, have been under scrutiny (Carlson,
of criminals and terrorists is information, often based on 2003) These projects are the U.S. Transportation Secu-
transactional data. Hence, government agencies increas- rity Administrations (TSA) Computer Assisted Passen-
ingly desire to collect, analyze, and share information ger Prescreening System II (CAPPS II) and the Defense
about citizens and aliens. However, according to Rep. Curt Advanced Research Projects Agencys (DARPA) Total
Weldon (R-PA), chairman of the House Subcommittee on Information Awareness (TIA) research project (Gross,
Military Research and Development, there are 33 classified 2003). TSAs CAPPS II will analyze the name, address,
agency systems in the federal government, but none of phone number, and birth date of airline passengers in an
them link their raw data together (Verton, 2002). As Steve effort to detect terrorists (Gross, 2003). James Loy, direc-
Cooper, CIO of the Office of Homeland Security, said, I tor of the TSA, stated to Congress that, with CAPPS II,
havent seen a federal agency yet whose charter includes the percentage of airplane travelers going through extra
collaboration with other federal agencies (Verton, 2002, p. screening is expected to drop significantly from 15% that
5). Weldon lambasted the federal government for failing to undergo it today (Carlson, 2003). Decreasing the number
act on critical data mining and integration proposals that of false positive identifications will shorten lines at
had been authored before the terrorists attacks on Sep- airports.
tember 11, 2001 (Verton, 2002). TIA, on the other hand, is a set of tools to assist
Data to be mined is obtained from a number of sources. agencies such as the FBI with data mining. It is designed
Some of these are relatively new and unstructured in to detect extremely rare patterns. The program will in-
nature, such as help desk tickets, customer service com- clude terrorism scenarios based on previous attacks,
plaints, and complex Web searches. In other circumstances, intelligence analysis, war games in which clever people
data miners must draw from a large number of sources. For imagine ways to attack the United States and its de-
456
TEAM LinG
Ethics of Data Mining
ployed forces, testified Anthony Tether, director of With this profile of personal details comes a substan-
DARPA, to Congress (Carlson, 2003, p. 22). When asked tial ethical obligation to safeguard this data. Ignoring any -
how DARPA will ensure that personal information caught legal ramifications, the ethical responsibility is placed
in TIAs net is correct, Tether stated that were not the firmly on IS professionals and businesses, whether they
people who collect the data. Were the people who supply like it or not; otherwise, they risk lawsuits and harming
the analytical tools to the people who collect the data individuals. The data industry has come under harsh
(Gross, 2003, p. 18). Critics of data mining say that while review. There is a raft of federal and local laws under
the technology is guaranteed to invade personal privacy, consideration to control the collection, sale, and use of
it is not certain to enhance national security. Terrorists do data. American companies have yet to match the tougher
not operate under discernable patterns, critics say, and privacy regulations already in place in Europe, while
therefore the technology will likely be targeted primarily personal and class-action litigation against businesses
at innocent people (Carlson, 2003, p. 22). Congress voted over data privacy issues is increasing (Wilder & Soat,
to block funding of TIA. But privacy advocates are 2001, p. 38).
concerned that the TIA architecture, dubbed mass
dataveillance, may be used as a model for other programs
(Carlson, 2003). FUTURE TRENDS
Systems such as TIA and CAPPS II raise a number of
ethical concerns, as evidenced by the overwhelming Data mining traditionally was performed by a trained
opposition to these systems. One system, the Multistate specialist, using a stand-alone package. This once na-
Anti-TeRrorism Information EXchange (MATRIX), rep- scent technique is now being integrated into an increas-
resents how data mining has a bad reputation in the public ing number of broader business applications and legacy
sector. MATRIX is self-defined as a pilot effort to systems used by those with little formal training, if any,
increase and enhance the exchange of sensitive terrorism in statistics and other related disciplines. Only recently
and other criminal activity information between local, has privacy and data mining been addressed together, as
state, and federal law enforcement agencies (matrix- evidenced by the fact that the first workshop on the
at.org, accessed June 27, 2004). Interestingly, MATRIX subject was held in 2002 (Clifton & Estivill-Castro, 2002).
states explicitly on its Web site that it is not a data-mining The challenge of ensuring that data mining is used in an
application, although the American Civil Liberties Union ethically and socially responsible manner will increase
(ACLU) openly disagrees. At the very least, the perceived dramatically.
opportunity for creating ethical dilemmas and ultimately
abuse is something the public is very concerned about, so
much so that the project felt that the disclaimer was CONCLUSION
needed. Due to the extensive writings on data mining in
the private sector, the next subsection is brief. Several lessons should be learned. First, decision makers
must understand key strategic issues. The data miner
Ethics of Data Mining in the Private must have an honest and frank dialog with the sponsor
Sector concerning objectives. Second, decision makers must not
come to rely on data mining to make decisions for them.
Businesses discriminate constantly. Customers are clas- The best data mining is susceptible to human interpreta-
sified, receiving different services or different cost struc- tion. Third, decision makers must be careful not to explain
tures. As long as discrimination is not based on protected away with intuition data mining results that are
characteristics such as age, race, or gender, discriminat- counterintuitive. Decision making inherently creates ethi-
ing is legal. Technological advances make it possible to cal dilemmas, and data mining is but a tool to assist
track in great detail what a person does. Michael Turner, management in key decisions.
executive director of the Information Services Executive
Council, states, For instance, detailed consumer infor-
mation lets apparel retailers market their products to REFERENCES
consumers with more precision. But if privacy rules im-
pose restrictions and barriers to data collection, those Adomavicius, G. & Tuzhilin, A. (2001). Using data mining
limitations could increase the prices consumers pay when methods to build customer profiles. Computer, 34(2), 74-82.
they buy from catalog or online apparel retailers by 3.5%
to 11% (Thibodeau, 2001, p. 36). Obviously, if retailers Brey, P. (2000). Disclosive computer ethics. Computers
cannot target their advertising, then their only option is and Society, 30(4), 10-16.
to mass advertise, which drives up costs.
457
TEAM LinG
Ethics of Data Mining
Carlson, C. (2003a). Feds look at data mining. eWeek, Winner, L. (1980). Do artifacts have politics? Daedalus,
20(19), 22. 109, 121-136.
Carlson, C. (2003b). Lawmakers will drill down into data
mining. eWeek, 20(13), 28. KEY TERMS
Clifton, C. & Estivill-Castro, V. (Eds.). (2002). Privacy,
security and data mining. Proceedings of the IEEE Inter- Applied Ethics: The study of a morally controversial
national Conference on Data Mining Workshop on Pri- practice, whereby the practice is described and analyzed,
vacy, Security, and Data Mining, Maebashi City, Japan. and moral principles and judgments are applied, resulting
in a set of recommendations.
Edelstein, H. (2003). Description is not prediction. DM
Review, 13(3), 10. Ethics: The study of the general nature of morals and
values as well as specific moral choices; it also may refer
Feenberg, A. (1999). Questioning technology. London: to the rules or standards of conduct that are agreed upon
Routledge. by cultures and organizations that govern personal or
professional conduct.
Fule, P., & Roddick, J.F. (2004). Detecting privacy and
ethical sensitivity in data mining results. Proceedings of Factual Data: Data that include demographic informa-
the 27 th Conference on Australasian Computer Science, tion such as name, gender, and birth date. It also may
Dunedin, New Zealand. contain information derived from transactional data such
as someones favorite beverage.
Gross, G. (2003). U.S. agencies defend data mining plans.
ComputerWorld, 37(19), 18. Factual Data Analysis: Another term for data mining,
often used by government agencies. It uses both factual
Sclove, R. (1995). Democracy and technology. New York: and transactional data.
Guilford Press.
Informational Privacy: The control over ones per-
Scott, M.D. (2002). Can data mining be a crime? CIO sonal information in the form of text, pictures, recordings,
Insight, 1(10), 65. and such.
Skalak, D. (2001). Data mining blunders exposed! 10 data Mass Dataveillance: Suspicion-less surveillance of
mining mistakes to avoid making today. DB2 Magazine, 6(2), large groups of people.
10-13.
Relational Privacy: The control over ones person
Thibodeau, P. (2001). FTC examines privacy issues raised and ones personal environment.
by data collectors. ComputerWorld, 35(13), 36.
Transactional Data: Data that contains records of
Verton, D. (2002a). Congressman says data mining could purchases over a given period of time, including such
have prevented 9-11. ComputerWorld, 36(35), 5. information as date, product purchased, and any special
Verton, D. (2002b). Database woes thwart counterterrorism requests.
work. ComputerWorld, 36(49), 14.
Wilder, C., & Soat, J. (2001). The ethics of data. Informa-
tion Week, 1(837), 37-48.
458
TEAM LinG
459
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Ethnography to Define Requirements and Data Model
460
TEAM LinG
Ethnography to Define Requirements and Data Model
The domain analysis concept also is used in informa- term, relationship, and hierarchical structure of each term
tion systems to show how data relate to each other. Peter within the area. Structured questions allow the researcher -
Chen (1976) proposed the Entity-Relationship (ER) model to find out how informants have organized their knowl-
as a way to define and unify relational database views and edge. A review of the transcribed words from the inter-
can be found in popular relational database management views, additional notes, and the categorized terms from
systems such as Oracle and Microsoft SQL Server. In the the field work began to identify key, structured ques-
entity-relationship model, tables are shown as entities, tions that were important to the Credit Services Depart-
data fields are shown as attributes, and the line shows how ment. These in-depth, structured, thick-description ques-
the entities and attributes relate to each other. Entities are tions began to emerge out of the terms and categories
categories such as a person, place, or thing. The domain used in the language.
analysis identified in Figure 1 is based on the initial themes
of reducing working capital and decreasing sales out- Componential Analysis
standing, where rectangles identify primary domain enti-
ties for both the internal PPG associates and external PPG Spradley (1979) states that the development research
customers. sequence evolves through locating informants, con-
ducting interviews, and collecting the terms and symbols
Taxonomic Analysis used for language and communication. This research
included the domain analysis process with an overview
Further analysis of the fields in the domain analysis in- of terms that make up categories and the relationship
cluded a review of the specific terms were considered among categories. The domain analysis provided an
critical because of the frequency and consistency that overview of the language in the cultural scene. The
these terms were used in the language. These terms even- technique of taxonomic analysis identified relationships
tually became the baseline data field attributes required for and differences among terms within each domain. The
the decision support system model. A taxonomic analysis next technique in the process included searching and
is a technique used to capture the essential terms, also organizing the attributes associated with cultural terms
called fields, in a hierarchical fashion to define not only the and categories. Attributes are specific terms that have
relationship among the fields but also to understand what significant meaning in the language. This process is
terms are a subset meaning to a grander, higher term. The called componential analysis.
design of the taxonomic analysis (see Figure 2) is based on Spradley (1973) states that, in drafting a schematic
a spiraling concept where all of the terms and relationships diagram on domains, This thinking process is one of the
defined previously are dependent on the central theme; best strategies for discovering cultural themes (p. 199).
that is, to get dollars for sales through reducing working The schematic diagram in Figure 3 identifies the one
capital, which is noted in the center of the drawing. The central theme that motivates Credit Services and this
taxonomic analysis is broken into five component areas of research project, and is based on one principleunder-
who, what, why, when and how. Each area identifies the standing customer payment behavior.
B ra n c h
C u s to m e r
F ie ld S a le s O p e r a t io n s /
S e r v ic e
D is t r ib u t io n
S t r a te g ic B u s in e s s
U n its
C r e d it
P a re n t H d q trs ,
L o c a t io n s
E x te rn a l
C r e d it S e r v ic e s In te rn a l P P G
C u s to m e r
D e p a rtm e n t A u d ie n c e
A u d ie n c e
C a s h A p p lic a t io n
A c c o u n t S ta tu s
P u rc h a s e a n d S e ll
P r o d u c ts / S e r v ic e s
P re s e n t S ta te m e n t
a n d I n v o ic e f o r
Paym ent
D is p u te s
R e c e iv e $ , B a n k $ T ra n s fe r to
A p p ly P a y m e n t PP G Account
461
TEAM LinG
Ethnography to Define Requirements and Data Model
Customer
WHAT
Payment
Cash
WHO Application
Location - Dispute
Level 3 Invoice
Credit
Limit Status:
Parent - Product Line
On Hold,
Branch Operations/ Level 1 (Divisional Sales
Active,
Distribution Resp)
Release
Customer
Service Invoices
Customer
Cash 15 SBU's
Field
Sales Application Account
PPG Auto Statements
PPG Paid
Credit Glass Get Dollars
Current In Full
For Sales,
Balance
Reduce Dispute
Oracle Working Cap Past Due Dispute Resolution:
A/R: AG, CARS Online Aging Reason:
Transaction Balance Unearned
Knowledge Over
SBU Processing Discount,
Mgt Credit Limit
Order Entry Price
Term Daily Deduction
Timeline Sales
Dimension
DSS/Business Outstanding
Intelligence Deduct/ Chargeback,
eBilling Short Writeoff,
HOW Invoice Term - Statement Term - Pay Credit Memo WHY
Online (Discounted Price) (Discount on Payment) Reason
Analytical
Processing
Measures
Invoice Date Discount Due Date
WHEN
S e ll p r o d u c t t o f in a n c ia lly
s ta b le c u s to m e rs ,
A c c u r a te p ric e o n in v o ic e
C u ltu ra l T h e m e s
R e d u c e s w o r k in g c a p it a l a t t h e
e n t e r p r is e , P P G le v e l
R e v ie w a n d m o n it o r f in a n ic a l
s t a b ilit y o f c u s to m e rs
Im p ro v e d c a s h flo w a t th e s tra te g ic
b u s in e s s u n it , d e p a r t m e n t le v e l
R e c e iv e c a s h , e n s u re p a y m e n ts U n d e r s t a n d in g C u s t o m e r
re c e iv e d f o r s e r v ic e s P a y m e n t B e h a v io r
I m p r o v e in v o ic e , s ta te m e n t in te g rity
r e s u lt in g in r e d u c e d d is p u t e s /
d e d u c ts -- - - h ig h e r
c u s to m e r s a t is f a c t io n
D is p u t e s / d e d u c t s a n d r e s o lu t io n s
o n d is c r e p a n c ie s
K n o w le d g e b a s e d d e c is io n
s u p p o rt a n d b u s in e s s
in t e llig e n c e
H y b r id , in t e g r a t e d in f o r m a t io n
s y s te m s
462
TEAM LinG
Ethnography to Define Requirements and Data Model
activity. The final stage in the ethnographic process was Inmon, W. (1996). Building the data warehouse. New
to define the decision support system model to assist PPG York: Wiley Computer Publishing. -
in tracking customer payments. The model was based on
using an information systems-based approach to capture Spradley, J.P. (1979). The ethnographic interview. Or-
the data required to understand customer payment be- lando, FL: Harcourt Brace.
havior and to provide trend analysis capabilities to gain
knowledge and insight from that understanding. KEY TERMS
Boyd, J. (2001, April 16). Think ASPs make sense. Internet Drill Down: User interface technique to navigate into
Week, 48. lower levels of information in decision support systems.
Camsuzou, C. (2001). The ecommerce vision of the credit Ethnography: The work of describing a culture. The
services department. PPG Industries IT Strategy, 15-17. essential core aims to understand another way of life from
the native point of view.
Chen P. (1976). The entity-relationship modelToward a
unified view of data. ACM Transactions on Database Measurements: Dynamic, numeric values associated
Systems, 1(1), 9-36. with dimensions found through drilling down into lower
levels of detail within decision support systems.
Craig, J., & Jutla, D. (2001). eBusiness readiness: A cus-
tomer-focused framework. Upper Saddle River, NJ: Online Analytical Processing Systems (OLAP):
Addison Wesley Publishers. Technology-based solution with data delivered at mul-
tiple dimensions to allow drilling down at multiple levels.
Creswell, J. (1994). Research design qualitative and quanti-
tative approaches. Thousand Oaks, CA: Sage Publications. Requirements: Specifics based on defined criterion
used as the basis for information systems design.
Forcht, K.A., & Cochran, K. (1999). Using data mining and
data warehousing techniques. Industrial Management Systems Development Life Cycle: A controlled, phased
and Data Systems, 189-196. approach in building information systems from under-
standing wants, defining specifications, designing and
Geertz, C. (1973). Emphasizing interpretation from the coding the system through implementing the final solution.
interpretation of cultures. New York: Basic Books, Inc.
463
TEAM LinG
464
thermore, it often happens that for a data problem it is model, one can resort to the notion of distance between
possible to use more than one type of model class, with a model f, which underlies the data, and an approximat-
different underlying probabilistic assumptions. For ex- ing model g (see, for instance, Zucchini, 2000). Notable
ample, for a problem of predictive classification it is examples of distance functions are, for categorical
possible to use both logistic regression and tree models variables: the entropic distance, which describes the
as well as neural networks. proportional reduction of the heterogeneity of the de-
We also point out that model specification and, pendent variable; the chi-squared distance, based on the
therefore, model choice is determined by the type of distance from the case of independence; and the 0-1
variables used. These variables can be the result of distance, which leads to misclassification rates. For
transformations or of the elimination of observations, quantitative variables, the typical choice is the Euclid-
following an exploratory analysis. We then need to ean distance, representing the distance between two
compare models based on different sets of variables vectors in a Cartesian space. Another possible choice is
present at the start. For example, how do we compare a the uniform distance, applied when nonparametric mod-
linear model with the original explanatory variables els are being used.
with one with a set of transformed explanatory vari- Any of the previous distances can be employed to
ables? define the notion of discrepancy of an statistical model.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Evaluation of Data Mining Methods
The discrepancy of a model, g, can be obtained comparing assumption is not always valid and therefore the AIC
the unknown probabilistic model, f, and the best paramet- criterion does not lead to a consistent estimate of the -
ric statistical model. Since f is unknown, closeness can be dimension of the unknown model. An alternative, and
measured with respect to a sample estimate of the un- consistent, scoring function is the BIC criterion (Baye-
known density f. A common choice of discrepancy func- sian Information Criterion), also called SBC, formulated
tion is the Kullback-Leibler divergence, which can be by Schwarz (1978). As can be seen from its definition, the
applied to any type of observations. In such context, the BIC differs from the AIC only in the second part, which
best model can be interpreted as that with a minimal loss now also depends on the sample size n. Compared to the
of information from the true unknown distribution. AIC, when n increases the BIC favours simpler models. As
It can be shown that the statistical tests used for model n gets large, the first term (linear in n) will dominate the
comparison are generally based on estimators of the total second term (logarithmic in n). This corresponds to the
Kullback-Leibler discrepancy; the most used is the log- fact that, for a large n, the variance term in the mean
likelihood score. Statistical hypothesis testing is based squared error expression tends to be negligible. We also
on subsequent pairwise comparisons of log-likelihood point out that, despite the superficial similarity between
scores of alternative models. Hypothesis testing allows the AIC and the BIC, the first is usually justified by
one to derive a threshold below which the difference resorting to classical asymptotic arguments, while the
between two models is not significant and, therefore, the second by appealing to the Bayesian framework.
simpler models can be chosen. To conclude, the scoring function criteria for select-
Therefore, with statistical tests it is possible make ing models are easy to calculate and lead to a total
an accurate choice among the models. The defect of this ordering of the models. From most statistical packages
procedure is that it allows only a partial ordering of we can get the AIC and BIC scores for all the models
models, requiring a comparison between model pairs considered. A further advantage of these criteria is that
and, therefore, with a large number of alternatives it is they can be used also to compare non-nested models
necessary to make heuristic choices regarding the com- and, more generally, models that do not belong to the
parison strategy (such as choosing among the forward, same class (for instance a probabilistic neural network
backward and stepwise criteria, whose results may di- and a linear regression model).
verge). Furthermore, a probabilistic model must be However, the limit of these criteria is the lack of a
assumed to hold, and this may not always be possible. threshold, as well the difficult interpretability of their
measurement scale. In other words, it is not easy to
Criteria Based on scoring functions determine if the difference between two models is
significant or not, and how it compares to another
A less structured approach has been developed in the difference. These criteria are indeed useful in a prelimi-
field of information theory, giving rise to criteria based nary exploratory phase. To examine this criteria and to
on score functions. These criteria give each model a compare it with the previous ones see, for instance,
score, which puts them into some kind of complete Zucchini (2000), or Hand, Mannila, & Smyth (2001).
order. We have seen how the Kullback-Leibler discrep-
ancy can be used to derive statistical tests to compare Bayesian Criteria
models. In many cases, however, a formal test cannot be
derived. For this reason, it is important to develop A possible compromise between the previous two cri-
scoring functions that attach a score to each model. The teria is the Bayesian criteria, which could be developed in
Kullback-Leibler discrepancy estimator is an example a rather coherent way (see, e.g., Bernardo & Smith,
of such a scoring function that, for complex models, can 1994). It appears to combine the advantages of the two
be often be approximated asymptotically. A problem previous approaches: a coherent decision threshold and a
with the Kullback-Leibler score is that it depends on the complete ordering. One of the problems that may arise is
complexity of a model as described, for instance, by the connected to the absence of a general purpose software.
number of parameters. It is thus necessary to employ For data mining works using Bayesian criteria the reader
score functions that penalise model complexity. could see, for instance, Giudici (2001), Giudici & Castelo
The most important of such functions is the AIC (2003) and Brooks et al. (2003).
(Akaike Information Criterion) (Akaike, 1974). From
its definition, notice that the AIC score essentially Computational Criteria
penalises the loglikelihood score with a term that in-
creases linearly with model complexity. The AIC criterion The intensive wide spread use of computational meth-
is based on the implicit assumption that q remains con- ods has led to the development of computationally inten-
stant when the size of the sample increases. However this sive model comparison criteria. These criteria are usually
465
TEAM LinG
Evaluation of Data Mining Methods
based on using dataset different than the one being that of SAS Enterprise Miner (SAS Institute, 2004).
analysed (external validation) and are applicable to all The idea behind these methods is to focus the atten-
the models considered, even when they belong to differ- tion, in the choice among alternative models, to the
ent classes (for example in the comparison between utility of the obtained results. The best model is the one
logistic regression, decision trees and neural networks, that leads to the least loss.
even when the latter two are non probabilistic). A pos- Most of the loss function based criteria are based
sible problem with these criteria is that they take a long on the confusion matrix. The confusion matrix is used
time to be designed and implemented, although general as an indication of the properties of a classification
purpose software has made this task easier. rule. On its main diagonal it contains the number of
The most common of such criterion is based on observations that have been correctly classified for
cross-validation. The idea of the cross-validation method each class. The off-diagonal elements indicate the num-
is to divide the sample into two sub-samples, a training ber of observations that have been incorrectly classi-
sample, with n - m observations, and a validation sample, fied. If it is assumed that each incorrect classification
with m observations. The first sample is used to fit a has the same cost, the proportion of incorrect classifi-
model and the second is used to estimate the expected cations over the total number of classifications is
discrepancy or to assess a distance. Using this criterion called rate of error, or misclassification error, and it is
the choice between two or more models is made by the quantity that must be minimised. The assumption of
evaluating an appropriate discrepancy function on the equal costs can be replaced by weighting errors with
validation sample. Notice that the cross-validation idea their relative costs.
can be applied to the calculation of any distance function. The confusion matrix gives rise to a number of
One problem regarding the cross-validation criterion graphs that can be used to assess the relative utility of
is in deciding how to select m, that is, the number of the a model, such as the Lift Chart, and the ROC Curve (see
observations contained in the validation sample. For Giudici, 2003). The lift chart puts the validation set
example, if we select m = n/2 then only n/2 observations observations, in increasing or decreasing order, on the
would be available to fit a model. We could reduce m but basis of their score, which is the probability of the
this would mean having few observations for the valida- response event (success), as estimated on the basis of
tion sampling group and therefore reducing the accuracy the training set. Subsequently, it subdivides such scores
with which the choice between models is made. In prac- in deciles. It then calculates and graphs the observed
tice proportions of 75% and 25% are usually used, probability of success for each of the decile classes in
respectively, for the training and the validation samples. the validation set. A model is valid if the observed
To summarise, these criteria have the advantage of success probabilities follow the same order (increas-
being generally applicable but have the disadvantage of ing or decreasing) as the estimated ones. Notice that, in
taking a long time to be calculated and of being sensitive order to be better interpreted, the lift chart of a model
to the characteristics of the data being examined. A way is usually compared with a baseline curve, for which the
to overcome this problem is to consider model combina- probability estimates are drawn in the absence of a
tion methods, such as bagging and boosting. For a thor- model, that is, taking the mean of the observed success
ough description of these recent methodologies, see probabilities.
Hastie, Tibshirani, & Friedman (2001). The ROC (Receiver Operating Characteristic) curve
is a graph that also measures predictive accuracy of a
Business Criteria model. It is based on four conditional frequencies that
can be derived from a model, and the choice of a cut-off
One last group of criteria seem specifically tailored for points for its scores: a) the observations predicted as
the data mining field. These are criteria that compare the events and effectively such (sensitivity); b) the obser-
performance of the models in terms of their relative vations predicted as events and effectively non-events;
losses, connected to the errors of approximation made c) the observations predicted as non-events and effec-
by fitting data mining models. Criteria based on loss tively events; d) the observations predicted as non-
functions have appeared recently, although related ideas events and effectively such (specificity). The ROC
are long time known in Bayesian decision theory (see, curve is obtained representing, for any fixed cut-off
for instance, Bernardo & Smith, 1984). They have a great value, a point in the plane having as x-value the false
application potential, although at present they are mainly positive value (1-specificity) and as y-value the sensitiv-
concerned with classification problems. For a more de- ity value. Each point in the curve corresponds therefore
tailed examination of these criteria the reader can see, for to a particular cut-off. In terms of model comparison, the
example, Hand (1997), Hand, Mannila, & Smyth (2001), or best curve is the one that is leftmost, the ideal one
the reference manuals on data mining software, such as coinciding with the y-axis.
466
TEAM LinG
Evaluation of Data Mining Methods
To summarise, criteria based on loss functions have parison criteria based on business quantities are ex-
the advantage of being easy to understand but, on the tremely useful. -
other hand, they still need formal improvements and
mathematical refinements.
REFERENCES
467
TEAM LinG
Evaluation of Data Mining Methods
following equation:
( )
BIC = 2 log L ; x1 ,..., x n + q log(n )
eters, which minimizes the distance with respect to f.
Chi-Squared Distance: The chi-squared distance Log-Likelihood Score: The log-likelihood score
of a distribution g from a target distribution f is: is defined by
( f i g i )2 2 log p (x i ) [ ]
2
d = i i =1
gi
n
( f , p ) = ( f (x i ) p (x i ))
2
i =1
468
TEAM LinG
469
Figure 1. An example showing how a histogram is formed from results obtained from a data cube
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Evolution of Data Cube Computational Approaches
products at a range of the time-frame on a particular operators (also known as cube-by) to conveniently sup-
location. port such aggregates. The data cube is identified as the
The second problem relates to roll-ups totals and sub- core operator in data warehousing and OLAP
totals for drill-down. Reports commonly aggregate data (Lakshmanan, Pei & Zhao, 2003). The cube-by operator
initially at a coarse level, and then at successively finer computes group-bys corresponding to all possible com-
levels. This type of report is difficult with the normal SQL binations in a list of attributes. An example of data cube
construct. However, the Cube-by operator is able to query is as follows:
present the roll-ups totals and sub-totals for drill-down
easily. SELECT Product, Year, City, SUM(amount)
The third problem relates to cross-tabulation which is FROM Sales
difficult to construct with the current standard SQL. The CUBE BY Product, Year, City
symmetric aggregation result is cross-tabulation table or
cross tab for short (known as a pivot-table in spread- The above query will produce the SUM of amount of
sheets). Using the cube-by operator, cross tab data can all tuples in the database according the 7 group-bys, i.e.
be readily obtained which is routinely displayed in the (Product, Year, City), (Product, Year), (Product, City),
more compact format as shown in Figure 2. This cross tab (Year, City), (Product), (Year), (City). Lastly, the 8th group-
is a two dimensional aggregation within the red-dotted by denotes as ALL, which contains an empty attribute so
line. If we add another location such as L002, it becomes as to make all group-bys results union compatible. For
a 3D aggregation. example, a cube-by of three attributes (ABC) in data cube
In summary, the problem of representing aggregate query will generate eight or 23 group-bys of ([ABC], [AB],
data in a relational data model with the standard SQL can [AC], [BC], [A], [B], [C] and [ALL]).
be a difficult and daunting task. A six dimensional cross- The most straightforward way to execute the data
tab will require a 64 way union of 64 different group-by cube query is to rewrite it as a collection of eight group-
operators in order to build the underlying representation. by queries and execute them separately as shown in
This is an important reason why the use of Group-bys is Figure 3. This means that the eight group-by queries will
inadequate as the resulting representation of aggregation need to access the raw data eight times. It is likely to be
is too complex for optimal analysis. quite expensive in execution time. If the number of dimen-
sion attributes increases, it becomes very expensive to
compute the data cube. This is because the required
MAIN THRUST computation cost grows exponentially with the increase
of dimension attributes. For instance, N number of
Birth of cube-by operator dimension attributes of cube-by will form a 2N number of
group-bys. However, there are a number of ways in which
To overcome the difficulty with these SQL aggregation this simple solution can be improved.
constructs, (Gray et al., 1996) proposed using data cube
470
TEAM LinG
Evolution of Data Cube Computational Approaches
Figure 3. A straightforward way to execute each of the Figure 4. A lattice of the cube-by operator
group-by separately -
471
TEAM LinG
Evolution of Data Cube Computational Approaches
Table 1 illustrates the increment in the number of paths with hierarchies on the dimensions. Lis represent Stores
and group-bys whenever there is an increase in one and Pis represent Products. Stores L1 L3 are in Churchill,
attribute. For example, three attributes [ABC] will result in while L4 L6 are in Morwell, and this roll-up into two
3 D cuboid level and 12 paths if we increase one more towns. Products P1 P3 are of type Cup, while products
attributes [ABCD], this will generate 4 D cuboid level and P4 P5 are of type Plate. Both cup and plate are further
32 paths. When the number of attributes is small, the grouped into the category Kitchenware. The xs are sales
number of group-bys and number of paths are relatively volumes; entries that are blank correspond to (product,
similar. However, when the number of attributes increases, store) combinations for which there are no sales.
the number of paths increases significantly faster by a
factor of Np-1 where p is the path number. For instance, In summary, there are two types of query dependen-
when Na is equal to 2, Ng and Np are equal to 4. When Na cies: dimension dependency, and attribute dependency.
is equal to 3, Ng is 23 = 8 and Np = (23 + 4) = 12. It is important Dimension dependency is present when there is an inter-
to note that with the number of different paths available, action of the different dimensions with one another as
it is difficult to decide which path should be used since shown in Figures 4 and 5. Attribute dependency is intro-
some of the group-bys can be computed by using others. duced when it falls within a dimension caused by attribute
hierarchies as shown in Figures 6 and 7.
Hierarchies in Lattice Diagram
Optimisations in Existing Approaches
In real-life environments, the dimensions of a data cube
normally consist of more than one attribute, and the Sarawagi, Agrawal, and Megiddo (1996) adapted these
dimensions are organized as hierarchies of these at- methods to compute group-bys by incorporating a num-
tributes. Harinarayan, Rajaraman, & Ullman (1996) used a ber of optimization:
simple time dimension example to illustrate a hierarchy:
day, month, and year in Figure 6. Hierarchies are very Smallest-Parent: This optimization was first pro-
important, as they form a basis of two frequently used posed in (Gray et. al, 1997). It aims at computing a
querying operations: drill-down and roll-up. Drill-up group-by from the smallest (in terms of either clos-
is the process of viewing data at gradually more detailed est parent group-by or size of the parent group-by)
levels while roll-up is just the opposite. More details can of previously computed group-by. Each group-by
be found in (Ramakrishnan & Gehrke, 2003, p. 852). can be computed from a number of other group-bys.
Figure 7 shows an example of real-life data set, which Figures 8 and 9 show a three attributes cube (ABC).
can be viewed conceptually as a two-dimensional array
There are a number of options for computing a group-
by A from its group-bys parent (ABC, AB, or AC). For
example, A can be computed from ABC, AB or AC. Figure
Figure 6. A hierarchy example of time attributes 8 shows an example where it is better choice to compute
A from smallest or closest parent group-bys AB or AC
Day rather than ABC. Another example shows another sce-
nario where the size of smallest-parent is smaller than the
Week Month other (Figure 9). AC is a better choice as compared to AB
in terms of size.
472
TEAM LinG
Evolution of Data Cube Computational Approaches
Figure 8. AB or AC is clearly better choice of computing Figure 9. An example of different size in AB and AC
A
Figure 10. Reduction in disk I/O by computing A from AB Figure 11. Reduction of disk reads by computing as many
instead of AC group-bys as possible in the memory
other group-bys are computed to reduce disk I/O. has brought equal items together, and duplicate
Figure 10 shows an example AB is better choice to removal will then be easy (Figure 12).
compute A as compared to AC. The reason is that Share-Partitions: This optimization is specific to
A can be computed while AB is still in memory so the hash-based algorithms and share the partition
that there is no disk I/O cost involved. costs across multiple group-bys. Data to be aggre-
Amortize-Scans: This optimization aims at amortiz- gated is usually too large for the hash-tables to fit
ing disk reads by computing as many group-bys as in memory. Hence, the conventional way to deal
possible, together in memory. In Figure 11 as many with limited memory when constructing hash tables
group-bys as possible are computed such as AB, is to partition the data on one or more attributes. If
AC and A from ABC while ABC is still in the memory. data is partitioned on an attribute, say A, then all
Thus, there is no need to involve disk read. group-bys that contain A can be computed by
Share-Sorts: This optimization is specific to the independently grouping on each partition.
sort-based algorithms and aims at sharing sorting
cost across multiple group-bys. Share-sort is using Unfortunately, the above optimizations are often
data sorted in a particular order to compute all contradictory for OLAP databases especially when the
group-bys that are prefixes of that order. There is no size of the data to be aggregated is usually much larger
need to re-sort for the subsequent group-bys as it than the available main memory. For instance, a group-by
473
TEAM LinG
Evolution of Data Cube Computational Approaches
Figure 12. An example of share-sorts Figure 13. An example of sharing partitioning cost
A
A sorted
AB sorted AB AC
[A] can be computed from one of the several parent group- by for each group-by and this has resulted in biased
bys ([AB] [ABC] [AC]), but the bigger one AC (in term of towards optimizing for smallest-parent. Second
size) is in memory and the smallest one AB is not. In this optimization, share-partitions, is achieved by com-
case, based on the Cache-results optimization, AC maybe puting from the same partition all group-bys that
is a better choice. contain the partitioning attribute. Third and fourth
With the possible optimization techniques, Agarwal optimizations are achieved when computing a
et al. (1996) suggested two proposed approaches, which subtree, the algorithm maintains all hash-tables of
are basically sort-based PipeSort and hash-based group-bys in the subtree in memory until all its
PipeHash. However, there is a need for some global children are created and also for each group-by,
planning, which uses the search lattice introduced in therefore its children can be computed in one scan
(Harinarayan et al., 1996). of the group-by. However, the limitation of PipeHash
algorithm relates to the NP-Hard problem especially
Search Lattice in minimizing overall disk scan cost.
Overlap: Deshpande et al. (1998) proposed the
The search lattice is a graph where a vertex represents a OVERLAP algorithm for data cube computation and
group-by of the cube as shown in Figure 14. A directed it based on sorting-based method. The Overlap
edge connects group-by i to group-by j whenever j can be algorithm is executed in four stages. The overlap
generated from i and j has exactly one attribute less than algorithm has minimized the number of sorting steps
i (i is called the parent of j, for instance, AB is called the required to compute many sub-aggregates and also
parent of A). Level k of the search lattice denotes all minimized the number of disk accesses by overlap-
group-bys that contain exactly k attributes. ping the computation of the cuboids. However, the
limitation of the overlap algorithm is that the memory
Data Cube Computation Methodology may be not large enough to store more than one such
partition simultaneously. This is because each of
PipeSort: Agrawal et al. (1996) have incorporated the O(k) nodes has O(k) such descendants, so there
the share-sorts, cache-results and amortize-scans are at least O(k2) nodes in the search tree that
into the PipeSort algorithm. The aim is to get mini- involve additional disk I/O. As a result, the total I/
mum total cost and also to use Pipelining to achieve O cost of OVERLAP is at least quadratic in k for
cache-results and amortize-scans. However, the main sparse data.
limitation is that it does not scale well with respect Partitioning: Ross et al. (1997) have taken into
to the number of Cube-by attributes. PipeSort per- consideration the fact that real data is frequently
forms one sort operation for the pipelined evalua- sparse. It exists for two reasons: (a) large domain
tion of each path. When the underlying relation is sizes of some cube-by attributes and (b) large num-
sparse and much larger than available memory, this ber of cube-by attributes in the data cube query. In
results in many of the cuboids that PipeSort sorts (Ross et al., 1997), large relations are partitioned into
are also larger than the available memory. Hence, fragments that fit in memory. Thus, there is always
this has resulted a considerable amount of I/O when enough memory to fit in the fragments of large
performing PipeSort. relation. The memory-cube is similar to PipeSort as
PipeHash: The PipeHash algorithm is able to in- it computes the various cuboids (group-bys) of the
clude the four stated optimizations only if the data cube using the idea of pipelined paths. The
memory is available. First optimization is smallest- memory-cube performs multiple in-memory sorts,
parent where PipeHash has fixed the parent group- and does not incur any I/O beyond the input of the
relation and the output of the data cube itself.
474
TEAM LinG
Evolution of Data Cube Computational Approaches
475
TEAM LinG
Evolution of Data Cube Computational Approaches
Harinarayan, V., Rajaraman, A., & Ullman, J. (1996, June). tiple dimensional queries. International ACM SIGMOD
Implementing data cubes efficiently. International ACM Conference (pp. 271-282), Seattle, Washington.
SIGMOD Conference (pp. 205-216), Montreal, Canada.
Lakshmanan, L.V.S., Pei, J., & Zhao, Y. (2003, September).
Efficacious data cube exploration by semantic summariza- KEY TERMS
tion and compression. International VLDB Conference
(pp. 1125-1128), Berlin, Germany.
Amortize-Scans: Amortizing disk reads by comput-
Lu, H.J., Yu, J.X., Feng, L., & Li, Z.X. (2003). Fully dynamic ing as many group-bys as possible, simultaneously in
partitioning: Handling data skew in parallel data cube memory.
computation. Journal Distributed & Parallel Databases,
13, 181-202. Attribute Dependency: Introduces when it falls within
a dimension caused by attribute hierarchies.
Ramakrishnan, R., & Gehrke, J. (2003). Database manage-
ment systems. NY: McGraw-Hill. Cache-Results: the result of a group-by is obtained
from other group-bys computation (in memory).
Ross, K.A., & Srivastava, D. (1997, August). Fast compu-
tation of sparse datacubes. International VLDB Confer- Data Cube: Is known as core operator in data ware-
ence (pp. 116-185), Athens, Greece. house and OLAP (Lakshmanan, Pei & Zhao, 2003).
Sarawagi, S., Agrawal, R., & Megiddo, N. (1998, March). Data Cube Operator: Computes Group-by correspond-
Discovery-driven exploration of OLAP data cubes. Inter- ing to all possible combinations of attributes in the Cube-
national EDBT Conference (pp. 168-182), Valencia, Spain. by clause.
Silberschatz, A., Korth, H., & Sudarshan, S. (2002). Data- Dependence Relation: Relates to data cube query
base system concepts. NY: McGraw-Hill. where some of the group-by queries could be answered
using the results of other.
Tan, R.B.N., Taniar, D., & Lu, G.J. (2003, March). Efficient
execution of parallel aggregate data cube queries in data Dimension Dependency: Presents when there is an
warehouse environments. International IDEAL Confer- interaction of the different dimensions with one another.
ence (pp. 709-7016), Hong Kong, China. OLAP: Describes a technology that uses a multi-
Taniar, D., & Tan, R.B.N. (2002, May). Parallel processing dimensional view of aggregate data to provide quick
of multi-join expansion_aggregate data cube query in access to strategic information for the purposes of ad-
high performance database systems. International I- vanced analysis.
SPAN Conference (pp. 51-58), Manila, Philippines. Smallest-Parent: In terms of either closest parent
Zhao, Y., Deshpande, P., Naughton, J., & Shukla, A. (1998, group-by or size of the parent group-by.
June). Simultaneous optimization and evaluation of mul-
476
TEAM LinG
477
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Evolutionary Computation and Genetic Algorithms
478
TEAM LinG
Evolutionary Computation and Genetic Algorithms
crossover in a largely uniform population only serves Looking ahead to future opportunities and challenges
to propagate innovations originally found by muta- in data mining, genetic algorithms are widely applicable -
tion, and that in a non-uniform population, cross- to classification by means of inductive learning. GAs
over is nearly always equivalent to a very large also provide a practical method for optimization of data
mutation, which is likely to be catastrophic. preparation and data transformation steps. The latter
In the field of GEC, basic building blocks for solu- includes clustering, feature selection and extraction, and
tions to engineering problems primarily have been instance selection. In data mining, GAs likely are most
characterized using schema theory, which has been useful where high-level, fitness-driven search is needed.
critiqued as being insufficiently exact to characterize Non-local search (global search or search with an adap-
the expected convergence behavior of a GA. Propo- tive step size) and multi-objective data mining are also
nents of schema theory have shown that it provides problem areas where GAs have proven promising.
useful normative guidelines for design of GAs and
automated control of high-level GA properties (e.g.,
population size, crossover parameters, and selec- REFERENCES
tion pressure).
Atkinson-Abutridy, J., Mellish, C., & Aitken, S. (2003). A
Recent and current research in GEC relates certain semantically guided and domain-independent evolution-
evolutionary algorithms to ant colony optimization ary model for knowledge discovery from texts. IEEE Trans-
(Parpinelli, Lopes & Freitas, 2002). actions on Evolutionary Computation, 7(6), 546-560.
Au, W.-H., Chan, K.C.C., & Yao, X. (2003). A novel
CONCLUSION evolutionary data mining algorithm with applications to
churn prediction. IEEE Transactions on Evolutionary
Genetic algorithms provide a comprehensive search method- Computation, 7(6), 532-545.
ology for machine learning and optimization. It has been Cano, J.R., Herrera, F., & Lozano, M. (2003). Using evo-
shown to be efficient and powerful through many data- lutionary algorithms as instance selection for data reduc-
mining applications that use optimization and classification. tion in KDD: An experimental study. IEEE Transactions
The current literature (Goldberg, 2002; Wikipedia, 2004) on Evolutionary Computation, 7(6), 561-575.
contains several general observations about the genera-
tion of solutions using a genetic algorithm: Cant-Paz, E. (2000). Efficient and accurate parallel
genetic algorithms. Norwell, MA: Kluwer.
GAs are sensitive to deceptivity, the irregularity of Cant-Paz, E., & Kamath, C. (2003). Inducing oblique
the fitness landscape. This includes locally optimal decision trees with evolutionary algorithms. IEEE Trans-
solutions that are not globally optimal, lack of fitness actions on Evolutionary Computation, 7(1), 54-68.
gradient for a given step size, and jump discontinuities
in fitness. De Jong, K.A., Spears, W.M., & Gordon, F.D. (1993).
In general, GAs have difficulty with adaptation to Using genetic algorithms for concept learning. Machine
dynamic concepts or objective criteria. This phe- Learning, 13, 161-188.
nomenon, called concept drift in supervised learning
and data mining, is a problem, because GAs tradi- Goldberg, D.E. (1989). Genetic algorithms in search,
tionally are designed to evolve highly fit solutions optimization, and machine learning. Reading, MA:
(populations containing building blocks of high rela- Addison-Wesley.
tive and absolute fitness) with respect to stationary Goldberg, D.E. (2002). The design of innovation: Lessons
concepts. from and for competent genetic algorithms. Norwell,
GAs are not always effective at finding globally MA: Kluwer.
optimal solutions but can rapidly locate good solu-
tions, even for difficult search spaces. This makes Gonzlez, F.A., & Dasgupta, D. (2003). Anomaly detec-
steady-state GAs (i.e., Bayesian optimization GAs tion using real-valued negative selection. Genetic Pro-
that collect and integrate solution outputs after gramming and Evolvable Machines, 4(4), 383-403.
convergence to an accurate representation of build- Hall, L.O., Ozyurt, I.B., & Bezdek, J.C. (1999). Clustering
ing blocks) a useful alternative to generational GAs with a genetically optimized approach. IEEE Transac-
(maximization GAs that seek the best individual of tions on Evolutionary Computation, 3(2), 103-112.
the final generation after convergence).
479
TEAM LinG
Evolutionary Computation and Genetic Algorithms
Zhou, C., Xiao, W., Tirpak, T. M., & Nelson, P.C. (2003). Permutation GA: A type of GA where individuals
Evolving accurate and compact classification rules with represent a total ordering of elements, such as cities to be
gene expression programming. IEEE Transactions on visited in a minimum-cost graph tour (the Traveling Sales-
Evolutionary Computation, 7(6), 519-531. man Problem). Permutation GAs use specialized cross-
over and mutation operators compared to the more com-
mon bit string GAs.
480
TEAM LinG
Evolutionary Computation and Genetic Algorithms
Schema (pl. Schemata): An abstract building block of bined evaluation measures being compared in order to
a GA-generated solution, corresponding to a set of indi- choose individuals. -
viduals. Schemata typically are denoted by bit strings
with dont-care symbols # (e.g., 1#01#00# is a schema
with 23 = 8 possible instances, one for each instantiation
of the # symbols to 0 or 1). Schemata are important in GA ENDNOTE
research, because they form the basis of an analytical
approach called schema theory, for characterizing build- 1
Payoff-driven reinforcement learning describes a
ing blocks and predicting their proliferation and survival class of learning problems for intelligent agents that
probability across generations, thereby describing the receive rewards, or reinforcements, from the envi-
expected relative fitness of individuals in the GA. ronment in response to actions selected by a policy
function. These rewards are transmitted in the form
Selection: In biology, a mechanism in by which the of payoffs, sometimes strictly non-negative. A GA
fittest individuals survive to reproduce and the basis of acquires policies by evolving individuals, such as
speciation according to the Darwinian theory of evolu- condition-action rules, that represent candidate
tion. Selection in GP involves evaluation of a quantitative policies.
criterion over a finite set of fitness cases, with the com-
481
TEAM LinG
482
Clarisse Dhaenens
LIFL, University of Lille 1, France
El-Ghazali Talbi
LIFL, University of Lille 1, France
INTRODUCTION BACKGROUND
Knowledge discovery from genomic data has become an Evolutionary data mining for genomics groups three
important research area for biologists. Nowadays, a lot important fields: evolutionary computation, knowledge
of data is available on the Web, but it is wrong to say that discovery, and genomics.
corresponding knowledge is also available. For example, It is now well known that evolutionary algorithms are
the first draft of the human genome, which contains well suited for some data mining tasks (Freitas, 2002).
3,000,000,000 letters, was achieved in June 2000, but, Here, we want to show the interest of dealing with
up to now, only a small part of the hidden knowledge has genomic data, thanks to evolutionary approaches. A first
been discovered. This is the aim of bioinformatics, proof of this interest may be the recent book by Gary
which brings together biology, computer science, math- Fogel and David Corne, Evolutionary Computation in
ematics, statistics, and information theory to analyze Bioinformatics, which groups several applications of
biological data for interpretation and prediction. Hence, evolutionary computation to problems in the biological
many problems encountered while studying genomic sciences and, in particular, in bioinformatics (Fogel &
data may be modeled as data mining tasks, such as Corne, 2002). In this article, several data mining tasks
feature selection, classification, clustering, or associa- are addressed, such as feature selection or clustering,
tion rule discovery. and solved, thanks to evolutionary approaches.
An important characteristic of genomic applications Another proof of the interest of such approaches is
is the large amount of data to analyze, and, most of the the number of sessions around evolutionary computa-
time, it is not possible to enumerate all the possibilities. tion in bioinformatics and computational biology that
Therefore, we propose to model these knowledge dis- have been organized during the last Congress on Evolu-
covery tasks as combinatorial optimization tasks in tionary Computation (CEC) in Portland, Oregon in 2004.
order to apply efficient optimization algorithms to ex- The aim of genomic studies is to understand the
tract knowledge from large datasets. To design an effi- function of genes, to determine which genes are in-
cient optimization algorithm, several aspects have to be volved in a given process, and how genes are related.
considered. The main one is the choice of the type of Hence, experiments are conducted, for example, to
resolution method according to the characteristics of localize coding regions in DNA sequences and/or to
the problem. Is it an easy problem, for which a polyno- evaluate the expression level of genes in certain condi-
mial algorithm may be found? If the answer is yes, then tions. Resulting from this, data available for the
let us design such an algorithm. Unfortunately, most of bioinformatics researcher may deal with DNA sequence
the time, the response to the question is no, and only information that are related to other types of data. The
heuristics that may find good but not necessarily opti- example used to illustrate this article may be classified
mal solutions can be used. In our approach, we focus on in this category.
evolutionary computation, which has already shown an Another type of data deals with the recent technol-
interesting ability to solve highly complex combinato- ogy called microarray, which allows the simultaneous
rial problems. measurement of the expression level of thousands of
In this article, we will show the efficacy of such an genes under different conditions (i.e., various time
approach while describing the main steps required to points of a process, absorption of different drugs, etc.).
solve data mining problems from genomics with evolu- This new type of data requires specific data mining tasks,
tionary algorithms. We will illustrate these steps with a as the number of genes to study is very large and the
real problem.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Evolutionary Data Mining for Genomics
number of conditions may be limited. Classical questions such a formulation is to identify the task. This work must
are the classification or the clustering of genes based on be done through discussions and cooperation with bi- -
their expression pattern, and commonly used approaches ologists in order to agree on the objective of the prob-
may vary from statistical approaches (Yeung & Ruzzo, lems. For example, in our data, identifying groups of
2001) to evolutionary approaches (Merz, 2002) and may people can be modeled as a clustering task, as we cannot
use additional biological information, such as gene ontol- take into account non-affected people. Moreover, a lot
ogy (GO) (Speer, Spieth & Zell, 2004). Recently, the bi- of loci have to be studied (3,652 points of comparison
clustering that allows the grouping of instances having on the 23 chromosomes and two environmental factors)
similar characteristic for a subset of attributes (here, and classical clustering algorithms are not able to cope
genes having the same expression patterns for a subset of with so many points. So, we decided first to execute a
conditions) has been applied to this type of data and feature selection in order to reduce the number of loci
evolutionary approaches proposed (Bleuler, Preli & in consideration and to extract the most influential
Ziztler, 2004). In this context of microarray data analysis, features that will be used for the clustering. Hence, the
association rule discovery also has been realized using model of this problem is decomposed into two phases:
evolutionary algorithms (Khabzaoui, Dhaenens & Talbi, feature selection and clustering.
2004).
From a Data Mining Task to an
Optimization Problem
MAIN THRUST
The most difficult aspect of turning a data mining task
In order to extract knowledge from genomic data using into an optimization problem is to define the criterion
evolutionary algorithms, several steps have to be con- to optimize. The choice of the optimization criterion,
sidered: which measures the quality of candidate knowledge to
be extracted, is very important, and the quality of the
1. Identification of the knowledge discovery task results of the approach depends on it. Indeed, develop-
from the biological problem under study; ing a very efficient method that does not use the right
2. Design of this task as an optimization problem; criterion will lead to obtaining the right answer to the
3. Resolution using an evolutionary approach. wrong question. The optimization criterion either can
be specific to the data mining task or dependent of the
Hence, in this section, we will focus on each of these biological application. Several different choices exist.
steps. First, we will present the genomic application that For example, considering the gene clustering, the opti-
we will use to illustrate the rest of the article and mization criterion can be the minimization of the mini-
indicate the knowledge discovery tasks that have been mum sum-of-squares (MSS) (Merz, 2002), while for
extracted. Then, we will show the challenges and some the determination of the members of a predictive gene
proposed solutions for the two other steps. group, the criterion can be the maximization of the
classification success using a maximum likelihood
Genomics Application (MLHD) classification method (Ooi & Tan, 2003).
Once the optimization criterion is defined, the sec-
The genomic problem under study is to formulate hy- ond step of the design of the data mining task into an
potheses on predisposition factors of different multi- optimization problem is to define the encoding of a
factorial diseases, such as diabetes and obesity. In such solution, which may be independent of the resolution
diseases, one of the difficulties is that sane people can method. For example, for clustering problems in gene
become affected during their life, so only the affected expression mining with evolutionary algorithm,
status is relevant. This work has been done in collabora- Faulkenauer and Marchand (2001) use the specific CGA
tion with the Biology Institute of Lille (IBL, France). encoding that is dedicated to grouping problems and is
One approach aims to discover the contribution of well suited to clustering.
environmental factors and genetic factors in the patho- Regarding the genomic application used to illustrate
genesis of the disease under study by discovering com- this article, two phases have been isolated. For the
plex interactions, such as ([gene A and gene B] or [gene feature selection, an optimization approach has been
C and environmental factor D]) in one or more popula- adopted, using an evolutionary algorithm (see next para-
tion. The rest of the article will use this problem as an graph), whereas a classical approach (k-means) has been
illustration. chosen for the clustering phase. Determining the opti-
To solve such a problem, the first thing is to formu- mization criterion for the feature selection was not an easy
late it into a classical data mining task. The difficulty of task, as it was difficult not to favor small sets of features.
483
TEAM LinG
Evolutionary Data Mining for Genomics
A corrective factor has been introduced (Jourdan et al., operators that are less efficient are less used, which may
2002). change during the search, if they become more efficient.
Diversification mechanisms are designed to avoid
Solving with Evolutionary Algorithms premature convergence. There exist several mecha-
nisms. The more classical are the sharing and the ran-
Once the formalization of the data mining task into an dom immigrant. The sharing boosts the selection of
optimization problem is done, resolution methods either individuals that lie in less crowded areas of the search
can be exact methods, specific heuristics, or space (Mafhoud, 1995). To apply such a mechanism, a
metaheuristics. As the space of the potential knowledge distance between solutions has to be defined. In the
is exponential in genomics problems (Zaki & Ho, 2000), feature selection for the genomic association discov-
exact methods are almost always discarded. The draw- ery, a distance has been defined by integrating knowl-
backs of heuristic approaches are that it is difficult to edge of the application domain. The distance is corre-
cope with multiple solutions and not easy to integrate lated to a Hamming distance, which integrates biologi-
specific knowledge in a general approach. The advan- cal notions (chromosomal cut, inheritance notion, etc.).
tages of metaheuristics are that you can define a general A further approach to the diversification of the
framework to solve the problem while specializing some population, the random immigrant, introduces new in-
agents in order to suit a specific problem. Genetic Algo- dividuals. An idea is to generate new individuals by
rithms (GAs), which represent a class of evolutionary recording statistics on previous selections.
methods, have given good results on hard combinatorial Assessing the efficiency of such algorithms applied
problems (Michalewicz, 1996). to genomic data is not easy, as most of the time,
In order to develop a genetic algorithm for knowl- biologists have no exact idea about what must be found.
edge discovery, we have to focus on the following: Hence, one step in this analysis is to develop simulated
data for which optimal results are known. In this man-
Operators ner, it is possible to measure the efficiency of the
Diversification mechanisms proposed method. For example, in the problem under
Intensification mechanisms study, predetermined genomic associations were con-
structed to form simulated data. Then, the algorithm
Operators allow GAs to explore the search space and was tested on these data and found these associations.
must be adapted to the problem. Generally, there are two
classes of operatorsmutation and crossover. The mu-
tation allows diversity. For the feature selection task FUTURE TRENDS
under study, the mutation flips n bits (Jourdan, Dhaenens
& Talbi, 2001). The crossover produces one, two, or There has been much work on evolutionary data mining
more children solutions by recombining two or parents. for genomics. In order to be more efficient and to
The objective of this mechanism is to keep useful infor- propose more interesting solutions for decision mak-
mation of the parents in order to ameliorate the solu- ers, the researchers are investigating multi-criteria de-
tions. In the considered problem, the subset-oriented sign of the data mining tasks. Indeed, we exposed that
common feature crossover operator (SSOCF) has been one of the critical phases was the determination of the
used. Its objective is to produce offspring that have the optimization criterion, and it may be difficult to select a
same distribution as the parents. This operator is adapted single one. In response to this problem, the multi-crite-
well for feature selection (Emmanouilidis, Hunter & ria design allows us to take into account some criteria
MacIntyre, 2000). Another advantage of evolutionary dedicated to a specific data mining task and some crite-
algorithms is that you easily can use other data mining ria coming from the application domain. Evolutionary
algorithms as an operator; for example, a kmeans itera- algorithms that work on a population of solutions are
tion may be used as an operator in a clustering problem well adapted to multi-criteria problems, as they can
(Krishma & Murty, 1999). exploit Pareto approaches and propose several good
Working on knowledge discovery on a particular solutions (i.e., solutions of best compromise).
domain by optimization leads to definition and use of For data mining in genomics, rule discovery has not
several operators, where some may use domain knowl- been very well applied and should be studied carefully,
edge, and others may be specific to the model and/or the as this is a very general model. Moreover, an interest-
encoding. To take advantage of all the operators, the idea ing multi-criteria model has been proposed for this
is to use adaptive mechanisms (Hong, Wang & Chen, task and starts to give some interesting results by using
2000) that help to adapt application probabilities of these multi-criteria genetic algorithms (Jourdan et al., 2004).
operators according to the progress they produce. Hence,
484
TEAM LinG
Evolutionary Data Mining for Genomics
Jourdan, L., Dhaenens, C., & Talbi, E.G. (2001). An Bioinformatics: Field of science in which biology,
optimization approach to mine genetic data. Proceed- computer science, and information technology merge
ings of Biological Data Mining and Knowledge Dis- into a single discipline.
covery (METMBS01). Clustering: Data mining task in which the system
Jourdan, L., Dhaenens, C., Talbi, E.-G., & Gallina, S. has to classify a set of objects without any information
(2002). A data mining approach to discover genetic fac- on the characteristics of the classes.
485
TEAM LinG
Evolutionary Data Mining for Genomics
Feature (or Attribute): Quantity or quality describing Multi-Factorial Disease: Disease caused by several
an instance. factors. Often, multi-factorial diseases are due to two
kinds of causality that interactone is genetic (and often
Feature Selection: Task of identifying and selecting polygenetic) and the other is environmental.
a useful subset of features from a large set of redundant,
perhaps irrelevant, features. Optimization Criterion: Criterion that gives the qual-
ity of a solution of an optimization problem.
Genetic Algorithm: Evolutionary algorithm using a
population and based on the Darwinian principle, the
survival of the fittest.
486
TEAM LinG
487
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Evolutionary Mining of Rule Ensembles
It is conjectured that maximizing the degree of interac- fiers are then maps taking only two values, a real number
tion amongst the rules already available is critical for for those x verifying the leaf and 0 elsewhere. The final
efficient learning (Kuncheva & Jain, 2000; Hand et al., boosting aggregation for x is thus unaffected by all
2001). A fundamental issue concerns then the extent to abstaining rules (with x C), so the number of expressing
which tentative rules work together and are capable of rules may be a small fraction of the total number of
influencing the learning of new rules. Conventional rules.
methods like Bagging and Boosting show at most mod-
erate amounts of interaction in this sense. While Bag- The General CS-Based Evolutionary
ging and Boosting are useful, well-known data-mining Approach
tools, it is appropriate to explore other ensemble-learn-
ing ideas as well. In this article, I focus mainly on the CS The general classifier system (CS) architecture invented
algorithm. CS approaches provide interesting architec- by John Holland constitutes perhaps one of the most
tures and introduce complex nonlinear processes to sophisticated classes of evolutionary computation al-
model prediction and reinforcement. I discuss a spe- gorithms (Hollandet et al., 1986). Originally conceived as
cific CS algorithm and show how it opens interesting a model for cognitive tasks, it has been considered in
pathways for emergent cooperative behaviour. many (simplified) forms to address a number of learn-
ing problems. The nowadays standard stimulus-response
Conventional Rule Assembly (or single-step) CS architecture provides a fascinating
approach to the representation issue. Straightforward
In Bagging methods, different training samples are cre- rules (classifiers) constitute the CS building blocks. CS
ated by bootstraping, and the same basic learning proce- algorithms maintain a population of such predictive
dure is applied on each bootstrapped sample. In Bagging rules whose conditions are hyperplanes involving the
trees, predictions are decided by majority voting or by wild-card character #. If we generalize the idea of hyper-
averaging the various opinions available in each case. plane to mean conjunctions of conditions on predic-
This idea is known to reduce the basic instability of tors where each condition involves a single predictor,
trees (Breiman, 1996). We see that these rules are also used by many other
A distinctive feature of the Boosting approach is the learning algorithms. Undoubtedly, hyperplane interpret-
iterative calling to a basic weak learner (WL) algorithm ability is a major factor behind this popularity.
(Schapire & Singer, 1998). Each time the WL is in- Critical subsystems in CS algorithms are the perfor-
voked, it takes as input the training set together with mance, credit-apportionment, and rule discovery mod-
a dynamic (probability) weight distribution over the data ules (Eiben & Smith, 2003). As regards credit-appor-
and returns a single tree. The output of the algorithm tionment, the question has been recently raised about
is a weighted sum itself, where the weights are propor- the suitability of endogenous reward schemes, where
tional to individual performance error. The WL learning endogenous refers to the overall context in which clas-
algorithm needs only to produce moderately successful sifiers act, versus other schemes based on intrinsic
models. Thus, trees and simplified trees (stumps) con- value measures (Booker, 2000). A well-known family
stitute a popular choice. Several weight updating schemes of algorithms is XCS (and descendants), some of which
have been proposed. Schapire and Singer update weights have been previously advocated as data-mining tools
according to the success of the last model incorporated, (see, e.g., Wilson, 2001). The complexity of the CS
whereas in their LogitBoost algorithm, Friedman, Hastie, dynamics has been analyzed in detail in Westerdale
and Tibshirani (2000) let the weights depend on overall (2001).
probabilistic estimates. This latter idea better reflects The match set M=M(x) is the subset of matched
the joint work of all classifiers available so far and (concurrently activated) rules, that is, the collection of
hence should provide a more effective guide for the WL all classifiers whose condition is verified by the input
in general. data vector x. The (point) prediction for a new x will be
The notion of abstention brings a connection with based exclusively on the information contained in this
the CS approach that will be apparent as I discuss the ensemble M.
match set idea in the followins sections. In standard
boosting trees, each tree contributes a leaf to the overall A System Based on Support and
prediction for any new x input data vector, so the number
of expressing rules is the number of boosting rounds
Predictive Scoring
independently of x. In the system proposed by Cohen
and Singer (1999), the WL essentially produces rules or Support is a familiar notion in various data-mining sce-
single leaves C (rather than whole trees). Their classi- narios. There is a general trade-off between support and
488
TEAM LinG
Evolutionary Mining of Rule Ensembles
predictive accuracy: the larger the support, the lower the Furthermore, the use of probabilistic predictions
accuracy. The importance of explicitly bounding support R(j) for j=1, ..., k, where k is the number of output classes, -
(or deliberately seeking high support) has been recog- makes possible a natural ranking of the M=M(x) as-
nized often in the literature (Greene & Smith, 1994; Fried- signed probabilities Ri(y) for the true class y related to
man & Fisher, 1999; Muselli & Liberati, 2002). Because the the current x. Rules i with large Ri(y) (scoring high) are
world of generality intended by using only high-support generally preferred in each niche. In fact, only a few rules
rules introduces increased levels of uncertainty and error, are rewarded at each step, so rules compete with each
statistical tools would seem indispensable for its proper other for the limited amount of resources. Persistent lack
modeling. To this end, classifiers in the BYPASS algorithm of reward means extinction to survive, classifiers must
(Muruzbal, 2001) differ from other CS alternatives in that get reward from time to time. Note that newly discovered,
they enjoy (support-minded) probabilistic predictions (thus more effective rules with even better scores may cut
extending the more common single-label predictions). dramatically the reward given previously to other rules
Support plays an outstanding role: A minimum support in certain niches. The fitness landscape is thus highly
level b is input by the analyst at the outset, and consider- dynamic, and lower scores pi(y) may get reward and
ation is restricted to rules with (estimated) support above survive provided they are the best so far at some niche.
b. The underlying predictive distributions, R, are easily An intrinsic measure of fitness for a classifier C R (such
constructed and coherently updated following a standard as the lifetime average score -log R(Y), where Y is condi-
Bayesian Multinomial-Dirichlet process, whereas their toll tioned by X C) could hardly play the same role.
on the system is minimal memory-wise. Actual predictions It is worth noting that BYPASS integrates three learn-
are built by first averaging the matched predictive distri- ing modes in its classifiers: Bayesian at the data-process-
butions and then picking the maximum a posteriori label of ing level, reinforcement at the survival (competition)
the result. Hence, by promoting mixtures of probability level, and genetic at the rule-discovery (exploration)
distributions, the BYPASS algorithm connects readily level. Standard genetic algorithms (GAs) are triggered by
with mainstream ensemble-learning methods. system failure and act always circumscribed to M. Be-
We can find sometimes perfect regularities, that is, cause the Bayesian updating guarantees that in the long
subsets C for which the conditional distribution of the run, predictive distributions R reflect the true conditional
response Y (given X C) equals 1 for some output class: probabilities P(Y=j | X C), scores become highly reli-
P(Y=j | X C)=1 for some j and 0 elsewhere. In the well- able to form the basis of learning engines such as reward
known multiplexer environment, for example, there ex- or crossover selection. The BYPASS algorithm is sketched
ists a set of such classifiers such that 100% performance in Table 1. Note that utility reflects accumulated reward
can be achieved. But in real situations, it will be difficult (Muruzbal, 2001). Because only matched rules get re-
to locate strictly neat C unless its support is quite small. ward, high support is a necessary (but not sufficient)
Moreover, putting too much emphasis on error-free condition to have high utility. Conversely, low-uncer-
behavior may increase the risk of overfitting; that is, we tainty regularities need to comply with the (induced)
may infer rules that do not apply (or generalize poorly) bound on support. The background generalization rate
over a test sample. When restricting the search to high- P# (controlling the number of #s in the random C built
support rules, probability distributions are well equipped along the run) is omitted for clarity, although some
to represent high-uncertainty patterns. When the largest tuning of P# with regard to threshold u is often required
P(Y=j | X C) is small, it may be especially important to in practice.
estimate P(Y=j | X C) for all j.
489
TEAM LinG
Evolutionary Mining of Rule Ensembles
BYPASS has been tested on various tasks under uncertainty associated to rules. Also, further research
demanding b (and not very large h), and the results have should be conducted to clearly delineate the strengths
been satisfactory in general. Comparative smaller popu- of the various CS approaches against current alternative
lations are used, and low uncertainty but high support methods for rule ensemble formation and data mining.
rules are uncovered. Very high values for P# (as high as
0.975) have been tested successfully in some cases
(Muruzbal, 2001). In the juxtaposed (or concatenated) CONCLUSION
multiplexer environment, BYPASS is shown to maintain a
compact population of relatively high uncertainty rules Evolutionary rule mining is a successful, promising
that solves the problem by bringing about appropriate research area. Evolutionary algorithms constitute by
match sets (nearly) all the time. Recent work by Butz, now a very useful and wide class of stochastic optimiza-
Goldberg, and Tharakunnel (2003) shows that XCS also tion methods. The evolutionary CS approach is likely to
solves this problem, although working at a lower level of provide interesting insights and cross-fertilization of
support (generality). ideas with other data-mining methods. The BYPASS
To summarize, BYPASS does not relay on rule plural- algorithm discussed in this article has been shown to
ity for knowledge encoding because it uses compact tolerate the high support constraint well, leading to
probabilistic predictions (bounded by support). It re- pleasant and unexpected results in some problems. These
quires no intrinsic value for rules and no added tailor- results stress the latent predictive power of the en-
made heuristics. Besides, it tends to keep population size sembles formed by high uncertainty rules.
under control (with increased processing speed and
memory savings). The ensembles (match sets) derived
from evolution in BYPASS have shown good promise of REFERENCES
cooperation.
Booker, L. B. (2000). Do we really need to estimate rule
utilities in classifier systems? Lecture Notes in Artifi-
FUTURE TRENDS cial Intelligence, 1813, 125-142.
Quick interactive data-mining algorithms and protocols Breiman, L. (1996). Bagging predictors. Machine Learn-
are nice when human judgment is available. When not, ing, 24, 123-140.
computer-intensive, autonomous algorithms capable of
thoroughly squeezing the data are also nice for prelimi- Bull, L. (2002). On using constructivism in neural classifier
nary exploration and other purposes. In a sense, we systems. Lecture Notes in Computer Science, 2439, 558-
should tend to rely on the latter to mitigate the nearly 567.
ubiquitous data overflow problem. Representation Butz, M. V., Goldberg, D. E., & Tharakunnel, K. (2003).
schemes and learning engines are crucial to the success Analysis and improvement of fitness exploitation in
of these unmanned agents and need, of course, further XCS: Bounding models, tournament selection, and bi-
investigation. Ensemble methods have lots of appealing lateral accuracy. Evolutionary Computation, 11(3), 239-
features and will be subject to further analysis and 277.
testing. Evolutionary algorithms will continue to uprise
and succeed in yet some other application areas. Addi- Cohen, W. W., & Singer, Y. (1999). A simple, fast, and
tional commercial spin-offs will keep coming. Although effective rule learner. Proceedings of the 16th Na-
great progress has been made in identifying many key tional Conference on Artificial Intelligence,.
insights in the CS framework, some central points still Eiben, A. E., & Smith, J. E. (2003). Introduction to
need further discussion. Specifically, the idea of rules evolutionary computing. Springer.
that perform its prediction following some kind of more
elaborate computation is appealing, and indeed more Folino, G., Pizzuti, C., & Spezzano, G. (2003). En-
functional representations of classifiers (such as multi- semble techniques for parallel genetic programming
layer perceptrons) have been proposed in the CS litera- based classifiers. Lecture Notes in Computer Science,
ture (see, e.g., Bull, 2002). On the theoretical side, a 2610, 59-69.
formal framework for more rigorous analysis in high-
support learning is much needed. The task is not easy, Friedman, J. H., & Fisher, N. (1999). Bump hunting in high-
however, because the target is somewhat more vague, dimensional data. Statistics and Computing, 9(2), 1-20.
and individual as well as collective interests should be
brought to terms when evaluating the generality and
490
TEAM LinG
Evolutionary Mining of Rule Ensembles
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Addi- Wilson, S. W. (2001). Mining oblique data with XCS.
tive logistic regression: A statistical view of boosting. Lecture Notes in Artificial Intelligence, 1996, 158-176. -
Annals of Statistics, 28(2), 337-407.
Greene, D. P., & Smith, S. F. (1994). Using coverage as a
model building constraint in learning classifier systems. KEY TERMS
Evolutionary Computation, 2(1), 67-91.
Hand, D. J. (1997). Construction and assessment of clas- Classification: The central problem in (supervised)
sification rules. Wiley. data mining. Given a training data set, classification
algorithms provide predictions for new data based on
Hand, D. J., Adams, N. M., & Kelly, M. G. (2001). Multiple predictive rules and other types of models.
classifier systems based on interpretable linear classifi-
ers. Lecture Notes in Computer Science, 2096, 136-147. Classifier System: A rich class of evolutionary com-
putation algorithms building on the idea of evolving a
Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. population of predictive (or behavioral) rules under the
R. (1986). Induction: Processes of inference, learning enforcement of certain competition and cooperation pro-
and discovery. MIT Press. cesses. Note that classifier systems can also be under-
stood as systems capable of performing classification.
Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W., Not all CSs in the sense meant here qualify as classifier
Yu, J., & Lanza, G. (Eds.). (2003). Genetic programming systems in the broader sense, but a variety of CS algo-
IV: Routine human-competitive machine intelligence. rithms concerned with classification do.
Kluwer.
Ensemble-Based Methods: A general technique
Kuncheva, L. I., & Jain, L. C. (2000). Designing classifier that seeks to profit from the fact that multiple rule
fusion systems by genetic algorithms. IEEE Transactions generation followed by prediction averaging reduces
on Evolutionary Computation, 4(4), 327-336. test error.
Liu, Y., Yao, X., & Higuchi, T. (2000). Evolutionary en- Evolutionary Computation: The solution approach
sembles with negative correlation learning. IEEE Trans- guided by artificial evolution, which begins with random
actions on Evolutionary Computation, 4(4), 380-387. populations (of solution models), then iteratively ap-
Muruzbal, J. (2001). Combining statistical and reinforce- plies algorithms of various kinds to find the best or
ment learning in rule-based classification. Computational fittest models.
Statistics, 16(3), 341-359. Fitness Landscape: Optimization space due to the
Muselli, M., & Liberati, D. (2002). Binary rule generation characteristics of the fitness measure used to define the
via hamming clustering. IEEE Transactions on Knowl- evolutionary computation process.
edge and Data Engineering, 14, 1258-1268. Predictive Rules: Standard if-then rules with the
Schapire, R. E., & Singer, Y. (1998). Improved boosting consequent expressing some form of prediction about
algorithms using confidence-rated predictions. Machine the output variable.
Learning, 37(3), 297-336. Rule Mining: A computer-intensive task whereby
Westerdale, T. H. (2001). Local reinforcement and recom- data sets are extensively probed for useful predictive
bination in classifier systems. Evolutionary Computa- rules.
tion, 9(3), 259-281. Test Error: Learning systems should be evaluated
with regard to their true error rate, which in practice is
approximated by the error rate on test data, or test error.
491
TEAM LinG
492
Yan Zhao
University of Regina, Canada
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Explanation-Oriented Data Mining
tists. There are some basic principles and techniques There are bi-directional benefits. The experiences and
that are commonly used in most types of scientific results from the studies of research methods can be
investigations. We adopt the model of the research applied to data mining problems; the data mining algo-
process from Garziano and Raulin (2000), and combine rithms can be used to support scientific research.
it with other models (Martella, et al., 1999). The basic
phases and their objectives are summarized in Table 1.
It is possible to combine several phases into one, or to MAIN THRUST
divide one phase into more detailed steps. The division
between phases is not clear-cut. The research process Explanations of data mining address several important
does not follow a rigid sequencing of the phases. Itera- questions. What needs to be explained? How to explain
tion of different phrases may be necessary (Graziano & the discovered knowledge? Moreover, is an explanation
Raulin, 2000). correct and complete? By answering these questions,
Many researchers have proposed and studied models one can better understand explanation-oriented data
of data mining processes (Fayyad, et al. 1996; Mannila, mining. The ideas and processes of explanation con-
1997; Yao, Zhao, et al., 2003; Zhong, Liu, & Ohsuga, struction and explanation evaluation are demonstrated
2001). A model that adds the explanation facility to the by explanation-oriented association mining.
commonly used models has been recently proposed by
Yao, Zhao, et al.; it is remarkably similar to the model of
scientific research. The basic phases and their objec- Figure 1. A framework of explanation-oriented data
tives are summarized in Table 2. Like the research mining
process, the data mining process is also an iterative
process and there is no clear-cut difference among the Pattern discovery & evaluation
different phases. In fact, Zhong, et al. argue that it should Data transformation Explanation construction
be a dynamically organized process (Zhong, et al., 2001). Data preprocessing
& evaluation
493
TEAM LinG
Explanation-Oriented Data Mining
494
TEAM LinG
Explanation-Oriented Data Mining
The Apriori-ID3 algorithm, which can be regarded as The absolute difference represents the disparity be-
an example of explanation-oriented association mining tween the pattern and the pattern under the condition.
method, is described in Table 3. For a positive value, one may say that the condition
supports ; for a negative value, one may say that the
Explanation Evaluation
condition rejects . The relative difference is the ratio
Once explanations are generated, it is necessary to of absolute difference to the value of the unconditional
evaluate them. For explanation-oriented association pattern. The ratio of change compares the actual change
mining, we want to compare a conditional association and the maximum potential change.
(explained association) with its unconditional counter- Generality is the measure to quantify the size of a
part, in addition to comparing different conditions. condition with respect to the whole data, defined by
Let T be a transaction table, and E be an explanation ||
profile table associated with T. Suppose that for a generality ( ) = . When the generality of conditions
|U |
desired pattern generated by an unsupervised learning
is essential, a compound measure should be applied. For
algorithm from T, there is a set K of conditions (expla-
example, one may be interested in discovering an accu-
nations) discovered by a supervised learning algorithm
rate explanation with a high ratio of change and a high
from E, and K is one explanation. Two points are
generality. However, it often happens that an explana-
noted. First, the set K of explanations can be different
tion has a high generality but a low RC value, while
according to various explanation profile tables, or vari-
another explanation has a low generality but a high RC
ous supervised learning algorithms. Second, not all
value. A trade-off between these two explanations does
explanations in K are equally interesting. Different con-
not necessarily exist.
ditions may have different degrees of interestingness.
A good explanation system must be able to rank the
Suppose is a quantitative measure used to evaluate constructed explanations and be able to reject the bad
plausible explanations, which can be the support mea- explanations. It should be realized that evaluation is a
sure for an undirected association, the confidence or difficult process because so many different kinds of knowl-
coverage measure for a one-way association, or the edge can come into play. In many cases, one must rely on
similarity measure for a two-way association (Yao & domain experts to reject uninteresting explanations.
Zhong, 1999). A condition K provides an explana-
tion of a discovered pattern if ( | ) > ( ) . One can
further evaluate explanations quantitatively based on FUTURE TRENDS
several measures, such as absolute difference (AD),
relative difference (RD) and ratio of change (RC): Considerable research remains to be done for explana-
tion construction and evaluation.
In this chapter, rule-based explanation is con-
AD( | ) = ( | ) ( ),
structed by inductive supervised learning algorithms.
( | ) ( ) Considering the structure of explanation, case-based
RD( | ) = ,
( ) explanations also need to be addressed. Based on the
( | ) ( ) case-based explanation, a pattern is explained if an
RC ( | ) = .
1 ( ) actual prior case is presented to provide compelling
495
TEAM LinG
Explanation-Oriented Data Mining
support. One of the perceived benefits of case-based Brodie, M. & Dejong, G. (2001). Iterated phantom
explanation is that the rule generation effort is saved. induction: A knowledge-based approach to learning con-
Instead, similarity functions need to be studied in order trol. Machine Learning, 45(1), 45-76.
to evaluate the distance between the description of the
new pattern and an existing case, and retrieve the most Cendrowska, J. (1987). PRISM: An algorithm for induc-
similar case as an explanation. ing modular rules. International Journal of Man-Ma-
The constructed explanations of the discovered pat- chine Studies, 27, 349-370.
tern provide conclusive evidence for the new instances. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. &
In other words, the new instances can be explained and Uthurusamy, R. (Eds.). (1996). Advances in Knowl-
implied by the explanations. This is normally true when edge Discovery and Data Mining. AAAI/MIT Press.
the explanations are sound and complete. However,
sometimes, the constructed explanations cannot guar- Graziano, A.M. & Raulin, M.L. (2000). Research meth-
antee that a certain instance is a perfect fit. Even worse, ods: A process of inquiry (4th ed.). Boston: Allyn &
a new data set, as a whole, may show a change or a Bacon.
confliction with the learnt explanations. This is because Han, J. & Kamber, M. (2000). Data mining: Concept
the explanations may be context-dependent on certain and techniques. Morgan Kaufmann Publisher.
spatial and/or temporal intervals. To consolidate the
explanations we have constructed, we cannot simply Ling, C.X., Chen, T., Yang, Q. & Cheng, J. (2002).
logically and, or, or ignore the new explanation. Mining optimal actions for profitable CRM. Proceed-
Instead, a spatial-temporal reasoning model needs to be ings of International Conference on Data Mining (pp.
introduced to show the trend and evolution of the pattern 767-770).
to be explained.
The explanations we have introduced so far are not Mannila, H. (1997). Methods and problems in data mining.
necessarily the causal interpretation of the discovered Proceedings of International Conference on Database
pattern, i.e. the relationships expressed in the form of Theory (pp. 41-55).
deterministic and functional equations. They can be Martella, R.C., Nelson, R. & Marchand-Martella, N.E.
inductive generalizations, descriptions, or deductive (1999). Research methods: Learning to become a
implications. Explanation as causality is the strongest critical research consumer. Boston: Allyn & Bacon.
explanation and coherence.
We might think of Bayesian networks as an inference Mitchell, T. (1999). Machine learning and data mining.
that unveils the internal relationship between attributes. Communications of the ACM, 42(11), 30-36.
Searching for an optimal model is difficult and NP-hard. Quinlan, J.R. (1983). Learning efficient classification
Arrow direction is not guaranteed. Expert knowledge procedures. In J.S. Michalski, J.G. Carbonell & T.M.
could be integrated in the a priori search function, such Mirchell (Eds.), Machine learning: An artificial intel-
as the presence of links and orders. ligence approach (pp. 463-482). Palo Alto, CA: Mor-
gan Kaufmann.
496
TEAM LinG
Explanation-Oriented Data Mining
Zhong, N., Liu, C. & Ohsuga, S. (2001). Dynamically Method of Explanation-Oriented Data Mining: The
organizing KDD processes. International Journal of method consists of two main steps and uses two data -
Pattern Recognition and Artificial Intelligence, 15, 451- tables. One table is used to learn a pattern. The other table,
473. an explanation table, is used to explain one desired pat-
tern. In the first step, an unsupervised learning algorithm
is used to discover a pattern of interest. In the second
step, by treating objects satisfying the pattern as positive
KEY TERMS instances, and treating the rest as negative instances, one
can search for conditions that explain the pattern by a
Absolute Difference: A measure that represents the supervised learning algorithm.
difference between an association and a conditional
association based on a given measure. The condition Ratio of Change: A ratio of actual change (absolute
provides a plausible explanation. difference) to the maximum potential change.
Explanation-Oriented Data Mining: A general Relative Difference: A measure that represents the
framework includes data pre-processing, data transfor- difference between an association and a conditional
mation, pattern discovery and evaluation, pattern expla- association relative to the association based on a given
nation and explanation evaluation, and pattern presenta- measure.
tion. This framework is consistent with the general Scientific Research Processes: A general model
model of scientific research processes. consists of the following phases: idea generation, prob-
Generality: A measure that quantifies the coverage lem definition, procedure design/planning, observation/
of an explanation in the whole data set. experimentation, data analysis, results interpretation,
and communication. It is possible to combine several
Goals of Scientific Research: The purposes of phases, or to divide one phase into more detailed steps.
science are to describe and predict, to improve or to The division between phases is not clear-cut. Iteration
manipulate the world around us, and to explain our of different phrases may be necessary.
world. One goal of scientific research is to discover
new and useful knowledge for the purpose of science.
As a specific research field, data mining shares this
common goal, and may be considered as a research
support system.
497
TEAM LinG
498
Richard L. Peterson
Montclair State University, USA
Chen-Fu Chien
National Tsing Hua University, Taiwan
Ruben Xing
Montclair State University, USA
INTRODUCTION BACKGROUND
The rapid growth and advances of information technol- The concept of FA was created in 1904 by Charles
ogy enable data to be accumulated faster and in much Spearman, a British psychologist. The term factor analy-
larger quantities (i.e., data warehousing). Faced with vast sis was first introduced by Thurston in 1931. Exploratory
new information resources, scientists, engineers, and FA and confirmatory FA are two main types of modern FA
business people need efficient analytical techniques to techniques. The goals of FA are (1) to reduce the number
extract useful information and effectively uncover new, of variables and (2) to classify variables through detec-
valuable knowledge patterns. tion of the structure of the relationships between vari-
Data preparation is the beginning activity of exploring ables. FA achieves the goals by creating a fewer number
for potentially useful information. However, there may be of new dimensions (i.e., factors) with potentially useful
redundant dimensions (i.e., variables) in the data, even knowledge. The applications of FA techniques can be
after the data are well prepared. In this case, the perfor- found in various disciplines in science, engineering, and
mance of data-mining methods will be affected negatively social sciences, such as chemistry, sociology, econom-
by this redundancy. Factor Analysis (FA) is known to be ics, and psychology. To sum up, FA can be considered as
a commonly used method, among others, to reduce a broadly used statistical approach that explores the
data dimensions to a small number of substantial char- interrelationships among variables and determines a
acteristics. smaller set of common underlying factors. Furthermore,
FA is a statistical technique used to find an underlying the information contained in the original variables can be
structure in a set of measured variables. FA proceeds with explained by these factors with a minimum loss of infor-
finding new independent variables (factors) that describe mation.
the patterns of relationships among original dependent
variables. With FA, a data miner can determine whether or
not some variables should be grouped as a distinguishing MAIN THRUST
factor, based on how these variables are related. Thus, the
number of factors will be smaller than the number of In order to represent the important structure of the data
original variables in the data, enhancing the performance efficiently (i.e., in a reduced number of dimensions), there
of the data-mining task. In addition, the factors may be are a number of techniques that can be used for data
able to reveal underlying attributes that cannot be ob- mining. These generally are referred to as multi-dimen-
served or interpreted explicitly so that, in effect, a recon- sional scaling methods. The most basic one is Principle
structed version of the data is created and used to make Component Analysis (PCA). Through transforming the
hypothesized conclusions. In general, FA is used original variables in the data into the same number of new
with many data-mining methods (e.g., neural net- ones, which are mutually orthogonal (uncorrelated), PCA
work, clustering). sequentially extracts most of the variance (variability) of
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Factor Analysis in Data Mining
the data. The hope is that most of the information in the ment (Rummel, 2002). The analysis procedures can be
data might be contained in the first few components. FA performed through a geometrical presentation by plotting .
also extracts a reduced number of new factors from the data points in a multi-dimensional coordinate axis (explor-
original data set, although it has different aims from PCA. atory FA) or through mathematical techniques to test the
FA usually starts with a survey or a number of ob- specified model and suspected relationship among vari-
served traits. Before FA is applied, the assumptions of ables (confirmatory FA).
correlations in the data (normality, linearity, homogeneity In order to illustrate how FA proceeds step by step,
of sample, and homoscedasticity) need to be satisfied. In here is an example from a case study on the key variables
addition, the factors to extract should all be orthogonal (or characteristics) for induction machines, conducted by
to one another. After defining the measured variables to Mat and Caldern (2000). The sample of a group of
represent the data, FA considers these variables as a motors was selected from a catalog published by Siemens
linear combination of latent factors that cannot be mea- (1988). It consists of 134 cases with no missing values and
sured explicitly. The objective of FA is to identify these 13 variables that are power (P), speed (W), efficiency (E),
unobserved factors, reflect what the variables share in power factor (PF), current (I), locked-rotor current (ILK),
common, and provide further information about them. torque (M), locked-rotor torque (MLK), breakdown torque
Mathematically, let X represent a column vector that (MBD), inertia (J), weight (WG), slip (S), and slope of M-
contains p measured variables and has a mean vector , s curve (M_S). FA can be implemented in the following
F stand for a column vector which contains q latent procedures using this sample data.
factors, and L be a p q matrix that transforms F to X.
The elements of L (i.e., factor loadings) give the weights Step 1: Ensuring the Adequacy of the
that each factor contributes to each measured variable. In Date
addition, let be a column vector containing p uncorrelated
random errors. Note that q is smaller than p. The following The correlation matrix containing correlations between
equation simply illustrates the general model of FA the variables is first examined to identify the variables that
(Johnson & Wichern, 1998): are statistically significant. In the case study, this matrix
from the sample data showed that the correlations be-
X - = LF + . tween the variables are satisfactory, and thus, all vari-
ables are kept for the next step. Meanwhile, preliminary
FA and PCA yield similar results in many cases, but, tests, such as the Bartlett test, the Kaiser-Meyer-Olkin
in practice, PCA often is preferred for data reduction, (KMO) test, and the Measures of Sampling Adequacy
while FA is preferred to detect structure of the data. (MSA) test, are used to evaluate the overall significance
In any experiment, any one scenario may be delineated of the correlation. Table 1 shows that the values of MSA
by a large number of factors. Identifying important factors (rounded to two decimal places) are higher than 0.5 for all
and putting them into more general categories generates variables but variable W. However, the MSA value of
an environment or structure that is more advantageous to variable W is close to 0.5 (MSA should be higher than 0.5,
data analysis, reducing the large number of variables to according to Hair, et al. (1998)).
smaller, more manageable, interpretable factors (Kachigan,
1986). Technically, FA allows the determination of the Step 2: Finding the Number of Factors
interdependency and pattern delineation of data. It un-
tangles the linear relationships into their separate pat- There are many approaches available for this purpose
terns as each pattern will appear as a factor delineating a (e.g., common factor analysis, parallel analysis). The case
distinct cluster of interrelated data (Rummel, 2002, Sec- study first employed the plot of eigenvalues vs. the factor
tion 2.1). In other words, FA attempts to take a group of number (the number of factors may be 1 to 13) and found
interdependent variables and create separate descriptive that choosing three factors accounts for 91.3% of the total
categories and, after this transformation, thereby de- variance. Then, it suggested that the solution be checked
crease the number of variables that are used in an experi-
499
TEAM LinG
Factor Analysis in Data Mining
with the attempt to extract two or three more factors. Based methods may be incorporated within the concept of FA.
on the comparison between results from different selected Proper selection of methods should depend on the na-
methods, the study ended up with five factors. ture of data and the problem to which FA is applied.
Garson (2003) points out the abilities of FA: determin-
Step 3: Determining Transformation (or ing the number of different factors needed to explain the
Rotation) Matrix pattern of relationships among the variables, describing
the nature of those factors, knowing how well the hy-
A commonly used method is orthogonal rotation. Table 2 pothesized factors explain the observed data, and find-
can represent the transformation matrix, if each cell shows ing the amount of purely random or unique variance that
the factor loading of each variable on each factor. For the each observed variable includes. Because of these abili-
sample size, only loadings with an absolute value bigger ties, FA has been used for various data analysis prob-
than 0.5 were accepted (Hair et al., 1998), and they are lems and may be used in a variety of applications in data
marked X in Table 2 (other loadings lower than 0.5 are not mining from science-oriented to business applications.
listed). From the table, J, I, M, S, M, WG, and P can be One of the most important uses is to provide a sum-
grouped to the first factor; PF, ILK, E, and S belong to the mary of the data. The summary facilitates learning the
second factor; W and MLK can be considered another two data structure via an economic description. For example,
single factors, respectively. Note that MBD can go to Pan et al. (1997) employed Artificial Neural Network
factor 2 or be an independent factor (factor 5) of the other (ANN) techniques combined with FA for spectroscopic
four. The case study settled the need to retain it in the fifth quantization of amino acids. Through FA, the number of
factor, based on the results obtained from other samples. input nodes for neural networks was compressed effec-
In this step, an oblique rotation is another method to tively, which greatly sped up the calculations of neural
determine the transformation matrix. Since it is a non- networks. Tan and Wisner (2001) used FA to reduce a set
orthogonal rotation, the factors are not required to be of factors affecting operations management constructs
uncorrelated to each other. This gives better flexibility and their relationships. Kiousis (2004) applied an explor-
than an orthogonal rotation. Using the sample data of the atory FA to he New York Times news coverage of eight
case study, a new transformation matrix may be obtained major political issues during the 2000 presidential elec-
from an oblique rotation, which provides new loadings to tion. FA identified two indices that measured the con-
group variables into factors. struct of the key independent variable in agenda-setting
research, which could then be used in future investiga-
tions.
Step 4: Interpreting the Factors. Screening variables is another important function of
FA. A co-linearity problem will appear, if the factors of
In the case study, the factor consisting of J, I, M, S, M, WG, the variables in the data are very similar to each other. In
and P was named size, because the higher value of the order to avoid this problem, a researcher can group
weight (WG) and power (P) reflects the larger size of the closely related variables into one category and then
machine. The factor containing PF, ILK, E, and S is ex- extract the one that would have the greatest use in
plained as global efficiency. determining a solution (Kachigan, 1986). For example,
This example provides a general demonstration of the Borovec (1996) proposed a six-step sequential extraction
application of FA techniques to data analysis. Various
500
TEAM LinG
Factor Analysis in Data Mining
procedure and applied FA, which found three dominant FUTURE TRENDS
trace elements from 12 surface stream sediments. These .
three factors accounted for 78% of the total variance. In At the Factor Analysis at 100 Conference held in May
another example, Chen, et al. (2001) performed an explor- 2004, the future of FA was discussed. Millsap and Meredith
atory FA on 48 financial ratios from 63 firms. Four critical (2004) suggested further research in the area of ordinal
financial ratios were concluded, which explained 80% of measures in multiple populations and technical issues of
the variation in productivity. small samples. These conditions can generate bias in
FA can be used as a scaling method, as well. Oftentimes, current FA methods, causing results to be suspect. They
after the data are collected, the development of scales is also suggested further study in the impact of violations
needed among individuals, groups, or nations, when they of factorial invariance and explanations for these viola-
are intended to be compared and rated. As the character- tions. Wall and Amemiya (2004) feel that there are chal-
istics are grouped to independent factors, FA assigns lenges in the area of non-linear FA. Although models exist
weights to each characteristic according to the observed for non-linear analysis, there are aspects of this area that
relationships among the characteristics. For instance, are not fully understood.
Tafeit, et al. (1999, 2000) provided a comparison between However, the flexibility of FA and its ability to reduce
FA and ANN for low-dimensional classification of high- the complexity of the data still make FA one of commonly
dimensional body fat topography data of healthy and used techniques. Incorporated with advances in informa-
diabetic subjects with a high-dimensional and partly tion technologies, the future of FA shows great promise
highly intercorrelated set of data. They found that the for the applications in the area of data mining.
analysis of the extracted weights yielded useful informa-
tion about the structure of the data. As the weights for
each characteristic are obtained by FA, the score (by CONCLUSION
summing characteristics times these weights) can be used
to represent the scale of the factor to facilitate the rating FA is a useful multivariate statistical technique that has
of factors. been applied in a wide range of disciplines. It enables
In addition, FAs ability to divide closely related researchers to effectively extract information from huge
variables into different groups is also useful for statistical databases and attempts to organize and minimize the
hypothesis testing, as Rummel (2002) stated, when hy- amount of variables used in collecting or measuring data.
potheses are about the dimensions that can be a group of However, the applications of FA in business sectors (e.g.,
highly intercorrelated characteristics, such as personal- e-business) is relatively new.
ity, attitude, social behavior, and voting. For instance, in Currently, the increasing volumes of data in databases
a study of resource investments in tourism business, and data warehouses are the key issue governing their
Morais, et al. (2003) use confirmatory FA to find that pre- future development. Allowing the effective mining of
established resource investment scales could not fit their potentially useful information from huge databases with
model well. They reexamined each subscale with explor- many dimensions, FA definitely is helpful in sorting out
atory FA to identify factors that should not have been the significant parts of information for decision makers, if
included in the original model. it is used appropriately.
There have been controversies about uses of FA.
Hand, et al. (2001) pointed out that one important reason
is that FAs solutions are not invariant to various trans-
formations. More precisely, the extracted factors are
REFERENCES
basically non-unique, unless extra constraints are im-
posed (Hand et al., 2001, p. 84). The same information Borovec, Z. (1996). Evaluation of the concentrations of
may reach different interpretations with personal judg- trace elements in stream sediments by factor and cluster
ment. Nevertheless, no method is perfect. In some situa- analysis and the sequential extraction procedure. The
tions, other statistical methods, such as regression analy- Science of the Total Environment, 117, 237-250.
sis and cluster analysis, may be more appropriate than FA. Chen, L., Liaw, S., & Chen, Y. (2001). Using financial
However, FA is a well-known and useful tool among data- factors to investigate productivity: An empirical study in
mining techniques. Taiwan. Industrial Management & Data Systems, 101(7),
378-384.
501
TEAM LinG
Factor Analysis in Data Mining
Garson, D. (2003). Factor analysis. Retrieved from http:/ Tan, K., & Wisner, J. (2003). A study of operations
/www2.chass.ncsu.edu/garson/pa765/factor.htm management constructs and their relationships. Interna-
tional Journal of Operations & Production Manage-
Hair, J., Anderson, R., Tatham R., & Black, W. (1998). ment, 23(11), 1300-1325.
Multivariate data analysis with readings. Englewood
Cliffs, NJ: Prentice-Hall, Inc. Wall, M., & Amemiya, Y. (2004). A review of nonlinear
factor analysis methods and applications. Proceedings of
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of the Factor Analysis at 100: Historical Developments
data mining. Cambridge, MA. MIT Press. and Future Directions Conference, Chapel Hill, North
Johnson, R., & Wichern, D. (1998). Applied multivariate Carolina.
statistical analysis. Englewood Cliffs, NJ: Prentice Hall, Williams, R.H., Zimmerman, D.W., Zumbo, B.D., & Ross,
Inc. D. (2003). Charles Spearman: British behavioral scientist.
Kachigan, S. (1986). Statistical analysis: An interdisci- Human Nature Review. Retrieved from http://human-
plinary introduction to univariate and multivariate nature.com/nibbs/03/spearman.html
methods. New York: Radius Press.
Kiousis, S. (2004). Explicating media salience: A factor KEY TERMS
analysis of New York Times issue coverage during the
2000 U.S. presidential election. Journal of Communica- Cluster Analysis: A multivariate statistical technique
tion, 54(1), 71-87. that assesses the similarities between individuals of a
population. Clusters are groups or categories formed so
Mate, C., & Calderon, R. (2000). Exploring the character- members within a cluster are less different than members
istics of rotating electric machines with factor analysis. from different clusters.
Journal of Applied Statistics, 27(8), 991-1006.
Eigenvalue: The quantity representing the variance of
Millsap, R., & Meredith, W. (2004). Factor invariance: a set of variables included in a factor.
Historical trends and new developments. Proceedings of
the Factor Analysis at 100: Historical Developments Factor Score: A measure of a factors relative weight
and Future Directions Conference, Chapel Hill, North to others, which is obtained using linear combinations of
Carolina. variables.
Morais, D., Backman, S., & Dorsch, M. (2003). Toward the Homogeneity: The degree of similarity or uniformity
operationalization of resource investments made between among individuals of a population.
customers and providers of a tourism service. Journal of
Travel Research, 41, 362-374. Homoscedasticity: A statistical assumption for linear
regression models. It requires that the variations around
Pan, Z. et al. (1997). Spectroscopic quantization of amino the regression line be constant for all values of input
acids by using artificial neural networks combined with variables.
factor analysis. Spectrochimica Acta Part A 53, 1629-
1632. Matrix: An arrangement of rows and columns to
display quantities. A p q matrix contains p q quan-
Rummel, R.J. (2002). Understanding factor analysis. Re- tities arranged in p rows and q columns (i.e., each row has
trieved from http://www.hawaii.edu/powerkills/ q quantities,and each column has p quantities).
UFA.HTM
Normality: A statistical assumption for linear regres-
Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (1999). The sion models. It requires that the errors around the regres-
determination of three subcutaneous adipose tissue com- sion line be normally distributed for each value of input
partments in non-insulin-dependent diabetes mellitus variable.
women with artificial neural networks and factor analysis.
Artificial Intelligence in Medicine, 17, 181-193. Variance: A statistical measure of dispersion around
the mean within the data. Factor analysis divides variance
Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (2000). of a variable into three elementscommon, specific, and
Artificial neural networks compared to factor analysis for error.
low-dimensional classification of high-dimensional body
fat topography data of healthy and diabetic subjects. Vector: A quantity having both direction and magni-
Computers and Biomedical Research, 33, 365-374. tude. This quantity can be represented by an array of
components in a column (column vector) or in a row (row
vector).
502
TEAM LinG
503
Victor M. Becerra
University of Reading, UK
Magda Abou-Seada
Middlesex University, UK
Prediction of corporate financial distress is a subject The optimal cut-off value for classification z c can be
that has attracted the interest of many researchers in calculated as
finance. The development of prediction models for
financial distress started with the seminal work by Altman 1 2)TS1(
z c = 0.5( 1 + 2) (4)
(1968), who used discriminant analysis. Such a tech-
nique is aimed at classifying a firm as bankrupt or A given vector x should be assigned to Population 1
nonbankrupt on the basis of the joint information con- if Z(x) > zc, and to Population 2 otherwise.
veyed by several financial ratios. The generalization (or prediction) performance of
The assessment of financial distress is usually based the Z-score model, that is, its ability to classify objects
on ratios of financial quantities, rather than absolute not used in the modeling phase, can be assessed by using
values, because the use of ratios deflates statistics by an independent validation set or cross-validation meth-
size, thus allowing a uniform treatment of different ods (Duda et al., 2001). The simplest cross-validation
firms. Moreover, such a procedure may be useful to technique, termed leave-one-out, consists of separat-
reflect a synergy or antagonism between the constitu- ing one of the m modelling objects and obtaining a Z-
ents of the ratio. score model with the remaining m 1 objects. This model
is used to classify the object that was left out. The
procedure is repeated for each object in the modeling
BACKGROUND set in order to obtain a total number of cross-validation
errors.
The classification of companies on the basis of finan- Resampling techniques (Good, 1999) such as the
cial distress can be performed by using linear discrimi- Bootstrap method (Davison & Hinkley, 1997) can also
nant models (also called Z-score models) of the follow- be used to assess the sensitivity of the analysis to the
ing form (Duda, Hart, & Stork, 2001): choice of the training objects.
1 2)TS1x
Z(x) = ( (1) The Financial Ratio Selection Problem
where x = [x1 x2 ... xn]T is a vector of n financial ratios, 1n The selection of appropriate ratios from the available
and 2n are the sample mean vectors of each group financial information is an important and nontrivial
(continuing and failed companies), and Snn is the common stage in building distress classification models. The
sample covariance matrix. Equation 1 can also be written best choice of ratios will normally depend on the types
as of companies under analysis and also on the economic
context. Although the analysts market insight plays an
Z = w1x 1 + w2x2 + ... + wnxn = wTx (2) important role at this point, the use of data-driven
selection techniques can be of value, because the rel-
where w = [w1 w2 ... wn]T is a vector of coefficients obtained as evance of certain ratios may only become apparent when
their joint contribution is considered in a multivariate
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Financial Ratio Selection for Distress Classification
context. Moreover, some combinations of ratios may not Algorithm A (Preselection Followed by
satisfy the statistical assumptions required in the model- Exhaustive Search)
ing process, such as normal distribution and identical
covariances in the groups being classified, in the case of If N variables are initially available for selection, they can
standard linear discriminant analysis (Duda et al., 2001). be combined in 2N 1 different subsets (each subset with
Finally, collinearity between ratios may cause the model a number of variables between 1 and N). Thus, the com-
to have poor prediction ability (Naes & Mevik, 2001). putational workload can be substantially reduced if
Techniques proposed for ratio selection include nor- some variables are preliminarily excluded.
mality tests (Taffler, 1982), and clustering followed by In this algorithm, such a preselection is carried out
stepwise discriminant analysis (Alici, 1996). according to a multivariate relevance index W(x) that
Most of the works cited in the preceding paragraph measures the contribution of each variable x to the
begin with a set of ratios chosen from either popularity classification output when a Z-score model is employed.
in the literature, theoretical arguments, or suggestions This index is obtained by using all variables to build a
by financial analysts. However, this article shows that it model as in Equation 1 and by multiplying the absolute
is possible to select ratios on the basis of data taken value of each model weight by the sample standard
directly from the financial statements. deviation (including both groups) of the respective vari-
For this purpose, we compare two selection methods able.
proposed by Galvo, Becerra, and Abou-Seada (2004). An appropriate threshold value for the relevance
A case study involving 60 failed and continuing British index W(x) can be determined by augmenting the model-
firms in the period from 1997 to 2000 is employed for ing data with artificial uninformative variables (noise)
illustration. and then obtaining a Z-score model. Those variables
whose relevance is not considerably larger than the
average relevance of the artificial variables are then
MAIN THRUST eliminated (Centner, Massart, Noord, Jong, Vandeginste,
& Sterna, 1996).
It is not always advantageous to include all available After the preselection phase, all combinations of the
variables in the building of a classification model (Duda remaining variables are tested. Subsets with the same
et al., 2001). Such an issue has been studied in depth in number of variables are compared on the basis of the
the context of spectrometry (Andrade, Gomez- number of classification errors on the modelling set for
Carracedo, Fernandez, Elbergali, Kubista, & Prada, a Z-score model and the condition number of the matrix
2003), in which the variables are related to the wave- of modeling data. The condition number (the ratio
lengths monitored by an optical instrumentation frame- between the largest and smallest singular value of the
work. This concept also applies to the Z-score modeling matrix) should be small to avoid collinearity problems
process described in the preceding section. In fact, (Navarro-Villoslada, Perez-Arribas, Leon-Gonzalez, &
numerical ill-conditioning tends to increase with (m Polodiez, 1995). After the best subset has been deter-
n)1, where m is the size of the modeling sample, and n mined for each given number of variables, a cross-
is the number of variables (Tabachnick & Fidell, 2001). validation procedure is employed to find the optimum
If n > m, matrix S becomes singular, thus preventing the use number of variables.
of Equation 1. In this sense, it may be more appropriate to
select a subset of the available variables for inclusion in Algorithm B (Genetic Selection)
the classification model.
The selection procedures to be compared in this The drawback of the preselection procedure employed
article search for a compromise between maximizing in Algorithm A is that some variables that display a small
the amount of discriminating information available for relevance index when all variables are considered to-
the model and minimizing collinearity between the clas- gether could be useful in smaller subsets. An alternative
sification variables, which is a known cause of generali- to such a preselection consists of employing a genetic
zation problems (Naes & Mevik, 2001). These goals are algorithm (GA), which tests subsets of variables in an
usually conflicting, because the larger the number of efficient way instead of performing an exhaustive search
variables, the more information is available, but also the (Coley, 1999; Lestander, Leardi, & Geladi, 2003).
more difficult it is to avoid collinearity. The GA represents subsets of variables as individu-
als competing for survival in a population. The genetic
504
TEAM LinG
Financial Ratio Selection for Distress Classification
code of each individual is stored in a chromosome, which were extracted from the statements, allowing 28 ratios to
is a string of N binary genes, each gene associated to one be built, as shown in Table 1. Quantities WC, PBIT, EQ, .
of the variables available for selection. The genes with S, TL, ARP, and TA are commonly found in the financial
value 1 indicate the variables that are to be included in the distress literature (Altman, 1968; Taffler, 1982; Alici,
classification model. 1996), and the ratios shown in boxes are those adopted
In the formulation adopted here, the measure F of the by Altman (1968). It is worth noting that the book value
survival fitness of each individual is defined as follows. A of equity was used rather than the market value of equity
Z-score model is obtained from Equation 1 with the vari- to allow the inclusion of firms not quoted in the stock
ables indicated in the chromosome, and then F is calcu- market. Quantity RPY is not typically employed in dis-
lated as tress models, but we include it here to illustrate the ability
of the selection algorithms to discard uninformative
F = (e + r)1 (5) variables. The data employed in this example are given in
Galvo, Becerra, and Abou-Seada (2004).
where e is the number of classification errors in the mod- The data set was divided into a modeling set (21 failed
eling set, r is the condition number associated to the and 21 continuing firms) and a validation set (8 failed and
variables included in the model, and > 0 is a design 10 continuing firms). In what follows, the errors will be
parameter that balances modeling accuracy against col- divided into Type 1 (failed company classified as con-
linearity prevention. The larger is, the more emphasis is tinuing) and 2 (continuing company classified as failed).
placed on avoiding collinearity.
After a random initialization of the population, the Conventional Financial Ratios
algorithm proceeds according to the classic evolution-
ary cycle (Coley, 1999). At each generation, the roulette Previous studies (Becerra, Galvo, & Abou-Seada,
method is used for mating pool selection, followed by 2001) with this data set revealed that when the five
the genetic operators of one-point crossover and point conventional ratios are employed, Ratio 13 (PBIT/TA)
mutation. The population size is kept constant, with each is actually redundant and should be excluded from the
generation replacing the previous one completely. How- Z-score model in order to avoid collinearity problems.
ever, the best-fitted individual is preserved from one Thus, Equation 1 was applied only to the remaining four
generation to the next (elitism) in order to prevent ratios, leading to the results shown in Table 2. It is
good solutions from being lost. worth noting that if Ratio PBIT/TA is not discarded, the
number of validation errors increases from four to
seven.
CASE STUDY
Algorithm A
This example employs financial data from 29 failed and
31 continuing British corporations in the period from The preselection procedure was carried out by aug-
1997 to 2000. The data for the failed firms were taken menting the 28 financial ratios with seven uninforma-
from the last financial statements published prior to the tive variables yielded by an N(0,1) random number
start of insolvency proceedings. Eight financial quantities generator. The relevance index thus obtained is shown
in Figure 1. The threshold value, represented by a hori-
zontal line, was set to five times the average relevance of
Table 1. Numbering of financial ratios (Num/Den). the uninformative variables. As a result, 13 ratios were
Conventional ratios are displayed in boxes. WC = discarded.
working capital, PBIT = profit before interest and tax, After the preselection phase, combinations of the 15
EQ = equity, S = Sales, TL = total liabilities, ARP = remaining ratios were tested for modeling accuracy and
accumulated retained profit, RPY = retained profit for
the year, TA = total assets.
Table 2. Results of a Zscore model using four
Den
Num conventional ratios
WC PBIT EQ S TL ARP RPY
WC
PBIT 1 Type 1 Type 2 Percent
EQ 2 8
Data set
errors errors accuracy
S 3 9 14
Modeling 2 7 79%
TL 4 10 15 19
ARP 5 11 16 20 23 Cross- 3 8 74%
RPY 6 12 17 21 24 26 validation
TA 7 13 18 22 25 27 28 Validation 0 4 78%
505
TEAM LinG
Financial Ratio Selection for Distress Classification
Figure 1. Relevance index of the financial ratios. Table 3. Results of a Z-score model using the five
Log 10 values are displayed for the convenience of ratios selected by Algorithm A
visualization.
Data Type 1 Type 2 Percent
set errors errors accuracy
Modeling 1 3 90%
Cross- 2 5 83%
validation
Validation 1 2 83%
Algorithm B
Resampling Study
506
TEAM LinG
Financial Ratio Selection for Distress Classification
Table 4. GA results for different values of the weight Table 6. Resampling results (average number of errors)
parameter .
Selected Modeling Cross-validation Ratios employed
Data
ratios errors errors Conventional Algorithm A Algorithm B
set
10 {7} 11 11 (4 ratios) (5 ratios) (3 ratios)
1 {2,9,15} 4 5 Modeling 8.52 5.70 5.36
0.1 {2,9,13,18,19,22,25} 3 7 Validation 4.69 3.22 2.89
507
TEAM LinG
Financial Ratio Selection for Distress Classification
Coley, D. A. (1999). An introduction to genetic algo- Wilson, R. L., & Sharda, R. (1994). Bankruptcy prediction
rithms for scientists and engineers. Singapore: World using neural networks. Decision Support Systems, 11,
Scientific. 545-557.
Davison, A. C., & Hinkley, D. V. (Eds.). (1997). Bootstrap
methods and their application. Cambridge, MA: Cam-
bridge University Press. KEY TERMS
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern
classification (2nd ed.). New York: Wiley. Condition Number: Ratio between the largest and
smallest singular values of a matrix, often employed to
Galvo, R. K. H., Becerra, V. M., & Abou-Seada, M. assess the degree of collinearity between variables asso-
(2004). Ratio selection for classification models. Data ciated to the columns of the matrix.
Mining & Knowledge Discovery, 8, 151-170.
Cross-Validation: Resampling method in which ele-
Good, P. I. (1999). Resampling methods: A practical ments of the modeling set itself are alternately removed
guide to data analysis. Boston, MA: Birkhauser. and reinserted for validation purposes.
Lestander, T. A., Leardi, R., & Geladi, P. (2003). Selec- Financial Distress: A company is said to be under
tion of near infrared wavelengths using genetic algo- financial distress if it is unable to pay its debts as they
rithms for the determination of seed moisture content. become due, which is aggravated if the value of the
Journal of Near Infrared Spectroscopy, 11(6), 433-446. firms assets is lower than its liabilities.
Naes, T., & Mevik, B. H. (2001). Understanding the col- Financial Ratio: Ratio formed from two quantities
linearity problem in regression and discriminant analysis. taken from a financial statement.
Journal of Chemometrics, 15(4), 413-426.
Genetic Algorithm: Optimization technique inspired
Navarro-Villoslada, F., Perez-Arribas, L. V., Leon- by the mechanisms of evolution by natural selection, in
Gonzalez, M. E., & Polodiez, L. M. (1995). Selection of which the possible solutions are represented as the chro-
calibration mixtures and wavelengths for different mul- mosomes of individuals competing for survival in a popu-
tivariate calibration methods. Analytica Chimica Acta, lation.
313(1-2), 93-101.
Linear Discriminant Analysis: Multivariate clas-
Tabachnick, B. G., & Fidell, L. S. (2001). Using multi- sification technique that models the classes under con-
variate statistics (4th ed.). Boston, MA: Allyn & Bacon. sideration by normal distributions with equal covari-
ances, which leads to hyperplanes as the optimal deci-
Taffler, R. J. (1982). Forecasting company failure in the sion surfaces.
UK using discriminant analysis and financial ratio data.
Journal of the Royal Statistical Society, Series A, 145, Resampling: Validation technique employed to as-
342-358. sess the sensitivity of the classification method with
respect to the choice of modeling data.
508
TEAM LinG
509
INTRODUCTION BACKGROUND
The discovery of association rules showing conditions An association rule is called binary association rule if all
of data co-occurrence has attracted the most attention items (attributes) in the rule have only two values: 1 (yes)
in data mining. An example of an association rule is the or 0 (no). Mining binary association rules was the first
rule the customer who bought bread and butter also proposed data mining task and was studied most inten-
bought milk, expressed by T(bread; butter)T(milk). sively. Centralized on the Apriori approach (Agrawal et
Let I ={x1,x2,,xm} be a set of (data) items, called the al., 1993), various algorithms were proposed (Savasere et
domain; let D be a collection of records (transactions), al., 1995; Shen, 1999; Shen, Liang, & Ng, 1999; Srikant &
where each record, T, has a unique identifier and con- Agrawal, 1996). Almost all the algorithms observe the
tains a subset of items in I. We define itemset to be a set downward property that all the subsets of a frequent
of items drawn from I and denote an itemset containing k itemset must also be frequent, with different pruning
items to be k-itemset. The support of itemset X, denoted strategies to reduce the search space. Apriori works by
by (X/D), is the ratio of the number of records (in D) finding frequent k-itemsets from frequent (k-1)-itemsets
containing X to the total number of records in D. An iteratively for k=1, 2, , m-1.
association rule is an implication rule X Y, where X; Two alternative approaches, mining on domain parti-
Y I and XIY=0. The confidence of X Y is the ratio of tion (Shen, L., Shen, H., & Cheng, 1999) and mining based
(X U Y/D) to (X/D), indicating that the percentage of on knowledge network (Shen, 1999) were proposed. The
those containing X also contain Y. Based on the user- first approach partitions items suitably into disjoint
specified minimum support (minsup) and confidence itemsets, and the second approach maps all records to
(minconf), the following statements are true: An itemset individual items; both approaches aim to improve the
X is frequent if (X/D)> minsup, and an association rule bottleneck of Apriori that requires multiple phases of
( X UY / D )
scans (read) on the database.
X Y is strong if X U Y is frequent and ( X /Y ) minconf. Finding all the association rules that satisfy minimal
The problem of mining association rules is to find all support and confidence is undesirable in many cases for
strong association rules, which can be divided into two a users particular requirements. It is therefore necessary
subproblems: to mine association rules more flexibly according to the
users needs. Mining different sets of association rules of
1. Find all the frequent itemsets. a small size for the purpose of predication and classifica-
2. Generate all strong rules from all frequent itemsets. tion were proposed (Li, Shen, & Topor, 2001; Li, Shen, &
Topor, 2002; Li, Shen, & Topor, 2004; Li, Topor, & Shen,
Because the second subproblem is relatively straight- 2002).
forward we can solve it by extracting every subset from
an itemset and examining the ratio of its support; most of
the previous studies (Agrawal, Imielinski, & Swami, 1993; MAIN THRUST
Agrawal, Mannila, Srikant, Toivonen, & Verkamo, 1996;
Park, Chen, & Yu, 1995; Savasere, Omiecinski, & Navathe, Association rule mining can be carried out flexibly to
1995) emphasized on developing efficient algorithms for suit different needs. We illustrate this by introducing
the first subproblem. important techniques to solve two interesting problems.
This article introduces two important techniques for
association rule mining: (a) finding N most frequent
itemsets and (b) mining multiple-level association rules.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Flexible Mining of Association Rules
Finding N Most Frequent Itemsets To find all the winners, the algorithm makes multiple
passes over the data. In the first pass, we count the
Given x, y I, we say that x is greater than y, or y is less supports of all 1-itemsets, select the N largest ones from
them to form W 1, and then use W 1 to generate potential 2-
than x, if ( x / D) > ( y / D) . The largest itemset in D is the
winners with size (2). Each subsequent pass k involves
itemset that occurs most frequently in D. We want to find three steps: First, we count the support for potential k-
the N largest itemsets in D, where N is a user-specified winners with size k (called candidates) during the pass
number of interesting itemsets. Because users are usually over D; then select the N largest ones from a pool pre-
interested in those itemsets with larger supports, finding cisely containing supports of all these candidates and all
N most frequent itemsets is significant, and its solution (k-1)-winners to form W k; finally, use W k to generate
can be used to generate an appropriate number of inter- potential (k+1)-winners with size k+1, which will be used
esting itemsets for mining association rules (Shen, L., in the next pass. This process continues until we cant get
Shen, H., Pritchard, & Topor, 1998). any potential (k+1)-winners with size k+1, which implies
We define the rank of itemset x, denoted by (x), as that W k+1 = Wk. From Property 2, we know that the last Wk
follows: (x) ={(y/D)>(x/D), y I}|+1. Call x a exactly contains all winners.
winner if (x)<N and (x/D)>1, which means that x is one We assume that M k is the number of itemsets with
of the N largest itemsets and it occurs in D at least once. support equal to k-crisup and a size not greater than k,
We dont regard any itemset with support 0 as a winner, where 1<k<|I|, and M is the maximum of all M1~M|I|. Thus,
even if it is ranked below N, because we do not need to we have |Wk|=N+M k-1<N+M. It was shown that the time
provide users with an itemset that doesnt occur in D at complexity of the algorithm is proportional to the number
all. of all the candidates generated in the algorithm, which is
Use W to denote the set of all winners and call the O(log(N+M)*min{N+M,|I|}*(N+M)) (Shen et al., 1998).
support of the smallest winner the critical support, Hence, the time complexity of the algorithm is polynomial
denoted by crisup. Clearly, W exactly contains all for bounded N and M.
itemsets with support exceeding crisup; we also have
crisup>1. It is easy to see that |W| may be different from Mining Multiple-Level Association Rules
N: If the number of all itemsets occurring in D is less
than N, |W| will be less than N; |W| may also be greater Although most previous research emphasized mining
than N, as different itemsets may have the same support. association rules at a single concept level (Agrawal et
The problem of finding the N largest itemsets is to al., 1993; Agrawal et al., 1996; Park et al., 1995; Savasere
generate W. et al., 1995; Srikant & Agrawal, 1996), some techniques
Let x be an itemset. Use Pk(x) to denote the set of were also proposed to mine rules at generalized abstract
all k-subsets (subsets with size k) of x. Use Uk to denote (multiple) levels (Han & Fu, 1995). However, they can
P1(I) U U Pk(I), the set of all itemsets with a size not only find multiple-level rules in a fixed concept hierar-
greater than k. Thus, we introduce the k-rank of x, denoted chy. Our study in this fold is motivated by the goal of
by k(x), as follows: k(x) = |{y| {(y/D)>(x/D), mining multiple-level rules in all concept hierarchies
y Uk}|+1. Call x a k-winner if k(x)<N and (x/D)>1, (Shen, L., & Shen, H., 1998).
which means that among all itemsets with a size not greater A concept hierarchy can be defined on a set of data-
than k, x is one of the N largest itemsets and also occurs base attribute domains such as D(a1),,D(an), where, for
in D at least once. Use Wk to denote the set of all k-winners. i [1, n], ai denotes an attribute, and D(a i) denotes the
We define k-critical-support, denoted by k-crisup, as domain of ai. The concept hierarchy is usually partially
follows: If |W k|<N, then k-crisup is 1; otherwise, k-crisup ordered according to a general-to-specific ordering. The
is the support of the smallest k-winner. Clearly, Wk exactly most general concept is the null description ANY, whereas
contains all itemsets with a size not greater than k and the most specific concepts correspond to the specific
support not less than k-crisup. We present some useful attribute values in the database. Given a set of
properties of the preceding concepts as follows. D(a1),,D(an), we define a concept hierarchy H as fol-
lows: H n H n - 1 H 0 , where
Property: Let k and i be integers such that 1<k<k+i <|I|. H = D(a1 ) L D(aii ) f o r
i i
i[0,n], and
{ a 1 , , a n } = { a1 ,..., an } { a1 ,..., an 1 } .
n n n n 1
(1) Given x Uk, we have x Wk iff (x/D)>k-crisup.
(2) If W k-1 = W k, then W=Wk Here, Hn represents the set of concepts at the primitive
(3) Wk+i I Uk Wk. level, Hn-1 represents the concepts at one level higher
(4) 1<k-crisup<(k +i)-crisup. than those at Hn, and so forth; H0, the highest level
hierarchy, may contain solely the most general concept,
510
TEAM LinG
Flexible Mining of Association Rules
start with L k-1, the set of all large [k-1]-itemsets, and use
ANY. We also use { a1n ,..., ann }{ a1n ,..., ann11 }...{ a11 } to
denote H directly, and H0 may be omitted here.
Lk-1 to generate Ck, a superset of all large [k]-itemsets. Call .
the elements in Ck candidate itemsets, and count the
We introduce FML items to represent concepts at any support for these itemsets during the pass over the data.
level of a hierarchy. Let *, called a trivial digit, be a dont- At the end of the pass, we determine which of these
care digit. An FML item is represented by a sequence of itemsets are actually large and obtain Lk for the next pass.
digits, x = x1x2 xn, xi D(ai) 7 {*}. The flat-set of x is This process continues until no new large itemsets are
defined as Sf (x) ={(I, x i) | i [1; n] and xi *}. Given two found. Not that L1={{trivial item}}, because {trivial item}
items x and y, x is called a generalized item of y if Sf (x) Sf is the unique [1]-itemset and is supported by all transac-
(y), which means that x represents a higher-level concept tions.
that contains the lower-level concept represented by y. The computation cost of the preceding algorithm,
Thus, *5* is a generalized item of 35* due to Sf(*5*) which finds all frequent FML itemsets is O( cC s( x) ),
={(2,5)} Sf (35*)={(1,3),(2,5)}. If Sf(x)=, then x is called where C is the set of all candidates, g(c) is the cost for
a trivial item, which represents the most general concept, generating c as a candidate, and s(c) is the cost for
ANY. counting the support of c (Shen, L., & Shen, H., 1998).
Let T be an encoded transaction table, t a transaction The algorithm is optimal if the method of support
in T, x an item, and c an itemset. We can say that (a) t counting is optimal.
supports x if an item y exists in t such that x is a generalized After all frequent FML itemsets have been found,
item of y and (b) t supports c if t supports every item in c. we can proceed with the construction of strong FML
The support of an itemset c in T, (c/T), is the ratio of the rules. Use r(l,a) to denote rule Fs(a)Fs(Fc(l)-Fc(a)),
number of transactions (in T) that support c to the total where l is an itemset and a l. Use Fo(l,a) to denote
number of transactions in T. Given a minsup, an itemset c Fc(l)-Fc(a) and say that Fo(l,a) is an outcome form of l
is large if (c/T)e>minsup; otherwise, it is small. or the outcome form of r(l,a). Note that Fo(l,a) repre-
Given an itemset c, we define its simplest form as sents only a specific form rather than a meaningful
Fs(c)={x c| y c; Sf (x) Sf(y)} and its complete form itemset, so it is not equivalent to any other itemset
as Fc(c)={x|Sf (x) Sf (y), y c}. whose simplest form is Fs(Fo(l,a)). Outcome forms are
Given an itemset c, we call the number of elements in also called outcomes directly. Clearly, the correspond-
Fs(c) its size and the number of elements in Fc(c) its weight. ing relationship between rules and outcomes is one to
An itemset of size j and weight k is called a (j)-itemset, [k]- one. An outcome is strong if it corresponds to a strong
itemset, or (j)[k]-itemset. Let c be a (j)[k]-itemset. Use Gi(c) rule. Thus, all strong rules related to a large itemset can
to indicate the set of all [i]-generalized-subsets of c, where be obtained by finding all strong outcomes of this
i<k. Thus, the set of all [k-1]-generalized-subsets of c can itemset.
be generated as follows: Gk-1(c) = {Fs(Fc(c)- {x}) | x Fs(c)}; Let l be an itemset. Use O(l) to denote the set of all
the size of Gk-1(c) is j. Hence, for k>1, c is a (1)-itemset iff outcomes of l; that is, O(l) ={Fo(l,a)|a l}. Thus, from
|Gk-1(c)|=1. With this observation, we call a [k-1]-itemset a O(l), we can output all rules related to l: Fs(Fc(l)-o) Fs(o)
self-extensive if a (1)[k]-itemset b exists such that Gk-1(b) (denoted by r(l , o) ), where o O(l). Clearly, r(l,a) and
= {a}; at the same time, we call b the extension result of a.
r(l , o) denote the same rule if o=Fo(l,a). Let o, o O(l ) . We
Thus, all self-extensive itemsets as can be generated from
can say two things: (a) o is a |k|-outcome of l if o exactly
all (1)-itemsets bs as follows: Fc(a) = Fc(b)-Fs(b).
Let b be a (1)-itemset and Fs(b)={x}. From |Fc(b)| = |{y|Sf contains k elements and (b) o is a sub-outcome of o
(y) Sf (x)}| = 2| S ( x )| , we know that b is a (1) [ 2| S ( x )| ]-itemset.
f f versus l if o o . Use Ok(l) to denote the set of all the |k|-
Let a be the self-extensive itemset generated by b; that is, outcomes of l and use Vm(o,l) to denote the set of all the
a is a [ 2| S ( x )| -1]-itemset such that Fc(a)=Fc(b)-Fs(b). Thus,
f |m|-sub-outcomes of o versus l. Let o be an |m+1|-outcome
if |Sf (x)| > 1, there exist y, z Fs(a) and yz such that Sf of l and me>1. If |Vm(o,l)| =1, then o is called an elementary
outcome; otherwise, o is a non-elementary outcome.
(x)=Sf(y) 7 Sf(z). Clearly, this property can be used to gen-
Let r(l,a) and r(l,b) be two rules. We can say that r(l,a)
erate the corresponding extension result from any self-
is an instantiated rule of r(l,b) if b a. Clearly, b a
extensive [2m-1]-itemset, where m > 1. For simplicity, given
a self-extensive [2m-1]-itemset a, where m>1, we directly (l / T ) (l / T )
implies (b/T)>(a/T) and . Hence, all
use Er(a) to denote its extension result. For example, (a / T ) (b / T )
Er({12*,1*3,*23}) = {123}. instantiated rules of a strong rule must also be strong. Let
The algorithm makes multiple passes over the data- l be a large itemset, o1, o2 O(l), o1=Fo(l,a), and o2 =
base. In the first pass, we count the supports of all [2]- Fo(l,b). The three straightforward conclusions are (a)
itemsets and then select all the large ones. In pass k-1, we
511
TEAM LinG
Flexible Mining of Association Rules
512
TEAM LinG
Flexible Mining of Association Rules
Shen, L., & Shen, H. (1998). Mining flexible multiple-level Concept Hierarchy: The organization of a set of data-
association rules in all concept hierarchies. Proceedings base attribute domains into different levels of abstraction .
of the Ninth International Conference on Database and according to a general-to-specific ordering.
Expert Systems Applications (pp. 786-795).
Confidence of Rule X Y: The fraction of the data-
Shen, L., Shen, H., & Cheng, L. (1999). New algorithms for base containing X that also contains Y, which is the ratio
efficient mining of association rules. Information Sci- of the support of X U Y to the support of X.
ences, 118, 251-268.
Flexible Mining of Association Rules: Mining asso-
Shen, L., Shen, H., Pritchard, P., & Topor, R. (1998). Finding ciation rules in user-specified forms to suit different
the N largest itemsets. Proceedings of the IEEE Interna- needs, such as on dimension, level of abstraction, and
tional Conference on Data Mining, 19 (pp. 211-222). interestingness.
Srikant, R., & Agrawal, R. (1996). Mining quantitative Frequent Itemset: An itemset that has a support
association rules in large relational tables. ACM greater than user-specified minimum support.
SIGMOD Record, 25(2), 1-8.
Strong Rule: An association rule whose support (of
the union of itemsets) and confidence are greater than
user-specified minimum support and confidence, re-
KEY TERMS spectively.
Support of Itemset X: The fraction of the database
Association Rule: An implication rule X Y that that contains X, which is the ratio of the number of
shows the conditions of co-occurrence of disjoint records containing X to the total number of records in
itemsets (attribute value sets) X and Y in a given database. the database.
513
TEAM LinG
514
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Formal Concept Analysis Based Clustering
Figure 2. Concept lattice for the context in Table 1 with reduced labeling
maximal set of features shared by a set of objects. It is easy Traditionally, most clustering algorithms do not allow
to show that intents of the concepts of a concept lattice clusters to overlap. However, this is not a valid assump-
are all closed feature sets. tion for many applications. For example, in Web docu-
The support of a set of features B is defined as the ments clustering, many documents have more than one
percentage of objects that possess every feature in B. topic and need to reside in more than one cluster (Beil,
That is, support(B) = |(B)|/|G|, where |B| is the cardinality Ester, & Xu, 2002; Hearst, 1999; Zamir & Etzioni, 1998).
of B. Let minSupport be a user-specified threshold value Similarly, in the market basket data, items purchased in a
for minimum support. A feature set B is frequent iff transaction may belong to more than one category of
support(B) minSupport. A frequent closed feature set is items.
a closed feature set, which is also frequent. For example, The concept lattice structure provides a hierarchical
for minSupport = 0.3, {a, f} is frequent, {a, d, f} is frequent clustering of objects, where the extent of each node could
closed, while {a, c, d, f} is closed but not frequent. be a cluster and the intent provides a description of that
cluster. There are two main problems, though, that make
it difficult to recognize the clusters to be used. First, not
CLUSTERING BASED ON FCA all objects are present at all levels of the lattice. Second,
the presence of overlapping clusters at different levels is
It is believed that the method described below is the first not acceptable for disjoint clustering. The techniques
for using FCA for disjoint clustering. Using FCA for described in this chapter solve these problems. For ex-
conceptual clustering to gain more information about data ample, for a node to be a cluster candidate, its intent must
is discussed in Carpineto & Romano (1999) and Mineau be frequent (meaning a minimum percentage of objects
& Godin (1995). In the remainder of this article we show must possess all the features of the intent). The intuition
how FCA can be used for clustering. is that the objects within a cluster must contain many
515
TEAM LinG
Formal Concept Analysis Based Clustering
features in common. Overlapping is resolved by using a negative(g, Ci) be the set of features in b(g) which are
score function that measures the goodness of a cluster for global-frequent but not cluster-frequent. The function
an object and keeps the object in the cluster where it scores score(g, Ci) is then given by the following formula:
best.
score(g, C i) = f positive(g, Ci) cluster-support(f) -
Formalizing the Clustering Problem f negative(g, Ci) global-support(f).
Given a set of objects G = {g1, g2, , gn}, where each object The first term in score(g, Ci) favors Ci for every feature
is described by the set of features it possesses, (i.e., gi is in positive(g, Ci) because these features contribute to
described by b(gi)), = {C1, C2, , Ck} is a clustering of G intra-cluster similarity. The second term penalizes Ci for
if and only if each Ci is a subset of G and UCi = G, where i every feature in negative(g, C i) because these features
ranges from 1 to k. For disjoint clustering, the additional contribute to inter-cluster similarities.
condition Ci I Cj = must be satisfied for all i j. An overlapping object will be deleted from all initial
Our method for disjoint clustering consists of two clusters except for the cluster where it scores highest.
steps. First, assign objects to their initial clusters. Second, Ties are broken by assigning the object to the cluster
make these clusters disjoint. For overlapping clustering, with the longest label. If this does not resolve ties, then
only the first step is needed. one of the clusters is chosen randomly.
Each frequent closed feature set (FCFS) is a cluster candi- Consider the objects in Table 1. The closed feature sets
date, with the FCFS serving as a label for that cluster. Each are the intents of the concept lattice in Figure 1. For
object g is assigned to the cluster candidates described by minSupport = 0.35, a feature set must appear in at least 3
the maximal frequent closed feature set (MFCFS) con- objects to be frequent. The frequent closed feature sets
tained in b(g). These initial clusters may not be disjoint are a, ag, ac, ab, ad, agh, abg, acd, and adf. These are the
because an object may contain several MFCFS. For ex- candidate clusters. Using the notation C[x] to indicate
ample, for minSupport = 0.375, object 2 in Table 1 contains the cluster with label x, and assigning objects to MFCFS
the MFCFSs agh and abg. results in the following initial clusters: C[agh] = {2, 3, 4},
Notice that all of the objects in a cluster must contain C[abg] = {1, 2, 3}, C[acd] = {6, 7, 8}, and C[adf] = {5, 6,
all the features in the FCFS describing that cluster (which 8}. To find the most suitable cluster for object 6, we need
is also used as the cluster label). This is always true even to calculate its score in each cluster containing it. For
after any overlapping is removed. This means that this cluster-support threshold value of 0.7, it is found that,
method produces clusters with their descriptions. This is score(6,C[acd]) = 1 - 0.63 + 1 + 1 - 0.38 = 1.99 and
a desirable property. It helps a domain expert to assign score[6,C[adf]) = 1 - 0.63 - 0.63 + 1 + 1 = 1.74.
labels and descriptions to clusters. We will use score(6,C[acd]) to explain the calculation.
All features in b(6) are global-frequent. They are a, b, c,
d, and f, with global frequencies 1, 0.63, 0.63, 0.5, and 0.38,
Making Clusters Disjoint respectively. Their respective cluster-support values are
1, 0.33, 1, 1, and 0.67. For a feature to be cluster-frequent
To make the clusters disjoint, we find the best cluster for in C[acd], it must appear in at least [ x |C[acd]|] = 3 of its
each overlapping object g and keep g only in that cluster. objects. Therefore, a, c, d positive(6,C[acd]), and b, f
To achieve this, a score function is used. The function negative(6,C[acd]). Substituting these values into the
score(g, C i) measures the goodness of cluster Ci for object formula for the score function, it is found that the
g. Intuitively, a cluster is good for an object g if g has many score(6,C[acd]) = 1.99. Since score(6,C[acd]) >
frequent features which are also frequent in Ci. On the other score[6,C[adf]), object 6 is assigned to C[acd].
hand, C i is not good for g if g has frequent features that are Tables 2 and 3 show the score calculations for all
not frequent in Ci. overlapping objects. Table 2 shows initial cluster as-
Define global-support(f) as the percentage of objects signments. Features that are both global-frequent and
possessing f in the whole database, and cluster-support(f) cluster-frequent are shown in the column labeled
as the percentages of objects possessing f in a given positive(g,C i), and features that are only global-frequent
cluster C i. We say f is cluster frequent in C i if cluster- are in the column labeled negative(g,Ci). For elements in
support(f) is at least a user-specified minimum threshold the column labeled positive(g,Ci), the cluster-support
value q. For cluster Ci, let positive(g, Ci) be the set of values are listed between parentheses after each feature
features in b(g) which are both global-frequent (i.e., fre- name. The same format is used for global-support values
quent in the whole database) and cluster-frequent. Also let
516
TEAM LinG
Formal Concept Analysis Based Clustering
4: acghi h(1)
3: abcgh g(1)
8: acdf d(1)
8: acdf f(1)
Table 3. Score calculations and final cluster assignments and = 0.5, we get the following final clusters C[agh] = {3,
(minSupport = 0.37, = 0.7) 4}, C[abg] = {1, 2}, C[acd] = {7, 8}, and C[adf] = {5, 6}.
CONCLUSION
for features in the column labeled negative(g,Ci). It is only
a coincidence that in this example all cluster-support This chapter introduces formal concept analysis (FCA);
values are 1 (try = 0.5 for different values). Table 3 shows a useful framework for many applications in computer
score calculations, and final cluster assignments. science. We also showed how the techniques of FCA can
Notice that, different threshold values may result in be used for clustering. A global support value is used to
different clusters. The value of minSupport affects cluster specify which concepts can be candidate clusters. A
labels and initial clusters while that of affects final score function is then used to determine the best cluster
elements in clusters. For example, for minSupport = 0.375 for each object. This approach is appropriate for cluster-
517
TEAM LinG
Formal Concept Analysis Based Clustering
ing categorical data, transaction data, text data, Web Saquer, J. (2003). Using concept lattices for disjoint clus-
documents, and library documents. These data usually tering. In The Second IASTED International Conference
suffer from the problem of high dimensionality with only on Information and Knowledge Sharing (pp. 144-148).
few items or keywords being available in each transaction
or document. FCA contexts are suitable for representing Wang, K., Xu, C., & Liu, B. (1999). Clustering transactions
this kind of data. using large items. In ACM International Conference on
Information and Knowledge Management (pp. 483-490).
Wille, R. (1982). Restructuring lattice theory: An ap-
REFERENCES proach based on hierarchies of concepts. In I. Rival (Ed.),
Ordered sets (pp. 445-470). Dordecht-Boston: Reidel.
Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based Yun, C., Chuang, K., & Chen, M. (2001). An efficient
text clustering. In The 8th International Conference on clustering algorithm for market basket data based on small
Knowledge Discovery and Data Mining (KDD 2002) large ratios. In The 25th COMPSAC Conference (pp. 505-
(pp. 436-442). 510).
Carpineto, C., & Romano, G. (1993). GALOIS: An order- Zaki, M.J., & Hsiao, C. (2002). CHARM: An efficient
theoretic approach to conceptual clustering. In Proceed- algorithm for closed itemset mining. In Second SIAM
ings of 1993 International Conference on Machine Learn- International Conference on Data Mining.
ing (pp. 33-40).
Zamir, O., & Etzioni, O. (1998). Web document clustering:
Fung, B., Wang, K., & Ester, M. (2003). Large hierarchical A feasibility demonstration. In The 21st Annual Interna-
document clustering using frequent itemsets. In Third tional ACM SIGIR (pp. 46-54).
SIAM International Conference on Data Mining (pp. 59-
70).
KEY TERMS
Ganter, B., & Wille, R. (1999). Formal concept analysis:
Mathematical foundations. Berlin: Springer-Verlag. Cluster-Support of Feature F In Cluster Ci: Percent-
Gouda, K., & Zaki, M. (2001). Efficiently mining maximal age of objects in Ci possessing f.
frequent itemsets. In First IEEE International Confer- Concept: A pair (A, B) of a set A of objects and a set
ence on Data Mining (pp. 163-170). San Jose, USA. B of features such that B is the maximal set of features
Hearst, M. (1999). The use of categories and clusters for possessed by all the objects in A and A is the maximal set
organizing retrieval results. In T. Strzalkowski (Ed.), Natu- of objects that possess every feature in B.
ral language information retrieval (pp. 333-369). Bos- Context: A triple (G, M, I) where G is a set of
ton: Kluwer Academic Publishers. objects, M is a set of features and I is a binary relation
Kryszkiewicz, M. (1998). Representative association rules. between G and M such that gIm if and only if object g
In Proceedings of PAKDD 98. Lecture Notes in Artifi- possesses the feature m.
cial Intelligence (Vol. 1394) (pp. 198-209). Berlin: Springer- Formal Concept Analysis: A mathematical framework
Verlag. that provides formal and mathematical treatment of the
Mineau, G., & Godin, R. (1995). Automatic structuring of notion of a concept in a given universe.
knowledge bases by conceptual clustering. IEEE Trans- Negative(g, Ci): Set of features possessed by g which
actions on Knowledge and Data Engineering, 7 (5), 824- are global-frequent but not cluster-frequent.
829.
Positive(g, Ci): Set of features possessed by g which
Pei J., Han J., & Mao R. (2000). CLOSET: an efficient are both global-frequent and cluster-frequent.
algorithm for mining frequent closed itemsets. In ACM-
SIGMOD Workshop on Research Issues in Data Mining Support or Global-Support of Feature F: Percentage
and Knowledge Discovery (pp. 21-30). Dallas, USA. of object transactions in the whole context (or whole
database) that possess f.
518
TEAM LinG
519
INTRODUCTION precise. For details see (Viertl, 2002). This kind of uncer-
tainty can be best described by a so-called fuzzy number.
The results of data warehousing and data mining are A fuzzy number x is defined by a so-called characteriz-
depending essentially on the quality of data. Usually ing function : IR [0,1] which obeys the following:
data are assumed to be numbers or vectors, but this is
often not realistic. Especially the result of a measure-
ment of a continuous quantity is always not a precise
number, but more or less non-precise. This kind of x0 IR : (x0 ) = 1 (1)
uncertainty is also called fuzziness and should not be
confused with errors. Data mining techniques have to (0,1] the so-called cut C [ ()] defined by
take care of fuzziness in order to avoid unrealistic
C [ ()] := {x IR : (x ) } = [a , b ] is a finite closed interval.
results.
(2)
BACKGROUND
Examples of non-precise data are results on ana-
logue measurement equipments as well as readings on
In standard data warehousing and data analysis data are
digital instruments.
treated as numbers, vectors, words, or symbols. These
For continuous vector quantities real measurements
data types do not take care of fuzziness of data and prior
are not precise vectors but also non-precise. This im-
information. Whereas some methodology for fuzzy data
analysis was developed, statistical data analysis is usu- precision can result in a vector (x1 ,L , x k ) of fuzzy num-
ally not taking care of fuzziness. Recently some meth- bers xi , or more generally, in a so-called k-dimen-
ods for statistical analysis of non-precise data were
sional fuzzy vector x . Using the notation
published (Viertl, 1996, 2003).
Historically fuzzy sets were first introduced by K. Menger x = (x1 ,L , x k ) IR k a k-dimensional fuzzy vector is
in 1951 (Menger, 1951). Later L. Zadeh made fuzzy models
defined by its vector-characterizing function
popular. For more information on fuzzy modeling compare
(Dubois & Prade, 2000). : IR k [0,1] obeying
Most data analysis techniques are statistical tech-
niques. Only in the last 20 years alternative methods
using fuzzy models were developed. For a detailed discus- x 0 IR k : (x 0 ) = 1 (1)
sion compare (Bandemer & Nther, 1992; Berthold &
Hand, 2003).
(0,1] the cut C [ ()] defined by
C [ ()] := {x IR k : (x ) } is a compact simply con-
MAIN THRUST nected subset of IR k (2)
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Fuzzy Information and Data Analysis
(x1 , L , x k ) = min{1 (x1 ),L , k (x k )} (x1 ,L , x k ) IR k The generalized integral of a fuzzy valued function
f () defined on M is a fuzzy number I denoted by
# {xi: C [ i ()] K j }
f (x )dx = 1
+
,
h n, (K j ) =
M
n
where 1+ is a fuzzy number fulfilling
# {xi: C [ i ()] K j }
h n, (K j ) = . 1 C1 [1+ ] and C [1+ ] (0, ) (0,1] .
n
520
TEAM LinG
Fuzzy Information and Data Analysis
521
TEAM LinG
Fuzzy Information and Data Analysis
sion analysis, integration of fuzzy valued functions, fuzzy Ross, R., Booker, J., & Parkinson, J., (Eds.) (2002).
optimization, and fuzzy approximation. For more details Fuzzy logic and probability applications Bridging
compare (Bandemer & Nther, 1992). the gap. Philadelphia: American Statistical Association
and Society of Industrial and Applied Mathematics.
Viertl, R. (1996). Statistical methods for non-precise
FUTURE TRENDS data. Boca Raton: CRC Press.
Up to now fuzzy models where mainly used to describe Viertl, R. (2002). On the description and analysis of
vague data in form of linguistic data. But also precision measurements of continuous quantities. Kybernetika,
measurements are connected with uncertainty which is 38, 353-362.
not only of stochastic nature. Therefore hybrid methods
of data analysis, which take care of the fuzziness of data Viertl, R. (2003). Statistical inference with imprecise
as well as the statistical variation of them, have to be data. Encyclopedia of life support systems. Retrieved
applied and further developed. Especially research on from UNESCO, Paris, Web Site: www.eolss.unesco.org
how to obtain the characterizing functions of non-pre- Viertl, R., & Hareter, D. (2004). Fuzzy information and
cise data are necessary. imprecise probability. Zeitschrift fr Angewandte
Mathematik und Mechanik, 84(10-11), 1-10.
522
TEAM LinG
523
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
A General Model for Data Warehouses
integrates existing models. This model can be used to There may be no fact attribute; in this case a fact
apprehend the sharing of dimensions in various ways and records the occurrence of an event or a situation. In such
to describe different relationships between fact types. cases, analysis consists in counting occurrences satis-
Using this model, we will also define the notion of well- fying a certain number of conditions.
formed warehouse structures. Such structures have de- For the needs of an application, it is possible to
sirable properties for applications. We suggest a graph introduce different fact types sharing certain dimen-
representation for such structures which can help the sions and having references between them.
users in designing and requesting a data warehouse.
Modelling Dimensions
Modelling Facts
The different criteria which are needed to conduct analy-
A fact is used to record measures or states concerning an sis along a dimension are introduced through members.
event or a situation. Measures and states can be analysed A member is a specific attribute (or a group of at-
through different criteria organized in dimensions. tributes) taking its values on a well defined domain. For
A fact type is a structure example, the dimension TIME can include members
such as DAY, MONTH, YEAR, etc. Analysing a fact
fact_name[(fact_key), attribute A along a member M means that we are inter-
(list_of_reference_attributes), ested in computing aggregate functions on the values of
(list_of_fact_attributes)] A for any grouping defined by the values of M. In the
article we will also use the notation Mij for the j-th
where member of i-th dimension.
Members of a dimension are generally organized in
fact_name is the name of the type; a hierarchy which is a conceptual representation of the
fact_key is a list of attribute names; the concat- hierarchies of their occurrences. Hierarchy in dimen-
enation of these attributes identifies each instance sions is a very useful concept that can be used to impose
of the type; constraints on member values and to guide the analysis.
list_of_reference_attributes is a list of attribute Hierarchies of occurrences result from various rela-
names; each attribute references a member in a tionships which can exist in the real world: categoriza-
dimension or another fact instance; tion, membership of a subset, mereology. Figure 1
list_of_fact_attributes is a list of attribute names; illustrates a typical situation which can occur. Note that
each attribute is a measure for the fact. a hierarchy is not necessarily a tree.
We will model a dimension according to a hierarchi-
The set of referenced dimensions comprises the cal relationship (HR) which links a child member Mij
dimensions which are directly referenced through the (i.e. town) to a parent member Mik (i.e. region) and we
list_of_reference_attributes, but also the dimensions will use the notation MijMik. For the following we
which are indirectly referenced through other facts. consider only situations where a child occurrence is
Each fact attribute can be analysed along each of the linked to a unique parent occurrence in a type. However,
referenced dimensions. Analysis is achieved through a child occurrence, as in case (b) or (c), can have several
the computing of aggregate functions on the values of parent occurrences but each of different types. We will
this attribute. also suppose that HR is reflexive, antisymmetric and
As an example let us consider the following fact type transitive. This kind of relationship covers the great
for memorizing the sales in a set of stores. majority of real situations. Existence of this HR is very
important since it means that the members of a dimension
Sales[(ticket_number, product_key), (time_key,
product_key, store_key),
(price_per_unit, quantity)] Figure 1. A typical hierarchy in a dimension
ck1 ck2 ck3 ck4 .
cust_key
The key is (ticket_number, product_key). This means
that there is an instance of Sales for each different
product of a ticket. There are three dimension refer- town demozone
Dallas Mesquite Akron . North North .
ences: time_key, product_key, store_key. There are
two fact attributes: price_per_unit, quantity. The fact
attributes can be analysed through aggregate operations region Texas Ohio .
by using the three dimensions.
524
TEAM LinG
A General Model for Data Warehouses
can be organized into levels and correct aggregation of First, a fact can directly reference any member of a
fact attribute values along levels can be guaranteed. dimension. Usually a dimension is referenced through /
Each member of a dimension can be an entry for this one of its roots (as we saw above, a dimension can have
dimension i.e. can be referenced from a fact type. This several roots). But it is also interesting and useful to
possibility is very important since it means that dimen- have references to members other than the roots. This
sions between several fact types can be shared in various means that a dimension can be used by different facts
ways. In particular, it is possible to reference a dimen- with different granularities. For example, a fact can
sion at different levels of granularity. A dimension root directly reference town in the customer dimension and
represents a standard entry. For the three dimensions in another can directly reference region in the same
Figure 1, there is a single root. However, definition 3 dimension. This second reference corresponds to a
authorizes several roots. coarser granule of analysis than the first.
As in other models (Hsemann, 2000), we consider Moreover, a fact F1 can reference any other fact F2.
property attributes which are used to describe the mem- This type of reference is necessary to model real
bers. A property attribute is linked to its member through situations. This means that a fact attribute of F1 can be
a functional dependence, but does not introduce a new analysed by using the key of F2 (acting as the grouping
member and a new level of aggregation. For example the attribute of a normal member) and also by using the
member town in the customer dimension may have prop- dimensions referenced by F2.
erty attributes such as population, administrative posi- To formalise the interconnection between facts and
tion, etc. Such attributes can be used in the selection dimensions, we thus suggest extending the HR rela-
predicates of requests to filter certain groups. tionship of section 3 to the representation of the asso-
We now define the notion of member type, which ciations between fact types and the associations be-
incorporates the different features presented above. tween fact types and member types. We impose the
A member type is a structure: same properties (reflexivity, anti-symmetry, transitiv-
ity). We also forbid cyclic interconnections. This gives
member_name[(member_key), a very uniform model since fact types and member
dimension_name, types are considered equally. To maintain a traditional
(list_of_reference_attributes)] vision of the data warehouses, we also ensure that the
members of a dimension cannot reference facts.
where Figure 2 illustrates the typical structures we want
to model. Case (a) corresponds to the simple case, also
member_name is the name of the type; known as star structure, where there is a unique fact
member_key is a list of attribute names; the con- type F1 and several separate dimensions D1, D2, etc.
catenation of these attributes identifies each in- Cases (b) and (c) correspond to the notion of facts of
stance of the type; fact. Cases (d), (e) and (f) correspond to the sharing of
list_of_reference_attributes is a list of attribute the same dimension. In case (f) there can be two differ-
names where each attribute is a reference to the ent paths starting at F2 and reaching the same member
successors of the member instance in the cover M of the sub-dimension D21. So analysis using these
graph of the dimension.
525
TEAM LinG
A General Model for Data Warehouses
year
[(year_no),] family region
[(family_name),] [(region_name),]
two paths cannot give the same results when reaching M. level increases (by one or more depending on the used
To pose this problem we introduce the DWG and the path). When using aggregate operations, this action corre-
path coherence constraint. sponds to a ROLLUP operation (corresponding to the
To represent data warehouse structures, we suggest semantics of the HR) and the opposite operation to a
using a graph representation called DWG (data ware- DRILLDOWN. Starting from the reference to a dimension
house graph). It consists in representing each type (fact D in a fact type F, we can then roll up in the hierarchies of
type or member type) by a node containing the main dimensions by following a path of the DWG.
information about this type, and representing each ref-
erence by a directed edge. Illustrating the Modelling of a Typical
Suppose that in the DWG graph, there are two differ- Case with Well-Formed Structures
ent paths P1 and P2 starting from the same fact type F,
and reaching the same member type M. We can analyse Well-formed structures are able to model correctly and
instances of F by using P1 or P2. The path coherence completely the different cases of Figure 2. We illus-
constraint is satisfied if we obtain the same results when trate in this section the modelling for the star-snow-
reaching M. flake structure.
For example in case of figure 1 this constraint means We have a star-snowflake structure when:
the following: for a given occurrence of cust_key,
whether the town path or the demozone path is used, one there is a unique root (which corresponds to the
always obtains the same occurrence of region. unique fact type);
We are now able to introduce the notion of well- each reference in the root points towards a sepa-
formed structures. rate subgraph in the DWG (this subgraph corre-
sponds to a dimension).
Definition of a Well-Formed Warehouse
Structure Our model does not differentiate star structures
from snowflake structures. The difference will appear
A warehouse structure is well-formed when the DWG is with the mapping towards an operational model (rela-
acyclic and the path coherence constraint is satisfied tional model for example). The DWG of a star-snow-
for any couple of paths having the same starting node and flake structure is represented in Figure 3. This repre-
the same ending node. sentation is well-formed. Such a representation can be
A well-formed warehouse structure can thus have very useful to a user for formulating requests.
several roots. The different paths from the roots can be
always divided into two sub-paths: the first one with
only fact nodes and the second one with only member FUTURE TRENDS
nodes. So, roots are fact types.
Since the DWG is acyclic, it is possible to distribute A first opened problem is that concerning the property
its nodes into levels. Each level represents a level of of summarizability between the levels of the different
aggregation. Each time we follow a directed edge, the dimensions. For example, the total of the sales of a
526
TEAM LinG
A General Model for Data Warehouses
product for 2001 must be equal to the sum of the totals REFERENCES
for the sales of this product for all months of 2001. Any /
model of data warehouse has to respect this property. In Abello, A., Samos, J., & Saltor, F. (2001). Understand-
our presentation we supposed that function HR verified ing analysis dimensions in a multidimensional object-
this property. In practice, various functions were pro- oriented model. Intl Workshop on Design and Man-
posed and used. It would be interesting and useful to agement of Data Warehouses, DMDW2000, Interlaken,
begin a general formalization which would regroup all Switzerland.
these propositions.
Another opened problem concerns the elaboration Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modelling
of a design method for the schema of a data warehouse. multidimensional databases. International Conference
A data warehouse is a data base and one can think that its on Data Engineering, ICDE97 (pp. 232-243), Birming-
design does not differ from that of a data base. In fact a ham, UK.
data warehouse presents specificities which it is neces- Datta, A., & Thomas, H. (1999). The cube data model: A
sary to take into account, notably the data loading and conceptual model and algebra for on-line analytical
the performance optimization. Data loading can be com- processing in data warehouses. Decision Support Sys-
plex since sources schemas can differ from the data tems, 27(3), 289-301.
warehouse schema. Performance optimization arises
particularly when using relational DBMS for imple- Golfarelli, M., Maio, D., & Rizzi, V.S. (1998). Concep-
menting the data warehouse. tual design of data warehouses from E/R schemes. 32th
Hawaii International Conference on System Sciences,
HICSS1998.
CONCLUSION Gyssens, M., & Lakshmanan, V.S. (1997). A foundation
for multi-dimensional databases. Intl Conference on
In this paper we propose a model which can describe Very Large Databases (pp. 106-115).
various data warehouse structures. It integrates and
extends existing models for sharing dimensions and for Hurtado, C., & Mendelzon, A. (2001). Reasoning about
representing relationships between facts. It allows for summarizability in heterogeneous multidimensional
different entries in a dimension corresponding to dif- schemas. International Conference on Database Theory,
ferent granularities. A dimension can also have several ICDT01.
roots corresponding to different views and uses. Thanks
to this model, we have also suggested the concept of Hsemann, B., Lechtenbrger, J., & Vossen, G. (2000).
Data Warehouse Graph (DWG) to represent a data ware- Conceptual data warehouse design. Intl Workshop on
house schema. Using the DWG, we define the notion of Design and Management of Data Warehouse,
well-formed warehouse structures which guarantees DMDW2000, Stockholm, Sweden.
desirable properties. Lehner, W., Albrecht, J., & Wedekind, H. (1998). Multidi-
We have illustrated how typical structures such as mensional normal forms. l0th Intl Conference on Scien-
star-snowflake structures can be advantageously repre- tific and Statistical Data Management, SSDBM98, Capri,
sented with this model. Other useful structures like Italy.
those depicted in Figure 2 can also be represented.
The DWG gathers the main information from the Nguyen, T., Tjoa, A.M., Wanger, S. (2000). Conceptual
warehouse and it can be very useful to users for formu- multidimensional data model based on metacube. Intl
lating requests. We believe that the DWG can be used as Conference on Advances in Information Systems (pp. 24-
an efficient support for a graphical interface to manipu- 33), Izmir, Turkey.
late multidimensional structures through a graphical Pedersen, T.B., & Jensen, C.S. (1999). Multidimensional
language. data modelling for complex data. Intl Conference on Data
The schema of a data warehouse represented with our Engineering, ICDE 99.
model can be easily mapped into an operational model.
Since our model is object-oriented a mapping towards Pourabbas, E., & Rafanelli, M. (1999). Characterization of
an object model is straightforward. But it is possible hierarchies and some operators in OLAP environment.
also to map the schema towards a relational model or an ACM Second International Workshop on Data Ware-
object relational model. It appears that our model has a housing and OLAP, DOLAP99 (pp. 54-59), Kansas City,
natural place between the conceptual schema of the USA.
application and an object relational implementation of
the warehouse. It can thus serve as a helping support for Tsois, A., Karayannidis, N., & Sellis, T. (2001). MAC:
the design of relational data warehouses. Conceptual data modeling for OLAP. Intl Workshop on
527
TEAM LinG
A General Model for Data Warehouses
Design and Management of Data Warehouses, product type, manufacturer type). Members are used to
DMDW2000, Interlaken, Switzerland. drive the aggregation operations.
Vassiliadis, P., & Skiadopoulos, S. (2000). Modelling and Drilldown: Opposite operation of the previous one.
optimisation issues for multidimensional databases. In-
ternational Conference on Advanced Information Sys- Fact: Element recorded in a warehouse (example: each
tems Engineering, CAISE2000 (pp. 482-497), Stockholm, product sold in a shop) and whose characteristics (i.e.
Sweden. measures) are the object of the analysis (example: quan-
tity of a product sold in a shop).
Galaxy Structure: Structure of a warehouse for which
two different types of facts share a same dimension.
KEY TERMS
Hierarchy: The members of a dimension are gener-
Data Warehouse: A data base which is specifically ally organized along levels into a hierarchy.
elaborated to allow different analysis on data. Analysis
consists generally to make aggregation operations (count, Member: Every criterion in a dimension is materi-
sum, average, etc.). A data warehouse is different from alized through a member.
a transactional data base since it accumulates data along Rollup: Operation consisting in going in a hierarchy
time and other dimensions. Data of a warehouse are at a more aggregated level.
loaded and updated at regular intervals from the transac-
tional data bases of the company. Star Structure: Structure of a warehouse for which
a fact is directly connected to several dimensions and
Dimension: Set of members (criteria) allowing to can be so analyzed according to these dimensions. It is
drive the analysis (example for the Product dimension: the most simple and the most used structure.
528
TEAM LinG
529
Genetic Programming /
William H. Hsu
Kansas State University, USA
INTRODUCTION BACKGROUND
Genetic programming (GP) is a subarea of evolutionary Although Cramer (1985) first described the use of cross-
computation first explored by John Koza (1992) and inde- over, selection, mutation, and tree representations for
pendently developed by Nichael Lynn Cramer (1985). It is using genetic algorithms to generate programs, Koza, et
a method for producing computer programs through ad- al. (1992) is indisputably the fields most prolific and
aptation according to a user-defined fitness criterion, or influential author (Wikipedia, 2004). In four books, Koza,
objective function. et al. (1992, 1994, 1999, 2003) have described GP-based
GP systems and genetic algorithms (GAs) are related solutions to numerous toy problems and several impor-
but distinct approaches to problem solving by simulated tant real-world problems.
evolution. As in the GA methodology, GP uses a represen-
tation related to some computational model, but in GP, State of the Field: To date, GPs have been applied
fitness is tied to task performance by specific program successfully to a few significant problems in ma-
semantics. Instead of strings or permutations, genetic chine learning and data mining, most notably sym-
programs most commonly are represented as variable- bolic regression and feature construction. The
sized expression trees in imperative or functional pro- method is very computationally intensive, how-
gramming languages, as grammars (ONeill & Ryan, 2001) ever, and it is still an open question in current
or as circuits (Koza et al., 1999). GP uses patterns from research whether simpler methods can be used in-
biological evolution to evolve programs: stead. These include supervised inductive learning,
deterministic optimization, randomized approxima-
Crossover: Exchange of genetic material such as tion using non-evolutionary algorithms (i.e., Markov
program subtrees or grammatical rules. chain Monte Carlo approaches), genetic algorithms,
Selection: The application of the fitness criterion in and evolutionary algorithms. It is postulated by GP
order to choose which individuals from a population researchers that the adaptability of GPs to struc-
will go on to reproduce. tural, functional, and structure-generating solutions
Replication: The propagation of individuals from of unknown forms makes them more amenable to
one generation to the next. solving complex problems. Specifically, Koza, et al.
Mutation: The structural modification of individuals. (1999, 2003) demonstrate that, in many domains, GP
is capable of human-competitive automated discov-
To work effectively, GP requires an appropriate set of ery of concepts deemed to be innovative through
program operators, variables, and constants. Fitness in technical review such as patent evaluation.
GP typically is evaluated over fitness cases. In data
mining, this usually means training and validation data,
but cases also can be generated dynamically using a MAIN THRUST
simulator or be directly sampled from a real-world prob-
lem-solving environment. GP uses evaluation over these The general strengths of genetic programming lie in its
cases to measure performance over the required task, ability to produce solutions of variable functional form,
according to the given fitness criterion. reuse partial solutions, solve multi-criterion optimization
This article begins with a survey of the design of GP problems, and explore a large search space of solutions in
systems and their applications to data-mining problems, parallel. Modern GP systems also are able to produce
such as pattern classification, optimization of representa- structured, object-oriented, and functional programming
tions for inputs and hypotheses in machine learning, solutions involving recursion or iteration, subtyping, and
grammar-based information extraction, and problem trans- higher-order functions.
formation by reinforcement learning. It concludes with a A more specific advantage of GPs is their ability to
discussion of current issues in GP systems (i.e., scalability, represent procedural, generative solutions to pattern
human-comprehensibility, code growth and reuse, and recognition and machine-learning problems. Examples of
incremental learning).
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Genetic Programming
this include image compression and reconstruction (Koza, whereas type coercion can implement even better prefer-
1992) and several of the recent applications surveyed in ential bias.
the following.
Grammar-Based GP for Data Mining
GP for Pattern Classification
Not all GP-based approaches use expression tree-based
GP in pattern classification departs from traditional super- representations or functional program interpretation as
vised inductive learning in that it evolves solutions whose the computational model. Wong and Leung (2000) survey
functional form is not determined in advance, and in some data mining using grammars and formal languages. This
cases can be theoretically arbitrary. Koza (1992, 1994) general approach has been shown to be effective for some
developed GPs for several pattern reproduction prob- natural language learning problems, and extension of the
lems, such as the multiplexer and symbolic regression approach to procedural information extraction is a topic of
problems. current research in the GP community.
Since then, there has been continuing work on induc-
tive GP for pattern classification (Kishore et al., 2000), GP Software Packages: Functionality
prediction (Brameier & Banzhaf, 2001), and numerical curve and Research Features
fitting (Nikolaev & Iba, 2001). GP has been used to boost
performance in learning polynomial functions (Nikolaev & A number of GP software packages are available publicly
Iba, 2001). More recent work on tree-based multi-crossover and commercially. General features common to most GP
schemes has produced positive results in GP-based design systems for research and development include a very
of classification functions (Muni et al., 2004). high-period random number generator, such as the
Mersenne Twister for random constant generation and GP
GP for Control of Inductive Bias, operations; a variety of selection, crossover, and muta-
Feature Construction, and Feature tion operations; and trivial parallelism (e.g., through multi-
Extraction threading).
One of the most popular packages for experimentation
GP approaches to inductive learning face the general with GP is Evolutionary Computation in Java, or ECJ (Luke
problem of optimizing inductive bias: the preference for et al., 2004). ECJ implements the previously discussed
groups of hypotheses over others on bases other than features as well as parsimony, strongly-typed GP, migra-
pure consistency with training data or other fitness cases. tion strategies for exchanging individual subpopulations
Krawiec (2002) approaches this problem by using GP to in island mode GP (a type of GP featuring multiple demes
preserve useful components of representation (features) local populations or breeding groups), vector representa-
during an evolutionary run, validating them using the tions, and reconfigurability using parameter files.
classification data and reusing them in subsequent gen-
erations. This technique is related to the wrapper ap- Other Applications: Optimization,
proach to knowledge discovery in databases (KDD), Policy Learning
where validation data is held out and used to select
examples for supervised learning or to construct or select Like other genetic and evolutionary computation method-
variables given as input to the learning system. Because ologies, GP is driven by fitness and suited to optimization
GP is a generative problem-solving approach, feature approaches to machine learning and data mining. Its
construction in GP tends to involve production of new program-based representation makes it good for acquir-
variable definitions rather than merely selecting a subset. ing policies by reinforcement learning.1 Many GP prob-
Evolving dimensionally-correct equations on the ba- lems are error-driven or payoff-driven (Koza, 1992), in-
sis of data is another area where GP has been applied. cluding the ant trail problems and foraging problems now
Keijzer and Babovic (2002) provide a study of how GP explored more heavily by the swarm intelligence and ant
formulates its declarative bias and preferential (search- colony optimization communities. A few problems use
based) bias. In this and related work, it is shown that a specific information-theoretic criteria, such as maximum
proper units of measurement (strong typing) approach entropy or sequence randomization.
can capture declarative bias toward correct equations,
530
TEAM LinG
Genetic Programming
531
TEAM LinG
Genetic Programming
Kishore, J.K., Patnaik, L.M., Mani, V., & Agrawal, V.K. KEY TERMS
(2000). Application of genetic programming for
multicategory pattern classification. IEEE Transactions Automatically-Defined Function (ADF): Parametric
on Evolutionary Computation, 4(3), 242-258. functions that are learned and assigned names for reuse
Koza, J.R. (1992). Genetic programming: On the program- as subroutines. ADFs are related to the concept of macro-
ming of computers by means of natural selection. Cam- operators or macros in speedup learning.
bridge, MA: MIT Press. Code Growth (Code Bloat): The proliferation of solu-
Koza, J.R. (1994). Genetic programming II: Automatic tion elements (e.g., nodes in a tree-based GP representa-
discovery of reusable programs. Cambridge, MA: MIT tion) that do not contribute toward the objective function.
Press. Crossover: In biology, a process of sexual recombina-
Koza, J.R. et al. (2003). Genetic programming IV: Routine tion, by which two chromosomes are paired up and ex-
human-competitive machine intelligence. San Mateo, change some portion of their genetic sequence. Cross-
CA: Morgan Kaufmann. over in GP is highly stylized and involves structural
exchange, typically using subexpressions (subtrees) or
Koza, J.R., Bennett III, F.H., Andr, D., & Keane, M.A. production rules in a grammar.
(1999). Genetic programming III: Darwinian invention
and problem solving. San Mateo, CA: Morgan Kaufmann. Evolutionary Computation: A solution approach based
on simulation models of natural selection, which begins
Krawiec, K. (2002). Genetic programming-based construc- with a set of potential solutions and then iteratively
tion of features for machine learning and knowledge applies algorithms to generate new candidates and select
discovery tasks. Genetic Programming and Evolvable the fittest from this set. The process leads toward a model
Machines, 3(4), 329-343. that has a high proportion of fit individuals.
Kushchu, I. (2002). Genetic programming and evolution- Generation: The basic unit of progress in genetic and
ary generalization. IEEE Transactions on Evolutionary evolutionary computation, a step in which selection is
Computation, 6(5), 431-442. applied over a population. Usually, crossover and muta-
tion are applied once per generation, in strict order.
Luke, S. (2000). Issues in scaling genetic programming:
Breeding strategies, tree generation, and code bloat [doc- Individual: A single candidate solution in genetic and
toral thesis]. College Park, MD: University of Maryland. evolutionary computation, typically represented using
strings (often of fixed length) and permutations in genetic
Luke, S. et al. (2004). Evolutionary computation in Java algorithms, or using problem-solver representations (i.e.,
v11. Retrieved from http://www.cs.umd.edu/projects/plus/ programs, generative grammars, or circuits) in genetic
ec/ecj/ programming.
Muni, D.P., Pal, N.R., & Das, J. (2004). A novel approach to Island Mode GP: A type of parallel GP, where multiple
design classifiers using genetic programming. IEEE Trans- subpopulations (demes) are maintained and evolve inde-
actions on Evolutionary Computation, 8(2), 183-196. pendently, except during scheduled exchanges of indi-
Nikolaev, N.Y., & Iba, H. (2001a). Regularization approach viduals.
to inductive genetic programming. IEEE Transactions on Mutation: In biology, a permanent, heritable change
Evolutionary Computation, 5(4), 359-375. to the genetic material of an organism. Mutation in GP
Nikolaev, N.Y., & Iba, H. (2001b). Accelerated genetic involves structural modifications to the elements of a
programming of polynomials. Genetic Programming and candidate solution. These include changes, insertion,
Evolvable Machines, 2(3), 231-257. duplication, or deletion of elements (subexpressions,
parameters passed to a function, components of a resis-
ONeill, M., & Ryan, C. (2001). Grammatical evolution. tor-capacitor-inducer circuit, non-terminals on the right-
IEEE Transactions on Evolutionary Computation. hand side of a production rule).
Wikipedia. (2004). Genetic programming. Retrieved from Parsimony: An approach in genetic and evolutionary
http://en.wikipedia.org/wiki/Genetic_programming computation related to minimum description length, which
Wong, M.L., & Leung, K.S. (2000). Data mining using rewards compact representations by imposing a penalty
grammar based genetic programming and applications. for individuals in direct proportion to their size (e.g.,
Norwell, MA: Kluwer. number of nodes in a GP tree). The rationale for parsimony
is that it promotes generalization in supervised inductive
532
TEAM LinG
Genetic Programming
533
TEAM LinG
534
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Graph Transformations and Neural Networks
transformation systems can be used. Three possibilities In Figure 1, a sample application of a graph transfor-
are especially promising: first, it is interesting whether an mation rule is shown. The left-hand side L consists of /
algorithm is terminating. Though this question is unde- three nodes (1:, 2:, 3:) and three edges. This graph is
cidable in the general case, the formal methods of graph embedded into a graph G. Numbers in G indicate how the
rewriting and general rewriting offer some chances to nodes of L are matched. The embedding of edges is
prove termination for neural network algorithms. The straightforward. In the next step L is deleted from G, and
same holds for the question whether the result produced R is inserted. If L is simply deleted from G, hanging edges
by an algorithm is useful, whether the learning of a neural remain. All edges ending/starting at 1:,2:,3: are missing
network was successful. Then it helps to prove whether one node after deletion. With the help of numbers 1:,2:,3:
two algorithms are equivalent. Finally, possible parallel- in the right-hand side R, it is indicated how these hanging
ism in algorithms can be detected and described, based on edges are attached to R inserted in G/L. The resulting
results for graph transformation systems. graph is H.
Simple graphs are not enough for modeling real-world
applications. Among the different extensions, two are of
BACKGROUND special interest. First, graphs and graph rules can be
labeled. When G is labeled with numbers, L is labeled with
A Short Introduction to Graph variables, and R is labeled with terms over Ls variables.
This way, calculations can be modeled. Taking our ex-
Transformations ample and extending G with numbers 1,2,3, the left-hand
side L with variables x,y,z and the right-hand side with
Despite the different approaches to handling graph trans- terms x+y, x-y, xy, x y is shown in Figure 1. When L is
formations, there are some properties that all approaches embedded in G, the variables are set to the numbers of the
have in common. When transforming a graph G somehow, corresponding nodes. The nodes in H are labeled with the
it is necessary to specify what part of the graph, what result of the terms in R when the variable settings resulting
subgraph L, has to be exchanged. For this subgraph, a new from the embedding of L in G are used.
graph R must be inserted. When applying such a rule to Also, application conditions can be added, restricting
a graph G, three steps are necessary: the application of a rule. For example, the existence of a
certain subgraph A in G can be allowed or forbidden. A rule
Choose an occurrence of L in G. only can be applied if A can be found resp. not found in
Delete L from G. G. Additionally, label-based application conditions are
Insert R into the remainder of G. possible. This rule could be extended by asking for x < y.
Only in this case would the rule be applied.
L 1: x R 1: x+y x-y
2: y z 3: 2: xy xy 3:
G 1: 1 H 1: 3 -1
2: 2 3 2: 2 1 3:
3:
535
TEAM LinG
Graph Transformations and Neural Networks
Combining Neural Networks and Graph before. This way, gradient descent-based training
Transformations methods as back propagation, a famous training
algorithm (Rojas, 2000) can be derived. A disad-
Various proposals exist as to how graph transformations vantage of this method is that no topology chang-
and neural networks can be combined having different ing algorithms can be modeled.
goals in mind. Several ideas originate from evolutionary In Fischer (2000), arbitrary neural nets are modeled
computing (Curran & ORiodan, 2002; De Jong & Pollack, as graphs and transformed by arbitrary transforma-
2001; Siddiqi & Lucas, 1998); others stem from electrical tion rules. Because this is the most general ap-
engineering (Wan & Beaufays, 1998). The approach of proach, it will be explained in detail in the following
Fischer (2000) has its roots in graph transformations itself. sections.
Despite these sources, only three really different ideas can
be found:
MAIN THRUST
In most of the papers (Curran & ORiodan, 2002; De
Jong & Pollack, 2001; Siddiqi & Lucas 1998), some In the remainder of this article, one special sort of neural
basic graph operators like insert-node, delete-node, network, the so-called probabilistic neural networks,
change-attribute, and so forth are used. This set of together with training algorithms, are explained in detail.
operators differs in the approaches. In addition to
these operators, some kind of application condition Probabilistic Neural Networks
exists, stating which rule has to be applied when and
on which nodes or edges. These application condi- The main purpose of a probabilistic neural network is to
tions can be directed acyclic graphs giving the se- sort patterns into classes. It always has three layersan
quence of the rule applications. It also can be tree- input layer, a hidden layer, and an output layer. The input
based, where newly created nodes are handed to neurons are connected to each neuron in the hidden
different paths in the tree. The main application area layer. The neurons of the hidden layer are connected to
of these approaches is to grow neural networks from one output neuron each. Hidden neurons are modeled
just one node. In this field, other grammar-based with two nodes, as they have an input value and do some
approaches also can be found (Cantu-Paz & Kamath, calculations on it, resulting in an output value. Neurons
2002), where matrices are rewritten (Browse, Hussain and connections are labeled with values resp. weights. In
& Smilie, 1999), taking attributed grammars. In Figure 2, a probabilistic neural network is shown.
Tsakonas and Dounias (2002), feed-forward neural
networks are grown with the help of a grammar in Calculations in Probabilistic Neural
Backus-Naur-Form. Networks
In Wan and Beaufays (1998), signal flow graphs
known from electrical engineering are used to model First, input is presented to the input neurons. The next
the information flow through the net. With the help step is to calculate the input value of the hidden neurons.
of rewrite rules, the elements can be reversed, so that The main purpose of the hidden neurons is to represent
the signal flow is going in the opposite direction as examples of classes. Each neuron represents one ex-
ample. This example forms the weights from the input
Figure 2. A probabilistic neural network seen as graph.
neurons to this special hidden neuron. When an input is
Please note that not all edges are labeled, due to space
presented to the net, each neuron in the hidden layer
reasons
computes the probability that it is the example it models.
Therefore, first, the Euclidean distance between the
8 5 activation of the input neurons and the weight of the
9 7 connection to the hidden layers neuron is computed.
0.4 0.7 0.3 0.2 The result coming via the connections is summed up to
within a hidden neuron. This is the distance that the
4 1 current input has from the example modeled by the
2 10 4 8 neuron. If the exact example is inserted into the net, the
3 5 result is 0. In Figure 3, this calculation is shown in detail.
The given graph rewrite rule can be applied to the net
shown in Figure 2. The labels i are variables modeling the
1 7 9 2 input values of the input neurons, w models the weights
536
TEAM LinG
Graph Transformations and Neural Networks
Figure 4. A new neuron is inserted into a net. This A more sophisticated training method is called Dy-
example is taken from Dynamic Decay Adjustment (Silipo, namic Decay Adjustment (Silipo, 2002). The algorithm
2002) also starts with no neuron in the hidden layer. If a pattern
is presented to the net, first, the activation of all existing
hidden neurons is calculated. If there is a neuron whose
1 1 1 activation is equal to or higher than a given threshold +,
d+ this neuron covers the input pattern. If this is not the case,
1 d a new hidden neuron is inserted into the net, as shown in
Figure 4. With the help of this algorithm, fewer hidden
0 neurons are inserted.
i1 i2 i3 i1 i2 i3
FUTURE TRENDS
537
TEAM LinG
Graph Transformations and Neural Networks
rithms, graph transformation systems can present their Rojas, R. (2000). Neural networks. A systematic introduc-
full power. This might be of special interest for educa- tion. New York, NY: Springer Verlag.
tional purposes where it is useful to visualize step-by-
step what algorithms do. Finally, the theoretical back- Rozenberg, G. (Ed.). (1997). Handbook of graph gram-
ground of graph transformation and rewriting systems mars and computing by graph transformations. Singapore:
offers several possibilities for proving termination, equiva- World Scientific.
lence, and the like, of algorithms. Siddiqi, A., & Lucas, S.M. (1998). A comparison of matrix
rewriting versus direct encoding for evolving neural net-
works. Proceedings of the IEEE International Confer-
REFERENCES ence on Evolutionary Computation, Anchorage, Alaska.
Blostein, D., & Schrr, A. (1999). Computing with graphs Silipo, R. (2002). Artificial neural networks. In M. Berthold,
and graph transformation. Software Practice and Experi- & D. Hand (Eds.), Intelligent data analysis (pp. 269-319).
ence, 29(3), 1-21. New York, NY: Springer Verlag.
Browse, R.A., Hussain, T.S., & Smillie, M.B. (1999). Using Tsakonas, A., & Dounias, D. (2002). A scheme for the
attribute grammars for the genetic selection of evolution of feedforward neural Networks using BNF-
backpropagation networks for character recognition. Pro- grammar driven genetic programming. Proceedings of the
ceedings of Applications of Artificial Neural Networks EUNITEEUropean Network on Intelligent Technolo-
in Image Processing IV. San Jose, California. gies for Smart Adaptive Systems, Algarve, Portugal.
Cantu-Paz, E., & Kamath, C. (2002). Evolving neural net- Wan, E., & Beaufays, F. (1998). Diagrammatic methods for
works for the classification of galaxies. Proceedings of deriving and relating temporal neural network algorithms.
the Genetic and Evolutionary Computation Conference. In C. Giles, & M. Gori (Eds.), Adaptive processing of
sequences and data structures (pp. 63-98). Salerno, Italy:
Curran, D., & ORiordan, C. (2002). Applying evolution- International Summer School on Neural Networks.
ary computation to designing neural networks: A study
of the state of the art [technical report]. Galway, Ireland:
Department of Information Technology, National Univer- KEY TERMS
sity of Ireland.
Confluence: A rewrite system is confluent, if no mat-
De Jong, E., & Pollack, J. (2001). Utilizing bias to evolve ter in which order rules are applied, they lead to the same
recurrent neural networks. Proceedings of the Interna- result.
tional Joint Conference on Neural Networks, Washing-
ton, D.C. Graph: A graph consists of vertices and edges. Each
edge is connected to a source node and a target node.
Ehrig, H., Engels, G., Kreowski, H.-J., & Rozenberg, G. Vertices and edges can be labeled with numbers and
(Eds.). (1999a). Handbook on graph grammars and symbols.
computing by graph transformation. Singapore: World
Scientific. Graph Production: Similar to productions in general
Chomsky grammars, a graph production consists of a left-
Ehrig, H., Kreowski, H.-J., Montanari, U., & Rozenberg, G. hand side and a right-hand side. The left-hand side is
(Eds.). (1999b). Handbook on graph grammars and embedded in a host graph. Then, it is removed, and in the
computing by graph transformation. Singapore: World resulting hole, the right-hand side of the graph produc-
Scientific. tion is inserted. To specify how this right-hand side is
Fischer, I. (2000). Describing neural networks with graph attached into this hole, how edges are connected to the
transformations [doctoral thesis]. Nuremberg: Friedrich- new nodes, some additional information is necessary.
Alexander University Erlangen-Nuremberg. Different approaches exist how to handle this problem.
Klop, J.W., De Vrijer, R.C., & Bezem, M. (2003). Term Graph Rewriting: The application of a graph produc-
rewriting systems. Cambridge, MA: Cambridge Univer- tion to a graph is also called graph rewriting.
sity Press. Neural Networks: Learning systems, designed by
Nipkow, T., & Baader, F. (1999). Term rewriting and all analogy with a simplified model of the neural connections
that. Cambridge, MA: Cambridge University Press. in the brain, which can be trained to find nonlinear rela-
tionships in data. Several neurons are connected to form
the neural networks.
538
TEAM LinG
Graph Transformations and Neural Networks
Neuron: The smallest processing unit in a neural follows the configuration y with the help of a rule appli-
network. cation. /
Probabilistic Neural Network: One of the many dif- Termination: A rewrite system terminates if it has no
ferent kinds of neural networks with the application area infinite chain.
to classify input data into different classes.
Weight: Connections between neurons of neural net-
Rewrite System: Consists of a set of configurations works have a weight. This weight can be changed during
and a relation x y denoting that the configuration x the training of the net.
539
TEAM LinG
540
Diane J. Cook
University of Texas at Arlington, USA
INTRODUCTION BACKGROUND
Graph-based data mining represents a collection of Graph-based data mining (GDM) is the task of finding
techniques for mining the relational aspects of data novel, useful, and understandable graph-theoretic pat-
represented as a graph. Two major approaches to graph- terns in a graph representation of data. Several ap-
based data mining are frequent subgraph mining and proaches to GDM exist based on the task of identifying
graph-based relational learning. This article will fo- frequently occurring subgraphs in graph transactions,
cus on one particular approach embodied in the Subdue that is, those subgraphs meeting a minimum level of
system, along with recent advances in graph-based su- support. Washio & Motoda (2003) provide an excellent
pervised learning, graph-based hierarchical conceptual survey of these approaches. We here describe four
clustering, and graph-grammar induction. representative GDM methods.
Most approaches to data mining look for associa- Kuramochi and Karypis (2001) developed the FSG
tions among an entitys attributes, but relationships system for finding all frequent subgraphs in large graph
between entities represent a rich source of information, databases. FSG starts by finding all frequent single and
and ultimately knowledge. The field of multi-relational double edge subgraphs. Then, in each iteration, it gener-
data mining, of which graph-based data mining is a part, ates candidate subgraphs by expanding the subgraphs
is a new area investigating approaches to mining this found in the previous iteration by one edge. In each
relational information by finding associations involving iteration the algorithm checks how many times the
multiple tables in a relational database. Two main ap- candidate subgraph occurs within an entire graph. The
proaches have been developed for mining relational candidates, whose frequency is below a user-defined
information: logic-based approaches and graph-based level, are pruned. The algorithm returns all subgraphs
approaches. occurring more frequently than the given level.
Logic-based approaches fall under the area of induc- Yan and Han (2002) introduced gSpan, which com-
tive logic programming (ILP). ILP embodies a number bines depth-first search and lexicographic ordering to
of techniques for inducing a logical theory to describe find frequent subgraphs. Their algorithm starts from all
the data, and many techniques have been adapted to frequent one-edge graphs. The labels on these edges
multi-relational data mining (Dzeroski & Lavrac, 2001; together with labels on incident vertices define a code
Dzeroski, 2003). Graph-based approaches differ from for every such graph. Expansion of these one-edge
logic-based approaches to relational mining in several graphs maps them to longer codes. Since every graph
ways, the most obvious of which is the underlying rep- can map to many codes, all but the smallest code are
resentation. Furthermore, logic-based approaches rely pruned. Code ordering and pruning reduces the cost of
on the prior identification of the predicate or predicates matching frequent subgraphs in gSpan. Yan & Han (2003)
to be mined, while graph-based approaches are more describe a refinement to gSpan, called CloseGraph,
data-driven, identifying any portion of the graph that has which identifies only subgraphs satisfying the minimum
high support. However, logic-based approaches allow support, such that no supergraph exists with the same
the expression of more complicated patterns involving, level of support.
for example, recursion, variables, and constraints among Inokuchi et al. (2003) developed the Apriori-based
variables. These representational limitations of graphs Graph Mining (AGM) system, which searches the space
can be overcome, but at a computational cost. of frequent subgraphs in a bottom-up fashion, beginning
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Graph-Based Data Mining
with a single vertex, and then continually expanding by a times referred to as value) as calculated using the MDL
single vertex and one or more edges. AGM also employs principle. The queues length is bounded by a user- /
a canonical coding of graphs in order to support fast defined constant.
subgraph matching. AGM returns association rules sat- The search terminates upon reaching a user-speci-
isfying user-specified levels of support and confidence. fied limit on the number of substructures extended, or
The last approach to GDM, and the one discussed in upon exhaustion of the search space. Once the search
the remainder of this chapter, is embodied in the Subdue terminates and Subdue returns the list of best substruc-
system (Cook & Holder, 2000). Unlike the above sys- tures found, the graph can be compressed using the best
tems, Subdue seeks a subgraph pattern that not only substructure. The compression procedure replaces all
occurs frequently in the input graph, but also signifi- instances of the substructure in the input graph by single
cantly compresses the input graph when each instance of vertices, which represent the substructures instances.
the pattern is replaced by a single vertex. Subdue per- Incoming and outgoing edges to and from the replaced
forms a greedy search through the space of subgraphs, instances will point to, or originate from the new vertex
beginning with a single vertex and expanding by one that represents the instance. The Subdue algorithm can
edge. Subdue returns the pattern that maximally com- be invoked again on this compressed graph.
presses the input graph. Holder & Cook (2003) describe Figure 1 illustrates the GDM process on a simple
current and future directions in this graph-based rela- example. Subdue discovers substructure S1, which is
tional learning variant of GDM. used to compress the data. Subdue can then run for a
second iteration on the compressed graph, discovering
substructure S2. Because instances of a substructure can
MAIN THRUST appear in slightly different forms throughout the data, an
inexact graph match, based on graph edit distance, is
As a representative of GDM methods, this section will used to identify substructure instances.
focus on the Subdue graph-based data mining system. Most GDM methods follow a similar process. Varia-
The input data is a directed graph with labels on vertices tions involve different heuristics (e.g., frequency vs.
and edges. Subdue searches for a substructure that best MDL) and different search operators (e.g., merge vs.
compresses the input graph. A substructure consists of extend).
a subgraph definition and all its occurrences throughout
the graph. The initial state of the search is the set of Graph-Based Hierarchical Conceptual
substructures consisting of all uniquely labeled verti- Clustering
ces. The only operator of the search is the Extend
Substructure operator. As its name suggests, it extends Given the ability to find a prevalent subgraph pattern in
a substructure in all possible ways by a single edge and a larger graph and then compress the graph with this
a vertex, or by only a single edge if both vertices are pattern, iterating over this process until the graph can no
already in the subgraph. longer be compressed will produce a hierarchical, con-
Subdues search is guided by the minimum descrip- ceptual clustering of the input data. On the i th iteration,
tion length (MDL) principle, which seeks to minimize the best subgraph Si is used to compress the input graph,
the description length of the entire data set. The evalu- introducing new vertices labeled Si in the graph input to
ation heuristic based on the MDL principle assumes that the next iteration. Therefore, any subsequently-discov-
the best substructure is the one that minimizes the ered subgraph Sj can be defined in terms of one or more
description length of the input graph when compressed of Sis, where i < j. The result is a lattice, where each
by the substructure. The description length of the sub- cluster can be defined in terms of more than one parent
structure S given the input graph G is calculated as
DL(G,S) = DL(S)+DL(G|S), where DL(S) is the descrip- Figure 1. Graph-based data mining: A simple example
tion length of the substructure, and DL(G|S) is the
description length of the input graph compressed by the S2
S1 S1
substructure. Subdue seeks a substructure S that mini-
mizes DL(G,S).
The search progresses by applying the Extend Sub-
structure operator to each substructure in the current
S1 S1 S1
state. The resulting state, however, does not contain all
the substructures generated by the Extend Substructure
operator. The substructures are kept on a queue and are S2 S2
ordered based on their description length (or some-
541
TEAM LinG
Graph-Based Data Mining
subgraph. For example, shows such a clustering done on We can bias the search toward a more characteristic
a DNA molecule. Note that the ordering of pattern dis- description by using the information-theoretic mea-
covery can affect the parents of a pattern. For instance, sure to look for a subgraph that compresses the positive
the lower-left pattern in could have used the C-C-O examples, but not the negative examples. If I(G) repre-
pattern, rather than the C-C pattern, but in fact, the lower- sents the description length (in bits) of the graph G, and
left pattern is discovered before the C-C-O pattern. For I(G|S) represents the description length of graph G
more information on graph-based clustering, see Jonyer compressed by subgraph S, then we can look for an S
et al. (2001). that minimizes I(G +|S) + I(S) + I(G-) I(G-|S), where the
last two terms represent the portion of the negative
Graph-Based Supervised Learning graph incorrectly compressed by the subgraph. This
approach will lead the search toward a larger subgraph
Extending a graph-based data mining approach to per- that characterizes the positive examples, but not the
form supervised learning involves the need to handle negative examples.
negative examples (focusing on the two-class scenario). Finally, this process can be iterated in a set-cover-
In the case of a graph the negative information can come ing approach to learn a disjunctive hypothesis. If using
in three forms. First, the data may be in the form of the error measure, then any positive example contain-
numerous smaller graphs, or graph transactions, each ing the learned subgraph would be removed from subse-
labeled either positive or negative. Second, data may be quent iterations. If using the information-theoretic
composed of two large graphs: one positive and one measure, then instances of the learned subgraph in both
negative. Third, the data may be one large graph in which the positive and negative examples (even multiple in-
the positive and negative labeling occurs throughout. We stances per example) are compressed to a single ver-
will talk about the third scenario in the section on future tex. Note that the compression is a lossy one, that is we
directions. The first scenario is closest to the standard do not keep enough information in the compressed
supervised learning problem in that we have a set of graph to know how the instance was connected to the
clearly defined examples. Let G + represent the set of rest of the graph. This approach is consistent with our
positive graphs, and G- represent the set of negative goal of learning general patterns, rather than mere
graphs. Then, one approach to supervised learning is to compression. For more information on graph-based
find a subgraph that appears often in the positive graphs, supervised learning, see Gonzalez et al. (2002).
but not in the negative graphs. This amounts to replacing
the information-theoretic measure with simply an er- Graph Grammar Induction
ror-based measure. This approach will lead the search
toward a small subgraph that discriminates well. How- As mentioned earlier, two of the advantages of logic-
ever, such a subgraph does not necessarily compress based approach to relational learning are the ability to
well, nor represent a characteristic description of the learn recursive hypotheses and constraints among vari-
target concept. ables. However, there has been much work in the area of
graph grammars, which overcome this limitation. Graph
grammars are similar to string grammars except that
Figure 2. Graph-based hierarchical, conceptual terminals can be arbitrary graphs rather than symbols
clustering of a DNA molecule from an alphabet. While much of the work on graph
grammars involves the analysis of various classes of
DNA graph grammars, recent research has begun to develop
techniques for learning graph grammars (Doshi et al.,
O
CN CC | 2002; Jonyer et al., 2002).
O == P OH
Figure 3b shows an example of a recursive graph
grammar production rule learned from the graph in a.
CC
C
\ \ O A GDM approach can be extended to consider graph
O
NC
\
|
O == P OH grammar productions by analyzing the instances of a
C |
O subgraph to see how they are related to each other. If
|
CH2 two or more instances are connected to each other by
O one or more edges, then a recursive production rule
\
C generating an infinite sequence of such connected sub-
CC
/ \
NC graphs can be constructed. A slight modification to the
O
/ \
C information-theoretic measure taking into account the
extra information needed to describe the recursive
542
TEAM LinG
Graph-Based Data Mining
component of the production is all that is needed to recursive hypotheses in many different domains includ-
allow such a hypothesis to compete along side simple ing learning the building blocks of proteins and commu- /
subgraphs (i.e., terminal productions) for maximizing nication chains in organized crime.
compression.
These graph grammar productions can include non-
terminals on the right-hand side. These productions can FUTURE TRENDS
be disjunctive, as in c, which represents the final produc-
tion learned from a using this approach. The disjunction The field of graph-based relational learning is still
rule is learned by looking for similar, but not identical, young, but the need for practical algorithms is growing
extensions to the instances of a subgraph. A new rule can fast. Therefore, we need to address several challenging
be constructed that captures the disjunctive nature of scalability issues, including incremental learning in
this extension, and included in the pool of production dynamic graphs. Another issue regarding practical ap-
rules competing based on their ability to compress the plications involves the blurring of positive and negative
input graph. With a proper encoding of this disjunction examples in a supervised learning task, that is, the graph
information, the MDL criterion will tradeoff the com- has many positive and negative parts, not easily sepa-
plexity of the rule with the amount of compression it rated, and with varying degrees of class membership.
affords in the input graph. An alternative to defining
these disjunction non-terminals is to instead construct Partitioning and Incremental Mining for
a variable whose range consists of the different disjunc-
tive values of the production. In this way we can intro-
Scalability
duce constraints among variables contained in a sub-
graph by adding a constraint edge to the subgraph. For Scaling GDM approaches to very large graphs, graphs
example, if the four instances of the triangle structure in too big to fit in main memory, is an ever-growing
a each had another edge to a c, d, f and f vertex respec- challenge. Two approaches to address this challenge are
tively, then we could propose a new subgraph, where being investigated. One approach involves partitioning
these two vertices are the graph into smaller graphs that can be processed in a
represented by variables, and an equality constraint distributed fashion (Cook et al., 2001). A second ap-
is introduced between them. If the range of the variable proach involves implementing GDM within a relational
is numeric, then we can also consider inequality con- database management system, taking advantage of user-
straints between variables and other vertices or vari- defined functions and the optimized storage capabilities
ables in the subgraph pattern. of the RDBMS.
Jonyer (2003) has developed a graph grammar learn- A newer issue regarding scalability involves dy-
ing approach with the above capabilities. The approach namic graphs. With the advent of real-time streaming
has shown promise both in handling noise and learning data, many data mining systems must mine incremen-
tally, rather than off-line from scratch. Many of the
domains we wish to mine in graph form are dynamic
Figure 3. Graph grammar learning example with (a) domains. We do not have the time to periodically re-
the input graph, (b) the first grammar rule learned, build graphs of all the data to date and run a GDM system
and (c) the second and third grammar rules learned from scratch. We must develop methods to incremen-
tally update the graph and the patterns currently preva-
(a) a a a a
lent in the graph. One approach is similar to the graph
b c b d b f b f
partitioning approach for distributed processing. New
k
data can be stored in an increasing number of partitions.
x y x y r x y x y Information within partitions can be exchanged, or a
z q z q z q z q repartitioning can be performed if the information loss
exceeds some threshold. GDM can be used to search the
(b) S1 x y S1 x y new partitions, suggesting new subgraph patterns as they
z q z q
evaluate highly in new and old partitions.
b S3 b S3
In a highly relational domain the positive and negative
examples of a concept are not easily separated. Such a
S3 c d f graph is called a supervised graph, in that the graph as a
543
TEAM LinG
Graph-Based Data Mining
544
TEAM LinG
Graph-Based Data Mining
International Conference on Knowledge Discovery and Graph Grammar: Grammar describing the construc-
Data Mining. tion of a set of graphs, where terminals and non-terminals /
represent vertices, edges or entire subgraphs.
Inductive Logic Programming: Techniques for learn-
KEY TERMS ing a first-order logic theory to describe a set of relational
data.
Conceptual Graph: Graph representation described Minimum Description Length (MDL) Principle: Prin-
by a precise semantics based on first-order logic. ciple stating that the best theory describing a set of data
Dynamic Graph: Graph representing a constantly is the one minimizing the description length of the theory
changing stream of data. plus the description length of the data described (or
compressed) by the theory.
Frequent Subgraph Mining: Finding all subgraphs
within a set of graph transactions whose frequency Multi-Relational Data Mining: Mining patterns that
satisfies a user-specified level of minimum support. involve multiple tables in a relational database.
Graph-Based Data Mining: Finding novel, useful, and Supervised Graph: Graph in which each vertex and
understandable graph-theoretic patterns in a graph rep- edge can belong to multiple categories to varying de-
resentation of data. grees. Such a graph complicates the ability to clearly
define transactions on which to perform data mining.
545
TEAM LinG
546
Chengqi Zhang
University of Technology Sydney, Australia
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Group Pattern Discovery Systems for Multiple Data Sources
and (2) a pattern discovery system and a post-mining branches. Therefore, these patterns may be far more
system for solving the second problem. important in terms of decision-making within the /
company. The key problem is how to efficiently
search for high-vote patterns of interest in multi-
MAIN THRUST dimensional spaces. It can be attacked by mining
the distribution of all patterns.
Group pattern discovery systems are able to (i) effectively (b) Finding Exceptional Patterns: Like high-vote pat-
enhance data quality for mining MDSs and (ii) automati- terns, exceptional patterns also are regarded as
cally identify potentially useful patterns from the multi- novel patterns in multiple data sources, which re-
dimension data in MDSs. flect the individuality of data sources. While high-
vote patterns are useful when a company is reaching
Data Enhancement common decisions, headquarters also are interested in
viewing exceptional patterns used when special deci-
Data enhancement includes the following: sions are made at only a few of the branches, perhaps
for predicting the sales of a new product. Exceptional
1. The data cleaning system mainly includes these patterns can capture the individuality of branches.
functions: recovering incomplete data (filling the Therefore, although an exceptional pattern receives
values missed or expelling ambiguity); purifying votes from only a few branches, it is extremely valuable
data (consistency of data name, consistency of data information in MDSs. The key problem is how to
format, correcting errors, or removing outliers); and construct efficient methods for measuring the inter-
resolving data conflicts (using domain knowledge estingness of exceptional patterns.
or expert decision to settle discrepancy). (c) Searching for Synthesizing Patterns by Weighting
2. The logical system for enhancing data quality fo- Majority: Although each data source can have an
cuses on the following epistemic properties: equal power to vote for patterns for making deci-
veridicality, introspection, and consistency. sions, data sources may be different in importance to
3. The logical system for resolving conflict has the a company. For example, in a company, if the sale of
property of obeying the weighted majority principle branch A is four times that of branch B, branch A is
in case of conflicts. certainly more important than branch B in the com-
4. The fuzzy database clustering system generates pany. (Here, each branch in a company is viewed as
good database clusters. a data source in an MDS environment.) The decisions
of the company are reasonably partial to high-sale
branches. Also, local patterns may have different
Identifying Interesting Patterns supports. For example, let the supports of patterns X1
and X2 be 0.9 and 0.4 in a branch. Pattern X1 is far more
A local pattern may be a frequent itemset, an association believable than pattern X2. These two examples
rule, causal rule, dependency, or some other expression. present the importance of branches and patterns for
Local pattern analysis is an in-place strategy specifically decision making within a company. Therefore, syn-
designed for mining MDSs, providing a feasible way to thesizing patterns is very useful.
generate globally interesting models from data in multi-
dimensional spaces.
Based on our local pattern analysis, three key systems
Post Pattern Analysis
can be developed for automatically searching for poten-
tially useful patterns from local patterns: (a) identifying In an MDS environment, a pattern (e.g., a high-vote
high-vote patterns; (b) finding exceptional patterns; and association rule) is attached to certain factors, including
(c) synthesizing patterns by weighting majority. name, vote, vsupp, and vconf. For a very large set of data
sources, a high-vote association rule may be supported
(a) Identifying High-Vote Patterns: Within an MDS by a number of data sources. So, the sets of its support and
environment, each data source, large or small, can confidence in these data sources are too large to be
have an equal power to vote for their patterns for the browsed by users, and thus, it is rather difficult to apply
decision-making of a company. Some patterns can the rule to decision making for users. Therefore, post-
receive votes from most of the data sources. These pattern analysis is very important in MDS mining. The key
patterns are referred to as high-vote patterns. High- problem is how to construct effective partition for classi-
vote patterns represent the commonness of the fying the patterns mined.
547
TEAM LinG
Group Pattern Discovery Systems for Multiple Data Sources
548
TEAM LinG
Group Pattern Discovery Systems for Multiple Data Sources
549
TEAM LinG
550
Vincent To-yee Ng
The Hong Kong Polytechnic University, Hong Kong
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Heterogeneous Gene Data for Classifying Tumors
solution would be to meta-analyze multiple, heteroge- The second and a better approach is to select a subset
neous gene expression data sets, forming meta-deci- of reference genes, also known as significant genes, and 0
sions from a number of individual decisions. to use the expression levels of these genes to estimate
The last difficulty is to find common features in scaling factors which are used to rescale the expression
various cancer types. These features can be referred as levels of genes in other data sets with the same set of
some sets of significant genes which are most likely reference genes as in the original subset. This approach
expressed in most cancer types, but they may be ex- has two advantages. The first is that it allows the effects
pressed differently in varying cancer types. The study of of outliers caused by non-significant genes to be elimi-
human cancer has recently discovered that the develop- nated while using only a subset of significant genes. In
ment of antigen-specific cancer vaccines leads to the a gene expression data set, only a proportion of genes is
discovery of immunogenic genes. This group of tumor tumor-specific. Because gene expression data contains
antigens has been introduced as the term cancer-testis high-dimensional data, by focusing on such tumor-spe-
(CT) antigen (Coulie et al., 2002). Discovered CT cific genes in classification would reduce computa-
antigens are recently grouped into distinct subsets and tional costs. The second advantage is that it improves the
named as cancer/testis (CT) immunogenic gene fami- quality of the normalization or re-scaling since it avoids
lies. Some works show that most CT immunogenic the underestimation of expression level of significant
gene families are expressed in more than one cancer genes, a problem which may arise because of the pres-
type, but with various expression frequencies. Cur- ence of large amounts of non-significant genes. We also
rently, researchers have reviewed and summarized that note that the selection algorithms are the focus of much
the current discovery is 44 CT immunogenic gene fami- current research. Some works that utilize existing
lies consisting of 89 individual genes in total (Scanlan, features selection algorithms include Dudoit, Yang,
Simpson, & Old, 2004). Callow, and Speed (2002), Bloom et al. (2004), and Lee
et al. (2003). New or enhanced algorithms have been
proposed by Park et al. (2003), Ng, Tan, and Sundarajan.
MAIN THRUST (2003), Choi, Yu, Kim, and Yoo (2003), Storey &
Tibshirani (2003), Chilingaryan, Gevorgyan, Vardanyan,
It is possible is to make classification algorithms more Jones, & Szabo (2002), and Golub et al. (1999).
reliable and robust by combining multiple, heteroge- In recent years, detection of significant genes was
neous gene expression data sets. A simple combination mainly done using fold-change detection. This detec-
method is to merge or append one data set to another. tion method is unreliable because it does not take into
Unfortunately, this method is inflexible because data account statistical variability. Currently, however, most
sets have various scales and ranges of variations. These algorithms that are used to select significant genes
are required to be the same in order to have consistent apply statistical methods. In the rest of the chapter, we
scales for comparisons after the combination. first present some recent works on the identification of
In this chapter, we discuss two approaches to com- significant genes using statistical methods. We then
bine data sets consisting of variation in the available briefly describe our proposed measure, Impact Factors
microarray technologies. The first, and simplest, ap- (IFs), which can be used to carry out tumor classifica-
proach is to normalize gene expression levels of genes tion using heterogeneous gene expression data (Fung &
in the data sets with mean zero and standard deviation Ng, 2003).
one (i.e. standard normal distribution, N(0, 1)) accord-
ing to the means and standard deviations across samples Statistical Methods
in individual data sets. While this approach is simple to
apply, it assumes that all genes have the same or similar The most common statistical method for identifying
expression rates. However, this assumption is incor- significant genes is the two-sample t-test (Cui &
rect. The fact is that only a small subset of genes reflects Churchill, 2003). The advantage of this test is that,
the existence of tumors, and that the remaining genes in because it requires only one gene to be studied for each
a tumor are not epidemiologically significant. It should t-test, it is insensitive to heterogeneity in variance
also be noted that the reflected genes do not all express across a couple of genes. However, while reliable t-
at the same rate. Therefore, when all genes in data sets values require large sample sizes, gene expression data
are normalized to have N(0, 1), the variations of the sets normally consist of small sample sizes. This prob-
reflected genes may be underestimated and the varia- lem of small sample sizes can be overcome using global
tions of genes which are stable and irrelevant may be t-tests, but it assumes that the variance is homogeneous
overestimated. This situation worsens as the number of between different genes (Tusher, Tibshirani, & Chu,
genes in data sets increases. 2001). Tusher, Tibshirani, and Chu (2001) proposed a
551
TEAM LinG
Heterogeneous Gene Data for Classifying Tumors
modified t-test that they called significance analysis of The second stage of selection is selection for classi-
microarray (SAM). SAM identifies significant genes in fication. This is done by calculating the differences
microarray experiments by measuring fluctuations of the between rescaled testing samples (since there are two
expression levels of genes across a number of microarray scaling factors, there are two rescaled samples in binary-
experiments. These fluctuations are estimated using per- class tumor classification) and individual classes in training
mutation tests and, to avoid higher t-values, they are set. It should be noted that, to improve the discriminative
expressed as a constant in the denominators of the two- power of the IFs, only those genes with higher differences
sample t-test. Tibshirani, Hastie, Narasimhan, and Chu are selected and used in classification.
(2002) proposed a modified nearest-centroid classifica- IFs have been integrated into classifiers to perform
tion. For all genes, it uses a t-test to calculate a centroid meta-classification of heterogeneous cancer gene ex-
distance, defined as the distance from class centroids to pression data (Fung & Ng, 2003). For most classifiers
overall centroids among classes of the genes. The centroid using either similarity or dissimilarity measures for
distance is then used to shrink the class centroids towards making classification decisions, IFs can be integrated
the overall centroid to order to reduce overfitting. into classifiers by multiplying the IFs directly with the
Correlation analysis is another common statistical original measures. The actual multiplication to be car-
method used to rank the significance of genes. Kuo, ried out depends on whether it is considered as dissimi-
Jenssen, Butte, Ohno-Machado, and Kohane (2002) ap- larity or similarity measures used by classifiers for
plied the Pearson linear and Spearman rank-order corre- making decisions. If it is being applied to dissimilarity
lation coefficients to study the flexibility of cross- measures, the IF of a class is multiplied by the measure
platform utilization of data from multiple gene expres- having the same class as the corresponding IF. In con-
sion data sets. Lee et al. (2003) also used the Pearson and trast, if it is being applied to similarity measures, the IF
Spearman correlation coefficients to study the correla- of a class is multiplied by the measure having another
tion among NCI-60 cancer data sets consisting of differ- class as the corresponding IF.
ent cancer types. They, however, proposed a measure to
rank the correlation of correlation among various data
sets. Later studies of correlation focused on multi- FUTURE TRENDS
platform, multi-type tumor data sets. Bloom et al. (2004)
used the Kruskal-Wallis H-test to identify significant Although DNA microarrays can be used to predict
genes within multi-type, multi-platform tumor data sets patients responses to medical treatment as well as
consisting of 21 data sets and 15 different cancer types. clinical outcomes, tumor classification using gene
expression data is as yet unreliable. This unreliability
Impact Factors has multiple causes. First of all, while some interna-
tional organizations such as the European
Recently, we proposed a dissimilarity measure called Bioinformatics Institute (EBI) and the National Cen-
Impact Factors (IFs) that measure the inter-experimen- ter for Biotechnology Information (NCBI) have cre-
tal variations between individual classes in training ated their own gene expression data repositories, there
samples and heterogeneous testing samples (Fung & Ng, still exists no international benchmark data reposito-
2003). The calculation of IFs takes place in two stages of ries. This makes it difficult for researchers to validate
selections, selection for re-scaling and selection for findings. Certainly, there is a need for integrated gene
classification. In the first stage, we first use SAM to expression databases which can help researchers to
select a set of significant genes. From these, we calcu- validate their findings with data conducted by different
late individual reference points corresponding to differ- laboratories around the world, and hence efficiency
ent classes in training set. These reference points are and effectiveness of different proposed mining algo-
then used to calculate their own scaling factors for the rithms can be compared objectively.
corresponding classes. The factors are used to rescale Unreliability could also be said to arise from inad-
the expression levels of all genes in testing samples. equate interdisciplinary communication between pro-
There are two advantages to using individual scaling fessionals and researchers in relevant fields. Recently,
factors corresponding to different classes: they ensure a number of promising mining algorithms have been
that the different gene expression levels of one class are proposed but it still remains for much work of this kind
not underestimated or overestimated because of unbal- to be analyzed and validated in molecular and biological
anced sample sizes between classes, and they allow terms (Sevenet & Cussenot, 2003). The developers of
individual testing samples to be compared with indi- these algorithms would welcome such input, as it would
vidual classes. be an invaluable assistance to them in constructing
552
TEAM LinG
Heterogeneous Gene Data for Classifying Tumors
553
TEAM LinG
Heterogeneous Gene Data for Classifying Tumors
Ng, S.K., Tan, S.H., & Sundarajan, V.S. (2003). On combin- Zien, A., Fluck, J., Zimmer, R., & Lengauer, T. (2003).
ing multiple microarray studies for improved functional Microarrays: How many do you need? Journal of Compu-
classification by whole-dataset feature selection. Ge- tational Biology, 10(3-4), 653-667.
nome Informatics, 14, 44-53.
Park, T., Yi, S.G., Lee, S., Lee, S.Y., Yoo, D.H., Ahn, J.I. et
al. (2003). Statistical tests for identifying differentially KEY TERMS
expressed genes in time-course microarray experiments.
Bioinformatics, 19(6), 694-703. Bioinformatics: It is an integration of mathematical,
Ramaswamy, S., Ross, K.N., Lander, E.S., & Golub, T.R. statistical and computational methods to analyze and
(2003). Evidence for a molecular signature of metastasis handle biological, biomedical, biochemical, and biophysi-
in primary solid tumors. Nature Genetics, 33, 49-54. cal information.
Scanlan, M.J., Simpson, A.J.G., & Old, L.J. (2004). The Cancer-testis (CT) Antigen: It is immunogenic in
cancer/testis genes: Review, standardization, and com- cancer patients, which exhibit highly tissue-restricted
mentary. Cancer Immunity, 4, 1. expression, and are considered promising target mol-
ecules for cancer vaccines.
Sebastiani, P., Gussoni, E., Kohane, I.S., & Ramoni,
M.F. (2003). Statistical challenges in functional Classification: It is the process of distributing things
genomics. Statistical Science, 18(1), 33-70. into classes or categories of the same type by a learnt
mapping function.
Sevenet, N., & Cussenot, O. (2003). DNA microarrays
in clinical practice: Past, present, and future. Clinical Gene Expression: It describes how the information of
and Experimental Medicine, 3(1), 1-3. transcription and translation encoded in a segment of
DNA is converted into proteins in a cell.
Storey, J.D., & Tibshirani, R.J. (2003). Statistical signifi-
cance for genomewide studies. Proceedings of the Na- Microarrays: It is the technology for biological explo-
tional Academy of Sciences of the United States of America, ration which allows to simultaneously measure the amount
100(16), 9440-9445. of mRNA in up to tens of thousand of genes in a single
experiment.
Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002).
Diagnosis of multiple cancer types by shrunken centroids Normalization: In terms of gene expression data, it is
of gene expression. Proceedings of the National Acad- a pre-processing to minimize systematic bias and remove
emy of Sciences of the United States of America, 99(10), the impact of non-biological influences before data analy-
6567-6572. sis is performed.
Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance Probe Arrays: They are a list of labeled, single-
analysis of microarrays applied to the ionizing radiation stranded DNA or RNA molecules in specific nucleotide
response. Proceedings of the National Academy of Sci- sequences, which are used to detect the complementary
ences of the United States of America, 98(9), 5116-5121. base sequence by hybridization.
554
TEAM LinG
555
Ke Wang
Simon Fraser University, Canada
Martin Ester
Simon Fraser University, Canada
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Hierarchical Document Clustering
DOCUMENT CLUSTERING METHODS is not suitable for discovering clusters of largely vary-
ing sizes, a common scenario in document clustering.
Hierarchical Clustering Methods Furthermore, it is sensitive to noise that may have a
significant influence on the cluster centroid, which in
One popular approach in document clustering is turn lowers the clustering accuracy. The k-medoids
agglomerative hierarchical clustering (Kaufman & algorithm (Kaufman & Rousseeuw, 1990; Krishnapuram,
Rousseeuw, 1990). Algorithms in this family build the Joshi, & Yi, 1999) was proposed to address the noise
hierarchy bottom-up by iteratively computing the simi- problem, but this algorithm is computationally much
larity between all pairs of clusters and then merging the more expensive and does not scale well to large docu-
most similar pair. Different variations may employ ment sets.
different similarity measuring schemes (Karypis, 2003;
Zhao & Karypis, 2001). Steinbach (2000) shows that Frequent Itemset-Based Methods
Unweighted Pair Group Method with Arithmatic Mean
(UPGMA) (Kaufman & Rousseeuw, 1990) is the most Wang et al. (1999) introduced a new criterion for clus-
accurate one in its category. The hierarchy can also be tering transactions using frequent itemsets. The intu-
built top-down which is known as the divisive approach. ition of this criterion is that many frequent items should
It starts with all the data objects in the same cluster and be shared within a cluster while different clusters should
iteratively splits a cluster into smaller clusters until a have more or less different frequent items. By treating
certain termination condition is fulfilled. a document as a transaction and a term as an item, this
Methods in this category usually suffer from their method can be applied to document clustering; however,
inability to perform adjustment once a merge or split the method does not create a hierarchy of clusters.
has been performed. This inflexibility often lowers the The Hierarchical Frequent Term-based Clustering
clustering accuracy. Furthermore, due to the complex- (HFTC) method proposed by (Beil, Ester, & Xu, 2002)
ity of computing the similarity between every pair of attempts to address the special requirements in docu-
clusters, UPGMA is not scalable for handling large data ment clustering using the notion of frequent itemsets.
sets in document clustering as experimentally demon- HFTC greedily selects the next frequent itemset, which
strated in (Fung, Wang, & Ester, 2003). represents the next cluster, minimizing the overlap of
clusters in terms of shared documents. The clustering
Partitioning Clustering Methods result depends on the order of selected itemsets, which
in turn depends on the greedy heuristic used. Although
K-means and its variants (Cutting, Karger, Pedersen, & HFTC is comparable to bisecting k-means in terms of
Tukey, 1992; Kaufman & Rousseeuw, 1990; Larsen & clustering accuracy, experiments show that HFTC is not
Aone, 1999) represent the category of partitioning clus- scalable (Fung, Wang, & Ester, 2003).
tering algorithms that create a flat, non-hierarchical
clustering consisting of k clusters. The k-means algo- A Scalable Algorithm for Hierarchical
rithm iteratively refines a randomly chosen set of k Document Clustering: FIHC
initial centroids, minimizing the average distance (i.e.,
maximizing the similarity) of documents to their clos- A scalable document clustering algorithm, Frequent
est (most similar) centroid. The bisecting k-means al- Itemset-based Hierarchical Clustering (FIHC) (Fung,
gorithm first selects a cluster to split, and then employs Wang, & Ester, 2003), is discussed in greater detail
basic k-means to create two sub-clusters, repeating because this method satisfies all of the requirements of
these two steps until the desired number k of clusters is document clustering mentioned above. We use item
reached. Steinbach (2000) shows that the bisecting k- and term as synonyms below. In classical hierarchical
means algorithm outperforms basic k-means as well as and partitioning methods, the pairwise similarity be-
agglomerative hierarchical clustering in terms of accu- tween documents plays a central role in constructing a
racy and efficiency (Zhao & Karypis, 2002). cluster; hence, those methods are document-centered.
Both the basic and the bisecting k-means algorithms FIHC is cluster-centered in that it measures the cohe-
are relatively efficient and scalable, and their complex- siveness of a cluster directly using frequent itemsets:
ity is linear to the number of documents. As they are documents in the same cluster are expected to share
easy to implement, they are widely used in different more common itemsets than those in different clusters.
clustering applications. A major disadvantage of k- A frequent itemset is a set of terms that occur
means, however, is that an incorrect estimation of the together in some minimum fraction of documents. To
input parameter, the number of clusters, may lead to illustrate the usefulness of this notion for the task of
poor clustering accuracy. Also, the k-means algorithm clustering, let us consider two frequent items, win-
556
TEAM LinG
Hierarchical Document Clustering
dows and apple. Documents that contain the word Figure 1. Initial Clusters
windows may relate to renovation. Documents that 0
contain the word apple may relate to fruits. However, {Sports, Tennis} {Sports, Tennis, Ball} {Sports, Tennis, Racket}
if both words occur together in many documents, then
another topic that talks about operating systems should {Sports, Ball}
be identified. By precisely discovering these hidden Doci
Sports, Tennis, Ball
topics as the first step and then clustering documents {Sports}
based on them, the quality of the clustering solution can
be improved. This approach is very different from HFTC
where the clustering solution greatly depends on the itemset. A document Doc1 containing global fre-
order of selected itemsets. Instead, FIHC assigns docu- quent items Sports, Tennis, and Ball is as-
ments to the best cluster from among all available clus- signed to clusters {Sports}, {Sports, Ball}, {Sports,
ters (frequent itemsets). The intuition of the clustering Tennis} and {Sports, Tennis, Ball}. Suppose
criterion is that there are some frequent itemsets for {Sports, Tennis, Ball} is the best cluster for Doc1
each cluster in the document set, but different clusters measured by some score function. Doc1 is then
share few frequent itemsets. FIHC uses frequent itemsets removed from {Sports}, {Sports, Ball}, and {Sports,
to construct clusters and to organize clusters into a topic Tennis}.
hierarchy.
The following definitions are introduced in (Fung,
Wang, & Ester, 2003): A global frequent itemset is a set Building Cluster Tree
of items that appear together in more than a minimum
fraction of the whole document set. A global frequent In the cluster tree, each cluster (except the root node)
item refers to an item that belongs to some global fre- has exactly one parent. The topic of a parent cluster is
quent itemset. A global frequent itemset containing k more general than the topic of a child cluster and they
items is called a global frequent k-itemset. A global are similar to a certain degree (see Figure 2 for an
frequent item is cluster frequent in a cluster Ci if the example). Each cluster uses a global frequent k-itemset
item is contained in some minimum fraction of docu- as its cluster label. A cluster with a k-itemset cluster
ments in Ci. FIHC uses only the global frequent items in label appears at level k in the tree. The cluster tree is
document vectors; thus, the dimensionality is signifi- built bottom up by choosing the best parent at level k-
cantly reduced. 1 for each cluster at level k. The parents cluster label
The FIHC algorithm can be summarized in three must be a subset of the childs cluster label. By treating
phases: First, construct initial clusters. Second, build a all documents in the child cluster as a single document,
cluster (topic) tree. Finally, prune the cluster tree in the criterion for selecting the best parent is similar to
case there are too many clusters. the one for choosing the best cluster for a document.
557
TEAM LinG
Hierarchical Document Clustering
REFERENCES
RELATED LINKS
Beil, F., Ester, M., & Xu, X. (2002). Frequent term-
The followings are some clustering tools on the Internet: based text clustering. International Conference on
Knowledge Discovery and Data Mining, KDD02 (pp.
Tools: FIHC implements Frequent Itemset-based 436-442), Edmonton, Alberta, Canada.
Hierarchical Clustering.
Cutting, D.R., Karger, D.R., Pedersen, J.O., & Tukey,
Website: http://www.cs.sfu.ca/~ddm/
J.W. (1992). Scatter/gather: A cluster-based approach
to browsing large document collections. International
Tools: CLUTO implements Basic/Bisecting K-means Conference on Research and Development in Infor-
and Agglomerative methods.
558
TEAM LinG
Hierarchical Document Clustering
mation Retrieval, SIGIR92 (pp. 318-329), Copenhagen, Zhao, Y., & Karypis, G. (2001). Criterion functions for
Denmark. document clustering: Experiments and analysis. Techni- 0
cal report, Department of Computer Science, University of
Fung, B., Wang, K., & Ester, M. (2003, May). Hierarchi- Minnesota.
cal document clustering using frequent itemsets. SIAM
International Conference on Data Mining, SDM03 (pp. Zhao, Y., & Karypis, G. (2002, November). Evaluation of
59-70), San Francisco, CA, United States.. hierarchical clustering algorithms for document datasets.
International Conference on Information and Knowl-
Guha, S., Mishra, N., Motwani, R., & OCallaghan, L. edge Management (pp. 515-524), McLean, Virginia, United
(2000). Clustering data streams. Symposium on Foun- States.
dations of Computing Science (pp. 359-366).
Karypis, G. (2003). Cluto 2.1.1: A software package for
clustering high dimensional datasets. Retrieved from
http://www-users.cs.umn.edu/~karypis/cluto/
KEY TERMS
Kaufman, L., & Rousseeuw, P.J. (1990, March). Finding Cluster Frequent Item: A global frequent item is
groups in data: An introduction to cluster analysis. New cluster frequent in a cluster Ci if the item is contained in
York: John Wiley & Sons, Inc. some minimum fraction of documents in Ci.
Krishnapuram, R., Joshi, A., & Yi, L. (1999, August). A Document Clustering: The automatic organization
fuzzy relative of the k-medoids algorithm with application of documents into clusters or group so that documents
to document and snippet clustering. IEEE International within a cluster have high similarity in comparison to
Conference - Fuzzy Systems, FUZZIEEE 99, Korea. one another, but are very dissimilar to documents in
other clusters.
Larsen, B., & Aone, C. (1999). Fast and effective text
mining using linear-time document clustering. Interna- Document Vector: Each document is represented
tional Conference on Knowledge Discovery and Data by a vector of frequencies of remaining items after
Mining, KDD99 (pp. 16-22), San Diego, California, United preprocessing within the document.
States.
Global Frequent Itemset: A set of words that
Ordonez, C. (2003). Clustering binary data streams with occur together in some minimum fraction of the whole
K-means. Workshop on Research issues in data min- document set.
ing and knowledge discovery, SIGMOD03 (pp. 12-19),
San Diego, California, United States. Inter-Cluster Similarity: The overall similarity among
documents from two different clusters.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A
comparison of document clustering techniques. Work- Intra-Cluster Similarity: The overall similarity among
shop on Text Mining, SIGKDD00. documents within a cluster.
van Rijsbergen, C.J. (1979). Information retrieval (2nd Medoid: The most centrally located object in a cluster.
ed.). London: Butterworth Ltd. Stemming: For text mining purposes, morphological
Wang, K., Xu, C., & Liu, B. (1999). Clustering transac- variants of words that have the same or similar semantic
tions using large items. International Conference on interpretations can be considered as equivalent. For ex-
Information and Knowledge Management, CIKM99 (pp. ample, the words computation and compute can be
483-490), Kansas City, Missouri, United States. stemmed into comput.
Wang, K., Zhou, S., & He Y. (2001, April). Hierarchical Stop Words Removal: A preprocessing step for text
classification of real life documents. SIAM Interna- mining. Stop words, like the and this which rarely help
tional Conference on Data Mining, SDM01, Chicago, the mining process, are removed from input data.
United States.
559
TEAM LinG
560
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
High Frequency Patterns in Data Mining
table and bitmap table. Bmonth means Birth month; City Some easy observations:
means the location of the entities.
Next, we will interpret the bit-vectors in terms of set 1. The collection of elementary granules of an at-
theory. A bit-vector can be viewed as a representation of tribute (column) forms a partition, that is, all
a subset of V. For example, the bit-vector, 100011100, granules of this attribute are pairwise disjoint.
of BusinesSize = TWENTY says that the first, fifth, This fact was observed by Pawlak (1982) and Tony
sixth, and seventh entities have been selected, in other Lee (1983).
words, the bit-vector represents the subset {v1, v5, v6, v 7}. 2. From Tables 1 and 2, one can easily conclude that
The other two bi-vectors, for values TEN and THIRTY, the relational table K, the bitmap table B and
represent the subsets {v2, v3, v4} and {v 8, v9} respectively. granular table G are isomorphic. Two tables are
We summarize such translations in Table 2a,b,c. and isomorphic if one can transform a table to the
refer to these subsets as elementary granules. other by renaming all attribute values in a one-to-
one fashion.
561
TEAM LinG
High Frequency Patterns in Data Mining
Granular Data Model (GDM) words (in group theory) as symbols; their words are our
Uninterpreted Relational Table in Free symbols.
Format
Data Processing and Computing with
The middle columns of Table 2a, 2b and 2c define 3 Words
partitions. The universe and such 3 partions, denoted by
(V, {E BusinesSize,E Bmonth,ECity}), determines the granular In traditional data processing (TDP), a relational table
table G and vice versa. More generally, 3-tuple (V, E, C) is is a knowledge representation of a slice of real world.
called a GDM, where E is a set of finite family of partitions, So each symbol of the table represents (to human) a
and C consists of the names of all elementary granules. A piece of the real world; however, such a representation
partition (equivalence relation) of V that is not in the given is not implemented in the system. Nevertheless, DBMS,
E is referred to as an uninterpreted attribute of GDM, and under human commands, does process the data, for
its elementary granules are un-interpreted attribute val- examples, Bmonth (attribute), April, March (attribute
ues. values) with human-perceived semantics. So in TDP the
relational table is a table of words; TDP is human
GDM Theorem: The granular table G. determines directed computing with words.
GDM and vice versa.
Data Mining and Computing with
In view of Isomorphic theorem below, it is sufficient to Symbols
do AM in GDM.
In (automated) AM we use the table created in TDP.
However, AM algorithms regard the TDP data as sym-
MAIN THRUST bols; no real world meaning of each word participates
in the process of AM. High frequency patterns are
Analysis of Association Mining (AM) completely deduced from the counting of the symbols.
AM is computing with symbols. The input data of AM is
To understand the mathematical mechanics of AM, let us a relational table of symbols, whose real wolrd mean-
examine how the information has been created and pro- ing does not participate in formal computing.
cessed. We will take the deductive data mining approach. Under such a circumstance, if we replace the given
First, let us set up some terminology. A symbol is a set of symbols by a new set, then we can derive new
string of bits and bytes that represents a slice of real patterns by simply replacing the symbols in old pat-
world, however, such a real world meaning does not terns. Formally, we have (Lin, 2002)
participate in the formal processing or computing. We
term such a processing computing with symbols. In AI, Isomorphic Theorem: Isomorphic relational tables
such a symbol is termed a semantic primitive. have isomorphic patterns.
(Feigenbaum,1981). A symbol is termed a word, if the
intended real world meaning participates in the formal This theorem implies that the theory of AM is a
processing or computing. We term such a processing syntactic theory.
computing with words. Note that mathematicians use
562
TEAM LinG
High Frequency Patterns in Data Mining
Example: From Table 3, it should be clear that the 2. Set Based: A high frequency pattern in GDM is a
one-to-one correspondences between K and K granular expression, which is a set theoretical
induces consistently a one-to-one correspondence algebriac expression of elementary granules; when
between the two sets of distinct attribute values. the expression is evaluated set theoretically, the
We describe such a phenomenon by the statement: cardinailty of the resultant set is greater than or
K and K are isomorphic. equal to the threshold; we will call the algebraic
expressions granular pattern. Note that several
In Table 4, we display the high frequency patterns of distinct algebraic expressions of elementary gran-
length 2 from Table K, K and GDM; the three sets of ules may have the same resultant set.
patterns are isomorphic to each other. So for AM, we
can use any one of the three tables. An observation: In Informally, a logical formula of granular pattern is
using K or K for AM, one needs to scan the table to get the logic formula of the names of elementary granules
the support, while in using GDM, the support can be (Lin, 2000); more pricisely we translate elementary
read from the cardinality of the granules, no database granules, and into their names, or and and
scan is required one strength of GDM. Another obser- respectively. Next, we note that there are only finitely
vation: From the definition of elementary granules, it many distinct subsets that can be generated by the inter-
should be obvious that subtuples are mapped to the sections and unions of elementary granules in GDM. If
intersections of elementary granules. we only consider the disjunct normal form, the total
possible high frequency patterns in AM is finite.
Patterns and Granular Formulas
Finding High Frequency Patterns by
Implicitly AM has assumed high frequency patterns are Solving a Set of Linear Inequalities
expressions of the input symbols (elements of the
input relational table.) Such assumptions are not made Let B be the Boolean algebra generated by the elemen-
in other techniques. In neural network techniques, the tary granules; the partial order is the set theoretical
input data are numerical, its patterns are not numerical inclusion . Then B is the set of all granular expres-
expressions. They are essentially functions that are sions. Let O be the smallest element (it is not necessary
derived from activation functions (Lin, 1996; Park & an empty set) and I is the greatest element (I is the
Sanders, 1989). universe V). An element p is an atom, if p O, and there
Let us back to AM, the implicit assumption simpli- is no element x such that p x O. Each atom p is an
fies the problem. What are the possible expressions intersection of some elementary granules. Let S(b) be
of the input symbols? There are two possible formal- the set of all atom pj such that pj b and s(b) be its
isms, logic formula and set theoretical algebraic ex- cardinality. From (Birkoff & MacLane, 1977, Chapter
pression. In logic form, we have several choices, deduc- 11), we have
tive database systems, datalog, or decision logic among
others (Pawlak, 1991; Ullman, 1988-89); we choose Proposition: Every b B can be expressed in the
decision logic because it is simpler. In set theoretical form b = p1 . . . ps(b).
form, we use GDM (Lin, 2000).
For convenience, let us define an operation of a
Expressions of High Frequency binary number x and a set S. We write S*x to mean the
Patterns following:
563
TEAM LinG
High Frequency Patterns in Data Mining
Let p1, p2, . . ,pm be the set of all atoms in B. Then a (TWENTY, MAR) from K has no meaning on its own,
granular expression b can be expressed as (20, SCREW) from K has a valid meaning.
Let RW(K) be the Real World that K is representing.
b=p1*x 1 . . . p m* xm . The summary implies that the subtuple (TWENTY, MAR),
even though occurs very frequently in the table, there is
and its cardinality can be expressed as no real world event correspond to it. The data implies
that three entities v 1, v5, v6 have common properties
|b| = | p i |*xi encoded by Twenty and Mar. In the table K, they are
naively summarized into one concept (TWENTY,
where | | is the cardinality of . MAR). Unfortunately, in the real world RW(K), the
three occurrences of Twenty and Mar (from three
Main Theorem: Let s be the threshold. Then b is a entities, v1, v 5, v 6) do not integrate into an appropriate
high frequency pattern, if new concept (TWENTY, MAR). Such error occurs,
because high frequency is an inadequate or inaccurate
|b| = | pi |*xi s (*) criterion. We need a tighter notion of patterns.
In applications, pis are readily computable; it is the Semantic Oriented Data Mining
elementary granules of the intersection of all partitions
(defined by attributes); see the Table 1 and 2. So we only If we do know how to compute the semantics, then the
need to find all binary solutions of xi. The generators of the computation should tell us that the repeated two words
solution can be enumerated along the hyperplanes of the TWENTY and MAR can not be combined into a new
inequalities of the constraints. concept regardless of high repetitions, and should be
dropped out. So semantic oriented data mining is needed
Observations (Lin & Louie. 2001, 2002). As ontology, semantic web,
and computing with words (semantic computing) are
Theoretically, this is a remarkable theorem. It says all heating up, it could be a right time to move onto seman-
possible high frequency patterns can be found by solving tic oriented data mining.
linear inequalities. However, the practicality of the
main theorem is highly depended on the complexity of New Notions of Patterns and
the problem. If both | pi | and s are small, then the number Algorithmic Information Theory
of solutions will be out of hands, simply due to the size of
the number. We would like to stress that the difficulty is In (Lin, 1993), based on algorithmic information theory
simply due to the size of possible solutions, not the or Kolmogorov complexity theory, we proposed that a
methodology. The result implies that the notion of high non-random (compressible string) is a string with pat-
frequency patterns may not be tight enough. At this terns and the shortest Turing machine that generates this
moment, (*) is useful only if the number of attributes under string is the pattern. We concluded, then, that a finite
considerations is small. sequence (a relational table is a finite sequence) with
long constant subsequences (the length of such constant
sequence is the support) is trivially compressible (hav-
FUTURE TRENDS ing a pattern). High frequency patterns are such patterns.
Taking the same thought, what would be the next less
Tighter Notion of Patterns trivial compressible finite sequences?
564
TEAM LinG
High Frequency Patterns in Data Mining
Garcia-Molina, H., Ullman, J. D., & Widom, J. (2002). Association (Undirected Association Rule): A
Database systems-The complete book. Prentice Hall. subtuple of a bag relation whose support is greater than
a given threshold.
Lee, T.T. (1983). Algebraic theory of relational data-
bases. The Bell System Technical Journal, 62(10),3159- Bag Relation: A relation that permits repetition of
3204. tuples
Lin, T.Y. (1993). Rough patterns in data - Rough sets and Computing with Symbols: The interpretations of
foundation of intrusion detection systems. Journal of the symbols are not participating in the formal data
Foundation of Computer Science and Decision Sup- processing or computing,
port, 18(3-4), 225-241. Computing with Words: In this article, computing
Lin, T.Y. (1996, July). The power and limit of neural with words means one form of formal data processing or
networks. In Proceedings of the 1996 Engineering computing, in which the interpretations of the symbols
Systems Design and Analysis Conference, 7 (pp. 49-53), do participate. L.A .Zadeh uses this term in a much
Montpellier, France. deeper way.
Lin, T.T. (2000). Data mining and machine oriented mod- Deductive Data Mining: A data mining methodol-
eling: A granular computing approach. Journal of Ap- ogy that requires to list explicitly the input data and
plied Intelligence, 13(2), 113-124. background knowledge. Roughly it treats data mining as
deductive science (axiomatic method)
Lin, T.Y., & Louie, E. (2001). Semantics oriented asso-
ciation rules. In 2002 World Congress of Computa- Granulation and Partition: Partition is a decomposi-
tional Intelligence (pp. 956-961), Honolulu, Hawaii, May tion of a set into a collection of mutually disjoint subsets.
12-17, (paper # 5754). Granulation is defined similarly, but allows the subsets to
be generalized subsets, such as fuzzy sets, and permits
Louie, E., & Lin, T.Y. (2000, October). Finding associa- the overlapping.
tion rules using fast bit computation: Machine-oriented
modeling. In Z. Ras, & S. Ohsuga (Eds.), Foundations Kolmogorov Complexity of a String: The length
of intelligent systems, Lecture Notes in Artificial Intelli- of shortest program that can generate the given string.
gence #1932, 486-494, Springer-Verlag. 12th Interna- Semantic Primitive: It is a symbol that has interpre-
tional Symposium on Methodologies for Intelligent Sys- tation, but the interpretation is not implemented in the
tems, Charlotte, NC. system. So in automated computing, semantic primitive is
Park, J., & Sandberg, I.W. (1991). Universal approximation treated as symbols. However, in interactive computing; it
using radial-basis-function networks. Neural Computa- may be treated as a word (not necessary).
tion, 3, 246-257. Support: Support is the percentage of the tuples in
Pawlak, Z. (1982). Rough sets. International Journal which that subtuple occurs.
of Computer and Information Sciences, 11, 341-356.
565
TEAM LinG
566
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Homeland Security Data Mining and Link Analysis
individuals (Berry & Linoff; Han & Kamber, 2000; a hypothesis and then determine whether the hypothesis
Thuraisingham, 1998). is true, or bottom-up reasoning, where we start with 0
examples and then come up with a hypothesis
Security Threats (Thuraisingham, 1998). In the following, we will exam-
ine how data-mining techniques may be applied for
Security threats have been grouped into many catego- homeland security applications. Later, we will examine
ries (Thuraisingham, 2003). These include informa- a particular data-mining technique called link analysis
tion-related threats, where information technologies (Thuraisingham, 2003).
are used to sabotage critical infrastructures, and non- Data-mining techniques include techniques for mak-
information-related threats, such as bombing buildings. ing associations, clustering, anomaly detection, predic-
Threats also may be real-time threats and non-real-time tion, estimation, classification, and summarization. Es-
threats. Real-time threats are threats where attacks have sentially, these are the techniques used to obtain the
timing constraints associated with them, such as build- various data-mining outcomes. We will examine a few
ing X will be attacked within three days. Non-real-time of these techniques and show how they can be applied to
threats are those threats that do not have timing con- homeland security. First, consider association rule min-
straints associated with them. Note that non-real-time ing techniques. These techniques produce results, such
threats could become real-time threats over time. as John and James travel together or Jane and Mary
Threats also include bioterrorism, where biological travel to England six times a year and to France three
and possibly chemical weapons are used to attack, and times a year. Essentially, they form associations be-
cyberterrorism, where computers and networks are at- tween people, events, and entities. Such associations
tacked. Bioterrorism could cost millions of lives, and also can be used to form connections between different
cyberterrorism, such as attacks on banking systems, could terrorist groups. For example, members from Group A
cost millions of dollars. Some details on the threats and and Group B have no associations, but Groups A and B
countermeasures are discussed in various texts (Bolz, have associations with Group C. Does this mean that
2001). The challenge is to come up with techniques to there is an indirect association between A and C?
handle such threats. In this article, we discuss data- Next, let us consider clustering techniques. Clusters
mining techniques for security applications. essentially partition the population based on a charac-
teristic such as spending patterns. For example, those
living in the Manhattan region form a cluster, as they
MAIN THRUST spend over $3,000 on rent. Those living in the Bronx
from another cluster, as they spend around $2,000 on
First, we will discuss data mining for homeland secu- rent. Similarly, clusters can be formed based on terror-
rity. Then, we will focus on a specific data-mining ist activities. For example, those living in region X
technique called link analysis for homeland security. bomb buildings, and those living in region Y bomb
An aspect of homeland security is cyber security. There- planes.
fore, we also will discuss data mining for cyber security. Finally, we will consider anomaly detection tech-
niques. A good example here is learning to fly an air-
plane without wanting to learn to take off or land. The
Applications of Data Mining for general pattern is that people want to get a complete
Homeland Security training course in flying. However, there are now some
individuals who want to learn to fly but do not care about
Data-mining techniques are being examined extensively take off or landing. This is an anomaly. Another example
for homeland security applications. The idea is to gather is John always goes to the grocery store on Saturdays.
information about various groups of people and study But on Saturday, October 26, 2002, he went to a fire-
their activities and determine if they are potential ter- arms store and bought a rifle. This is an anomaly and may
rorists. As we have stated earlier, data-mining outcomes need some further analysis as to why he is going to a
include making associations, linking analyses, forming firearms store when he has never done so before. Some
clusters, classification, and anomaly detection. The tech- details on data mining for security applications have
niques that result in these outcomes are techniques been reported recently (Chen, 2003).
based on neural networks, decisions trees, market-bas-
ket analysis techniques, inductive logic programming, Applications of Link Analysis
rough sets, link analysis based on graph theory, and
nearest-neighbor techniques. The methods used for data Link analysis is being examined extensively for applica-
mining include top-down reasoning, where we start with tions in homeland security. For example, how do we
567
TEAM LinG
Homeland Security Data Mining and Link Analysis
connect the dots describing the various events and make that Johns computer is never used between 2:00 A .M.
links and connections between people and events? One and 5:00 A.M.. When Johns computer is in use at 3:00
challenge to using link analysis for counterterrorism is A .M ., for example, then this is flagged as an unusual
reasoning with partial information. For example, agency pattern.
A may have a partial graph, agency B another partial Data mining is also being applied to other applica-
graph, and agency C a third partial graph. The question is tions in cyber security, such a auditing. Here again, data
how do you find the associations between the graphs, on normal database access is gathered, and when some-
when no agency has the complete picture? One would thing unusual happens, then this is flagged as a possible
ague that we need a data miner that would reason under access violation. Digital forensics is another area where
uncertainty and be able to figure out the links between data mining is being applied. Here again, by mining the
the three graphs. This would be the ideal solution, and the vast quantities of data, one could detect the violations
research challenge is to develop such a data miner. The that have occurred. Finally, data mining is being used
other approach is to have an organization above the three for biometrics. Here, pattern recognition and other
agencies that will have access to the three graphs and machine-learning techniques are being used to learn
make the links. the features of a person and then to authenticate the
The strength behind link analyses is that by visualiz- person, based on the features.
ing the connections and associations, one can have a
better understanding of the associations among the vari-
ous groups. Associations such as A and B, B and C, D and FUTURE TRENDS
A, C and E, E and D, F and B, and so forth can be very
difficult to manage, if we assert them as rules. However, While data mining has many applications in homeland
by using nodes and links of a graph, one can visualize the security, it also causes privacy concerns. This is be-
connections and perhaps draw new connections among cause we need to collect all kinds of information about
different nodes. Now, in the real world, there would be people, which causes private information to be di-
thousands of nodes and links connecting people, groups, vulged. Privacy and data mining have been the subject of
events, and entities from different countries and conti- much debate during the past few years, although some
nents as well as from different states within a country. early discussions also have been reported
Therefore, we need link analysis techniques to deter- (Thuraisingham, 1996).
mine the unusual connection, such as a connection be- One promising direction is privacy-preserving data
tween G and P, for example, which is not obvious with mining. The challenge here is to carry out data mining but
simple reasoning strategies or by human analysis. at the same time ensure privacy. For example, one could
Link analysis is one of the data-mining techniques use randomization as a technique and give out approxi-
that is still in its infancy. That is, while much has been mate values instead of the actual values. The challenge is
written about techniques such as association rule min- to ensure that the approximate values are still useful.
ing, automatic clustering, classification, and anomaly Many papers on privacy-preserving data mining have
detection, very little material has been published on link been published recently (Agrawal & Srikant, 2000).
analysis. We need interdisciplinary researchers such as
mathematicians, computational scientists, computer sci-
entists, machine-learning researchers, and statisticians CONCLUSION
working together to develop better link analysis tools.
This article has discussed data-mining applications in
Applications of Data Mining for Cyber homeland security. Applications in national security
Security and cyber security both are discussed. We first pro-
vided an overview of data mining and security threats
Data mining also has applications in cyber security, and then discussed data-mining applications. We also
which is an aspect of homeland security. The most promi- emphasized a particular data-mining techniquelink
nent application is in intrusion detection. For example, analysis. Finally, we discussed privacy-preserving data
our computers and networks are being intruded by unau- mining.
thorized individuals. Data-mining techniques, such as It is only during the past three years that data mining
those for classification and anomaly detection, are being for security applications has received a lot of attention.
used extensively to detect such unauthorized intrusions. Although a lot of progress has been made, there is also
For example, data about normal behavior is gathered, and a lot of work that needs to be done. First, we need to
when something occurs out of the ordinary, it is flagged have a better understanding of the various threats. We
as an unauthorized intrusion. Normal behavior could be need to determine which data-mining techniques are
568
TEAM LinG
Homeland Security Data Mining and Link Analysis
applicable to which threats. Much research is also needed Thuraisingham, B. (2003). Web data mining technolo-
on link analysis. To develop effective solutions, data- gies and their applications in business intelligence 0
mining specialists have to work with counter-terrorism and counter-terrorism. FL: CRC Press.
experts. We also need to motivate the tool vendors to
develop tools to handle terrorism.
KEY TERMS
NOTE
Cyber Security: Techniques used to protect the
The views and conclusions expressed in this article are those computer and networks from the threats.
of the author and do not reflect the policies of the MITRE Data Management: Techniques used to organize,
Corporation or of the National Science Foundation. structure, and manage the data, including database man-
agement and data administration.
569
TEAM LinG
570
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Humanities Data Warehousing
that different database structures may reflect a particular is placed on choosing the right subjects to model as
linguistic theory, and also mentioned the trade-off be- opposed to being constrained to model around applica- 0
tween quality and quantity in terms of coverage. tions. Data warehouses do not replace databases as such
The choice of data model thus has a profound effect - they co-exist alongside them in a symbiotic fashion.
on the problems that can be tackled and the data that can Databases are needed both to serve the clerical commu-
be interrogated. For both historical and linguistic re- nity who answer day-to-day queries such as what is A.R.
search, relational data modeling using normalization Smiths current overdraft? and also to feed a data
often appears to impose data structures which do not fit warehouse. To do this, snapshots of data are extracted
naturally with the data and which constrain subsequent from a database on a regular basis (daily, hourly and in
analysis. Coping with complicated dating systems can the case of some mobile phone companies almost real-
also be very problematic. Surprisingly, similar difficul- time). The data is then transformed (cleansed to ensure
ties have already arisen in the business community, and consistency) and loaded into a data warehouse. In addi-
have been addressed by data warehousing. tion, a data warehouse can cope with diverse data sources,
including external data in a variety of formats and sum-
marized data from a database. The myriad types of data
MAIN THRUST of different provenance create an exceedingly rich and
varied integrated data source opening up possibilities
Data Warehousing in the Business not available in databases. Thus all the data in a data
warehouse is integrated. Crucially, data in a warehouse
Context is not updated - it is only added to, thus making it non-
volatile, which has a profound effect on data modeling,
Data warehouses came into being as a response to the as the main function of normalization is to obviate
problems caused by large, centralized databases which update anomalies. Finally, a data warehouse has a time
users found unwieldy to query. Instead, they extracted horizon (that is contains data over a period) of five to ten
portions of the databases which they could then control, years, whereas a database typically holds data that is
resulting in the spider-web problem where each de- current for two to three months.
partment produces queries from its own, uncoordinated
extract database (Inmon, 2001, 99. 6-14). The need was
thus recognized for a single, integrated source of clean
Data Modeling in a Data Warehouse
data to serve the analytical needs of a company. Dimensional Modeling
A data warehouse can provide answers to a com-
pletely different range of queries than those aimed at a There is a fundamental split in the data warehouse com-
traditional database. Using an estate agency as a typical munity as to whether to construct a data warehouse from
business, the type of question their local databases scratch, or to build them via data marts. A data mart is
should be able to answer might be How many three- essentially a cut-down data warehouse that is restricted
bedroomed properties are there in the Botley area up to to one department or one business process. Inmon (2001,
the value of 150,000? The type of over-arching ques- p. 142) recommended building the data warehouse first,
tion a business analyst (and CEOs) would be interested then extracting the data from it to fill up several data marts.
in might be of the general form Which type of property The data warehouse modeling expert Kimball (2002)
sells for prices above the average selling price for advised the incremental building of several data marts
properties in the main cities of Great Britain and how that are then carefully integrated into a data warehouse.
does this correlate to demographic data? (Begg & Whichever way is chosen, the data is normally modeled
Connolly, 2004, p. 1154). To trawl through each local via dimensional modeling. Dimensional models need to
estate agency database and corresponding local county be linked to the companys corporate ERD (Entity Re-
council database, then amalgamate the results into a lationship Diagram) as the data is actually taken from
report would take a long time and a lot of resources. The this (and other) source(s). Dimensional models are
data warehouse was created to answer this type of need. somewhat different from ERDs, the typical star model
having a central fact table surrounded by dimension
Basic Components of a Data tables. Kimball (2002, pp. 16-18) defined a fact table as
the primary table in a dimensional model where the
Warehouse numerical performance measurements of the business
are storedSince measurement data is overwhelmingly
Inmon (2002, p. 31), the father of data warehousing, the largest part of any data mart, we avoid duplicating it
defined a data warehouse as being subject-oriented, in multiple places around the enterprise. Thus the fact
integrated, non-volatile and time-variant. Emphasis table contains dynamic numerical data such as sales
571
TEAM LinG
Humanities Data Warehousing
quantities and sales and profit figures. It also contains required. It is comparatively easy to extend a data ware-
key data in order to link to the dimension tables. Dimen- house and add material from a new source. The data
sion tables contain the textual descriptors of the busi- cleansing techniques developed for data warehousing
ness process being modeled and their depth and breadth are of interest to researchers, as is the tracking facility
define the analytical usefulness of the data warehouse. afforded by the meta data manager (Begg & Connolly,
As they contain descriptive data, it is assumed they will 2004, p. 1169-1170).
not change at the same rapid rate as the numerical data in In terms of using data warehouses off the shelf,
the fact table that will certainly change every time the some humanities research might fit into the numerical
data warehouse is refreshed. Dimension tables typically fact topology, but some might not. The factless fact
have 50-100 attributes (sometimes several hundreds) table has been used to create several American univer-
and these are not usually normalized. The data is often sity data warehouses, but expertise in this area would
hierarchical in the tables and can be an accurate reflec- not be as widespread as that with normal fact tables. The
tion of how data actually appears in its raw state (Kimball whole area of data cleansing may perhaps be daunting
2002, pp. 19-21). There is not the need to normalize as data for humanities researchers (as it is to those in indus-
is not updated in the data warehouse, although there are try). Ensuring vast quantities of data is clean and con-
variations on the star model such as the snowflake and sistent may be an unattainable goal for humanities
starflake models which allow varying degrees of normal- researchers without recourse to expensive data cleans-
ization in some or all of their dimension tables. Coding is ing software. The data warehouse technology is far
disparaged due to the long-term view that definitions may from easy and is based on having existing databases to
be lost and that the dimension tables should contain the extract from, hence double the work. It is unlikely that
fullest, most comprehensible descriptions possible researchers would be taking regular snapshots of their
(Kimball 2002, p. 49). The restriction of data in the fact table data, as occurs in industry, but they could equate to data
to numerical data has been a hindrance to academic com- sets taken at different periods of time to data ware-
puting. However, Kimball has recently developed factless house snapshots (e.g. 1841 census, 1861 census).
fact tables (Kimball, 2002) which do not contain measure- Whilst many data warehouses use familiar WYSIWYGs
ments, thus opening the door to a much broader spectrum and can be queried with SQL-type commands, there is
of possible data warehouses. undeniably a huge amount to learn in data warehousing.
Nevertheless, there are many areas in both linguistics
Applying the Data Warehouse and historical research where data warehouses may
Architecture to Historical and Linguistic prove attractive.
Research
572
TEAM LinG
Humanities Data Warehousing
recent dedicated conference at Portsmouth, November trade figures. Similarly a Street directories data ware-
2003. Teich, Hansen and Fankhauser drew attention to the house would contain data from this rich source for 0
multi-layered nature of corpora and speculated as to how whole country for the last 100 years. Lastly, a Taxation
multi-layer corpora can be maintained, queried and ana- data warehouse could afford an overview of taxation of
lyzed in an integrated fashion. A data warehouse would different types, areas or periods. 19th century British
be able to cope with this complexity. census data does not fit into the typical data warehouse
Nerbonne (1998) alluded to the importance of co- model as it does not have the numerical facts to go into
ordinating the overwhelming amount of work being a fact table, but with the advent of factless fact tables a
done and yet to be done. Kretzschmar (2001) delin- data warehouse could now be made to house this data.
eated the challenge of preservation and display for The fact that some institutions have Oracle site licenses
massive amounts of survey data. There appears to be opens to way for humanities researchers with Oracle
many linguistics databases containing data from a range databases to use Oracle Warehouse Builder as part of
of locations/countries. For example, ALAP, the Ameri- the suite of programs available to them. These are
can Linguistic Atlas Project; ANAE, the Atlas of North practical project suggestions which would be impos-
American English (part of the TELSUR Project); TDS, sible to construct using relational databases, but which,
the Typological Database System containing European if achieved, could grant new insights into our history.
data; AMPER, the Multimedia Atlas of the Romance Comparisons could be made between counties and cit-
Languages. Possible research ideas for the future may ies and much broader analysis would be possible than
include a broadening of horizons - instead of the empha- has previously been the case.
sis on individual database projects, there may develop an
integrated warehouse approach with the emphasis on
larger scale, collaborative projects. These could com- CONCLUSION
pare different languages or contain many different types
of linguistic data for a particular language, allowing for The advances made in business data warehousing are
new orders of magnitude analysis. directly applicable to many areas of historical and lin-
guistics research. Data warehouse dimensional model-
Data Warehouses and Historical ing would allow historians and linguists to model vast
Research amounts of data on a countrywide basis (or larger),
incorporating data from existing databases and other
There are inklings of historical research involving data external sources. Summary data could also be included,
warehousing in Britain and Canada. A data warehouse of and this would all lead to a data warehouse containing
current census data is underway at the University of more data than is currently possible, plus the fact that the
Guelph, Canada and the Canadian Century Research data would be richer than in current databases due to the
Infrastructure aims to house census data from the last fact that normalization is no longer obligatory. Whole
100 years in data marts constructed using IBM software data sources could be captured, and more post-hoc analy-
at several sites based in universities across the country. sis would result. Dimension tables particularly lend them-
At the University of Portsmouth, UK, a historical data selves to hierarchical modeling, so data would not need
warehouse of American mining data is under construc- splitting into many tables thus forcing joins while query-
tion using Oracle Warehouse Builder (Delve, Healey, & ing. The time dimension particularly lends itself to his-
Fletcher, 2004). These projects give some idea of the torical research where significant difficulties have been
scale of project a data warehouse can cope with that is, encountered in the past. These suggestions for historical
really large country-/state-wide problems. Following and linguistics research will undoubtedly resonate in
these examples, it would be possible to create a data other areas of humanities research, such as historical
warehouse to analyze all British censuses from 1841 to geography, and any literary or cultural studies involving
1901 (approximately 108 bytes of data). Data from a textual analysis (for example biographies, literary criti-
variety of sources over time such as hearth tax, poor cism and dictionary compilation).
rates, trade directories, census, street directories, wills
and inventories, GIS maps for a city such as Winchester
could go into a city data warehouse. Such a project is REFERENCES
under active consideration for Oslo, Norway. Similarly,
a Voting data warehouse could contain voting data poll Begg, C., & Connolly, T. (2004). Database systems.
book data and rate book data up to 1870 for the whole Harlow: Addison-Wesley.
country. A Port data warehouse could contain all data
from portbooks for all British ports together with yearly Bliss, & Ritter (2001). IRCS (Institute for Research
into Cognitive Science) Conference Proceedings. Re-
573
TEAM LinG
Humanities Data Warehousing
574
TEAM LinG
575
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Hyperbolic Space for Interactive Visualization
Figure 1. Regular H2 tessellation with equilateral triangles (here 8 triangles meet at each vertex). Three
snapshots of a simple focus transfer are visible. Note the circular appearance of lines in the Poincar disk PD
and the fish-eye-lens effect: the triangles in the center appear larger and take less space in regions further
away.
Poincare Poincare Poincare
Solution for P2: Poincar Disk PD Regular Tessellations with triangles offer richer
possibilities than the R2. It turns out that there is an
For practical and technological reasons most available infinite set of choices to tessellate H2: for any
displays are flat. The perfect projection into the flat integer n 7, one can construct a regular tessella-
display area should preserve length, area, and angles tions in which n triangles meet at each vertex (in
(=form). But it lays in the nature of a curved space to resist contrast to the plane with allows only n=3,4,6 and
the attempt to simultaneously achieve these goals. Con- the sphere only n=3,4,5). Figure 1 depicts an ex-
sequently several projections or maps of the hyperbolic amples for n=8.
space were developed, four are especially well examined: Moving Around and Changing the Focus: For
(i) the Minkowski, (ii) the upper-half plane, (iii) the Klein- changing the focus point in PD we need a transla-
Beltrami, and (iv) the Poincar or disk mapping. For our tion operation, which can be bound to mouse click
purpose the latter is particularly suitable. Its main charac- and drag events. In the Poincar disk model the
teristics are: Mbius transformation T(z) is the appropriate so-
lution. By describing the Poincar disk PD as the
Display Compatibility: The infinite large area unit circle in the complex plane, the isometric
of the H2 is mapped entirely into a circle, the transformations for a point zPD can be written as
Poincar disk PD.
Circle Rim is Infinity : All remote points are z = T(z; c, ) = ( z+c)/(c* z+1), with | |=1,
close to the rim, without touching it. |c|<1. (2)
Focus+Context: The focus can be moved to each
location in H2, like a fovea. The zooming factor Here the complex number describes a pure rota-
is 0.5 in the center and falls (exponentially) off tion of PD around the origin 0 (the star * denotes
with distance to the fovea. Therefore, the context complex conjugation). The following translation by c
appears very natural. As more remote things are, maps the origin to c and -c becomes the new center 0 (if
the less spatial representation is assigned in the =1). The Mbius transformations are also called the
current display (compare Figure 1). circle automorphies of the complex plane, since they
describe the transformations from circles to (general-
Lines Become Circles: All H2-lines appear as
ized) circles. Here they serve to translate H2 straight
circle arc segments of centered straight lines in lines to lines both appearing as generalized circles in
PD (both belong to the set of so-called general- the PD projection. For further details, see for example,
ized circles). There extensions cross the PD-rim Lamping & Rao (1999) or Walter (2004).
always perpendicular on both ends.
Conformal Mapping: Angles (and therefore Three Layout Techniques in H2
form) relations are preserved in PD, area and
length relations obviously not. Now we turn to the question raised earlier: how to accom-
modate data in the hyperbolic space. In the following
576
TEAM LinG
Hyperbolic Space for Interactive Visualization
section three known layout techniques for the H2 are has its prototype vector wa closest to the given input a
shortly reviewed. BMU
= argmin a| wa - x |.
H
The distribution of the reference vectors wa, is itera-
Hyperbolic Tree Layout (HTL) for Tree- tively adapted by a sequence of training vectors x. After
Like Graph Data finding the aBMU all reference vectors are updated to-
wards the stimulus x: wa new:= wa old + h(d a, aBMU)(x - wa).
A first solution to this question for the case of acyclic, tree- Here h(.) is a bell-shaped Gaussian centered at the BMU
like graph data in H2 was provided by Lamping & Rao and decaying with increasing distance d a, aBMU =|ga - g aBMU|
(1994, 1999). Each tree node receives a certain open space in the node ga. Thus, each node or neuron in the
pie segment, where the node chooses the locations of its neighborhood of the aBMU participates in the current learn-
siblings. For all its siblings i it calls recursively the layout- ing step (as indicated by the gray shading in Figure 2).
routine after applying the Mbius transformation in order This neighborhood cooperation in the adaptation
to center i. algorithm has important advantages: (i) it is able to
Tamara Munzner developed another graph layout al- generate topological order between the wa, which means
gorithm for the three-dimensional hyperbolic space that similar inputs are mapped to neighboring nodes;
(1997). While she gains much more space for the layout, (ii) As a result, the convergence of the algorithm can be
the problem of more complex navigation (and viewport sped up by involving a whole group of neighboring
control) in 3-D and, more serious, the problem of occlu- neurons in each learning step.
sion appears. The structure of this neighborhood is essentially
The next two layout techniques are freed from the governed by the structure of h(a, aBMU) = h(d a, aBMU)
requirement of hierarchical data. therefore also called the neighborhood function.
While most learning and visualization applications choose
d a, aBMU as distances in an rectangular (2-D, 3-D) lattice
Hyperbolic Self-Organizing Map (HSOM) this can be generalized to the non-Euclidean case as
suggested by Ritter (1999). The core idea of the Hyper-
The standard Self-Organizing Map (SOM) algorithm is bolic Self-Organizing Map (HSOM) is to employ an H2-
used in many applications for learning and visualization grid of nodes. A particular convenient choice is to take the
(Kohonen, 2001). Figure 2 illustrates the basic opera- ga in PD of a finite patch of the triangular tessellation grid
tion. The feature map is built by a lattice of nodes (or as displayed in Figure 2. The internode distance is com-
formal neurons) a A, each with a reference vector or puted in the appropriate Poincar metric (see equation 4).
prototype vector wa attached, projecting into the input
space X. The response of a SOM to an input vector x is
determined by the reference vector of wa of the discrete
Hyperbolic Multidimensional Scaling
best-matching unit (BMU) a BMU, that is, the node which (HMDS)
577
TEAM LinG
Hyperbolic Space for Interactive Visualization
mization problem of a cost function which sums over the ibility of the entire structure and space for naviga-
squares of disparities-distance misfits tion in the detail-rich areas.
Comparison: The following table compares the main
E({xi}) = i=1N j>i wi j (ij - Dij)2. (3) properties of the three available layout techniques.
The factors wi j are introduced to weight the disparities All three techniques share the concept of spatial
individually and also to normalize the cost function E() to proximity representing similarities in the data. In HTL
be independent to the absolute scale of the disparities Dij. close relatives are directly connected. Skupin (2002)
The set of xi is found by a gradient descent procedure, pointed out that we humans learn early the usage and the
minimizing iteratively the cost or stress function [see versatile concepts of maps and suggested to build dis-
Sammon (1969) or Cox (1994) for further details on this and plays implementing this underlying map metaphor. The
other MDS algorithms]. HSOM can handle many objects and can generate topic
The recently introduced Hyperbolic Multi-Dimen- maps, while the HMDS is ideal to smaller sets, pre-
sional Scaling (HMDS) combines the concept of MDS sented on the object level (see below Figures 4 & 5).
and hyperbolic geometry (Walter & Ritter, 2002). In-
stead of finding a MDS solution in the low-dimensional Application Examples
Euclidean RM and transferring it to the H2 (which can
not work well), the MDS formalism operates in the
hyperbolic space from the beginning. The key is to Even though the look and feel of an interactive visualiza-
replace the Euclidean distance in the target space by the tion and navigation is hardly compressible to paper
appropriate distance metric for the Poincar model. format, we present some application screenshots of
visualization experiments. Figure 3 displays an applica-
dij = 2 arctanh [ | xi - xj | / |1- x i xj* | ] with xi , xj PD; tion of browsing image collections with an HMDS. Here
(4) the direct image features based on color are employed
to define the concept of similarity.
While the gradients ij,q/ xi, q required for the gradi- Figures 4 and 5 depict the hybrid combination of
ent descent are rather simple to compute for the Euclidean HSOM and HMDS for an application from the field of
geometry, the case becomes complex for HMDS, see text mining of newsfeed articles. The similarity concept
Walter (2002, 2004) for details. is based on semantic information gained via a standard
bag-of-words model of the unstructured text, here
Reuters news articles [see Walter et al. (2003) for
Disparity preprocessing: Due to the non-lin-
further details].
earity of the distance function above, the prepro-
cessing function D(.) has more influence in H2.
Consider, for example, linear rescaling of the
dissimilarities Dij = dij. In the Euclidean case the
FUTURE TRENDS
visual structure is not affected only magnified
by . In contrast in H2, a scales the distribution and A closer look at the comparative table above suggests a
with it the amount of curvature felt by the data. The hybrid architecture of techniques. This includes the com-
optimal depends on the given task, the dataset, bination of HSOM and HMDS for a two-stage navigation
and its dissimilarity structure. One way is to set and retrieval process. This embraces a coarse grain theme
manually and choose a compromise between vis- map (e.g., with the HSOM) and a detailed map using
Table 1.
578
TEAM LinG
Hyperbolic Space for Interactive Visualization
Figure 3. Snapshot of an Hyperbolic Image Viewer using Figure 4. A HSOM projection of a large collection of
a color distance metric for pairwise dissimilarity newswire articles (Reuter-21578) forms semantically 0
definition for 100 pictures. Note the dynamic zooming of related category clusters as seen by the glyphs indicating
images is dependent on the distance to the current focus the dominant out-of-band category label of the news
point. objects gathered by the HSOM in each node. The
similarity of objects is here derived from the angle of the
feature vectors in the bag-of-word or vector space
model of unstructured text [standard model in information
retrieval, after Salton (1988)]
Figure 5. Screenshot of the HMDS visualizing all HMDS, as suggested in the Hyperbolic Hybrid Data
documents in the node previously selected in the HSOM Viewer prototype in Walter et al. (2003).
(direction 5o clock). The news headlines where enabled The HMDS allows to support different notions of
for the articles cluster now in focus They reveal that the similarity and furthermore to dynamically modulate the
object group are semantically close and all related to a actual distance metric while observing the change in the
strike in the oilseed company Cargill U.K. spatial arrangement of the object.
This feature is valuable in various multi-media appli-
cations with various natural and task-depended notions
H2-MDS Earn
of similarity. For example, when browsing a collection
Acquisition
Money-FX of images, the similarity of objects can be based on textual
Crude
Grain description, metadata, or image features, for example,
using color, shape, and texture. Due to the general data
Trade
Interest
579
TEAM LinG
Hyperbolic Space for Interactive Visualization
(Risden et al., 2000; Pirolli et al., 2001). By simple mouse Salton, G., & Buckley, C. (1988). Term-weighting ap-
interaction the focus can be transferred to any location of proaches in automatic text retrieval. Information Process-
interest. The core area close to the center of the Poincar ing and Management, 5(24), 513-523.
disk magnifies the data with a maximal zoom factor and
decreases gradually to the outer area. The object place- Sammon, J. (1969). A non-linear mapping for data structure
holder (text box, image thumbnails, etc.) is scaled in propor- analysis. IEEE Transactions Computers, 18, 401-409.
tion. The fovea is an area with high resolution, while remote Skupin, A. (2002). A cartographic approach to visualizing
areas are gradually compressed but are still visible as conference abstracts. In IEEE Computer Graphics and
context. Interestingly, this situation resembles the log- Applications (pp. 50-58).
polar density distribution of neurons in the retina, which
governs the natural resolution allocation in our visual Walter, J. (2004). H-MDS: A new approach for interactive
perception system. visualization with multidimensional scaling in the hyperbolic
space. Information Systems, 29(4), 273-292.
Walter, J., Ontrup, J., Wessling, D., & Ritter, H. (2003).
REFERENCES Interactive visualization and navigation in large data
collections using the hyperbolic space. In IEEE Interna-
Cox, T., & Cox, M. (1994). Multidimensional scaling. In tional Conference on Data Mining (ICDM03) (pp. 355-
Monographs on statistics and appied probability. 362).
Chapman & Hall.
Walter, J., & Ritter, H. (2002). On interactive visualization
Kohonen, T. (2001). Self-organizing maps. Berlin: of high-dimensional data using the hyperbolic plane. In
Springer. ACM SIGKDD International Conference of Knowledge
Lamping, J., & Rao, R. (1994). Laying out and visualizing Discovery and Data Mining (pp. 123-131).
large trees using a hyperbolic space. In ACM Symp User
Interface Software and Technology (pp. 13-14).
Lamping, J., & Rao, R. (1999). The hyperbolic browser: A KEY TERMS
Focus+Context technique for visualizing large hierar-
chies. In Readings in Information Visualization (pp. Focus+Context Technique: Allows to interactively
382-408). Morgan Kaufmann. transfer the focus as desired while the context of the
region in focus remains in view with gradually degrading
Lamping, J., Rao, R., & Pirolli, P. (1995). A focus+context resolution to the rim. In contrast allows the standard
technique based on hyperbolic geometry for viewing zoom+panning technique to select the magnification
large hierarchies. In ACM SIGCHI Conf Human Factors in factor, trading detail and overview and involving harsh
Computing Systems (pp. 401-408). view cut-offs.
Morgan, F. (1993). Riemannian geometry: A beginners H2: Two-dimensional hyperbolic space (plane).
guide. Jones and Bartlett Publishers.
H2DV: The Hybrid Hyperbolic Data Viewer incorpo-
Munzner, T. (1997). H3: Laying out large directed graphs rates a two stage browsing and navigation using the
in 3d hyperbolic space. In Proceedings of IEEE Sympo- HSOM for a coarse thematic mapping of large object
sium on Information Visualization (pp. 2-10). collections and the HMDS for detailed inspection of
Pirolli, P., Card, S.K., & Van Der Wege. (2001). Visual smaller subsets in the object level.
information foraging in a focus + context visualization. HMDS: Hyperbolic Multi-Dimensional Scaling for
In CHI (pp. 506-513). laying out objects in the H2 such that the spacial ar-
Risden, K., Czerwinski, M., Munzner, T., & Cook, D. rangement resembles the dissimilarity structure of the
(2000). An initial examination of ease of use for 2D and data as close as possible.
3D information visualizations of Web content. Interna- HSOM: Hyperbolic Self-Organizing Mapping. Exten-
tional Journal of Human Computer Studies, 53(5), 695- sion of Kohonens topographic map, which offers the
714. exponential growth of neighborhood for the inner nodes.
Ritter, H. (1999). Self-organizing maps on non-Euclid- Hyperbolic Geometry: Geometry with constant nega-
ean spaces. In Kohonen Maps (pp. 97-110). Elsevier. tive curvature in contrast to the flat Euclidean geometry.
580
TEAM LinG
Hyperbolic Space for Interactive Visualization
581
TEAM LinG
582
Olga Georgieva
Institute of Control and System Research, Bulgaria
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Identifying Single Clusters in Large Data Sets
ally close enough to the good clusters will get correctly Generally, a number of clusters with different shapes
classified into a good cluster, while the points that are away and densities exist in large data sets and thus a compli- I
from the good clusters will get classified into the noise cated dynamics of the data assigned to the single cluster
cluster. The detection of noise and outliers is a serious will be observed. However, the smooth part of the
problem and has been addressed in various approaches considered curve corresponds to the situation that a
(Leski, 2003; Yang, 2004; Zhang, 2003). However, all relatively small amount of data is removed, which is
these methods need complex additional computations. usually caused by loosing noise data. When we loose
data from a good cluster (with higher density than the
noise data), a small decrease of the noise distance will
MAIN THRUST lead to a large amount of data we loose to the noise
cluster, so that we will see a strong slope in our curve
The purpose of most clustering algorithms is to parti- instead of a plateau. Thus, a strong slope indicates that
tion a given data set into clusters. However, in data at least one cluster is just removed and separated from
mining tasks partitioning is not always the main goal. the (single) good cluster we try to find. By this way the
Finding interesting patterns or substructures in a large algorithm determines the number of clusters and de-
data set in the form of one or a few clusters that do not tects noise data. It does not depend on the initialisation,
cover or partition the data set completely is an impor- so that the danger of converging into local optima is
tant issue in data mining. For this purpose a new cluster- further reduced compared to standard fuzzy clustering
ing algorithm named noise clustering with one good (Hppner, 2003).
cluster, based on the noise clustering technique and able The described procedure is implemented in a cluster
to detect single clusters step-by-step in a given data set, identification algorithm that assesses the dynamics of
has been recently developed (Georgieva, 2004). In addi- the quantity of the points assigned to the good cluster or
tion to identifying clusters step by step, as a side-effect equivalently assigned to the noise cluster through the
noise data are detected automatically. slight decrease of the noise distance. By detecting the
The algorithm assesses the dynamics of the number strong slopes the algorithm separates one cluster at
of points that are assigned to only one good cluster of every algorithm pass. A significant reduction of the
the data set by slightly decreasing the noise distance. noise is achieved even in the first algorithm pass. The
Starting with some large enough noise distance it is clustering procedure is repeated by proceeding with a
decreased with a prescribed decrement till the reason- smaller data set as the original one is reduced by the
able smallest distance is reached. The number of data identified noise data and data belonging to the already
belonging to the good cluster is calculated for every identified cluster(s).
noise distance using the formula for the hard member- The curve that is determined by the dynamics of the
ship values or fuzzy membership values, respectively. data assigned to the noise cluster is smoother in case of
Note that in this scheme only one cluster centre has to fuzzy noise clustering compared to the hard clustering
be computed, which in the case of hard noise clustering case due to the fuzzily defined membership values. Also
is the mean value of the good cluster data points and in local minima of this curve could be observed due to the
the case of fuzzy noise clustering is the weighted aver- given freedom of the points to belong to both good and
age of all data points. It is obvious that by decreasing the noise cluster simultaneously. Fuzzy clustering can deal
noise distance a process of loosing data, that is, better with complex data sets than hard clustering due to
separating them to the noise cluster, will begin. Con- the given relative degree of membership of a point to the
tinuing to decrease the noise distance, we will start to good cluster. However, for the same reason the amount
separate points from the good cluster and add them to of the identified noise points is less than in the hard
the noise cluster. A further reduction of the noise clustering case.
distance will lead to a decreasing amount of data in the
good cluster until the cluster will be entirely empty, as
all data will be assigned to the noise cluster. The de- FUTURE TRENDS
scribed dynamics can be illustrated in a curve viewing
the number of data points assigned to the good cluster Whereas the standard clustering partitions the whole
over the noise distance. In this curve a plateau will data set, the main goal of the noise clustering with one
indicate that we are in a phase of assigning proper noise good cluster is to identify single clusters even in the
data to the noise cluster, whereas a strong slope means case when a large part of the data does not have any kind
that we actually loose data belonging to the good cluster of group structure at all. This will have a large benefit in
to the noise cluster. some application areas of cluster analysis like, for
583
TEAM LinG
Identifying Single Clusters in Large Data Sets
instance, gene expression data and astrophysics data Gath, I., & Geva, A.B. (1989). Unsupervised optimal fuzzy
analysis, where the ultimate goal is not to partition the clustering. IEEE Transactions on Pattern Analysis and
data, but to find some well-defined clusters that only cover Machine Intelligence, 7, 773-781.
a small fraction of the data. By removing a large amount of
the noise data the obtained clusters are used to find some Georgieva, O., & Klawonn, F. (2004). A clustering algo-
interesting substructures in the data set. rithm for indentification of single clusters in large datat
One future improvement will lead to an extended pro- sets. In Proceedings of East West Fuzzy Colloquium,
cedure that first finds the location of the centres of the 81(pp. 118-125), Sept. 8-10, Zittau, Germany.
interesting clusters using the standard Euclidean distance. Guo, P., Chen, C.L.P., & Lyu, M.R. (2002). Cluster number
Then, the algorithm could be started again with this cluster selection for a small set of samples using the Bayesian
centres but using more sophisticated distance measures as Ying-Yang model. IEEE Transactions on Neural Net-
GK or volume adaptation strategy that can better adapt to works, 13, 757-763.
the shape of the identified cluster. For extremely large
data sets the algorithm can be combined with speed-up Gustafson, D., & Kessel, W. (1979). Fuzzy clustering with
techniques as the one proposed by Hppner (2002). a fuzzy covariance matrix. In Advances in fuzzy set theory
Another further application consists in incorporation and applications (pp. 605-620). North-Holland.
of the proposed algorithm as an initialisation step for Hppner, F. (2002). Speeding up fuzzy c-means: Using a
other more sophisticated clustering algorithm providing hierarchical data organisation to control the precision of
the number of clusters and their approximate location. membership calculation. Fuzzy Sets and Systems, 12 (3),
365-378.
584
TEAM LinG
Identifying Single Clusters in Large Data Sets
585
TEAM LinG
586
Henk Koppelaar
Delft University of Technology, The Netherlands
Ronald Hamers
Erasmus Medical Thorax Center, The Netherlands
Nico Bruining
Erasmus Medical Thorax Center, The Netherlands
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Immersive Image Mining in Cardiology
A: IVUS catheter/endosonics and the wire. B: angiograophy of a contrast-filled section of coronary artery. C: two
IVUS cross-sectional images and some quantitative measurements on them. D: Virtual L-View of the vessel
reconstruction. E: Virtual 3D-Impression of the vessel reconstruction.
sition, and spatial orientation of the slices on the coronary position equals to the sum of the pullback distance and
vessel. For example, it has been reported that more than the longitudinal catheter displacement. In F, the absolute
5 mm of longitudinal catheter motion relative to the vessel catheter positions of solid dots are in disorder, which will
may occur during one cardiac cycle (Winter et al., 2004), cause a disordered sequence of camera images. The
when the catheter was pulled back at 0.5 mm/sec and the consecutive image samples selected in relation to the
non-gated samples were stored on S-VHS videotape at a positions of the catheter relative to the coronary vessel
rate of 25 images/sec. Figure 2 explains the longitudinal wall are highlighted in G. In conclusion, these samples
displacement caused by cardiac cycles during a camera used for analysis are anatomically dispersed in space (III,
pullback in a segment of coronary artery. The catheter I, V, II, IV, and VI).
587
TEAM LinG
Immersive Image Mining in Cardiology
The data of IVUS sources are voluminous separated This 4D model is used for further cardiac computation to
slices with artifacts, and the quantitative measurements, gain VR display and volume measurements on the vessel.
the physicians interpretations are also essential compo- IVUS images are fundamentally different from histology
nents to remodel the vessel, but the accompanying math- and cannot be used to detect and quantify specific
ematical models are poorly characterized. Mining in these histologic contents directly. But based on quantitative
datasets is more difficult and it inspires interdisciplinary measurements and VR display, combined cardiac knowl-
work in medicine, physics and computer science. edge, the qualitative assessment such as atheroma mor-
phology, unstable lesions and complications after inter-
vention may be gained semi-automatically or artificially
MAIN THRUST by the physicians.
These quantitative measurements and qualitative as-
Immersive Image Mining Processes in sessments may be organized and stored in a database or
data warehouse. Statistics-based mining on colony data
Cardiology IVUS and mining on individual history data lead to knowledge
discovery of heart diseases.
IVUS image mining includes a series of complicated mining
procedures due to the complex immersive data: data recon-
ciliation, image mining, remodeling and VR display, and
Individual Data Mining to Reconcile the
knowledge discovery. Figure 3 shows the function-driven Data
processes.
The data reconciliation compensates or decreases ar- Data reconciliation is the basis and most important step
tifacts to improve data, usually fused with non-invasive to cardiac computation, and it is expecting more effective
inspection cardiology data such as angiography and ECG methods to attack the data artifacts. Three types of
etc. This data mining extracts features in the individual methods are applied: parsimonious data acquiring, data
dataset to use for data reconciliation. fusion of invasive and non-invasive data, and hybrid
Cardiac computation is taken on the reconciled indi- motion compensation.
vidual dataset. Quantitative measurement calculates fea-
tures such as lumen, atheroma, calcium and stent on every Parsimonious Phase-Dependent Data Acquisition:
slice. Vessel remodeling based on these measurements Cardiac knowledge dictates systolic/diastolic tim-
forms a straight 3D volume, and the fusion with the pull- ing features. It has been suggested to select IVUS
back path determined from angiography yields an accurate images recorded in the end-diastolic phase of the
spatio-temporal (4D) model of the coronary vasculature. cardiac cycle, in which the heart is motionless and
L-Mode
Cardiac
Immersive 3D reconstruction
knowledge
image Picture in picture
acquisition Expert
knowledge
Knowledge Vessel remodeling
discovery & Visual display
Atheroma morphology
Qualitative
Unstable lesions
Data Cardiac assessments
Ruptured plaque
reconciliation computation
Complications after
Quantitative
intervention
measurements
Non-invasive Border Identification
data acquisition Lumen measurements
Calcium measurements
588
TEAM LinG
Immersive Image Mining in Cardiology
blood flow has ceased so that their influences on the (Cipra, 1999). An effective and pragmatic model is
catheter can be neglected. Online ECG-gated pull- futuristic yet. Timinger reconstructed the catheter I
back has been used to acquire phase-dependent position on 3D roadmap by a motion compensation
data but the technology is expensive and prolongs algorithm based on an affine model for compensat-
the acquisition procedure. Instead, a retrospective ing the respiratory motion and ECG gating method
image-based gating mining method has been stud- for the catheter positions acquired using a magnetic
ied (Winter et al, 2004; Zhu, Oakeson, & Friedman, tracking system (Timinger, Krueger, Borgert, &
2003). In this method, the different features are Grewer, 2004). Fusing the empirical features of the
mined from sampled IVUS images over time by coronary longitudinal movement with a motion com-
transforming the data with spectral analysis, to pensation model is a novel way to resolve the
discover the most prominent repetition frequencies longitudinal distortion of the IVUS dataset (Liu,
of appearance of these image features. From this Koppelaar, Koffijberg, Bruining, & Hamers, 2004).
mining, the near images at the end-diastolic phases
can be inferred. The selection of images is parsimo- Cardiac Computation to Quantitative
nious: only about 5% of the dataset are selected, and Qualitative Analysis
wherein about 10% of the selections are
mispositioned. Cardiac computation aims at quantitative measurements,
Data Fusion: The motion vessel courses provide remodeling and qualitative assessments on the vessel
helpful information to identify space and time of the and the lesion zones. The technologies of border detec-
IVUS camera in the coronary of the moving heart. tion of image processing and pattern recognition of the
Fusion of complementary information from two or vessel layers are important in the process of cardiac
more differing modalities enhances insights into the calculation (Koning, Dijkstra, von Birgelen, Tuinenburg,
underlying anatomy and physiology. Combining et al., 2002). For qualitative assessments, expert knowl-
non-invasive data mining techniques to battle mea- edge of physicians must be fused within reasoning to get
surement errors is a preferred method. The position- the assessments.
ing for the camera could be remedied if from
angiograms the outer form of a vessel is available,
as a path-road for the camera. Fusing the route and
Quantitative Measurements
IVUS data, a simulator generates a VR reconstruc-
tion of the vessel (Wahle, Mitchell, Olszewski, Long, Cross-Sectional Slices Calculation: The normal
& Sonka, 2000; Sarry & Boire, 2001; Ding & Fried- coronary artery consists of the lumen surrounded
man, 2000; Rotger, Radeva, Mauri, & Fernandez- by intima, media, and adventitia of the vessel wall
Nofrerias, 2002; Weichert, Wawro, & Wilke, 2004). (Halligan & Higano, 2003). The innermost layer
This should help to detect the absolute catheter consists of a complex of three elements: intima,
spatial positions and orientations, but usually the atheroma (diseased arteries), and internal elastic
routes are static and data are parsimonious, phase- membrance. After gaining vessel layers and their
dependent, or without exhibiting the distortion. attributes through edge identification and pattern
There are few papers on the consecutive motion recognition of image processing, every slice is cal-
tracking of the coronary tree (Chen & Carrol, 2003; culated separately and the measurements such as
Shechter, Devernay, Coste-Maniere, Quyyumi, & lumen area, EEM (external elastic membrance) area,
McVeigh, 2003), but they omitted the stretch of the and maximum atheroma thickness can be reported in
arteries that is an important property for accurate detail.
positioning analyses. Vessel Remodeling and Virtual Display: Based on
Hybrid Motion Compensation: Physical prior knowl- the above calculation results, vessel layers includ-
edge would predict space and time of the global ing plaques can be remodeled and visualized in
position of IVUS camera, if it would be fully mod- longitude. VR display is an important way to assist
eled. The many mechanical and electrical mecha- inspecting the artery and the distribution of plaque
nisms in a heart make a full model intractable, but if and lesions, to help navigate in guided surgery
most of its mechanisms could be dispensed with, the facilities for minimally invasive surgery. For ex-
prediction model would become considerably sim- amples, L-Mode display sets of slices taken from a
plified. A hybrid approach could solve the problem single cut plane in longitude (Figure 1, panel D); 3D
if a full Courant-type model would be available, reconstruction display a shaded or wire-frame im-
coined by the term Computational Cardiology age of the vessel to give an entire view (Figure 1,
Panel E).
589
TEAM LinG
Immersive Image Mining in Cardiology
Derived Measurements: Calculating on the virtual Volume and complexity in data: First, IVUS data
remodeled vessel, the derived measurements can be have a large volume in one acquisition procedure for
obtained such as hemodynamics, length, volumes one person. Second, usually tracing a patient needs
of the vessel and special lesion zones. a long time: more than several years. Third, the data
Qualitative Assessments: IVUS images are funda- from patients maybe are incomplete. Finally,
mentally different from histology and cannot be immersive data acquisition is a complicated proce-
used to detect and quantify specific histologic con- dure depending on body condition, immersive me-
tents directly. However, based on quantitative mea- chanical system, and even operating procedure.
surements and a virtual model, as well as combined The immersive complexity may even bias the histori-
cardiac knowledge, the qualitative assessment such cal data from the same heart.
as atheroma morphology, unstable lesions and com- Data standards and semantics: IVUS images, espe-
plications after intervention may be gained semi- cially the qualitative assessments, need a consis-
automatically or artificially to serve physicians. tent standard and semantics to support mining.
IVUS documents (Mintz, et al., 2001) and DICOM
Knowledge Discovery in Cardiology IVUS standards may be the base to follow. It is
necessary to consider the importance of relative
Once a large amount of quantitative and qualitative fea- parameters, spatial positions, multi-interpretations
tures are in hand, mining on these data can reveal of image quantitative measurements, which also
knowledge about the forming and regression of some need to be addressed in the IVUS standards.
heart diseases, and can simulate the different stages of the Data fusion: Since heart diseases are complicated,
disease as well as assess surgical treatment procedure IVUS is one of the preferred diagnose tools, so
and risk level. mining IVUS needs considering other clinical infor-
Some papers are reported on medical knowledge dis- mation including physicians interpretation at the
covery in cardiology using data mining. Pressure calcu- same time.
lation by using fluid dynamic equations on 3D IVUS
volumetric measurements predicts the physiologic sever-
ity of coronary lesions (Takayama & Hodgson, 2001). FUTURE TRENDS
Artificial intelligence methods: a structural description,
syntactic reasoning and pattern recognition method are Immersive image mining in cardiology is a new challenge
applied on angiography to recognize stenosis of coronary for medical informatics. In mining our body physicians
artery lumen (Ogiela & Tadeusiewicz, 2002). Logical Analy- thus inspire interdisciplinary work in medicine, physics
sis of Data is a method based on combinatory, optimiza- and computer science which will improve monitoring
tion and theory of Boolean functions is used on 21 heart data to many deep applications serving clinical
variables dataset to predict coronary risk, but these dataset needs in diagnostics, therapies, safety level, cost and risk
do not include any medical images information (Alexe, effectiveness. For instance, in due course of time
2003). Mining in single proton emission computed tomog- nanotechnology will mature to a degree of immersive
raphy images accompanied by clinical information and medical mining equipment, physicians will directly con-
physician interpretation using inductive machine learn- trol medicine by instruction via mobile communication
ing and heuristic approaches to mimic cardiologists because of a transducer inside the human body.
diagnosis (Kurgan, Cios, Tadeusiewicz, Ogiela, &
Goodenday, 2001).
Immersive data mining or fusing the mined data with CONCLUSION
other cardiac data is a challenge for improving medical
knowledge in cardiology. Mining will contribute to some Over the last several years, IVUS has developed into an
difficult cardiology applications, for example: important clinical tool in the assessment of atherosclero-
interventional procedures and healthcare; mining coro- sis. The limitations on data artifacts and difficult discern-
nary vessel movement anomalies; highlight local abnor- ing between entities with similar echodensities (Halligan
mal cellular growth; accurate heart and virtual vessel & Higano, 2003) are waiting for better solution. Hospitals
reconstruction; adjust ferro-electric materials to monitor have stored a large amount of immersive data, which may
heart movement anomalies; prophylactic patient monitor- be mined for effective application in cardiac knowledge
ing etc. Some technicalities should be considered in this discovery.
mining field. Data mining technologies play crucial roles in the
whole procedures in IVUS application: from data acquisi-
590
TEAM LinG
Immersive Image Mining in Cardiology
tion, data reconciliation, image processing, vessel remod- tional Conference on Systems, Man and Cybernetics,
eling and virtual display, knowledge discovery, disease Hague, Netherlands. I
diagnosing, clinical treatment, etc. Reconciling coronary
longitudinal movement with a motion compensation model Mintz, G. S., Nissen, S. E., Anderson, W. D., Bailey, S. R.,
is a novel way to resolve the longitudinal distortion of the Erbel, R., Fitzgerald, P. J., Pinto, F. J., Rosenfield, K.,
IVUS dataset. This fusion with other cardiac datasets and Siegel, R. J., Tuzcu, E. M., & Yock, P. G. (2001). ACC clinical
online processing are very important for future applica- expert consensus document on standards for the acqui-
tion, which means more effective and efficient mining sition, measurement and reporting of intravascular ultra-
methods on the complicated datasets need to be studied. sound studies: A report of the American college of cardi-
ology task force on clinical expert consensus documents.
Journal of the American College of Cardiology, 37,
14781492.
REFERENCES
Ogiela, M. R. & Tadeusiewicz, R. (2002). Syntactic reason-
Alexe, S. (2003). Coronary risk prediction by logical analy- ing and pattern recognition for analysis of coronary artery
sis of data. Annals of Operations Research, 119(1-4), 15- images. Artificial Intelligence in Medicine, 26, 145-159.
42.
Rotger, B., Radeva, P., Mauri, J., & Fernandez-Nofrerias,
Chen, S-Y. J. & Carroll, J. D. (2003). Kinematic and defor- E. (2002). Internal and external coronary vessel images
mation analysis of 4-D coronary arterial trees recon- registration. In M. T. Escrig, F. Toledo, & E. Golobardes
structed from cine angiograms. IEEE Transactions on (Eds.), Topics in Artificial Intelligence, 5th Catalonian
Medical Imaging, 22(6), 710-720. Conference on AI, Castelln (pp. 408-418). Spain.
Cios, K. J. & Moore, W. (2002). Uniqueness of medical Sarry, L. & Boire, J. Y. (2001). Three-Dimensional tracking
data mining. Artificial Intelligence in Medicine Journal, of coronary arteries from biplane angiographic sequences
26(1-2), 1-24. using parametrically deformationable models. IEEE Trans-
actions on Medical Imaging, 20(12), 1341-1351.
Cipra, B. A. (1999). Failure in sight for a mathematical
model of the heart. SIAM News, 32(8). Shechter, G., Devernay, F., Coste-Maniere, E., Quyyumi,
A., & McVeigh, E. R. (2003). Three-Dimentional motion
Ding, Z. & Friedman, M.H. (2000). Quantification of 3D tracking of coronary arteries in biplane cineangiograms.
coronary arterial motion using clinical biplane IEEE Transaction on Medical Imaging, 2, 1-16.
cineangiograms. The International Journal of Cardiac
Imaging, 16(5), 331-346. Takayama, T. & Hodgson, J. M. (2001). Prediction of the
physiologic severity of coronary lesions using 3D IVUS:
Halligan, S. & Higano, S. T. (2003). Coronary assessment Validation by direct coronary pressure measurements.
beyond angiography. Applications in Imaging Cardiac Catheterization and Cardiovascular Interventions,
Interventions, 12, 29-35. 53(1), 48-55.
Hsu, W., Lee, M. L., & Zhang, J. (2002). Image mining: Timinger, H., Krueger, S., Borgert, J., & Grewer, R. (2004).
Trends and developments. Journal of Intelligent Infor- Motion compensation for interventional navigation on
mation System: Special Issue on Multimedia Data Min- 3D static roadmaps based on an affine model and gating.
ing, 19(1), 7-23. Physics in Medicine and Biology, 49, 719-732.
Koning, G., Dijkstra, J., von Birgelen, C., Tuinenburg, J. C. Wahle A., Mitchell, S. C., Olszewski, M. E., Long R. M., &
et al. (2002). Advanced contour detection for three-di- Sonka, M. (2000). Accurate visualization and quantifica-
mensional intracoronary ultrasound: A validationin tion of coronary vasculature by 3-D/4-D fusion from
vitro and in vivo. The International Journal of Cardio- biplane angiography and intravascular ultrasound. EBiOS
vascular Imaging, 18, 235-248. 2000, Biomonitoring and Endoscopy Technologies (pp.
Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & 144-155). Amsterdam NL, SPIE Europto, 4158.
Goodenday, L. S. (2001). Knowledge discovery approach Weichert, F., Wawro, M., & Wilke, C. (2004). A 3D com-
to automated cardiac SPECT diagnosis. Artificial Intelli- puter graphics approach to brachytherapy planning. The
gence in Medicine, 23(2), 149-169. International Journal of Cardiovascular Imaging, 20,
Liu, X., Koppelaar, H., Koffijberg, H., Bruining, N., & 173-182.
Hamers, R. (2004, October). Data reconciliation of immersive Winter, S. A. de, Hamers, R. Degertekin, M., Tanabe, K.,
heart inspection. Paper presented at the IEEE Interna- Lemos, P.A., Serruys, P.W., Roelandt, J. R. T. C., &
591
TEAM LinG
Immersive Image Mining in Cardiology
Bruining, N. (2004). Retrospective image-based gating of UltraSound transducer in human arteries. The dataset is
intracoronary ultrasound images for improved quantita- usually a volume with artifacts caused by the complicated
tive analysis: The intelligate method. Catheterization immersed environments.
and Cardiovascular Interventions, 61(1), 84-94.
IVUS Data Reconciliation: Mining in the IVUS indi-
Zhu, H., Oakeson, K. D., & Friedman, M. H. (2003). Re- vidual dataset to compensate or decrease artifacts to get
trieval of cardiac phase from IVUS sequences. In F. Wil- improved data using for further cardiac calculation and
liam, M.F. Walker, & Insana (Eds.), Medical Imaging medical knowledge discovery.
2003: Ultrasonic Imaging and Signal Processing, Pro-
ceedings of SPIE 5035 (pp. 135-146). IVUS Standards and Semantics: It refers to the stan-
dard and semantics of IVUS data and their medical quan-
titative measurements, qualitative assessments. The con-
sistent definition and description improve medical data
KEY TERMS management and mining.
Medical Image Mining: It involves extracting the
Cardiac Data Fusion: Fusion of complementary infor- most relevant image features into a form suitable for data
mation from two or more differing cardiac modalities mining for medical knowledge discovery; or generating
(IVUS, ECG, CT, physicians interpretation etc.) enhances image patterns to improve the accuracy of images re-
insights into the underlying anatomy and physiology. trieved from image databases.
Computational Cardiology: Using mathematic and Virtual Reality Vasculature Reconstruction: For
computer model to simulate the heart motion and its effective applications of intravascular analyses and
properties as a whole. brachytherapy, reconstruct and visualize the vessel-walls
Immersive IVUS Images: The real-time cross-sec- interior structure in a single 3D/4D model by fusing
tional images obtained from a pullback IntraVascular invasive IVUS data and non-invasive angiography on?
592
TEAM LinG
593
John F. Kros
East Carolina University, USA
Missing or inconsistent data has been a pervasive prob- The article focuses on the reasons for data inconsistency
lem in data analysis since the origin of data collection. and the types of missing data. In addition, trends regarding
The management of missing data in organizations has missing data and data mining are discussed along with
recently been addressed as more firms implement large- future research opportunities and concluding remarks.
scale enterprise resource planning systems (see Vosburg
& Kumar, 2001; Xu et al., 2002). The issue of missing
data becomes an even more pervasive dilemma in the REASONS FOR DATA
knowledge discovery process, in that as more data is INCONSISTENCY
collected, the higher the likelihood of missing data
becomes.
The objective of this research is to discuss impre- Data inconsistency may arise for a number of reasons,
cise data and the data mining process. The article begins including:
with a background analysis, including a brief review of
both seminal and current literature. The main thrust of Procedural Factors
the chapter focuses on reasons for data inconsistency Refusal of Response
along with definitions of various types of missing data. Inapplicable Responses
Future trends followed by concluding remarks com-
plete the chapter. These three reasons tend to cover the largest areas of
missing data in the data mining process.
The analysis of missing data is a comparatively recent Data entry errors are common and their impact on the
discipline. However, the literature holds a number of knowledge discovery process and data mining can gen-
works that provide perspective on missing data and data erate serious problems. Inaccurate classifications, er-
mining. Afifi and Elashoff (1966) provide an early roneous estimates, predictions, and invalid pattern rec-
seminal paper reviewing the missing data and data min- ognition may also take place. In situations where data-
ing literature. Little and Rubins (1987) milestone work bases are being refreshed with new data, blank responses
defined three unique types of missing data mechanisms from questionnaires further complicate the data mining
and provided parametric methods for handling these process. If a large number of similar respondents fail to
types of missing data. These papers sparked numerous complete similar questions, the deletion or
works in the area of missing data. Lee and Siau (2001) misclassification of these observations can take the
present an excellent review of data mining techniques researcher down the wrong path of investigation or lead
within the knowledge discovery process. The refer- to inaccurate decision-making by end users.
ences in this section are given as suggested reading for
any analyst beginning their research in the area of data Refusal of Response
mining and missing data.
Some respondents may find certain survey questions
offensive or they may be personally sensitive to certain
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Imprecise Data and the Data Mining Process
questions. For example, some respondents may have no [Data] Missing Completely At Random
opinion regarding certain questions such as political or (MCAR)
religious affiliation. In addition, questions that refer to
ones education level, income, age or weight may be
deemed too private for some respondents to answer. Kim (2001), based on an earlier work, classified data as
Furthermore, respondents may simply have insuffi- MCAR when the probability of response [shows that]
cient knowledge to accurately answer particular ques- independence exists between X and Y. MCAR data
tions. Students or inexperienced individuals may have exhibits a higher level of randomness than does MAR. In
insufficient knowledge to answer certain questions (such other words, the observed values of Y are truly a random
as salaries in various regions of the country, retirement sample for all values of Y, and no other factors included
options, insurance choices, etc). in the study may bias the observed values of Y.
Consider the case of a laboratory providing the
results of a chemical compound decomposition test in
Inapplicable Responses which a significant level of iron is being sought. If
certain levels of iron are met or missing entirely and no
Sometimes questions are left blank simply because the other elements in the compound are identified to corre-
questions apply to a more general population rather than late then it can be determined that the identified or
to an individual respondent. If a subset of questions on missing data for iron is MCAR.
a questionnaire does not apply to the individual respon-
dent, data may be missing for a particular expected Non-Ignorable Missing Data
group within a data set. For example, adults who have
never been married or who are widowed or divorced are
likely to not answer a question regarding years of marriage. In contrast to the MAR situation where data missingness
is explained by other measured variables in a study; non-
ignorable missing data arise due to the data missingness
pattern being explainable and only explainable by
TYPES OF MISSING DATA the very variable(s) on which the data are missing.
For example, given two variables, X and Y, data is
The following is a list of the standard types of missing deemed Non-Ignorable when the probability of response
data: depends on variable X and possibly on variable Y. For
example, if the likelihood of an individual providing his
Data Missing at Random or her weight varied within various age categories, the
Data Missing Completely at Random missing data is non-ignorable (Kim, 2001). Thus, the
Non-Ignorable Missing Data pattern of missing data is non-random and possibly
Outliers Treated as Missing Data predictable from other variables in the database.
In practice, the MCAR assumption is seldom met. Most
It is important for an analyst to understand the differ- missing data methods are applied upon the assumption of
ent types of missing data before they can address the MAR. And in correspondence to Kim (2001), Non-Ignor-
issue. Each type of missing data is defined next. able missing data is the hardest condition to deal with, but
unfortunately, the most likely to occur as well.
[Data] Missing At Random (MAR)
Outliers Treated As Missing Data
Rubin (1978), in a seminal missing data research paper,
defined missing data as MAR when given the variables Many times it is necessary to classify these outliers as
X and Y, the probability of response depends on X but missing data. Pre-testing and calculating threshold
not on Y. Cases containing incomplete data must be boundaries are necessary in the pre-processing of data
treated differently than cases with complete data. For in order to identify those values which are to be classi-
example, if the likelihood that a respondent will provide fied as missing. Data whose values fall outside of pre-
his or her weight depends on the probability that the defined ranges may skew test results. Consider the case
respondent will not provide his or her age, then the of a laboratory providing the results of a chemical
missing data is considered to be Missing At Random compound decomposition test. If it has been predeter-
(MAR) (Kim, 2001). mined that the maximum amount of iron that can be
594
TEAM LinG
Imprecise Data and the Data Mining Process
contained in a particular compound is 500 parts/million, procedures resulting in the replacement of missing
then the value for the variable iron should never exceed values by attributing them to other available data. This I
that amount. If, for some reason, the value does exceed research investigates the most common imputation
500 parts/million, then some visualization technique methods including:
should be implemented to identify that value. Those
offending cases are then presented to the end users. Case Deletion
Mean Substitution
Cold Deck Imputation
COMMONLY USED METHODS OF Hot Deck Imputation
ADDRESSING MISSING DATA Regression Imputation
595
TEAM LinG
Imprecise Data and the Data Mining Process
Depression of observed correlations due to the Once the most similar case has been identified, hot
repetition of a constant value deck imputation substitutes the most similar complete
cases value for the missing value. Since case two con-
Obviously, a researcher must weigh the advantages tains the value of 23 for item four, a value of 23 replaces
against the disadvantages before implementation. the missing data point for case three. The advantages of
hot deck imputation include conceptual simplicity, main-
Cold Deck Imputation tenance and proper measurement level of variables, and
the availability of a complete set of data at the end of the
Cold deck imputation methods select values or use imputation process that can be analyzed like any complete
relationships obtained from sources other than the cur- set of data. One of hot decks disadvantages is the difficulty
rent data. With this method, the end user substitutes a in defining what is similar, Hence, many different
constant value derived from external sources or from schemes for deciding on what is similar may evolve.
previous research for the missing values. It must be
ascertained by the end user that the replacement value Regression Imputation
used is more valid than any internally derived value.
Unfortunately, feasible values are not always provided Regression Analysis is used to predict missing values
using cold deck imputation methods. Many of the same based on the variables relationship to other variables in
disadvantages that apply to the mean substitution method the data set. Single and/or multiple regression can be
apply to cold deck imputation. Cold deck imputation used to impute missing values. The first step consists of
methods are rarely used as the sole method of imputa- identifying the independent variables and the dependent
tion and instead are generally used to provide starting variable. In turn, the dependent variable is regressed on
values for hot deck imputation methods. the independent variables. The resulting regression equa-
tion is then used to predict the missing values. Table 2
Hot Deck Imputation displays an example of regression imputation.
From the table, twenty cases with three variables
Generally speaking, hot deck imputation replaces miss- (income, age, and years of college education) are listed.
ing values with values drawn from the next most similar Income contains missing data and is identified as the
case. The implementation of this imputation method dependent variable while age and years of college edu-
results in the replacement of a missing value with a value cation are identified as the independent variables.
selected from an estimated distribution of similar re- The following regression equation is produced for
sponding units for each missing value. In most instances, the example
the empirical distribution consists of values from re-
sponding units. For example, Table 1 displays a data set
containing missing data. Table 2. Illustration of regression imputation
From Table 1, it is noted that case three is missing
Years of College Regression
data for item four. In this example, case one, two, and Case Income Age Education Prediction
four are examined. Using hot deck imputation, each of 1 $95,131.25 26 4 $96,147.60
the other cases with complete data is examined and the 2 $108,664.75 45 6 $104,724.04
value for the most similar case is substituted for the 3 $98,356.67 28 5 $98,285.28
missing data value. Case four is easily eliminated, as it 4 $94,420.33 28 4 $96,721.07
5 $104,432.04 46 3 $100,318.15
has nothing in common with case three. Case one and
6 $97,151.45 38 4 $99,588.46
two both have similarities with case three. Case one has 7 $98,425.85 35 4 $98,728.24
one item in common whereas case two has two items in 8 $109,262.12 50 6 $106,157.73
common. Therefore, case two is the most similar to 9 $95,704.49 45 3 $100,031.42
case three. 10 $99,574.75 52 5 $105,167.00
11 $96,751.11 30 0 $91,037.71
12 $111,238.13 50 6 $106,157.73
Table 1. Illustration of hot deck imputation: incomplete 13 $102,386.59 46 6 $105,010.78
data set 14 $109,378.14 48 6 $105,584.26
15 $98,573.56 50 4 $103,029.32
16 $94,446.04 31 3 $96,017.08
Case Item 1 Item 2 Item 3 Item 4
17 $101,837.93 50 4 $103,029.32
1 10 22 30 25
2 23 20 30 23 18 ??? 55 6 $107,591.43
3 25 20 30 ??? 19 ??? 35 4 $98,728.24
4 11 25 10 12 20 ??? 39 5 $101,439.40
596
TEAM LinG
Imprecise Data and the Data Mining Process
TEAM LinG
Imprecise Data and the Data Mining Process
Little, R., & Rubin, D. (1987). Statistical analysis with Data Imputation: The process of estimating missing
missing data. New York: Wiley. data of an observation based on the valid values of other
variables.
Rubin, D. (1978). Multiple imputations in sample surveys:
A phenomenological Bayesian approach to nonresponse. Data Missing at Random (MAR): When given the
In Imputation and Editing of Faulty or Missing Survey variables X and Y, the probability of response depends
Data (pp. 1-23). Washington, DC: U.S. Department of on X but not on Y.
Commerce.
Inapplicable Responses: Respondents omit answer
Vosburg, J., & Kumar, A. (2001). Managing dirty data in due to doubts of applicability.
organizations using ERP: Lessons from a case study.
Industrial Management & Data Systems, 101(1), 21-31. Knowledge Discovery Process: The overall pro-
cess of information discovery in large volumes of ware-
Xu, H., Horn Nord, J., Brown, N., & Nord, G.D. (2002). housed data.
Data quality issues in implementing an ERP. Industrial
Management & Data Systems, 102(1), 47-58. Non-Ignorable Missing Data: Arise due to the data
missingness pattern being explainable, non-random, and
possibly predictable from other variables.
Procedural Factors: Inaccurate classifications of
KEY TERMS new data, resulting in classification error or omission.
[Data] Missing Completely at Random (MCAR): When Refusal of Response: Respondents outward omission
the observed values of a variable are truly a random of a response due to personal choice, conflict, or inexpe-
sample of all values of that variable (i.e., the response rience.
exhibits independence from any variables).
598
TEAM LinG
599
Figure 1. Integrated view of the knowledge discovery process (Adapted from Wickramasinghe et al., 2003)
Knowledge
evolution
Steps in knowledge
discovery
Types of
data mining
Exploratory Predictive
Data Mining Data Mining
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Incorporating the People Perspective into Data Mining
standable patterns in data (Spiegler, 2003; Fayyad, there is an articulation of nuances; for example, as in
Piatetsky-Shapiro, & Smyth, 1996). KDD is primarily health care, if a renowned surgeon is questioned as to why
used on data sets for creating knowledge through model he does a particular procedure in a certain manner, by his
building or by finding patterns and relationships in data. articulation of the steps, the tacit knowledge becomes
From an application perspective, data mining and explicit, and d) Explicit-to-tacit knowledge transformation
KDD are often used interchangeably. Figure 1 presents a usually occurs as new explicit knowledge is internalized;
generic representation of a typical knowledge discovery it can then be used to broaden, reframe, and extend ones
process. This figure not only depicts each stage within tacit knowledge. These transformations are often referred
the KDD process but also highlights the evolution of to as the modes of socialization, combination,
knowledge from data through information in this process externalization, and internalization, respectively (Nonaka,
as well as the two major types of data mining, namely, 1994). Integral to this changing of knowledge through the
exploratory and predictive, whereas the last two steps knowledge spiral is that new knowledge is created (Nonaka
(i.e., data mining and interpretation/evaluation) in the & Nishiguchi, 2001), which can bring many benefits to
KDD process are considered predictive data mining. It is organizations. Specifically, in todays knowledge-centric
important to note in Figure 1 that typically in the KDD economy, processes that effect a positive change to the
process, the knowledge component itself is treated as a existing knowledge base of the organization and facilitate
homogeneous block. Given the well-established multifac- better use of the organizations intellectual capital, as the
eted nature of the knowledge construct (Boland & Tenkasi, knowledge spiral does, are of paramount importance.
1995; Malhotra, 2000; Alavi & Leidner, 2001; Schultze & Two other primarily people-driven frameworks that
Leidner, 2002; Wickramasinghe et al., 2003), this would focus on knowledge creation as a central theme are
appear to be a significant limitation or oversimplification Spenders and Blacklers respective frameworks
of knowledge creation through data mining as a technique (Newell, Robertson, Scarbrough, & Swan, 2002; Swan,
and of the KDD process in general. Scarbrough, & Preston, 1999). Spender draws a distinc-
tion between individual knowledge and social knowl-
The Psychosocial-Driven Perspective edge, each of which he claims can be implicit or explicit
to Knowledge Creation (Newell et al.). From this framework, you can see that
Spenders definition of implicit knowledge corresponds
Knowledge can exist in essentially two forms: explicit, to Nonakas tacit knowledge. However, unlike Spender,
or factual, knowledge and tacit, or experiential (i.e., Nonaka doesnt differentiate between individual and
know how) (Polyani, 1958, 1966). Of equal signifi- social dimensions of knowledge; rather, he focuses on
cance is the fact that organizational knowledge is not the nature and types of the knowledge itself. In contrast,
static; rather, it changes and evolves during the lifetime Blackler (Newell et al.) views knowledge creation from
of an organization (Becerra-Fernandez & Sabherwal, an organizational perspective, noting that knowledge
2001; Bendoly, 2003; Choi & Lee, 2003). Furthermore, can exist as encoded, embedded, embodied, encultured,
it is possible to change the form of knowledge, that is, and/or embrained. In addition, Blackler emphasized that
transform existing tacit knowledge into new explicit for different organizational types, different types of
knowledge, and existing explicit knowledge into new knowledge predominate, and he highlighted the connec-
tacit knowledge, or to transform the subjective form of tion between knowledge and organizational processes
knowledge into the objective form of knowledge (Nonaka (Newell et al.).
& Nishiguchi, 2001; Nonaka, 1994). This process of Blacklers types of knowledge can be thought of in
transforming the form of knowledge and thus increasing terms of spanning a continuum of tacit (implicit) through
the extant knowledge base as well as the amount and to explicit with embrained being predominantly tacit
utilization of the knowledge within the organization is (implicit) and encoded being predominantly explicit
known as the knowledge spiral (Nonaka & Nishiguchi, while embedded, embodied, and encultured types of
2001). In each of these instances, the overall extant knowledge exhibit varying degrees of a tacit (implicit)/
knowledge base of the organization grows to a new explicit combination. An integrated view of all three
superior knowledge base. frameworks is presented in Figure 2. Specifically, from
According to Nonaka and Nishiguchi (2001), four Figure 2, Spenders and Blacklers perspectives comple-
things are true: a) Tacit-to-tacit knowledge transforma- ment Nonakas conceptualization of knowledge cre-
tion usually occurs through apprenticeship-type rela- ation and, more importantly, do not contradict his thesis
tions, where the teacher or master passes on the skill to the of the knowledge spiral, wherein the extant knowledge
apprentice, b) Explicit-to-explicit knowledge transforma- base is continually being expanded to a new knowledge
tion usually occurs via formal learning of facts, c) Tacit- base, be it tacit/explicit (in Nonakas terminology),
to-explicit knowledge transformation usually occurs when implicit/explicit (in Spenders terminology), or
600
TEAM LinG
Incorporating the People Perspective into Data Mining
Figure 2. People driven knowledge creation map metaframework for knowledge creation. By so doing, a
richer and more complete approach to knowledge cre- I
ation is realized. Such an approach not only leads to a
The Knowledge Continuum
deeper understanding of the knowledge creation pro-
cess but also offers a knowledge creation methodology
Explicit Other K Types
(embodied, encultured,embedded)
Tacit/Implicit/Embrained
that is more customizable to specific organizational
Social
contexts, structures, and cultures. Furthermore, it brings
the human factor back into the knowledge creation
process and doesnt oversimplify the complex knowl-
edge construct as a homogenous product.
The Knowledge Spiral
Specifically, in Figure 3, the knowledge product of
(Spenders
Spenders
Actors)
Actors
data mining is broken into its constituent components
based on the people-driven perspectives (i.e., Blackler,
Spender, and Nonaka, respectively) of knowledge cre-
ation. On the other hand, the specific modes of trans-
Individual formation of the knowledge spiral discussed by Nonaka
in his Knowledge Spiral should benefit from the algo-
rithmic structured nature of both exploratory and pre-
embrained/encultured/embodied/embedded/encoded (in dictive data-mining techniques. For example, if you
Blacklers terminology). consider socialization, which is described in Nonaka
and Nishiguchi (2001) and Nonaka (1994) as the pro-
cess of creating new tacit knowledge through discus-
ENRICHING DATA MINING WITH THE sion within groups, more specifically groups of ex-
KNOWLEDGE SPIRAL perts, and you then incorporate the results of data-
mining techniques into this context, this provides a
To conceive of knowledge as a collection of information structured forum and hence a jump start for guiding the
seems to rob the concept of all of its life. . . . Knowledge dialogue and, consequently, knowledge creation. Note,
resides in the user and not in the collection. It is how the however, that this only enriches the socialization pro-
user reacts to a collection of information that matters cess without restricting the actual brainstorming activi-
(Churchman, 1971, p. 10). ties and thus not necessarily leading to the side effect of
truncating divergent thoughts. This also holds for
Churchman is clearly underscoring the importance of Nonakas other modes of knowledge transformation.
people in the process of knowledge creation. However,
most formulations of information technology (IT) en- An Example within a Health Care
abled knowledge management, and data mining in par- Context
ticular seems to have not only ignored the ignored the
human element but also taken a very myopic and homog- An illustration from health care will serve to illustrate
enous perspective on the knowledge construct itself. the potential of combining the people perspective with
Recent research that has surveyed the literature on KM data mining. Health care is a very information-rich
indicates the need for more frameworks for knowledge industry. The collecting of data and information per-
management, particularly a metaframework to facilitate meates most if not all areas of this industry. By incor-
more successful realization of the KM steps porating a people perspective with data mining, it will
(Wickramasinghe & Mills, 2001; Holsapple & Joshi, be possible to realize the full potential of these data
2002; Alavi & Leidner, 2001; Schultze & Leidner, 2002). assets.
From a macro knowledge management perspective, the Table 1 details some specific instances of each of
knowledge spiral is the cornerstone of knowledge cre- the transformations identified in Figure 3, and Table 2
ation. From a micro data-mining perspective, one of the provides an example of explicit knowledge stored in a
key strengths of data mining as a technique is that it medication repository 2.
facilitates knowledge creation from data. Therefore, by Using the association rules data-mining algorithm,
integrating the algorithmic approach of knowledge cre- the following patterns can be discovered:
ation (in particular data mining) with the psychosocial
approach of knowledge creation (i.e., the people-driven D1 is administered to 60% of the patients (i.e., 3/5).
frameworks of knowledge creation, in particular the
knowledge spiral), it is indeed possible to develop a
D1 and D2 are administered together to 40% of
the patients (i.e., 2/5).
601
TEAM LinG
Incorporating the People Perspective into Data Mining
Nonakas/Spenders/Blacklers K-Types
Spenders Exploratory
Actors
DM
Predictive
Individual
(Explicit) 3. INTERNALIZATION 4. COMBINATION
Specific Modes of Transformations and the Role of Data Mining in Realizing Them:
1. Socialization experiential to experiential via practice group interactions on data-mining results
2. Externalization written extensions to data-mining results
3. Internalization enhancing experiential skills by learning from data-mining results
4. Combination gaining insights through applying various data mining visualization techniques
EXPLICIT1 TACIT
TO
FROM
EXPLICIT The performance of exploratory data mining, such as By assimilating and internalizing
summarization and visualization, makes it possible to knowledge discovered through data
upgrade, expand, and/or revise current facts and mining, physicians can turn this
protocols. explicit knowledge into tacit
knowledge, which they can apply to
treating new patients.
TACIT Interpretation of findings from data mining helps reveal Interpreting treatment patterns (for
tacit knowledge, which can then be articulated and stored example, for hip disease) discovered
as explicit cases in the case repository. through data mining enables
interaction among physicians, hence
making it possible to stimulate and
grow their own tacit knowledge.
602
TEAM LinG
Incorporating the People Perspective into Data Mining
Table 2. Drugs administered to patients As physicians discuss the implications of these find-
Patient ID Drug
ings, tacit knowledge from some of them is transformed I
into tacit knowledge for other physicians. Thus, during
1 D1, D2 the interaction of physicians from different specialties,
2 S3, D4, D5
an environment of existing tacit to new tacit knowledge
3 D3, D1, D2
4 D3, D5, D1 transformations occur. These knowledge transforma-
5 D5, D2 tions are summarized in Table 3. Figure 3 and Table 3
illustrate how data mining helps to realize the four modes
or transformations of the knowledge spiral (socializa-
D2 is administered to 67% of the patients who are tion, externalization, internalization, and combination).
given drug D1 (i.e., 2/3).
As the physicians try to understand these findings,
one physician could explain that D2 has to be given with FUTURE TRENDS
D1 for patients who had a heart attack at age 40 or less.
Thus, from this observation, the following rule can be The two significant ways to create knowledge are through
added to the rule repository: If a patients age is <= 40 the a) synthesis of new knowledge through socialization
years, the patient has a heart attack, and D1 is adminis- with experts (a primarily people-dominated perspec-
tered to the patient, then D2 should also be administered tive) and b) discovery by finding interesting patterns
to that patient. This is an example of existing tacit through observation and combination of explicit data (a
knowledge, because it originated from the physicians primarily technology-driven perspective) (Becerra-
head and was transformed into new explicit knowledge Fernandez, et al., 2004). In todays knowledge economy,
that is now recorded for everyone to use (as Figure 3 knowledge creation and the maximization of an
demonstrates). organizations knowledge and data assets are key strate-
Table 3. Data mining as an enabler of the knowledge spiral from the example data set
EXPLICIT1 TACIT
FROM TO
INTERACTION
1
Each entry gives an example of how data mining enables the knowledge transfer from the knowledge type in the cell row to the
knowledge type in the cell column.
603
TEAM LinG
Incorporating the People Perspective into Data Mining
gic necessities. Furthermore, more techniques, such as Becerra-Fernandez, I., & Sabherwal, R. (2001). Organiza-
business intelligence and business analytics, which have tional knowledge management: A contingency perspec-
their foundations in traditional data mining, are being tive. Journal of Management Information Systems, 18(1),
embraced by organizations in order to try to facilitate 23-55.
the discovery of novel and unique patterns in data that
will lead to new knowledge and maximization of an Bendoly, E. (2003). Theory and support for process
organizations data assets. Full maximization of an frameworks of knowledge discovery and data mining
organizations data assets, however, will not be realized from ERP systems. Information & Management, 40, 639-
until the people perspective is incorporated into these 647.
data-mining techniques to enable the full potential of Boland, R., & Tenkasi, R. (1995). Perspective making
knowledge creation to occur. Thus, as organizations perspective taking. Organizational Science, 6, 350-372.
strive to survive and thrive in todays competitive busi-
ness environment, incorporating a people perspective Choi, B., & Lee, H. (2003). An empirical investigation
into their data-mining initiatives will increasingly be- of KM styles and their effect on corporate perfor-
come a competitive necessity. mance. Information & Management, 40, 403-417.
Churchman, C. (1971). The design of inquiring sys-
tems: Basic concepts of systems and organizations.
CONCLUSION New York: Basic Books.
Sustainable competitive advantage is dependent on build- Davenport, T., & Grover, V. (2001). Knowledge man-
ing and exploiting core competencies (Newell et al., agement. Journal of Management Information Sys-
2002). In order to sustain competitive advantage, re- tems, 18(1), 3-4.
sources that are idiosyncratic (and thus scarce) and Davenport, T., & Prusak, L. (1998). Working knowl-
difficult to transfer or replicate are required (Grant, edge. Boston: Harvard Business School Press.
1991). A knowledge-based view of the firm identifies
knowledge as the organizational asset that enables sus- Fayyad, Piatetsky-Shapiro, & Smyth, (1996). From data
tainable competitive advantage, especially in mining to knowledge discovery: An overview. In Fayyad,
hypercompetitive environments (Wickramasinghe, Piatetsky-Shapiro, Smyth, & Uthurusamy (Eds.), Ad-
2003; Davenport & Prusak, 1998; Zack, 1999). This is vances in knowledge discovery and data mining. Menlo
attributed to the fact that barriers exist regarding the Park, CA: AAAI Press/MIT Press.
transfer and replication of knowledge (Wickramasinghe,
2003), thus making knowledge and knowledge manage- Grant, R. (1991). The resource-based theory of competi-
ment of strategic significance (Kanter, 1999). The key tive advantage: Implications for strategy formulation.
to maximizing the knowledge asset is in finding novel California Management Review, 33(3), 114-135.
and actionable patterns and in continuously creating new Holsapple, C., & Joshi, K. (2002). Knowledge manipu-
knowledge, thereby increasing the extant knowledge lation activities: Results of a delphi study. Information
base of the organization. By incorporating a people & Management, 39, 477-419.
perspective into data-mining, it becomes truly possible
to support both major types of knowledge creation Kanter, J. (1999). Knowledge management practically
scenarios and thereby realize the synergistic effect of speaking. Information Systems Management.
the respective strengths of these approaches in enabling Malhotra, Y. (2000). Knowledge management and new
superior knowledge creation to ensue. organizational form. In Malhotra (Ed.), Knowledge
management and virtual organizations. Hershey, PA:
Idea Group Publishing.
REFERENCES
Newell, S., Robertson, M., Scarbrough, H., & Swan, J.
Alavi, M., & Leidner, D. (2001). Review: Knowledge (2002). Managing knowledge work. New York: Palgrave
management and knowledge management systems: Con- Macmillan.
ceptual foundations and research issues. MIS Quar- Nonaka, I. (1994). A dynamic theory of organizational
terly, 25(1), 107-136. knowledge creation. Organizational Science, 5, 14-37.
Becerra-Fernandez, I., Gonzalez, A., & Sabherwal, R. (2004). Nonaka, I., & Nishiguchi, T. (2001). Knowledge emer-
Knowledge management. Upper Saddle River, NJ: Prentice gence. Oxford, UK: Oxford University Press.
Hall.
604
TEAM LinG
Incorporating the People Perspective into Data Mining
Polyani, M. (1958). Personal knowledge: Towards a Externalization: A knowledge transfer mode that in-
postcritical philosophy. Chicago: University of Chi- volves new explicit knowledge being derived from exist- I
cago Press. ing tacit knowledge.
Polyani, M. (1966). The tacit dimension. London: Hegelian/Kantian Perspective of Knowledge
Routledge & Kegan Paul. Management: Refers to the subjective component of
knowledge management and can be viewed as an ongoing
Schultze, U., & Leidner, D. (2002). Studying knowledge phenomenon being shaped by social practices of commu-
management in information systems research: Dis- nities and encouraging discourse and divergence of mean-
courses and theoretical assumptions. MIS Quarterly, ing.
26(3), 212-242.
Internalization: A knowledge transfer mode that
Spiegler, I. (2003). Technology and knowledge: Bridg- involves new tacit knowledge being derived from exist-
ing a generating gap. Information & Management, 40, ing explicit knowledge.
533-539.
Knowledge Spiral: The process of transforming
Swan, J., Scarbrough, H., & Preston, J. (1999). Knowl- the form of knowledge and thus increasing the extant
edge management: The next fad to forget people? Pro- knowledge base as well as the amount and utilization of
ceedings of the Seventh European Conference in In- the knowledge within the organization.
formation Systems.
Lockean/Leibnitzian Perspective of Knowledge
Wickramasinghe, N. (2003). Do we practise what we Management: Refers to the objective aspects of knowl-
preach: Are knowledge management systems in practice edge management, where the need for knowledge is to
truly reflective of knowledge management systems in improve effectiveness and efficiency.
theory? Business Process Management Journal, 9(3),
295-316. Socialization: A knowledge transfer mode that in-
volves new tacit knowledge being derived from existing
Wickramasinghe, N., Fadlalla, A., Geisler, E., & Schaffer, tacit knowledge.
J. (2003). Knowledge management and data mining: Stra-
tegic imperatives for healthcare. Proceedings of the Third Tacit Knowledge: Also known as experiential knowl-
Hospital of the Future Conference, Warwick, UK. edge (i.e., know how) (Cabena et al., 1998) represents
knowledge that is gained through experience.
Wickramasinghe, N., & Mills, G. (2001). MARS: The
electronic medical record system the core of the kaiser
galaxy. International Journal of Healthcare Technol- ENDNOTES
ogy Management, 3(5/6), 406-423.
Zack, M. (1999). Knowledge and strategy. Boston:
1
Each entry explains how data mining enables the
Butterworth Heinemann. knowledge transfer from the type of knowledge in
the cell row to the type of knowledge in the cell
column.
2
The example is kept small and simple for illustrative
KEY TERMS purposes; naturally, in large medical databases the
data would be much larger.
Combination: A knowledge transfer mode that in- 3
Each entry gives an example of how data mining
volves new explicit knowledge being derived from ex- enables the knowledge transfer from the knowl-
isting explicit knowledge. edge type in the cell row to the knowledge type in
the cell column.
Explicit Knowledge: Also known as factual knowl-
edge (i.e., know what) (Cabena et al., 1998), repre-
sents knowledge that is well established and documented.
605
TEAM LinG
606
Jongeun Jun
University of Southern California, USA
Dennis McLeod
University of Southern California, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Incremental Mining from News Streams
ontologies have a limitation in supporting a topical search. into the database, whenever a new document is
In sum, it is essential to develop incremental text mining inserted it should perform a fast update of the I
methods for intelligent news information presentation. existing cluster structure.
6. Meaningful theme of clusters: We expect each
cluster to reflect a meaningful theme. We define
MAIN THRUST meaningful theme in terms of precision and recall.
That is, if a cluster (C) is about Turkey earth-
In the following, we will explore text mining approaches quake, then all documents about Turkey earth-
that are relevant for news streams data. quake should belong to C, and documents that do
not talk about Turkey earthquake should not
Requirements of Document Clustering belong to C.
7. Interpretability of resulting clusters: A clustering
in News Streams structure needs to be tied up with a succinct sum-
mary of each cluster. Consequently, clustering re-
Data we are considering are high dimensional, large in sults should be easily comprehensible by users.
size, noisy, and a continuous stream of documents. Many
previously proposed document clustering algorithms did
not perform well on this dataset due to a variety of
Previous Document Clustering
reasons. In the following, we define application-depen- Approaches
dent (in terms of news streams) constraints that the
clustering algorithm must satisfy. The most widely used document clustering algorithms fall
into two categories: partition-based clustering and hier-
1. Ability to determine input parameters: Many clus- archical clustering. In the following, we provide a concise
tering algorithms require a user to provide input overview for each of them, and discuss why these ap-
parameters (e.g., the number of clusters), which is proaches fail to address the requirements discussed above.
difficult to be determined in advance, in particular Partition-based clustering decomposes a collection of
when we are dealing with incremental datasets. documents, which is optimal with respect to some pre-
Thus, we expect the clustering algorithm not to need defined function (Duda, Hart, & Stork, 2001; Liu, Gong,
such kind of knowledge. Xu, & Zhu, 2002). Typical methods in this category
2. Scalability with large number of documents: The include center-based clustering, Gaussian Mixture Model,
number of documents to be processed is extremely and etcetera. Center-based algorithms identify the clus-
large. In general, the problem of clustering n objects ters by partitioning the entire dataset into a pre-deter-
into k clusters is NP-hard. Successful clustering mined number of clusters (e.g., k-means clustering). Al-
algorithms should be scalable with the number of though the center-based clustering algorithms have been
documents. widely used in document clustering, there exist at least
3. Ability to discover clusters with different shapes five serious drawbacks. First, in many center-based clus-
and sizes: The shape of document cluster can be of tering algorithms, the number of clusters needs to be
arbitrary shapes; hence we cannot assume the shape determined beforehand. Second, the algorithm is sensi-
of document cluster (e.g., hyper-sphere in k-means). tive to an initial seed selection. Third, it can model only a
In addition, the sizes of clusters can be of arbitrary spherical (k-means) or ellipsoidal (k-medoid) shape of
numbers, thus clustering algorithms should iden- clusters. Furthermore, it is sensitive to outliers since a
tify the clusters with wide variance in size. small amount of outliers can substantially influence the
4. Outliers Identification: In news streams, outliers mean value. Note that capturing an outlier document and
have a significant importance. For instance, a unique forming a singleton cluster is important. Finally, due to the
document in a news stream may imply a new technol- nature of an iterative scheme in producing clustering
ogy or event that has not been mentioned in previ- results, it is not relevant for incremental datasets.
ous articles. Thus, forming a singleton cluster for Hierarchical (agglomerative) clustering (HAC) identi-
the outlier is important. fies the clusters by initially assigning each document to
5. Efficient incremental clustering: Given different its own cluster and then repeatedly merging pairs of
ordering of a same dataset, many incremental clus- clusters until a certain stopping condition is met (Zhao &
tering algorithms produce different clusters, which Karypis, 2002). Consequently, its result is in the form of
is an unreliable phenomenon. Thus, the incremental a tree, which is referred to as a dendrogram. A dendrogram
clustering should be robust to the input sequence. is represented as a tree with numeric levels associated to
Moreover, due to the frequent document insertion its branches. The main advantage of HAC lies in its ability
607
TEAM LinG
Incremental Mining from News Streams
to provide a view of data at multiple levels of abstraction. methodology is required in terms of incremental news
Although HAC can model arbitrary shapes and different article clustering.
sizes of clusters, and can be extended to the robust version
(in outlier handling sense), it is not relevant for news Dynamic Topic Mining
streams application due to the following two reasons.
First, since HAC builds a dendrogram, a user should Dynamic topic mining is a framework that supports the
determine where to cut the dendrogram to produce actual identification of meaningful patterns (e.g., events, top-
clusters. This step is usually done by human visual inspec- ics, and topical relations) from news stream data (Chung
tion, which is a time-consuming and subjective process. & McLeod, 2003). To build a novel paradigm for an
Second, the computational complexity of HAC is expen- intelligent news database management and navigation
sive since pairwise similarities between clusters need to be scheme, it utilizes techniques in information retrieval,
computed. data mining, machine learning, and natural language
processing.
Topic Detection and Tracking In dynamic topic mining, a Web crawler downloads
news articles from a news Web site on a daily basis.
Over the past six years, the information retrieval commu- Retrieved news articles are processed by diverse infor-
nity has developed a new research area, called TDT (Topic mation retrieval and data mining tools to produce useful
Detection and Tracking) (Makkonen, Ahonen-Myka, & higher-level knowledge, which is stored in a content
Salmenkivi, 2004; Allan, 2002). The main goal of TDT is to description database. Instead of interacting with a Web
detect the occurrence of a novel event in a stream of news news service directly, by exploiting the knowledge in the
stories, and to track the known event. In particular, there database, an information delivery agent can present an
are three major components in TDT. answer in response to a user request (in terms of topic
detection and tracking, keyword-based retrieval, docu-
1. Story segmentation: It segments a news stream (e.g., ment cluster visualization, etc). Key contributions of the
including transcribed speech) into topically cohe- dynamic topic mining framework are development of a
sive stories. Since online Web news (in HTML for- novel hierarchical incremental document clustering al-
mat) is supplied in segmented form, this task only gorithm, and a topic ontology learning framework.
applies to audio or TV news. Despite the huge body of research efforts on docu-
2. First Story Detection (FSD): It identifies whether a ment clustering, previously proposed document cluster-
new document belongs to an existing topic or a new ing algorithms are limited in that it cannot address special
topic. requirements in a news environment. That is, an algo-
3. Topic tracking: It tracks events of interest based on rithm must address the seven application-dependent
sample news stories. It associates incoming news constraints discussed before. Toward this end, the dy-
stories with the related stories, which were already namic topic mining framework presents a sophisticated
discussed before. It can be also asked to monitor the incremental hierarchical document clustering algorithm
news stream for further stories on the same topic. that utilizes a neighborhood search. The algorithm was
tested to demonstrate the effectiveness in terms of the
Event is defined as some unique thing that happens seven constraints. The novelty of the algorithm is the
at some point in time. Hence, an event is different from a ability to identify meaningful patterns (e.g., news events,
topic. For example, airplane crash is a topic while Chi- and news topics) while reducing the amount of compu-
nese airplane crash in Korea in April 2002 is an event. tations by maintaining cluster structure incrementally.
Note that it is important to identify events as well as topics. In addition, to overcome the lack of topical relations
Although a user is not interested in a flood topic, the user in conceptual ontologies, a topic ontology learning frame-
may be interested in the news story on the Texas flood if work is presented. The proposed topic ontologies pro-
the users hometown is from Texas. Thus, a news recom- vide interpretations of news topics at different levels of
mendation system must be able to distinguish different abstraction. For example, regarding to a Winona Ryder
events within a same topic. court trial news topic (T), the dynamic topic mining could
Single-pass document clustering (Chung & McLeod, capture winona, ryder, actress, shoplift, beverly as
2003) has been extensively used in TDT research. How- specific terms describing T (i.e., the specific concept for
ever, the major drawback of this approach lies in order- T) while attorney, court, defense, evidence, jury, kill,
sensitive property. Although the order of documents is law, legal, murder, prosecutor, testify, trial as general
already fixed since documents are inserted into the data- terms representing T (i.e., the general concept for T).
base in chronological order, order-sensitive property im- There exists research work on extracting hierarchical
plies that the resulting cluster is unreliable. Thus, new relations between terms from a set of documents (Tseng,
608
TEAM LinG
Incremental Mining from News Streams
2002). However, the dynamic topic mining framework is would be provided. To overcome the problem of concept-
unique in that the topical relations are dynamically gen- based ontologies (i.e., topically related concepts and I
erated based on incremental hierarchical clustering rather terms are not explicitly linked), topic ontologies are pre-
than based on human defined topics, such as Yahoo sented to characterize news topics at multiple levels of
directory. abstraction. In sum, coupling with topic ontologies and
concept-based ontologies, supporting a topical search as
well as semantic information retrieval can be achieved.
FUTURE TRENDS
CONCLUSION Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002, August).
Document clustering with cluster refinement and model
Incremental text mining from news streams is an emerging selection capabilities. In ACM SIGIR International Con-
technology as many news organizations are providing ference on Research and Development in Information
newswire services through the Internet. In order to ac- Retrieval (SIGIR02) (pp. 91-198). Tampere, Finland.
commodate dynamically changing topics, efficient incre- Maedche, A., & Staab, S. (2001). Ontology learning for the
mental document clustering algorithms need to be devel- semantic Web. IEEE Intelligent Systems, 16(2), 72-79.
oped. The algorithms must address the special require-
ments in news clustering, such as high rate of document Makkonen, J., Ahonen-Myka, H., & Salmenkivi, M.
update or ability to identify event level clusters, as well as (2004). Simple semantics in topic detection and track-
topic level clusters. ing. Information Retrieval, 7(3-4), 347-368.
In order to achieve rich semantic information retrieval Noy, N.F., Sintek, M., Decker, S., Crubezy, M.,
within Web news services, an ontology-based approach Fergerson, R.W., & Musen M.A. (2001). Creating seman-
609
TEAM LinG
Incremental Mining from News Streams
tic Web contents with Protg 2000. IEEE Intelligent First Story Detection: A TDT component that identi-
Systems, 6(12), 60-71. fies whether a new document belongs to an existing topic
or a new topic.
Peng, F., & Schuurmans, D. (2003, April). Combining naive
Bayes and n-gram language models for text classification. Ontology: A collection of concepts and inter-relation-
European Conference on IR Research (ECIR03) (pp. ships.
335-350). Pisa, Italy.
Text Mining: A process of identifying patterns or
Tseng, Y. (2002). Automatic thesaurus generation for Chi- trends in natural language text including document clus-
nese documents. Journal of the American Society for Infor- tering, document classification, ontology learning, and
mation Science and Technology, 53(13), 1130-1138. etcetera.
Zhao, Y., & Karypis, G. (2002, November). Evaluations of Topic Detection And Tracking (TDT): Topic Detec-
hierarchical clustering algorithms for document datasets. tion and Tracking (TDT) is a DARPA-sponsored initiative
In ACM International Conference on Information and to investigate the state of the art for news understanding
Knowledge Management (CIKM02) (pp. 515-524). systems. Specifically, TDT is composed of the following
McLean, VA. three major components: (1) segmenting a news stream
(e.g., including transcribed speech) into topically cohe-
sive stories; (2) identifying novel stories that are the first
KEY TERMS to discuss a new event; and (3) tracking known events
given sample stories.
Clustering: An unsupervised process of dividing
data into meaningful groups such that each identified Topic Ontology: A collection of terms that character-
cluster can explain the characteristics of underlying data ize a topic at multiple levels of abstraction.
distribution. Examples include characterization of differ- Topic Tracking: A TDT component that tracks events
ent customer groups based on the customers purchasing of interest based on sample news stories. It associates
patterns, categorization of documents in the World Wide incoming news stories with the related stories, which were
Web, or grouping of spatial locations of the earth where already discussed before or it monitors the news stream
neighbor points in each region have similar short-term/ for further stories on the same topic.
long-term climate patterns.
Dynamic Topic Mining: A framework that supports
the identification of meaningful patterns (e.g., events,
topics, and topical relations) from news stream data.
610
TEAM LinG
611
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Inexact Field Learning Approach for Data Mining
s
0 x j U i k h(j i ) h(j k )
s
This algorithm was tested on three large real observa-
ck ( x j ) = 1 x j h (j k ) U i k h (ji ) tional weather data sets containing both high-quality and
x j b x h( k ) ( s h (i ) ) (6) low-quality data. The accuracy rates of the forecasts were
j I Ui k j
a b
j
86.4%, 78%, and 76.8%. These are significantly better
(k = 1, 2,..., s) than the accuracy rates achieved by C4.5 (Quinlan, 1986,
1993), feed forward neural networks, discrimination analy-
sis, K-nearest neighbor classifiers, and human weather
forecasters. The fish-net algorithm exhibited significantly
Step 3: Work Out Contribution Fields by applying less overfitting than the other algorithms. The training
the constructed contribution functions to the train- times were shorter, in some cases by orders of magnitude
ing data set. (Dai & Ciesieski, 1994a, 2004; Dai 1996).
Calculate the contribution of each instance.
FUTURE TRENDS
n
( I i ) = ( ( xij )) / n
j =1
(7) The inexact field-learning approach has led to a success-
(i = 1, 2,..., m) ful algorithm in a domain where there is a high level of
noise. We believe that other algorithms based on fields
also can be developed. The b-rules, produced by the
Work out the contribution field for each class. current FISH-NET algorithm involve linear combinations
h+ =< hl+ , hu+ >
of attributes. Non-linear rules may be even more accurate.
612
TEAM LinG
Inexact Field Learning Approach for Data Mining
While extensive tests have been done on the fish-net proach, such as C4.5, on all the data sets and lower
algorithm with large meteorological databases, nothing in than the feed-forward neural network. A reason- I
the algorithm is specific to meteorology. It is expected that ably low LPA error rate was achieved by the feed-
the algorithm will perform equally well in other domains. In forward neural network but with the high time cost
parallel to most existing exact machine-learning methods, of error back-propagation. The LPA error rate of
the inexact field-learning approaches can be used for large the KNN method is comparable to fish-net. This
or very large noisy data mining, particularly where the data was achieved after a very high-cost genetic algo-
quality is a major problem that may not be dealt with by rithm search.
other data-mining approaches. Various learning algorithms 4. The FISH-NET algorithm obviously was not af-
can be created, based on the fields derived from a given fected by low-quality data. It performed equally
training data set. There are several new applications of well on low-quality data and high-quality data.
inexact field learning, such as Zhuang and Dai (2004) for
Web document clustering and some other inexact learning
approaches (Ishibuchi et al., 2001; Kim et al., 2003). The REFERENCES
major trends of this approach are in the following:
Ciesielski, V., & Dai, H. (1994a). FISHERMAN: A compre-
1. Heavy application for all sorts of data-mining tasks hensive discovery, learning and forecasting systems.
in various domains. Proceedings of 2nd Singapore International Conference
2. Developing new powerful discovery algorithms in on Intelligent System, Singapore.
conjunction with IFL and traditional learning ap-
proaches. Dai, H. (1994c). Learning of forecasting rules from large
3. Extend current IFL approach to deal with high dimen- noisy meteorological data [doctoral thesis]. RMIT,
sional, non-linear, and continuous problems. Melbourne, Victoria, Australia.
Dai, H. (1996a). Field learning. Proceedings of the 19th
Australian Computer Science Conference.
CONCLUSION
Dai, H. (1996b). Machine learning of weather forecasting
The inexact field-learning algorithm: Fish-net is developed rules from large meteorological data bases. Advances in
for the purpose of learning rough classification/forecast- Atmospheric Science, 13(4), 471-488.
ing rules from large, low-quality numeric databases. It runs Dai, H. (1997). A survey of machine learning [technical
high efficiently and generates robust rules that do not report]. Monash University, Melbourne, Victoria, Aus-
overfit the training data nor result in low prediction accu- tralia.
racy.
The inexact field-learning algorithm, fish-net, is based Dai, H. & Ciesielski, V. (1994a). Learning of inexact rules
on fields of the attributes rather than the individual point by the FISH-NET algorithm from low quality data. Pro-
values. The experimental results indicate that: ceedings of the 7 th Australian Joint Conference on
Artificial Intelligence, Brisbane, Australia.
1. The fish-net algorithm is linear both in the number of
instances and in the number of attributes. Further, Dai, H. & Ciesielski, V. (1994b). The low prediction
the CPU time grows much more slowly than the other accuracy problem in learning. Proceedings of Second
algorithms we investigated. Australian and New Zealand Conference On Intelligent
2. The Fish-net algorithm achieved the best prediction Systems, Armidale, NSW, Australia.
accuracy tested on new unseen cases out of all the Dai, H., & Ciesielski, V. (1995). Inexact field learning
methods tested (i.e., C4.5, feed-forward neural net- using the FISH-NET algorithm [technical report].
work algorithms, a k-nearest neighbor method, the Monash University, Melbourne, Victoria, Australia.
discrimination analysis algorithm, and human experts.
3. The fish-net algorithm successfully overcame the Dai, H., & Ciesielski, V. (2004). Learning of fuzzy classi-
LPA problem on two large low-quality data sets fication rules by inexact field learning approach [tech-
examined. Both the absolute LPA error rate and the nical report]. Deakin University, Melbourne, Australia.
relative LPA error rate (Dai & Ciesieski, 1994b) of the Dai, H. & Li, G. (2001). Inexact field learning: An approach
fish-net were very low on these data sets. They were to induce high quality rules from low quality data. Pro-
significantly lower than that of point-learning ap- ceedings of the 2001 IEEE International Conference on
Data Mining (ICDM-01), San Jose, California.
613
TEAM LinG
Inexact Field Learning Approach for Data Mining
614
TEAM LinG
615
Il-Yeol Song
Drexel University, USA
Xiaohua Hu
Drexel University, USA
Hyoil Han
Drexel University, USA
bioinformatics. With the size of a digital library com- Evaluate the system
monly exceeding millions of documents, rapidly in- Biomedical
creasing, and covering a wide range of topics, efficient Extract entities &
relationships
Text Mining Systems
Build a KB
Knowledge base
(KB)
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Information Extraction in Biomedical Literature
4) From what data sources are the target objects ex- identification of each entity with an SVM classifier, and
tracted? the second phase is post-processing to correct the errors
by the SVM with a simple dictionary lookup. Bunescu, et
Target Objects al. (2004) studied protein name identification and protein-
protein interaction. Among several approaches used in
In terms of what is to be extracted by the systems, most their study, the main two ways are one using POS tagging
studies can be broken into the following two major and the other using the generalized dictionary-based
areas: (1) named entity extraction such as proteins or tagging. Their dictionary-based tagging presents higher
genes; and (2) relation extraction, such as relationships F-value. Table 1 summarizes the works in the areas of
between proteins. Most of these studies adopt informa- named entity extraction in biomedical literature.
tion extraction techniques using curated lexicon or The second target object type of biomedical literature
natural language processing for identifying relevant extraction is relation extraction. Leek (1997) applies HMM
tokens such as words or phrases in text (Shatkay & techniques to identify gene names and chromosomes
Feldman, 2003). through heuristics. Blaschke et al. (1999) extract protein-
In the area of named entity extraction, Proux et al. protein interactions based on co-occurrence of the form
(2000) use single word names only with selected test p1I1 p2 within a sentence, where p1, p2 are
set from 1,200 sentences coming from Flybase. Collier, proteins, and I1 is an interaction term. Protein names and
et al. (2000) adopt Hidden Markov Models (HMMs) for interaction terms (e.g., activate, bind, inhibit) are pro-
10 test classes with small training and test sets. vided as a dictionary. Proux (2000) extracts an interact
Krauthammer et al. (2000) use BLAST database with relation for the gene entity from Flybase database.
letters encoded as 4-tuples of DNA. Demetriou and Pustejovsky (2002) extracts an inhibit relation for the gene
Gaizuaskas (2002) pipeline the mining processes, in- entity from MEDLINE. Jenssen, et al. (2001) extract a gene-
cluding hand-crafted components and machine learning gene relations based on co-occurrence of the form
components. For the study, they use large lexicon and g1g2 within a MEDLINE abstracts, where g1 and g2
morphology components. Narayanaswamy et al. (2003) are gene names. Gene names are provided as a dictionary,
use a part of speech (POS) tagger for tagging the parsed harvested from HUGO, LocusLink, and other sources.
MEDLINE abstracts. Although Narayanaswamy and his Although their study uses 13,712 named human genes
colleagues (2003) implement an automatic protein name and millions of MEDLINE abstracts, no extensive quan-
detection system, the number of words used is 302, and, titative results are reported and analyzed. Friedman, et
thus, it is difficult to see the quality of their system, al. (2001) extract a pathway relation for various biologi-
since the size of the test data is too small. Yamamoto, et cal entities from a variety of articles. In their work, the
al. (2003) use morphological analysis techniques for precision of the experiments is high (from 79-96%).
preprocessing protein name tagging and apply support However, the recalls are relatively low (from 21-72%).
vector machine (SVM) for extracting protein names. Bunescu et al. (2004) conducted protein/protein interac-
They found that increasing training data from 390 ab- tion identification with several learning methods, such as
stracts to 1,600 abstracts improved F-value perfor- pattern matching rule induction (RAPIER), boosted wrap-
mance from 70% to 75%. Lee et al. (2003) combined an SVM per induction (BWI), and extraction using longest com-
and dictionary lookup for named entity recognition. Their mon subsequences (ELCS). ELCS automatically learns
approach is based on two phases: the first phase is rules for extracting protein interactions using a bottom-up
616
TEAM LinG
Information Extraction in Biomedical Literature
approach. They conducted experiments in two ways: one of an HMM correspond to candidate POS tags, and
with manually crafted protein names and the other with the probabilistic transitions among states represent pos-
extracted protein names by their name identification method. sible parses of the sentence, according to the matches of
In both experiments, Bunescu, et al. compared their results the terms occurring in it to the POSs. In the context of
with human-written rules and showed that machine learn- biomedical literature mining, HMM is also used to model
ing methods provide higher precisions than human-writ- families of biological sequences as a set of different
ten rules. Table 2 summarizes the works in the areas of utterances of the same word generated by an HMM
relation extraction in biomedical literature. technique (Baldi et al., 1994).
Ray and Craven (2001) have proposed a more sophis-
Techniques Used ticated HMM-based technique to distinguish fact-bear-
ing sentences from uninteresting sentences. The target
The most commonly used extraction technique is co- biological entities and relations that they intend to ex-
occurrence based. The basic idea of this technique is that tract are protein subcellular localizations and gene-dis-
entities are extracted based on frequency of co-occur- order associations. With a predefined lexicon of loca-
rence of biomedical named entities such as proteins or tions and proteins and several hundreds of training
genes within sentences. This technique was introduced sentences derived from Yeast database, they trained and
by Blaschke, et al. (1999). Their goal was to extract tested the classifiers over a manually labeled corpus of
information from scientific text about protein interac- about 3,000 MEDLINE abstracts. There have been sev-
tions among a predetermined set of related programs. eral studies applying natural language tagging and pars-
Since Blaschke and his colleagues study, numerous ing techniques to biomedical literature mining. Friedman,
other co-occurrence-based systems have been proposed et al. (2001) propose methods parsing sentences and using
in the literature. All are associated with information thesauri to extract facts about genes and proteins from
extraction of biomedical entities from the unstructured biomedical documents. They extract interactions among
text corpus. The common denominator of the co-occur- genes and proteins as part of regulatory pathways.
rence-based systems is that they are based on co-occur-
rences of names or identifiers of entities, typically Evaluation
along with activation/dependency terms. These systems
are differentiated one from another by integrating dif- One of the pivotal issues yet to be explored further in
ferent machine learning techniques such as syntactical biomedical literature mining is how to evaluate the
analysis or POS tagging, as well as ontologies and con- techniques or systems. The focus of the evaluation
trolled vocabularies (Hahn et al., 2002; Pustejovsky et conducted in the literature is on extraction accuracy.
al., 2002; Yakushiji et al., 2001). Although these tech- The accuracy measures used in IE are precision and
niques are straightforward and easy to develop, from the recall ratio. For a set of N items, where N is either
performance standpoint, recall and precision are much terms, sentences, or documents, and the system needs
lower than any other machine-learning techniques (Ray to label each of the terms as positive or negative,
& Craven, 2001). according to some criterion (positive, if a term belongs
In parallel with co-occurrence-based systems, the to a predefined document category or a term class). As
researchers began to investigate other machine learning discussed earlier, the extraction accuracy is measured
or NLP techniques. One of the earliest studies was done by precision and recall ratio. Although these evaluation
by Leek (1997), who utilized Hidden Markov Models techniques are straightforward and are well accepted,
(HMMs) to extract sentences discussing gene location of recall ratios often are criticized in the field of information
chromosomes. HMMs are applied to represent sentence retrieval, when the total number of true positive terms is
structures for natural language processing, where states not clearly defined.
617
TEAM LinG
Information Extraction in Biomedical Literature
In IE, an evaluation forum similar to TREC in informa- Papers (IFBP) transcription factor database are natural
tion retrieval (IR) is the Message Understanding Confer- language processing based systems.
ence (MUC). Participants in MUC tested the ability of An alternative approach that may be more relevant in
their systems to identify entities in text to resolve co- practice is based on the treatment of text with statistical
reference, extract and populate attributes of entities, and methods. In this approach, the possible relevance of
perform various other extraction tasks from written text. words in a text is deduced from the comparison of the
As identified by Shatkay and Feldman (2003), the impor- frequency of different words in this text with the fre-
tant challenge in biomedical literature mining is the cre- quency of the same words in reference sets of text.
ation of gold-standards and critical evaluation methods Some of the major methods using the statistical ap-
for systems developed in this very active field. The proach are AbXtract and the automatic pathway discov-
framework of evaluating biomedical literature mining sys- ery tool of Ng and Wong (1999). There are advantages
tems was recently proposed by Hirschman, et al. (2002). to each of these approaches (i.e., grammar or pattern
According to Hirschman, et al. (2002), the following ele- matching). Generally, the less syntax that is used, the
ments are needed for a successful evaluation: (1) chal- more domain-specific the system is. This allows the
lenging problem; (2) task definition; (3) training data; (4) construction of a robust system relatively quickly, but
test data; (5) evaluation methodology and implementa- many subtleties may be lost in the interpretation of
tion; (6) evaluator; (7) participants; and (8) funding. In sentences. Recently, GENIA corpus has been used for
addition to these elements for evaluation, the existing extracting biomedical-named entities (Collier et al., 2000;
biomedical literature mining systems encounter the is- Yamamoto et al., 2003). The reason for the recent surge
sues of portability and scalability, and these issues need of using GENIA corpus is because GENIA provides anno-
to be taken into consideration of the framework for evalu- tated corpus that can be used for all areas of NLP and IE
ation. applied to the biomedical domain that employs super-
vised learning. With the explosion of results in molecular
Data Sources biology, there is an increased need for IE to extract
knowledge to build databases and to search intelligently
In terms of data sources from which target biomedical for information in online journal collections.
objects are extracted, most of the biomedical data min-
ing systems focus on mining MEDLINE abstracts of
National Library of Medicine. The principal reason for FUTURE TRENDS
relying on MEDLINE is related to complexity. Ab-
stracts occasionally are easier to mine, since many With the taxonomy proposed here, we now identify the
papers contain less precise and less well supported research trends of applying IE to mine biomedical literature.
sections in the text that are difficult to distinguish from
more informative sections by machines (Andrade & 1. A variety of biomedical objects and relations are
Bork, 2000). The current version of MEDLINE contains to be extracted.
nearly 12 million abstracts stored on approximately 2. Rigorous studies are conducted to apply advanced
43GB of disk space. A prominent example of methods IE techniques, such as Random Common Field and
that target entire papers is still restricted to a small Max Entropy based HMM to biomedical data.
number of journals (Friedman et al., 2000; Krauthammer 3. Collaborative efforts to standardize the evaluation
et al., 2002). The task of unraveling information about methods and the procedures for biomedical litera-
function from MEDLINE abstracts can be approached ture mining.
from two different viewpoints. One approach is based 4. Continue to broaden the coverage of curated data-
on computational techniques for understanding texts bases and extend the size of the biomedical data-
written in natural language with lexical, syntactical, and bases.
semantic analysis. In addition to indexing terms in docu-
ments, natural language processing (NLP) methods ex-
tract and index higher-level semantic structures com- CONCLUSION
posed of terms and relationships between terms. How-
ever, this approach is confronted with the variability, The sheer size of biomedical literature triggers an in-
fuzziness, and complexity of human language (Andrade tensive pursuit for effective information extraction
& Bork, 2000). The Genies system (Friedman et al., tools. To cope with such demand, the biomedical litera-
2000; Krauthammer et al., 2002), for automatically ture mining emerges as an interdisciplinary field that
gathering and processing of knowledge about molecular information extraction and machine learning are applied
pathways, and the Information Finding from Biological to the biomedical text corpus.
618
TEAM LinG
Information Extraction in Biomedical Literature
In this article, we approached the biomedical literature MEDSYNDIKATE text mining system. Proceedings of
mining from an IE perspective. We attempted to synthe- the Pacific Symposium on Biocomputing. I
size the research efforts made in this emerging field. In
doing so, we showed how current information extraction Hirschman, L., Park, J.C., Tsujii, J., Wong, L., & Wu, C.H.
can be used successfully to extract and organize informa- (2002). Accomplishments and challenges in literature data
tion from the literature. We surveyed the prominent meth- mining for biology. Bioinformatics, 18(12), 1553-1561.
ods used for information extraction and demonstrated Jenssen, T.K., Laegreid, A., Komorowski, J., & Hovig, E.
their applications in the context of biomedical literature (2001). A literature network of human genes for high-
mining throughput analysis of gene expression. Nature Genet-
The following four aspects were used in classifying ics, 28(1), 21-8.
the current works done in the field: (1) what to extract;
(2) what techniques are used; (3) how to evaluate; and Krauthammer, M., Rzhetsky, A., Morozov P., & Friedman,
(4) what data sources are used. The taxonomy proposed C. (2000). Using BLAST for identifying gene and protein
in this article should help identify the recent trends and names in journal articles. Gene, 259(1-2), 245-252.
issues pertinent to the biomedical literature mining. Lee, K., Hwang, Y., & Rim, H. (2003). Two-phase
biomedical NE recognition based on SVMs. Proceed-
ings of the ACL 2003 Workshop on Natural Language
REFERENCES Processing in Biomedicine.
Andrade, M.A., & Bork, P. (2000). Automated extrac- Leek, T.R. (1997). Information extraction using hidden
tion of information in molecular biology. FEBS Letters, Markov models [masters theses]. San Diego, CA: De-
476,12-7. partment of Computer Science, University of California.
Blaschke, C., Andrade, M.A., Ouzounis, C., & Valencia, A. Narayanaswamy, M., Ravikumar, K.E., & Vijay-Shanker,
(1999). Automatic extraction of biological information K. (2003). A biological named entity recognizer. Pro-
from scientific text: Protein-protein interactions, Pro- ceedings of the Pacific Symposium on Biocomputing.
ceedings of the First International Conference on Intel- Ng, S.K., & Wong, M. (1999). Toward routine automatic
ligent Systems for Molecular Biology. pathway discovery from on-line scientific text abstracts.
Bunescu, R. et al. (2004). Comparative experiments on Proceedings of the Genome Informatics Series: Work-
learning information extractors for proteins and their shop on Genome Informatics.
interactions [to be published]. Journal Artificial Intelli- Proux, D., Rechenmann, F., & Julliard, L. (2000). A
gence in Medicine on Summarization and Information pragmatic information extraction strategy for gathering
Extraction from Medical Documents. data on genetic interactions. Proceedings of the Inter-
Collier, N., Nobata,C., & Tsujii, J. (2000). Extracting the national Conference on Intelligent System for Mo-
names of genes and gene products with a hidden Markov lecular Biology.
model. Proceedings of the 18th International Confer- Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., &
ence on Computational Linguistics (COLING2000). Cochran, B. (2002). Robust relational parsing over bio-
De Bruijn, B., & Martin, J. (2002). Getting to the (c)ore of medical literature: extracting inhibit relations. Pacific
knowledge: Mining biomedical literature. International Symposium on Biocomputing (pp. 362-73).
Journal of Medical Informatics, 67, 7-18. Ray, S., & Craven, M. (2001). Representing sentence struc-
Demetriou, G., & Gaizauskas, R. (2002). Utilizing text ture in hidden Markov models for information extraction.
mining results: The pasta Web system. Proceedings of Proceedings of the 17th International Joint Conference on
the Workshop on Natural Language Processing in the Artificial Intelligence, Seattle, Washington.
Biomedical Domain. Shatkay, H., & Feldman, R. (2003). Mining the biomedical
Friedman, C., Kra, P., Yu, H., Krauthammer, M., & literature in the genomic era: An overview. Journal of
Rzhetsky, A. (2001). GENIES: A natural-language pro- Computational Biology, 10(6), 821-855.
cessing system for the extraction of molecular path- Yakushiji, A., Tateisi, Y., Miyao,Y., & Tsujii, J. (2001).
ways from journal articles. Bioinformatics, 17, S74-82. Event extraction from biomedical papers using a full
Hahn, U., Romacker, M., & Schulz, S. (2002). Creating parser. Proceedings of the Pacific Symposium on
knowledge repositories from biomedical reports: The Biocomputing.
619
TEAM LinG
Information Extraction in Biomedical Literature
Yamamoto, K., Kudo, T., Konagaya, A., & Matsumoto, Y. lems inherent in the processing and manipulation of
(2003). Protein name tagging for biomedical annotation in natural language.
text. Proceedings of the ACL 2003 Workshop on Natural
Language Processing in Biomedicine. Part of Speech (POS): A classification of words ac-
cording to how they are used in a sentence and the types
of ideas they convey. Traditionally, the parts of speech
are the noun, pronoun, verb, adjective, adverb, preposi-
KEY TERMS tion, conjunction, and interjection.
Precision: The ratio of the number of correctly filled
F-Value: Combines recall and precision in a single slots to the total number of slots the system filled.
efficiency measure (it is the harmonic mean of preci-
sion and recall): F = 2 * (recall * precision) / (recall + Recall: Denotes the ratio of the number of slots the
precision). system found correctly to the number of slots in the
answer key.
Hidden Markov Model (HMM): A statistical model
where the system being modeled is assumed to be a Support Vector Machine (SVM): A learning ma-
Markov process with unknown parameters, and the chal- chine that can perform binary classification (pattern
lenge is to determine the hidden parameters from the recognition) as well as multi-category classification
observable parameters, based on this assumption. and real valued function approximation (regression es-
timation) tasks.
Natural Language Processing (NLP): A subfield of
artificial intelligence and linguistics. It studies the prob-
620
TEAM LinG
621
Instance Selection I
Huan Liu
Arizona State University, USA
Lei Yu
Arizona State University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Instance Selection
MAJOR LINES OF RESEARCH AND partitions, the resulting sample is more representative
DEVELOPMENT than a randomly generated one. Recent methods can be
found in Liu, Motoda, and Yu (2002), in which samples
A spontaneous response to the challenge of instance selected from partitions based on data variance result in
selection is, without fail, some form of sampling. Al- better performance than samples selected with random
though sampling is an important part of instance selec- sampling.
tion, other approaches do not rely on sampling but
resort to search or take advantage of data-mining algo- Methods for Labeled Data
rithms. In this section, we start with sampling methods
and proceed to other instance-selection methods asso- One key data-mining application is classification
ciated with data-mining tasks, such as classification and predicting the class of an unseen instance. The data for
clustering. this type of application is usually labeled with class
values. Instance selection in the context of classifica-
Sampling Methods tion has been attempted by researchers according to the
classifiers being built. In this section, we include five
Sampling methods are useful tools for instance selec- types of selected instances.
tion (Gu, Hu, & Liu, 2001). Critical points are the points that matter the most to
Simple random sampling is a method of selecting n a classifier. The issue originated from the learning
method of Nearest Neighbor (NN) (Cover & Thomas,
instances out of the N such that every one of the (nN ) 1991). NN usually does not learn during the training
distinct samples has an equal chance of being drawn. If phase. Only when it is required to classify a new sample
an instance that has been drawn is removed from the data does NN search the data to find the nearest neighbor for
set for all subsequent draws, the method is called ran- the new sample, using the class label of the nearest
dom sampling without replacement. Random sampling neighbor to predict the class label of the new sample.
with replacement is entirely feasible: At any draw, all N During this phase, NN can be very slow if the data are
instances of the data set have an equal chance of being large and can be extremely sensitive to noise. There-
drawn, no matter how often they have already been fore, many suggestions have been made to keep only the
drawn. critical points, so that noisy ones are removed and the
Stratified random sampling divides the data set of N data set is reduced. Examples can be found in Yu, Xu,
instances into subsets of N1, N2,, Nl instances, respec- Ester, and Kriegel (2001) and Zeng, Xing, and Zhou
tively. These subsets are nonoverlapping, and together (2003), in which critical data points are selected to
they comprise the whole data set (i.e., N1+N2,,+Nl =N). improve the performance of collaborative filtering.
The subsets are called strata. When the strata have been Boundary points are the instances that lie on bor-
determined, a sample is drawn from each stratum, the ders between classes. Support vector machines (SVM)
drawings being made independently in different strata. If provide a principled way of finding these points through
a simple random sample is taken in each stratum, the whole minimizing structural risk (Burges, 1998). Using a non-
procedure is described as stratified random sampling. It is linear function to map data points to a high-dimen-
often used in applications when one wishes to divide a sional feature space, a nonlinearly separable data set
heterogeneous data set into subsets, each of which is becomes linearly separable. Data points on the bound-
internally homogeneous. aries, which maximize the margin band, are the support
Adaptive sampling refers to a sampling procedure vectors. Support vectors are instances in the original
that selects instances depending on results obtained data sets and contain all the information a given classi-
from the sample. The primary purpose of adaptive sam- fier needs for constructing the decision function. Bound-
pling is to take advantage of data characteristics in order ary points and critical points are different in the ways
to obtain more precise estimates. It takes advantage of they are found.
the result of preliminary mining for more effective Prototypes are representatives of groups of instances
sampling, and vice versa. via averaging (Chang, 1974). A prototype that repre-
Selective sampling is another way of exploiting data sents the typicality of a class is used in characterizing a
characteristics to obtain more precise estimates in sam- class rather than describing the differences between
pling. All instances are first divided into partitions classes. Therefore, they are different from critical points
according to some homogeneity criterion, and then or boundary points.
random sampling is performed to select instances from Tree-based sampling is a method involving deci-
each partition. Because instances in each partition are sion trees (Quinlan, 1993), which are commonly used
more similar to each other than instances in other classification tools in data mining and machine learn-
622
TEAM LinG
Instance Selection
ing. Instance selection can be done via the decision tree pseudo points that can be reconstructed from sufficient
built. Breiman and Friedman (1984) propose delegate sam- statistics rather than keeping only the k means. I
pling. The basic idea is to construct a decision tree such Squashed data are some pseudo data points gener-
that instances at the leaves of the tree are approximately ated from the original data. In this aspect, they are similar
uniformly distributed. Delegate sampling then samples to prototypes, as both may or may not be in the original
instances from the leaves in inverse proportion to the data set. Squashed data points are different from proto-
density at the leaf and assigns weights to the sampled types in that each pseudo data point has a weight, and
points that are proportional to the leaf density. the sum of the weights is equal to the number of instances
In real-world applications, although large amounts in the original data set. Presently, two ways of obtaining
of data are potentially available, the majority of data are squashed data are (a) model free (DuMouchel, Volinsky,
not labeled. Manually labeling the data is a labor-inten- Johnson, Cortes, & Pregibon, 1999) and (b) model depen-
sive and costly process. Researchers investigate whether dent, or likelihood based (Madigan, Raghavan,
experts can be asked to label only a small portion of the DuMouchel, Nason, Posse, & Ridgeway, 2002).
data that is most relevant to the task if labeling all data is
too expensive and time-consuming, a process that is
called instance labeling. Usually an expert can be en- FUTURE TRENDS
gaged to label a small portion of the selected data at
various stages. So we wish to select as little data as As shown in this article, instance selection has been
possible at each stage and use an adaptive algorithm to studied and employed in various tasks, such as sam-
guess what else should be selected for labeling in the pling, classification, and clustering. Each task is very
next stage. Instance labeling is closely associated with unique, as each has different information available and
adaptive sampling, clustering, and active learning. different requirements. Clearly, a universal model of
instance selection is out of the question. This short
Methods for Unlabeled Data article provides some starting points that can hopefully
lead to more concerted study and development of new
When data are unlabeled, methods for labeled data can- methods for instance selection. Instance selection deals
not be directly applied to instance selection. The wide- with scaling down data. When we better understand
spread use of computers results in huge amounts of data instance selection, we will naturally investigate whether
stored without labels, for example, Web pages, transac- this work can be combined with other lines of research,
tion data, newspaper articles, and e-mail messages (Baeza- such as algorithm scaling-up, feature selection, and
Yates & Ribeiro-Neto, 1999). Clustering is one ap- construction, to overcome the problem of huge amounts
proach to finding regularities from unlabeled data. We of data,. Integrating these different techniques to
discuss three types of selected instances here. achieve the common goal effective and efficient
Prototypes are pseudo data points generated from the data mining is a big challenge.
formed clusters. The idea is that after the clusters are
formed, one may just keep the prototypes of the clusters
and discard the rest of the data points. The k-means CONCLUSION
clustering algorithm is a good example of this sort.
Given a data set and a constant k, the k-means clustering With the constraints imposed by computer memory and
algorithm is to partition the data into k subsets such that mining algorithms, we experience selection pressures
instances in each subset are similar under some measure. more than ever. The central point of instance selection
The k means are iteratively updated until a stopping is approximation. Our task is to achieve as good of
criterion is satisfied. The prototypes in this case are the mining results as possible by approximating the whole
k means. data with the selected instances and, hopefully, to do
Bradley, Fayyad, and Reina (1998) extend the k- better in data mining with instance selection, as it is
means algorithm to perform clustering in one scan of the possible to remove noisy and irrelevant data in the
data. By keeping some points that defy compression plus process. In this short article, we have presented an
some sufficient statistics, they demonstrate a scalable k- initial attempt to review and categorize the methods of
means algorithm. From the viewpoint of instance selec- instance selection in terms of sampling, classification,
tion, prototypes plus sufficient statistics is a method of and clustering.
representing a cluster by using both defiant points and
623
TEAM LinG
Instance Selection
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Clustering: A process of grouping instances into
Uthurusamy, R. (1996). From data mining to knowledge clusters so that instances are similar to one another
discovery. Advances in Knowledge Discovery and Data within a cluster but dissimilar to instances in other
Mining. clusters.
Gu, B., Hu, F., & Liu, H. (2001). Sampling: Knowing Data Mining: The application of analytical methods
whole from its part. In H. Liu & H. Motoda (Eds.), and tools to data for the purpose of discovering patterns,
Instance selection and construction for data mining. statistical or predictive models, and relationships among
Boston: Kluwer Academic. massive data.
Han, J., & Kamber, M. (2001). Data mining: Concepts Data Reduction: A process of removing irrelevant
and techniques. Morgan Kaufmann. information from data by reducing the number of fea-
tures, instances, or values of the data.
Liu, H., & Motoda, H., (1998). Feature selection for
knowledge discovery and data mining. Boston: Kluwer Instance: A vector of attribute values in a multidi-
Academic. mensional space defined by the attributes, also called a
record, tuple, or data point.
Liu, H., & Motoda, H. (Eds.). (2001). Instance selec-
tion and construction for data mining. Boston: Kluwer Instance Selection: A process of choosing a subset
Academic. of data to achieve the original purpose of a data-mining
application as if the whole data is used.
Liu, H., Motoda, H., & Yu, L. (2002). Feature selection
with selective sampling. Proceedings of the 19th Interna- Sampling: A procedure that draws a sample, Si, by a
tional Conference on Machine Learning (pp. 395-402). random process in which each Si receives its appropriate
probability, Pi, of being selected.
624
TEAM LinG
625
MAIN THRUST
BACKGROUND
In this article, I explore the application of data-mining
Many fields of business and research show a tremen- methods to the integration of data sources. Although
dous need to integrate data from different sources. The data transformation tasks can sometimes be performed
process of data source integration has two major com- through data mining, such techniques are most useful in
ponents. the context of schema matching. Therefore, the follow-
Schema matching refers to the task of identifying ing discussion focuses on the use of data mining in
related fields across two or more databases (Rahm & schema matching, mentioning data transformation where
Bernstein, 2001). Complications arise at several levels, appropriate.
for example
Schema-Matching Approaches
Source databases can be organized by using sev-
eral different models, such as the relational model, Two classes of schema-matching solutions exist:
the object-oriented model, or semistructured schema-only-based matching and instance-based match-
models (e.g., XML). ing (Rahm & Bernstein, 2001).
Information stored in a single table in one rela- Schema-only-based matching identifies related
tional database can be stored in two or more tables database fields by taking only the schema of input data-
in another. This problem is common when source bases into account. The matching occurs through lin-
databases show different levels of normalization guistic means or through constraint matching. Linguis-
and also occurs in nonrelational sources. tic matching compares field names, finds similarities in
A single field in one database, such as Name, could field descriptions (if available), and attempts to match
correspond to multiple fields, such as First Name field names to names in a given hierarchy of terms
and Last Name, in another. (ontology). Constraint matching matches fields based
on their domains (data types) or their key properties
Data transformation (sometimes called instance (primary key, foreign key). In both approaches, the data
matching) is a second step in which data in matching in the sources are ignored in making decisions on match-
fields must be translated into a common format. Fre- ing. Important projects implementing this approach in-
quent reasons for mismatched data include data format clude ARTEMIS (Castano, de Antonellis, & de Capitani
(such as 1.6.2004 vs. 6/1/2004), numeric precision di Vemercati, 2001) and Microsofts CUPID (Madhavan,
(3.5kg vs. 3.51kg), abbreviations (Corp. vs. Corpora- Bernstein, & Rahm, 2001).
tion), or linguistic differences (e.g., using different Instance-based matching takes properties of the
synonyms for the same concept across databases). data into account as well. A very simple approach is to
Todays databases are large both in the number of conclude that two fields are related if their minimum
records stored and in the number of fields (dimensions) and maximum values and/or their average values are
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Integration of Data Sources through Data Mining
equal or similar. More sophisticated approaches consider pendencies by mapping the discovery problem to a
the distribution of values in fields. A strong indicator of problem of discovering patterns (specifically cliques)
a relation between fields is a complete inclusion of the in graphs. This approach is able to discover inclusion
data of one field in another. I take a closer look at this dependencies with several dozens of attributes in tables
pattern in the following section. Important instance-based with tens of thousands of rows. Both algorithms rely on
matching projects are SemInt (Li & Clifton, 2000) and LSD the antimonotonic property of the inclusion depen-
(Doan, Domingos, & Halevy, 2001). dency discovery problem. This property is also used in
Some projects explore a combined approach, in which association rule mining and states that patterns of size k
both schema-level and instance-level matching is per- can only exist in the solution of the problem if certain
formed. Halevy and Madhavan (2003) present a Corpus- patterns of sizes smaller than k exist as well. Therefore,
based schema matcher. It attempts to perform schema it is meaningful to first discover small patterns (e.g.,
matching by incorporating known schemas and previous single-attribute inclusion dependency) and use this in-
matching results and to improve the matching result by formation to restrict the search space for larger patterns.
taking such historical information into account.
Data-mining approaches are most useful in the context Instance-Based Matching in the
of instance-based matching. However, some mining-re- Presence of Data Mismatches
lated techniques, such as graph matching, are employed
in schema-only-based matching as well. Inclusion dependency discovery captures only part of
the problem of schema matching, because only exact
Instance-Based Matching through matches are found. If attributes across two relations are
Inclusion Dependency Mining not exact subsets of each other (e.g., due to entry
errors), then data mismatches requiring data transfor-
An inclusion dependency is a pattern between two mation, or partially overlapping data sets, it becomes
databases, stating that the values in a field (or set of more difficult to perform data-driven mining-based dis-
fields) in one database form a subset of the values in covery. Both false negatives and false positives are
some field (or set of fields) in another database. Such possible. For example, matching fields might not be
subsets are relevant to data integration for two reasons. discovered due to different encoding schemes (e.g., use
First, fields that stand in an inclusion dependency to one of a numeric identifier in one table, where text is used to
another might represent related data. Second, knowl- denote the same values in another table). On the other
edge of foreign keys is essential in successful schema hand, purely data-driven discovery relies on the assump-
matching. Because a foreign key is necessarily a subset tion that semantically related values are also syntacti-
of the corresponding key in another table, foreign keys cally equal. Consequently, fields that are discovered by
can be discovered through inclusion dependency dis- a mining algorithm to be matching might not be seman-
covery. tically related.
The discovery of inclusion dependencies is a very
complex process. In fact, the problem is in general NP- Data Mining by Using Database
hard as a function of the number of fields in the largest Statistics
inclusion dependency between two tables. However, a
number of practical algorithms have been published. The problem of false negatives in mining for schema
De Marchi, Lopes, and Petit (2002) present an algo- matching can be addressed by more sophisticated min-
rithm that adopts the idea of levelwise discovery used in ing approaches. If it is known which attributes across
the famous Apriori algorithm for association rule min- two relations relate to one another, data transforma-
ing. Inclusion dependencies are discovered by first com- tion solutions can be used. However, automatic discov-
paring single fields with one another and then combining ery of matching attributes is also possible, usually
matches into pairs of fields, continuing the process through the evaluation of statistical patterns in the data
through triples, then 4-sets of fields, and so on. How- sources. In the classification of Kang and Naughton
ever, due to the exponential growth in the number of (2003), interpreted matching uses artificial intelligence
inclusion dependencies in larger tables, this approach techniques, such as Bayesian classification or neural
does not scale beyond inclusion dependencies with a networks, to establish hypotheses about related at-
size of about eight fields. tributes. In the uninterpreted matching approach, statis-
A more recent algorithm (Koeller & Rundensteiner, tical features, such as the unique value count of an
2003) takes a graph-theoretic approach. It avoids enu- attribute or its frequency distribution, are taken into
merating all inclusion dependencies between two tables consideration. The underlying assumption is that two
and finds candidates for only the largest inclusion de-
626
TEAM LinG
Integration of Data Sources through Data Mining
627
TEAM LinG
Integration of Data Sources through Data Mining
business tasks such as extraction, transformation, and Kang, J., & Naughton, J. F. (2003). On schema matching
loading (ETL) and data integration and migration in gen- with opaque column names and data values. Proceed-
eral become more feasible when automatic methods are ings of the ACM SIGMOD International Conference
used. on Management of Data, USA (pp. 205-216).
Although the underlying algorithmic problems are
difficult and often show exponential complexity, several Koeller, A., & Rundensteiner, E. A. (2003). Discovery of
interesting solutions to the schema-matching and data high-dimensional inclusion dependencies. Proceedings
transformation problems in integration have been pro- of the 19th IEEE International Conference on Data
posed. This is an active area of research, and more com- Engineering, India (pp. 683-685).
prehensive and beneficial applications of data mining to Li, W., & Clifton, C. (2000). SemInt: A tool for identifying
integration are likely to emerge in the near future. attribute correspondences in heterogeneous databases
using neural network. Journal of Data and Knowledge
Engineering, 33(1), 49-84.
REFERENCES
Lbbers, D., Grimmer, U., & Jarke, M. (2003). Systematic
development of data mining-based data quality tools.
Castano, S., de Antonellis, V., & de Capitani di Vemercati, Proceedings of the 29th International Conference on
S. (2001). Global viewing of heterogeneous data sources. Very Large Databases, Germany (pp. 548-559).
IEEE Transactions on Knowledge and Data Engineer-
ing, 13(2), 277-297. Madhavan, J., Bernstein, P. A., & Rahm, E. (2001) Generic
schema matching with CUPID. Proceedings of the 27th
Commonwealth Scientific and Industrial Research International Conference on Very Large Databases,
Organisation. (2003, April). Record linkage: Current Italy (pp. 49-58).
practice and future directions (CMIS Tech. Rep. No.
03/83). Canberra, Australia: L. Gu, R. Baxter, D. Vickers, Massachusetts Institute of Technology, Sloan School of
& C. Rainsford. Retrieved July 22, 2004, from http:// Management. (2002, May). Data integration using
www.act.cmis.csiro.au/rohanb/PAPERS/record Web services (Working Paper 4406-02). Cambridge, MA.
_linkage.pdf M. Hansen, S. Madnick, & M. Siegel. Retrieved July 22,
2004, from http://hdl.handle.net/1721.1/1822
Dasu, T., Johnson, T., Muthukrishnan, S., & Shkapenyuk,
V. (2002). Mining database structure; or, how to build a Pervasive Software, Inc. (2003). ETL: The secret weapon
data quality browser. Proceedings of the 2002 ACM in data warehousing and business intelligence.
SIGMOD International Conference on Management [Whitepaper]. Austin, TX: Pervasive Software.
of Data, USA (pp. 240-251).
Rahm, E., & Bernstein, P. A. (2001). A survey of ap-
de Marchi, F., Lopes, S., & Petit, J.-M. (2002). Efficient proaches to automatic schema matching. VLDB Jour-
algorithms for mining inclusion dependencies. Pro- nal, 10(4), 334-350.
ceedings of the Eighth International Conference on
Extending Database Technology, Prague, Czech Repub-
lic, 2287 (pp. 464-476).
KEY TERMS
Doan, A. H., Domingos, P., & Halevy, A. Y. (2001). Rec-
onciling schemas of disparate data sources: A machine- Antimonotonic: A property of some pattern-find-
learning approach. Proceedings of the ACM SIGMOD ing problems stating that patterns of size k can only exist
International Conference on Management of Data, USA if certain patterns with sizes smaller than k exist in the
(pp. 509-520). same dataset. This property is used in levelwise algo-
Halevy, A. Y., & Madhavan, J. (2003). Corpus-based rithms, such as the Apriori algorithm used for associa-
knowledge representation. Proceedings of the 18th tion rule mining or some algorithms for inclusion de-
International Joint Conference on Artificial Intelli- pendency mining.
gence, Mexico (pp. 1567-1572). Database Schema: A set of names and conditions
Hernndez, M. A., & Stolfo, S. J. (1998). Real-world data that describe the structure of a database. For example, in
is dirty: Data cleansing and the merge/purge problem. a relational database, the schema includes elements
Journal of Data Mining and Knowledge Discovery, 2(1), such as table names, field names, field data types, pri-
9-37. mary key constraints, or foreign key constraints.
628
TEAM LinG
Integration of Data Sources through Data Mining
Domain: The set of permitted values for a field in a Levelwise Discovery: A class of data-mining algo-
database, defined during database design. The actual data rithms that discovers patterns of a certain size by first I
in a field are a subset of the fields domain. discovering patterns of size 1, then using information
from that step to discover patterns of size 2, and so on. A
Extraction, Transformation, and Loading (ETL): De- well-known example of a levelwise algorithm is the Apriori
scribes the three essential steps in the process of data algorithm used to mine association rules.
source integration: extracting data and schema from the
sources, transforming it into a common format, and load- Merge/Purge: The process of identifying duplicate
ing the data into an integration database. records during the integration of data sources. Related
data sources often contain overlapping information ex-
Foreign Key: A key is a field or set of fields in a tents, which have to be reconciled to improve the quality
relational database table that has unique values, that is, of an integrated database.
no duplicates. A field or set of fields whose values form
a subset of the values in the key of another table is Relational Database: A database that stores data in
called a foreign key. Foreign keys express relationships tables, which are sets of tuples (rows). A set of corre-
between fields of different tables. sponding values across all rows of a table is called an
attribute, field, or column.
Inclusion Dependency: A pattern between two da-
tabases, stating that the values in a field (or set of fields) Schema Matching: The process of identifying an
in one database form a subset of the values in some field appropriate mapping from the schema of an input data
(or set of fields) in another database. source to the schema of an integrated database.
629
TEAM LinG
630
Intelligence Density
David Sundaram
The University of Auckland, New Zealand
Victor Portougal
The University of Auckland, New Zealand
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Intelligence Density
Figure 2. ERP and DSS support for increasing homonyms and synonyms. The key steps that need to be
intelligence density (Adapted from Shafiei and undertaken to transform raw data to a form that can be I
Sundaram, 2004) stored in a Data Warehouse for analysis are:
K now ledge The extraction and loading of the data into the Data
Warehouse environment from a number of sys-
Learn D SS tems on a periodic basis
D iscover Conversion of the data into a format that is appro-
Transform priate to the Data Warehouse
Integrate Cleansing of the data to remove inconsistencies,
Scrub ER P inappropriate values, errors, etc
A ccess System Integration of the different data sets into a form
D ata that matches the data model of the Data Warehouse
Transformation of the data through operations such
as summarisation, aggregation, and creation of de-
rived attributes.
intelligence density are Data Warehousing (DW), Online
Analytical Processing (OLAP), and Data Mining (DM). Once all these steps have been completed the data is
These technologies have had a significant impact on the ready for further processing. While one could use dif-
design and implementation of DSS. A generic decision ferent programs/packages to accomplish the various
support architecture that incorporates these technolo- steps listed above they could also be conducted within a
gies is illustrated in Figure 3. This architecture high- single environment. For example, Microsoft SQL Server
lights the complimentary nature of data warehousing, (2004) provides the Data Transformation Services by
OLAP, and data mining. The data warehouse and its which raw data from organisational data stores can be
related components support the lower end of the intel- loaded, cleansed, converted, integrated, aggregated, sum-
ligence density pyramid by providing tools and tech- marized, and transformed in a variety of ways.
nologies that allow one to extract, load, cleanse, con-
vert, and transform the raw data available in an
organisation into a form that then allows the decision
maker to apply OLAP and data mining tools with ease. Figure 3. DSS architecture incorporating data
The OLAP and data mining tools in turn support the warehouses, OLAP, and data mining (Adapted from
middle and upper levels of the intelligence density Srinivasan et al., 2000)
pyramid. In the following paragraphs we look at each of
these technologies with a particular focus on their abil-
ity to increase the intelligence density of data.
Data Warehousing
631
TEAM LinG
Intelligence Density
OLAP can be defined as the creation, analysis, ad hoc There are two key trends that are evident in the commer-
querying, and management of multidimensional data cial as well as the research realm. The first is the comple-
(Thomsen, 2002). Predominantly the focus of most OLAP mentary use of various decision support tools such as
systems is on the analysis and ad hoc querying of the data warehousing, OLAP, and data mining in a synergis-
multidimensional data. Data warehousing systems are tic fashion leading to information of high intelligence
usually responsible for the creation and management of density. Another subtle but vital trend is the ubiquitous
the multidimensional data. A superficial understanding inclusion of data warehousing, OLAP, and data mining in
might suggest that there does not seem to be much of a most information technology architectural landscapes.
difference between data warehouses and OLAP. This is This is especially true of DSS architectures.
due to the fact, that both are complimentary technologies
with the aim of increasing the intelligence density of data.
OLAP is a logical extension to the data warehouse. OLAP CONCLUSION
and related technologies focus on providing support for
the analytical, modelling, and computational requirements In this chapter we first defined intelligence density and
of decision makers. While OLAP systems provide a me- the need for decision support tools that would provide
dium level of analysis capabilities most of the current intelligence of a high density. We then introduced
crop of OLAP systems do not provide the sophisticated three emergent technologies integral in the design and
modeling or analysis functionalities of data mining, math- implementation of DSS architectures whose prime pur-
ematical programming, or simulation systems. pose is to increase the intelligence density of data. We
introduced and described data warehousing, OLAP, and
Data Mining data mining briefly from the perspective of their ability
to increase intelligence density of data. We also pro-
Data mining can be defined as the process of identifying posed a generic decision support architecture that
valid, novel, useful, and understandable patterns in data complementarily uses data warehousing, OLAP, and
through automatic or semiautomatic means (Berry & data mining.
Linoff, 1997). Data mining borrows techniques that origi-
nated from diverse fields such as computer science,
statistics, and artificial intelligence. Data mining is now REFERENCES
being used in a range of industries and for a range of tasks
in a variety of contexts (Wang, 2003). The complexity of Berry, M.J.A., & Linoff, G. (1997). Data mining tech-
the field of data mining makes it worthwhile to structure niques: For marketing, sales, and customer support.
it into goals, tasks, methods, algorithms, and algorithm John Wiley & Sons Inc.
implementations. The goals of data mining drive the
tasks that need to be undertaken, and the tasks drive the Berson, A., & Smith, S.J. (1997). Data warehousing, Data
methods that will be applied. The methods that will be mining, & OLAP. McGraw-Hill.
applied, drives the selections of algorithms followed by Dhar, V., & Stein, R. (1997). Intelligent decision support
the choice of algorithm implementations. methods: The science of knowledge work. Prentice Hall.
The goals of data mining are description, prediction,
and/or verification. Description oriented tasks include Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).
clustering, summarisation, deviation detection, and vi- The KDD process for extracting useful knowledge
sualization. Prediction oriented tasks include classifica- from volumes of data. Communications of the ACM,
tion and regression. Statistical analysis techniques are 39(11), 27-34.
predominantly used for verification. Methods or tech-
niques to carry out these tasks are many, chief among Inmon, W.H. (2002). Building the data warehouse. John
them are: neural networks, rule induction, market basket, Wiley & Sons.
cluster detection, link, and statistical analysis. Each Kimball, R., & Ross, M. (2002). The data warehouse
method may have several supporting algorithms and in toolkit: The complete guide to dimensional model-
turn each algorithm may be implemented in a different ing. John Wiley & Sons.
manner. Data mining tools such as Clementine (SPSS,
2004) not only support the discovery of nuggets but also Lapin, L., & Whisler, W.D. (2002). Quantitative decision
support the entire intelligence density pyramid by pro- making with spreadsheet applications. Belmont, CA:
viding a sophisticated visual interactive environment. Duxbury/Thomson Learning.
632
TEAM LinG
Intelligence Density
Microsoft. (2004). Microsoft SQL Server. Retrieved from Data Warehouses: provide data that are integrated,
http://www.microsoft.com/ subject-oriented, time-variant, and non-volatile thereby I
increasing the intelligence density of the raw input data.
Shafiei, F., & Sundaram, D. (2004, January 5-8). Multi-
enterprise collaborative enterprise resource planning and Decision Support Systems/Tools: in a wider sense
decision support systems. Thirty-Seventh Hawaii Inter- can be defined as systems/tools that affect the way
national Conference on System Sciences (CD/ROM). people make decisions. But in our present context could
be defined as systems that increase the intelligence
SPSS. (2004). Clementine. Retrieved from http:// density of data.
www.spss.com
Enterprise Resource Planning /Enterprise Sys-
Srinivasan, A., Sundaram, D., & Davis, J. (2000). Imple- tems: are integrated information systems that support
menting decision support systems. McGraw Hill. most of the business processes and information system
Thomsen, E. (2002). OLAP solutions: Building multidi- requirements in an organization.
mensional information systems (2nd ed.). New York; Intelligence Density: is the useful decision sup-
Chichester, UK: Wiley. port information that a decision maker gets from using
Wang, J. (2003). Data mining: Opportunities and chal- a system for a certain amount of time or alternately the
lenges. Hershey, PA: Idea Group Publishing. amount of time taken to get the essence of the underly-
ing data from the output.
Westphal, C., & Blaxton, T. (1998). Data mining solu-
tions: Methods and tools for solving real-world prob- Online Analytical Processing (OLAP): enables the
lems. John Wiley & Sons. creation, management, analysis, and ad hoc querying of
multidimensional data thereby increasing the intelligence
density of the data already available in data warehouses.
633
TEAM LinG
634
BACKGROUND
MAIN THRUST
The job of a data analyst typically involves problem
formulation, advice on data collection (though it is not In this paper, we will explore the main disciplines and
uncommon for the analyst to be asked to analyze data that associated techniques as well as applications to help
have already been collected), effective data analysis, and clarify the meaning of intelligent data analysis, followed
interpretation and report of the finding. Data analysis is by a discussion of several key issues.
about the extraction of useful information from data and
is often performed by an iterative process in which explor- Statistics and Computing: Key
atory analysis and confirmatory analysis are the two Disciplines
principal components.
Exploratory data analysis, or data exploration, re- IDA has its origins in many disciplines, principally statis-
sembles the job of a detective; that is, understanding tics and computing. For many years, statisticians have
evidence collected, looking for clues, applying relevant studied the science of data analysis and have laid many
background knowledge, and pursuing and checking the of the important foundations. Many of the analysis meth-
possibilities that clues suggest. ods and principles were established long before comput-
Data exploration is not only useful for data under- ers were born. Given that statistics are often regarded as
standing but also helpful in generating possibly interest- a branch of mathematics, there has been an emphasis on
ing hypotheses for a later studynormally a more formal mathematics rigor, a desire to establish that something is
or confirmatory procedure for analyzing data. Such pro- sensible on theoretical ground before trying it out on
cedures often assume a potential model structure for the practical problems (Berthold & Hand, 2003). On the other
data and may involve estimating the model parameters hand, the computing community, particularly in machine
and testing hypotheses about the model. learning (Mitchell, 1997) and data mining (Wang, 2003) is
Over the last 15 years, we have witnessed two phe- much more willing to try something out (e.g., designing
nomena that have affected the work of modern data new algorithms) to see how they perform on real-world
analysts more than any others. First, the size and variety datasets, without worrying too much about the theory
of machine-readable data sets have increased dramati- behind it.
cally, and the problem of data explosion has become Statistics is probably the oldest ancestor of IDA, but
apparent. Second, recent developments in computing what kind of contributions has computing made to the
have provided the basic infrastructure for fast data access subject? These may be classified into three categories.
as well as many advanced computational methods for First, the basic computing infrastructure has been put in
extracting information from large quantities of data. These place during the last decade or so, which enables large-
developments have created a new range of problems and scale data analysis (e.g., advances in data warehousing
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Intelligent Data Analysis
and online analytic processing, computer networks, desk- such as bioinformatics and e-science have called for new
top technologies have made it possible to easily organize ways of analyzing the data. Therefore, it is a very difficult I
and move the data around for the analysis purpose). The task to have a sensible summary of the type of IDA
modern computing processing power also has made it applications that are possible. The following is a partial list.
possible to efficiently implement some of the very
computationally-intensive analysis methods such as sta- Bioinformatics: A huge amount of data has been
tistical resampling, visualizations, large-scale simulation generated by genome-sequencing projects and other
and neural networks, and stochastic search and optimiza- experimental efforts to determine the structures and
tion methods. functions of biological molecules and to under-
Second, there has been much work on extending tra- stand the evolution of life (Orengo et al., 2003). One
ditional statistical and operational research methods to of the most significant developments in
handle challenging problems arising from modern data bioinformatics is the use of high-throughput de-
sets. For example, in Bayesian networks (Ramoni et al., vices such as DNA microarray technology to study
2002), where the work is based on Bayesian statistics, one the activities of thousands of genes in a single
tries to make the ideas work on large-scale practical experiment and to provide a global view of the
problems by making appropriate assumptions and devel- underlying biological process by revealing, for ex-
oping computationally efficient algorithms; in support ample, which genes are responsible for a disease
vector machines (Cristianini & Shawe-Taylor, 2000), where process, how they interact and are regulated, and
one tries to see how the statistical learning theory (Vapnik, which genes are being co-expressed and participate
1998) could be utilized to handle very high-dimensional in common biological pathways. Major IDA chal-
datasets in linear feature spaces; and in evolutionary lenges in this area include the analysis of very high
computation (Eiben & Michalewicz, 1999) one tries to dimensional but small sample microarray data, the
extend the traditional operational research search and integration of a variety of data for constructing
optimization methods. biological networks and pathways, and the han-
Third, new kinds of IDA algorithms have been pro- dling of very noisy microarray image data.
posed to respond to new challenges. Here are several Medicine and Healthcare: With the increasing de-
examples of the novel methods with distinctive comput- velopment of electronic patient records and medical
ing characteristics: powerful three-dimensional virtual information systems, a large amount of clinical data
reality visualization systems that allow gigabytes of data is available online. Regularities, trends, and surpris-
to be visualized interactively by teams of scientists in ing events extracted from these data by IDA meth-
different parts of the world (Cruz-Neira, 2003); parallel and ods are important in assisting clinicians to make
distributed algorithms for different data analysis tasks informed decisions, thereby improving health ser-
(Zaki & Pan, 2002); so-called any-time analysis algorithms vices (Bellazzi et al., 2001). Examples of such appli-
that are designed for real-time tasks, where the system, if cations include the development of novel methods
stopped any time from its starting point, would be able to to analyze time-stamped data in order to assess the
give some satisfactory (not optimal) solution (of course, progression of disease, autonomous agents for moni-
the more time it has, the better solution would be); induc- toring and diagnosing intensive care patients, and
tive logic programming extends the deductive power of intelligent systems for screening early signs of
classic logic programming methods to induce structures glaucoma. It is worth noting that research in
from data (Mooney, 2004) ; Association rule learning bioinformatics can have significant impact on the
algorithms were motivated by the need in retail industry understanding of disease and consequently better
where customers tend to buy related items (Nijssen & therapeutics and treatments. For example, it has
Kok, 2001), while work in inductive databases attempt to been found using DNA microarray technology that
supply users with queries involving inductive capabili- the current taxonomy of cancer in certain cases
ties (De Raedt, 2002). Of course, this list is not meant to appears to group together molecularly distinct dis-
be exhaustive, but it gives some ideas about the kind of eases with distinct clinical phenotypes, suggesting
IDA work going on within the computing community. the discovery of subgroups of cancer (Alizadeh et al., 2000).
Science and Engineering: Enormous amounts of
IDA Applications data have been generated in science and engineer-
ing (Cartwight, 2000) (e.g., in cosmology, chemical
Data analysis is performed for a variety of reasons by engineering, or molecular biology, as discussed
scientists, engineers, business communities, medical and previously). In cosmology, advanced computational
government researchers, and so forth. The increasing size tools are needed to help astronomers understand
and variety of data as well as new exciting applications the origin of large-scale cosmological structures
635
TEAM LinG
Intelligent Data Analysis
as well as the formation and evolution of their systems. Important progress has been made, but
astrophysical components (i.e., galaxies, quasars, further work is needed urgently to come up with
and clusters). In chemical engineering, mathemati- practical and effective methods for managing differ-
cal models have been used to describe interactions ent kinds of data quality problems in large databases.
among various chemical processes occurring in- Scalability: Currently, technical reports analyzing
side a plant. These models are typically very large really big data are still sketchy. Analysis of big,
systems of nonlinear algebraic or differential equa- opportunistic data (i.e., data collected for an unre-
tions. Challenges for IDA in this area include the lated purpose) is beset with many statistical pit-
development of scalable, approximate, parallel, or falls. Much research has been done to develop
distributed algorithms for large-scale applications. efficient, heuristic, parallel, and distributed algo-
Business and Finance: There is a wide range of rithms that are able to scale well. We will be eager
successful business applications reported, although to see more practical experience shared when ana-
the retrieval of technical details is not always easy, lyzing large, complex, real-world datasets in order
perhaps for obvious reasons. These applications to obtain a deep understanding of the IDA process.
include fraud detection, customer retention, cross Mapping Methods to Applications: Given that there
selling, marketing, and insurance. Fraud is costing are so many methods developed in different com-
industries billions of pounds, so it is not surprising munities for essentially the same task (e.g., classi-
to see that systems have been developed to combat fication), what are the important factors in choos-
fraudulent activities in such areas as credit card, ing the most appropriate method(s) for a given
health care, stock market dealing, or finance in gen- application? The most commonly used criterion is
eral. Interesting challenges for IDA include timely the prediction accuracy. However, it is not always
integration of information from different resources the only, or even the most important, criterion for
and the analysis of local patterns that represent evaluating competing methods. Credit scoring is
deviations from a background model (Hand et al., 2002). one of the most quoted applications where
misclassification cost is more important than pre-
IDA Key Issues dictive accuracy. Other important factors in decid-
ing that one method is preferable to another in-
In responding to challenges of analyzing complex data clude computational efficiency and interpretability
from a variety of applications, particularly emerging ones of methods.
such as bioinformatics and e-science, the following issues Human-Computer Collaboration: Data analysis is
are receiving increasing attention in addition to the develop- often an iterative, complex process in which both
ment of novel algorithms to solve new emerging problems. analyst and computer play an important part. An
interesting issue is how one can have an effective
Strategies: There is a strategic aspect to data analy- analysis environment where the computer will per-
sis beyond the tactical choice of this or that test, form complex and laborious operations and pro-
visualization or variable. Analysts often bring exog- vide essential assistance, while the analyst is al-
enous knowledge about data to bear when they lowed to focus on the more creative part of the data
decide how to analyze it. The question of how data analysis using knowledge and experience.
analysis may be carried out effectively should lead
us to having a close look not only at those individual
components in the data analysis process but also at FUTURE TRENDS
the process as a whole, asking what would consti-
tute a sensible analysis strategy. The strategy should There is strong evidence that IDA will continue to gen-
describe the steps, decisions, and actions that are erate a lot of interest in both academic and industrial
taken during the process of analyzing data to build communities, given the number of related conferences,
a model or answer a question. journals, working groups, books, and successful case
Data Quality: Real-world data contain errors and are studies already in existence. It is almost inconceivable
incomplete and inconsistent. It is commonly ac- that this topic will fade in the foreseeable future, since
cepted that data cleaning is one of the most difficult there are so many important and challenging real-world
and most costly tasks in large-scale data analysis problems that demand solutions from this area, and there
and often consumes most of project resources. Re- are still so many unanswered questions. The debate on
search on data quality has attracted a significant what constitutes intelligent or unintelligent data analy-
amount of attention from different communities sis will carry on for a while.
and includes statistics, computing, and information
636
TEAM LinG
Intelligent Data Analysis
More analysis tools and methods inevitably will ap- Berthold, M., & Hand, D.J. (Eds). (2003). Intelligent data
pear, but help in their proper use will not be fast enough. analysis: An introduction. Springer-Verlag. I
A tool can be used without an essential understanding of
what it can offer and how the results should be inter- Cartwright, H. (Ed). (2000). Intelligent data analysis in
preted, despite the best intentions of the user. Research science. Oxford University Press.
will be directed toward the development of more helpful Cristianini, N., & Shawe-Taylor, J. (2000). An introduction
middle-ground tools, those that are less generic than to support vector machines. Cambridge University Press
current data analysis software tools but more general than
specialized data analysis applications. Cruz-Neira, C. (2003). Computational humanities: The new
Much of the current work in the area is empirical in challenge for virtual reality. IEEE Computer Graphics
nature, and we are still in the process of accumulating and Applications, 23(3), 10-13.
more experience in analyzing large, complex data. A lot of De Raedt, L. (2002). A perspective on inductive data-
heuristics and trial and error have been used in exploring bases. ACM SIGKDD Explorations Newsletter, 4(2), 69-77.
and analyzing these data, especially the data collected
opportunistically. As time goes by, we will see more Eiben, A.E., & Michalewicz, Z. (Eds). (2003). Evolution-
theoretical work that attempts to establish a sounder ary computation. IOS Press.
foundation for analysts of the future.
Hand, D.J., Adams, N., & Bolton, R. (2002). Pattern detec-
tion and discovery. Lecture Notes in Artificial Intelli-
gence, 2447.
CONCLUSION
Liu, X. (1999). Progress in intelligent data analysis. Inter-
Statistical methods have been the primary analysis tool, national Journal of Applied Intelligence, 11(3), 235-240.
but many new computing developments have been ap-
plied to the analysis of large and challenging real-world Mitchell, T. (1997). Machine learning. McGraw Hill.
datasets. Intelligent data analysis requires careful think- Mooney, R. et al. (2004). Relational data mining with
ing at every stage of an analysis process, assessment, and inductive logic programming for link discovery. In H.
selection of the most appropriate approaches for the Kargupta et al. (Eds.), Data mining: Next generation
analysis tasks in hand and intelligent application of rel- challenges and future directions. AAAI Press.
evant domain knowledge. This is an area with enormous
potential, as it seeks to answer the following key ques- Nijssen, S., & Kok, J. (2001). Faster association rules for
tions. How can one perform data analysis most effectively multiple relations. Proceedings of the International Joint
(intelligently) to gain new scientific insights, to capture Conference on Artificial Intelligence.
bigger portions of the market, to improve the quality of Orengo, C., Jones, D., & Thornton, J. (Eds). (2003).
life, and so forth? What are the guiding principles to Bioinformatics: Genes, proteins & computers. BIOS Sci-
enable one to do so? How can one reduce the chance of entific Publishers.
performing unintelligent data analysis? Modern datasets
are getting larger and more complex, but the number of Ramoni M., Sebastiani P., & Cohen, P. (2002). Bayesian
trained data analysts is certainly not keeping up at any clustering by dynamics. Machine Learning, 47(1), 91-121.
rate. This poses a significant challenge for the IDA and
other related communities such as statistics, data mining, Vapnik, V.N. (1998). Statistical learning theory. Wiley.
machine learning, and pattern recognition. The quest for Wang, J. (Ed.). (2003). Data mining: Opportunities and
bridging this gap and for crucial insights into the process challenges. Hershey, PA: Idea Group Publishing.
of intelligent data analysis will require an interdisciplinary
effort from all these disciplines. Zaki, M., & Pan, Y. (2002). Recent developments in parallel
and distributed data mining. Distributed and Parallel
Databases: An International Journal, 11(2), 123-127.
REFERENCES
KEY TERMS
Alizadeh, A.A. et al. (2000). Distinct types of diffuse large
b-cell lymphoma identified by gene expression profiling. Bioinformatics: The development and application of
Nature, 403, 503-511. computational and mathematical methods for organiz-
Bellazzi, R., Zupan, B., & Liu, X. (Eds). (2001). Intelligent ing, analyzing, and interpreting biological data.
data analysis in medicine and pharmacology. London.
637
TEAM LinG
Intelligent Data Analysis
E-Science: The large-scale science that will increas- procedures. They can be incomplete, inaccurate, out-of-
ingly be carried out through distributed global collabora- date, or inconsistent.
tions enabled by the Internet.
Support Vector Machines (SVM): Learning machines
Intelligent Data Analysis: An interdisciplinary study that can perform difficult classification and regression
concerned with the effective analysis of data, which estimation tasks. SVM non-linearly map their n-dimen-
draws the techniques from diverse fields including AI, sional input space into a high-dimensional feature space.
databases, high-performance computing, pattern recog- In this high-dimensional feature space, a linear classifier
nition, and statistics. is constructed.
Machine Learning: A study of how computers can be Visualization: Visualization tools to graphically dis-
used automatically to acquire new knowledge from past play data in order to facilitate better understanding of
cases or experience or from the computers own experi- their meanings. Graphical capabilities range from simple
ences. scatter plots to three-dimensional virtual reality systems.
Noisy Data: Real-world data often contain errors due
to the nature of data collection, measurement, or sensing
638
TEAM LinG
639
Agnieszka Dardzinska
Bialystok Technical University, Poland
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Intelligent Query Answering
the time progresses more and more rules can be added to (2003b) is to use not only functional dependencies to
the local knowledge base, which means that some at- chase S (Atzeni & DeAntonellis, 1992) but also use rules
tribute values (decision parts of rules) foreign for the discovered from a complete subsystem of S to do the
client are also added to its local alphabet. The choice of chasing.
which site should be contacted first, in search for defini- In the first step, intelligent QAS identifies all in-
tions of foreign attribute values, is mainly based on the complete attributes used in a query. An attribute is
number of attribute values common for the client and incomplete in S if there is an object in S with incomplete
server sites. The solution to this problem is given in Ras information on this attribute. The values of all incom-
(2002). plete attributes are treated as concepts to be learned (in
a form of rules) from S.
Incomplete information in S is replaced by new data
MAIN THRUST provided by Chase algorithm based on these rules. When
the process of removing incomplete vales in the local
The technology dimension will be explored to help information system is completed, QAS finds the answer
clarify the meaning of intelligent query answering based to query in a usual way.
on knowledge discovery and chase.
Intelligent Query Answering for
Intelligent Query Answering for Distributed Autonomous Information
Standalone Information System Systems
QAS for an information system is concerned with iden- Semantic inconsistencies are due to different interpre-
tifying all objects in the system satisfying a given de- tations of attributes and their values among sites (for
scription. For example, an information system might instance one site can interpret the concept young
contain information about students in a class and clas- differently than other sites). Different interpretations
sify them using four attributes of hair color, eye are also due to the way each site is handling null values.
color, gender, and size. A simple query might be to Null value replacement by values suggested either by
find all students with brown hair and blue eyes. When an statistical or knowledge discovery methods is quite
information system is incomplete, students having brown common before a user query is processed by QAS.
hair and unknown eye color can be handled by either Ontology (Guarino, 1998; Sowa, 1999, 2000; Van
including or excluding them from the answer to the Heijst et al., 1997) is a set of terms of a particular
query. In the first case we talk about optimistic approach information domain and the relationships among them.
to query evaluation while in the second case we talk Currently, there is a great deal of interest in the devel-
about pessimistic approach. Another option to handle opment of ontologies to facilitate knowledge sharing
such a query would be to discover rules for eye color in among information systems.
terms of the attributes hair color, gender, and size. Ontologies and inter-ontology relationships between
These rules could then be applied to students with them are created by experts in the corresponding do-
unknown eye color to generate values that could be used main, but they can also represent a particular point of
in answering the query. Consider that in our example one view of the global information system by describing
of the generated rules said: customized domains. To allow intelligent query pro-
cessing, it is often assumed that an information system
(hair, brown) (size, medium) (eye, brown). is coupled with some ontology. Inter-ontology relation-
ships can be seen as semantical bridges between ontolo-
Thus, if one of the students having brown hair and gies built for each of the autonomous information sys-
medium size has no value for eye color, then the query tems so they can collaborate and understand each other.
answering system should not include this student in the In Ras and Dardzinska (2004), the notion of optimal
list of students with brown hair and blue eyes. Attributes rough semantics and the method of its construction have
hair color and size are classification attributes and eye been proposed. Rough semantics can be used to model
color is the decision attribute. semantic inconsistencies among sites due to different
We are also interested in how to use this strategy to interpretations of incomplete values of attributes. Dis-
build intelligent QAS for incomplete information sys- tributed chase (Ras & Dardzinska, 2004) is a chase-type
tems. If a query is submitted to information system S, algorithm, driven by a client site of a distributed infor-
the first step of QAS is to make S as complete as mation system (DIS), which is similar to chase algo-
possible. The approach proposed in Dardzinska & Ras rithms based on knowledge discovery and presented in
640
TEAM LinG
Intelligent Query Answering
Dardzinska & Ras (2003a, 2003b). Distributed chase has ent interpretations of incomplete attribute values among
one extra feature in comparison to other chase-type algo- sites have been considered. I
rithms: the dynamic creation of knowledge bases at all sites In Ras (2002), it was shown that query q(B) can be
of DIS involved in the process of solving a query submit- processed at site S by discovering definitions of values
ted to the client site of DIS. of attributes from B - [A B] at the remote sites for S
The knowledge base at the client site may contain and next use them to answer q(B).
rules extracted from the client information system and Foreign attributes for S in B, can be also seen as
also rules extracted from information systems at remote attributes entirely incomplete in S, which means values
sites in DIS. These rules are dynamically updated through (either exact or partially incomplete) of such attributes
the incomplete values replacement process (Ras & should be ascribed by chase to all objects in S before
Dardzinska, 2004). query q(B) is answered. The question remains, if values
Although the names of attributes are often the same discovered by chase are really correct?
among sites, their semantics and granularity levels may Classical approach to this kind of problem is to
differ from site to site. As the result of these differ- build a simple DIS environment (mainly to avoid diffi-
ences, the knowledge bases at the client site and at culties related to different granularity and different
remote sites have to satisfy certain properties in order to semantics of attributes at different sites). As the test-
be applicable in a distributed chase. ing data set (Ras & Dardzinska, 2005) have taken 10,000
So, assume that system S = (X,A,V), which is a part of tuples randomly selected from a database of some insur-
DIS, is queried be user. ance company. This sample table, containing 100 at-
Chase algorithm, to be applicable to S, has to be based tributes, was randomly partitioned into four subtables of
on rules from the knowledge base D associated with S, equal size containing 2,500 tuples each. Next, from each
which satisfies the following conditions: of these subtables 40 attributes (columns) have been
randomly removed leaving four data tables of the size
1. Attribute value used in decision part of a rule from 2,50060 each. One of these tables was called a client and
D has the granularity level either equal to or finer the remaining 3 have been called servers. Now, for all
than the granularity level of the corresponding objects at the client site, values of one of the attributes,
attribute in S. which were chosen randomly, have been hidden. This
2. The granularity level of any attribute used in the attribute is denoted by d. At each server site, if attribute
classification part of a rule from D is either equal d was listed in its domain schema, descriptions of d using
or softer than the granularity level of the corre- See5 software (data are complete so it was not necessary
sponding attribute in S. to use ERID) have been learned. All these descriptions,
3. Attribute used in the decision part of a rule from D in the form of rules, have been stored in the knowledge
either does not belong to A or is incomplete in S. base of the client. Distributed Chase was applied to
predict what is the real value of the hidden attribute for
Assume again that S=(X,A,V) is an information sys- each object x at the client site. The threshold value =
tem (Pawlak, 1991; Ras & Dardzinska, 2004), where X is 0.125 was used to rule out all values predicted by distrib-
a set of objects, A is a set of attributes (seen as partial uted Chase with confidence below that threshold. Al-
functions from X into 2(V[0,1])) and, V is a set of values of most all hidden values (2476 out of 2500) have been
attributes from A. By [0,1] we mean the set of real discovered correctly (assuming = 0.125).
numbers from 0 to 1. Let L(D)={[t vc] D: c In(A)}
be a set of all rules (called a knowledge-base) extracted Distributed Chase and Security
initially from the information system S by ERID Problem of Hidden Attributes
(Dardzinska & Ras, 2003c), where In(A) is a set of
incomplete attributes in S.
Assume now that query q(B) is submitted to system Assume now that an information system S=(X,A,V)
S=(X,A,V), where B is the set of all attributes used in is a part of DIS and attribute bA has to be hidden. For
q(B) and that A B . All attributes in B - [A B] are that purpose, we construct Sb=(X,A,V) to replace S,
called foreign for S. If S is a part of a distributed infor- where:
mation system, definitions of foreign attributes for S can
be extracted at its remote sites (Ras, 2002). Clearly, all 1. aS(x) = aSb(x), for any a A-{b}, x X,
semantic inconsistencies and differences in granularity 2. bSb(x) is undefined, for any x X,
of attribute values among sites have to be resolved first. 3. bS(x) Vb.
In Ras & Dardzinska (2004) only different granularity of
attribute values and different semantics related to differ-
641
TEAM LinG
Intelligent Query Answering
Users are allowed to submit queries to S b and not to S. given threshold value. It means that the new updated
What about the information system Chase(Sb)? How it information system S has to be chased again before any
differs from S? query is answered by QAS.
If bS(x) = bChase(Sb)(x), where x X, then values of
additional attributes for object x have to be hidden in Sb
to guarantee that value bS(x) can not be reconstructed by REFERENCES
Chase. In Ras and Dardzinska (2005) it was shown how to
identify the minimal number of such values. Atzeni, P., & DeAntonellis, V. (1992). Relational data-
base theory. The Benjamin Cummings Publishing Com-
pany.
FUTURE TRENDS
Cuppens, F., & Demolombe, R. (1988). Cooperative
One of the main problems related to semantics of an answering: A methodology to provide intelligent access
incomplete information system S is the freedom of how to databases. In Proceedings of the Second Interna-
new values are constructed to replace incomplete values tional Conference on Expert Database Systems (pp.
in S, before any rule extraction process begins. This 333-353).
replacement of incomplete attribute values in some of Dardzinska, A., & Ras, Z.W. (2003a). Rule-based Chase
the slots in S can be done either by chase or/and by a algorithm for partially incomplete information sys-
number of available statistical methods (Giudici, 2003). tems. In Proceedings of the Second International
This implies that semantics of queries submitted to S Workshop on Active Mining, Maebashi City, Japan (pp.
and driven (defined) by query answering system QAS 42-51).
based on chase may often differ. Although rough seman-
tics can be used by QAS to handle this problem, we still Dardzinska, A., & Ras, Z.W. (2003b). Chasing unknown
have to look for new alternate methods. values in incomplete information systems. In Proceed-
Assuming different semantics of attributes among ings of ICDM03 Workshop on Foundations and New
sites in DIS, the use of global ontology or local ontolo- Directions of Data Mining, Melbourne, Florida (pp. 24-
gies built jointly with inter-ontology relationships among 30). IEEE Computer Society.
them seems to be necessary for solving queries in DIS Dardzinska, A., & Ras, Z.W. (2003c). On rule discovery
using knowledge discovery and chase. Still a lot of from incomplete information systems. In Proceedings
research has to be done in this area. of ICDM03 Workshop on Foundations and New Di-
rections of Data Mining. Melbourne, Florida (pp. 31-35).
IEEE Computer Society.
CONCLUSION
Gaasterland, T., Godfrey, P., & Minker, J. (1992). Relax-
Assume that the client site in DIS is represented by ation as a platform for cooperative answering. Journal of
partially incomplete information system S. When a Intelligent Information Systems, 1(3), 293-321.
query is submitted to S, its query answering system QAS Gal, A., & Minker, J. (1988). Informative and cooperative
will replace S by Chase(S) and next will solve the query answers in databases using integrity constraints. In natu-
using, for instance, the strategy proposed in Ras & Joshi ral language understanding and logic programming
(1997). Rules used by Chase can be extracted from S or (pp. 288-300). North Holland.
from its remote sites in DIS assuming that all differ-
ences in semantics of attributes and differences in Giudici, P. (2003). Applied data mining: Statistical meth-
granularity levels of attributes are resolved first. We ods for business and industry. West Sussex, UK: Wiley.
can argue here why the resulting information system
obtained by Chase can not be stored aside and reused Guarino, N. (Ed.). (1998). Formal ontology in information
when a new query is submitted to S? If system S is not systems. Amsterdam: IOS Press.
frequently updated, we can do that by keeping a copy of Pawlak, Z. (1991). Rough sets-theoretical aspects of
Chase(S) and next reusing that copy when a new query is reasoning about data. Kluwer.
submitted to S. But, the original information system S
still has to be kept so when user wants to enter new data Ras, Z. (2002). Reducts-driven query answering for
to S, they can be stored in the original system. System distributed knowledge systems. International Journal
Chase(S), if stored aside, can not be reused by QAS of Intelligent Systems, 17(2), 113-124.
when the number of updates in the original S exceeds a
642
TEAM LinG
Intelligent Query Answering
Ras, Z., & Dardzinska, A. (2004). Ontology based distrib- Distributed Chase: Kind of a recursive strategy ap-
uted autonomous knowledge systems. Information Sys- plied to a database V, based on functional dependencies I
tems International Journal, 29(1), 47-58. or rules extracted both from V and other autonomous
databases, by which a null value or an incomplete value
Ras, Z., & Dardzinska, A. (2005). Data security and null in V is replaced by a new more complete value. Any
value imputation in distributed information systems. In differences in semantics among attributes in the involved
Advances in Soft Computing, Proceedings of MSRAS04 databases have to be resolved first.
Symposium (pp. 133-146). Poland: Springer-Verlag.
Intelligent Query Answering: Enhancements of query
Ras, Z., & Joshi, S. (1997). Query approximate answering answering systems into sort of intelligent systems (ca-
system for an incomplete DKBS. Fundamenta Informaticae, pable or being adapted or molded). Such systems should
30(3), 313-324. be able to interpret incorrectly posed questions and
Sowa, J.F. (1999). Ontological categories. In L. compose an answer not necessarily reflecting precisely
Albertazzi (Ed.), Shapes of forms: From Gestalt psychol- what is directly referred to by the question, but rather
ogy and phenomenology to ontology and mathematics reflecting what the intermediary understands to be the
(pp. 307-340). Kluwer. intention linked with the question.
Sowa, J.F. (2000). Knowledge representation: Logi- Knowledge Base: A collection of rules defined as
cal, philosophical, and computational foundations. expressions written in predicate calculus. These rules
Pacific Grove, CA: Brooks/Cole Publishing. have a form of associations between conjuncts of values
of attributes.
Van Heijst, G., Schreiber, A., & Wielinga, B. (1997).
Using explicit ontologies in KBS development. Inter- Ontology: An explicit formal specification of how
national Journal of Human and Computer Studies, 46 to represent objects, concepts and other entities that are
(2/3), 183-292. assumed to exist in some area of interest and relation-
ships holding among them. Systems that share the same
ontology are able to communicate about domain of
discourse without necessarily operating on a globally
KEY TERMS shared theory. System commits to ontology if its ob-
servable actions are consistent with the definitions in
Autonomous Information System: Information the ontology.
system existing as an independent entity.
Query Semantics: The meaning of a query with an
Chase: Kind of a recursive strategy applied to a information system as its domain of interpretation.
database V, based on functional dependencies or rules Application of knowledge discovery and Chase in query
extracted from V, by which a null value or an incomplete evaluation makes semantics operational.
value in V is replaced by a new more complete value.
Semantics: The meaning of expressions written in
some language, as opposed to their syntax, which de-
scribes how symbols may be combined independently
of their meaning.
643
TEAM LinG
644
Hai Wang
Saint Marys University, Canada
INTRODUCTION pattern depends on the data miner and does not solely
depend on the statistical strength of the pattern. Second,
In the data mining field, people have no doubt that high heuristic search in combinatorial spaces built on com-
level information (or knowledge) can be extracted from the puter and human interaction is useful for effective knowl-
database through the use of algorithms. However, a one- edge discovery. One strategy for effective knowledge
shot knowledge deduction is based on the assumption discovery is the use of human-computer collaboration.
that the model developer knows the structure of knowl- One technique used for human-computer collabora-
edge to be deducted. This assumption may not be invalid tion in the business information systems field is data
in general. Hence, a general proposition for data mining is visualization (Bell, 1991; Montazami & Wang, 1988) which
that, without human-computer interaction, any knowl- is particularly relevant to data mining (Keim & Kriegel,
edge discovery algorithm (or program) will fail to meet the 1996; Wang, 2002). From the human side of data visualiza-
needs from a data miner who has a novel goal (Wang, S. tion, graphics cognition and problem solving are the two
& Wang, H., 2002). Recently, interactive visual data major concepts of data visualization. It is a commonly
mining techniques have opened new avenues in the data accepted principle that visual perception is compounded
mining field (Chen, Zhu, & Chen, 2001; de Oliveira & out of processes in a way which is adaptive to the visual
Levkowitz, 2003; Han, Hu & Cercone, 2003; Shneiderman, presentation and the particular problem to be solved
2002; Yang, 2003). (Kosslyn, 1980; Newell & Simon, 1972).
Interactive visual data mining differs from traditional
data mining, standalone knowledge deduction algorithms,
and one-way data visualization in many ways. Briefly, MAIN THRUST
interactive visual data mining is human centered, and is
implemented through knowledge discovery loops coupled Major components of interactive visual data mining and
with human-computer interaction and visual representa- their functions that make data mining more effective are
tions. Interactive visual data mining attempts to extract the current research theme in this field. Wang, S. and
unsuspected and potentially useful patterns from the data Wang, H. (2002) have developed a model of interactive
for the data miners with novel goals, rather than to use the visual data mining for human-computer collaboration
data to derive certain information based on a priori human knowledge discovery. According to this model, an inter-
knowledge structure. active visual data mining system has three components
on the computer side, besides the database: data visual-
ization instrument, data and model assembly, and human-
BACKGROUND computer interface.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Interactive Visual Data Mining
tures in the high-dimensional input vectors using low 5. Homogeneousness Examination: Knowledge formula-
dimensional space for representation. These low-dimen- tion often needs to identify the ranges of values of a I
sional presentations can be viewed and interpreted by determinant variable so that observations with values
human in discovering knowledge (Wang, 2000). of a certain range in this variable have a homogeneous
behavior. This query function provides interactive
Data and Model Assembly mechanism for the data miner to decompose variables
for homogeneousness examination.
The data and model assembly is a set of query functions
that assemble the data and data visualization instruments Human-Computer Interface
for data mining. Query tools are characterized by struc-
tured query language (SQL), the standard query language Human-computer interface allows the data miner to dialog
for relational database systems. To support human-com- with the computer. It integrates the data base, data visu-
puter collaboration effectively, query processing is nec- alization instruments, and data and model assembly into
essary in data mining. As the ultimate objective of data a single computing environment. Through the human-
retrieval and presentation is the formulation of knowl- computer interface, the data miner is able to access the
edge, it is difficult to create a single standard query data visualization instruments, select data sets, invoke
language for all purposes of data mining. Nevertheless, the query process, organize the screen, set colors and
the following functionalities can be implemented through animation speed, and manage the intermediate data min-
the design of query that support the examination of the ing results.
relevancy, usefulness, interestingness, and novelty of
extracted knowledge.
FUTURE TRENDS
1. Schematics Examination: Through this query func-
tion, the data miner is allowed to set different values Interactive visual data mining techniques will become key
for the parameters of the data visualization instru- components of any data mining instruments. More theo-
ment to perceive various schematic visual presen- ries and techniques of interactive visual data mining will
tations. be developed in the near future, followed by comprehen-
2. Consistency Examination: To cross-check the data sive comparisons of these theories and techniques. Query
mining results, the data miner may choose different systems along with data visualization functions on large-
sets of data of the database to check if the conclu- scale database systems for data mining will be available
sion from one set of data is consistent with others. for data mining practitioners.
This query function allows the data miner to make
such consistency examination.
3. Relevancy Examination: It is a fundamental law CONCLUSION
that, to validate a data mining result, one must use
external data, which are not used in generating this Given the fact that a one-shot knowledge deduction may
result but are relevant to the problem being inves- not provide an alternative result if it fails, we must provide
tigated. For instance, the data of customer at- an integrated computing environment for the data miner
tributes can be used for clustering to identify through interactive visual data mining. An interactive
significant market segments for the company. visual data mining system consists of three intertwined
However, whether the market segments relevant components, besides the database: data visualization
to a particular product, one must use separate instrument, data and model assembly instrument, and
product survey data. This query function allows human-computer interface. In interactive visual data min-
the data miner to use various external data to ing, the human-computer interaction and effective visual
examine the data mining results. presentations of multivariate data allow the data miner to
4. Dependability Examination: The concept of de- interpret the data mining results based on the particular
pendability examination in interactive visual data problem domain, his/her perception, specialty, and the
mining is similar to that of factor analysis in tradi- creativity. The ultimate objective of interactive visual
tional statistical analysis, but the dependability data mining is to allow the data miner to conduct the
examination query function is more comprehensive experimental process and examination simultaneously
in determining whether a variable contributes the through the human-computer collaboration in order to
data mining results in a certain way. obtain a satisfactory result.
645
TEAM LinG
Interactive Visual Data Mining
646
TEAM LinG
647
Giorgio Terracina
Universit della Calabria, Italy
Domenico Ursino
Universit Mediterranea di Reggio Calabria, Italy
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Interscheme Properties Role in Data Warehouses
648
TEAM LinG
Interscheme Properties Role in Data Warehouses
into account how much they are related. The formulas for mation sources; this becomes the core of the reconciled
computing this distance are quite complex; due to space data level in a three-level DW. I
limitations we cannot show them here. However, the inter- Generally, in classical approaches, this global rep-
ested reader can refer to De Meo, Quattrone, Terracina, resentation is obtained by integrating all involved data
and Ursino (2003) for a detailed illustration of them. sources into a unique one. However, when involved
We can define now the neighborhood of an x-compo- sources are numerous and large, a unique global schema
nent. In particular, given an x-component xS of an XML presumably encodes an enormous number and variety
schema S, the neighborhood of level j of xS consists of all of objects and becomes far too complex to be used
x-components of S whose semantic distance from xS is effectively.
less than or equal to j. In order to overcome the drawbacks mentioned
In order to verify if two x-components x1j, belonging previously, our approach does not directly integrate
to an XML schema S1, and x2k, belonging to an XML involved source schemas to construct a global flat
Schema S2, are synonymous, it is necessary to examine schema. Rather, it first groups them into homogeneous
their neighborhoods. More specifically, first, it is nec- clusters and then integrates schemas on a cluster-by-
essary to verify if their nearest neighborhoods (i.e., the cluster basis. Each integrated schema thus obtained is
neighborhoods of level 0) are similar. This decision is then abstracted to construct a global schema represent-
made by computing a suitable objective function associ- ing the cluster. The aforementioned process is iterated
ated with the maximum weight matching on a bipartite over the set of obtained cluster schemas, until one
graph constructed from the x-components of the neigh- schema is left. In this way, a hierarchical structure is
borhoods into consideration and the lexical synonymies obtained, which is called Data Repository (DR).
stored in a thesaurus (e.g., WordNet) 3. If these two Each cluster of a DR represents a group of homoge-
neighborhoods are similar, then x1j and x 2k are assumed to neous schemas and is, in turn, represented by a schema
be synonymous. (hereafter called C-schema). Clusters of level n of the
However, observe that the neighborhoods of level 0 hierarchy are obtained by grouping some C-schemas of
of x 1j and x2k provide quite a limited vision of their level n-1; clusters of level 0 are obtained by grouping
contexts. If a higher certainty on the synonymy between input source schemas. Therefore, each cluster Cl is
x1j and x 2k is required, it is necessary to verify the simi- characterized by (1) its identifier C-id; (2) its C-schema;
larity, not only of their neighborhoods of level 0, but also (3) the group of identifiers of clusters whose C-schemas
of the other neighborhoods. As a consequence, it is originated the C-schema of Cl (hereafter called O-
possible to introduce a severity level u against which identifiers); (4) the set of interschema properties in-
interschema properties can be determined, and to say volving objects belonging to the C-schemas that origi-
that x1j and x2k are synonymous with a severity level u, if nated the C-schema of Cl; and (5) a level index.
all neighborhoods of x1j and x 2k of a level lesser than or It is clear from this reasoning that the three funda-
equal to u are similar. mental operations for obtaining a DR are (1) schema
After all synonymies of S1 and S2 have been extracted, clustering (Han & Kumber, 2001), which takes a set of
homonymies can be derived. In particular, there exist a schemas as input and groups them into semantically
homonymy between two x-components x1j and x2k with a homogeneous clusters; (2) schema integration, which
severity level u if: (1) x1j and x2k have the same name; (2) produces a global schema from a set of heterogeneous
both of them are elements or both of them are attributes; input schemas; and (3) schema abstraction, which
and (3) they are not synonymous with a severity level u. groups concepts of a schema into homogeneous clus-
In other words, a homonymy indicates that two concepts ters and, in the abstracted schema, represents each
having the same name represent different meanings. cluster with only one concept.
Due to space constraints, we cannot describe in this
article the derivation of all the other interschema prop- Exploitation of the Uniform
erties mentioned; however, it follows the same philoso- Representation for Constructing a DW
phy as the detection of synonymies and homonymies.
The interested reader can find a detailed description of it The Data Repository can be exploited as the core struc-
in Ursino (2002). ture of the reconciled level of a new three-level DW
architecture. Indeed, different from classical three-level
Construction of a Uniform architectures, in order to reconcile data, we do not di-
Representation rectly integrate involved schemas to construct a flat glo-
bal schema. Rather, we first collect subsets of involved
Detected interschema properties can be exploited for schemas into homogeneous clusters and construct a DR
constructing a global representation of involved infor- that is used as the core of the reconciled data level.
649
TEAM LinG
Interscheme Properties Role in Data Warehouses
In order to pinpoint the differences between classical possibly relative to different, yet related and comple-
three-level DW architectures and ours, the following mentary, application contexts.
observations can be drawn:
650
TEAM LinG
Interscheme Properties Role in Data Warehouses
Gal, A., Anaby-Tavor, A., Trombetta, A., & Montesi, D. KEY TERMS
(2004). A framework for modeling and evaluating auto- I
matic semantic reconciliation. The International Jour- Assertion Between Knowledge Patterns: A particular
nal on Very Large Databases [forthcoming]. interschema property. It indicates either a subsumption or
Han, J. & Kamber, M. (2001). Data mining: Concepts and an equivalence between knowledge patterns. Roughly
techniques. Morgan Kaufmann Publishers. speaking, knowledge patterns can be seen as views on
involved information sources.
Hunt, E., Atkinson, M.P., & Irving, R.W. (2002). Database
indexing for large DNA and protein sequence collections. Data Repository: A complex catalogue of a set of
The International Journal on Very Large Databases, sources organizing both their description and all associ-
11(3), 256-271. ated information at various abstraction levels.
Madhavan, J., Bernstein, P.A., & Rahm, E. (2001). Generic Homonymy: A particular interschema property. An
schema matching with cupid. Proceedings of the Interna- homonymy between two concepts A and B indicates that
tional Conference on Very Large Data Bases (VLDB they have the same name but different meanings.
2001), Rome, Italy. Hyponymy/Hypernymy: A particular interschema
McBrien, P., & Poulovassilis, A. (2003). Data integration property. Concept A is said to be a hyponym of a concept
by bi-directional schema transformation rules. Proceed- B (which, in turn, is a hypernym of A), if A has a more
ings of the International Conference on Data Engineer- specific meaning than B.
ing (ICDE 2003), Bangalore, India. Interschema Properties: Terminological and struc-
Melnik, S., Garcia-Molina, H., & Rahm, E. (2002). Similarity tural relationships involving concepts belonging to dif-
flooding: A versatile graph matching algorithm and its ferent sources.
application to schema matching. Proceedings of the Inter- Overlapping: A particular interschema property.
national Conference on Data Engineering (ICDE 2002), An overlapping exists between two concepts A and B, if
San Jos, California. they are neither synonyms nor hyponyms of the other
Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). but share a significant set of properties; more formally,
Intensional and extensional integration and abstraction there exists an overlapping between A and B, if there
of heterogeneous databases. Data & Knowledge Engi- exist non-empty sets of properties {pA1, pA2, , pAn} of
neering, 35(3), 201-237. A and {pB1, pB2, , pBn} of B such that, for 1i n, pAi is
a synonym of pBi..
Palopoli, L., Sacc, D., Terracina, G., & Ursino, D. (2003).
Uniform techniques for deriving similarities of objects Schema Abstraction: The activity that clusters ob-
and subschemas in heterogeneous databases. IEEE Trans- jects belonging to a schema into homogeneous groups
actions on Knowledge and Data Engineering, 15(2), and produces an abstracted schema obtained by substi-
271-294. tuting each group with one single object representing it.
Palopoli, L., Terracina, G., & Ursino, D. (2001). A graph- Schema Integration: The activity by which differ-
based approach for extracting terminological properties ent input source schemas are merged into a global
of elements of XML documents. Proceedings of the structure representing all of them.
International Conference on Data Engineering (ICDE Subschema Similarity: A particular interschema
2001), Heidelberg, Germany. property. It represents a similitude between fragments
Rahm, E., & Bernstein, P.A. (2001). A survey of approaches to of different schemas.
automatic schema matching. The International Journal on Synonymy: A particular interschema property. A
Very Large Databases, 10(4), 334-350. synonymy between two concepts A and B indicates that
Ursino, D. (2002). Extraction and exploitation of inten- they have the same meaning.
sional knowledge from heterogeneous information Type Conflict: A particular interschema property.
sources: Semi-automatic approaches and tools. Springer. It indicates that the same concept is represented by
different constructs (e.g., an element and an attribute in
an XML source) in different schemas.
651
TEAM LinG
Interscheme Properties Role in Data Warehouses
ENDNOTES 3
Clearly, if necessary, a more specific thesaurus,
possibly constructed with the support of a human
1
Here and in the following, we shall consider a expert, might be used.
three-level data warehouse architecture.
2
Semantic distance is often called connection cost
in the literature.
652
TEAM LinG
653
Tharam Dillon
University of Technology Sydney, Australia
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Inter-Transactional Association Analysis for Prediction
a number of attributes that describe the context within Normalized Extended Item (Transaction)
which the transaction happens. We call them dimen- Sets
sional attributes, because, together, these attributes
constitute a multi-dimensional space, and each transac- We call an extended itemset a normalized extended itemset,
tion can be mapped to a certain point in this space. if all its extended items are positioned with respect to the
Basically, dimensional attributes can be of any kind, as smallest reference point of the set. In other words, the
long as they are meaningful to applications. Time, dis- extended items in the set have the minimal relative dis-
tance, temperature, latitude, and so forth are typical di- tance 0 for each dimension. Formally, let Ie = {(d1,1, d1,2,
mensional attributes. (i ), (d2,1, d2,2, , d2,m)(i 2), , (dk,1, dk,2, , dk,m)(ik)} be an
,~d1,m) 1
extended itemset. Ie is a normalized extended itemset, if
Multidimensional Contexts and only if for j (1 j k) i (1 i m), min (dj, i) = 0.
The normalization concept can be applied to an ex-
An m-dimensional mining context can be defined through tended transaction set as well. We call an extended trans-
m dimensional attributes a1, a2, , am, each of which action set a normalized extended transaction set, if all its
represents a dimension. When m=1, we have a single- extended transactions are positioned with respect to the
dimensional mining context. Let ni = (ni.a1, ni.a2, , ni.am) smallest reference point of the set. Any non-normalized
and nj = (nj.a1, nj.a2, , nj.am) be two points in an m- extended item (transaction) set can be transformed into a
dimensional space, whose values on the m dimensions are normalized one through a normalization process, where
represented as ni.a1, ni.a2, , ni.am and nj.a1, nj.a2, , nj.am, the intention is to reposition all the involved extended
respectively. Two points ni and nj are equal, if and only if items (transactions) based on the smallest reference point
for k (1 k m), ni.ak = nj.ak. A relative distance between of this set. We use INE and TNE to denote the set of all
ni and nj is defined as ni, nj = (nj.a1-ni.a1, nj.a2-ni.a2, , possible normalized extended itemsets and normalized
nj.am-ni.am). We also use the notation (d1, d2, , dm), where extended transaction sets, respectively. According to the
dk = nj.ak-ni.a k (1 k m), to represent the relative distance above definitions, any superset of a normalized extended
between two points ni and nj in the m-dimensional space. item (transaction) set is also a normalized extended item
Besides, the absolute representation (ni.a1, ni.a2, , (transaction) set.
ni.a m) for point ni, we also can represent it by indicating its
relative distance n0, ni from a certain reference point n0, Multidimensional Intertransactional
(i.e., n0+n0, ni, where ni = n0+n0, ni). Note that ni, n0,
ni, and (ni.a1-n0.a1, ni.a2-n0.a2, , ni.am-n0.am) can be used inter-
Association Rule Framework
changeably, since each of them refers to the same point
ni in the space. Let N = {n1, n 2, , nu} be a set of points in With the above extensions, we are now in a position to
an m-dimensional space. We construct the smallest refer- formally define intertransactional association rules and
ence point of N, n*, where for k (1 k m), n*.ak = min related measurements.
(n1.ak, n 2.ak, , nu.ak).
Definition 1
Extended Items (Transactions)
A multidimensional intertransactional association rule is
The traditional concepts regarding item and transaction an implication of the form X Y, where
can be extended accordingly under an m-dimensional
context. We call an item i kI happening at the point (d1, (1) X INE and Y IE;
, (i.e., at the point (n0.a1+d1, n0.a2+d2, , n0.am+dm)), (2) The extended items in X and Y are positioned with
d2, , dm)
an extended item and denote it as (d1, d2, , dm)(ik). In a respect to the same reference point;
similar fashion, we call a transaction tkT happening at the (3) For (x1, x2, , xm)(i x) X, (y1, y2, , ym)(ix) Y, xj yj
point (d1, d2, , dm) an extended transaction and denote it as (1 j m);
D(d1, d2, , dm)(t k). The set of all possible extended items, IE, (4) X Y = .
is defined as a set of (d1, d2, , dm)(i k) for any ikI at all
possible points (d1, d2, , dm) in the m-dimensional space. TE Different from classical intratransactional association
is the set of all extended transactions, each of which rules, the intertransactional association rules capture the
contains a set of extended items, in the mining context. occurrence contexts of associated items. The first clause
654
TEAM LinG
Inter-Transactional Association Analysis for Prediction
two steps: frequent extended itemset discovery and asso- p, q Lk-1, we have
ciation rule generation.
insert into Ck
1. Frequent Extended Itemset Discovery
select p.u1 (i1),, , p. uk-1 (i k-1), q. vk-1 (jk-1)
In this phase, we find the set of all frequent extended
itemsets. For simplicity, in the following, we use itemset from p in Lk-1, q in Lk-1
and extended itemset, transaction and extended transac-
tion interchangeably. Let Lk represent the set of frequent where (i1=j1 u1=v1) (i k-2=j k-2 uk-2=vk-2) (uk-1<vk-
k-itemsets and Ck the set of candidate k-itemsets. The 1
(uk-1=vk-1 ik-1<jk-1))
algorithm makes multiple passes over the database. Each
pass consists of two phases. First, the set of all frequent Next, in the prune phase, we delete all those extended
(k-1)-itemsets Lk-1 found in the (k-1)th pass is used to itemsets in Ck that have some (k-1)-subsets with sup-
generate the candidate itemset Ck. The candidate genera- ports less than the support threshold.
tion procedure ensures that Ck is a superset of the set of
all frequent k-itemsets. The algorithm now scans the data- Support Counting
base. For each list of consecutive transactions, it deter-
mines which candidates in Ck are contained and increments To facilitate the efficient support counting process, a
their counts. At the end of the pass, Ck is examined to check candidate Ck of k-itemsets is divided into k groups, with
which of the candidates actually are frequent, yielding Lk. each group Go containing o number of items whose
The algorithm terminates when L k becomes empty. In the intervals are 0 (1 o k). For example, a candidate set of
following, we detail the procedures for candidate genera- 3-itemsets
tion and support counting.
655
TEAM LinG
Inter-Transactional Association Analysis for Prediction
C3={ {0(a), 1(a), 2(b)}, {0(c), (d), 2(d)}, {0(a), wind speed, temperature, relative humidity, rainfall, and
0(b), 3(h)}, {0(l), 0(m), 0(n)}, {0(p), v0(q), 0(r)} ] mean sea level pressure, and so forth every six hours each
day. In some data records, certain atmospheric observa-
is divided into three groups: tions, such as relative humidity, are missing. Since the
context constituted by the time dimension is valid for the
G1={{0(a), 1(a), 2(b)}},G2={{0(c), 0(d), 2(d)}, whole set of meteorological records (transactions), we fill
{0(a), 0(b), 3(h)}}, G3={{0(l), 0(m), 0(n)}, in these empty fields by averaging their nearby values. In
{0(p), 0(q), 0(r)} } this way, the data to be mined contains no missing fields;
in addition, no database holes (i.e., meaningless contexts)
Each group is stored in a modified hash-tree. Only exist in the mining space. Essentially, there is one dimen-
those items with interval 0 participate the construction of sion in this case; namely, time. After preprocessing the
this hash tree (e.g., in G2, only {0(c), 0(d)} and {0(a), data set, we discover intertransactional association rules
D0(b)} enter the hash-tree). The construction process is from the 1995 meteorological records and then examine
similar to that of a priori. The rest items, 2(d) and 3(h), their prediction accuracy using the 1996 meteorological
are simply attached to the corresponding itemsets, {0(c), data from the same area in Hong Kong. Considering
0(d)} and {0(a), 0(b)}, respectively, in the leaves of the seasonal changes of weather, we extract records from
tree. May to October for our experiments, totaling 736 records
Upon reading one transaction of the database, every (total_days * record_num_per_day = (31+30+31+
hash tree is tested. If one itemset is contained, its attached 31+30+31) * 4 = 736) for each year. These raw data sets,
itemsets whose intervals are larger than 0 will be checked containing continuous data, are further converted into
against the consecutive transactions. In the previous appropriate formats with which the algorithms can work.
example, if {0(a), 0(b)} exists in the current extended Each record has six meteorological elements (items). The
transaction at the point s, then the extended transaction interval of every two consecutive records is six hours. We
s+3(t) will be scanned to see whether it contains item h. set maxspan=11 in order to detect the association rules for
If so, the support of 3-itemset {0(a), 0(b), 3(h)} will a three-day horizon (i.e., (11+1) / 4 = 3).
increase by 1. At support=45% and confidence=92%, from the 1995
meteorological data, we found only one classical associa-
Association Rule Generation tion rule: if the humidity is medium wet, then there is no
rain at the same time (which is quite obvious), but we
Using sets of frequent itemsets, we can find the desired found 580 intertransactional association rules. Note that
intertransactional association rules. The generation of the number of intertransactional association rules re-
intertransaction association rules is similar to the genera- turned depends on the maxspan parameter setting. Table
tion of the classical association rules. 1 lists some significant intertransactional association
rules found from the single-station meteorological data.
Application of Intertransactional We measure their predictive capabilities using the 1996
meteorological data recorded by the same station through
Association Rules to Weather Prediction-Rate (XY) = sup(XY) / sup(X), which can
Prediction achieve more than 90% prediction rate. From the test
results, we find that, with intertransactional association
We apply the previous algorithm to studying Hong Kong rules, more comprehensive and interesting knowledge
meterological data. The database records wind direction, can be discovered from the databases.
Table 1. Some significant intertransactional association rules found from meteorological data
656
TEAM LinG
Inter-Transactional Association Analysis for Prediction
657
TEAM LinG
Inter-Transactional Association Analysis for Prediction
Intertransactional Association: Correlations among Normalized Extended Transaction Set: A set of ex-
items not only within the same transactions but also tended transactions whose contextual positions have
across different transactions. been positioned with respect to the largest reference
position of the set.
658
TEAM LinG
659
Rui Yan
Saint Marys University, Canada
Mofreh Hogo
Czech Technical University, Czech Republic
Chad West
IBM Canada Limited, Canada
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Interval Set Representations of Clusters
Cluster C
Cluster B
Cluster A
An equivalence class
Lower approximation
Actual set
Upper approximation
( A( X ), A( X )) also provides a set theoretic interval for sists of one gene per object. The gene for an object is a
string of bits that describes which lower and upper
the set X. Figure 2 illustrates the lower and upper
approximations the object belongs to. The string for a
approximation of a set.
gene can be partitioned into two parts, lower and upper,
as shown in Figure 4 for three clusters. Both lower and
upper parts of the string consist of three bits each. The
MAIN THRUST ith bit in the lower/upper string tells whether the object
is in the lower/upper approximation of the ith cluster.
Interval Set Clustering Figure 4 shows examples of all the valid genes for three
clusters. An object represented by g1 belongs to the
Rough sets were originally used for supervised learning. upper bounds of the first and second clusters. An object
There are an increasing number of research efforts on represented by g6 belongs to the lower and upper bounds
clustering in relation to rough-set theory (do Prado, of the second cluster. Any other value not given by g1 to
Engel, & Filho, 2002; Hirano & Tsumoto, 2003; Peters, g7 is not valid. The objective of the genetic algorithms
Skowron, Suraj, Rzasa, & Borkowski, 2002). Lingras (GAs) is to minimize the within-group-error. Lingras
(2001) developed rough-set representation of clusters. provided a formulation of within-group-error for rough-
Figure 3 shows how the 12 objects from Figure 1 could set based clustering. The resulting GAs were used to
be clustered by using rough sets. Instead of Object 9 evolve interval clustering of highway sections. Lingras
belonging to any one cluster, it belongs to the upper (2002) applied the unsupervised rough-set clustering
bounds of Clusters B and C. Similarly, Object 4 belongs based on GAs for grouping Web users. However, the
to the upper bounds of Clusters A and B. clustering process based on GAs seemed
Lingras (2001; Lingras & West, 2004; Lingras, computationally expensive for scaling to larger datasets.
Hogo, & Snorek, 2004) proposed three different ap- The K-means algorithm is one of the most popular
proaches for unsupervised creation of rough or interval statistical techniques for conventional clustering
set representations of clusters: evolutionary, statisti- (Hartigan & Wong, 1979). Lingras and West (2004)
cal, and neural. Lingras (2001) described how a rough- provided a theoretical and experimental analysis of a
set theoretic clustering scheme could be represented by modified K-means clustering based on the properties of
using a rough-set genome. The rough-set genome con- rough sets. It was used to create interval set representa-
660
TEAM LinG
Interval Set Representations of Clusters
10
Lower bound of Group A Boundary area of Groups B and C
9
1
Lower bound of Group B
3 8 7
4
2 5 6
661
TEAM LinG
Interval Set Representations of Clusters
occur. The algorithm of Rhee and Hwang is based on the article briefly describes several approaches to creating
fact that when updating the cluster centers, higher mem- interval and fuzzy set representations of clusters. The
bership values should contribute more than memberships interval set clustering is based on the theory of rough
with smaller values. sets. Changes in clusters can provide important clues
Rough-set theory and fuzzy-set theory complement about the changing nature of the usage of a facility as
each other (Shi, Shen, & Liu, 2003). It is possible to create well as the changing nature of its users. Use of fuzzy and
interval clusters based on the fuzzy memberships ob- interval set clustering also adds an interesting dimen-
tained by using the fuzzy C-means algorithm described in sion to cluster migration studies. Due to rough bound-
the previous section. Let 1 0 . An object v aries of interval clusters, it may be possible to get early
belongs in the lower bound of cluster i if its membership warnings of potential significant changes in clustering
in the cluster is more than . Similarly, if its membership patterns.
in cluster i is greater than , the object belongs in the
upper bound of cluster i. Because 1 0 , if an REFERENCES
object belongs to the lower bound of a cluster, it will also
belong to its upper bound. Lingras and Yan (2004) de- Bezdek, J. C. (1981). Pattern recognition with fuzzy
scribe further conditions on and that ensure satis- objective function. New York: Plenum Press.
faction of other important properties of rough sets.
Do Prado, H. A., Engel, P. M., & Filho, H. C. (2002).
Rough clustering: An alternative to finding meaningful
clusters by using the reducts from a dataset. In J. Alpigini,
FUTURE TRENDS J. F. Peters, A. Skowron, & N. Zhong, (Eds.), Proceed-
ings of the Symposium on Rough Sets and Current
Temporal data mining is an application of data-mining Trends in Computing: Vol. 2475. Springer-Verlag.
techniques to the data that takes the time dimension into
account. Temporal data mining is assuming increasing Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS136:
importance. Much of the temporal data-mining tasks are A k-means clustering algorithm. Applied Statistics, 28,
related to the use and analysis of temporal sequences of 100-108.
raw data. There is little work that analyzes the results of
data mining over a period of time. Changes in cluster Hirano, S., & Tsumoto, S. (2003). Dealing with rela-
characteristics of objects, such as supermarket customers tively proximity by rough clustering. Proceedings of
or Web users, over a period of time can be useful in data the 22nd International Conference of the North Ameri-
mining. Such an analysis can be useful for formulating can Fuzzy Information Processing Society (pp. 260-265).
marketing strategies. Marketing managers may want to Kohonen, T. (1988). Self-organization and associa-
focus on specific groups of customers. Therefore, they tive memory. Berlin: Springer-Verlag.
may need to understand the migrations of the customers
from one group to another group. The marketing strate- Krishnapuram, R., Joshi, A., Nasraoui, O., & Yi, L.
gies may depend on desirability of these cluster migra- (2001). Low-complexity fuzzy relational clustering al-
tions. The overlapping clusters created by using interval gorithms for Web mining. IEEE Transactions on Fuzzy
or fuzzy set clustering can be especially useful in such Systems, 9(4), 595-607.
studies. Overlapping clusters make it possible for an Lingras, P. (2001). Unsupervised rough set classifica-
object to transition from the core of a cluster to the core tion using GAs. Journal of Intelligent Information
of another cluster through the overlapping region. When Systems, 16(3), 215-228.
an object moves from the core of a cluster to an overlap-
ping region, it may be possible to provide an early warning Lingras, P. (2002). Rough set clustering for Web min-
that can trigger an appropriate marketing campaign. ing. Proceedings of the IEEE International Confer-
ence on Fuzzy Systems.
Lingras, P., Hogo, M., & Snorek, M. (2004). Interval set
CONCLUSION clustering of Web users using modified Kohonen self-
organizing maps based on the properties of rough sets.
Clusters in data mining tend to have fuzzy and rough Web Intelligence and Agent Systems: An International
boundaries. An object may potentially belong to more Journal.
than one cluster. Interval or fuzzy set representation
enables modeling of such overlapping clusters. This
662
TEAM LinG
Interval Set Representations of Clusters
Lingras, P., & West, C. (2004). Interval set clustering of KEY TERMS
Web users with rough k-means. Journal of Intelligent I
Information Systems, 23(1), 5-16. Clustering: A form of unsupervised learning that
Lingras, P., & Yan, R. (2004). Interval clustering using divides a data set so that records with similar content are
fuzzy and rough set theory. Proceedings of the 23rd in the same group and groups are as different from each
International Conference of the North American Fuzzy other as possible.
Information Processing Society (pp. 780-784), Canada. Evolutionary Computation: A solution approach
Pawlak, Z. (1992). Rough sets: Theoretical aspects of guided by biological evolution that begins with potential
reasoning about data. New York: Kluwer Academic. solution models, then iteratively applies algorithms to
find the fittest models from the set to serve as inputs to
Peters, J. F., Skowron, A., Suraj, Z., Rzasa, W., & the next iteration, ultimately leading to a model that best
Borkowski, M. (2002). Clustering: A rough set ap- represents the data.
proach to constructing information granules, soft com-
puting and distributed processing. Proceedings of the Fuzzy C-Means Algorithms: Clustering algorithms
Sixth International Conference (pp. 57-61). that assign a fuzzy membership in various clusters to an
object instead of assigning the object precisely to a
Rhee, F. C. H., & Hwang, C. (2001). A type-2 fuzzy c- cluster.
means clustering algorithm. Proceedings of the IFSA
World Congress and the 20th International Confer- Fuzzy Membership: Instead of specifying whether
ence of the North American Fuzzy Information Pro- an object precisely belongs to a set, fuzzy membership
cessing Society (Vol. 4, pp. 1926-1929). specifies a degree of membership between [0,1].
Shi, H., Shen, Y., & Liu, Z. (2003). Hyperspectral bands Interval Set Clustering Algorithms: Clustering
reduction based on rough sets and fuzzy C-means clus- algorithms that assign objects to lower and upper bounds
tering. Proceedings of the 20th IEEE Instrumentation of a cluster, making it possible for an object to belong
and Measurement Technology Conference (Vol. 2, pp. to more than one cluster.
1053-1056). Interval Sets: If a set cannot be precisely defined,
Skowron, A., Stepaniuk, J., & Peters, J. F. (2003). one can describe it in terms of a lower bound and an
Rough sets and infomorphisms: Towards approximation upper bound. The set will contain its lower bound and
of relations in distributed environments. Fundamental will be contained in its upper bound.
Informatica, 54(2-3), 263-277. Rough Sets: Rough sets are special types of interval
Szmigielski, A., & Polkowski, L. (2003). Computing sets created by using equivalence relations.
from words via rough mereology in mobile robot navi- Self-Organization: A system structure that often
gation. Proceedings of the IEE/RSJ International Con- appears without explicit pressure or involvement from
ference on Intelligent Robots and Systems (Vol. 4, pp. outside the system.
3498-3503).
Supervised Learning: A learning process in which
Yao, Y. Y. (2001). Information granulation and rough set the exemplar set consists of pairs of inputs and desired
approximation. International Journal of Intelligent Sys- outputs. The process learns to produce the desired
tems, 16(1), 87-104. outputs from the given inputs.
Temporal Data Mining: An application of data-
mining techniques to the data that takes the time dimen-
sion into account.
Unsupervised Learning: Learning in the absence
of external information on outputs.
663
TEAM LinG
664
INTRODUCTION KMs are new to such tasks. They have been applied for
applications in chemoinformatics since the late 1990s.
Millions of people are suffering from fatal diseases KMs and their utility for applications in
such as cancer, AIDS, and many other bacterial and viral chemoinformatics are the focus of the research pre-
illnesses. The key issue is now how to design lifesaving sented in this article. These methods possess special
and cost-effective drugs so that the diseases can be characteristics that make them very attractive for tasks
cured and prevented. It would also enable the provision such as the induction of SAR/QSAR. KMs such as SVMs
of medicines in developing countries, where approxi- map the data into some higher dimensional feature
mately 80% of the world population lives. Drug design space and train a linear predictor in this higher dimen-
is a discipline of extreme importance in sional space. The kernel trick offers an effective way to
chemoinformatics. Structure-activity relationship (SAR) construct such a predictor by providing an efficient
and quantitative SAR (QSAR) are key drug discovery method of computing the inner product between mapped
tasks. instances in the feature space. One does not need to
During recent years great interest has been shown in represent the instances explicitly in the feature space.
kernel methods (KMs) that give state-of-the-art perfor- The kernel function computes the inner product by
mance. The support vector machine (SVM) (Vapnik, implicitly mapping the instances to the feature space.
1995; Cristianini and Shawe-Taylor, 2000) is a well- These methods can handle very high-dimensional noisy
known example. The building block of these methods is data and can avoid overfitting. SVMs suffer from a
an entity known as the kernel. The nondependence of drawback that is the difficulty of interpretation of the
KMs on the dimensionality of the feature space and the models for nonlinear kernel functions.
flexibility of using any kernel function make them an
optimal choice for different tasks, especially modeling
SAR relationships and predicting biological activity or MAIN THRUST
toxicity of compounds. KMs have been successfully
applied for classification and regression in pharmaceu- I now present basic principles for the construction of
tical data analysis and drug design. SVMs and also explore empirical findings.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Kernel Methods in Chemoinformatics
665
TEAM LinG
Kernel Methods in Chemoinformatics
and result in a loss of information. Furthermore, accurate noncongeneric datasets (Helma, Cramer, Kramer, & De
prediction can speed up the drug design process. Raedt, in press). The performance of SVC is compared
In order to apply learning techniques such as KMs to with decision trees C4.5 and a rule learner. SVC and rule
chemometric data, the compounds are transformed into a learner showed excellent results.
form amenable to these techniques. Modeling SAR/QSAR In chemoinformatics, extracted features play a key
analysis can be viewed as comprising two stages. In the role in SAR/QSAR analysis. The feature extraction
first stage, descriptors (features) are extracted or com- module is constructed in such a way so as the loss of
puted, and molecules are transformed into vectors. The information is at a minimum. This can make feature
dimensionality of the vector space can be very high, extraction tasks as complex and expensive as solving
where, generally, molecular datasets comprise several the entire problem. Kernel methods are an effective
tens to hundred compounds. In the second stage, an alternative to explicit feature extraction. Kernels that
induction algorithm (SVC or SVR) is applied to learn an transform the compounds into feature vectors without
SAR/QSAR model. The similarity between two com- explicitly representing them can be constructed. This
pounds is measured by the inner product between two can be achieved by viewing compounds as graphs
vectors. The similarity between compounds is inversely (Kashima, Tsuda, & Inokuchi, 2003; Mahe, Ueda, Akutsu,
proportional to the cosine of the angle between the Perret, & Vert, 2004). In this way, kernel methods can be
vectors. It is worth noting that kernel methods can be utilized to generate or extract feature for SAR/QSAR
applied not only for the induction of model but also for analysis. Kashima et al. proposed a kernel function that
the feature extraction through specialized kernels. computes the similarity between compounds (labeled
I now describe the application of kernel methods to graphs). The compounds are implicitly transformed into
inducing structure activity relationship models. I first feature vectors, where each entry of feature vector rep-
focus on the classification task for chemometric data. resents the number of label paths. Label paths are gen-
Trotter, Buxton, and Holden (2001) studied the efficacy erated by random walks on graphs. In order to perform
of an SVC to separating the compounds that will cross binary classification of compounds, voted kernel
the blood/brain barrier from those that will not cross the perceptron (Freund & Shapire, 1999) is employed with
barrier. SVC substantially improved other machine-learn- promising results. Mahe, et al (2004) have improved the
ing techniques, including neural networks and decision graph kernels in terms of efficiency (computational time)
trees. In another study, a classification problem has been and classification accuracy. An SVC in conjunction with
formulated in order to predict the reduction of graph kernels is used to predict mutageneicity of aro-
dihydrofolate reductase by pyrimidines (Burbidge, Trot- matic and hetroaromatics nitro compounds. SVC in con-
ter, Holden, & Buxton, 2001). An SVC in conjunction junction with graph kernels validates the efficacy of KMs
with an RBF kernel is used to conduct experiments. for feature extraction and induction of SAR models.
Experimental results show that SVC outperforms the I now focus on the regression task for chemometric
other classification techniques, including artificial neu- data. In a chemical domain, datasets are generally char-
ral networks, C5.0 decision tree, and nearest neighbor. acterized as having high dimensionality and few data
Structural feature extraction is an important prob- points. Partial least square (PLS) (a regression method)
lem in chemoinformatics. Structural feature extraction is very useful for such scenarios as compared to linear
refers to the problem of computing the features that least square regression. Demiriz, Bemmett, Breneman,
describe the chemical structure. In order to compute and Embrechts (2001) have applied an SVR for QSAR
structural features, Kramer, Frank, and Helma (2000) analysis. The authors performed feature selection by
viewed compounds as labeled graphs where vertices de- removing high-variance variables and induced a model
pict atoms and edges describe bonds between them. The by using a linear programming SVR. The experimental
method performs automated construction of structural results show that SVR in conjunction with RBF kernel
features of two-dimensional represented compounds. outperforms PLS for chemometric datasets. PLSs
Features are constructed by retrieving sequences of lin- ability of handling high-dimensional data makes it an
early connected atoms and bonds on the basis of some optimal choice to combine it with KMs. There exist
statistical criterion. The authors argue that SVMs abil- two techniques to kernelize PLS (Bennett & Embrechts,
ity to handle high-dimensional data makes them attrac- 2003; Rosipal, Trejo, Matthews, & Wheeler, 2003).
tive in this scenario, as the dimensionality of the feature One technique is based on mapping the data into a
space may be very large. The authors applied an SVC to higher dimensional space and constructing a linear
predict mutageneicity and carcinogenicity of compounds regression function in the space (Rosipal, Trejo,
with promising results. In another study, an SVC in Matthews, & Wheeler, 2003). Alternatively, PLS is
conjunction with structural features is applied to model kernelized by obtaining a low-rank approximation of
mutageneicity structure activity relationships from kernel matrix and computing a regression function based
666
TEAM LinG
Kernel Methods in Chemoinformatics
As an example, I now present the results of a novel Kernel methods have strong theoretical foundations.
approach for modeling structure-activity relationships These methods combine the principles of statistical
(Lodhi & Guo, 2002) based on the use of a special learning theory and functional analysis. SVMs and other
kernel namely Gram-Schmidt kernel (GSK) (Cristianini, kernel methods have been applied in different domains,
Shawe-Taylor, & Lodhi, 2002). The kernel efficiently including text mining, face recognition, protein homol-
performs Gram-Schmidt orthogonalisation in a kernel- ogy detection, analyses and classification of gene ex-
induced feature space. It is based on the idea of building pression data, and many others with great success. They
a more informative kernel matrix, as compared to Gram have shown impressive performance in
matrices, which are constructed by standard kernels. An chemoinformatics. I believe that the growing popularity
SVC in conjunction with GSK is used to perform QSAR of machine-learning techniques, especially kernel meth-
analysis. The learning process of an SVC in conjunction ods in chemoinformatics, will lead to significant devel-
with GSK comprises two stages. In the first stage, highly opment in both disciplines.
informative features are extracted in a kernel-induced
feature space. In the next stage, a soft margin classifier
is trained. In order to perform the analysis, the GSK REFERENCES
algorithm requires a set of compounds an underlying
kernel function and the number T , which specifies the Bennett, K. P., & Embrechts, M. J. (2003). Advances in
dimensionality of the feature space. For the underlying learning theory: Methods, models and applications. Nato
kernel function, an RBF kernel is employed. GSK in Science Series III: Computer & Systems Science, 190,
conjunction with SVC is applied on a benchmark dataset 227-250.
(King, Muggleton, Lewis, & Sternberg, 1992) to predict
the inhibition of dihydrofolate reductase by pyrim- Boser, B. E., Guyon, I. M., & Vapnik, V. (1992). A
idines. The dataset contains 55 compounds that are training algorithm for optimal margin classifier. Pro-
divided into 5-fold cross-validation series. For each ceedings of the Fifth Annual ACM workshop on Com-
drug there are three positions of possible substitution, putational Learning Theory (pp. 144-152).
and the number of attributes for each substitution is
Burbidge, R., Trotter, M., Holden, S., & Buxton, B.
nine. The experimental results show that SVC in con-
(2001). Drug design by machine learning: Support vec-
junction with GSK achieves lower classification error
tor machines for pharmaceutical data. Computers and
than the best reported results (Burbidge, Trotter, Holden,
Chemistry, 26(1), 4-15.
& Buxton, 2001). Burbidge et al. performed SAR analysis
on this dataset by using a number of learning techniques Cristianini, N., & Shawe-Taylor, J. (2000). An intro-
with SVC in conjunction with RBF kernels, achieving a duction to support vector machines. Cambridge, MA:
classification error of 0.1269. An SVC in conjunction with Cambridge University Press.
GSK improved these results, achieving a classification
error of 0.1120 (Lodhi & Guo, 2002). Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002).
Latent semantic kernels. Journal of Intelligent Infor-
mation Systems, 18(2/3), 127-152.
667
TEAM LinG
Kernel Methods in Chemoinformatics
Demiriz, A., Bemmett, K. P., Breneman, C. M., & Embrechts, Rosipal, R., Trejo, L., Matthews, B., & Wheeler, K. (2003).
M. J. (2001). Support vector machine regression in Nonlinear kernel-based chemometric tools: A machine
chemometrics. Computing Science and Statistics. learning approach. Proceedings of the Third Interna-
tional Symposium on PLS and Related Methods (pp. 249-
Freund, Y., & Shapire, R. (1999). Large margin classifica- 260).
tion using perceptron algorithm. Machine Learning, 37(3),
277-296. Trotter, M., Buxton, B., & Holden, S. (2001). Support
vector machines in combinatorial chemistry. Measure-
Haussler, D. (1999). Convolution kernels on discrete ment and Control, 34(8), 235-239.
structures (Tech. Rep. No. UCSC-CRL-99-10). Santa
Cruz: University of California, Computer Science De- Vapnik, V. (1995). The nature of statistical learning
partment. theory. Springer-Verlag.
Helma, C., Cramer, T., Kramer, S., & De Raedt, L. (in Watkins, C. (2000). Dynamic alignment kernels. In P. J.
press). Data mining and machine learning techniques for Bartlett, B. Schlkopf, D. Schuurmans, & A. J. Smola,
identification of mutageneicity inducing substructures Advances in large-margin classifiers (pp. 39-50). Cam-
and structure activity relationships of noncongeneric bridge, MA: MIT Press.
compounds. Journal of Chemical Information and
Computer Systems. Wu, W., Massarat, D. L., & de Jong, S. (1997). The
kernel PCA algorithm for wide data. Part I: Theory and
Kashima, H., Tsuda, K., & Inokuchi, A. (2003). algorithms. Chemometrics and Intelligent Laboratory
Marginalized kernels between labeled graphs. Proceed- Systems, 36, 165-172.
ings of the 20th International Conference on Machine
Learning.
King, R. D., Muggleton, S., Lewis, R. A., & Sternberg, KEY TERMS
M. J. E. (1992). Drug design by machine learning: The
use of inductive logic programming to model the struc- Chemoinformatics: Storage, analysis, and drawing
ture activity relationships of timethoprim analogues inferences from chemical information (obtained from
binding to dihydrofolate reeducates. Proceedings of chemical data) by using computational methods for drug
the National Academy of Sciences, USA, 89 (pp. 11322- discovery.
11326).
Kernel Function: A function that computes the
Kramer, S., Frank, E., & Helma, C. (2002). Fragment genera- inner product between mapped instances in a feature
tion and support vector machines for inducing SARs. SAR space. It is a symmetric, positive definite function.
and QSAR in Environmental Research, 13(5), 509-523.
Kernel Matrix: A matrix that contains almost all
Lodhi, H., & Guo, Y. (2002). Gram-Schmidt kernels applied the information required by kernel methods. It is ob-
to structure activity analysis for drug design. Proceed- tained by computing the inner product between n in-
ings of the Second ACM SIGKDD Workshop on Data stances.
Mining in Bioinformatics (pp. 37-42).
Machine Learning: A discipline that comprises the
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, study of how machines learn from experience.
N., & Watkins, C. (2002). Text classification using
string kernels. Journal of Machine Learning Research, Margin: A real-valued function. The sign and mag-
2, 419-444. nitude of the margin give insight into the prediction of
an instance. Positive margin indicates correct predic-
Mahe, P., Ueda, N., Akutsu, T., Perret, J.-L., & Vert, J.- tion, whereas negative margin shows incorrect predic-
P. (2004). Extension of marginalized graph kernels. tion.
Proceedings of the 21st International Conference on
Machine Learning. Quantitative Structure-Activity Relationship
(QSAR): Illustrates quantitative relationships between
Mercer, J. (1909). Functions of positive and negative chemical structures and the biological and pharmaco-
type and their connection with the theory of integral logical activity of chemical compounds.
equations. Philosophical Transactions of the Royal
Society London (A), 209, 415-446. Support Vector Machines (SVMs): SVMs (im-
plicitly) map input examples into a higher dimensional
feature space via a kernel function and construct a linear
function in this space.
668
TEAM LinG
669
BACKGROUND
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Knowledge Discovery with Artificial Neural Networks
neuron in the network, and its connection with the con- Streeter, Mydlowec, Yu, & Lanza, 2003; Wong & Leung,
secutive neurons. Towell and Shavlik (1994) in particular 2000), to a set of inputs / outputs produced by the ANN.
see the connections between neurons as rules, and The set of network inputs / produced outputs is dynami-
Andrews and Geva (1994) uses networks with functions cally modified, as explained on this paper.
that allow a clear identification of the dominant inputs.
Other approaches are the RULENEG (Rule-Extraction
from Neural Networks by Step-wise Negation) (Pop, Hay- ARCHITECTURE
ward, & Diederich, 1994) and TREPAN (Craven, 1996)
algorithms. The first approach however modifies the train- This article presents an architecture in two levels for the
ing set and therefore loses the generalisation capacity of extraction of knowledge from databases. In a first level, we
the ANNs. The TREPAN approach is similar to decision apply an ANN as Data Mining technique; in the second
tree algorithms such as CART (Classification and Regres- level, we apply a knowledge extraction technique to this
sion Trees) or C4.5, which turns the ANN into a MofN (M- network.
of-N) decision tree.
The DEDEC (Decision Detection) algorithm (Tickle, Data Mining with ANNs
Andrews, Golea, & Tickle, 1998) extracts rules by finding
minimal information sufficient to distinguish, from the Artificial Neural Networks constitute a Data Mining
neural network point of view, between a given pattern and technique that has been widely used as a technique for
all other patterns. The DEDEC algorithm uses the trained the extraction of knowledge from databases. Their train-
ANN to create examples from which rules can be extracted. ing process is based on examples, and presents several
Unlike other approaches, it also uses the weight vectors advantages that other models do not offer:
of the network to obtain an additional analysis that
improves the extraction of rules. This information is then A high generalisation level. Once ANNs are trained
used to direct the strategy for generating a (minimal) set with a training set, they produce outputs (close to
of examples for the learning phase. It also uses an efficient desired or supposed outputs) for inputs that were
algorithm for the rule extraction phase. Based on these never presented to them before.
and other already mentioned techniques, Chalup, Hay- A high error tolerance. Since ANNs are based on the
ward and Diedrich (1998); Visser, Tickle, Hayward and successive and parallel interconnection between
Andrews (1996); and Tickle, Andrews, Golea and Tickle many processing elements (neurons), the output of
(1998) also presented their solutions. the system is not significantly affected if one of
The methods for the extraction of logical rules, devel- them fails.
oped by Duch, Adamczak and Grabczewski (2001), are A high noise tolerance.
based on multilayer perceptron networks with the MLP2LN
(Multi-Layer Perceptron converted to Logical Network) All these advantages turn ANNs into the ideal tech-
method and its constructive version C-MLP2LN. MLP2LN nique for the extraction of knowledge in almost any
consists in taking a multilayer and already trained domain. They are trained with many different training
perceptron and simplify it in order to obtain a network with algorithms. The most famous one is the backpropagation
weights 0, +1 or -1. C-MLP2LN acts in a similar way. After algorithm (Rumelhart, Hinton & Williams, 1986), but many
this process, the dominant rules are easily extracted, and other training algorithms are applied according to the
the weights of the input layer allow us to deduct which topology of the network and the use that is given to it. In
parameters are relevant. the course of recent years, Evolutionary Computation
More recently, genetic algorithms (GAs) have been techniques such as Genetic Algorithms (Holland, 75)
used to discover rules in ANNs. Keedwell, Narayanan and (Goldberg, 89) (Rabual, Dorado, Pazos, Gestal, Rivero &
Savic (2000) use a GA in which the chromosomes are rules Pedreira, 2004a; Rabual, Dorado, Pazos, Pereira & Rivero,
based on value intervals or ranges applied to the inputs 2004b) are gaining ground, because they correct the
of the ANN. The values are obtained from the training defects of other training algorithms, such as the tendency
patterns. to generate local minimums or to overtrain the network.
The most recent works in rules extraction from ANNs Even so, and in spite of these algorithms that train the
are presented by Rivero, Rabual, Dorado, Pazos and network automatically (and even search for the topology
Pedreira (2004) and Rabual, Dorado, Pazos and Rivero of the network), ANNs present a series of defects that
(2003). They extract rules by applying a symbolic regres- make them useless in many application fields. As we
sion system, based on Genetic Programming (GP) already said, their main defect is the fact that in general
(Engelbrecht, Rouwhorst & Schoeman, 2001; Koza, Keane, they are not interpretable: once an input is applied to the
670
TEAM LinG
Knowledge Discovery with Artificial Neural Networks
671
TEAM LinG
Knowledge Discovery with Artificial Neural Networks
In this way, we create a closed system in which the the Australian Conference on Neural Networks,
patterns set is constantly updated and the search continues Brisbane, Queensland (pp. 9-12).
for new areas that are not yet covered by the knowledge that
we have of the network. A detailed description of the Bentez, J.M., Castro, J.L., & Requena, I. (1997). Are
method and of its application to concrete problems can be artificial neural networks black boxes? IEEE Transactions
found in Rabual, Dorado, Pazos and Rivero (2003); on Neural Networks, 8, 1156-1164.
Rabual, Dorado, Pazos, Gestal, Rivero and Pedreira (2004a); Buckley, J.J., Hayashi, Y., & Czogala, E. (1993). On the
and Rabual, Dorado, Pazos, Pereira & Rivero (2004b). equivalence of neural nets and fuzzy expert systems.
Fuzzy Sets Systems, 53, 129-134.
FUTURE TRENDS Chalup, S., Hayward, R., & Diedrich, J. (1998). Rule extrac-
tion from artificial neural networks trained on elemen-
The proposed architecture is based on the application of tary number classification task. Queensland University
a knowledge extraction technique to an ANN, whereas the of Technology, Neurocomputing Research Centre. QUT
technique that will be used depends on the type of NRC technical report.
knowledge that we wish to obtain. Since there is a great Cios, K., Pedrycz, W., & Swiniarski, R. (1998). Data mining
variety of techniques and algorithms that generate infor- methods for knowledge discovery. The 1st Edition, Kluver
mation of the same kind (IF-THEN rules, trees, etc.), we International Series in Engineering and Computer Sci-
need to study them and carry out experiments to test their ence (pp. 495). Boston: Kluwer Academic Publishers.
functioning in the proposed system, and in particular their
adequacy for the new patterns generation system. Craven, M. W. (1996). Extracting comprehensible mod-
Also, this system for the generation of new patterns els from trained neural networks. PhD Thesis, University
involves a large number of parameters, such as the per- of Wisconsin, Madison.
centage of patterns change, the number of new patterns, Duch, W., Adamczak, R., & Grabczewski, K. (2001). A new
or the maximum error with which we consider that a pattern methodology of extraction, optimisation and application
is represented by a rule. Since there are so many param- of crisp and fuzzy logical rules. IEEE Transactions on
eters, we need to study not only each parameter sepa- Neural Networks, 12, 277-306.
rately but also the influence of all the parameters on the
final result. Engelbrecht, A.P., Rouwhorst, S.E., & Schoeman L. (2001).
A building block approach to genetic programming for
rule discovery. In Abbass, R. Sarkar, & C. Newton (Eds.),
CONCLUSION Data mining: A heuristic approach (pp. 175-189). Hershey:
Idea Group Publishing.
Goldberg, D. E. (1989). Genetic algorithms in search,
This article proposes a system architecture that makes optimization and machine learning. Reading, MA:
good use of the advantages of the ANNs, such as Data Addison-Wesley.
Mining, and avoids its inconveniences. On a first level, we
apply an ANN to extract and model a set of data. The Haykin, S. (1999). Neural networks (2nd ed.). Englewood
resulting model offers all the advantages of ANNs, such Cliffs, NJ: Prentice Hall.
as noise tolerance and generalisation capacity. On a
second level, we apply another knowledge extraction Holland, J. H. (1975). Adaptation in natural and artificial
technique to the ANN, and thus obtain the knowledge of systems. University of Michigan Press.
the ANN, which is the generalisation of the knowledge Jang, J., & Sun, C. (1992). Functional equivalence between
that was used for its learning; this knowledge is expressed radial basis function networks and fuzzy inference sys-
in the shape that is decided by the user. It is obvious that tems. IEEE Transactions on Neural Networks, 4, 156-158.
the union of various techniques in a hybrid system con-
veys a series of advantages that are associated to them. Keedwell E., Narayanan A., & Savic D. (2000). Creating
rules from trained neural networks using genetic algo-
rithms. Proceedings of the International Journal of
REFERENCES Computers, Systems and Signals (IJCSS) (Vol. 1, pp. 30-
42).
Andrews, R., & Geva, S. (1994). Rule extraction from a Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W.,
constrained error backpropagation MLP. Proceedings of Yu, J., & Lanza, G. (Eds.). (2003). Genetic programming
672
TEAM LinG
Knowledge Discovery with Artificial Neural Networks
IV: Routine human-competitive machine intelligence. Wong, M.L., & Leung, K.S. (2000). Data mining using
Dordrecht, The Netherlands: Kluwer Academic Publishers. grammar based genetic programming and applications. K
Series in Genetic Programming, 3, 232. Boston: Kluwer
McCulloch, W.S., & Pitts, W. (1943). A logical calculus of Academic Publishers.
ideas immanent in nervous activity. Bulletin of Math-
ematical Biophysics, (5), 115-133.
Orchard, G. (Ed.) (1993). Neural computing. Research and KEY TERMS
Applications. Ed. London: Institute of Physics Publish-
ing, Londres. Area of the Search Space: Set of specific ranges or
Pop, E., Hayward, R., & Diederich, J. (1994). RULENEG: values of the input variables that constitute a subset of
Extracting rules from a trained ANN by stepwise nega- the search space.
tion. Queensland University of Technology, Artificial Neural Networks: A network of many simple
Neurocomputing Research Centre. QUT NRC technical processors (units or neurons) that imitates a biologi-
report. cal neural network. The units are connected by unidirec-
Rabual, J.R., Dorado, J., Pazos, A., & Rivero, D. (2003). tional communication channels, which carry numeric data.
Rules and generalization capacity extraction from ANN Neural networks can be trained to find nonlinear relation-
with GP. Lecture notes in Computer Science, 606-613. ships in data, and are used in applications such as robot-
ics, speech recognition, signal processing or medical
Rabual, J.R., Dorado, J., Pazos, A., Gestal, M., Rivero, D., diagnosis.
& Pedreira, N. (2004a). Search the optimal RANN architec-
ture, reduce the training set and make the training process Backpropagation Algorithm: Learning algorithm of
by a distribute genetic algorithm. Artificial Intelligence ANNs, based on minimising the error obtained from the
and Applications, 1, 415-420. comparison between the outputs that the network gives
after the application of a set of network inputs and the
Rabual, J.R., Dorado, J., Pazos, A., Pereira, J., & Rivero, outputs it should give (the desired outputs).
D. (2004 b). A new approach to the extraction of ANN rules
and to their generalization capacity through GP. Neural Data Mining: The application of analytical methods
computation, 7(16), 1483-1523. and tools to data for the purpose of identifying patterns,
relationships or obtaining systems that perform useful
Rivero, D., Rabual, J.R., Dorado, J., Pazos, A., & Pedreira, tasks such as classification, prediction, estimation, or
N. (2004). Extracting knowledge from databases and ANNs affinity grouping.
with genetic programming: Iris flower classification prob-
lem. Intelligent agents for data mining and information Evolutionary Computation: Solution approach guided
retrieval (pp. 136-152). by biological evolution, which begins with potential so-
lution models, then iteratively applies algorithms to find
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning the fittest models from the set to serve as inputs to the next
representations by back-propagating errors. Nature, 323, iteration, ultimately leading to a model that best repre-
533-536. sents the data.
Tickle, A.B., Andrews, R., Golea, M., & Diederich, J. Knowledge Extraction: Explicitation of the internal
(1998). The truth will come to light: Directions and chal- knowledge of a system or set of data in a way that is easily
lenges in extracting the knowledge embedded within interpretable by the user.
trained artificial neural networks. IEEE Transaction on
Neural Networks. (9), 1057-1068. Rule Induction: Process of learning, from cases or
instances, if-then rule relationships that consist of an
Towell G., & Shavlik J.W. (1994). Knowledge-based arti- antecedent (if-part, defining the preconditions or cover-
ficial neural networks. Artificial Intelligence, 70, 119-165. age of the rule) and a consequent (then-part, stating a
classification, prediction, or other expression of a prop-
Visser, U., Tickle, A., Hayward, R., & Andrews, R. (1996). erty that holds for cases defined in the antecedent).
Rule-extraction from trained neural networks: Different
techniques for the determination of herbicides for the Search Space: Set of all possible situations of the
plant protection advisory system PRO_PLANT. Proceed- problem that we want to solve could ever be in.
ings of the Rule Extraction from Trained Artificial Neu-
ral Networks Workshop (pp. 133-139), Brighton, UK.
673
TEAM LinG
674
Paola Sebastiani
Boston University School of Public Health, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Learning Bayesian Networks
Figure 1. Learning
L
Learning a Bayesian network from data consists of the
induction of its two different components: (1) the graphi-
cal structure of conditional dependencies (model selec-
tion) and (2) the conditional distributions quantifying the
dependency structure (parameter estimation).
There are two main approaches to learning Bayesian
networks from data. The first approach, known as con-
straint-based approach, is based on conditional indepen-
dence tests. As the network encodes assumptions of
conditional independence, along this approach we need
The network represents the notion that obesity and to identify conditional independence constraints in the
gender affect the heart condition of a patient. The variable data by testing and then encoding them into a Bayesian
obesity can take three values: yes, borderline and no. The network (Glymour, 1987; Pearl, 1988; Whittaker, 1990).
variable heart condition has two states: true and false. In The second approach is Bayesian (Cooper &
this representation, the node heart condition is said to be Herskovitz, 1992; Heckerman et al., 1995) and regards
a child of the nodes gender and obesity, which, in turn, model selection as an hypothesis testing problem. In this
are the parents of heart condition. approach, we suppose to have a set M= {M0,M1, ...,Mg} of
The variables used in a Bayesian networks are sto- Bayesian networks for the random variables Y1, ..., Yv,, and
chastic, meaning that the assignment of a value to a each Bayesian network represents an hypothesis on the
variable is represented by a probability distribution. For dependency structure relating these variables. Then, we
instance, if we do not know for sure the gender of a patient, choose one Bayesian network after observing a sample of
we may want to encode the information so that we have data D = {y 1k, ..., yvk}, for k = 1, . . . , n. If p(M h) is the prior
better chances of having a female patient rather than a probability of model M h, a Bayesian solution to the model
male one. This guess, for instance, could be based on selection problem consists of choosing the network with
statistical considerations of a particular population, but maximum posterior probability:
this may not be our unique source of information. So, for
the sake of this example, lets say that there is an 80% p(Mh|D) p(Mh)p(D|Mh).
chance of being female and a 20% chance of being male.
Similarly, we can encode that the incidence of obesity is The quantity p(Mh|D) is the marginal likelihood, and its
10%, and 20% are borderline cases. The following set of computation requires the specification of a parameteriza-
distributions tries to encode the fact that obesity in- tion of each model Mh and the elicitation of a prior distri-
creases the cardiac risk of a patient, but this effect is more bution for model parameters. When all variables are dis-
significant in men than women: crete or all variables are continuous, follow Gaussian
The dependency is modeled by a set of probability distributions, and the dependencies are linear and the
distributions, one for each combination of states of the marginal likelihood factorizes into the product of marginal
variables gender and obesity, called the parent variables likelihoods of each node and its parents. An important
of heart condition. property of this likelihood modularity is that in the com-
parison of models that differ only for the parent structure
of a variable Yi, only the local marginal likelihood matters.
Thus, the comparison of two local network structures that
Figure 2. specify different parents for Yi can be done simply by
evaluating the product of the local Bayes factor BFh,k =
p(D|Mhi) / p(D|Mki), and the ratio p(Mhi)/ p(Mki), to compute
the posterior odds of one model vs. the other as p(M hi|D)
/ p(Mki|D).
In this way, we can learn a model locally by maximizing
the marginal likelihood node by node. Still, the space of the
possible sets of parents for each variable grows exponen-
tially with the number of parents involved, but successful
heuristic search procedures (both deterministic and sto-
chastic) exist to render the task more amenable (Cooper &
Herskovitz, 1992; Singh & Larranaga, 1996; Valtorta, 1995).
675
TEAM LinG
Learning Bayesian Networks
Once the structure has been learned from a dataset, we latter networks are termed linear Gaussian networks,
still need to estimate the conditional probability distribu- which still enjoy the decomposability properties of the
tions associated to each dependency in order to turn the marginal likelihood. Imposing the assumption that con-
graphical model into a Bayesian network. This process, tinuous variables follow linear Gaussian distributions
called parameter estimation, takes a graphical structure and that discrete variables only can be parent nodes in
and estimates the conditional probability distributions of the network but cannot be children of any continuous
each parent-child combination. When all the parent vari- node, leads to a closed-form solution for the computation
ables are discrete, we need to compute the conditional of the marginal likelihood (Lauritzen, 1992). The second
probability distribution of the child variable, given each technical challenge is the identification of sound meth-
combination of states of its parent variables. These con- ods to handle incomplete information, either in the form
ditional distributions can be estimated either as relative of missing data (Sebastiani & Ramoni, 2001) or com-
frequencies of cases or, in a Bayesian fashion, by using pletely unobserved variables (Binder et al., 1997). A third
these relative frequencies to update some, possibly uni- important area of development is the extension of Baye-
form, prior distribution. A more detailed description of sian networks to represent dynamic processes
these estimation procedures for both discrete and continu- (Ghahramani, 1998) and to decode control mechanisms.
ous cases is available in Ramoni and Sebastiani (2003). The most fundamental challenge of Bayesian net-
works today, however, is the full deployment of their
Prediction and Classification potential in groundbreaking applications and their es-
tablishment as a routine analytical technique in science
Once a Bayesian network has been defined, either by hand and engineering. Bayesian networks are becoming in-
or by an automated discovery process from data, it can be creasingly popular in various fields of genomic and
used to reason about new problems for prediction, diagno- computational biologyfrom gene expression analysis
sis, and classification. Bayes theorem is at the heart of the (Friedman, 2004) to proteimics (Jansen et al., 2003) and
propagation process. genetic analysis (Lauritzen & Sheehan, 2004)but they
One of the most useful properties of a Bayesian net- are still far from being a received approach in these areas.
work is the ability to propagate evidence irrespective of Still, these areas of application hold the promise of
the position of a node in the network, contrary to standard turning Bayesian networks into a common tool of statis-
classification methods. In a typical classification system, tical data analysis.
for instance, the variable to predict (i.e., the class) must be
chosen in advance before learning the classifier. Informa-
tion about single individuals then will be entered, and the CONCLUSION
classifier will predict the class (and only the class) of these
individuals. In a Bayesian network, on the other hand, the Bayesian networks are a representation formalism born
information about a single individual will be propagated in at the intersection of statistics and artificial intelligence.
any direction in the network so that the variable(s) to Thanks to their solid statistical foundations, they have
predict must not be chosen in advance. been turned successfully into a powerful data-mining
Although the problem of propagating probabilistic in- and knowledge-discovery tool that is able to uncover
formation in Bayesian networks is known to be, in the complex models of interactions from large databases.
general case, NP-complete (Cooper, 1990), several scalable Their high symbolic nature makes them easily under-
algorithms exist to perform this task in networks with hun- standable to human operators. Contrary to standard
dreds of nodes (Castillo, et al., 1996; Cowell et al., 1999; Pearl, classification methods, Bayesian networks do not re-
1988). Some of these propagation algorithms have been quire the preliminary identification of an outcome vari-
extended, with some restriction or approximations, to net- able of interest, but they are able to draw probabilistic
works containing continuous variables (Cowell et al., 1999). inferences on any variable in the database. Notwith-
standing these attractive properties and the continuous
interest of the data-mining and knowledge-discovery
FUTURE TRENDS community, Bayesian networks still are not playing a
routine role in the practice of science and engineering.
The technical challenges of current research in Bayesian
networks are focused mostly on overcoming their current
limitations. Established methods to learn Bayesian net- REFERENCES
works from data work under the assumption that each
variable is either discrete or normally distributed around a Binder, J. et al. (1997). Adaptive probabilistic networks
mean that linearly depends on its parent variables. The with hidden variables. Mach Learn, 29(2-3), 213-244.
676
TEAM LinG
Learning Bayesian Networks
Castillo, E. et al. (1996). Expert systems and probabiistic Pearl, J. (1986). Fusion, propagation, and structuring in
network models. New York: Springer. belief networks. Artif. Intell., 29(3), 241-288. L
Charniak, E. (1991). Bayesian networks without tears. AI Pearl, J. (1988). Probabilistic reasoning in intelligent
Magazine, 12(8), 50-63. systems: Networks of plausible inference. San Francisco:
Morgan Kaufmann.
Cooper, G.F. (1990). The computational complexity of
probabilistic inference using Bayesian belief networks. Ramoni, M., & Sebastiani, P. (2003). Bayesian methods. In
Artif Intell, 42(2-3), 393-405. M.B. Hand (Ed.), Intelligent data analysis: An introduc-
tion (pp. 128-166). New York: Springer.
Cooper, G.F., & Herskovitz, G.F. (1992). A Bayesian method
for the induction of probabilistic networks from data. Sebastiani, P., & Ramoni, M. (2001). Bayesian selection of
Mach Learn, 9, 309-347. decomposable models with incomplete data. J Am Stat
Assoc, 96(456), 1375-1386.
Cowell, R.G., et al. (1999). Probabilistic networks and
expert systems. New York: Springer. Singh, M., & Valtorta, M. (1995). Construction of Baye-
sian network structures from data: A brief survey and an
Friedman, N. (2004). Inferring cellular networks using efficient algorithm. Int J Approx Reason, 12, 111-131.
probabilistic graphical models. Science, 303, 799-805.
Whittaker, J. (1990). Graphical models in applied multi-
Ghahramani, Z. (1998). Learning dynamic Bayesian net- variate statistics. New York: John Wiley & Sons.
works. In C.L. Giles, & M. Gori (Eds.), Adaptive process-
ing of sequences and data structures (pp. 168-197). New Wright, S. (1923). The theory of path coefficients: A reply
York: Springer. to Niles criticisms. Genetics, 8, 239-255.
Glymour, C., Scheines, R., Spirtes, P., & Kelly, K. (1987). Wright, S. (1934). The method of path coefficients. Ann
Discovering causal structure: Artificial intelligence, Math Statist, 5, 161-215.
philosophy of science, and statistical modeling. San
Diego, CA: Academic Press.
Heckerman, D. (1997). Bayesian networks for data mining. KEY TERMS
Data Mining and Knowledge Discovery, 1(1), 79-119.
Heckerman, D. et al. (1995). Learning Bayesian networks: Bayes Factor: Ratio between the probability of the
The combinations of knowledge and statistical data. Mach observed data under one hypothesis divided by its prob-
Learn, 20, 197-243. ability under an alternative hypothesis.
Jansen, R. et al. (2003). A Bayesian networks approach for Conditional Independence: Let X, Y, and Z be three
predicting protein-protein interactions from genomic data. sets of random variables; then X and Y are said to be
Science, 302, 449-453. conditionally independent given Z, if and only if
p(x|z,y)=p(x|z) for all possible values x, y, and z of X, Y,
Larranaga, P., Kuijpers, C., Murga, R., & Yurramendi, Y. and Z.
(1996). Learning Bayesian network structures by search-
ing for the best ordering with genetic algorithms. IEEE T Directed Acyclic Graph (DAG): A graph with di-
Syst Man Cyb, 26, 487-493. rected arcs containing no cycles; in this type of graph,
for any node, there is no directed path returning to it.
Lauritzen, S.L. (1992). Propagation of probabilities, means
and variances in mixed graphical association models. J Probabilistic Graphical Model: A graph with nodes
Amer Statist Assoc, 87, 1098-108. representing stochastic variables annotated by probabil-
ity distributions and representing assumptions of condi-
Lauritzen, S.L. (1996). Graphical models. Oxford: tional independence among its variables.
Clarendon Press.
Statistical Independence: Let X and Y be two dis-
Lauritzen, S.L. (1988). Local computations with probabilities joint sets of random variables; then X is said to be
on graphical structures and their application to expert sys- independent of Y, if and only if p(x)=p(x|y) for all
tems (with discussion). J Roy Stat Soc B Met, 50, 157-224. possible values x and y of X and Y.
Lauritzen, S.L., & Sheehan, N.A. (2004). Graphical models
for genetic analysis. Statist Sci, 18(4), 489-514.
677
TEAM LinG
678
Chun-Nan Hsu
Institute of Information Science, Academia Sinica, Taiwan
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Learning Information Extraction Rules for Web Data Mining
tive languages, HTML structure analysis, natural lan- systems place emphasize on the learning techniques in
guage processing, machine learning, data modeling, and their paper. Recently, several works have been proposed L
ontology. to simplify the annotation process. For example, Lixto
(Baumgartner et al., 2001), DEByE (Laender et al., 2002)
and OLERA (Chang & Kuo, 2004) are three such systems
MAIN THRUST that stress the importance of how annotation or examples
are received from users. Note that OLERA also features
We classify previous work in Web IE into three catego- the so-called semi-supervised approach, which receives
ries. The first category contains the systems that require rough rather than exact and perfect examples from users
users to possess programming expertise. This category of to reduce labeling effort.
wrapper generation systems provides specialized lan- The third category contains the WI systems that do
guages or toolkits for wrapper construction, such as W4F not require any preprocessing of the input documents by
(Sahuguet & Azavant, 2001) and XWrap (Liu et al., 2000). the users. We call them annotation-free WI systems.
Such languages or toolkits were proposed as alternatives Example systems include IEPAD (Chang & Lui, 2001),
to general-purpose languages in order to allow program- RoadRunner (Crescenzi et al., 2001), DeLa (Wang &
mers to concentrate on formulating the extraction rules Lochovsky, 2003), and EXALG (Arasu & Garcia-Molina,
without being concerned about the detailed process of 2003). Since no extraction targets are specified, such WI
input strings. To apply these systems, users must learn systems make heuristic assumptions about the data to be
the language in order to write their extraction rules. There- extracted. For example, the first three systems assume the
fore, such systems also feature user-friendly interfaces existence of multiple tuples to be extracted in one page;
for easy use of the toolkits. However, writing correct therefore, the approach is to discover repeated patterns
extraction rules requires significant programming exper- in the input page. With such an assumption, IEPAD and
tise. In addition, since the structures of Web pages are not DeLa only apply to Web pages that contain multiple data
always obvious and change frequently, writing special- tuples. RoadRunner and EXALG, on the other hand, try to
ized extraction rules can be time-consuming, error-prone, extract structured data by deducing the template and the
and not scalable to a large number of Web sites. There- schema of the whole page from multiple Web pages. Their
fore, there is a need for automatic wrapper induction that assumption is that strings that are stationary across
can generalize extraction rules for each distinct IE task. pages are presumably template, and strings that are vari-
The second category contains the WI systems that ant are presumably schema and need to be extracted.
require users to label some extraction targets as training However, as commercial Web pages often contain mul-
examples for WI systems to apply a machine-learning tiple topics where a lot of information is embedded in a
algorithm to learn extraction rules from the training ex- page for navigation, decoration, and interaction pur-
amples. No programming is needed to configure these WI poses, their systems may extract both useful and useless
systems. Many IE tasks for Web mining belong to this information from a page. However, the criterion of what is
category; for example, IE for semi-structured text such as useful is quite subjective and depends on the application.
RAPIER (Califf & Mooney, 1999), SRV (Freitag, 2000), In summary, these approaches are, in fact, not fully
WHISK (Soderland, 1999), and for IE for template-based automatic. Rather, post-processing is required for users
pages such as WIEN (Kushmerick et al., 2000), SoftMealy to select useful data and to assign the data to a proper
(Hsu and Dung, 1998), STALKER (Muslea, et al., 2001), attribute.
and so forth. Compared to the first category, these WI
systems are preferable, since general users, instead of Task Difficulties
only programmers, can be trained to use these WI systems
for wrapper construction. A critical issue of WI systems is what types of documents
However, since the learned rules only apply to Web and structuring variations can be handled. Documents
pages from a particular Web site, labeling training ex- can be classified into structured, semi-structured, and
amples can be laborious, especially when we need to unstructured sets (Hsu & Dung, 1998). Early IE systems
extract contents from thousands of data sources. There- like RAPIER, SRV, and WHISK are designed to handle
fore, researchers have focused on developing tools that documents that contain semi-structured texts, while re-
can reduce labeling effort. For instance, Muslea, et al. cent IE systems are designed mostly to handle documents
(2002) proposed selective sampling, a form of active that contain semi-structured data (Laender et al., 2002). In
learning that reduces the number of training examples. this survey, we focus on semi-structured data extraction
Chidlovskii, et al. (2000) designed a wrapper generation and possible structure variation. These include missing
system that requires a small amount (one training record) data attributes, multi-valued attributes, attribute permu-
of labeling by the user. Earlier annotation-based WI tations, nested data structures, and so forth. Table 1 lists
679
TEAM LinG
Learning Information Extraction Rules for Web Data Mining
these variations and whether a WI system can correctly limiters will fail in this case, because no delimiter exists
extract the contents from a document with a layout variation. between the department codes COMP and the course
Most WI systems can handle missing attributes except number 4016. Note that delimiter-related issues some-
for WIEN. Multi-valued attributes can be considered as a times are caused by the tokenization/encoding method
special case of nested objects. However, it is neither right (see next paragraph) of the WI systems. Therefore, they
nor wrong to say annotation-free WI systems can handle do not necessarily cause problems for all WI systems.
multi-valued attributes, because it depends on how the Finally, most WI systems assume that the attributes of a
values are delimited (by HTML tags or by delimiters like data object occur in a contiguous string, which does not
comma, spaces, etc.). Similarly, though RoadRunner and interleave with other data objects except for MDR (Liu et
EXALG support the extraction of nested objects in gen- al., 2003), which is able to handle non-contiguous (NC)
eral, whether they can handle a particular set of documents data records.
with nested objects depends, in fact, on the quality of
template-based input. Encoding Scheme, Scanning Passes,
Permutations of attributes refer to multiple attribute and Other Features
orders in different data tuples in the target documents (see
the PMI example in Chang & Kuo, 2004). Note that both Table 2 compares important features of WI systems. In
missing attributes and permutations of attributes can lead order to learn the set of extraction rules, WI systems
to multiple attribute orders. Some approaches (e.g., need to know how to segment a string of characters (the
STALKER and multi-pass SoftMealy) utilize multiple scans input) into tokens. For example, SoftMealy segments a
to deal with attribute permutation. Some (e.g., IEPAD, page into tokens including HTML tags as well as words
DeLa) employ string alignment technique for this issue. separated by spaces and uses a token taxonomy tree for
However, the way they handle the extract data results in extraction rule generation. On the other hand, IEPAD
different power. EXALG combines two equivalence classes and RoadRunner regard every text string between two
to produce disjunction rules for handling permutated at- HTML tags as one token, which leads to coarser extrac-
tributes. However, the procedure may fail when all tokens tion granularity. Most WI systems have a predefined
fall in one equivalence class or when no equivalence class feature set for rule generalization. Still, some systems
is formed. such as OLERA (Chang & Kuo, 2004) explicitly allow
Another difficult issue is what we called common user-defined encoding schemes for tokenization. The
delimiters (CD) for attributes and record boundaries. An extraction mechanisms of various WI systems also
example can be found in an Internet address finder docu- play an important role for extraction efficiency. Some
ment set from the repository of information sources for WI systems scan the input document once, referred to
information extraction (RISE, http://www.isi.edu/info- as single-pass extractor. Others scan the input docu-
agents/RISE/repository.html). In that document set, HTML ment several times to complete the extraction. Gener-
tags like <TR> and <TD> are used as delimiters for both ally speaking, single-pass wrappers are more efficient
record boundaries and attributes. Such document sets are than multi-pass wrappers. However, multi-pass wrap-
especially difficult for annotation-free systems. Even for pers are more effective at handling data objects with
annotation-based systems, the problem cannot be com- unrestricted attribute permutations or complex object
pletely handled with their default extraction mechanism. extraction.
Nonexistent delimiters (ND) also cause problems for some Some of the WI systems have special characteristics
WI systems that rely on delimiter-based extraction rules. or requirements that deserve discussion. For example,
For example, suppose we want to extract the department EXALG limits extraction failure to only part of the records
code and course number from the following string: (a few attributes) instead of the entire page. STALKER
COMP4016. WI systems that depend on recognizing de-
680
TEAM LinG
Learning Information Extraction Rules for Web Data Mining
681
TEAM LinG
Learning Information Extraction Rules for Web Data Mining
Chang, C.-H., Siek, H., Lu, J.-J., Chiou, J.-J., & Hsu, C.-N. Muslea, I. (1999). Extraction patterns for information ex-
(2003). Reconfigurable Web wrapper agents. IEEE Intel- traction tasks: A survey. The AAAI-99 Workshop on
ligent Systems, 18(5), 34-40. Machine Learning for Information Extraction (pp. 435-
442), Sydney, Australia.
Chidlovskii, B., Ragetli, J., & Rijke, M. (2000). Automatic
wrapper generation for Web search engines. Proceedings Muslea, I., Minton, S., & Knoblock, C.A. (2001). Hierarchi-
of WAIM (pp. 399-410), Shanghai, China. cal wrapper induction for semi-structured information
sources. Journal of Autonomous Agents and Multi-Agent
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrun- Systems, 4, 93-114.
ner: Towards automatic data extraction from large Web
sites. Proceedings of the 27th International Conference Muslea, I., Minton, S., & Knoblock, C. (2002). Active +
on Very Large Data Bases (VLDB) (pp. 109-118), Roma, semi-supervised learning = Robust multi-view learning.
Italy. Proceedings of the 19th International Conference on
Machine Learning (ICML), Sydney, Australia.
Freitag, D. (2000). Machine learning for information ex-
traction in informal domains. Machine Learning, 39(2/3), Sahuguet, A., & Azavant, F. (2001). Building intelligent
169-202. Web applications using lightweight wrappers. Data and
Knowledge Engineering, 36(3), 283-316.
Hsu, C.-N., Chang, C.-H., Hsieh, C.-H., Lu, J.-J., & Chang,
C.-C. (2004). Reconfigurable Web wrapper agents for Sarawagi, S. (2002). Automation in information extraction
biological information integration. Journal of the Ameri- and integration. Proceedings of the 28th International
can Society for Information Science and Technology Conference on Very Large Data Bases (VLDB), Tutorial,
(JASIST), 56(5), 505-517. Hong Kong, China.
Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state Soderland, S. (1999). Learning information extraction rules
transducers for semi-structured data extraction from the for semi-structured and free text. Journal of Machine
Web. Information Systems, 23(8), 521-538. Learning, 34(1-3), 233-272.
Kushmerick, N. (2000). Wrapper induction: Efficiency and Wang, J., & Lochovsky, F.H. (2003). Data extraction and
expressiveness. Artificial Intelligence Journal, 118(1- label assignment. Proceedings of the 10th International
2), 15-68. World Wide Web Conference (pp. 187-196), Budapest,
Hungary.
Kushmerick, N., & Thomas, B. (2002). Adaptive informa-
tion extraction: Core technologies for information agents. Yang, G., Ramakrishnan, I.V., & Kifer, M. (2003). On the
Intelligent Information Agents R&D In Europe: An complexity of schema inference from Web pages in the
AgentLink Perspective. Lecture Notes in Computer Sci- presence of nullable data attributes. Proceedings of the
ence, 2586, 2003, 79-103. Springer. 12th International Conference on Information and
Knowledge Management (pp. 224-231), New Orleans,
Laender, A.H.F., Ribeiro-Neto, B.A., & Da Silva, A.S. Louisiana, USA.
(2002). DEByEData extraction by example. Data and
Knowledge Engineering, 40(2), 121-154.
Laender, A.H.F., Ribeiro-Neto, B.A., Da Silva, A.S., &
KEY TERMS
Teixeira, J.S. (2002). A brief survey of Web data extraction
tools. SIGMOD Record, 31(2), 84-93. Hidden Markov Model: A variant of a finite state
machine having a set of states, an output alphabet,
Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data transition probabilities, output probabilities, and initial
records in Web pages. Proceedings of the Ninth ACM state probabilities. It is only the outcome, not the state
SIGKDD International Conference on Knowledge Dis- visible to an external observer, and, therefore, states are
covery and Data Mining (pp. 601-606), Washington, D.C., hidden to the outside; hence, the name Hidden Markov Model.
USA.
Information Extraction: An information extraction
Liu, L., Pu, C., & Han, W. (2000). Xwrap: An Xml-enabled task is to extract or pull out of user-defined and pertinent
wrapper construction system for Web information sources. information from input documents.
Proceedings of the 16th International Conference on
Data Engineering (ICDE) (pp. 611-621), San Diego, Cali- Logic Programming: A declarative, relational style of
fornia, USA. programming based on first-order logic. The original logic
682
TEAM LinG
Learning Information Extraction Rules for Web Data Mining
programming language was Prolog. The concept is based Transducer: A finite state machine specifically with
on Horn clauses. a read-only input and a write-only output. The input and L
output cannot be reread or changed.
Semi-Structured Documents: Semi-structured (or
template-based) documents refer to documents that are Wrapper: A program that extracts data from the input
formatted between structured and unstructured documents. documents and wraps them in user desired and structured
form.
Software Agents: An artificial agent that operates in
a software environment such as operating systems, com-
puter applications, databases, networks, and virtual domains.
683
TEAM LinG
684
Dimitrios Gunopulos
University of California, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Locally Adaptive Techniques for Pattern Classification
that each instance is described by 20 features, but only for distance computation capable of reducing the bias of
three of them are relevant to classifying a given instance. the estimation. L
In this case, two points that have identical values for the Friedman (1994) describes an adaptive approach (the
three relevant features may, nevertheless, be distant from Machete and Scythe algorithms) for classification that
one another in the 20-dimensional input space. As a combines some of the best features of kNN learning and
result, the similarity metric that uses all 20 features will be recursive partitioning. The resulting hybrid method in-
misleading, since the distance between neighbors will be herits the flexibility of recursive partitioning to adapt the
dominated by the large number of irrelevant features. This shape of the neighborhood N (x 0 ) of query x0 , as well as
shows the effect of the curse of dimensionality phenom-
the ability of nearest neighbor techniques to keep the
enon; that is, in high dimensional spaces, distances
between points within the same class or between different points within N (x 0 ) close to the point being predicted.
classes may be similar. This fact leads to highly-biased The method is capable of producing nearly continuous
estimates. Nearest neighbor approaches (Ho, 1998; Lowe, probability estimates with the region N (x 0 ) centered at x0
1995) are especially sensitive to this problem. and the shape of the region separately customized for
In many practical applications, things often are further each individual prediction point.
complicated. In the previous example, the three relevant The major limitation concerning the Machete/Scythe
features for the classification task at hand may be depen- method is that, like recursive partitioning methods, it
dent on the location of the query point (i.e., the point to applies a greedy strategy. Since each split is conditioned
be classified) in the feature space. Some features may be on its ancestor split, minor changes in an early split, due
relevant within a specific region, while other features may to any variability in parameter estimates, can have a
be more relevant in a different region. Figure 1 illustrates significant impact on later splits, thereby producing dif-
a case in point, where class boundaries are parallel to the ferent terminal regions. This makes the predictions highly
coordinate axes. For query a, dimension X is more rel- sensitive to the sampling fluctuations associated with the
evant, because a slight move along the X axis may change random nature of the process that produces the training
the class label, while for query b, dimension Y is more data and, therefore, may lead to high variance predictions.
relevant. For query c, however, both dimensions are In Hastie and Tibshirani (1996), the authors propose
equally relevant. a discriminant adaptive nearest neighbor classification
These observations have two important implications. method (DANN), based on linear discriminant analysis.
Distance computation does not vary with equal strength Earlier related proposals appear in Myles and Hand (1990)
or in the same proportion in all directions in the feature and Short and Fukunaga (1981). The method in Hastie and
space emanating from the input query. Moreover, the Tibshirani (1996) computes a local distance metric as a
value of such strength for a specific feature may vary from product of weighted within and between the sum of
location to location in the feature space. Capturing such squares matrices. The authors also describe a method of
information, therefore, is of great importance to any clas- performing global dimensionality reduction by pooling
sification procedure in high-dimensional settings. the local dimension information over all points in the
training set (Hastie & Tibshirani, 1996a, 1996b).
While sound in theory, DANN may be limited in
MAIN THRUST practice. The main concern is that in high dimensions, one
may never have sufficient data to fill in q q (within and
Severe bias can be introduced in pattern classification in between sum of squares) matrices (where q is the dimen-
a high dimensional input feature space with finite samples. sionality of the problem). Also, the fact that the distance
In the following, we introduce adaptive metric techniques metric computed by DANN approximates the weighted
Chi-squared distance only when class densities are
Gaussian and have the same covariance matrix may cause
Figure 1. Feature relevance varies with query locations a performance degradation in situations where data do not
follow Gaussian distributions or are corrupted by noise,
which is often the case in practice.
A different adaptive nearest neighbor classification
method (ADAMENN) has been introduced to try to mini-
mize bias in high dimensions (Domeniconi, Peng &
Gunopulos, 2002) and to overcome the previously men-
tioned limitations. ADAMENN performs a Chi-squared
distance analysis to compute a flexible metric for produc-
685
TEAM LinG
Locally Adaptive Techniques for Pattern Classification
l =1
tion upon which the ADAMENN algorithm computes a mea-
rise to linear and quadratic weightings, respectively, and
sure of local feature relevance, as shown in the following.
The first observation is that Pr( j | x) is a function of Ri (x 0 ) = max qj =1{r j (x 0 )} r i (x 0 ) (i.e., the larger the Ri , the
x . Therefore, one can compute the conditional expecta- more relevant dimension i ). We propose the following
tion of Pr( j | x) , denoted by Pr( j | xi = z ) , given that xi exponential weighting scheme
686
TEAM LinG
Locally Adaptive Techniques for Pattern Classification
the technique is query-based, because the weights de- Chi-squared distance (1) by a Taylor series expansion,
pend on the query (Aha, 1997; Atkeson, Moore & Shaal, given that class densities are Gaussian and have the same L
1997). covariance matrix. In contrast, ADAMENN does not
An intuitive explanation for (2) and, hence, (3) goes as make such assumptions, which are unlikely in real-
follows. Suppose that the value of ri (z) is small, which world applications. Instead, it attempts to approximate
the weighted Chi-Squared distance (1) directly. The
implies a large weight along dimension i . Consequently,
main concern with DANN is that, in high dimensions, we
the neighborhood is shrunk along that direction. This, in
may never have sufficient data to fill in q q matrices.
turn, penalizes points along dimension i that are moving It is interesting to note that the ADAMENN algorithm
away from z i . Now, ri (z ) can be small, only if the subspace potentially can serve as a general framework upon which
to develop a unified adaptive metric theory that encom-
spanned by the other input dimensions at xi = z i likely
passes both Friedmans work and that of Hastie and
contains samples similar to z in terms of the class condi- Tibshirani.
tional probabilities. Then, a large weight assigned to
dimension i based on (4) says that moving away from the
subspace and, hence, from the data similar to z is not a
FUTURE TRENDS
good thing to do. Similarly, a large value of ri (z) and,
hence, a small weight indicates that in the vicinity of z i Almost all problems of practical interest are high dimen-
along dimension i , one is unlikely to find samples similar sional. With the recent technological trends, we can
to z . This corresponds to an elongation of the neighbor- expect an intensification of research efforts in the area of
hood along dimension i . Therefore, in this situation, in feature relevance estimation and selection. In
bioinformatics, the analysis of micro-array data poses
order to better predict the query, one must look farther
challenging problems. Here, one has to face the problem
away from z i . of dealing with more dimensions (genes) than data points
One of the key differences between the relevance (samples). Biologists want to find marker genes that are
measure (3) and Friedmans is the first term in the squared differentially expressed in a particular set of conditions.
difference. While the class conditional probability is used Thus, methods that simultaneously cluster genes and
in (3), its expectation is used in Friedmans. This differ- samples are required to find distinctive checkerboard
ence is driven by two different objectives: in the case of patterns in matrices of gene expression data. In cancer
Friedmans, the goal is to seek a dimension along which data, these checkerboards correspond to genes that are
the expected variation of Pr( j | x) is maximized, whereas in up- or down-regulated in patients with particular types of
(3) a dimension is found that minimizes the difference tumors. Increased research efforts in this area are needed
between the class probability distribution for a given and expected.
query and its conditional expectation along that dimen- Clustering is not exempt from the curse of dimension-
sion (2). Another fundamental difference is that the ma- ality. Several clusters may exist in different subspaces,
chete/scythe methods, like recursive partitioning, employ comprised of different combinations of features. Since
a greedy peeling strategy that removes a subset of data each dimension could be relevant to at least one of the
points permanently from further consideration. As a re- clusters, global dimensionality reduction techniques are
sult, changes in an early split, due to any variability in not effective. We envision further investigation on this
parameter estimates, can have a significant impact on later problem with the objective of developing robust tech-
splits, thereby producing different terminal regions. This niques in the presence of noise.
makes predictions highly sensitive to the sampling fluc- Recent developments on kernel-based methods sug-
tuations associated with the random nature of the process gest a framework to make the locally adaptive techniques
that produces the training data, thus leading to high discussed previously more general. One can perform
variance predictions. In contrast, ADAMENN employs a feature relevance estimation in an induced feature space
patient averaging strategy that takes into account not and then use the resulting kernel metrics to compute
only the test point x0 , but also its K 0 nearest neighbors. distances in the input space. The key observation is that
kernel metrics may be non-linear in the input space but are
As such, the resulting relevance estimates (3) are, in still linear in the induced feature space. Hence, the use of
general, more robust and have the potential to reduce the suitable non-linear features allows the computation of
variance of the estimates. locally adaptive neighborhoods with arbitrary orienta-
In Hastie and Tibshirani (1996), the authors show tions and shapes in input space. Thus, more powerful
that the resulting metric approximates the weighted classification techniques can be generated.
687
TEAM LinG
Locally Adaptive Techniques for Pattern Classification
Pattern classification faces a difficult challenge in fi- Classification: The task of inferring concepts from
nite settings and high dimensional spaces, due to the observations. It is a mapping from a measurement space
curse of dimensionality. In this paper, we have presented into the space of possible meanings, viewed as finite and
and compared techniques to address data exploration discrete target points (class labels). It makes use of
tasks such as classification and clustering. All methods training data.
design adaptive metrics or parameter estimates that are
local in input space in order to dodge the curse of Clustering: The process of grouping objects into
dimensionality phenomenon. Such techniques have been subsets, such that those within each cluster are more
demonstrated to be effective for the achievement of closely related to one another than objects assigned to
accurate predictions. different clusters, according to a given similarity measure.
Curse of Dimensionality: Phenomenon that refers to
the fact that, in high-dimensional spaces, data become
REFERENCES extremely sparse and are far apart from each other. As a
result, the sample size required to perform an accurate
Aha, D. (1997). Lazy learning. Artificial Intelligence Re- prediction in problems with high dimensionality is usually
view, 11, 1-5. beyond feasibility.
Atkeson, C., Moore, A.W., & Schaal, S. (1997). Locally Kernel Methods: Pattern analysis techniques that
weighted learning. Artificial Intelligence Review, 11, 11-73. work by embedding the data into a high-dimensional
vector space and by detecting linear relations in that
Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally space. A kernel function takes care of the embedding.
adaptive metric nearest neighbor classification. IEEE
Transactions on Pattern Analysis and Machine Intelli- Local Feature Relevance: Amount of information that
gence, 24(9), 1281-1285. a feature carries to predict the class posterior probabilities
at a given query.
Friedman, J.H. (1994). Flexible metric nearest neighbor
classification. Technical Report. Stanford University. Nearest Neighbor Methods: Simple approach to the
classification problem. It finds the K nearest neighbors
Hastie, T., & Tibshirani, R (1996a). Discriminant adaptive
of the query in the training set and then predicts the class
nearest neighbor classification. IEEE Transactions on
label of the query as the most frequent one occurring in
Pattern Analysis and Machine Intelligence, 18(6), 607-615.
the K neighbors.
Hastie, T., & Tibshirani, R. (1996b). Discriminant analysis
Pattern: A structure that exhibits some form of regu-
by Gaussian mixtures. Journal of the Royal Statistical
larity, able to serve as a model representing a concept of
Society, 58, 155-176.
what was observed.
Ho, T.K. (1998). Nearest neighbors in random subspaces.
Recursive Partitioning: Learning paradigm that em-
Proceedings of the Joint IAPR International Workshops
ploys local averaging to estimate the class posterior
on Advances in Pattern Recognition.
probabilities for a classification problem.
Lowe, D.G. (1995). Similarity metric learning for a variable-
Subspace Clustering: Simultaneous clustering of both
kernel classifier. Neural Computation, 7(1), 72-85.
row and column sets in a data matrix.
Myles, J.P., & Hand, D.J. (1990). The multi-class metric
Training Data: Collection of observations (character-
problem in nearest neighbor discrimination rules. Pattern
ized by feature measurements), each paired with the cor-
Recognition, 23(11), 1291-1297.
responding class label.
Short, R.D., & Fukunaga, K. (1981). Optimal distance
measure for nearest neighbor classification. IEEE Trans-
actions on Information Theory, 27(5), 622-627.
688
TEAM LinG
689
Peter L. Hammer
RUTCOR, Rutgers University, USA
Toshihide Ibaraki
Kwansei Gakuin University, Japan
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Logical Analysis of Data
also may be interesting, as it reveals a monotone depen- least one vector in T satisfying this conjunction, but no
dence on the involved variables. vector in F satisfies it, where a literal is either a variable or
its complement. In the example of Tables 1 and 2,
x5 x8 , x5 x8 , x1 x5 , K are patterns. The notion of a pat-
MAIN THRUST
tern is closely related to the association rule, which is
In many applications, the available data is not binary, commonly used in data mining. Each pattern captures a
which necessitates the generation of relevant binary certain characteristic of (T, F) and forms a part of knowl-
features, for which the three-staged LAD analysis then edge about the data set. Several types of patterns (prime,
can be applied. Following are some details about all the spanned, strong) have been analyzed in the literature
main stages of LAD, including the generation of binary (Alexe et al., 2002), and efficient algorithms for enumerat-
features, finding support sets, generating patterns, and ing large sets of patterns have been described (Alexe &
constructing theories. Hammer, 2004; Alexe & Hammer, 2004; Eckstein et al., 2004).
It is desirable from the viewpoint of simplicity to build an In many cases, it is known in advance that the given
extension by using as small a set of variables as possible. dataset has certain properties, such as monotone depen-
A subset S of the n original variables is called a support dence on variables. To utilize such information, exten-
set, if the projections of T and F on S still has an extension. sions by Boolean functions with the corresponding prop-
The pdBf in Table 1 has the following minimal support erties are important. For special classes of Boolean func-
sets: S1={5,8}, S2={6,7}, S3={1,2,5}, S4={1,2,6}, S5={2,5,7}, tions, such as monotone (or positive), Horn, k-DNF,
S6={2,6,8}, S7={1,3,5,7}, S8={1,4,5,7}. For example, f1 in decomposable, and threshold, the algorithms and com-
Table 2 is constructed from S1. Several methods to find plexity of finding extensions were investigated (Boros et
small support sets are discussed and compared in Boros, al., 1995; Boros et al., 1998). If there is no extension in the
et al. (2003). specified class of Boolean functions, we may still want to
find an extension in the class with the minimum number of
Pattern Generation errors. Such an extension is called the best-fit extension
and studied in Boros, et al. (1998).
As a basic tool to construct an extension f, a conjunction
of a set of literals is called a pattern of (T, F), if there is at
690
TEAM LinG
Logical Analysis of Data
691
TEAM LinG
Logical Analysis of Data
Hammer, P.L., Kogan, A., & Lejeune, M.A. (2003). A non- LAD (Logical Analysis of Data): A methodology that
recursive regression model for country risk rating. tries to extract and/or discover knowledge from datasets
RUTCOR Research Report RRR 9-2003. by utilizing the concept of Boolean functions.
Pattern: A conjunction of literals that is true for some
KEY TERMS data vectors in T but is false for all data vectors in F, where
(T, F) is a given pdBf. A co-pattern is similarly defined by
Binarization: The process of deriving a binary repre- exchanging the roles of T and F.
sentation for numerical and/or categorical attributes. Support Set: A set of variables S for a data set (T, F)
Boolean Function: A function from {0,1}n to {0,1}. A such that projections of T and F on S still have an
function from a subset of {0,1}n to {0,1} is called a partially extension.
defined Boolean function (pdBf). A pdBf is defined by a Theory: A set of patterns of a pdBf (T, F), such that
pair of datasets (T, F), where T (resp., F) denotes a set of each data vector in T has a pattern satisfying it.
data vectors belonging to positive (resp., negative) class.
Extension: A Boolean function f that satisfies f(x)=1
for xT and f(x)=0 for xF for a given pdBf (T, F).
692
TEAM LinG
693
Klaus Truemper
University of Texas at Dallas, USA
INTRODUCTION that holds true for all the records in one set while it is false
for all records in another set, can become extremely
The method described in this chapter is designed for difficult when the dimension involved is not trivial, and
data mining and learning on logic data. This type of data many different techniques and approaches have been
is composed of records that can be described by the proposed in the literature. In this article we describe one
presence or absence of a finite number of properties. of them, the Lsquare System, developed in collaboration
Formally, such records can be described by variables between IASI-CNR and UTD and described in detail in
that may assume only the values true or false, usually Felici & Truemper (2002), Felici, Sun, & Truemper (2004),
referred to as logic (or Boolean) variables. In real and Truemper (2004). The system is freely distributed for
applications, it may also happen that the presence or research and study purposes at www.leibnizsystem.com.
absence of some property cannot be verified for some Data mining in logic domains is becoming a very
record; in such a case we consider that variable to be interesting topic for both research and applications. The
unknown (the capability to treat formally data with motivations for the study of such models are frequently
missing values is a feature of logic-based methods). For found in real life situations where one wants to extract
example, to describe patient records in medical diagno- usable information from data expressed in logic form.
sis applications, one may use the logic variables healthy, Besides medical applications, these types of problems
old, has_high_temperature, among many others. A often arise in marketing, production, banking, finance,
very common data mining task is to find, based on and credit rating. A quick scan of the updated Irvine
training data, the rules that separate two subsets of the Repository (see Murphy & Aha, 1994) is sufficient to
available records, or explains the belonging of the data show the relevance of logic-based models in data min-
to one subset or the other. For example, one may desire ing and learning application. The literature describes
to find a rule that, based one the many variables observed several methods that address learning in logic domains,
in patient records, is able to distinguish healthy patients for example the very popular decision trees (Breiman et
from sick ones. Such a rule, if sufficiently precise, may al., 1984), the highly combinatorial approach of Boros
then be used to classify new data and/or to gain informa- et al. (1996), the interior point method of Kamath et al.
tion from the available data. This task is often referred (1992), or the partial enumeration scheme proposed by
to as machine learning or pattern recognition and ac- Triantaphyllou & Soyster (1996). While the problem
counts for a significant portion of the research con- formulation adopted by Lsquare is somewhat related the
ducted in the data mining community. When the data work in Kamath et al. (1992) and Triantaphyllou et al.
considered is in logic form or can be transformed into (1994), substantial differences are found in the solution
it by some reasonable process, it is of great interest to method adopted. Most of the methods considered in this
determine explanatory rules in the form of the combina- area are of intrinsic deterministic nature, being based on
tion of logic variables, or logic formulas. In the ex- the formal description of a problem in mathematical form
ample above, a rule derived from data could be: and in its solution by a specific algorithm. Nevertheless,
some real life situations present uncertainty and errors in
if (has_high_temperature is true) and (running_nose the data that are often successfully dealt with by the use
is true) then (the patient is not healthy). of fuzzy set and fuzzy membership theory. In such cases
the proposed system may embed the uncertainty and the
Clearly such rules convey a lot of information and fuzziness of the data in a pre-processing step, providing
can be easily understood and interpreted by domain fuzzy functions that determine the value of the Boolean
experts. Despite the apparent simplicity of this setting, variables.
the problem of determining, if possible, a logic formula
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
The Lsquare System for Mining Logic Data
We consider {0, +/-1} vectors of given length n, each of We can express the separation conditions (1) and (2)
which has an associated outcome with value true or with the Boolean variables pi and qi. For (1) we have that
false. We call these vectors records of logic data and s must not be nested in any b B. Defining b+ as the set
view them as an encoding of logic information. A 1 in a of indices i for which bi of b is equal to 1, that is, b+ =
record means that a certain Boolean variable has value {ibi = 1} and b- = {ibi = -1}, b0 = {ibi = 0}, we
true, and a 1 that the variable has value false. The value summarize condition (1) writing that
0 is used for unknown. The outcome is considered to be
the value of a Boolean variable t that we want to explain ( i(b+ b0) qi) (i(b- b0) pi); bB (4)
or predict. We collect the records for which the prop-
erty t is present in a set A, and those for which t is not For condition (2) we have to enforce that, if s sepa-
present in set B. For ease of recognition, we usually rates a from B, then s is nested in a. In order to do so we
denote a member of A by a, and of B by b. The Lsquare introduce a new Boolean variable da that determines
system deduces {0, +/-1} separating vectors that effec- whether s must separate a from B. That is, da = true
tively represent logic formulas and may be used to means that s need not separate a from B, while da = false
compute for each record the associated outcome, that requires that separation. For a A, the separation condi-
is, to separate the records in A from the records in B. A tion is:
separating set is a collection of separating vectors. The
separation of A and B makes sense only when both A and qi da; i (a+ a0)
B are non-empty, and when each record of A or B pi da; i (a+ a0) (5)
contains at least one {+/-1} entry. Consider two records
694
TEAM LinG
The Lsquare System for Mining Logic Data
We now formulate the problem of determining a vector property t. One then determines a set S that separates B
s that separates as many a A from B as possible (this from A, and uses that set to guess whether a new vector L
amounts to a satisfying solution for (3)-(5) that assigns r is in A or B. That is, we guess r to be in A if at least one
value true to as few variables da as possible). For each a s S is nested in r, and to be in B otherwise. Of course,
A, define a rational cost c a that is equal to 1 if da is true, the classification of r based on S is correct if r is in A or
and equal to 0 otherwise. Using these costs and (3)-(5), the B, but otherwise need not be correct. Specifically, we may
desired s may be found by solving the following MINSAT guess a record of A -A to be in B, and a record of B -B to
problem, with variables da, a A, and pi, qi, i = 1, 2, , n. be in A. Usually such errors are referred to as type error
and type error respectively. The utility of S depends on
min c which type of error is made how many times. In some
s. t. p q i = 1, 2,..., n settings, an error of one of the two types is bad, but an
((b+ b0) q ) (i(b- b0) pi) bB (6) error of the other type is worse. For example, a non-
invasive diagnostic system for cancer that claims a case
q d a , (a+a0) to be benign when a malignancy is present has failed
pi da a , (a+a0) badly. On the other hand, prediction of a malignancy for
an actually benign case triggers additional tests, and
The solution of problem (6) identifies an s and a thus is annoying but not nearly as objectionable as an
largest subset A = {a A da = False} that is separated error of the first type. We can influence the extent of type
from B by s. Restricting the problem to A and using error and type errors by an appropriate choice of the
costs c(p i) and c(qi) associated with the variables pi and qi, objective function c(p i) and c(q i). As anticipated, when
we define a second objective function and solve again the c(pi) = c(qi) = 1 for all i = 1, , n the formulas determined
problem, obtaining a separating vector whose properties by the solution of the sequence of MINSAT problems
depend on the c(pi) and c(qi). A simple example of the role will have minimal support, that is, will try to use the
of cost values c(p i) and c(q i) is the following. Assume that minimum number of variables to separate A from B. On
c(pi) and c(qi) assign cost of 1 when pi and qi are true and the other hand, setting c(pi) = c(q i) = -1 for all i = 1, ,
cost 0 when they are false. The separating vector will use n, we will obtain the opposite effect, that is, a formula with
the minimum number of logic variables, that is, the maximum support. If we use a single vector s to classify
minimum amount of information contained in the data to a vector r, then we guess r to be in A if s is nested in r.
separate the sets. On the opposite, if we assign cost 0 for The latter condition tends to become less stringent when
true and 1 for false, it will use the maximum amount of the number of nonzero entries in s is reduced. Hence, we
information to define the separating sets. If one separating heuristically guess that a solution vector s with mini-
vector is not sufficient to separate all set A, we set A A mum support tends to avoid type errors. Conversely,
and iterate. The disjunction of all separating vectors consti- an s with maximum support tends to avoid type errors.
tutes the final logic formula that separates A from B. We apply this heuristic argument to the separating set S
produced under one of the two choices of objective
The Leibniz System functions, and thus expect that a set S with minimum
(resp. maximum) support tends to avoid type (resp. )
The MINSAT problems considered belong to the so errors. The above considerations are combined with a
called NP class, that means, shortly, that the time needed sophisticated voting scheme embedded in Lsquare, partly
for their solution cannot be bounded by a polynomial inspired to the notion of stacked generalization origi-
function in the size of the input, as explained by the nally described in Wolpert (1992) (see also Breiman,
modern theory of computational complexity (Garey & 1996). Such scheme refines the separating procedure
Johnson, 1979). We solve the MINSAT instances with described in the previous section by a resampling tech-
the Leibniz System, a logic programming solver devel- nique of the logic records available for training. For
oped at the University of Texas at Dallas. Such solver is each subsample we determine 4 separating formulas,
based on decomposition and combinatorial optimization by switching the roles of A and B in the MINSAT
results described in Truemper (1998). formulation and by switching the signs of the cost
coefficients. Then, each formula so determined is used
Error Control and Voting System to produce a vote for each elements of A and B, the vote
being 1 (resp. 1) if the element is recognised to
In learning problems one views the sets A and B as belong to A (resp. B). The sum of all the votes so
obtained is the vote total V. If the vote V is positive (resp.
training sets, and considers them to be subsets of sets A
negative), then the record belongs to A (resp. to B). If A
and B where A consists of all {0, +/-1} records of length and B are representative subsets of A and B, then the
n with property t, and B consists of all such records without
695
TEAM LinG
The Lsquare System for Mining Logic Data
696
TEAM LinG
The Lsquare System for Mining Logic Data
697
TEAM LinG
698
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Marketing Data Mining
sampling), and creating a cell design structure (testing SAS with estimable main effects, all two-way interac-
various offers and also by age, income, or other vari- tion effects, and quadratic effects on quantitative vari-
ables). I focus on the latter here. ables generates a design of 256 cells, which may still be
considered large. An optimal design using PROC OPTEX
Problem 1 with the same estimable effects generates a design of
only 37 cells (see Table 2; refer to Table 1 for attribute
Classical designs often test one variable at a time. For level definitions).
example, in a cell phone direct mail campaign, you may Fractional factorial design has been used in credit
test a few price levels of the phone. After launching the card acquisition but is not widely used in other indus-
campaign and uncovering the price level that led to the tries for marketing. Two reasons are (a) a lack of experi-
highest revenue, another campaign is launched to test a mental design knowledge and experience and (b) busi-
monthly fee, a third campaign tests the direct mail ness process requirement tight coordination with list
message, and so forth. A more efficient way is to struc- selection and creative design professionals is required.
ture the cell design such that all these variables are
testable in one campaign. Consider an example: A credit Opportunity 1
card company would like to determine the best combi-
nation of treatments for each prospect; the treatment Mayer and Sarkissien (2003) proposed using individual
attributes and attribute levels are summarized in Table 1. characteristics as attributes in the optimal design, where
The number of all possible combinations = 44 x 22 = individuals are chosen optimally. Using both indi-
1024 cells, which is not practical to test. vidual characteristics and treatment attributes as design
attributes is theoretically interesting. In practice, we
Solution 1 should compare this optimal selection of individuals with
stratified random sampling. Simulation and theoretical
To reduce the number of cells, a fractional factorial and empirical studies are required to evaluate this idea.
design can be applied (full factorial refers to the design Additionally, if many individual variables (say, hundreds)
that includes all possible combinations); see Montgom- are used in the design, then constructing an optimal
ery (1991) and Almquist and Wyner (2001). Two types design may be very computationally intensive due to the
of fractional factorials are a) an orthogonal design large design matrix and thus the design may require a
where all attributes are made orthogonal (uncorrelated) unique optimization technique to solve.
with each other and b) an optimal design where a certain
criterion related to the variance-covariance matrix of
parameter estimates is optimized; see Kuhfeld (1997, RESPONSE MODELING
2004) for the applications of SAS PROC FACTEX and
PROC OPTEX in market research. (Kuhfelds market Problem 2
research applications are also applicable to database
marketing). As stated in the introduction, response modeling uses
For the preceding credit card problem, an orthogo- data from a previous marketing campaign to identify
nal fractional factorial design using PROC FACTEX in
699
TEAM LinG
Marketing Data Mining
700
TEAM LinG
Marketing Data Mining
M
exp( + 'X i + Ti + 'X iTi + 'Z + ' Z X )
Pi | treatment Pi | control Pi = i i i
1 + exp( + 'X i + Ti + 'X iTi + 'Z + 'Z X )
i i i
exp( + + 'X i + 'X i ) exp( + 'X i )
= ( (3) where Z i is the vector of treatment attributes, (4)
1+ exp( + + 'X i + 'X i ) 1+ exp( + 'X i )
and and are additional parameters ( Z i = 0 if Ti = 0).
subject to :
An extension of Equation (2) is to incorporate both treat- x ij max. # of individuals receiving treatment combination j,
i
ment attributes and individual characteristics in the
same lift-based response model.
c x
i j
ij ij expense budget, plus other relevant constraints,
xij = 0 or 1,
where ij = inc. value received by sending treatment comb. j to individual i,
cij = cost of sending treatment comb. j to individual i. (5)
701
TEAM LinG
Marketing Data Mining
702
TEAM LinG
Marketing Data Mining
703
TEAM LinG
Marketing Data Mining
704
TEAM LinG
705
Tzai-Zang Lee
National Cheng Kung University, Taiwan, ROC
BACKGROUND
MAIN THRUST
Material searchers regularly spend large amounts of
time to acquire resources for enormous numbers of Utilization discovery as a base of material acquisitions,
library users. Therefore, something significant should comprising a combination of association utilization and
be relied on to produce the acquisition recommendation statistics utilization, is discussed in this article. The
list for the limited budget (Whitmire, 2002). Major association utilization is derived by a data-mining tech-
resources for material acquisitions are, in general, the nique. Systemically, when data mining is applied in the
personal collections of the librarians and recommenda- field of material acquisitions, it follows five stages:
tions by users, departments, and vendors (Stevens, 1999). collecting datasets, preprocessing collected datasets,
The collections provided by these collectors are usually mining preprocessed datasets, gathering discovery
determined by their individual preferences, rather than informatics, interpreting and implementing discovered
by a global view, and thus may not be adequate for the informatics, and evaluating discovered informatics. The
material acquisitions to rely on. Information in the statistics utilization is simply the sum of numeric values
usage data may show something different from the of strength for all different types of categories in pre-
collectors recommendations (Hamaker, 1995). For processed circulation data tables (Kao et al., 2003).
example, knowing which materials were most utilized These need both domain experts and data miners to
by the patrons would be highly useful for material accomplish the tasks successfully.
acquisitions.
First, circulation statistics is one of the most sig- Collecting Datasets
nificant references for library material acquisition de-
cisions (Budd & Adams, 1989; Tuten & Lones, 1995; Most libraries have employed computer information
Pu, Lin, Chien, & Juan, 1999). It is a reliable factor by systems to collect circulation data that mainly includes
which to evaluate the success of material utilization users identifier, name, address, and department for a
(Wise & Perushek, 2000). Second, the data-mining user; identifier, material category code, name, author,
technique with a capability of description and predic- publisher, and publication date for a material; and users
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Material Acquisitions Using Discovery Informatics Approach
identifier, material identifier, material category code, date and are support and confidence, respectively (Meo,
borrowed, and date returned for a transaction. In order to Pseila, & Ceri, 1998). The P is regarded as the condition,
consider the importance of a material category, a data and Q as the conclusion, meaning that P can produce Q
table must be created to define the degree of importance implicitly. For example, an association rule Systems
that a material presents to a department (or a group of =>Organizations & Management (0.25, 0.33) means, If
users). For example, five scales of degree can be abso- materials in the category of Systems were borrowed in a
lutely matching, highly matching, matching, likely transaction, materials in Organization & Management
matching, and absolutely not matching, and their were also borrowed in the same transaction with a support
importance strength can be defined as 0.4, 0.3, 0.2, 0.1, and of 0.25 and a confidence of 0.33. Support is defined as the
0.0, respectively (Kao et al., 2003). ratio of the number of transactions observed to the total
number of transactions, whereas confidence is the ratio of
Preprocessing Data the number of transactions to the number of conditions.
Although association rules having the form of P=>Q (,
Preprocessing data may have operations of refinement ) can be generated in a transaction, the inverse associa-
and reconstruction of data tables, consistency of tion rules and single material category in a transaction
multityped data tables, elimination of redundant (or also need to be considered.
unnecessary) attributes, combination of highly corre- When two categories (C1 and C2) are utilized in a
lated attributes, and discretization of continuous at- transaction, it is difficult to determine the association
tributes. Two operations in this stage for material acqui- among C1=>C2, C2=>C1, and both. A suggestion from
sitions are the elimination of unnecessary attributes and librarians is to take the third one (both) as the decision
the reconstruction of data tables. For the elimination of of this problem (Wu, Lee, & Kao, 2004). This is also
unnecessary attributes, four data tables are preprocessed supported by the study of Meo et al. (1998), which deals
to derive the material utilization. They are users tables with association rule generation in customer purchasing
(two attributes: department identifier and user identi- transactions. The number of support and confidence of
fier), category tables (two attributes: material identifi- C1=>C2 may be different from those of C2=>C1. As a
ers and material category code), circulation table (three result, the inverse rules are considered as an extension
attributes: user identifier, material category code, and for the transactions that contain more than two catego-
date borrowed), and importance table (three attributes: ries to determine the number of association rules. The
department identifier, material category identifier, and number of rules can be determined via 2*[n*(n-1)/2],
importance). For the reconstruction of data tables, a where n is the number of categories in a transaction. For
new table can be generated that contains attributes of example, {C1, C2, C3} are the categories of a transac-
department identifier, user identifier, material category tion, and 6 association rules are then produced to be
code, strength, and date borrowed. {C1=>C2, C1=>C3, C2=>C3, C2=>C1, C3=>C1,
C3=>C2}. Unreliable association rules may occur be-
Mining Data cause their supports and confidences are too small.
Normally, there is a predefined threshold that defines
Mining mechanisms can perform knowledge discovery the value of support and confidence to filter the unreli-
with the form of association, classification, regression, able association rules. Only when the support and con-
clustering, and summarization/generalization (Hirota & fidence of a rule satisfy the defined threshold is the rule
Pedrycz, 1999). The association with a form of If regarded as a reliable rule. However, no evidence exists
Condition Then Conclusion captures relationships be- so far is reliable determining the threshold. It mostly
tween variables. The classification is to categorize a set depends on how reliable the management would like the
of data based on their values of the defined attributes. discovered rules to be. For a single category in a trans-
The regression is to derive a prediction model by alter- action, only the condition part without support and
ing the independent variables for dependent one(s) in a confidence is considered, because of the computation
defined database. Clustering is to put together the physi- of support and confidence for other transactions.
cal or abstract objects into a class based on similar Another problem is the redundant rules in a transac-
characteristics. The summarization/generalization is tion. It is realized that an association rule is to reveal the
to abridge the general characteristics over a set of company of a certain kind of material category, inde-
defined attributes in a database. pendent of the number of its occurrences. Therefore, all
The association informatics can be employed in redundant rules are eliminated. In other words, there is
material acquisitions. Like a rule, it takes the form of only one rule for a particular condition and only one conclu-
P=>Q (, ), where P and Q are material categories, and sion in a transaction. Also, the importance of a material to a
706
TEAM LinG
Material Acquisitions Using Discovery Informatics Approach
department is omitted. However, the final material utilization Evaluating Discovered Informatics
will take into account this concern when the combination with M
statistics utilization is performed. Performance of the discovered informatics needs to be
tested. Criteria used can be validity, significance/unique-
Gathering Discovery Informatics ness, effectiveness, simplicity, and generality (Hirota &
Pedrycz, 1999). The validity looks at whether the discov-
The final material utilization as the discovery informatics ered informatics is practically applicable. Uniqueness/
contains two parts. One is statistics utilization, and the significance deals with how different the discovered
other is association utilization (Wu et al., 2004). It is informatics are to the knowledge that library manage-
expressed as Formula 1 for a material category C. ment already has. Effectiveness is to see the impact the
discovered informatics has on the decision that has been
k made and implemented. Simplicity looks at the degree of
MatU (C ) = nC + nk * ( * support + * confidence ) (1) understandability, while generality looks at the degree
i
of scalability. The criteria used to evaluate the discov-
ered material utilization for material acquisitions can be
Where in particular the uniqueness/significance and effective-
ness. The uniqueness/significance can show that mate-
MatU(C): material utilization for category C rial utilization is based not only on statistics utilization,
nC: statistics utilization but also on association utilization. The solution of effec-
nk: statistics utilization of the kth category that can tiveness evaluation can be found by answering the
produce C questions Do the discovered informatics significantly
: intensity of support help reflect the information categories and subject areas
support: number of support of materials requested by users? and Do the discov-
: intensity of confidence ered informatics significantly help enhance material uti-
confidence: number of confidence lizations for next year?
Interpretation of discovery informatics can be performed Digital libraries using innovative Internet technology
by any visualization techniques, such as table, figure, promise a new information service model, where li-
graph, animation, diagram, and so forth. The main discov- brary materials are digitized for users to access any-
ery informatics for material acquisition have three tables time from anywhere. In fact, it is almost impossible for
indicating statistics utilization, association rules, and a library to provide patrons with all the materials avail-
material utilization. The statistics utilization table lists able because of budget limitations. Having the collec-
each material category and its utilization. The associa- tions that closely match the patrons needs is a primary
tion rule table has four attributes, including condition, goal for material acquisitions. Libraries must be cen-
conclusion, supports, and confidence. Each tuple in this tered on users and based on contents while building a
table represents an association rule. The association global digital library (Kranich, 1999). This results in
utilization is computed according to this table. The ma- the increased necessity of discovery informatics tech-
terial utilization table has five attributes, including ma- nology. Advanced research tends to the integrated stud-
terial category code, statistics utilization, association ies that may have requests for information for different
utilization, material utilization, and percentage. In this subjects (material categories). The material associa-
table, the value of the material utilization is the sum of tions discovered in circulation databases may reflect
statistics utilization and association utilization. For each these requests.
material category, the percentage is the ratio of its Library management has paid increased attention to
utilization to the total utilization. Implementation deals easing access, filtering and retrieving knowledge
with how to utilize the discovered informatics. The ma- sources, and bringing new services onto the Web, and
terial utilization can be used as a base of material acqui- users are industriously looking for their needs and
sitions by which informed decisions about allocating figuring out what is really good for them. Personalized
budget is made. information service becomes urgent. The availability
707
TEAM LinG
Material Acquisitions Using Discovery Informatics Approach
of accessing materials via the Internet is rapidly changing Kranich, N. (1999). Building a global digital library. In C.-
the strategy from print to digital forms for libraries. For C. Chen (Ed.), IT and global digital library development
example, what can be relied on while making the decision (pp. 251-256). West Newton, MA: MicroUse Information.
on which electronic journals or e-books are required for a
library, how do libraries deal with the number of login Lu, H., Feng, L., & Han, J. (2000). Beyond intratransaction
names and the number of users entering when analysis of association analysis: Mining multidimensional
arrival is concerned, and how do libraries create person- intertransaction association rules. ACM Transaction on
alized virtual shelves for patrons by analyzing their trans- Information Systems, 18(4), 423-454.
action profiles? Furthermore, data collection via daily Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL
circulation operation may be greatly impacted by the way for mining association rules. Data Mining & Knowledge
a user makes use of the online materials and, as a conse- Discovery, 2, 195-224.
quence, makes the material acquisitions operation even
more difficult. Discovery informatics technology can help Pu, H. T., Lin, S. C., Chien, L. F., & Juan, Y. F. (1999).
find the solutions for these issues. Exploration of practical approaches to personalized
library and networked information services. In C.-C.
Chen (Ed.), IT and global digital library development
CONCLUSION (pp. 333-343). West Newton, MA: MicroUse Information.
Stevens, P. H. (1999). Whos number one? Evaluating
Material acquisition is an important operation for library acquisitions departments. Library Collections, Acqui-
that needs both technology and management involved. sitions, & Technical Services, 23, 79-85.
Circulation data are more than data that keep material
usage records. Discovery informatics technology is an Tuten, J. H., & Lones, B. (1995). Allocation formulas in
active domain that is connected to data processing, ma- academic libraries. Chicago, IL: Association of Col-
chine learning, information representation, and manage- lege and Research Libraries.
ment, particularly when it has shown a substantial aid in Wang, J. (2003). Data mining: Opportunities and chal-
decision making. Data mining is an application-depen- lenges. Hershey, PA: Idea Group.
dent issue, and applications in domain will need adequate
techniques to deal with. Although discovery informatics Whitmire, E. (2002). Academic library performance mea-
depends highly on the technologies used, its use with sures and undergraduates library use and educational
respect to applications in domain still needs more efforts outcomes. Library & Information Science Research, 24,
to concurrently benefit management capability. 107-128.
Wise, K., & Perushek, D. E. (2000). Goal programming as
a solution technique for the acquisition allocation prob-
REFERENCES lem. Library & Information Science Research, 22(2), 165-
183.
Bloss, A. (1995). The value-added acquisitions librar-
ian: Defining our role in a time of change. Library Wu, C. H. (2003). Data mining applied to material acquisi-
Acquisitions: Practice & Theory, 19(3), 321-330. tion budget allocation for libraries: Design and develop-
ment. Expert Systems with Applications, 25(3), 401-411.
Budd, J. M., & Adams, K. (1989). Allocation formulas in
practice. Library Acquisitions: Practice & Theory, 13, Wu, C. H., Lee, T. Z., & Kao, S. C. (2004). Knowledge
381-390. discovery applied to material acquisitions for libraries.
Information Processing & Management, 40(4), 709-725.
Hamaker, C. (1995). Time series circulation data for collec-
tion development; Or, you cant intuit that. Library Ac-
quisition: Practice & Theory, 19(2), 191-195.
KEY TERMS
Hirota, K., & Pedrycz, W. (1999). Fuzzy computing for data
mining. Proceedings of the IEEE, 87(9), 1575-1600. Association Rule: The implication of connections for
Kao, S. C., Chang, H. C., & Lin, C. H. (2003). Decision variables that are explored in databases, having a form of
support for the academic library acquisition budget allo- AB, where A and B are disjoint subsets of a dataset of
cation via circulation data base mining. Information Pro- binary attributes.
cessing & Management, 39(1), 133-147.
708
TEAM LinG
Material Acquisitions Using Discovery Informatics Approach
Circulation Database: The information of material Material Acquisition: A process of material informa-
usages that are stored in a database, including user tion collection by recommendations of users, vendors, M
identifier, material identifier, date the material is borrowed colleges, and so forth. Information explored in databases
and returned, and so forth. can be also used. The collected information is, in general,
used in purchasing materials.
Digital Library: A library that provides the resources
to select, structure, offer, access, distribute, preserve, and Material Category: A set of library materials with
maintain the integrity of the collections of digital works. similar subjects.
Discovery Informatics: Knowledge explored in data-
bases with the form of association, classification, regres-
sion, summarization/generalization, and clustering.
709
TEAM LinG
710
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Materialized Hypertext View Maintenance
711
TEAM LinG
Materialized Hypertext View Maintenance
TEAM LinG
Materialized Hypertext View Maintenance
These techniques, like many others defined previ- Vista, D. (1998). Integration of incremental view mainte-
ously, are now being applied to XML data by various nance into query optimizers. In EDBT (pp. 374-388). M
researchers (Braganholo et al., 2003; Chen et al., 2002;
Alon et al., 2003; Zhang et al., 2003; Shanmugasundaram Zhang, X. et al. (2003). Rainbow: Multi-XQuery optimiza-
et al., 2001). The most challenging issue for long-term tion using materialized XML views. SIGMOD Conference
research is probably that of extending hypertext incre- (pp. 671).
mental maintenance to the case where data come from Zhuge, Y. et al. (1995). View maintenance in a warehousing
many heterogeneous, autonomous and distributed data- environment. In SIGMOD Conference (pp. 316-327).
bases.
KEY TERMS
REFERENCES
Database Status: The structure and content of a
database at a given time stamp. It comprises the database
Alon, N. et al. (2003). Typechecking XML views of rela- object classes, their relationships and their object in-
tional databases. ACM Transactions on Computational stances.
Logic, 4 (3), 315-354.
Deferred Maintenance: The policy of not performing
Blakeley, J. et al. (1986). Efficiently updating materialized database maintenance operations when their need be-
views. In ACM SIGMOD International Conf. on Manage- comes evident, but postponing them to a later moment.
ment of Data (SIGMOD86) (pp. 61-71).
Dynamic Web Pages: Virtual pages dynamically
Braganholo, V. P. et al. (2003). On the updatability of XML constructed after a client request. The request is usually
views over relational databases. In WebDB (pp. 31-36). managed by a specific program or described using a
Bunker, C.J. et al. (2001). Aggregate maintenance for data specific query language whose statements are embed-
warehousing in Informix Red Brick Vista. In VLDB 2001 ded into pages.
(pp. 659-662). Immediate Maintenance: The policy of perform-
Chen, Y.B. et al. (2000). Designing valid XML views. In ing database maintenance operations as soon as their
Entity Relationship Conference (pp. 463-478). need becomes evident.
Fernandez, M.F. et al. (2000). Declarative specification of Link Consistency: The ability of a hypertext net-
Web sites with Strudel. The VLDB Journal, 9(1), 38-55. work links to always point to an existing and semanti-
cally coherent target.
Gupta, A. et al. (2001). Adapting materialized views after
redefinitions: Techniques and a performance study. In- Materialized Hypertext: A hypertext dynamically
formation Systems, 26 (5), 323-362. generated from an underlying database and physically
stored as a marked-up text file.
Labrinidis, A., & Roussopoulos, N. (2000). WebView
materialization. In SIGMOD00 (pp. 367-378). Semistructured Data: Data with a structure not as
rigid, regular, or complete as that required by traditional
Labrinidis, A., & Roussopoulos, N. (2001). Update propa- database management systems.
gation strategies for improving the quality of data on the
Web. In VLDB (pp. 391-400).
Paraboschi, S. et al. (2003). Materialized views in multidi- ENDNOTES
mensional databases. In Multidimensional databases
(pp. 222-251). Hershey, PA: Idea Group Publishing. 1
For more details, see the paper Materialized
Shanmugasundaram, J. et al. (2001). Querying XML views Hypertext Views.
of relational data. In VLDB (pp. 261-270).
2
A page scheme is essentially the abstract repre-
sentation of pages with the same structure and a
Sindoni, G. (1998). Incremental maintenance of hypertext page scheme instance is a page with the structure
views. In Proceedings of the Workshop on the Web and described by the page scheme. For the definition
Databases (WebDB98) (in conjunction with EDBT98). of page scheme and instance, see paper Material-
LNCS 1590 (pp. 98-117). Berlin: Springer-Verlag. ized Hypertext Views.
713
TEAM LinG
714
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Materialized Hypertext Views
715
TEAM LinG
Materialized Hypertext Views
according to a specific model. Pages may be mapped on the Mecca, G. et al. (1999). Araneus in the Era of XML. IEEE
database and automatically generated using a program- Data Engineering Bulletin, 22(3), 19-26.
ming language.
To allow external applications to access these metadata, Merialdo, P. et al. (2003). Design and development of
a materialized approach to page generation can be adopted. data-intensive Web sites: The araneus approach. ACM
The massive diffusion of XML as a preferred means for Transactions on Internet Technology, 3(1), 49-92.
describing a Web pages structure and publishing it on the Rossi, G., & Schwabe, D. (2002). Object-oriented design
Internet is facilitating integrated access to heterogeneous, structures in Web application models. Annals of Soft-
distributed data sources: the Web is rapidly becoming a ware Engineering, 13(1-4), 97-110.
repository of global knowledge. The research challenge
for the 21st century will probably be to provide global users Simeon, G., & Cluet, S. (1998). Using YAT to build a Web
with applications to efficiently and effectively find the server. In Proceedings of the Workshop on the Web and
required information. This could be achieved by utilizing Databases (Web and DataBases 98) (in conjunction
models, methods and tools which have been already devel- with Extending DataBase Technology 98). Lecture Notes
oped for knowledge discovery and data warehousing in in Computer Science (Vol. 1590) (pp. 118-135).
more controlled and local environments. Sindoni, G. (1999). Maintenance of data and metadata in
Web-based information systems. PhD Thesis. Universit
degli studi di Roma La Sapienza.
REFERENCES
World Wide Web Consortium. (2004). XML Query
Agosti, M. et al. (1995). Automatic authoring and con- (XQuery). Retrieved August 23, 2004, from http://
struction of hypertext for information retrieval. Multime- www.w3.org/XML/Query
dia Systems, 3, 15-24.
Aguilera, V. et al. (2002). Views in a large-scale XML KEY TERMS
repository. Very Large DataBase Journal, 11(3), 238-255.
Dynamic Web Pages: Virtual pages dynamically con-
Balasubramanian, V. et al. (2001). A case study in system- structed after a client request. Usually, the request is
atic hypertext design. Information Systems, 26(4), 295-320. managed by a specific program or is described using a
Baresi, L. et al. (2000). From Web sites to Web applications: specific query language whose statements are embed-
New issues for conceptual modeling. In Entity Relation- ded into pages.
ship (Workshops) (pp. 89-100). HTML: The Hypertext Markup Language. A lan-
Beeri, C. et al. (1998). WebSuite: A tools suite for harness- guage based on labels to describe the structure and
ing Web data. In Proceedings of the Workshop on the Web layout of a hypertext.
and Databases (Web and DataBases 98) (in conjunction HTTP: The HyperText Transaction Protocol. An
with Extending DataBase Technology 98). Lecture Notes Internet protocol, used to implement communication
in Computer Science (Vol. 1590) (pp. 152-171). between a Web client, which requests a file, and a Web
Crestani, F., & Melucci, M. (2003). Automatic construction server, which delivers it.
of hypertexts for self-referencing: The hyper-text book Knowledge Management: The practice of transform-
project. Information Systems, 28(7), 769-790. ing the intellectual assets of an organization into busi-
Fernandez, M. F. et al. (2000). Declarative specification of ness value.
Web sites with Strudel. Very Large DataBase Journal, Materialized Hypertext View: A hypertext contain-
9(1), 38-55. ing data coming from a database and whose pages are
Fraternali, P., & Paolini, P. (1998). A conceptual model and stored in files.
a tool environment for developing more scalable, dynamic, Metadata: Data about data. Structured information
and customizable Web applications. In VI Intl. Conference describing the nature and meaning of a set of data.
on Extending Database Technology (EDBT 98) (pp. 421-
435). XML: The eXtensible Markup Language. An evolu-
tion of HTML, aimed at separating the description of the
hypertext structure from that of its layout.
716
TEAM LinG
717
Alkis Simitsis
National Technical University of Athens, Greece
BACKGROUND
MAIN THRUST
Figure 1 shows a simplified DW architecture. The DW
contains a set of materialized views. The users address When selecting views to materialize in a DW, one attempts
their queries to the DW. The materialized views are used to satisfy one or more design goals. A design goal is either
partially or completely for the evaluation of the user the minimization of a cost function or a constraint. A
queries. This is achieved through partial or complete constraint can be classified as user-oriented or system-
rewritings of the queries using the materialized views. oriented. Attempting to satisfy the constraints can result
in no feasible solution to the view selection problem. The
design goals determine the design of the algorithms that
Figure 1. A simplified DW architecture select views to materialize from the space of alternative
view sets.
queries answers
Minimization of Cost Functions
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Materialized View Selection for Data Warehouse Design
718
TEAM LinG
Materialized View Selection for Data Warehouse Design
(1998), Liang, et al., (1999), and Theodoratos (2000) evaluate an input query using the views material-
aim at making the DW self-maintainable. ized at the DW should not exceed a given bound. M
Answering the Input Queries Using Exclusively the The bound for each query is given by the users and
Materialized Views: This constraint requires the reflects their needs for fast answers. For some
existence of a complete rewriting of the input que- queries, fast answers may be required, while for
ries, initially defined over the source relations, over others, the response time may not be predominant.
the materialized views. Clearly, if this constraint is
satisfied, the remote data sources need not be con- Search Space and Algorithms
tacted for evaluating queries. This way, expensive
data transmissions from the DW to the sources, and Solving the problem of selecting views for materializa-
conversely, are avoided. Some approaches assume tion involves addressing two main tasks: (a) generating
a centralized DW environment, where the source a search space of alternative view sets for materialization
relations are present at the DW site. In this case, the and (b) designing optimization algorithms that select an
answerability of the queries from the materialized optimal or near-optimal view set from the search space.
views is trivially guaranteed by the presence of the A DW is usually organized according to a star schema
source relations. The answerability of the queries where a fact table is surrounded by a number of dimen-
also can be trivially guaranteed by appropriately sion tables. The dimension tables define hierarchies of
defining select-project views on the source relations aggregation levels. Typical OLAP queries involve star
and replicating them at the DW. This approach as- joins (key/foreign key joins between the fact table and
sures also the self-maintainability of the materialized the dimension tables) and grouping and aggregation at
views. Theodoratos and Sellis (1999) do not assume different levels of granularity. For queries of this type,
a centralized DW environment or replication of part the search space can be formed in an elegant way as a
of the source relations at the DW and explicitly multidimensional lattice (Baralis et al., 1997; Harinarayan
impose this constraint in selecting views for materi- et al., 1996).
alization. Gupta (1997) states that the view selection problem is
NP-hard. Most of the approaches on view selection
User-Oriented Constraints problems avoid exhaustive algorithms. The adopted al-
gorithms fall into two categories: deterministic and ran-
User-oriented constraints express requirements of the domized. In the first category belong greedy algorithms
users. with performance guarantee (Gupta, 1997; Harinarayan
et al., 1996), 0-1 integer programming algorithms (Yang et
Answer Data Currency Constraints: An answer al., 1997), A* algorithms (Gupta & Mumick, 1999), and
data currency constraint sets an upper bound on the various other heuristic algorithms (Baralis et al., 1997;
time elapsed between the point in time the answer to Ross et al., 1996; Shukla et al., 1998; Theodoratos &
a query is returned to the user and the point in time Sellis, 1999). In the second category belong simulated
the most recent changes of a source relation that are annealing algorithms (Kalnis et al., 2002; Theodoratos et
taken into account in the computation of this answer al., 2001), iterative improvement algorithms (Kalnis et al.,
are read (this time reflects the currency of answer 2002) and genetic algorithms (Lee & Hammer, 2001). Both
data). Currency constraints are associated with ev- categories of algorithms exploit the particularities of the
ery source relation in the definition of every input specific view selection problem and the restrictions of
query. The upper bound in an answer data currency the class of queries considered.
constraint (minimal currency required) is set by the
users according to their needs. This formalization of
data currency constraints allows stating currency FUTURE TRENDS
constraints at the query level and not at the materi-
alized view level, as is the case in some approaches. The view selection problem has been addressed for
Therefore, currency constraints can be exploited by different types of queries. Research has focused mainly
DW view selection algorithms, where the queries are on queries over star schemas. Newer applications (e.g.,
the input, while the materialized views are the output XML or Web-based applications) require different types
(and, therefore, are not available). Furthermore, it of queries. This topic has only been partially investi-
allows stating different currency constraints for dif- gated (Golfarelli et al., 2001; Labrinidis & Roussopoulos,
ferent relations in the same query. 2000).
Query Response Time Constraints: A query re- A relevant issue that needs further investigation is
sponse time constraint states that the time needed to the construction of the search space of alternative view
719
TEAM LinG
Materialized View Selection for Data Warehouse Design
sets for materialization. Even though the construction of Baralis, E., Paraboschi, S., & Teniente, E. (1997). Materi-
such a search space for grouping and aggregation queries alized views selection in a multidimensional database.
is straightforward (Harinarayan et al., 1966), it becomes an International Conference on Very Large Data Bases,
intricate problem for general queries (Golfarelli & Rizzi, 2001). Athens, Greece.
Indexes can be seen as special types of views. Gupta,
et al. (1997) show that a two-step process that divides the Golfarelli, M., & Rizzi, S. (2000). View materialization for
space available for materialization and picks views first nested GPSJ queries. International Workshop on Design
and then indexes can perform very poorly. More work and Management of Data Warehouses (DMDW),
needs to be done on the problem of automating the Stockholm, Sweden.
selection of views and indexes together. Golfarelli, M., Rizzi, S., & Vrdoljak B. (2001). Data ware-
DWs are dynamic entities that evolve continuously house design from XML sources. ACM International
over time. As time passes, new queries need to be satis- Workshop on Data Warehousing and OLAP (DOLAP),
fied. A dynamic version of the view selection problem Atlanta, Georgia.
chooses additional views for materialization and avoids
the design of the DW from scratch (Theodoratos & Sellis, Gupta, H. (1997). Selection of views to materialize in a data
2000). A system that dynamically materializes views in the warehouse. International Conference on Database
DW at multiple levels of granularity in order to match the Theory (ICDT), Delphi, Greece.
workload (Kotidis & Roussopoulos, 2001) is a current Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman, J.D.
trend in the design of a DW. (1997). Index selection for OLAP. IEEE International
Conference on Data Engineering, Birmingham, UK.
720
TEAM LinG
Materialized View Selection for Data Warehouse Design
Ross, K., Srivastava, D., & Sudarshan, S. (1996). Materi- KEY TERMS
alized view maintenance and integrity constraint check- M
ing: Trading space for time. ACM SIGMOD International Auxiliary View: A view materialized in the DW exclu-
Conference on Management of Data (SIGMOD), Montreal, sively for reducing the view maintenance cost.
Canada.
Materialized View: A view whose answer is stored in
Shukla, A., Deshpande, P., & Naughton, J. (1998). Mate- the DW.
rialized view selection for multidimensional datasets. In-
ternational Conference on Very Large Data Bases Operational Cost: A linear combination of the query
(VLDB), New York. evaluation and view maintenance cost.
Theodoratos, D. (2000). Complex view selection for data Query Evaluation Cost: The sum of the cost of evalu-
warehouse self-maintainability. International Conference ating each input query rewritten over the materialized
on Cooperative Information Systems (CoopIS), Eilat, views.
Israel. Self-Maintainable View: A materialized view that can
Theodoratos, D., Dalamagas, T., Simitsis, A., & be maintained, for any instance of the source relations,
Stavropoulos, M. (2001). A randomized approach for the and for all source relation changes, using only these
incremental design of an evolving data warehouse. Inter- changes, the view definition, and the view materialization.
national Conference on Conceptual Modeling (ER), View: A named query.
Yokohama, Japan.
View Maintenance Cost: The sum of the cost of
Theodoratos, D., & Sellis, T. (1999). Designing data ware- propagating each source relation change to the material-
houses. Data & Knowledge Engineering, 31(3), 279-301. ized views.
Theodoratos, D., & Sellis, T. (2000). Incremental design of
a data warehouse. Journal of Intelligent Information
Systems (JIIS), 15(1), 7-27.
Yang, J., Karlapalem, K., & Li, Q. (1997). Algorithms for
materialized view design in data warehousing environ-
ment. International Conference on Very Large Data
Bases, Athens, Greece.
721
TEAM LinG
722
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Methods for Choosing Clusters in Phylogenetic Trees
the sum of diagonal elements. The GTR is fully specified follows that Pij(t) = 0.25 + 0.75e-t and that the distance
by 5 relative rate parameters (a, b, c, d, e) and 3 relative between taxa x and y is 3/4 log(1 - 4/3D) where D is the M
frequency parameters (A, C, and G with T determined percentage of sites where x and y differ (regardless of what
via A+ C + G + T = 1) in the rate matrix Q defined as kind of difference because all relative substitution rates
and base frequencies are assumed to be equal). Important
generalizations include allowing unequal relative frequen-
a C b G c T cies and/or rate parameters), and to allow the rate to vary
a A d G e T , across DNA sites. Allowing to vary across sites via a
Q/ =
a A d C f G gamma-distributed rate parameter is one way to model the
fact that sites often have different observed rates. If the rate
a A e C f G
is assumed to follow a gamma distribution with shape
parameter then these gamma distances can be obtained
where is the overall substitution rate. The rate matrix Q from the original distances by replacing the function log(x)
is related to the substitution probability matrix P via Pij(t)= with (1-x-1/) in the dxy = -trace{P log(P-1Fxy)} formula
e Qt, where P ij(t) is the probability of a change from (Swofford et al. 1996). Generally, this rate heterogeneity
nucleotide i to j in time t and P ij(t) satisfies the time and the fact that multiple substitutions at the same site
reversibility and stationarity criteria: iP ij = jP ji. Com- tend to saturate any distance measure make it a practical
monly used models such as Jukes-Cantor (Swofford et challenge to find a metric such that the distance between
al. 1996) assumes that a = b = c = d = e = 1 and A= any two taxa increases linearly with time.
C = G= T= 0.25. For the Jukes-Cantor model, it
Figure 1. HIV Data (env region). (Top) Hierarchical Clustering; (Middle) Principle Coordinate plot; (Bottom)
Results of model-based clustering under six different assumptions regarding volume (V), shape (S), and orientation
(O). E denotes equal among clusters and V denotes varying among clusters, for V, S, and O respectively. For
example, case 6 has varying V, equal S, and varying O among clusters. Models 1 and 2 each assume a spherical shape
(I denotes the identify matrix, so S and O are equal among clusters, while V is equal for case 1 and varying for case
2 ). Note that the B and D subtypes tend to be merged.
0.4
0.2
B
GG
F
F
AA
A
FF
BB
D D
GG
AA
D
G E
AA
A
D D
BC
E A
A
BB
FF
G G
G
EE
AA
B
B
D D
D D
0.0
B
A
B
B
D
C C
F
E
G
E
E
E
E
E
E
F
F
EE
BB
D
CC
C
C
C
D
C
G
B
E
E
D
D
D
C
C
C
C
C
C
EEE E
E E
EE EEEE EE E
A D BB B B
AA
A AAAA A A B B BDB D DD
G G A A BBD
B B DD
D D D B
BDD
B
0.0
G GA A
GG G GG
A
F
x2
G F F F
FFFF
F
-0.2
CC CC
CC
C C CC
C
C
x1
1400
1
2
3 1
2
3 1
2
3 1
3 1 1
3 1
2 6
5 5
6 3 3 1
3 1
3 1 1 1
6
4 4 5
6 5 5 3 3 3 1
3 1
3 1
4
6 3
5 5 15 EI 5 3
2
5 1
4
6 2 VI 5 5
2 3 5 5
1000
3 EEE 5
BIC
5 5
3 5
4
6 4 VVV 5
2 1
3 5 EEV
1 6 VEV
600
1
2
3
4
5
6 1
5 10 15 20
number of clusters
723
TEAM LinG
Methods for Choosing Clusters in Phylogenetic Trees
724
TEAM LinG
Methods for Choosing Clusters in Phylogenetic Trees
k. Banfield and Raftery (1993) developed a model-based that were chosen in advance of the analysis. The plot of
framework by parameterizing the covariance matrix in the MCMC-based estimate versus the ML + bootstrap M
terms of its eigenvalue decomposition in the form based estimate was consistent with the hypothesis that
k = k D k A k D k T , where D k is the orthonormal matrix of both methods assign (on average) the same probability to
eigenvectors, Ak is a diagonal matrix with elements propor- a chosen group, unless the number of DNA sites was very
tional to the eigenvalues of k and k is a scalar, and under small, in which case there can be a non-negligible bias in
one convention is the largest eigenvalue of k. The the ML method, resulting in bias in the ML + bootstrap
orientation of cluster k is determined by Dk, Ak determines results. Because the group was chosen in advance, the
the shape, while k specifies the volume. Each of the method for choosing groups was not fully tested, so there
volume, shape, and orientation (VSO) can be variable is a need for additional research.
among groups, or fixed at one value for all groups. One
advantage of the mixture-model approach is that it allows
the use of approximate Bayes factors to compare models, FUTURE TRENDS
giving a means of selecting the model parameterization
(which of V, S, and O are variable among groups) and the The recognized subtypes of HIV-1 were identified using
number of clusters (Figure 1, bottom). The Bayes factor is informal observations of tree shapes followed by ML +
the posterior odds for one model against another model bootstrap (Korber & Myers, 1992). Although identifying
assuming that neither model is favored a priori (uniform such groups is common in phylogenetic trees, there have
prior). When the EM algorithm (estimation-maximum like- been only a few attempts to formally evaluate clustering
lihood, Dempster, Laird, & Rubin, 1977) is used to find the methods for the underlying genetic data. ML + boot-
maximum mixture likelihood, the most reliable approxima- strap remains the standard way to assign confidence to
tion to twice the log Bayes factor (called the Bayesian hypothesized groups and the group structure is usually
Information Criterion, BIC) is BIC = 2lM ( x,) mM log( n) , hypothesized either by using auxiliary information (geo-
graphic, temporal, or other) or by visual inspection of
where lM ( x,) is the maximized mixture loglikelihood for trees (which often display distinct groups). A more thor-
the model and mM is the number of independent param- ough evaluation could be performed using realistic simu-
eters to be estimated in the model. A convention for lated data with known branching orders. Having known
calibrating BIC differences is that differences less than 2 branching order is almost the same as having known
correspond to weak evidence, differences between 2 and groups; however, choosing the number of groups is likely
6 are positive evidence, differences between 6 and 10 are to involve arbitrary decisions even when the true branch-
strong evidence, and differences more than 10 are very ing order is known.
strong evidence.
The two clustering methods ML + bootstrap and
model-based clustering have been compared on the CONCLUSION
same data sets (Burr, Gattiker, & LaBerge, 2002b) and the
differences were small. For example, ML + bootstrap One reason to cluster taxa is that evolutionary processes
suggested 7 clusters for 95 HIV env sequences and 6 can sometimes be revealed once the group structure is
clusters in HIV gag (p17) sequences while model-based recognized. Another reason is that phylogenetic trees are
clustering suggests 6 for env (tends to merge the so- complicated objects that can often be effectively summa-
called B and D subtypes see Figure 1) and 6 for gag. Note rized by identifying the major groups together with a
from Figure 1(top) that only 7 of the 10 recognized sub- description of the typical between and within group
types were included among the 95 sequences. However, variation. Also, if we correctly choose the number of
it is likely that case-specific features will determine the clusters present in the tree for a large number of taxa (100
extent of difference between the methods. Also, model- or more), then we can then use these groups to rapidly
based clustering provides a more natural and automatic construct a good approximation to the true tree very
way to identify candidate groups. Once these candidate rapidly. One strategy for doing this is to repeatedly apply
groups have been identified, then either method is rea- model-based clustering to relatively small numbers of taxa
sonable for assigning confidence measures to the result- (100 or fewer) and check for consistent indications for the
ing cluster assignments. The two clustering methods number of groups. We described two other clustering
ML + bootstrap and an MCMC-based estimate of the strategies (ML + bootstrap, and MCMC-based) and
posterior probability on the space of branching orders note that two studies have made limited comparisons of
have also been compared on the same data sets (Burr, these three methods on the same genetic data.
Doak, Gattiker, & Stanbro, 2002a) with respect to the
confidence that each method assigns to particular groups
725
TEAM LinG
Methods for Choosing Clusters in Phylogenetic Trees
726
TEAM LinG
Methods for Choosing Clusters in Phylogenetic Trees
Phylogenetic Tree: A representation of the branching grated (for interval-valued random variables) to give the
order and branch lengths of a collection of taxa, which, in probability of observing values in a specified set. M
its most common display form, looks like the branches of
a tree. Substitution Probability Matrix: A matrix whose i, j
entry is the probability of substituting DNA character j (C,
Probability Density Function: A function that can be G, T or A) for character i over a specified time period.
summed (for discrete-valued random variables) or inte-
727
TEAM LinG
728
Based on the concept of simultaneously studying the The laboratory information management system (LIMS)
expression of a large number of genes, a DNA microarray keeps track of and manages data produced from each
is a chip on which numerous probes are placed for step in a microarray experiment, such as hybridization,
hybridization with a tissue sample. Biological complex- scanning, and image processing. As microarray experi-
ity encoded by a deluge of microarray data is being ments generate a vast amount of data, the efficient
translated into all sorts of computational, statistical, or storage and use of the data require a database manage-
mathematical problems bearing on biological issues ment system. Although some databases are designed to
ranging from genetic control to signal transduction to be data archives only, other databases such as ArrayDB
metabolism. Microarray data mining is aimed to iden- (Ermolaeva, Rastogi, & Pruitt, 1998) and Argus
tify biologically significant genes and find patterns that (Comander, Weber, Gimbrone, & Garcia-Cardena,
reveal molecular network dynamics for reconstruction 2001) allow information storage, query, and retrieval,
of genetic regulatory networks and pertinent metabolic as well as data processing, analysis, and visualization.
pathways. These databases also provide a means to link microarray
data to other bioinformatics databases (e.g., NCBI Entrez
systems, Unigene, KEGG, and OMIM). The integration
BACKGROUND with external information is instrumental to the inter-
pretation of patterns recognized in the gene-expression
The idea of microarray-based assays seemed to emerge data. To facilitate the development of microarray data-
as early as of the 1980s (Ekins & Chu, 1999). In that bases and analysis tools, there is a need to establish a
period, a computer-based scanning and image-process- standard for recording and reporting microarray gene
ing system was developed to quantify the expression expression data. The MIAME (Minimum Information
level in tissue samples of each cloned complementary about Microarray Experiments) standard includes a de-
DNA sequences spotted in a two-dimensional array on scription of experimental design, array design, samples,
strips of nitrocellulose, which could be the first proto- hybridization, measurements, and normalization con-
type of the DNA microarray. The microarray-based trols (Brazma, Hingamp, & Quackenbush, 2001).
gene expression technology was actively pursued in the
mid-1990s (Schena, Heller, & Theriault, 1998) and has Data Mining Objectives
seen rapid growth since then.
Microarray technology has catalyzed the develop- Data mining addresses the question of how to discover
ment of the field known as functional genomics by a gold mine from historical or experimental data, par-
offering high-throughput analysis of the functions of ticularly in a large database. The goal of data mining and
genes on a genomic scale (Schena et al., 1998). There knowledge discovery algorithms is to extract implicit
are many important applications of this technology, and previously unknown nontrivial patterns, regulari-
including elucidation of the genetic basis for health and ties, or knowledge from large data sets that can be used
disease, discovery of biomarkers of therapeutic re- to improve strategic planning and decision making. The
sponse, identification and validation of new molecular discovered knowledge capturing the relations among
targets and modes of action, and so on. The accomplish- the variables of interest can be formulated as a function
ment of decoding human genome sequence together for making prediction and classification or as a model
with recent advances in the biochip technology has for understanding the problem in a given domain. In the
ushered in genomics-based medical therapeutics, diag- context of microarray data, the objectives are identify-
nostics, and prognostics. ing significant genes and finding gene expression pat-
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Microarray Data Mining
terns associated with known or unknown categories. To determine which genes are differentially expressed,
Microarray data mining is an important topic in a common approach is based on fold-change; in this M
bioinformatics, dealing with information processing on approach, we simply decide a fold-change threshold (e.g.,
biological data, particularly genomic data. 2C) and select genes associated with changes greater
than that threshold. If a cDNA microarray is used, the ratio
Practical Factors Prior to Data Mining of the test over control expression in a single array can be
converted easily to fold change in both cases of up-
Some practical factors should be taken into account regulation (induction) and down-regulation (suppres-
prior to microarray data mining. First of all, microarray sion). For oligonucleotide chips, fold-change is com-
data produced by different platforms vary in their for- puted from two arrays, one for test and the other for
mats and may need to be processed differently. For control sample. In this case, if multiple samples in each
example, one type of microarray with cDNA as probes condition are available, the statistical t-test or Wilcoxon
produces ratio data from two channel outputs, whereas tests can be applied, but the catch is that the Bonferroni
another type of microarray using oligonucleotide probes adjustment to the level of significance on hypothesis
generates nonratio data from a single channel. Not only testing would be necessary to account for the presence of
may different platforms pick up gene expression activ- multiple genes. The t-test determines the difference in
ity with different levels of sensitivity and specificity, mean expression values between two conditions and
but also different data processing techniques may be identifies genes with significant difference. The nonpara-
required for different data formats. metric Wilcoxon test is a good alternative in the case of
Normalizing data to allow direct array-to-array com- non-Gaussian data distribution. SAM (Significance Analy-
parison is a critical issue in array data analysis, because sis of Microarrays) (Tusher, Tibshirani, & Chu, 2001) is a
several variables in microarray experiments can affect state-of-the-art technique based on balanced perturba-
measured mRNA levels (Schadt, Li, Ellis, & Wong, tion of repeated measurements and minimization of the
2001; Yang, Dudoit, & Luu, 2002). Variations may false discovery rate.
occur during sample handling, slide preparation, hybrid-
ization, or image analysis. Normalization is essential Coordinated Gene Expression
for correct microarray data interpretation. In simple
ways, data can be normalized by dividing or subtracting Identifying genes that are co-expressed across multiple
expression values by a representative value (e.g., mean conditions is an issue with significant implications in
or median in an array) or by taking a linear transforma- microarray data mining. For example, given gene ex-
tion to zero mean and unit variance. As an example, data pression profiles measured over time, we are interested
normalization in the case of cDNA arrays may proceed in knowing what genes are functionally related. The
as follows: The local background intensity is subtracted answer to this question also leads us to deduce the
from the value of each spot on the array; the two chan- functions of unknown genes from their correlation with
nels are normalized against the median values on that genes of known functions. Equally important is the
array; and the Cy5/Cy3 fluorescence ratios and log 10- problem of organizing samples based on their gene
transformed ratios are calculated from the normalized expression profiles so that distinct phenotypes or dis-
values. In addition, genes that do not change signifi- ease processes may be recognized or discovered.
cantly can be removed through a filter in a process The solutions to both problems are based on so-
called data filtration. called cluster analysis, which is meant to group objects
into clusters according to their similarity. For example,
Differential Gene Expression genes are clustered by their expression values across
multiple conditions; samples are clustered by their ex-
To identify genes differentially expressed across two pression values across genes. The issue is the question
conditions is one of the most important issues in of how to measure the similarity between objects. Two
microarray data mining. In cancer research, for ex- popular measures are the Euclidean distance and
ample, we wish to understand what genes are abnormally Pearsons correlation coefficient. Clustering algorithms
expressed in a certain type of cancer, so we conduct a can be divided into hierarchical and nonhierarchical
microarray experiment and collect the gene expression (partitional). Hierarchical clustering is either
profiles of normal and cancer tissues, respectively, as agglomerative (starting with singletons and progres-
the control and test samples. The information regarding sively merging) or divisive (starting with a single clus-
differential expression is derived from comparing the ter and progressively breaking). Hierarchical
test against the control sample. agglomerative clustering is most commonly used in the
729
TEAM LinG
Microarray Data Mining
cluster analysis of microarray data. In this method, two ferred to as discriminant analysis. In practice, given a
most similar clusters are merged at each stage until all limited number of samples, correct discriminant analy-
the objects are included in a single cluster. The result is sis must rely on the use of an effective gene selection
a dendrogram (a hierarchical tree) that encodes the technique to reduce the gene number and, hence, the
relationships among objects by showing how clusters data dimensionality. The objective of gene selection is
merge at each stage. Partitional clustering algorithms to select genes that most contribute to classification as
are best exemplified by k-means and self-organization well as provide biological insight. Approaches to gene
maps (SOMs). selection range from statistical analysis (Golub et al.,
1999) and a Bayesian model (Lee, Sha, Dougherty,
Gene Selection for Discriminant Vannucci, & Mallick, 2003) to Fishers linear dis-
Analysis criminant analysis (Xiong, Li, Zhao, Jin, & Boerwinkle,
2001) and support vector machines (SVMs) (Guyon,
Taking an action based on the category of the pattern Weston, Barnhill, & Vapnik, 2002). This is one of the
recognized in microarray gene expression data is an most challenging areas in microarray data mining. De-
increasingly important approach to medical diagnosis spite good progress, the reliability of selected genes
and management (Furey, Cristianini, & Duffy, 2000; should be further improved. Table 1 summarizes some
Golub, Slonim, & Tamayo, 1999; Khan, Wei, & Ringner, of most important microarray data-mining problems
2001). A class predictor derived on this basis can auto- and their solutions.
matically discover the distinction between different
classes of samples, independent of previous biological Microarray Data-Mining Applications
knowledge (Golub et al., 1999). Gene expression infor-
mation appears to be a more reliable indicator than Microarray technology permits a large-scale analysis
phenotypic information for categorizing the underlying of gene functions in a genomic perspective and has
causes of diseases. The microarray approach has offered brought about important changes in how we conduct
hope for clinicians to arrive at more objective and accu- basic research and practice clinical medicine. There
rate cancer diagnoses and hence choose more appropri- have existed an increasing number of applications with
ate forms of treatment (Tibshirani, Hastie, Narasimhan, this technology. Here, the role of data mining in discov-
& Chu, 2002). ering biological and clinical knowledge from microarray
The central question is how to construct a reliable data is examined.
classifier that predicts the class of a sample on the basis Consider that only the minority of all the yeast (Sac-
of its gene expression profile. This is a pattern recogni- charomyces cerevisiae) open reading frames in the ge-
tion problem, and the type of analysis involved is re- nome sequence could be functionally annotated on the
[Solutions:]
Fold change
t-test or Wilcoxon rank sum test (with Bonferronis correction)
Significance analysis of microarrays
[Solutions:]
Hierarchical clustering
Self-organization
k-means clustering
Problem 3: To select genes for discriminant analysis, given microarray gene expression data
of two or more classes.
[Solutions:]
Neighborhood analysis
Support vector machines
Principal component analysis
Bayesian analysis
Fishers linear discriminant analysis
730
TEAM LinG
Microarray Data Mining
basis of sequence information alone (Zweiger, 1999), sis, owing to possible failure of detection or presence of
although microarray results showed that nearly 90% of all marker variants. The approach of constructing a classifier M
yeast mRNAs (messenger RNAs) are observed to be based on gene expression profiles has gained increasing
present (Wodicka, Dong, Mittmann, Ho, & Lockhart, interest, following the success in demonstrating that
1997). Functional annotation of a newly discovered gene microarray data differentiated between two types of leu-
based on sequence comparison with other known gene kemia (Golub et al., 1999). In this application, the two data-
sequences is sometimes misleading. Microarray-based mining problems are to identify gene expression patterns
genome-wide gene expression analysis has made it pos- or signatures associated with each type of leukemia and
sible to deduce the functions of novel or poorly charac- to discover subtypes within each. The first problem is
terized genes from co-expression with already known dealt with by gene selection, and the second one by
genes (Eisen, Spellman, Brown, & Botstein, 1998). The cluster analysis. Table 2 illustrates some applications of
microarray technology is a valuable tool for measuring microarray data mining.
whole-genome mRNA and enables system-level explora-
tion of transcriptional regulatory networks (Cho, Campbell,
& Winzeler, 1998; DeRisi, Iyer, & Brown, 1997; Laub, FUTURE TRENDS
McAdams, Feldblyum, Fraser, & Shapiro, 2000; Tavazoie,
Hughes, Campbell, Cho, & Church, 1999). Hierarchical The future challenge is to realize biological networks that
clustering can help us recognize genes whose cis-regula- provide qualitative and quantitative understanding of
tory elements are bound by the same proteins (transcrip- molecular logic and dynamics. To meet this challenge,
tion factors) in vivo. Such a set of coregulated genes is recent research has begun to focus on leveraging prior
known as a regulon. Statistical characterization of known biological knowledge and integration with biological analy-
regulons is used to derive criteria for inferring new sis in quest of biological truth. In addition, there is
regulatory elements. To identify regulatory elements increasing interest in applying statistical bootstrapping
and associated transcription factors is fundamental to and data permutation techniques to mining microarray
building a global gene regulatory network essential for data for appraising the reliability of leaned patterns.
understanding the genetic control and biology in living
cells. Thus, determining gene functions and gene net-
works from microarray data is an important application CONCLUSION
of data mining.
The limitation of the morphology-based approach to Microarray technology has rapidly emerged as a power-
cancer classification has led to molecular classifica- ful tool for biological research and clinical investiga-
tion. Techniques such as immunohistochemistry and tion. However, the large quantity and complex nature of
RT-PCR are used to detect cancer-specific molecular data produced in microarray experiments often plague
markers, but pathognomonic molecular markers are researchers who are interested in using this technology.
unfortunately unavailable for most solid tumors Microarray data mining uses specific data processing and
(Ramaswamy, Tamayo, & Rifkin, 2001). Furthermore, normalization strategies and has its own objectives, re-
molecular markers do not guarantee a definitive diagno-
v Identified functional related genes and their genetic control upon metabolic shift from fermentation to respiration
(DeRisi et al., 1997).
v Explored co-expressed or coregulated gene families by cluster analysis (Eisen et al., 1998).
v Determined genetic network architecture based on coordinated gene expression analysis and promoter motif analysis
(Tavazoie et al., 1999).
v Differentiated acute myeloid leukemia from acute lymphoblastic leukemia by selecting genes and constructing a
classifier for discriminant analysis (Golub et al., 1999).
v Selected genes differentially expressed in response to ionizing radiation based on significance analysis (Tusher et al.,
2001).
Recent Work:
v Analyzed gene expression in the Arabidopsis genome (Yamada, Lim, & Dale, 2003).
v Discovered conserved genetic modules (Stuart, Segal, Koller, & Kim, 2003).
v Elucidated functional properties of genetic networks and identified regulatory genes and their target genes (Gardner, di
Bernardo, Lorenz, & Collins, 2003).
v Identified genes associated with Alzheimers disease (Roy Walker, Smith, & Liu, 2004).
731
TEAM LinG
Microarray Data Mining
quiring effective computational algorithms and statistical tion by gene expression monitoring. Science, 286(5439),
techniques to arrive at valid results. The microarray tech- 531-537.
nology has been perceived as a revolutionary technology
in biomedicine, but the hardware device does not pay off Guyon, I., Weston, J., Barnhill, S., & Vapnik (2002). Gene
unless backed up with sound data-mining software. selection for cancer classification using support vector
machines. Machine Learning, 46(1/3), 389-422.
Khan, J., Wei, J. S., & Ringner, M. (2001). Classification
ACKNOWLEDGMENT and diagnostic prediction of cancers using gene expres-
sion profiling and artificial neural networks. Nat Med,
This work is supported by the National Science Founda- 7(6), 673-679.
tion under Grant IIS-0221954.
Laub, M. T., McAdams, H. H., Feldblyum, T., Fraser &
Shapiro (2000). Global analysis of the genetic network
controlling a bacterial cell cycle. Science, 290(5499),
REFERENCES 2144-2148.
Brazma, A., Hingamp, P., & Quackenbush, J. (2001). Mini- Lee, K. E., Sha, N., Dougherty, E. R., Vannucci & Mallick
mum information about a microarray experiment (2003). Gene selection: A Bayesian variable selection
(MIAME) toward standards for microarray data. Nat approach. Bioinformatics, 19(1), 90-97.
Genet, 29(4), 365-371. Ramaswamy, S., Tamayo, P., & Rifkin, R. (2001). Multiclass
Cho, R. J., Campbell, M. J., & Winzeler, E. A. (1998). A cancer diagnosis using tumor gene expression signa-
genome-wide transcriptional analysis of the mitotic tures. Proceedings of the National Acad Sci, USA, 98(26),
cell cycle. Mol Cell, 2(1), 65-73. 15149-15154.
Comander, J., Weber, G. M., Gimbrone, M. A., Jr., & Garcia- Roy Walker, P., Smith, B., & Liu, Q. Y. (2004). Data mining
Cardena (2001). Argus: A new database system for Web- of gene expression changes in Alzheimer brain. Artif Intell
based analysis of multiple microarray data sets. Genome Med, 31(2), 137-154.
Res, 11(9), 1603-1610. Schadt, E. E., Li, C., Ellis, B., & Wong (2001). Feature
DeRisi, J. L., Iyer, V. R., & Brown, P. O. (1997). Exploring extraction and normalization algorithms for high-den-
the metabolic and genetic control of gene expression on sity oligonucleotide gene expression array data. Jour-
a genomic scale. Science, 278(5338), 680-686. nal of Cell Biochemistry, (Suppl. 37), 120-125.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein Schena, M., Heller, R. A., & Theriault, T. P. (1998). Microarrays:
(1998). Cluster analysis and display of genome-wide Biotechnologys discovery platform for functional genomics.
expression patterns. Proceedings of the National Acad Trends Biotechnol, 16(7), 301-306.
Sci, USA, 95(25), 14863-14868. Stuart, J. M., Segal, E., Koller, D., & Kim (2003). A gene-
Ekins, R., & Chu, F. W. (1999). Microarrays: Their origins coexpression network for global discovery of conserved
and applications. Trends Biotechnol, 17(6), 217-218. genetic modules. Science, 302(5643), 249-255.
Ermolaeva, O., Rastogi, M., & Pruitt, K. D., (1998). Data Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho & Church
management and analysis for gene expression arrays. (1999). Systematic determination of genetic network ar-
Nat Genet, 20(1), 19-23. chitecture. Nat Genet, 22(3), 281-285.
Furey, T. S., Cristianini, N., & Duffy, N., (2000). Support Tibshirani, R., Hastie, T., Narasimhan, B., & Chu (2002).
vector machine classification and validation of cancer Diagnosis of multiple cancer types by shrunken cen-
tissue samples using microarray expression data. troids of gene expression. Proceedings of the National
Bioinformatics, 16(10), 906-914. Acad Sci, USA, 99(10), 6567-6572.
Gardner, T. S., di Bernardo, D., Lorenz, D., & Collins (2003). Tusher, V. G., Tibshirani, R., & Chu, G. (2001). Significance
Inferring genetic networks and identifying compound analysis of microarrays applied to the ionizing radiation
mode of action via expression profiling. Science, 301(5629), response. Proceedings of the National Acad Sci, USA,
102-105. 98(9), 5116-5121.
Golub, T. R., Slonim, D. K., & Tamayo, P. (1999). Molecular Wodicka, L., Dong, H., Mittmann, M., Ho & Lockhart
classification of cancer: Class discovery and class predic- (1997). Genome-wide expression monitoring in Saccharo-
myces cerevisiae. Nat Biotechnol, 15(13), 1359-1367.
732
TEAM LinG
Microarray Data Mining
Xiong, M., Li, W., Zhao, J., Jin & Boerwinkle (2001). Cis-Regulatory Element: The genetic region that af-
Feature (gene) selection in gene expression-based tu- fects the activity of a gene on the same DNA molecule. M
mor classification. Mol Genet Metab, 73(3), 239-247.
Clustering: The process of grouping objects accord-
Yamada, K., Lim, J., & Dale, J. M. (2003). Empirical analysis ing to their similarity. This is an important approach to
of transcriptional activity in the Arabidopsis genome. microarray data mining.
Science, 302(5646), 842-846.
Functional Genomics: The study of gene functions
Yang, Y. H., Dudoit, S., & Luu, P. (2002). Normalization for on a genomic scale, especially based on microarrays.
cDNA microarray data: A robust composite method ad-
dressing single and multiple slide systematic variation. Gene Expression: Production of mRNA from DNA
Nucleic Acids Res, 30(4), e15. (a process known as transcription) and production of
protein from mRNA (a process known as translation).
Zweiger, G. (1999). Knowledge discovery in gene-expres- Microarrays are used to measure the level of gene
sion-microarray data: Mining the information output of the expression in a tissue or cell.
genome. Trends Biotechnol, 17(11), 429-436.
Genomic Medicine: Integration of genomic and
clinical data for medical decision.
Microarray: A chip on which numerous probes are
KEY TERMS placed for hybridization with a tissue sample to analyze
its gene expression.
Bioinformatics: All aspects of information pro-
cessing on biological data, in particular genomic data. Postgenome Era: The time after the complete hu-
The rise of bioinformatics is driven by the genomic man genome sequence is decoded.
projects.
Transcription Factor: A protein that binds to the
cis-element of a gene and affects its expression.
733
TEAM LinG
734
INTRODUCTION that are then organized into a results set as a text file that
can then be subjected to analyses such as data mining.
Microarray informatics is a rapidly expanding disci- Jagannathan (2002) of the Swiss Institute of
pline in which large amounts of multi-dimensional data Bioinformatics (SIB) described databases for
are compressed into small storage units. Data mining of microarrays including their construction from
microarrays can be performed using techniques such as microarray experiments such as gathering data from
drill-down analysis rather than classical data analysis on cells subjected to more than one conditions. The latter
a record-by-record basis. Both data and metadata can be are hybridized to a microarray that is stored after the
captured in microarray experiments. The latter may be experiment by methods such as scanned images. Hence
constructed by obtaining data samples from an experi- data is to be stored both before and after the experi-
ment. Extractions can be made from these samples and ments, and the software used must be capable of dealing
formed into homogeneous arrays that are needed for with large volumes of both numeric and image data.
higher level analysis and mining. Jagannathan (2002) also discussed some of the most
Biologists and geneticists find microarray analy- promising existing non-commercial microarray data-
sis as both a practical and appropriate method of storing bases of ArrayExpress, which is a public microarray
images, together with pixel or spot intensities and iden- gene expression repository, the Gene Express Omnibus
tifiers, and other information about the experiment. (GEO), which is a gene expression database hosted at
the National Library of Medicine, and GeneX, which is
an open source database and integrated tool set released
by the National Center for Genome Resources (NCGR)
BACKGROUND in Santa Fe, New Mexico.
Grant (2001) wrote an entire thesis on microarray
A Microarray has been defined by Schena (2003) as an databases describing the scene for its application to
ordered array of microscopic elements in a planar sub- genetics and the human genome and its sequence of the
strate that allows the specific binding of genes or gene three billion-letter sequences of genes. Kim (2002)
products. Schena (2003) claims microarray databases presented improved analytical methods for micro-array
as a widely recognized next revolution in molecular based genome composition analysis by selecting a sig-
biology that enables scientists to analyze genes, pro- nal value that is used as a cutoff to discriminate present
teins, and other biological molecules on a genomic
scale.
Figure 1. Overview of the microarray process (Kennedy,
According to an article (2004) on the National Cen-
(2003)
ter for Biotechnology Information (NCBI) Web site,
because microarrays can be used to examine the ex-
pression of hundreds or thousands of genes at once, it GENAdb
promises to revolutionize the way scientists examine Genomics Array Database
gene expression, and this technology is still consid- Overview of the Microarray Process
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Microarray Databases for Biotechnology
and divergent genes. Do et al. (2003) provided compara- AMAD, which is a Web driven database system written
tive evaluation of microarray-based gene expression entirely in PERL and JavaScript (UAMS, Bioinformatics M
databases by analyzing the requirements for microarray Center, 2004).
data management, and Sherlock (2003) discussed storage
and retrieval of microarray data for molecular biology.
Kemmeren (2001) described a bioinformatics pipe- MAIN THRUST
line for supporting microarray analysis with example of
production and analysis of DNA (Deoxyribonucleic The purpose of this article is to help clarify the meaning
Acid) microarrays that require informatics support. of microarray informatics. The latter is addressed by
Gonclaves & Marks (2002) discussed roles and require- summarizing some illustrations of applications of data
ments for a research microarray database. mining to microarray databases specifically for bio-
An XML description language called MAML technology.
(Microarray Annotation Markup Language) has been First, it needs to be stated which data mining tools
developed to allow communication with other databases are useful in data mining of microarrays. SAS Enterprise
worldwide (Cover Pages 2002). Liu (2004) discusses Miner, which was used in Segall et al. (2003, 2004a,
microarray databases and MIAME (Minimal Informa- 2004b) as discussed below contains the major data
tion about a Microarray Experiment) that defines what mining tools of decisions trees, regression, neural net-
information at least should be stored. For example, the works, and clustering, and also other data mining tools
MIAME for array design would be the definite structure such as association rules, variable selection, and link
and definition of each array used and their elements. The analysis. All of these are useful data mining tools for
Microarray Gene Expression Database Group (MGED) microarray databases regardless if using SAS Enter-
composed and developed the recommendations for prise Miner or not. In fact, an entire text has been written
microarray data annotations for both MAIME and MAML by Draghici (2003) on data analysis tools for DNA
in 2000 and 2001 respectively in Cambridge, United microarrays that includes these data mining tools as
Kingdom. well as numerous others tools such as analysis of func-
Jonassen (2002) presents a microarray informatics tional categories and statistical procedure of correc-
resource Web page that includes surveys and introduc- tions for multiple comparisons.
tory papers on informatics aspects, and database and
software links. Another resourceful Web site is that
from the Lawrence Livermore National Labs (2003)
Scientific and Statistical Data Mining
entitled Microarray Links that provides an extensive list and Visual Data Mining for Genomes
of active Web links for the categories of databases,
microarray labs, and software and tools including data Data mining of microarray databases has been discussed
mining tools. by Deyholos (2002) for bioinformatics by methods that
University-wide database systems have been estab- include correlation of patterns and identifying the sig-
lished such as at Yale as the Yale Microarray Database nificance analysis of microarrays (SAM) for genes within
(YMD) to support large-scale integrated analysis of DNA. Visual data mining was utilized to distinguish the
large amounts of gene expression data produced by a intensity of data filtering and the effect of normaliza-
wide variety of microarray experiments for different tion of the data using regression plots.
organisms as described by Cheung (2004), and similarly Tong (2002) discusses supporting microarray stud-
at Stanford with Stanford Microarray Database (SMD) ies for toxicogenomic databases through data integra-
as described by both Sherlock (2001) and Selis (2003). tion with public data and applying visual data mining
Microarray Image analysis is currently included in such as ScatterPlot viewer.
university curricula, such as in Rouchka (2003) Intro- Chen et al. (2003) presented a statistical approach
duction to Bioinformatics graduate course at University using a Gene Expression Analysis Refining System
of Louisville. (GEARS).
In relation to the State of Arkansas, the medical Piatetsky-Shapiro and Tamayo (2003) discussed the
school is situated in Little Rock and is known as the main types of challenges for microarrray data mining as
University of Arkansas for Medical Sciences (UAMS). including gene selection, classification, and clustering.
A Bioinformatics Center is housed within UAMS that is According to Piatetsky-Shapiro and Tamayo (2003), one
involved with the management of microarray data. The of the important challenges for data mining of microarrays
software utilized at UAMS for microarray analysis in- is that the difficulty of collecting microarray samples
cludes BASE (BioArray Software Environment) and causes the number of samples to remain small and while
735
TEAM LinG
Microarray Databases for Biotechnology
the number of fields corresponding to the number of genes Scientific and Statistical Data Mining
is typically in the thousands this creates a high likeli- and Visual Data Mining for Plants
hood of finding false positives.
Piatetsky-Shapiro and Tamayo (2003) identify areas in
which micorarrays and data mining tools can be im- Segall et al. (2003, 2004a, 2004b) performed data mining
proved that include better accuracy, more robust mod- for assessing the impact of environmental stresses on
els and estimators as well as better appropriate biologi- plant geonomics and specifically for plant data from the
cal interpretation of the computational or statistical Osmotic Stress Microarray Information Database
results for those microarrays constructed from bio- (OSMID). The latter databases are considered to be
medical or DNA data. representative of those that could be used for biotech
Piatetsky-Shapiro and Tamayo (2003) summarize up application such as the manufacture of plant-made-phar-
the areas in which microarray and microarry data mining maceuticals (PMP) and genetically modified (GM) foods.
tools can be improved by stating: The Osmotic Stress Microarray Information Data-
base (OSMID) database that was used in the data mining
Typically a computational researcher will apply his or in Segall et al. (2003, 2004a, 2004b) contains the
her favorite algorithm to some microarray dataset and results of approximately 100 microarray experiments
quickly obtain a voluminous set of results. These results performed at the University of Arizona as part of a
are likely to be useful but only if they can be put in National Science Foundation (NSF) funded project
context and followed up with more detailed studies, for named the The Functional Genomics of Plant Stress
example by a biologist or a clinical researcher. Often whose data constitutes a data warehouse.
this follow up and interpretation is not done carefully The OSMID microarray database is available for
enough because of the additional significant research public access on the Web hosted by Universite
involvement, the lack of domain expertise or proper Montpellier II (2003) in France, and the OSMID con-
collaborators, or due to the limitations of the tains information about the more than 20,000 ESTs
computational analysis itself. (Experimental Stress Tolerances) that were used to
produce these arrays. These 20,000 ESTs could be
Draghici (2003) discussed in-depth other challenges considered as components of data warehouse of plant
in using microarrays specifically for gene expression microarray databases that was subjected to data mining
studies, such as being very noisy or prone to error after in Segall et al. (2003, 2004a, 2004b). The data mining
the scanning and image processing steps, consensus as to was performed using SAS Enterprise Miner and its
how to perform normalization, and the fact that cluster analysis module that yielded both scientific and
microarrays are not necessarily able to substitute com- statistical data mining as well as visual data mining. The
pletely other biological factors or tools in the realm of conclusions of Segall et al. (2003, 2004a, 2004b)
the molecular biologist. included the facts about the twenty-five different varia-
Mamitsuka et al. (2003) mined biological active pat- tions or levels of the environmental factor of salinity
terns in metabolic pathways using microarray expres- on plant of corn, as also evidenced by the visualization
sion profiles. Mamitsuka (2003) utilized microarray of the clusters formed as a result of the data mining.
data sets of gene expressions on yeast proteins.
Curran et al. (2003) performed statistical methods Other Useful Sources of Tools and
for joint data mining of gene expressions and DNA Projects for Microarray Informatics
sequence databases. The statistical methods used in-
clude linear mixed effect model, cluster analysis, and A bibliography on microarray data analysis cre-
logistic regression. ated as available on the Web by Li (2004) that
Zaki et al. (2003) reported on an overview of the includes book and reprints for the last ten years.
papers on data mining in bioinformatics as presented at The Rosalind Franklin Centre for Genomics Re-
the International Conference on Knowledge Discovery search (RFCGR) of the Medical Research Council
and Data Mining held in Washington, DC in August 2003. (MRC) (2004) in the UK provides a Web site with
Some of the novel data mining techniques discussed in links for data mining tools and descriptions of
papers at this conference included gene expression analy- their specific applications to gene expressions and
sis, protein/RNA (ribonucleic acid) structure predic- microarray databases for genomics and genetics.
tion, and gene finding. Reviews of data mining software as applied to
genetic microarray databases are included in an
736
TEAM LinG
Microarray Databases for Biotechnology
annotated list of references for microarray soft Finally, the author also wishes to acknowledge the
ware review compiled by Leung et al. (2002). useful reviews of the three anonymous referees of the M
Web links for the statistical analysis of microarray earlier version of this article without whose construc-
data are provided by van Helden (2004). tive comments the final form of this article would not
Reid (2004) provides Web links of software tools have been possible.
for microarray data analysis including image
analysis.
Bio-IT World Journal Web site has a Microarray REFERENCES
Resource Center that includes a link of extensive
resources for microarray informatics at the Eu- Bio-IT World Inc. (2004). Microarray resources and
ropean Bioinformatics Institute (EBI). articles. Retrieved from http://www.bio-itworld.com/
resources/microarray/
FUTURE TRENDS Chen, C.H. et al. (2003). Gene expression analysis refining
system (GEARS) via statistical approach: A preliminary
The wealth of resources available on the Web for report. Genome Informatics, 14, 316-317.
microarray informatics only supports the premise that Cheung, K.H. et al. (2004). Yale Microarray Database
microarray informatics is a rapidly expanding field. System. Retrieved from http://crcjs.med.utah.edu/
This growth is in both software and methods of analysis bioinfo/abstracts/Cheung,%20Kei.doc
that includes techniques of data mining.
Future research opportunities in microarray Curran, M.D., Liu, H., Long, F., & Ge, N. (2003,Decem-
informatics include the biotech applications for manu- ber). Machine learning in low-level microarray analy-
facture of plant-made-pharmaceuticals (PMP) and ge- sis. SIGKDD Explorations, 5(2),122-129.
netically modified (GM) foods. Deyholos, M. (2002). An introduction to exploring ge-
nomes and mining microarrays. In OReilly
Bioinformatics Technology Conference, January 28-31,
CONCLUSION 2002, Tucson, AZ. Retrieved from http://
conferences.oreillynet.com/cs/bio2002/view/e_sess/
Because data within genome databases is composed of 1962 - 11k - May 7, 2004
micro-level components such as DNA, microarray data-
bases are a critical tool for analysis in biotechnology. Do, H., Toralf, K., & Rahm, E. (2003). Comparative
Data mining of microarray databases opens up this field evaluation of microarray-based gene expression da-
of microarray informatics as multi-facet tools for knowl- tabases. Retrieved from http://www.btw2003.de/pro-
edge discovery. ceedings/paper/96.pdfDraghici, S. (2003). Data analy-
sis tools for DNA microarrays. Boca Raton, FL: Chapman
& Hall/CRC.
ACKNOWLEDGMENT Goncalves, J., & Marks, W.L. (2002). Roles and re-
quirements for a research microarray Database. IEEE
The author wishes to acknowledge the funding provided Engineering Medical Biol Magazine, 21(6), 154-157.
by a block grant from the Arkansas Biosciences Institute Grant, E. (2001, September). A microarray database. The-
(ABI) as administered by Arkansas State University sis for Masters of Science in Information Technology.
(ASU) to encourage development of a focus area in The University of Glasgow.
Biosciences Institute Social and Economic and Regula-
tory Studies (BISERS) for which he served as Co- Jagannathan, V. (2002). Databases for microarrays. Pre-
Investigator (Co-I) in 2003, and with which funding the sentation at Swiss Institute of Bioinformatics (SIB), Uni-
analyses of the Osmotic Stress Microarray Information versity of Lausanne, Switzerland. Retrieved from http://
Database (OSMID) discussed within this article were www.ch.embnet.org/CoursEMBnet/CHIP02/ppt/
performed. Vidhya.ppt
The author also wishes to acknowledge a three-year
software grant from SAS Incorporated to the College of Jonassen, I. (2002). Microarray informatics resource page.
Business at Arkansas State University for SAS Enterprise Retrieved from http://www.ii.uib.no/~inge/micro
Miner that was used in the data mining of the OSMID Kemmeren, P.C., & Holstege, F.C. (2001). A bioinformatics
microarrays discussed within this article. pipeline for supporting microarray analysis. Retrieved
737
TEAM LinG
Microarray Databases for Biotechnology
738
TEAM LinG
Microarray Databases for Biotechnology
Van Helden, J. (2004). Statistical analysis of microarray MIAME (Minimal Information about a Microarray
data: Links.Retrieved from http://www.scmbb.ulb.ac.be/ Experiment): Defines what information at least should be stored. M
~jvanheld/web_course_microarrays/links.html
Microarray Databases: Store large amounts of com-
Zaki, M.J., Wang, H.T., & Toivonen, H.T. (2003, Decem- plex data as generated by microarray experiments (e.g., DNA).
ber). Data mining in bioinformatics. SIGKDD Explora-
tions, 5(2), 198-199. Microarray Informatics: The study of the use of
microarray databases to obtain information about experi-
mental data.
Microarray Markup Language (MAML): An XML
KEY TERMS (Extensible Markup Language)-based format for com-
municating information about data from microarray
Data Warehouses: A huge collection of consistent experiments.
data that is both subject-oriented and time variant, and
used in support of decision-making. Scientific and Statistical Data Mining: The use of
data and image analyses to investigate knowledge dis-
Genomic Databases: Organized collection of data covery of patterns in the data.
pertaining to the genetic material of an organism.
Visual Data Mining: The use of computer gener-
Metadata: Data about data, for example, data that ated graphics in both 2-D and 3-D for the use in knowl-
describes the properties or characteristics of other data. edge discovery of patterns in data.
739
TEAM LinG
740
Mine Rule
Rosa Meo
Universit degli Studi di Torino, Italy
Giuseppe Psaila
Universit degli Studi di Bergamo, Italy
INTRODUCTION (1998) and Baralis, et al. (1999) discuss the usage of MINE
RULE in this context.
Mining of association rules is one of the most adopted We want to show that, thanks to a highly expressive
techniques for data mining in the most widespread appli- query language, it is possible to exploit all the semantic
cation domains. A great deal of work has been carried possibilities of association rules and to solve very
out in the last years on the development of efficient different problems with a unique language, whose state-
algorithms for association rules extraction. Indeed, this ments are instantiated along the different semantic di-
problem is a computationally difficult task, known as mensions of the same application domain. We discuss
NP-hard (Calders, 2004), which has been augmented by examples of statements solving problems in different
the fact that normally association rules are being ex- application domains that nowadays are of a great impor-
tracted from very large databases. Moreover, in order to tance. The first application is the analysis of a retail
increase the relevance and interestingness of obtained data, whose aim is market basket analysis (Agrawal et
results and to reduce the volume of the overall result, al., 1993) and the discovery of user profiles for cus-
constraints on association rules are introduced and must tomer relationship management (CRM). The second
be evaluated (Ng et al.,1998; Srikant et al., 1997). application is the analysis of data registered in a Web
However, in this contribution, we do not focus on the server on the accesses to Web sites by users. Cooley, et
problem of developing efficient algorithms but on the al. (2000) present a study on the same application
semantic problem behind the extraction of association domain. The last domain is the analysis of genomic
rules (see Tsur et al. [1998] for an interesting generali- databases containing data on micro-array experiments
zation of this problem). (Fayyad, 2003). We show many practical examples of
We want to put in evidence the semantic dimensions MINE RULE statements and discuss the application
that characterize the extraction of association rules; problems that can be solved by analyzing the association
that is, we describe in a more general way the classes of rules that result from those statements.
problems that association rules solve. In order to ac-
complish this, we adopt a general-purpose query lan-
guage designed for the extraction of association rules BACKGROUND
from relational databases. The operator of this lan-
guage, MINE RULE, allows the expression of con- An association rule has the form B H, where B and H are
straints, constituted by standard SQL predicates that sets of items, respectively called body (the antecedent)
make it suitable to be employed with success in many and head (the consequent). An association rule (also
diverse application problems. For a comparison be- denoted for short with rule) intuitively means that items
tween this query language and other state-of-the-art in B and H often are associated within the observed data.
languages for data mining, see Imielinski, et al. (1996); Two numerical parameters denote the validity of the rule:
Han, et al. (1996); Netz, et al. (2001); Botta, et al. support is the fraction of source data for which the rule
(2004). holds; confidence is the conditional probability that H
In Imielinski, et al. (1996), a new approach to data holds, provided that B holds. Two minimum thresholds for
mining is proposed, which is constituted by a new gen- support and confidence are specified before rules are
eration of databases called Inductive Databases (IDBs). extracted, so that only significant rules are extracted.
With an IDB, the user/analyst can use advanced query This very general definition, however, is incomplete
languages for data mining in order to interact with the and very ambiguous. For example, what is the meaning
knowledge discovery (KDD) system, extract data min- of fraction of source data for which the rule holds? Or
ing descriptive and predictive patterns from the data- what are the items associated by a rule? If we do not
base, and store them in the database. Boulicaut, et al. answer these basic questions, an association rule does
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mine Rule
not have a precise meaning. Consider, for instance, the the discovery of significant and unexpected information
original problem for which association rules were ini- in very different application domains. M
tially proposed in Agrawal, et al. (1993)the market
baskets analysis. If we have a database collecting single The main features and clauses of MINE RULE are as
purchase transactions (i.e., transactions performed by follows (see Meo, et al. [1998] for a detailed description):
customers in a retail store), we might wish to extract
association rules that associate items sold within the Selection of the relevant set of data for a data
same transactions. Intuitively, we are defining the se- mining process: This feature is specified by the
mantics of our problemitems are associated by a rule FROM clause.
if they appear together in the same transaction. Support Selection of the grouping features w.r.t., which
denotes the fraction of the total transactions that con- data are observed: These features are expressed
tain all the items in the rule (both B and H), while by the GROUP BY clause.
confidence denotes the conditional probability that, Definition of the structure of rules and cardi-
found B in a transaction, also H is found in the same nality constraints on body and head, specified
transaction. Thus a rule in the SELECT clause: Elements in rules can be
single values or tuples.
{pants, shirt} {socks, shoes} Definition of coupling constraints: These are
support=0.02 confidence=0.23 constraints applied at the rule level (mining condi-
tion instantiated by a WHERE clause associated to
means that the items pants, shirt, socks, and shoes SELECT) for coupling values.
appear together in 2% of the transactions, while having Definition of rule evaluation measures and
found items pants and shirt in a transaction, the prob- minimum thresholds: These are support and con-
ability that the same transaction also contains socks and fidence (even if, theoretically, other statistical
shoes is 23%. measures also would be possible). Support of a
rule is computed on the total number of groups in
Semantic Dimensions which it occurs and satisfies the given constraints.
Confidence is the ratio between the rule support
MINE RULE puts in evidence the semantic dimensions and the support of the body satisfying the given
that characterize the extraction of association rules constraints. Thresholds are specified by clause
from within relational databases and force users (typi- EXTRACTING RULES WITH.
cally analysts) to understand these semantic dimen-
sions. Indeed, extracted association rules describe the
most recurrent values of certain attributes that occur in MAIN THRUST
the data (in the previous example, the names of the
purchased product). This is the first semantic dimension In this section, we introduce MINE RULE in the context
that characterizes the problem. These recurrent values are of the three application domains. We describe many
observed within sets of data grouped by some common examples of queries that can be conceived as a sort of
features (i.e., the transaction identifier in the previous template, because they are instantiated along the rel-
example but, in general, the date, the customer identifier, evant dimensions of an application domain and solve
etc.). This constitutes the second semantic dimension of some frequent, similar, and critical situations for users
the association rule problem. Therefore, extracted asso- of different applications.
ciation rules describe the observed values of the first
dimension, which are recurrent in entities identified by the First Application: Retail Data Analysis
second dimension.
When values belonging to the first dimension are We consider a typical data warehouse gathering infor-
associated, it is possible that not every association is mation on customers purchases in a retail store:
suitable, but only a subset of them should be selected,
based on a coupling condition on attributes of the ana- FactTable (TransId, CustId, TimeId, ItemId, Num,
lyzed data (e.g., a temporal sequence between events Discount)
described in B and H). This is the third semantic dimen- Customer (CustId, Profession, Age, Sex)
sion of the problem; the coupling condition is called
mining condition. Rows in FactTable describe sales. The dimensions of
It is clear that MINE RULE is not tied to any particular data are the customer (CustId), the time (TimeId), and the
application domain, since the semantic dimensions allow purchased item (ItemId); each sale is characterized by the
741
TEAM LinG
Mine Rule
number of sold pieces (Num) and the discount (Discount); MINE RULE CustomerProfiles AS
the transaction identifier (TransId) is reported, as well. We SELECT DISTINCT 1..1 Profession, Age AS BODY,
also report table Customer. 1..n Item AS HEAD, SUPPORT, CONFIDENCE
FROM FactTable JOIN Customer
Example 1: We want to extract a set of association ON FactTable.CustId=Customer.CustId
rules, named FrequentItemSets, that finds the asso- GROUP BY CustId
ciations between sets of items (first dimension of EXTRACTING RULES WITH SUPPORT:0.6,
the problem) purchased together in a sufficient CONFIDENCE:0.9
number of dates (second dimension), with no spe-
cific coupling condition (third dimension). These The observed entity is the customer (first dimen-
associations provide the business relevant sets of sion of data) described by a single pair in the body
items, because they are the most frequent in time. (cardinality constraint 1..1); the head associates prod-
The MINE RULE statement is now reported. ucts frequently purchased by customers (second di-
mension of data) with the profile reported in the body
MINE RULE FrequentItemSets AS (see the SELECT clause). Thus a rule
SELECT DISTINCT 1..n ItemId AS BODY,
1..n ItemId AS HEAD, SUPPORT, CONFIDENCE {(employee, 35)} {socks, shoes}
FROM FactTable support=0.7 confidence=0.96
GROUP BY TimeId
EXTRACTING RULES WITH SUPPORT:0.2, means that customers that are employees and 35 years
CONFIDENCE:0.4 old often (96% of cases) buy socks and shoes. Support
tells about the absolute frequency of the profile in the
The first dimension of the problem is specified in the customer base (GROUP BY clause). This solution can
SELECT clause that specifies the schema of each ele- be generalized easily for any profiling problem.
ment in association rules, the cardinality of body and
head (in terms of lower and upper bound), and the statis- Second Application: Web Log Analysis
tical measures for the evaluation of association rules
(support and confidence); in the example, body and head Typically, Web servers store information concerning
are not empty sets of items, and their upper bound is access to Web sites stored in a standard log file. This is
unlimited (denoted as 1..n). a relational table (WebLogTable) that typically con-
The GROUP BY clause provides the second dimen- tains at least the following attributes:
sion of the problem: since attribute TimeId is specified,
rules denote that associated items have been sold in the RequestID: identifier of the request;
same date (intuitively, rows are grouped by values of IPcaller: IP address from which the request is origi-
TimeId, and rules associate values of attribute ItemId nated;
appearing in the same group). Date: date of the request;
Support of an association rule is computed in terms TS: time stamp;
of the number of groups in which any element of the rule Operation: kind of operation (for instance, get or put);
co-occurs; confidence is computed analogously. In this Page URL: URL of the requested page;
example, support is computed over the different instants Protocol: transfer protocol (such as TCP/IP);
of time, since grouping is made according to the time Return Code: code returned by the Web server;
identifier. Support and confidence of rules must not be Dimension: dimension of the page (in Bytes).
lower than the values in EXTRACTING clause (respec-
tively 0.2 and 0.4). Example 1: To discover Web communities of users
on the basis of the pages they visited frequently,
Example 2: Customer profiling is a key problem in we might find associations between sets of users
CRM applications. Association rules allow to ob- (first dimension) that have all visited a certain
tain a description of customers (e.g., w.r.t. age and number of pages (second dimension); no coupling
profession) in terms of frequently purchased prod- conditions are necessary (third dimension). Users
ucts. To do that, values coming from two distinct are observed by means of their IP address, Ipcaller,
dimensions of data must be associated. whose values are associated by rules (see SE-
LECT). In this case, support and confidence of
association rules are computed, based on the num-
742
TEAM LinG
Mine Rule
ber of pages visited by users in rules (see GROUP Third Application: Genes Classification
BY). Thus, rule by Micro-Array Experiments M
{Ip1, Ip2} {Ip3, Ip4} We consider information on a single micro-array experi-
support=0.4 confidence=0.45 ment containing data on several samples of biological
tissue tied to correspondent probes on a silicon chip.
means that users operating from Ip1, Ip2, Ip3 and Ip4 Each sample is treated (or hybridized) in various ways
visited the same set of pages, which constitute 40% of and under different experimental conditions; these can
the total pages in the site. determine the over-expression of a set of genes. This
means that the sets of genes are active in the experimen-
MINE RULE UsersSamePages AS tal conditions (or inactive, if, on the contrary, they are
SELECT DISTINCT 1..n IPcaller AS BODY, under-expressed). Biologists are interested in discov-
1..n IPcaller AS HEAD, SUPPORT, CONFIDENCE ering which sets of genes are expressed similarly and
FROM WebLogTable under what conditions.
GROUP BY PageUrl A micro-array typically contains hundreds of samples,
EXTRACTING RULES WITH SUPPORT:0.2, and for each sample, several thousands of genes are
CONFIDENCE:0.4 measured. Thus, input relation, called MicroArrayTable,
contains the following information:
Example 2: In Web log analysis, it is interesting
to discover the most frequent crawling paths. SampleID: identifier of the sample of biological
tissue tied to a probe on the microchip;
MINE RULE FreqSeqPages AS GeneId: identifier of the gene measured in the
SELECT DISTINCT 1..n PageUrl AS BODY, sample;
1..n PageUrl AS HEAD, SUPPORT, CONFIDENCE TreatmentConditionId: identifier of the experi-
WHERE BODY.Date < HEAD.Date mental conditions under which the sample has
FROM WebLogTable been treated;
GROUP BY IPcaller LevelOfExpression: measured valueif higher
EXTRACTING RULES WITH SUPPORT:0.3, than a threshold T2, the genes are over-expressed;
CONFIDENCE:0.4 if lower than another threshold T1, genes are un-
der-expressed.
Rows are grouped by user (IPcaller) and sets of pages
frequently visited by a sufficient number of users are Example: This analysis discovers sets of genes
associated. Furthermore, pages are associated only if (first dimension of the problem) that, in the same
they denote a sequential pattern (third dimension); in experimental conditions (second dimension), are
fact, the mining condition WHERE BODY.Date < expressed similarly (third dimension).
HEAD.Date constrains the temporal ordering between
pages in antecedent and consequent of rules. Conse- MINE RULE SimilarlyCorrelatedGenes AS
quently, rule SELECT DISTINCT 1..n GeneId AS BODY,
1..n GeneId AS HEAD, SUPPORT, CONFIDENCE
{P1, P2} {P3, P4, P5} support=0.5 confidence=0.6 WHERE BODY.LevelOfExpression < T1 AND
HEAD.LevelOfExpression < T1 OR
means that 50% of users visit pages P3, P4, and P5 after BODY.LevelOfExpression > T2 AND
pages P1 and P2. This solution can be generalized easily HEAD.LevelOfExpression > T2
for any problem requiring the search for sequential FROM MicroArrayTable
patterns. GROUP BY SampleId, TreatmentConditionId
Many other examples are possible, such as rules that EXTRACTING RULES WITH SUPPORT:0.95,
associate users to frequently visited Web pages (high- CONFIDENCE:0.8
light the fidelity of the users to the service provided by
a Web site) or frequent requests of a page by a browser The mining condition introduced by WHERE con-
that cause an error in the Web server (interesting be- strains both the sets of genes to be similarly expressed
cause it constitutes a favorable situation to hackers in the same experimental conditions (i.e., samples of
attacks). tissue treated in the same conditions). Support thresh-
743
TEAM LinG
Mine Rule
744
TEAM LinG
Mine Rule
Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL KEY TERMS
for mining association rules. Journal of Data Mining and M
Knowledge Discovery, 2(2), 195-224. Association Rule: An association between two sets
Netz, A., Chaudhuri, S., Fayyad, U.M., & Bernhardt, J. of items co-occurring frequently in groups of data.
(2001). Integrating data mining with SQL databases: Constraint-Based Mining: Data mining obtained by
OLE DB for data mining Proceedings of the International means of evaluation of queries in a query language allow-
Conference on Data Engineering, Heidelberg, Germany. ing predicates.
Ng, R.T., Lakshmanan, V.S., Han, J., & Pang, A. (1998). CRM: Management, understanding, and control of
Exploratory mining and pruning optimizations of con- data on the customers of a company for the purposes of
strained associations rules. Proceedings of the Inter- enhancing business and minimizing the customers churn.
national Conference Management of Data, Seattle,
Washington. Inductive Database: Database system integrating
in the database source data and data mining patterns
Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining asso- defined as the result of data mining queries on source
ciation rules with item constraints. Proceedings of the data.
International Conference on Knowledge Discovery
from Databases, Newport Beach, California. KDD: Knowledge Discovery Process from the data-
base, performing tasks of data pre-processing, transfor-
Tsur, D. et al. (1998). Query flocks: A generalization of mation and selection, and extraction of data mining
association-rule mining. Proceedings of the International patterns and their post-processing and interpretation.
Conference Management of Data, Seattle, Washington.
Semantic Dimension: Concept or entity of the
studied domain that is being observed in terms of other
concepts or entities.
Web Log: File stored by the Web server containing
data on users accesses to a Web site.
745
TEAM LinG
746
Murali Mangamuri
Wright State University, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Association Rules on a NCR Teradata System
parallel database system that can execute the SQL queries software, isolated from other vprocs but sharing some of
in parallel on different processing nodes. By processing the physical resources of the node, such as memory and M
the relations directly, we can easily relate the mined CPUs (NCR Teradata Division, 2002).
association rules to other information in the same data- Vprocs and the tasks running under them communi-
base, such as the customer information. cate using the unique-address messaging, as if they were
In this paper, we propose a new algorithm named physically isolated from one another. The Parsing Engine
Enhanced SETM (ESETM), which is an enhanced version (PE) and the Access Module Processor (AMP) are two
of the SETM algorithm. We implemented both ESETM and types of vprocs. Each PE executes the database software
SETM on a parallel NCR Teradata database system and that manages sessions, decomposes SQL statements into
evaluated and compared their performance for various steps, possibly parallel, and returns the answer rows to
cases. It has been shown that ESETM is considerably the requesting client. The AMP is the heart of the Teradata
faster than SETM. RDBMS. The AMP is a vproc that performs many data-
base and file-management tasks. The AMPs control the
management of the Teradata RDBMS and the disk sub-
MAIN THRUST system. Each AMP manages a portion of the physical disk
space and stores its portion of each database table within
NCR Teradata Database System that disk space, as shown in Figure 2 (NCR Teradata
Division, 2002 ).
The algorithms are implemented on an NCR Teradata
database system. It has two nodes, where each node SETM Algorithm
consists of 4 Intel 700MHz Xeon processors, 2GB shared
memory, and 36GB disk space. The nodes are intercon- The SETM algorithm proposed in (Houtsma & Swami,
nected by a dual BYNET interconnection network sup- 1995) for finding frequent itemsets and the corresponding
porting 960Mbps of data bandwidth for each node. More- SQL queries used are as follows:
over, nodes are connected to an external disk storage
// SALES = <trans_id, item>
subsystem configured as a level-5 RAID (Redundant
k := 1;
Array of Inexpensive Disks) with 288GB disk space. sort SALES on item;
The relational DBMS used here is Teradata RDBMS F1 := set of frequent 1-itemsets and their counts;
(version 2.4.1), which is designed specifically to function R 1 := filter SALES to retain supported items;
repeat
in the parallel environment. The hardware that supports
k := k + 1;
Teradata RDBMS software is based on off-the-shelf Sym- sort R k-1 on trans_id, item1, . . . , itemk-1;
metric Multiprocessing (SMP) technology. The hardware R k := merge-scan Rk-1, R 1;
is combined with a communication network (BYNET) that sort R k on item1, . . . , item k ;
F k := generate frequent k-itemsets from the sorted Rk;
connects the SMP systems to form Massively Parallel
Rk := filter R k to retain supported k-itemsets;
Processing (MPP) systems, as shown in Figure 1 (NCR until R k = {}
Teradata Division, 2002).
The versatility of the Teradata RDBMS is based on In this algorithm, initially, all frequent 1-itemsets and
virtual processors (vprocs) that eliminate the depen- their respective counts (F1=<item, count>) are generated
dency on specialized physical processors. Vprocs are a by a simple sequential scan over the SALES table. After
set of software processes that run on a node within the creating F1, R1 is created by filtering SALES using F1. A
multitasking environment of the operating system. Each merge-scan is performed for creating Rk table using R k-1
vproc is a separate, independent copy of the processor
Figure 1. Teradata system architecture Figure 2. Query processing in the Teradata system
747
TEAM LinG
Mining Association Rules on a NCR Teradata System
and R1 tables. R k table can be viewed as the set of candidate generate candidate 2-itemsets and directly generates
k-itemsets coupled with their transaction identifiers. frequent 2-itemsets. This view or subquery is used also
to create candidate 3-itemsets.
SQL query for generating Rk:
INSERT INTO R k CREATE VIEW R 2 (trans_id, item 1, item 2) AS
SELECT p.trans_id, p.item 1, . . . , p.item k-1, q.item SELECT P1.trans_id, P1.item, P2.item
FROM (SELECT p.trans_id, p.item FROM SALES p, F 1 q
FROM Rk-1 p, R1 q WHERE p.item = q.item) AS P1,
WHERE q.trans_id = p.trans_id AND q.item > p.itemk-1 (SELECT p.trans_id, p.item FROM SALES p, F 1 q WHERE p.item = q.item) AS P2
WHERE P1.trans_id = P2.trans_id AND
P1.item < P2.item
Frequent k-itemsets are generated by a sequential scan
over Rk and selecting only those itemsets that meet the Note that R1 is not created, since it will not be used for
minimum support constraint. the generation of Rk. The set of frequent 2-itemsets, F 2,
can be generated directly by using this R2 view.
SQL query for generating Fk:
INSERT INTO F k
INSERT INTO F 2
SELECT p.item1, . . . , p.itemk, COUNT(*)
SELECT item 1 , item 2 , COUNT(*)
FROM R k p
FROM R2
GROUP BY p.item 1, . . . , p.itemk
GROUP BY item 1, item2
HAVING COUNT(*) >= :minimum_support
HAVING COUNT(*) >= :minimum_support
Rk table is created by filtering Rk table using Fk. Rk table The second modification is to generate Rk+1 using the
can be viewed as a set of frequent k-itemsets coupled with join of Rk with itself, instead of the merge-scan of Rk with
their transaction identifiers. This step is performed to R1.
ensure that only the candidate k-itemsets (R k) relative to
frequent k-itemsets are used to generate the candidate SQL query for generating Rk+1:
(k+1)-itemsets. INSERT INTO R k+1
SELECT p.trans_id, p.item1, . . . , p.item k, q.itemk
SQL query for generating R k: FROM R k p, Rk q
INSERT INTO R k WHERE p.trans_id = q.trans_id AND
SELECT p.trans_id, p.item 1, . . . , p.itemk p.item 1 = q.item 1 AND
FROM R k p, Fk q .
WHERE p.item 1 = q.item 1 AND .
. p.item k-1 = q.itemk-1 AND
. p.item k < q.itemk
p.item k-1 = q.item k-1 AND
p.item k = q.item k
This modification reduces the number of candidates
ORDER BY p.trans_id, p.item1, . . . , p.itemk
(k+1)-itemsets generated compared to the original SETM
algorithm. The performance of the algorithm can be
A loop is used to implement the procedure described improved further if candidate (k+1)-itemsets are gener-
above, and the number of iterations depends on the size of ated directly from candidate k-itemsets using a subquery
the largest frequent itemset, as the procedure is repeated as follows:
until Fk is empty.
SQL query for R k+1 using R k:
Enhanced SETM (ESETM) INSERT INTO R k+1
SELECT P1.trans_id, P1.item 1, . . . , P1.itemk, P2.item k
FROM
The Enhanced SETM (ESETM) algorithm has three modi- (SELECT p.* FROM Rk p, Fk q WHERE p.item 1 = q.item1
fications to the original SETM algorithm: AND . . . AND p.item k = q.item k) AS P1,
(SELECT p.* FROM Rk p, Fk q WHERE p.item 1 = q.item1
AND . . . AND p.item k = q.item k) AS P2
1. Create frequent 2-itemsets without materializing R1
WHERE P1.trans_id = P2.trans_id AND
and R2. P1.item 1 = P2.item 1 AND
2. Create candidate (k+1)-itemsets in Rk+1 by joining Rk .
with itself. .
P1.itemk-1 = P2.itemk-1 AND
3. Use a subquery to generate Rk rather than material-
P1.item k < P2.itemk
izing it, thereby generating Rk+1 directly from Rk.
The number of candidate 2-itemsets can be very large, Rk is generated as a derived table using a subquery,
so it is inefficient to materialize R2 table. Instead of creat- thereby saving the cost of materializing R k table.
ing R2 table, ESETM creates a view or a subquery to
748
TEAM LinG
Mining Association Rules on a NCR Teradata System
ESETM with Pruning (PSETM) in the Subquery Q1 is too high when there are not many
candidates to be pruned. In our implementation, the prun- M
In the ESETM algorithm, candidate (k+1)-itemsets in Rk+1 ing is performed until the number of rows in Fk becomes
are generated by joining Rk with itself on the first k-1 items, less than 1,000, or up to five passes. The difference
as described previously. For example, a 4-itemset {1, 2, 3, between the total execution times with and without prun-
9} becomes a candidate 4-itemset only if {1, 2, 3} and {1, ing was very small for most of the databases we tested.
2, 9} are frequent 3-itemsets. It is different from the subset-
infrequency-based pruning of the candidates used in the Performance Analysis
Apriori algorithm, where a (k+1)-itemset becomes a can-
didate (k+1)-itemset, only if all of its k-subsets are fre- In this section, the performance of the Enhanced SETM
quent. So, {2, 3, 9} and {1, 3, 9} also should be frequent (ESETM), ESETM with pruning (PSETM), and SETM are
for {1, 2, 3, 9} to be a candidate 4-itemset. The above SQL- evaluated and compared. We used synthetic transaction
query for generating Rk+1 can be modified such that all the databases generated according to the procedure de-
k-subsets of each candidate (k+1)-itemset can be checked. scribed in (Agrawal & Srikant, 1994).
To simplify the presentation, we divided the query into The total execution times of ESETM, PSETM and
subqueries. Candidate (k+1)-itemsets are generated by SETM are shown in Figure 3 for the database T10.I4.D100K,
the Subquery Q1 using Fk. where Txx.Iyy.DzzzK indicates that the average number of
items in a transaction is xx, the average size of maximal
Subquery Q0: potential frequent itemset is yy, and the number of trans-
SELECT item 1,item2, . . . , itemk FROM F k
actions in the database is zzz in thousands.
Subquery Q1: ESETM is more than three times faster than SETM for
SELECT p.item1, p.item 2, . . . , p.item k, q.item k all minimum support levels, and the performance gain
FROM Fk p, Fk q increases as the minimum support level decreases. ESETM
WHERE p.item1 = q.item 1 AND
and PSETM have almost the same total execution time,
.
. because the effect of the reduced number of candidates in
p.itemk-1 = q.item k-1 AND PESTM is offset by the extra time required for the pruning.
p.item k < q.item k AND The time taken for each pass by the algorithms for the
(p.item2, . . . , p.itemk, q.itemk) IN (Subquery Q0) AND
T10.I4.D100K database with the minimum support of
.
. 0.25% is shown in Figure 4. The second pass execution
(p.item1, . . . , p.itemj-1, p.itemj+1, . . . , p.item k, q.itemk) IN time of ESTM is much smaller than that of SETM, because
(Subquery Q0 ) AND R2 table (containing candidate 2-itemsets together with
.
the transaction identifiers) and R2 table (containing fre-
.
(p.item1, . . . , p.item k-2, p.item k, q.item k) IN (Subquery Q0) quent 2-itemsets together with the transaction identifiers)
are not materialized. In the later passes, the performance
of ESETM is much better than that of SETM, because
Subquery Q2:
ESTM has much less candidate itemsets generated and
SELECT p.* FROM R k p, F k q
WHERE p.item 1 = q.item 1 AND . . . . AND p.itemk = q.item k does not materialize Rk tables, for k > 2.
In Figure 5, the size of Rk table containing candidate
INSERT INTO R k+1 k-itemsets is shown for each pass when the T10.I4.D100K
SELECT p.trans_id, p.item1, . . . , p.item k, q.itemk
database is used with the minimum support of 0.25%. From
FROM (Subquery Q 2) p, (Subquery Q 2) q
WHERE p.trans_id = q.trans_id AND the third pass, the size of Rk table for ESETM is much
p.item 1 = q.item1 AND
. Figure 3. Total execution times (for T10.I4.D100K)
.
p.itemk-1 = q.item k-1 AND SETM ESETM PSETM
p.itemk < q.itemk AND
1200
(p.item1, . . . , p.itemk, q.item k) IN (Subquery Q1)
1000
Q2 derives Rk, and Rk+1 is generated as: Rk+1 = (Rk JOIN Rk) 200
749
TEAM LinG
Mining Association Rules on a NCR Teradata System
smaller than that of SETM because of the reduced number because the number of candidate itemsets generated in
of candidate itemsets. PSETM performs additional prun- the later passes is small.
ing of candidate itemsets, but the difference in the number
of candidates is very small in this case.
The scalability of the algorithms is evaluated by in- FUTURE TRENDS
creasing the number of transactions and the average size
of transactions. Figure 6 shows how the three algorithms Relational database systems are used widely, and the size
scale up as the number of transactions increases. The of existing relational databases grows quite rapidly. Thus,
database used here is T10.I4, and the minimum support is mining the relations directly without transforming them
0.5%. The number of transactions ranges from 100,000 to into certain file structures is very useful. However, due to
400,000. SETM performs poorly as the number of transac- the high operation complexity of the mining processes,
tions increases, because it generates much more candi- parallel data mining is essential for very large databases.
date itemsets than others. Currently, we are developing an algorithm for mining
The effect of the transaction size on the performance association rules across multiple relations using our par-
is shown in Figure 7. In this case, the size of the database allel NCR Teradata database system.
wasnt changed by keeping the product of the average
transaction size and the number of transactions constant.
The number of transactions was 20,000 for the average CONCLUSION
transaction size of 50 and 100,000 for the average transac-
tion size of 10. We used the fixed minimum support count In this paper, we proposed a new algorithm, named En-
of 250 transactions, regardless of the number of transac- hanced SETM (ESETM) for mining association rules from
tions. The performance of SETM deteriorates as the relations. ESETM is an enhanced version of the SETM
transaction size increases, because the number of candi- algorithm (Houtsma & Swami, 1995), and its performance
date itemsets generated is very large. On the other hand, is much better than SETM, because it generates much less
the total execution times of ESETM and PSETM are stable,
Figure 4. Per pass execution times (for T10.I4.D100K) Figure 5. Size of R'k (for T10.I4.D100K)
250 2500
No. of Tuples (in 1000s)
200 2000
Time (sec)
150 1500
100 1000
50 500
0
0
R3 R4 R5 R6 R7 R8 R9
1 2 3 4 5 6 7 8 9
Number of Passes
Figure 6. Effect of the number of transactions Figure 7. Effect of the transaction size
1200 5000
1000 4000
800
Time (sec)
Time (sec)
3000
600
2000
400
200 1000
0 0
100 200 300 400 10 20 30 40 50
Number of Transactions (in 1000s) Average Transaction Size
750
TEAM LinG
Mining Association Rules on a NCR Teradata System
candidate itemsets to count. ESTM and SETM are imple- Holt, J.D., & Chung, S.M. (2001). Multipass algorithms for
mented on a parallel NCR database system, and we evalu- mining association rules in text databases. Knowledge M
ated their performance in various cases. ESTM is at least and Information Systems, 3(2), 168-183.
three times faster than SETM in most of our test cases, and
its performance is quite scalable. Holt, J.D., & Chung, S.M. (2002). Mining association rules
using inverted hashing and pruning. Information Pro-
cessing Letters, 83(4), 211-220.
ACKNOWLEDGMENTS Houtsma, M., & Swami, A. (1995). Set-oriented mining for
association rules in relational databases. Proceedings of
This research was supported in part by NCR, LexisNexis, the International Conference on Data Engineering,
Ohio Board of Regents (OBR), and AFRL/Wright Broth- Taipei, Taiwan.
ers Institute (WBI)
NCR Teradata Division (2002). Introduction to Teradata
RDBMS.
REFERENCES Park, J.S., Chen, M.S., & Yu, P.S. (1997). Using a hash-
based method with transaction trimming for mining asso-
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining ciation rules. IEEE Trans. on Knowledge and Data Engi-
association rules between sets of items in large data- neering, 9(5), 813-825.
bases. Proceedings of the ACM SIGMOD International
Conference on Management of Data, Washington, D.C., Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrat-
USA. ing association rule mining with relational database sys-
tems: Alternatives and implications. Proceedings of the
Agrawal, R., & Shim, K. (1996). Developing tightly-coupled ACM SIGMOD International Conference on Manage-
data mining applications on a relational database system. ment of Data, Seattle, WA, USA
Proceedings of the International Conference on Knowl-
edge Discovery and Data Mining, Portland, OR, USA. Savasere, A., Omiecinski, E., & Navathe, S. (1995). An
eficient algorithm for mining association rules in large
Agrawal. R., & Srikant, R. (1994). Fast algorithms for databases. Proceedings of the VLDB Conference, Zurich,
mining association rules. Proceedings of the VLDB Confer- Switzerland.
ence.
Zaki, M.J. (2000). Scalable algorithms for association
Agarwal, R.C., Aggarwal, C.C., & Prasad, V.V.V. (2000). mining. IEEE Trans. on Knowledge and Data Engineer-
Depth first generation of long patterns. Proceedings of ing, 12(3), 372-390.
the International Conference on Knowledge Discovery
and Data Mining, Boston, MS, USA.
KEY TERMS
Bayardo, R.J. (1998). Efficient mining long patterns from
databases. Proceedings of the ACM SIGMOD Interna- Association Rule: Implication of the form X => Y,
tional Conference on Management of Data, Seattle, WA, meaning that database tuples satisfying the conditions of
USA. X are also likely to satisfy the conditions of Y.
Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A Data Mining: Process of finding useful data patterns
maximal frequent itemset algorithm for transaction data- hidden in large data sets.
bases. Proceedings of the International Conference on
Data Engineering, Heidelberg, Germany. Parallel Database System: Database system support-
ing the parallel execution of the individual basic database
Gouda, K., & Zaki, M.J. (2001). Efficiently mining maximal operations, such as relational algebra operations and
frequent itemsets. Proceedings of the 1st IEEE Interna- aggregate operations.
tional Conference on Data Mining, San Jose, CA, USA.
751
TEAM LinG
752
INTRODUCTION BACKGROUND
In the domain of knowledge discovery in databases and The support of an itemset I is the proportion of objects
its computational part called data mining, many works containing I in the context. An itemset is frequent if its
addressed the problem of association rule extraction that support is greater or equal to the minimal support thresh-
aims at discovering relationships between sets of items old defined by the user. An association rule r is an
(binary attributes). An example association rule fitting in implication with the form r: I1 I2 - I1 where I1 and I2 are
the context of market basket data analysis is cereal milk frequent itemsets such that I1 I2. The confidence of r is
sugar (support 10%, confidence 60%). This rule states the number of objects containing I2 divided by the number
that 60% of customers who buy cereals and sugar also buy of objects containing I1. An association rule is generated
milk, and that 10% of all customers buy all three items. if its support and confidence are at least equal to the
When an association rule support and confidence exceed minsupport and minconfidence thresholds. Association
some user-defined thresholds, the rule is considered rules with 100% confidence are called exact association
relevant to support decision making. Association rule rules; others are called approximate association rules.
extraction has proved useful to analyze large databases in The natural decomposition of the association rule-mining
a wide range of domains, such as marketing decision problem is:
support; diagnosis and medical research support; tele-
communication process improvement; Web site manage- 1. Extracting frequent itemsets and their support from
ment and profiling; spatial, geographical, and statistical the context.
data analysis; and so forth. 2. Generating all valid association rules from frequent
The first phase of association rule extraction is the data itemsets and their support.
selection from data sources and the generation of the data
mining context that is a triplet D = (O, I, R), where O and I The first phase is the most computationally expensive
are finite sets of objects and items respectively, and R O part of the process, since the number of potential frequent
I is a binary relation. An item is most often an attribute itemsets 2|I| is exponential in the size of the set of items, and
value or an interval of attribute values. Each couple (o, i) context scans are required. A trivial approach would
R denotes the fact that the object o O is related to the item consider all potential frequent itemsets at the same time,
i I. If an object o is in relation with all items of an itemset but this approach cannot be used for large databases
I (a set of items) we say that o contains I. where I is large. Then, the set of potential frequent
This phase helps to improve the extraction efficiency itemsets that constitute a lattice called itemset lattice
and enables the treatment of all kinds of data, often mixed must be decomposed into several subsets considered one
in operational databases, with the same algorithm. Data- at a time.
mining contexts are large relations that do not fit in
main memory and must be stored in secondary memory. Level-Wise Algorithms for Extracting
Consequently, each context scan is very time consuming. Frequent Itemsets
Table 1. Example context These algorithms consider all itemsets of a given size (i.e.,
OID Items all itemsets of a level in the itemset lattice) at a time. They
1 ACD are based on the properties that all supersets of an
2 BCE infrequent itemset are infrequent and all subsets of a
3 ABCE frequent itemset are frequent (Agrawal et al., 1995).
4 BE Using this property, the candidate k-itemsets (itemsets of
5 ABCE size k) of the kth iteration are generated by joining two
6 BCE frequent (k-1)-itemsets discovered during the preceding
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Association Rules Using Frequent Closed Itemsets
753
TEAM LinG
Mining Association Rules Using Frequent Closed Itemsets
also can be used instead of the confidence to determine the Figure 2. Closed itemset lattice
rule precision (Silverstein, Brin & Motwani, 1998).
Several methods to prune similar rules by analyzing ABCDE
their structures also have been proposed. This allows the
extraction of rules only, with maximal antecedents among
those with the same support and the same consequent ACD ABCE
(Bayardo & Agrawal, 1999), for instance.
AC BCE
MAIN THRUST
The search space in the first phase is reduced to the Comparing Execution Times
closed itemset lattice, which is a sublattice of the itemset
lattice. Experiments conducted on both synthetic and opera-
The first algorithms based on this approach proposed tional datasets showed that (maximal) frequent itemsets-
are CLOSE (Pasquier et al., 1999a) and A-CLOSE (Pasquier based approaches are more efficient than closed itemsets-
et al., 1999b). To improve the extraction efficiency, both based approaches on weakly correlated data, such as
perform a level-wise search for generators of frequent market-basket data. In such data, nearly all frequent
closed itemsets. The generators of a closed itemset C are itemsets also are frequent closed itemsets (i.e., closed
the minimal itemsets whose closure is C; an itemset G is itemset lattice and itemset lattice are nearly identical),
a generator of C, if there is no other itemset G G whose and closure computations add execution times.
closure is C. Correlated data constitute a challenge for efficiently
During an iteration k, CLOSE considers a set of candi- extracting association rules, since the number of fre-
date k-generators. One context scan is performed to com- quent itemsets is most often very important, even for
754
TEAM LinG
Mining Association Rules Using Frequent Closed Itemsets
high minsupport values. On these data, few frequent For weakly correlated data, very few exact rules are
itemsets are also frequent closed itemsets. Thus, the extracted, and the reduction for approximate rules is in the M
closure helps to reduce the search space; fewer itemsets order of five for both the min-max and the Luxenburger
are tested, and the number of context scans is reduced. On bases.
such data, maximal frequent itemsets-based approaches For correlated data, the Duquenne-Guigues basis re-
suffer from the time needed to compute frequent itemset duces exact rules to a few tens; for the min-max exact basis,
supports that require accessing the dataset. With the the reduction factor is about some tens. For approximate
closure, these supports are derived from the supports of association rules, both the Luxenburger and the min-max
frequent closed itemsets without accessing the dataset. bases reduce the number of rules by a factor of some
hundreds.
Extracting Bases for Association Rules If the number of rules can be reduced from several
million to a few hundred or a few thousand, visualization
Bases are minimal sets, with respect to some criteria, from tools such as templates and/or generalization tools such
which all rules can be deduced with support and confi- as taxonomies are required to explore so many rules.
dence. The Duquenne-Guigues and the Luxenburger ba-
sis for global and partial implications were adapted to
association rule framework in Pasquier et al. (1999c) and FUTURE TRENDS
Zaki (2000). These bases are minimal regarding the number
of rules; no smaller set allows the deduction of all rules Most recent researches on association rules extraction
with support and confidence. However, they do not concern applications to natural phenomena modeling,
contain the minimal non-redundant rules. gene expression analysis (Creighton & Hanash, 2003),
An association rule is redundant, if it brings the same biomedical engineering (Gao, Cong et al., 2003), and
information or less general information than those con- geospatial, telecommunications, Web and semi-struc-
veyed by another rule with identical support and confi- tured data analysis (Han et al., 2002). These applications
dence. Then, an association rule r is a minimal non- most often require extending existing methods. For in-
redundant association rule, if there is no association rule stance, to extract only rules with low support and high
r with the same support and confidence whose anteced- confidence in semi-structured (Cohen et al., 2001) or
ent is a subset of the antecedent of r and whose conse- medical data (Ordonez et al., 2001), to extract temporal
quent is a superset of the consequent of r. An inference association rules in Web data (Yang & Parthasarathy,
system based on this definition was proposed in Cristofor 2002) or adaptive sequential association rules in long-
and Simovici (2002). term medical observation data (Brisson et al., 2004). Fre-
The Min-Max basis for exact association rules con- quent closed itemsets extraction also is applied as a
tains all rules G g(G) - G between a generator G and its conceptual analysis technique to explore biological (Pfaltz
closure (G) such that (G) G. The Min-Max basis for & Taylor, 2002) and medical data (Cremilleux, Soulet &
approximate association rules contains all rules G C - Rioult, 2003).
G between a generator itemset G and a frequent closed These domains are promising fields of application for
itemset C that is a superset of its closure: (G) C. These association rules and frequent closed itemsets-based
bases, also called informative bases, contain, respec- techniques, particularly in combination with other data
tively, the minimal non-redundant exact and approximate mining techniques, such as clustering and classification.
association rules. Their union constitutes a basis for all
association rules: They all can be deduced with their
support and confidence (Bastide et al., 2000). The objec- CONCLUSION
tive is to capture the essential knowledge in a minimal
number of rules without information loss. Next-generation data-mining systems should answer the
Algorithms for determining generators, frequent closed analysts requirements for high-level ready-to-use knowl-
itemsets, and the min-max bases from frequent itemsets edge that will be easier to exploit. This implies the integra-
and their supports are presented in Pasquier et al. (2004). tion of data-mining techniques in DBMS and domain-
specific applications (Ansari et al., 2001). This integration
Comparing Sizes of Association Rule should incorporate the use of knowledge visualization
Sets and exploration techniques, knowledge consolidation by
cross-analysis of results of different techniques, and the
Results of experiments conducted on both synthetic and incorporation of background knowledge, such as taxono-
operational datasets show that the generation of the mies or gene annotations for gene expression data, for
bases can reduce substantially the number of rules. example, in the process.
755
TEAM LinG
Mining Association Rules Using Frequent Closed Itemsets
REFERENCES Gao Cong, F.P., Tung, A., Yang, J., & Zaki, M.J. (2003).
CARPENTER: Finding closed patterns in long biological
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & datasets. Proceedings of the KDD Conference.
Verkamo, A.I. (1995). Fast discovery of association rules. Han, J., & Fu, Y. (1999). Mining multiple-level association
Advances in knowledge discovery and data mining. rules in large databases. IEEE Transactions on Knowl-
AAAI/MIT Press. edge and Data Engineering, 11(5), 798-804.
Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent
Integrating e-commerce and data mining: Architecture patterns without candidate generation: A frequent-pat-
and challenges. Proceedings of the ICDM Conference. tern tree approach. Data Mining and Knowledge Discov-
Baralis, E., & Psaila, G. (1997). Designing templates for ery, 8(1), 53-87.
mining association rules. Journal of Intelligent Informa- Han, J., Russ, B., Kumar, V., Mannila, H., & Pregibon, D.
tion Systems, 9(1), 7-32. (2002). Emerging scientific applications in data mining.
Bastide, Y., Pasquier, N., Taouil, R., Lakhal, L., & Stumme, Communications of the ACM, 45(8), 54-58.
G. (2000). Mining minimal non-redundant association Lin, D., & Kedem, Z.M. (1998). P INCER-SEARCH: A new
rules using frequent closed itemsets. Proceedings of the algorithm for discovering the maximum frequent set. Pro-
DOOD Conference. ceedings of the EBDT Conference.
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, Ordonez, C. et al. (2001). Mining constrained association
L. (2000). Mining frequent closed itemsets with counting rules to predict heart disease. Proceedings of the ICDM
inference. SIGKDD Explorations, 2(2), 66-75. Conference.
Bayardo, R.J. (1998). Efficiently mining long patterns from Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1998).
databases. Proceedings of the SIGMOD Conference. Pruning closed itemset lattices for association rules. Pro-
Bayardo, R.J., & Agrawal, R. (1999). Mining the most ceedings of the BDA Conference.
interesting rules. Proceedings of the KDD Conference. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999a).
Bayardo, R.J., Agrawal, R., & Gunopulos, D. (2000). Con- Efficient mining of association rules using closed itemset
straint-based rule mining in large, dense databases. Data lattices. Information Systems, 24(1), 25-46.
Mining and Knowledge Discovery, 4(2/3), 217-240. Pasquier N., Bastide, Y., Taouil, R., & Lakhal, L. (1999b).
Brisson, L., Pasquier, N., Hebert, C., & Collard, M. (2004). Discovering frequent closed itemsets for association
HASAR: Mining sequential association rules for athero- rules. Proceedings of the ICDT Conference.
sclerosis risk factor analysis. Proceedings of the PKDD Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999c).
Discovery Challenge. Closed set based discovery of small covers for associa-
Cohen, E. et al. (2001). Finding interesting associations tion rules. Proceedings of the BDA Conference.
without support pruning. IEEE Transaction on Knowl- Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., & Lakhal,
edge and Data Engineering, 13(1), 64,78. L. (2004). Generating a condensed representation for
Creighton, C., & Hanash, S. (2003). Mining gene expres- association rules. Journal of Intelligent Information
sion databases for association rules. Bioinformatics, Systems.
19(1), 79-86. Pfaltz J., & Taylor C. (2002, July). Closed set mining of biological
Cremilleux, B., Soulet, A., & Rioult, F. (2003). Mining the data. Proceedings of the KDD/BioKDD Conference.
strongest emerging patterns characterizing patients af- Silverstein, C., Brin, S., & Motwani, R. (1998). Beyond
fected by diseases due to atherosclerosis. Proceedings of market baskets: Generalizing association rules to de-
the PKDD Discovery Challenge. pendence rules. Data Mining and Knowledge Discov-
Cristofor, L., & Simovici, D.A. (2002). Generating an infor- ery, 2(1), 39-68.
mative cover for association rules. Proceedings of the Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., &
ICDM Conference. Lakhal, L. (2002). Computing iceberg concept lattices with
El-Hajj, M., & Zaane, O.R. (2004). COFI approach for TITANIC. Data and Knowledge Engineering, 42(2), 189-
mining frequent itemsets revisited. Proceedings of the 222.
SIGMOD/DMKD Workshop.
756
TEAM LinG
Mining Association Rules Using Frequent Closed Itemsets
Wang, J., & Han, J. (2004). BIDE: Efficient mining of KEY TERMS
frequent closed sequences. Proceedings of the ICDE M
Conference. Association Rules: An implication rule between two
Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching for itemsets with statistical measures of range (support) and
the best strategies for mining frequent closed itemsets. precision (confidence).
Proceedings of the KDD Conference. Basis for Association Rules: A set of association
Yang, H., & Parthasarathy, S. (2002). On the use of con- rules that is minimal with respect to some criteria and from
strained associations for Web log mining. Proceedings of which all association rules can be deduced with support
the KDD/WebKDD Conference. and confidence.
Zaki, M.J. (2000). Generating non-redundant association Closed Itemset: An itemset that is a maximal set of
rules. Proceedings of the KDD Conference. items common to a set of objects. An itemset is closed if
it is equal to the intersection of all objects containing it.
Zaki, M.J., & Hsiao, C.-J. (2002). CHARM: An efficient
algorithm for closed itemset mining. Proceedings of the Frequent Itemset: An itemset contained in a number
SIAM International Conference on Data Mining. of objects at least equal to some user-defined threshold.
Zaki, M.J., & Ogihara, M. (1998). Theoretical foundations Itemset: A set of binary attributes, each correspond-
of association rules. Proceedings of the SIGMOD/DMKD ing to an attribute value or an interval of attribute values.
Workshop.
757
TEAM LinG
758
Daniel Licthnow
Catholic University of Pelotas, Brazil
Thyago Borges
Catholic University of Pelotas, Brazil
Tiago Primo
Catholic University of Pelotas, Brazil
Gabriel Simes
Catholic University of Pelotas, Brazil
Gustavo Piltcher
Catholic University of Pelotas, Brazil
Ramiro Saldaa
Catholic University of Pelotas, Brazil
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Chat Discussions
BACKGROUND the words and the grammar present in the texts represent
knowledge from people, expressed in written formats M
Some works has investigated the analysis of online (Sowa, 2000).
discussions. Brutlag and Meek (2000) have studied the An ontology or thesaurus can be used to help to
identification of themes in e-mails. The work compares identify cue words for each subject. The ontology or
the identification by analyzing only the subject of the e- thesaurus has concepts of a domain or knowledge area,
mails against analyzing the message bodies. One con- including relations between concepts and the terms
clusion is that e-mail headers perform so well as mes- used in written languages to express these concepts
sage bodies, with the additional advantage of reducing (Gilchrist, 2003). The ontology can be created by ma-
the number of features to be analyzed. chine learning methods (supervised learning), where
Busemann, Schmeier and Arens, (2000) investigated human experts select training cases for each subject
the special case of messages registered in call centers. (e.g., texts of positive and negative examples) and an
The work proved possible to identify themes in this kind intelligent software system identifies the keywords that
of message, although the informality of the language used define each subject. The TFIDF method from Salton and
in the messages. This informality causes mistakes due to McGill (1983) is the most used in this kind of task.
jargons, misspellings and grammatical inaccuracy. If considering that the terms that compose the mes-
The work of Durbin, Richter, and Warner (2003) has sages compose a bag of words (have no difference in
shown possible to identify affective opinions about importance), probabilistic techniques can be used to
products and services in e-mails sent by customers, in identify the subject. By other side, natural language
order to alert responsible people or to evaluate the processing techniques can identify syntactic elements
organization and customers satisfaction. Furthermore, and relations, then supporting more precise subject
the work identifies the intensity of the rating, allowing identification.
the separation of moderate or intensive opinions. The identification of themes should consider the
Tong (2001) investigated the analysis of online dis- context of the messages to determine if the concept
cussions about movies. Messages represent comments identified is really present in the discussion. A group of
about movies. This work proved to be feasible to find messages is better to infer the subject than a single
positive and negative opinions, by analyzing key or cue message. That avoids misunderstandings due to words
words. Furthermore, the work also extracts information ambiguity and use of synonyms.
about the movies, like directors and actors, and then
examines opinions about these particular characteristics. Making Recommendations in a Chat
The only work found in the scientific literature that Discussion
analyzes chat messages is the one from Khan, Fisher,
Shuler, Wu, and Pottenger (2002). They apply mining A recommender system is a software whose main goal is
techniques over chat messages in order to find social to aid in the social collaborative process of indicating
interactions among people. The goal is to find who is or receiving indications (Resnick & Varian, 1997).
related to whom inside a specific area, by analyzing the Recommender systems are broadly used in electronic
exchange of messages in a chat and the subject of the commerce for suggesting products or providing infor-
discussion. mation about products and services, helping people to
decide in the shopping process (Lawrence et al., 2001;
Schafer et al., 2001). The offered gain is that people do
MAIN THRUST not need to request recommendation or to perform a
query over an information base, but the system decides
Following, the chapter explains how messages can be what and when to suggest. The recommendation is usu-
mined, how recommendations can be made and how the ally based on user profiles and reuse of solutions.
whole discussion (an entire chat session) can be analyzed. When a subject is identified in a message, the
recommender searches for items classified in this sub-
Identifying Themes in Chat Messages ject. Items can come from different databases. For
example, a Digital Library may provide electronic docu-
To provide people with useful information during a ments, links to Web pages and bibliographic references.
collaboration session, the system has to identify what is A profile database may contain information about
being discussed. Textual messages sent by the users in people, including the interest areas of each person, as
the chat can be analyzed for this purpose. Texts can lead well an associated degree, informing the users knowl-
to the identification of the subject discussed because edge level on the subject or how much is his/her compe-
tence in the area (his/her expertise). This can be used to
759
TEAM LinG
Mining Chat Discussions
indicate the most active user in the area or who is the FUTURE TRENDS
authority in the subject.
A database of past discussions records everything Recommender systems are still an emerging area. There
that occurs in the chat, during every discussion session. are some doubts and open issues. For example, whether
Discussions may be stored by sessions, identified by is good or bad to recommend items already suggested
data and themes discusses and can include who partici- in past discussions (re-recommend, as if remembering
pated in the session, all the messages exchanged (with a the person). Besides that it is important to analyze the
label indicating who sent it), the concept identified in level of the participants in order to recommend only
each message, the recommendations made during the basic or advanced items.
session for each user and documents downloaded or read Collaborative filtering techniques can be used to
during the session. Past discussions may be recom- recommend items already seen by other users (Resnick,
mended during a chat session, remembering the partici- et al., 1994; Terveen & Hill, 2001). Grouping people
pants that other similar discussions have already hap- with similar characteristics allows for crossing of
pened. This database also allows users to review the recommended items, for example, to offer documents
whole discussion later after the session. The great ben- read by one person to others.
efit is that users do not re-discuss the same question. In the same way, software systems can capture
relevance feedback from users to narrow the list of
Mining a Chat Session recommendations. Users should read some items of
the list and rate them, so that the system can use this
Analyzing the themes discussed in a chat session can information to eliminate items from the list or to
bring an important overview of the discussion and also of reorder the items in a new ranking.
the subject. Statistical tools applied over the messages The context of the messages needs to be more
sent and the subjects identified in each message can help studied. To infer the subject being discussed, the sys-
users to understand which were the themes more dis- tem can analyze a group of messages, but it is necessary
cussed. Counting the messages associated with each to determine how many (a fixed number or all messages
subject, it is possible to infer the central point of the sent in the past N minutes?).
discussion and the peripheral themes. An orthographic corrector is necessary to clean the
The list of subjects identified during the chat session messages posted to the chat. Lots of linguistic mis-
compose an interesting order, allowing users to analyze takes are expected since people are using chats in a
the path followed by the participants during the discus- hurry, with little attention to the language, without
sion. For example, it is possible to observe which was the revisions and in an informal way. Furthermore, the text
central point of the discussion, whether the discussion mining tools must analyze special signs like novel
deviated from the main subject and whether the subjects abbreviations, emoticons and slang expressions. Spe-
present in the beginning of the discussion were also cial words may be added to the domain ontology in
present at the end. The coverage of the discussion may be order to hold the differences in the language.
identified by the number of different themes discussed.
Furthermore, this analysis allows identifying the depth
of the discussion, that is, whether more specific themes CONCLUSION
were discussed or whether the discussion occurred su-
perficially at a higher conceptual level. An example of such a system discussed in this chapter
Analyzing the messages sent by every participant is available in http://gpsi.ucpel.tche.br/sisrec. Cur-
allows determining the degree of participation of each rently, the system uses a domain ontology for com-
person in the discussion: who participated more and who puter science, but others can be used. Similarly, the
did less. Furthermore, it is possible to observe which are current digital library only has items related to Com-
the interesting areas for each person and in someway to puter Science.
determine the expertise of the group and of the partici- The recommendation system facilitates the organi-
pants (which are the areas where the group is more zational learning because people receive suggestions
competent). of information sources during online discussions. The
Association techniques can be used to identify corre- main advantage of the system is to free the user of the
lations between themes or between themes and persons. burden to search information sources during the online
For example, it is possible to find that some theme is discussion. Users do not have to choose attributes or
present always when other theme is also present or to find requirements from a menu of options, in order to retrieve
that every discussion where some person participated had items of a database; the system decides when and what
a certain theme as the principal. information to recommend to the user. This proactive
760
TEAM LinG
Mining Chat Discussions
approach is useful for non-experienced users that receive Busemann, S., Schmeier, S., & Arens, R.G. (2000) Message
hits about what to read in a specific subject. Users classification in the call center. In Proceedings of the M
information needs are discovered naturally during the Applied Natural Language Processing Conference
conversation. ANLP2000 (pp. 159-165), Seattle, WA.
Furthermore, when the system indicates people who
are authorities in each subject, nave users can meet Durbin, S.D., Richter, J.N., & Warner, D. (2003). A system
these authorities for getting more knowledge. for affective rating of texts. In Proceedings of the 3rd
Other advantage of the system is that part of the Workshop on Operational Text Classification, 9th ACM
knowledge shared in the discussion can be made explicit International Conference on Knowledge Discovery and
through the record of the discussion for future retrieval. Data Mining (KDD-2003), Washington, DC.
Besides that, the system allows the posterior analysis of Khan, F.M., Fisher, T.A., Shuler, L., Wu, T., & Pottenger,
each discussion, presenting the subjects discussed, the W. M. (2002). Mining chat-room conversations for social
messages exchanged, the items recommended and the and semantic interactions. Technical Report LU-CSE-02-
order in which the subjects were discussed. 011, Lehigh University, Bethlehem, Pennsylvania, USA.
An important feature is the statistical analysis of the
discussion, allowing understanding the central point, Gilchrist, A. (2003). Thesauri, taxonomies and ontologies
the peripheral themes, the order of the discussion, its an etymological note. Journal of Documentation, 59(1),
coverage and depth. 7-18.
The benefit of mining chat sessions is of special Lawrence, R.D. et al. (2001). Personalization of supermar-
interest for Knowledge Management efforts. Organiza- ket product recommendations. Journal of Data Mining
tions can store tacit knowledge formatted as discus- and Knowledge Discovery, 5(1/2), 11-32.
sions. The discussions can be retrieved, so that knowl-
edge can be reused. In the same way, the contents of a Nonaka, I., & Takeuchi, T. (1995). The knowledge-creating
Digital Library (or Organizational Memory) can be bet- company: How Japanese companies create the dynamics of
ter used through recommendations. People do not have innovation. Cambridge: Oxford University Press.
to search for contents neither to remember items in
order to suggest to others. Recommendations play this Resnick, P. et al. (1994). GroupLens: An open architecture
role in a proactive way, examining what people are for collaborative filtering of Netnews. In Proceedings of
discussing and users profiles and selecting interesting the Conference on Computer Supported Cooperative
new contents. Work (pp. 175-186).
In special, such systems (that mine chat sessions) Resnick, P., & Varian, H. (1997). Recommender systems.
can be used in e-learning environments, supporting the Communications of the ACM, 40(3), 56-58.
construction of knowledge by individuals or groups.
Recommendations help the learning process, suggest- Salton, G., & McGill, M.J. (1983). Introduction to modern
ing complementary contents (documents and sites stored information retrieval. New York: McGraw-Hill.
in the Digital Library). Recommendations also include Schafer, J.B. et al. (2001). E-commerce recommendation
authorities in topics being discussed, that is, people applications. Journal of Data Mining and Knowledge
with high degrees of knowledge. Discovery, 5(1/2), 115-153.
Senge, P.M. (2001). The fifth discipline: The art and
ACKNOWLEDGMENTS practice of the learning organization (9th ed.). So
Paulo: Best Seller (in Portuguese).
This research group is partially supported by CNPq, an Sowa, J.F. (2000). Knowledge representation: Logical,
entity of the Brazilian government for scientific and philosophical, and computational foundations. Pacific
technological development. Grove, CA: Brooks/Cole Publishing Co.
Terveen, L., & Hill, W. (2001). Human-computer col-
REFERENCES laboration in recommended systems. In J. Carroll (Ed.),
Human computer interaction in the new millennium.
Brutlag, J.D., & Meek, C. (2000). Challenges of the email Boston: Addison-Wesley.
domain for text classification. In Proceedings of the 7th Tong, R. (2001). Detecting and tracking opinions in online
International Conference on Machine Learning (ICML discussions. In Proceedings of the Workshop on Opera-
2000) (pp. 103-110), Stanford University, Stanford, CA, tional Text Classification, SIGIR, New Orleans, Louisi-
USA. ana, USA.
761
TEAM LinG
Mining Chat Discussions
762
TEAM LinG
763
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Data with Group Theoretical Means
The resulting set of rules may serve as a basis for generic nature of conditionals by minimizing the amount
maximum entropy inference. Therefore, the method of information being added, as shown in Kern-Isberner
described in this article addresses minimality aspects, (2001).
as in Padmanabhan and Tuzhilin (2000), and makes use Nevertheless, modelling ME-rule bases has to be
of inference mechanisms, as in Cristofor and Simovici done carefully so as to ensure that all relevant depen-
(2002). Different from most approaches, however, it dencies are taken into account. This task can be difficult
exploits the inferential power of the maximum entropy and troublesome. Usually, the modelling rules are based
methods in full consequence and in a structural, somehow on statistical data. So, a method to compute rule
nonheuristic way. sets appropriate for ME-modelling from statistical data is
urgently needed.
Modelling Conditional Knowledge by
Maximum Entropy (ME) Structures of Knowledge
Suppose a set R* = {(B1|A1)[x1], , (Bn|An)[xn]} of The most typical approach to discover interesting rules
probabilistic conditionals is given. For instance, R* from data is to look for rules with a significantly high
may describe the knowledge available to a physician (conditional) probability and a concise antecedent
when he has to make a diagnosis. Or R* may express (Bayardo & Agrawal, 1999; Agarwal, Aggarwal, & Prasad,
common sense knowledge, such as Students are young 2000; Fayyad & Uthurusamy, 2002; Coenen,
with a probability of (about) 80% and Singles (i.e., Goulbourne, & Leng, 2001). Basing relevance on fre-
unmarried people) are young with a probability of (about) quencies, however, is sometimes unsatisfactory and
70%, the latter knowledge being formally expressed by inadequate, particularly in complex domains such as
R* = { (young|student)[0.8], (young|single)[0.7] }. medicine. Further criteria to measure the interesting-
Usually, these rule bases represent incomplete ness of the rules or to exclude redundant rules have also
knowledge, in that a lot of probability distributions are been brought forth (Jaroszewicz & Simovici, 2001;
apt to represent them. So learning or inductively repre- Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000;
senting the rules, respectively, means to take them as a Zaki, 2000). Some of these algorithms also make use of
set of conditional constraints and to select a unique optimization criteria, which are based on entropy
probability distribution as the best model that can be (Jaroszewicz & Simovici, 2002).
used for queries and further inferences. Paris (1994) Mostly, the rules are considered as isolated pieces
investigates several inductive representation techniques of knowledge; no interaction between rules can be taken
in a probabilistic framework and proves that the prin- into account. In order to obtain more structured infor-
ciple of maximum entropy (ME-principle) yields the mation, one often searches for causal relationships by
only method to represent incomplete knowledge in an investigating conditional independencies and thus
unbiased way, satisfying a set of postulates describing noninteractivity between sets of variables (Spirtes et
sound common sense reasoning. The entropy H(P) of a al., 1993).
probability distribution P is defined as Although causality is undoubtedly most important
for human understanding, the concept seems to be too
H(P) = - w P(w) log P(w), rigid to represent human knowledge in an exhaustive way.
For instance, a person suffering from a flu is certainly sick
where the sum is taken over all possible worlds, w, and (P(sick | flu) = 1), and he or she often will complain about
measures the amount of indeterminateness inherent to headaches (P(headache | flu) = 0.9). Then you have
P. Applying the principle of maximum entropy, then, P(headache | flu) = P(headache | flu & sick), but you would
means to select the unique distribution P* = ME(R*) surely expect that P(headache | not flu) is different from
that maximizes H(P) among all distributions P that P(headache | not flu & sick)! Although the first equality
satisfy the rules in R*. In this way, the ME-method suggests a conditional independence between sick and
ensures that no further information is added, so the headache, due to the causal dependency between head-
knowledge R* is represented most faithfully. ache and flu, the second inequality shows this to be (of
Indeed, the ME-principle provides a most conve- course) false. Furthermore, a physician might also state
nient and founded method to represent incomplete some conditional probability involving sickness and head-
probabilistic knowledge (efficient implementations of ache, so you obtain a complex network of rules. Each of
ME-systems are described in Roedder & Kern-Isberner, these rules will be considered relevant by the expert, but
2003). In an ME-environment, the expert has to list only none will be found when searching for conditional inde-
whatever relevant conditional probabilities he or she is pendencies! So what, exactly, are the structures of knowl-
aware of. Furthermore, ME-modelling preserves the edge by which conditional dependencies (not indepen-
764
TEAM LinG
Mining Data with Group Theoretical Means
dencies! See also Simovici, Cristofor, D., & Cristofor, L., Start with a set B of simple rules, the length of which
2000) manifest themselves in data? is considered to be large enough to capture all M
To answer this question, the theory of conditional relevant dependencies.
structures has been presented in Kern-Isberner (2000). Search for numerical relationships in P by inves-
Conditional structures are an algebraic means to make tigating which products of probabilities match.
the effects of conditionals on possible worlds (i.e., Compute the corresponding conditional struc-
possible combinations or situations) transparent, in that tures with respect to B, yielding equations of
they reflect whether the corresponding world verifies group elements.
the conditional or falsifies it, or whether the conditional Solve these equations by forming appropriate
cannot be applied to the world because the if-condition is factor groups.
not satisfied. Consider, for instance, the conditional If Building these factor groups corresponds to elimi-
you step in a puddle, then your feet might get wet. In a nating and joining the basic conditionals in B to
particular situation, the conditional is applicable (you make their information more concise, in accor-
actually step into a puddle) or not (you simply walk dance with the numerical structure of P. Actually,
around it), and it can be found verified (you step in a the antecedents of the conditionals in B are short-
puddle and indeed, your feet get wet) or falsified (you ened so as to comply with the numerical relation-
step in a puddle, but your feet remain dry because you are ships in P.
wearing rain boots).
This intuitive idea of considering a conditional as a So the basic idea of this algorithm is to start with
three-valued event is generalized in Kern-Isberner (2000) long rules and to shorten them in accordance with the
to handle the simultaneous impacts of a set of condition- probabilistic information provided by P without losing
als by using algebraic symbols for positive and negative information.
impact, respectively. Then for each world, a word of Group theory actually provides an elegant frame-
these symbols can be computed, which shows immedi- work, on the one hand, to disentangle highly complex
ately how the conditionals interact on this world. The conditional interactions in a systematic way, and on the
proper mathematical structure for building words are other hand, to make operations on the conditionals
(semi)groups, and indeed, group theory provides the basis computable, which is necessary to make information
for connecting numerical to structural information in an more concise.
elegant way. In short, a probability (or frequency) distri-
bution is called (conditionally) indifferent with respect to How to Handle Sparse Knowledge
a set of conditionals R* iff its numerical information
matches the structural information provided by condi- The frequency distributions calculated from data are
tional structures. In particular, each ME-distribution turns mostly not positive just to the contrary, they would
out to be indifferent with respect to a generating set of be sparse, full of zeros, with only scattered clusters of
conditionals. nonzero probabilities. This overload of zeros is also a
problem with respect to knowledge representation,
Data Mining and Group Theory A because a zero in such a frequency distribution often
Strange Connection? merely means that such a combination has not been
recorded. The strict probabilistic interpretation of zero
The concept of conditional structures, however, is not probabilities, however, is that such a combination does
only an algebraic means to judge well-behavedness with not exist, which does not seem to be adequate.
respect to conditional information. The link between The method sketched in the preceding section is
numerical and structural information, which is provided also able to deal with that problem in a particularly
by the concept of conditional indifference, can also be adequate way: The zero values in frequency distribu-
used in the other direction, that is, to derive structural tions are taken to be unknown but equal probabilities,
information about the underlying conditional relation- and this fact can be exploited by the algorithm. So they
ships from numerical information. More precisely, find- actually help to start with a tractable set B of rules right
ing a set of rules with the ability to represent a given from the beginning (see also Kern-Isberner & Fisseler,
probability distribution P via ME-methods can be done 2004).
by elaborating numerical relationships in P, interpreting In summary, zeros occurring in the frequency dis-
them as manifestations of underlying conditional depen- tribution computed from data are considered as miss-
dencies. The procedure to discover appropriate sets of ing information, and in my algorithm, they are treated
rules is sketched in the following steps: as non-knowledge without structure.
765
TEAM LinG
Mining Data with Group Theoretical Means
FUTURE TRENDS Coenen, F., Goulbourne, G., & Leng, P. H. (2001). Comput-
ing association rules using partial totals. Proceedings of
Although by and large, the domain of knowledge discov- the Fifth European Conference on Principles and Prac-
ery and data mining is dominated by statistical tech- tice of Knowledge Discovery in Databases (pp. 54-66).
niques and the problem of how to manage vast amounts Cristofor, L., & Simovici, D. (2002). Generating an informa-
of data, the increasing need for and popularity of human- tive cover for association rules. Proceedings of the IEEE
machine interactions will make it necessary to search International Conference on Data Mining (pp. 597-600).
for more structural knowledge in data that can be used to
support (humanlike) reasoning processes. The method Fayyad, U., & Uthurusamy, R. (2002). Evolving data
described in this article offers an approach to realize mining into solutions for insights. Communications of
this aim. The conditional relationships that my algo- the ACM, 45(8), 28-61.
rithm reveals can be considered as kind of cognitive
links of an ideal agent, and the ME-technology takes the Jaroszewicz, S., & Simovici, D. A. (2001). A general mea-
task of inductive reasoning to make use of this knowl- sure of rule interestingness. Proceedings of the Fifth
edge. Combined with clustering techniques in large European Conference on Principles and Practice of
databases, for example, it may turn out a useful method Knowledge Discovery in Databases (pp. 253-265).
to discover relationships that go far beyond the results Jaroszewicz, S., & Simovici, D. A. (2002). Pruning redun-
provided by other, more standard data-mining techniques. dant association rules using maximum entropy principle.
Proceedings of the Pacific-Asia Conference on Knowl-
edge Discovery and Data Mining.
CONCLUSION
Kern-Isberner, G. (2000). Solving the inverse representa-
In this article, I have developed a new method for discov- tion problem. Proceedings of the 14th European Confer-
ering conditional dependencies from data. This method ence on Artificial Intelligence (pp. 581-585).
is based on information-theoretical concepts and group- Kern-Isberner, G. (2001). Conditionals in nonmonotonic
theoretical techniques, considering knowledge discov- reasoning and belief revision. Lecture Notes in Artifi-
ery as an operation inverse to inductive knowledge cial Intelligence.
representation. By investigating relationships between
the numerical values of a probability distribution P, the Kern-Isberner, G., & Fisseler, J. (2004). Knowledge
effects of conditionals are analyzed and isolated, and discovery by reversing inductive knowledge representa-
conditionals are joined suitably so as to fit the knowl- tion. Proceedings of the Ninth International Confer-
edge structures inherent to P. ence on the Principles of Knowledge Representation
and Reasoning.
Padmanabhan, B., & Tuzhilin, A. (2000). Small is beau-
REFERENCES tiful: Discovering the minimal set of unexpected pat-
terns. Proceedings of the Sixth ACM SIGKDD Interna-
Agarwal, R. C., Aggarwal, C. C., & Prasad, V. V. V. tional Conference on Knowledge Discovery and Data
(2000). Depth first generation of long patterns. Pro- Mining (pp. 54-63).
ceedings of the Sixth ACM-SIGKDD International
Conference on Knowledge Discovery and Data Mining Paris, J. B. (1994). The uncertain reasoners companion:
(pp. 108-118). A mathematical perspective. Cambridge University Press.
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G. & Lakhal, Roedder, W., & Kern-Isberner, G. (2003). From informa-
L. (2000). Mining minimal non-redundant association rules tion to probability: An axiomatic approach. International
using frequent closed itemsets. Proceedings of the First Journal of Intelligent Systems, 18(4), 383-403.
International Conference on Computational Logic (pp. Simovici, D.A., Cristofor, D., & Cristofor, L. (2000). Min-
972-986). ing for purity dependencies in databases (Tech. Rep. No.
Bayardo, R. J., & Agrawal, R. (1999). Mining the most 00-2). Boston: University of Massachusetts.
interesting rules. Proceedings of the Fifth ACM SIGKDD Spirtes, P., Glymour, C., & Scheines, R.. (1993). Causation,
International Conference on Knowledge Discovery prediction and search. Lecture Notes in Statistics, 81.
and Data Mining.
766
TEAM LinG
Mining Data with Group Theoretical Means
Zaki, M. J. (2000). Generating non-redundant association Conditional Structure: An algebraic expression that
rules. Proceedings of the Sixth ACM-SIGKDD Interna- makes the effects of conditionals on possible worlds M
tional Conference on Knowledge Discovery and Data transparent and computable.
Mining (pp. 34-43).
Entropy: Measures the indeterminateness inherent to
a probability distribution and is dual to information.
Possible World: Corresponds to the statistical notion
KEY TERMS of an elementary event. Probabilities over possible worlds,
however, have a more epistemic, subjective meaning, in
Conditional: The formal algebraic term for a rule that they are assumed to reflect an agents knowledge.
that need not be strict, but also can be based on plausi-
bility, probability, and so forth. Principle of Maximum Entropy: A method to complete
incomplete probabilistic knowledge by minimizing the
Conditional Independence: A generalization of amount of information added.
plain statistical independence that allows you to take a
context into account. Conditional independence is of- Probabilistic Conditional: A conditional that is as-
ten associated with causal effects. signed a probability. To match the notation of conditional
probabilities, a probabilistic conditional is written as
(B|A)[x] with the meaning If A holds, then B holds with
probability x.
767
TEAM LinG
768
Tobias Scheffer
Humboldt-Universitt zu Berlin, Germany
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining E-Mail Data
contain other spam recipients addresses, and most send- vertices from the spring (center) there is a tendency of
ers addresses are used only once. decreasing real hierarchy depth. M
Additionally, the semantic e-mail approach E-mail graphs can also be used for controlling virus
(McDowell, Etzioni, Halevy, & Levy, 2004) aims at sup- attacks. Ebel, Mielsch, and Bornholdt (2002) show that
porting communication by allowing automatic e-mail pro- vertex degrees of e-mail graphs are governed by power
cessing and facilitating e-mail mining; it is the equivalent laws. By equipping the small number of highly connected
of semantic web for e-mail. The goal is to make e-mails nodes with anti-virus software the spreading of viruses
human- and machine-understandable with a standardized can be prevented easily.
set of e-mail processes. Each e-mail has to follow a
standardized process definition that includes specific
process relevant information. An example for a semantic MAIN THRUST
e-mail process is meeting coordination. Here, the indi-
vidual process tasks (corresponding to single e-mails) are In the last section we categorized e-mail mining tasks
issuing invitations and collecting responses. In order to regarding their objective and gave a short explanation on
work, semantic e-mail would require a global agreement the single tasks. We will now focus on the ones that we
on standardized semantic processes, special e-mail cli- consider to be most interesting and potentially most
ents and training for all users. Additional mining tasks for beneficial for users and describe them in greater detail.
support of communication are automatic e-mail answering These tasks aim at supporting the message creation
and sentence completion. They are described in Section process. Many e-mail management systems allow the
Main Thrust. definition of message templates that simplify the message
creation for recurring topics. This is a first step towards
Discovering Hidden Properties of supporting the message creation process, but past e-
Communication Networks mails that are available for mining are disregarded. We
describe two approaches for supporting the message
E-mail communication patterns reveal much information creation process by mining historic data: mining ques-
about hidden social relationships within organizations. tion-answer pairs and mining sentences.
Conclusions about informal communities and informal
leadership can be drawn from e-mail graphs. Differences Mining Question-Answer Pairs
between informal and formal structures in business orga-
nizations can provide clues for improvement of formal We consider the problem of learning to answer incoming
structures which may lead to enhanced productivity. In e-mails from records of past communication. We focus on
the case of terrorist networks, the identification of com- environments in which large amounts of similar answers
munities and potential leaders is obviously helpful as to frequently asked questions are sent such as call
well. Additional potential applications lie in marketing, centers or customer support departments. In these envi-
where companies especially communication providers ronments, it is possible to manually identify equivalence
can target communities as a whole. classes of answers in the records of outbound communi-
In social science, it is common practice for studies on cation. Each class then corresponds to a set of semanti-
electronic communication within organizations to derive cally equivalent answers sent in the past; it depends
the network structure by means of personal interviews or strongly on the application context which fraction of the
surveys (Garton Garton, Haythornthwaite, & Wellman, outbound communication falls into such classes. Map-
1997; Hinds & Kiesler, 1995). For large organizations, this ping inbound messages to one of the equivalence classes
is not feasible. Building communication graphs from e- of answers is now a multi-class text classification problem
mail logs is a very simple and accurate alternative pro- that can be solved with text classifiers.
vided that the data is available. Tyler, Wilkinson, and This procedure requires a user to manually group
Huberman (2004) derive a network structure from e-mail previously sent answers into equivalence classes which
logs and apply a divisive clustering algorithm that decom- can then serve as class labels for training a classifier. This
poses the graph into communities. Tyler, Wilkinson, and substantial manual labeling effort reduces the benefit of
Huberman verify the resulting communities by interview- the approach. Even though it can be reduced by employ-
ing the communication participants; they find that the ing semi-supervised learning (Nigam, McCallum, Thrun,
derived communities correspond to informal communities. & Mitchell, 2000; Scheffer, 2004), it would still be much
Tyler et al. also apply a force-directed spring algorithm preferable to learn from only the available data: stored
(Fruchterman & Rheingold, 1991) to identify leadership inbound and outbound messages. Bickel and Scheffer
hierarchies. They find that with increasing distance of (2004) discuss an algorithm that learns to answer ques-
769
TEAM LinG
Mining E-Mail Data
tions from only the available data and does not require tional information that is needed for answering specific
additional manual labeling. The key idea is to replace the types of questions. This information can be visualized in
manual assignment of outbound messages to equivalence an inseparability graph, where each class of equivalent
classes by a clustering step. answers is represented by a vertex, and an edge is drawn
The algorithms for training (learning from message when a classifier that discriminates between these classes
pairs) and answering a new question are shown in Table 1. achieves only a low AUC performance (the AUC perfor-
In the training phase, a clustering algorithm identifies mance is the probability that, when a positive and a
groups of similar outbound messages. Each cluster then negative example are drawn at random, a discriminator
serves as class label; the corresponding questions which assigns a higher value to the positive than to the nega-
have been answered by a member of the cluster are used tive one). Typical examples of inseparable answers are
as training examples for a multi-class text classifier. The your order has been shipped this morning and your
medoid of each cluster (the outbound message closest to order will be shipped tomorrow. Intuitively, it is not
the center) is used as an answer template. The classifier possible to predict which of these answers a service
maps a newly incoming question to one of the clusters; this employee will send, based on only the question when
clusters medoid is then proposed as answer to the ques- will I receive my shipment?
tion. Depending on the user interface, high confidence
messages might be answered automatically, or an answer Mining Sentences
is proposed which the user may then accept, modify, or
reject (Scheffer, 2004). The message creation process can also be supported on
The approach can be extended in many ways. Multiple a sentence level. Given an incomplete sentence, the task
topics in a question can be identified to mix different of sentence completion is to propose parts or the total
corresponding answer templates and generate a multi- rest of the current sentence, based on an application
topic answer. Question specific information can be ex- specific document collection. A sentence completion
tracted in an additional information extraction step and user interface can, for instance, display a proposed
automatically inserted into answer templates. In this ex- completion in a micro window and insert the proposed
traction step also customer identifications can be ex- text when the user presses the tab key.
tracted and used for a database lookup that provides The sentence completion problem poses new chal-
customer and order specific information for generating lenges for data mining and information retrieval, includ-
more customized answers. ing the problem of finding sentences whose initial frag-
Bickel and Scheffer (2004) analyze the relationship of ment is similar to a given fragment in a very large text
answer classes regarding the separability of the corre- corpus. To this end, Grabski and Scheffer (2004) provide
sponding questions using e-mails sent by the service a retrieval algorithm that uses a special inverted indexing
department of an online shop. By analyzing this relation- structure to find the sentence whose initial fragment is
ship one can draw conclusions about the amount of addi- most similar to a given fragment, where similarity is
Table 1. Algorithms for learning from message pairs and answering new questions
770
TEAM LinG
Mining E-Mail Data
defined in terms of the greatest cosine similarity of the a research topic and first solutions to some of the prob-
TFIDF vectors. In addition, they study an approach that lems involved have been studied. Mining social networks M
compresses the data further by identifying clusters of the from e-mail logs is a new challenge; research on this topic
most frequently used similar sets of sentences. In order to in computer science is in an early stage.
evaluate the accuracy of sentence completion algorithms,
Grabski and Scheffer (2004) measure how frequently the
algorithm, when given a sentence fragment drawn from a ACKNOWLEDGMENT
corpus, provides a prediction with confidence above ,
and how frequently this prediction is semantically equiva- The authors are supported by the German Science Foun-
lent to the actual sentence in the corpus. They find that dation DFG under grant SCHE540/10-1. We would like to
for the sentence mining problem higher precision and thank the anonymous reviewers.
recall values can be obtained than for the problem of
mining question answer pairs; depending on the thresh-
old and the fragment length, precision values of between REFERENCES
80% and 100% and recall values of about 40% can be
observed. Bickel, S., & Scheffer, T. (2004). Learning from message
pairs for automatic email answering. Proceedings of the
European Conference on Machine Learning.
FUTURE TRENDS
Boykin, P., & Roychowdhury, V. (2004). Personal e-mail
Spam filtering and e-mail filing based on message text can be networks: An effective anti-spam tool. Preprint, arXiv id
reduced to the well studied problem of text classification. 0402143.
The challenges that e-mail classification faces today con- Cohen, W. (1996). Learning rules that classify e-mail.
cern technical aspects, the extraction of spam-specific Proceedings of the IEEE Spring Symposium on Machine
features from e-mails, and an arms race between spam filters learning for Information Access, Palo Alto, California,
and spam senders adapting to known filters. By comparison, USA.
research in the area of automatic e-mail answering and
sentence completion is in an earlier stage; we see a substan- Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector
tial potential for algorithmic improvements to the existing machines for spam categorization. IEEE Transactions on
methods. The technical integration of these approaches into Neural Networks, 10(5), 1048-1055.
existing e-mail clients or call-center automation software
provides an additional challenge. Some of these technical Ebel, H., Mielsch, L., & Bornholdt, S. (2002). Scale-free
challenges have to be addressed before mining algorithms topology of e-mail networks. Physical Review, E 66.
that aim at supporting communication can be evaluated Fruchterman, T. M., & Rheingold, E. M. (1991). Force-
under realistic conditions. directed placement. Software Experience and Practice,
Construction of social network graphs from e-mail 21(11).
logs is much easier than by surveys and there is a huge
interest in mining social networks see, for instance, the Garton, L., Haythornthwaite, C., & Wellman, B. (1997).
DARPA program on Evidence Extraction and Link Dis- Studying online social networks. Journal of Computer-
covery (EELD). While social networks have been studied Mediated Communication, 3(1).
intensely in the social sciences and in physics, we see a Grabski, K., & Scheffer, T. (2004). Sentence completion.
considerable potential for new and better mining algo- Proceedings of the SIGIR International Conference on
rithms for social networks that computer scientists can Information Retrieval, Sheffield, UK.
contribute.
Graham, P. (2003). Better Bayesian filtering. Proceedings
of the First Annual Spam Conference, MIT. Retrieved
CONCLUSION from http://www.paulgraham.com/better.html
Green, C., & Edwards, P. (1996). Using machine learning
Some methods that can form the basis for effective spam to enhance software tools for internet information man-
filtering have reached maturity (text classification), addi- agement. Proceedings of the AAAI Workshop on Internet
tional foundations are being worked on (social network Information Management.
analysis). Today, technical challenges dominate the de-
velopment of spam filters. The development of methods Hinds, P., & Kiesler, S. (1995). Communication across
that support and automate communication processes is boundaries: Work, structure, and use of communication
771
TEAM LinG
Mining E-Mail Data
772
TEAM LinG
773
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining for Image Classification Based on Feature Elements
these color components play a great role in perception and mum support and minimum confidence constraints (Hipp,
represent useful visual meanings of images. The pixels 2000). It is known that rule-based classification models
belonging to these visual components can be taken to form often have difficulty dealing with continuous variables.
perceptual primitive units, by which human beings could However, as a feature element is just a discrete entity,
identify the content of images (Xu, 2001). association rules can easily be used for treating images
The feature elements are defined on the basis of these represented and described by feature elements. In fact,
primitive units. They are discrete quantities, relatively a decision about whether an image I contains feature
independent of each other, and have obvious intuitive element X and/or feature element Y can be properly
visual senses. In addition, they can be considered as sets defined and detected.
of items. Based on feature elements, image classifica-
tion becomes a process of counting the existence of Classification Based on Association
representative components in images. For this purpose,
it is required to find some association rules between the Classification based on associations (CBA) is an algo-
feature elements and the class attributes of image. rithm for integrating classification and association rule
mining (Liu, 1998). Assume that the data set is a normal
Association Rules and Rule Mining relational table that consists of N cases described by
distinct attributes and classified into several known
The association rule can be represented by an expres- classes. All the attributes are treated uniformly. For a
sion X Y, where X and Y can be any discrete entity. As categorical attribute, all the possible values are mapped
we discuss image database, X and Y can be some feature to a set of consecutive positive integers. With these
elements extracted from images. The meaning of X Y mappings, a data case can be treated as a set of (attribute,
is: Given an image database D, for each image I D, X integer value) pairs plus a class label. Each (attribute,
Y expresses that whenever an image I contains X then I integer value) is called an item. Let D be the data set, I the
probably will also contain Y. The support of association set of all items in D, and Y the class labels. A class
rule is defined as the probability p(X I, Y I), and the association rule (CAR) is an implication of the form X
confidence of association rule is defined as the condi- y, where X I, and y Y. A data case d D means d
tional probability p(X I | Y I). A rule with support contains a subset of items; that is, X d and X I. A rule
bigger than a specified minimum support and with con- X y holds in D with confidence C if C percentages of
fidence bigger than a specified minimum confidence is cases in D that contain X are labeled with class y. The rule
considered as a significant association rule. X y has support S in D if the S percentages of cases in
Since the introduction of the association rule mining D are contained in X and are labeled with class y.
by Agrawal (1993), many researches have been con- The objective of CBA is to generate the complete set
ducted to enhance its performance. Most works can be of CARs that satisfy the specified minimum supports
grouped into the following categories: and minimum confidence constraints, and to build a
classifier from CARs. It is easy to see that if the right-
1. Works for mining of different rules, such as multi- hand-side of the association rules is restricted to the
dimensional rules (Yang, 2001). (classification) class attributes, then such rules can be
2. Works for taking advantage of particular techniques, regarded as classification rules to build classifiers.
such as, tree projection (Guralnik, 2004), mul-
tiple minimum supports (Tseng, 2001), constraint-
based clustering (Tung, 2001), and association MAIN THRUST
(Cohen, 2001).
3. Works for developing fast algorithms, such as algo- Extracting Various Types of Feature
rithm based on anti-skew partitioning (Lin, 1998).
4. Works for discovering a temporal database, such
Elements
as discovering temporal association rules
(Guimaraes, 2000; Li, 2003). Various types of feature elements that put emphasis on
different properties will be employed in different applica-
Currently, the association rule mining (Lee, 2003; tions. The extractions of feature elements can be carried
Harms, 2004) is one of the most popular pattern discov- out first by locating the perceptual elements and then by
ery methods in knowledge discovery and data mining. In determining their main properties and giving them suit-
contrast to the classification rule mining (Pal, 2003), able descriptions. Three typical examples are described in
the purpose of association rule mining is to find all the following.
significant rules in the database that satisfy some mini-
774
TEAM LinG
Mining for Image Classification Based on Feature Elements
One process for obtaining feature elements primarily Then, a process of discretization is followed (Li,
based on color properties can be described by the follow- 2002). Suppose the wavelet decomposition is performed M
ing steps (Xu, 2001): in six levels; for each level, seven moments are computed.
This gives a 42-D vector. It can be split into six groups,
1. Images are divided into several clusters with a per- each of them being a 7-D vector that represents seven
ceptual grouping based on hue histogram. moments on one level. On the other side, the whole vector
2. For each cluster, the central hue value is taken as its can be split into seven groups; each of them is a 6-D
color cardinality named as Androutsos-cardinality vector that represents one moment on all six levels. This
(AC). In addition, color-coherence-vector (CCV) and process can be described with the help of Figure 1.
color-auto-correlogram (CAC) also are calculated. In all these examples, the feature elements have prop-
3. Additional attributes such as the center coordi- erty represented by numeric values. As not all of the
nates and area of each cluster are recoded to repre- feature elements have the same status in the visual
sent the position and size information of clusters. sense, an evaluation of feature elements is required to
select suitable feature elements according to the sub-
One type of feature element highlighting the form jective perception of human beings (Xu, 2002).
property of clusters is obtained with the help of Zernike
moments (Xu, 2003). They are invariant to similarity Feature Element Based Image
transformations, such as translation, rotation, and scal- Classification
ing of the planar shape (Wee, 2003). Based on Zernike
moments of clusters, different descriptors for express- Feature Element Based Image Classification (FEBIC)
ing circularity, directionality, eccentricity, roundness, uses CBA to find association rules between feature
symmetry, and so forth, can be directly obtained, which elements and class attributes of the images, while the
provides useful semantic meanings of clusters with re- class attributes of unlabeled images could be predicted
spect to human perception. with such rules. In case an unlabeled image satisfies
Wavelet feature element is based on wavelet modulus several rules, which might make this image be classi-
maxima and invariant moments (Zhang, 2003). Wavelet fied into different classes, the support values and con-
modulus maxima can indicate the location of edges in fidence values can be used to make the final decision.
images. A set of seven invariant moments (Gonzalez, In accordance with the assumption in CBA, each
2002) is used to represent the multi-scale edges in image is considered as a data case, which is described
wavelet-transformed images. Three steps are taken first: by a number of attributes. The components of the
feature element are taken as attributes. The labeled
1. Images are decomposed, using dyadic wavelet, into image set can be considered as a normal relational table
a multi-scale modulus image. that is used to mine association rules for classification.
2. Pixels in the wavelet domain whose moduli are In the same way, feature elements from unlabeled im-
locally maxima are used to form multi-scale edges. ages are extracted and form another relational table
3. The seven invariant moments at each scale are without class attributes, on which the classification
computed and combined to form the feature vector rules to predict the class attributes of each unlabeled
of images. image will be applied.
The whole procedure can be summarized as follows:
Figure 1. Splitting and groups feature vectors to 1. Extract feature elements from images.
construct feature elements 2. Form relational table for mining association rules.
3. Use mined rules to predict the class attributes of
m11 m12 m13 m14 m15 m16 m17 unlabeled images.
4. Classify images using the association of feature
m21 m22 m23 m24 m25 m26 m27 elements.
m31 m32 m33 m34 m35 m36 m37
Database Used in Test
m41 m42 m43 m44 m45 m46 m47
The image database for testing consists of 2,558 real-
m51 m52 m53 m54 m55 m56 m57 color images that can be grouped into five different
classes: (1) 485 images with (big) flowers; (2) 565
m61 m62 m63 m64 m65 m66 m67 images with person pictures; (3) 505 images with au-
775
TEAM LinG
Mining for Image Classification Based on Feature Elements
tos; (4) 500 images with different sceneries (e.g., sunset, Except the classification error, the time complexity is
sunrise, beach, mountain, forest, etc.); and (5) 503 images another important factor to be counted in Web applica-
with flower clusters. Among these classes, the first three tion, as the number of images on the WWW is huge. The
have prominent objects, while the other two normally computation times for two methods are compared during
have no dominant items. Two typical examples from each the test experiments. The time needed for FEBIC is only
class are shown in Figure 2. about 1/100 of the time needed for NFL. Since NFL requires
Among these images, one-third have been used in the many arithmetic operations to compute distance func-
test set and the rest in the training set. The images in the tions, while FEBIC needs only a few operations for judg-
training set are labeled manually and then used in the mining ing the existence of feature elements, such a big difference
of association rules, while the images in the testing set will in computation is well expected.
be labeled automatically by these mined rules.
Classification experiments using two methods with the The detection and description of feature elements play
previously mentioned database are carried out. The pro- an important role in providing suitable information and
posed method FEBIC is compared to another state-of- a basis for association rule mining. How to adaptively
the-art methodnearest feature line (NFL) (Li, 2000). design feature elements that can capture the users
NFL is a classification method based on feature vectors. intention based on perception and interpretation needs
In comparison, the color feature (i.e., AC, CCV, CAC) further research.
and the wavelet feature based on wavelet modulus maxima, The proposed techniques also can be extended to the
and invariant moments are used. content-based retrieval of images over the Internet. As
Two tests are performed. For each test, both meth- feature elements are discrete entities, the similarity
ods use the same training set and testing set. The results between images described by feature elements can be
of these experiments are summarized in Table 1, where computed according to the number of common elements.
the classification error rates for each class and for the
average over the five classes are listed.
The results in Table 1 show that the classification CONCLUSION
error rate of NFL is about 34.5%, while the classifica-
tion error rate of FEBIC is about 25%. The difference is A new approach for image classification that uses fea-
evident. ture elements and employs association rule mining is
proposed. It provides lower classification error and
higher computation efficiency. These advantages make
Table 1. Comparison of classification errors
it quite suitable to be included into a Web search engine
Error rate Test set 1 Test set 2 for images over the Internet.
FEBIC NFL FEBIC NFL
Flower 32.1% 48.8% 36.4% 46.9%
Person 22.9% 25.6% 20.7% 26.1%
Auto 21.3% 23.1% 18.3% 23.1%
Scenery 30.7% 38.0% 32.5% 34.3%
Flower cluster 26.8% 45.8% 20.2% 37.0%
Average 26.6% 35.8% 25.4% 33.2%
776
TEAM LinG
Mining for Image Classification Based on Feature Elements
Cohen, E. et al. (2001). Finding interesting associations Pal, S.K. (2003). Soft computing pattern recognition, case
without support pruning. IEEE Trans. Knowledge and generation and data mining. Proceedings of the Interna-
Data Engineering, 13(1), 64-78. tional Conference on Active Media Technology.
Gonzalez, R.C., & Woods, R.E. (2002). Digital image Renato, C. (2002). A theoretical framework for data mining:
processing. Prentice Hall. The informational paradigm. Computational Statistics
and Data Analysis, 38(4), 501-515.
Guimaraes, G. (2000). Temporal knowledge discovery
for multivariate time series with enhanced self-organiz- Tseng, M.C., Lin, W., & Chien, B.C. (2001). Maintenance
ing maps. Proceedings of the International Joint Con- of generalized association rules with multiple minimum
ference on Neural Networks. supports. Proceedings of the Annual Conference of the
North American Fuzzy Information Processing Society.
Guralnik, V., & Karypis, G. (2004). Parallel tree-pro-
jection-based sequence mining algorithms. Parallel Tung, A.K.H. et al. (2001). Constraint-based clustering in
Computing, 30(4), 443-472. large databases. Proceedings of International Confer-
ence on Database Theory.
Harms, S K., & Deogun, J.S. (2004). Sequential associa-
tion rule mining with time lags. Journal of Intelligent Wee, CY. (2003). New computational methods for full and
Information Systems, 22(1), 7-22. subset Zernike moments. Information Sciences, 159(3-4),
203-220.
Hipp, J., Guntzer, U., & Nakhaeizadeh, G. (2000). Algo-
rithms for association rule miningA general survey Xu, Y., & Zhang, Y.J. (2001). Image retrieval framework
and comparison. ACM SIGKDD, 2(1), 58-64. driven by association feedback with feature element
evaluation built in. Proceedings of the SPIE Storage
Hirata, K. et al. (2000). Integration of image matching and and Retrieval for Media Databases.
classification for multimedia navigation. Multimedia Tools
and Applications, 11, 295309. Xu, Y., & Zhang, Y.J. (2002). Feature element theory
for image recognition and retrieval. Proceedings of the
Lee, C.H., Chen, M.S., & Lin, C.R. (2003). Progressive SPIE Storage and Retrieval for Media Databases.
partition miner: An efficient algorithm for mining gen-
eral temporal association rules. IEEE Trans. Knowl- Xu, Y., & Zhang, Y.J. (2003). Semantic retrieval based
edge and Data Engineering, 15(4), 1004-1017. on feature element constructional model and bias com-
petition mechanism. Proceedings of the SPIE Storage
Li, Q., Zhang, Y.J., & Dai, S.Y. (2002). Image search and Retrieval for Media Databases.
engine with selective filtering and feature element based
classification. Proceedings of the SPIE of Internet Yang, C., Fayyad, U., & Bradley, P.S. (2001). Efficient
Imaging III. discovery of error-tolerant frequent itemsets in high
dimensions. Proceedings of the International Confer-
Li, S.Z., Chan, K.L., & Wang, C.L. (2000). Performance ence on Knowledge Discovery and Data Mining.
evaluation of the nearest feature line method in image
classification and retrieval. IEEE Trans. Pattern Analy- Zhang, Y.J. (2003). Content-based visual information
sis and Machine Intelligence, 22(11), 1335-1339. retrieval. Science Publisher.
777
TEAM LinG
Mining for Image Classification Based on Feature Elements
Multi-Resolution Analysis: A process to treat a func- Web Image Search Engine: A kind of search engine
tion (i.e., an image) at various levels of resolutions and/ that starts from several initially given URLs and extends
or approximations. In such a way, a complicated function from complex hyperlinks to collect images on the WWW.
could be divided into several simpler ones that can be Web search engine is also known as Web crawler.
studied separately. Web Mining: Concerned with the mechanism for
Pattern Detection: Concerned with locating patterns discovering the correlations among the references to
in the database to maximize/minimize a response variable various files that are available on the server by a given
client visit to the server.
778
TEAM LinG
779
Wen-Chi Hou
Southern Illinois University, USA
Zhong Chen
Shanghai JiaoTong University, PR China
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining for Profitable Patterns in the Stock Market
Figure 1. Ideal stock price movement curve under the Figure 3(a) is a price-up K-Line, denoted by an empty
market efficiency theory rectangle, indicating the closing price is higher than the
opening price. Figure 3(b) is a price-down K-line, de-
noted by a solid rectangle, indicating the closing price is
lower than the opening price. Figure 3(c) and 3(d) are 3-
day K-Lines. Figure 3(c) shows that the price was up for
two consecutive days and the second days opening
The flat periods indicate that there were no events price continued on the first days closing price. This
occurring during the periods while the sharp edges indicates that the news was very positive. The price came
indicate sudden stock price movements in response to down a little bit on the third day, which might be due to the
event announcements. However, in reality, most stocks correction to the over-valuation of the good news in the
daily price resembles the curve shown in Figure 2. As prior two days. Actually, the long price shadow above the
the figure shows, there is no obvious flat period for a closing price of the second day already shows some
stock and the stock price seemed to keep on changing. In degree of price correction. Figure 3(d) is the opposite of
some cases, the stock price continuously moves down Figure 3(c). When an event about a stock happens, such
or up for a relatively long period, for example, the as rumors on merger/acquisition, or change of dividend
period of May 17, 2002 to July 2, 2002, and the period policies, the price adjustments might last for several days
of October 16, 2002 to November 6, 2002. This could till the price finally settles down. As a result, the stocks
be either there were negative (or positive) events for the price might keep rising or falling or stay the same during
company every day for a long period of time or the stock the price adjustment period.
price adjustment to events actually spans a period of A stock has a K-Line for every trading day, but not
time, rather than instantly. The latter means that stock every K-Line is of our interest. Our goal is to identify a
price adjustment to the event announcements is not stocks K-Line patterns that reflect investors reactions
efficient and the semi-form of the market efficiency theory to market events such as the releases of good or bad
does not hold. Furthermore, we think the first few days corporate news, major stock analysts upgrade on the
price adjustments of the stock are crucial, and the price stock, and etcetera. Such market events usually can
movements in these early days might contain enough cause the stocks price to oscillate for a period of time.
information to predict whether the rest of price adjustment Certainly, a stocks price sometimes might change with
in the near future is upwards or downwards. large magnitude just for one day or two due to transient
market rumors. These types of price oscillations are
Knowledge Representation regarded as market noises and therefore are ignored.
Whether a stocks daily price oscillates is deter-
mined by examining if the price change on that day is
Knowledge representation holds the key to the success
greater than the average price change of the year. If a
of data mining. A good knowledge representation should
stocks price oscillates for at least three consecutive
be able to include all possible phenomena of a problem
days, we regard it as a signal of the occurrence of a
domain without complicating it (Liu, 1998). Here, we
market event. The markets response to the event is
use K-lines, a widely used representation method of the
recorded in a 3-day K-Line pattern. Then, we examine
daily stock price in Asian stock markets, to describe the
whether this pattern is followed by an up or down trend
daily price change of a stock. Figure 3 is examples of K-
of the stocks price a few days later.
Lines.
Figure 2. The daily stock price curve of Intel Corporation (NasdaqNM Symbol: INTC)
780
TEAM LinG
Mining for Profitable Patterns in the Stock Market
The relative positions of the K-Lines such as one days 0000100111, reserved
opening/closing prices relative to the prior days closing/ 01000, if the price body covers the day 1 and day
opening prices, the length of the price body, etc. reveal 2s price bodies
market reactions to the events. The following bit-represen-
01001, if the price body covers the day 1s price
tation method, called Relative Price Movement or RPM for
body only
simplicity, is used to describe the positional relationship
of the K-lines in three days. 01010, if the price body covers the day 2s price
body only
Day 1 01011, if the price body is covered by the day 1
and day 2s price bodies
bit 0: 1 if the days price is up, 0 otherwise
01100, if the price body is covered by the day 2s
price body only
bit 1: 1 if the up shadow is longer than the price
01101, if the price body is covered by the day 1s
body, 0 otherwise.
price only
bit 2: 1 if the down shadow is longer than the price
01110, if the whole price body is higher than the
body, 0 otherwise
day 1 and day 2s price bodies
Day 2 01111, if the whole price body is higher than the
day 1s price body only
10000, if the whole price body is higher than the
bits 0 2: The same as Day 1s representation
day 2s price body only
bits 3 5: 10001, if the whole price body is lower than the
001, if the price body covers the day 1s price body day 1 and day 2s price bodies
010, if the price body is covered by the day 1s 10010, if the whole price body is lower than the
price body day 1s price body only
011, if the whole price body is higher than the day 10011, if the whole price body is lower than the
1s price body day 2s price body only
100, if the whole price body is lower than the day 10100, if the price body is partially higher than
1s price body the day 1 and day 2s price bodies
101, if the price body is partially higher than the 10101, if the price body is partially lower than
day 1s price body the day 1 and day 2s price bodies
110, if the price body is partially lower than the day 10110, if the price body is partially higher than
1s price body the day 2s price body only
10111, if the price body is partially higher than
DAY 3 the day 1s price body only
11000, if the price body is partially lower than
bits 0 2: The same as Day 1s representation the day 2s price body only
bits 3 7: 11001, if the price body is partially lower than
the day 1s price body only
781
TEAM LinG
Mining for Profitable Patterns in the Stock Market
Mining for Rules patterns to reduce the ambiguity of the patterns. Using
the price-up pattern as an example, for a pattern to be
The rules we mine for are similar to those by Liu (1998), labeled as a price-up pattern, we think the times it
Siberschatz & Tuzhilin (1996), and Zaki, Parthasatathy, appeared in Pup should be at least twice as many as the
Ogihara, & Li (1997). They have the following format: times it appeared in the Pdown. Within all the patterns
labeled as price-up patterns, they were sorted based on
Rule type (1): a 3-day K-Line pattern the stocks the ratio of the squared root of its total occurrences plus
price rises 10% in 10 days its occurrence as a price-up pattern over the occurrence
as a price-down pattern.
Rule type (2): a 3-day K-Line pattern the stocks
For price-up pattern:
price falls 10 % in 10 days
Pup PO Pup
The search algorithm for finding 3-day K-Line pat- + * Pup , if Pdown > 2
Pdown Pdown
terns that lead to stock price rise or fall is as follows:
Preference =
up ,
1. For every 3-day K-Line pattern in the database P + PO * Pup , if Pup 2
2. Encode it by using the RPM method to get every
Pdown Pdown Pdown
days bit representation, c1, c2, c3;
3. Increase pattern_occurrence[c1][c2][c3] by 1;
4. base_price = the 3rd days closing price; For price-down pattern:
5. if the stocks price rises 10% or more, as com-
pared to the base_price, in 10 days after the Pdown PO Pdown
+ * Pdown , if >2
occurrence of this pattern Pup Pup Pup
Preference =
Increase Pup[c1][c2][c3] by 1; down
P PO Pdown
+ * Pdown , if 2
Pup Pup Pup
6. if the stocks price falls 10% or more, as com-
pared to the base_price, in 10 days after the
occurrence of this pattern The final winning patterns with positive Preference
score are listed in Table 2.
Increase Pdown[c1][c2][c3] by 1;
Performance Evaluation
We used the daily trading data from January 1, 1994,
through December 31, 1998, of the 82 stocks, as shown in
To evaluate the performance of the found winning pat-
Table 1, as the base data set to mine for the price up and
terns listed in Table 2, we applied them to the prices of
down patterns. After applying the above search algorithm
the same 82 stocks for the period from January 1, 1999,
on the base data set, the Pup and Pdown arrays contained the
counts of all the patterns that led price to rise or fall by 10%
in 10 days. In total, the up-patterns occurred 1,377 times,
Table 2. Final winning patterns sorted by preference
among which there were 870 different types of up-pat-
terns; and the down-patterns occurred 1,001 times, among Pattern Code PO Pup Pdown Preference
which there were 698 different types of down-patterns. Up[00][20][91] 46 15 4 81.68
A heuristic, stated below, was applied to all found Up[01][28][68] 17 7 1 77.86
Up[07][08][88] 11 7 1 72.22
Up[00][24][88] 10 7 1 71.14
Table 1. 82 selected stocks Up[00][30][8E] 9 7 1 70.00
ADBE BA CDN F KO MWY S WAG Up[01][19][50] 28 12 3 69.17
ADSK BAANF CEA FON LGTO NETG SAPE WCOM Up[00][30][90] 39 21 9 63.57
ADVS BEAS CHKP GATE LU NKE SCOC WMT Up[00][31][81] 26 8 2 52.40
AGE BEL CLGY GE MACR NOVL SNPS XOM
Up[00][20][51] 18 8 2 48.97
AIT BTY CNET GM MERQ ORCL SUNW YHOO
AMZN BVEW CSCO HYSL MO PRGN SYBS Up[01][19][60] 24 9 3 41.70
AOL CA DD IBM MOB PSDI SYMC Down[01][1D][71] 10 0 6 66.00
ARDT CAL DELL IDXC MOT PSFT T
Down[00][11][71] 17 1 6 60.74
AVNT CBS DIS IFMX MRK RATL TSFW
AVTC CBTSY EIDSY INTU MSFT RMDY TWX
Down[01][19][79] 35 3 10 53.05
AWRE CCRD ERTS ITWO MUSE RNWK VRSN Down[00][20][67] 18 2 5 23.11
782
TEAM LinG
Mining for Profitable Patterns in the Stock Market
through December 31, 1999. A stop-loss of 5% was set to 1999; Lo & MacKinlay, 1999). Data mining techniques
reduce the risk imposed by a wrong signal. This is a combined with financial theories can be a powerful ap- M
common practice in the investment industry. If a buy proach for discovering price movement patterns in the
signal is generated, we buy that stock and hold it. The financial market. Unfortunately, researchers in the data
stock will be sold when it reaches the 10% profit target mining field often focus exclusively on computational
or 10 days of holding period, or when its price goes part of market analysis, not paying attention to the theo-
down 5%. Same rules were applied to the sell signals ries of the target area. In addition, the knowledge repre-
but in an opposite way. Table 3 shows the number of buy sentation methods and variables chosen were often based
and short sell signals generated by these patterns. on the common sense, rather than theories. This article
As seen from Table 3, the price-up winning pattern borrows the market efficiency theory to model the problem
worked very well. 42.86% predictions were perfectly and the out-of-sample performance was quite pleasing.
correct. Also, 20 of the 84 buy signals assured 6.7% We believe there will be more studies integrating theories
gain after the signals. If we regard 5% increase also as in multiple disciplines to achieve better results in the near
making money, then in total we had 70.24% chance to future.
win money, and 85.71% chance of not losing money.
The price-down patterns did not work as well as the
price-up patterns. It was probably because there were CONCLUSION
not as many down trends as the up trends in the U.S.
stock market in 1999. Still, by following the sell This paper combines a knowledge discovery technique
signal, there was 43% chance of gaining money and with financial theory, the market efficiency theory, to
87.5% chance of not losing money in 1999. The final solve a classic problem in stock market analysis, that is,
return for the year 1999 was 153.8%, which was supe- finding stock trading patterns that lead to superior fi-
rior as compared to 84% return of Nasdaq composite nancial gains. This study is one of a few efforts that go
and 25% return of Dow Industrial Average. across multi-disciplines to study the stock market and
the results were quite good.
There are also some future research opportunities in
FUTURE TRENDS this direction. For example, the trading volume is not
considered in this research, which it is an important
Being able to identify price rise or drop patterns can be factor of the stock market, and we believe it is worth
exciting for frequent stock traders. By following the further investigation. Using four or more days K-Line
buy or sell signals generated by these patterns, patterns, instead of just 3-day K-Line patterns, is also
frequent stock traders can earn excessive returns over worth exploring.
the simple buy-and-hold strategy (Allen & Karjalainen,
Accumulated
Times Percentage
Total Buy Signals 84
Price is up at least 10% after the signal 36 42.86%
Price is up 2/3 of 10%, i.e. 6.7%, after the signal 20 66.67%
Price is up 1/2 of 10%, i.e. 5%, after the signal 3 70.24%
Price is up only 1/10 of 10% after the signal 13 85.71%
Price drops after the signal 12 100.00%
Accumulated
Times Percentage
Total Sell Signals 16
Price is down at least 10% after the signal 4 25.00%
Price is down 2/3 of 10%, i.e. 6.7%, after the signal 2 37.50%
Price is down 1/2 of 10%, i.e. 5%, after the signal 1 43.75%
Price is down only 1/10 of 10% after the signal 7 87.50%
Price raises after the signal 2 100.00%
783
TEAM LinG
Mining for Profitable Patterns in the Stock Market
784
TEAM LinG
785
INTRODUCTION cess patterns extracted from the Web data. With the use
of data mining techniques, e-business companies can
A small shop owner builds a relationship with its custom- improve the sales and quality of the products by anticipat-
ers by observing their needs, preferences and buying ing problems before they occur.
behaviour. A Web-enabled e-business will like to accom- When dealing with Web-enabled e-business data, a
plish something similar. It is an easy job for the small shop data mining task is decomposed into many sub tasks
owner to serve his customers better in future by learning (figure 1). The discovered knowledge is presented to user
from past interactions. But, this may not be easy for Web- in an understandable and useable form. The analysis may
enabled e-businesses when most customers may never reveal how a Web site is useful in making decision for a
interact personally, and the number of customers is much user, resulting in improving the Web site. The analysis
higher than of the small shop owner. may also lead into business strategies for acquiring new
Data mining techniques can be applied to understand customers and retaining the existing ones.
and analyse e-business data, and turn into actionable
information, that can support a Web enabled e-business
to improve its marketing, sales and customer support DATA MINING OPPORTUNITIES
operations. This seems to be more appealing, when data
is produced and stored with advance electronic data Data obtained from the Web-enabled e-business transac-
interchange methods, the computing power is affordable, tions can be categorised into (1) primary data that in-
the competitive pressure among businesses is strong, cludes actual Web contents, and (2) secondary data that
and the efficient and commercial data mining tools are includes Web server access logs, proxy server logs,
available for data analysis. browser logs, registration data if any, user sessions and
queries, cookies, etc (Cooley, 2003; Kosala & Blockeel,
2000).
BACKGROUND The goal of mining the primary Web data is to
effectively interpret the searched Web documents. Web
Data mining is the process of searching the trends, clus- search engines discover resources on the Web but have
ters, valuable links and anomalies in the entire data. The many problems such as (1) the abundance problem, where
process benefits from the availability of large amount of hundreds of irrelevant data are returned in response to a
data with rich description. The rich descriptions of data search query, (2) limited coverage problem, where only a
such as wide customer records with many potentially few sites are searched for the query instead of searching
useful fields allow data mining algorithms to search be- the entire Web, (3) limited query interface, where user can
yond obvious correlations. Examples of data mining in only interact by providing few keywords, (4) limited
Web-enabled e-business applications are generation of customization to individual users, etc (Garofalakis,
user profiles, enabling customer relationship manage- Rastogi, Seshadri, & Hyuseok, 1999). Mining of Web
ment, and targeting Web advertising based on user ac- contents can assist e-businesses in improving the orga-
Data Gathering Data Processing Data Modelling Information Information Analysis &
extraction Knowledge Assimilation
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining for Web-Enabled E-Business Applications
nization of retrieved result and increasing the precision of Some of the data mining applications appropriate for such
information retrieval. Some of the data mining applica- type of data are:
tions appropriate for such type of data are:
Promoting cross-marketing strategies across prod-
Trend prediction within the retrieved information to ucts. Data mining techniques can analyse logs of
indicate future values. For example, an e-auction different sales indicating customers buying pat-
company provides information about items to auc- terns (Cooley, 2003). Classification and clustering
tion, previous auction details, etc. Predictive mod- of Web access log can help a company to target their
elling can analyse the existing information, and as a marketing (advertising) strategies to a certain group
result estimate the values for auctioneer items or of customers. For example, classification rule min-
number of people participating in future auctions. ing is able to discover that a certain age group of
Text clustering within the retrieved information. For people from a certain locality are likely to buy a
example structured relations can be extracted from certain group of products. Web enabled e-business
unstructured text collections by finding the struc- can also be benefited with link analysis for repeat
ture of Web documents, and present a hierarchical buying recommendations. Schulz, Hahsler, & Jahn
structure to represent the relation among text data (1999) applied link analysis in traditional retail chains,
in Web documents (Wong & Fu, 2000). and found that 70% cross-selling potential exists.
Monitoring a competitors Web site to find unex- Associative rule mining can find frequent products
pected information e.g. offering unexpected ser- bought together. For example, association rule min-
vices and products. Because of the large number of ing can discover rules such as 75% customers who
competitors Web sites and huge information in place an order for product1 from the /company/
them, automatic discovery is required. For instance, product1/ page also place the order for product2
association rule mining can discover frequent word from the /company/product2/ page.
combination in a page that will lead a company to Maintaining or restructuring Web sites to better
learn about competitors (Liu, Ma, & Yu, 2001). serve the needs of customers. Data mining tech-
Categorization of Web pages by discovering simi- niques can assist in Web navigation by discovering
larity and relationships among various Web sites authority sites of a users interest, and overview
using clustering or classification techniques. This sites for those authority sites. For instance, asso-
will lead into effectively searching the Web for the ciation rule mining can discover correlation be-
requested Web documents within the categories tween documents in a Web site and thus estimate
rather than the entire Web. Cluster hierarchies of the probability of documents being requested to-
hypertext documents can be created by analysing gether (Lan, Bressan, & Ooi, 1999). An example
semantic information embedded in link structures association rule resulting from the analysis of a
and document contents (Kosala & Blockeel, 2000). travelling e-business company Web data is: 79%
Documents can also be given classification codes of visitors who browsed pages about Hotel also
according to keywords present in them. browsed pages on visitor information: places to
Providing a higher level of organization for semi- visit. This rule can be used in redesigning the Web
structured or unstructured data available on the site by directly linking the authority and overview
Web. Users do not scan the entire Web site to find Web sites.
the required information, instead they use Web Personalization of Web sites according to each
query languages to search within the document or individuals taste. Data mining techniques can as-
to obtain structural information about Web docu- sist in facilitating the development and execution of
ments. A Web query language restructures extracted marketing strategies such as dynamically changing
information from Web information sources that are a particular Web site for a visitor (Mobasher, Cooley,
heterogenous and semi-structured (Abiteboul, & Srivastave, 1999). This is achieved by building a
Buneman, & Suciu, 2000). An agent based approach model representing correlation of Web pages and
involving artificial intelligent systems can also orga- users. The goal is to find groups of users performing
nize Web based information (Dignum & Cortes, 2001). similar activities. The built model is capable of
categorizing Web pages and users, and matching
The goal of mining the secondary Web data is to between and across Web pages and/or users
capture the buying and traversing habits of customers in (Mobasher, et al, 1999). According to the clusters of
an e-business environment. Secondary Web data in- user profiles, recommendations can be made to a
cludes Web transaction data extracted from Web logs. visitor on return visit or to new visitors (Spiliopoulou,
786
TEAM LinG
Mining for Web-Enabled E-Business Applications
Pohle, & Faulstich, 1999). For example, people ac- patterns and make predictions harder (Kohavi, 2001).
cessing educational products in a company Web site Nevertheless, the quality of data is increased with M
between 6-8 p.m. on Friday can be considered as the use of electronic interchange. There is less
academics and can be focused accordingly. noise present in the data due to electronic storage
and processing in comparison to manual process-
ing of data.
DIFFICULTIES IN APPLYING DATA Data warehousing provides a capability for the
MINING good quality data storage. A warehouse integrates
data from operational systems, e-business applica-
The idea of discovering knowledge in large amounts of tions, and demographic data providers, and handles
data with rich description is both appealing and intuitive, issues such as data inconsistency, missing values,
but technically it is challenging. There should be strate- etc. A Web warehouse may be used as data source.
gies implemented for better analysis of data collected from There has been some initiative to warehouse the
Web-enabled e-business sources. Web data generated from e-business applications,
but still long way to go in terms of data mining
Data Format: Data collected from Web-enabled e- (Bhowmick, Madria, & Ng, 2003).
business sources is semi-structured and hierarchi- Another solution of collecting the good quality
cal. Data has no absolute schema fixed in advance Web data is the use of (1) a dedicated server
and the extracted structure may be irregular or incom- recording all activities of each user individually, or
plete. (2) cookies or scripts in the absence of such server
This type of data requires an additional processing (Chan, 1999; Kohavi, 2001). The agent based ap-
before applying to traditional mining algorithms proaches that involve artificial intelligence sys-
whose source is mostly confined to structured data. tems can also be used to discover such Web based
This pre-processing includes transforming unstruc- information.
tured data to a format suitable for traditional mining Data Adaptability: Data on the Web is ever chang-
methods. Web query languages can be used to ing. Data mining models and algorithms should be
obtain structural information from semi-structured adapted to deal with real-time data such that the
data. Based on this structural information, data ap- new data is incorporated for analysis. The con-
propriate to mining techniques are generated. Web structed data model should be updated as the new
query languages that combine path expressions with data approaches.
an SQL-style syntax such as Lorel or UnQL User-interface agents can be used to maximize the
(Abiteboul, et al, 2000) are a good choice for extract- productivity of current users interactions with the
ing structural information. system by adapting behaviours. Another solution
Data Volume: Collected e-business data sets are can be to dynamically modifying mined informa-
large in volume. The mining techniques should be tion as the database changes (Cheung & Lee, 2000)
able to handle such large data sets. or to incorporate user feedback to modify the ac-
Enumeration of all patterns may be expensive and tions performed by the system.
unnecessary. In spite, selection of representative XML Data: It is assumed that in few years XML will
patterns that capture the essence of the entire data be the most highly used language of Internet in
set and their use for mining may prove a more effec- exchanging information.
tive approach. But then selection of such data set Assuming the metadata stored in XML, the inte-
becomes a problem. A more efficient approach would gration of the two disparate data sources becomes
be to use an iterative and interactive technique that much more transparent, field names are matched
takes account into real time responses and feedback more easily and semantic conflicts are described
into calculation. An interactive process involves explicitly (Abiteboul et al., 2000). As a result, the
human analyst in the process, so an instant feedback types of data input to and output from the learned
can be included in the process. An iterative process models and the detailed form of the models are
first considers a selected number of attributes cho- determined. XML documents may not completely
sen by the user for analysis, and then keeps adding be in the same format thus resulting in missing
other attributes for analysis until the user is satis- values when integrated. Various techniques e.g.,
fied. This iterative method reduces the search space tag recognition can be used to fill missing informa-
significantly. tion created from the mismatch in attributes or tags
Data Quality: Web server logs may not contain all the (Abiteboul et al., 2000). Moreover, many query
data needed. Also, noisy and corrupt data can hide languages such as XML-QL, XSL and XML-GL
787
TEAM LinG
Mining for Web-Enabled E-Business Applications
It is easy to collect data from Web-enabled e-business Liu, B., Ma, Y., & Yu, P.H. (2001, August). Discovering
sources as visitors to a Web site leave the trail which unexpected information from your competitors Web sites.
automatically is stored in log files by Web server. The data In Proceedings of the seventh ACM SIGKDD Interna-
mining tools can process and analyse such Web server tional Conference on Knowledge Discovery and Data
log files or Web contents to discover meaningful informa- Mining (KDD 2001), SanFrancisco, USA.
tion. This analysis uncovers the previously unknown Masand, B., & Spiliopoulou, M. (1999, August). KDD99
buying habits of their online customers to the companies. workshop on Web usage analysis and user profiling
More importantly, the fast feedback the companies ob- (WEBKDD99), San Diego, CA. ACM.
tained using data mining is very helpful in increasing the
companys benefit.
788
TEAM LinG
Mining for Web-Enabled E-Business Applications
Mobasher, B., Cooley, R., & Srivastave, J. (1999). Auto- Link Analysis Data Mining Task: Establishes inter-
matic personalization based on Web usage mining. In nal relationship to reveal hidden affinity among items in M
Masand & Spiliopoulou (Eds.), WEBKDD99. a given data set. Link analysis exposes samples and
trends by predicting correlation of items that are other-
Piastesky-Shapiro, G. (2000, January). Knowledge dis- wise not obvious.
covery in databases: 10 years after. SIGKDD Explora-
tions, 1(2), 59-61, ACM SIGKDD. Mining of Primary Web Data: Assists to effectively
interpret the searched Web documents. Output of this
Schulz, A.G., Hahsler, M., & Jahn, M. (1999). A customer mining process can help e-business customers to improve
purchase incidence model applied to recommendation the organization of retrieved result and to increase the
service. In Masand & Spiliopoulou (Eds.), WEBKDD99. precision of information retrieval.
Spiliopoulou, M., Pohle, C., & Faulstich, L.C. (1999). Mining of Secondary Web Data: Assists to capture
Improving the effectiveness of a Web site with Web the buying and traversing habits of customers in an e-
usage mining. In Masand & Spiliopoulou (Eds.), business environment. Output of this mining process can
WEBKDD99. help e-business to predicting customer behaviour in fu-
Wong, W.C., & Fu, A.W. (2000, July). Finding structure ture, to personalization of Web sites, to promoting cam-
and characteristic of Web documents for classification. In paign by cross-marketing strategies across products.
Proceedings of the ACM SIGMOD Workshop on Research Predictive Modelling Data Mining Task: Makes pre-
issues in Data Mining and Knowledge discovery, ACM. dictions based on essential characteristics about the
Wu, J. (2000, August). Business intelligence: What is data data. The classification task of data mining builds a model
mining? In Data Mining Review Online. to map (or classify) a data item into one of several pre-
defined classes. The regression task of data mining builds
a model to map a data item to a real-valued prediction
KEY TERMS variable.
Clustering Data Mining Task: To identify items with Primary Web Data: Includes actual Web contents.
similar characteristics, and thus creating a hierarchy of Secondary Web Data: Includes Web transaction data
classes from the existing set of events. A data set is extracted from Web logs Examples are Web server access
partitioned into segments of elements (homogeneous) logs, proxy server logs, browser logs, registration data if
that share a number of properties. any, user sessions, user queries, cookies, product corre-
Data Mining (DM) or Knowledge Discovery in Data- lation and feedback from the customer companies.
bases: The extraction of interesting, meaningful, implicit, Web-Enabled E-Business: A business transaction or
previously unknown, valid and actionable information interaction in which participants operate or transact busi-
from a pool of data sources. ness or conduct their trade electronically on the Web.
789
TEAM LinG
790
Wesley Chu
University of California - Los Angeles, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Frequent Patterns via Pattern Decomposition
partition of S if and only if S= S 1 S2 and S1 S2= . The {X:Y}-{:Z} = { Xai : ai +1 ...ak (Y Z )} , a1a2ak=Y-Z.
i =1
relationship is denoted by S=S1+S 2 or S1= S-S2 or S 2= S-S1.
We say S is partitioned into S1 and S2. Similarly, a set {S1,
S2, , Sk} is a partition of S if and only if S= S1 S2 Sk
Proof
and Si Sj= for i,j [1..k] and i j. We denote it as
If Z does not contain X, no itemset in {X:Y} is subsumed
S=S1+S2++Sk. by Z. Therefore, knowing that Z is frequent, we cannot
Let a be an item where aX is an itemset by concatenat- prune any part of the search space {X:Y}.
ing a with X. Otherwise, when X is a subset of Z, we have
k
Theorem 1 {X:Y}= { Xai : ai +1 ...a kV } + X : V , where V=Y Z. The
i =1
For a X,Y, the search space {X:aY} can be partitioned head in the first part is Xai where ai is a member of Y-Z.
into {Xa:Y} and {X:Y} by item a (i.e., Since Z does not contain ai, the first part cannot be pruned
{X:aY}={Xa:Y}+{X:Y}). by Z. For the second part, we have {X:V}-{:Z}={X:V}-
{X:(Z-X)}. Since X Y= , we have V Z-X. Therefore,
Proof {X:V} can be pruned away entirely.
For example, we have {:bcde}-{:abcd} = {:bcde}-{:bcd}
It follows from the fact that each itemset of {X:aY} either = {e:bcd}. Here, a is irrelevant and is removed in the first
contains a (i.e., {Xa:Y}) or does not contain a (i.e., {X:Y}). step. Another example is {e:bcd}-{:abe} = {e:bcd}-{:be}
For example, we have {b:cd}={bc:d}+{b:d}. = {e:bcd}-{e:b} = {ec:bd}+{ed:b}.
791
TEAM LinG
Mining Frequent Patterns via Pattern Decomposition
For infrequent 3-itemset ~abc, PD(d:abcef|~abc) = a professor of cardiology at Duke University in Durham,
d:bcef+da:cef+dab:ef by excluding abc. North Carolina.
Staffords findings were based on 1996 data from 10,942
By decomposing a transaction t, we reduce the number doctor visits by people with heart disease. The study
of items in its tails and thus reduce its search space. For may underestimate aspirin use; some doctors may not
example, the search space of a:bcd contains the following have reported instances in which they recommended
eight itemsets {a, ab, ac, ad, abc, abd, acd, abcd}. Its patients take over-the-counter medications, he said.
decomposition result, abc:d, contains only two itemsets He called the data a wake-up call to doctors who
{abc, abcd}, which is only 25% of its original search space. focus too much on acute medical problems and ignore
When using pattern decomposition, we find frequent general prevention.
patterns in a step-wise fashion starting at step 1 for 1-item
itemsets. At step k, it first counts the support for every We can find frequent one-word, two-word, three-
possible k-item itemsets contained in the dataset Dk to find word, four-word, and five-word combinations. For in-
frequent k-item itemsets Lk and infrequent k-item itemsets stance, we found 14 four-word combinations.
~L k. Then, using the Lk and ~Lk, Dk, they can be decom-
posed into Dk+1, which has a smaller search space than Dk. heart aspirin use regul, aspirin they take not, aspirin
These steps continue until the search space Dk becomes patient take not, patient doct use some, aspirin patient
empty. study take, patient they take not, aspirin patient use
some, aspirin doct use some, aspirin patient they not,
An Application aspirin patient they take, aspirin patient doct some,
heart aspirin patient too, aspirin patient doct use, heart
The motivation of our work originates from the problem of aspirin patient study.
finding multi-word combinations in a group of medical
report documents, where sentences can be viewed as Multi-word combinations are effective for document
transactions and words can be viewed as items. The indexing and summarization. The work in Johnson, et al.
problem is to find all multi-word combinations that occur (2002) shows that multi-word combinations can index
at least in two sentences of a document. documents more accurately than single-word indexing
As a simple example, for the following text: terms. Multi-word combinations can delineate the con-
cepts or content of a domain-specific document collec-
Aspirin greatly underused in people with heart disease. tion more precisely than single word. For example, from
DALLAS (AP) Too few heart patients are taking aspirin, the frequent one-word table, we may infer that heart,
despite its widely known ability to prevent heart attacks, aspirin, and patient are the most important concepts in
according to a study released Monday. the text, since they occur more often than others. For the
The study, published in the American Heart Associations frequent two-word table, we see a large number of two-
journal Circulation, found that only 26% of patients who word combinations with aspirin (i.e., aspirin patient,
had heart disease and could have benefited from aspirin heart aspirin, aspirin use, aspirin take, etc.). This infers
took the pain reliever. that the document emphasizes aspirin and aspirin-re-
This suggests that theres a substantial number of lated topics more than any other words.
patients who are at higher risk of more problems because
theyre not taking aspirin, said Dr. Randall Stafford, an
internist at Harvards Massachusetts General Hospital, FUTURE TRENDS
who led the study. As we all know, this is a very
inexpensive medication very affordable. There is a growing need for mining frequent sequence
The regular use of aspirin has been shown to reduce the patterns from human genome datasets. There are 23 pairs
risk of blood clots that can block an artery and trigger a of human chromosomes, approximately 30,000 genes, and
heart attack. Experts say aspirin also can reduce the risk more than 1,000,000 proteins. The previously discussed
of a stroke and angina, or severe chest pain. pattern decomposition method can be used to capture
Because regular aspirin use can cause some side effects, sequential patterns with some small modifications.
such as stomach ulcers, internal bleeding, and allergic When the frequent patterns are long, mining frequent
reactions, doctors too often are reluctant to prescribe it itemsets (FI) are infeasible because of the exponential
for heart patients, Stafford said. number of frequent itemsets. Thus, algorithms mining
Theres a bias in medicine toward treatment, and within frequent closed itemsets (FCI) (Pasquier, Bastide, Taouil
that bias, we tend to underutilize preventative services, & Lakhal, 1999; Pei, Han & Mao, 2000; Zaki & Hsiao, 1999)
even if theyve been clearly proven, said Marty Sullivan, are proposed, since FCI is enough to generate associa-
792
TEAM LinG
Mining Frequent Patterns via Pattern Decomposition
tion rules. However, FCI also could be as exponentially Johnson, D., Zou, Q., Dionisio, J.D., Liu, Z., Chu, W.W.
large as the FI. As a result, many algorithms for mining (2002). Modeling medical content for automated summa- M
maximal frequent itemsets (MFI) are proposed, such as rization. Annals of the New York Academy of Sciences.
Mafia (Burdick, Calimlim & Gehrke, 2001), GenMax (Gouda
& Zaki, 2001), and SmartMiner (Zou, Chu & Lu, 2002). Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999).
The main idea of pattern decomposition also is used Discovering frequent closed itemsets for association
in SmartMiner, except that SmartMiner uses tail informa- rules. Proceedings of the 7th International Conference
tion (frequent itemsets) to decompose the search space of on Database Theory.
a dataset rather than the dataset itself. While pattern Pei, J., Han, J., & Mao, R. (2000). Closet: An efficient
decomposition avoids candidate set generation, algorithm for mining frequent closed itemsets. Proceed-
SmartMiner avoids superset checking, which is a time- ings of the SIGMOD International Workshop on Data
consuming process. Mining and Knowledge Discovery.
Toivonen, H. (1996). Sampling large databases for asso-
CONCLUSION ciation rules. Proceedings of the 22nd International
Conference on Very Large Data Bases, Bombay, India.
We propose to use pattern decomposition to find fre- Zaki, M.J., & Hsiao, C. (1999). Charm: An efficient algo-
quent patterns in large datasets. The PD algorithm shrinks rithm for closed association rule mining. Technical Re-
the dataset in each pass so that the search space of the port 99-10. Rensselaer Polytechnic Institute.
dataset is reduced. Pattern decomposition avoids the
costly candidate set generation procedure, and using Zaki, M.J., Parthasarathy, S., Ogihara, M., & Li, W. (1997).
reduced datasets greatly decreases the time for support New algorithms for fast discovery of association rules.
counting. Proceedings of the Third International Conference on
Knowledge Discovery in Databases and Data Mining.
Zou, Q., Chu, W., Johnson, D., & Chiu, H. (2002). Using
ACKNOWLEDGMENT pattern decomposition (PD) methods for finding all fre-
quent patterns in large datasets. Journal Knowledge and
This research is supported by NSF IIS ITR Grant # 6300555. Information Systems (KAIS).
Zou, Q., Chu, W., & Lu, B. (2002). SmartMiner: A depth
REFERENCES first algorithm guided by tail information for mining maxi-
mal frequent itemsets. Proceedings of the IEEE Interna-
Agrawal, R. & Srikant, R. (1994). Fast algorithms for tional Conference on Data Mining, Japan.
mining association rules. Proceedings of the 1994 Inter-
national Conference on Very Large Data Bases.
Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A KEY TERMS
maximal frequent itemset algorithm for transactional data-
bases. Proceedings of the International Conference on Frequent Itemset (FI): An itemset whose support is
Data Engineering. greater than or equal to the minimal support.
Gouda, K., & Zaki, M.J. (2001). Efficiently mining maximal Infrequent Pattern: An itemset that is not a frequent
frequent itemsets. Proceedings of the IEEE International pattern.
Conference on Data Mining, San Jose, California. Minimal Support (minSup): A user-given number that
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns specifies the minimal number of transactions in which an
without candidate generation. Proceedings of the 2000 interested pattern should be contained.
ACM International Conference on Management of Data, Pattern Decomposition: A technique that uses known
Dallas, Texas. frequent or infrequent patterns to reduce the search space
Heikki, M., Toivonen, H., & Verkamo, A.I. (1994). Efficient of a dataset.
algorithms for discovering association rules. Proceed- Search Space: The union of the search space of every
ings of the AAAI Workshop on Knowledge Discovery in transaction in a dataset.
Databases, Seattle, Washington.
793
TEAM LinG
Mining Frequent Patterns via Pattern Decomposition
Search Space of a Transaction N=X:Y: The set of Transaction: An instance that usually contains a set
unknown frequent itemsets contained by N. Its size is of items. In this article, we extend a transaction to a
decided by the number of items in the tail of N, i.e. Y. composition of a head and a tail (i.e., N=X:Y), where the
head represents a known frequent itemset, and the tail is
Support of an Itemset x: The number of transactions the set of items for extending the head for new frequent
that contains x. patterns.
794
TEAM LinG
795
Geoffrey I. Webb
Monash University, Australia
BACKGROUND support ( A C )
confidence(A C ) =
frequency( A)
There have been two main approaches to the group
discovery problem from two different schools of The association rules discovered through this pro-
thought. The first, Emerging Patterns, evolved as a clas- cess then are sorted according to some user-specified
sification method, while the second, Contrast Sets, interestingness measure before they are displayed to
grew as an exploratory method. The algorithms of both the user.
approaches are based on the Max-Miner rule discovery Another type of rule discovery is k-most interesting
system (Bayardo Jr., 1998). Therefore, we will briefly rule discovery (Webb, 2000). In contrast to the support-
describe rule discovery. confidence framework, there is no minimum support or
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Group Differences
confidence requirement. Instead, k-most interesting rule exploratory method for finding differences between one
discovery focuses on the discovery of up to k rules that group and another that the user can utilize, rather than as
maximize some user-specified interestingness measure. a classification system focusing on prediction accuracy.
To this end, they present filtering and pruning methods
to ensure only the most interesting and optimal number
MAIN THRUST rules are shown to the user, from what is potentially a large
space of possible rules.
Emerging Patterns Contrast Sets are discovered using STUCCO, an algo-
rithm that is based on the Max-Miner search algorithm
Emerging Pattern analysis is applied to two or more (Bayardo Jr., 1998). Initially, only Contrast Sets are sought
datasets, where each dataset contains data relating to a that have supports that are both significant and the
different group. An Emerging Pattern is defined as an difference large (i.e., the difference is greater than a user-
itemset whose support increases significantly from one defined parameter, mindev). Significant Contrast Sets
group to another (Dong & Li, 1999). This support in- (cset), therefore, are defined as those that meet the criteria:
crease is represented by the growth ratethe ratio of
support of an itemset in group 1 over that of group 2. The P(cset | Gi ) P (cset | G j )
support of a group G is given by:
Large Contrast Sets are those for which:
count G ( X )
supp G (X ) =
G support (cset, Gi ) support (cset , G j ) mindev
796
TEAM LinG
Mining Group Differences
While STUCCO and Magnum Opus specify different conducted a study with a retailer to find interesting pat-
support conditions in the discovery phase, their condi- terns between transactions from two different days. This M
tions were proven to be equivalent (Webb et al., 2003). data was traditional market-basket transactional data, con-
Further investigation found that the key difference be- taining the purchasing behaviors of customers across the
tween the two techniques was the filtering technique. many departments. Magnum Opus was used with the
Magnum Opus uses a binomial sign test to filter spurious group, encoded as a variable and the consequent re-
rules, while STUCCO uses a chi-square test. STUCCO stricted to that variable only.
attempts to control the risk of type-1 error by applying a In this experiment, Magnum Opus discovered all of
correction for multiple comparisons. However, such a the Contrast Sets that STUCCO found, and more. This is
correction, when given a large number of tests, will reduce indicative of the more lenient filtering method of Mag-
the value to an extremely low number, meaning that the num Opus. It was also interesting that, while all of the
risk of type-2 error (i.e., the risk of not accepting a non- Contrast Sets discovered by STUCCO were only of size
spurious rule) is substantially increased. Magnum Opus 1, Magnum Opus discovered conjunctions of sizes up to
does not apply such corrections so as not to increase the three department codes.
risk of type-2 error. This information was presented to the retail market-
While a chi-square approach is likely to be better suited ing manager in the form of a survey. For each rule, the
to Contrast Set discovery, the correction for multiple com- manager was asked if the rule was surprising and if it was
parisons, combined with STUCCOs minimum difference, is potentially useful to the organization. For ease of under-
a much stricter filter than that employed by Magnum Opus. standing, the information was transformed into a plain
As a result of Magnum Opus much more lenient filter text statement.
mechanisms, many more rules are being presented to the end The domain expert judged a greater percentage of the
user. After finding that the main difference between the Magnum Opus rules of surprise than the STUCCO con-
systems was their control of type-1 and type-2 errors via trasts; however, the result was not statistically signifi-
differing statistical test methods, Webb, et al. (2003) con- cant. The percentage of rules found that potentially were
cluded that Contrast Set mining is, in fact, a special case of useful were similar for both systems. In this case, Mag-
the rule discovery task. num Opus probably found some rules that were spurious,
Experience has shown that filters are important for and STUCCO probably failed to discover some rules that
removing spurious rules, but it is not obvious which of the were potentially interesting.
filtering methods used by systems like Magnum Opus and
STUCCO is better suited to the group discovery task.
Given the apparent tradeoff between type-1 and type-2 FUTURE TRENDS
error in these data-mining systems, recent developments
(Webb, 2003) have focused on a new filter method to avoid Mining differences among groups will continue to grow
introducing type-1 and type-2 errors. This approach di- as an important research area. One area likely to be of
vides the dataset into exploratory and holdout sets. Like future interest is improving filter mechanisms. Experi-
the training and test set method of statistically evaluating ence has shown that the use of filter is important, as it
a model within the classification framework, one set is used reduces the number of rules, thus avoiding overwhelm-
for learning (the exploratory set) and the other is used for ing the user. There is a need to develop alternative filters
evaluating the models (the holdout set). A statistical test as well as to determine which filters are best suited to
then is used for the filtering of spurious rules, and it is different types of problems.
statistically sound, since the statistical tests are applied An interestingness measure is a user-generated speci-
using a different set. A key difference between the tradi- fication of what makes a rule potentially interesting.
tional training and test set methodology of the classifica- Interestingness measures are another important issue,
tion framework and the new holdout technique is that many because they attempt to reflect the users interest in a
models are being evaluated in the exploratory framework model during the discovery phase. Therefore, the devel-
rather than only one model in the classification framework. opment of new interestingness measures and determina-
We envisage the holdout technique will be one area of tion of their appropriateness for different tasks are both
future research, as it is adapted by exploratory data-mining expected to be areas of future study.
techniques as a statistically sound filter method. Finally, while the methods discussed in this article
focus on discrete attribute-value data, it is likely that
Case Study there will be future research on how group mining can
utilize quantitative, structural, and sequence data. For
In order to evaluate STUCCO and the more lenient Magnum example, group mining of sequence data could be used
Opus filter mechanisms, Webb, Butler, and Newlands (2003) to investigate what is different about the sequence of
797
TEAM LinG
Mining Group Differences
events between fraudulent and non-fraudulent credit Bayardo, Jr., R.J. (1998). Efficiently mining long patterns
card transactions. from databases. Proceedings of the 1998 ACM SIGMOD
International Conference on Management of Data, 85-93,
Seattle, Washington, USA.
CONCLUSION Dong, G., & Li, J. (1999). Efficient mining of emerging
patterns: Discovering trends and differences. Proceed-
We have presented an overview of techniques for mining ings of the Fifth International Conference on Knowledge
differences among groups, discussing Emerging Pattern Discovery and Data Mining, San Diego, California, USA.
discovery, Contrast Set discovery, and association rule
discovery approaches. Emerging Patterns are useful in a Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP:
classification system where prediction accuracy is the Classification by aggregating emerging patterns. Pro-
focus but are not designed for presenting the group ceedings of the Second International Conference on
differences to the user and thus dont have any filters. Discovery Science, Tokyo, Japan.
Exploratory data mining can result in a large number of
rules. Contrast Set discovery is an exploratory technique Fan, H., & Ramamohanarao, K. (2003). A Bayesian
that includes mechanisms to filter spurious rules, thus approach to use emerging patterns for classification.
reducing the number of rules presented to the user. By Proceedings of the 14th Australasian Database Confer-
forcing the consequent to be the group variable during ence, Adelaide, Australia.
rule discovery, generic rule discovery software like Mag- Li, J., Dong, G., & Ramamohanarao, K. (2001). Making
num Opus can be used to discover group differences. The use of the most expressive jumping emerging patterns
number of differences reported to the user by STUCCO for classification. Knowledge and Information Sys-
and Magnum Opus are related to the different filter mecha- tems, 3(2), 131-145.
nisms for controlling the output of potentially spurious
rules. Magnum Opus uses a more lenient filter than Li, J., Dong, G., Ramamohanarao, K., & Wong, L. (2004).
STUCCO and thus presents more rules to the user. A new DeEPs: A new instance-based lazy discovery and classi-
method, the holdout technique, will be an improvement fication system. Machine Learning, 54(2), 99-124.
over other filter methods, since the technique is statisti- Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classifi-
cally sound. cation and association rule mining. Proceedings of the
Fourth International Conference on Knowledge Dis-
covery and Data Mining, New York, New York.
REFERENCES
Liu, B., Ma, Y., & Wong, C.K. (2001). Classification
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining using association rules: Weaknesses and enhancements.
association rules between sets of items in large data- In V. Kumar et al. (Eds.), Data mining for scientific and
bases. Proceedings of the 1993 ACM SIGMOD Interna- engineering applications (pp. 506-605). Boston: Kluwer
tional Conference on Management of Data, Washing- Academic Publishing.
ton, D.C., USA. Webb, G.I. (1995). An efficient admissible algorithm for
Agrawal, R., & Srikant, R. (1994). Fast algorithms for unordered search. Journal of Artificial Intelligence Re-
mining association rules. Proceedings of the 20th Inter- search, 3, 431-465.
national Conference on Very Large Data Bases, Santiago, Webb, G.I. (2000). Efficient search for association
Chile. rules. Proceedings of the Sixth ACM SIGKDD Interna-
Bay, S.D., & Pazzani, M.J. (1999). Detecting change in tional Conference on Knowledge Discovery and Data
categorical data: Mining contrast sets. Proceedings of Mining, Boston, Massachesetts, USA.
the Fifth ACM SIGKDD International Conference on Webb, G.I. (2003). Preliminary investigations into sta-
Knowledge Discovery and Data Mining, San Diego, tistically valid exploratory rule discovery. Proceedings
USA. of the Australasian Data Mining Workshop, Canberra,
Bay, S.D., & Pazzani, M.J. (2001). Detecting group differ- Australia.
ences: Mining contrast sets. Data Mining and Knowl- Webb, G.I., Butler, S.M., & Newlands, D. (2003). On
edge Discovery, 5(3), 213-246. detecting differences between groups. Proceedings of
the Ninth ACM SIGKDD International Conference on
798
TEAM LinG
Mining Group Differences
Knowledge Discovery and Data Mining, Washington, Growth Rate: The ratio of the proportion of data
D.C., USA. covered by the Emerging Pattern in one group over the M
proportion of the data it covers in another group.
Holdout Technique: A filter technique that splits the
KEY TERMS data into exploratory and holdout sets. Rules discovered
from the exploratory set then can be evaluated against the
Association Rule: A rule relating two itemsets holdout set using statistical tests.
the antecedent and the consequent. The rule indicates Itemset: A conjunction of items (attribute-value pairs)
that the presence of the antecedent implies that the
consequent is more probable in the data. Written as (e.g., age = teen hair = brown ).
AC. k-Most Interesting Rule Discovery: The process of
Contrast Set: Similar to an Emerging Pattern, it is also finding k rules that optimize some interestingness mea-
an itemset whose support differs across groups. The main sure. Minimum support and/or confidence constraints are
difference is the methods application as an exploratory not used.
technique rather than as a classification one.
Market Basket: An itemset; this term is sometimes
Emerging Pattern: An itemset that occurs signifi- used in the retail data-mining context, where the itemsets
cantly more frequently in one group than another. Uti- are collections of products that are purchased in a single
lized as a classification method by several algorithms. transaction.
Filter Technique: Any technique for reducing the Rule Discovery: The process of finding rules that
number of models with the aim of avoiding overwhelm- then can be used to predict some outcome (e.g., IF 13
ing the user. <= age <= 19 THEN teenager).
799
TEAM LinG
800
INTRODUCTION BACKGROUND
Nowadays the Web poses itself as the largest data re- The historical XML mining research is largely inspired
pository ever available in the history of humankind by two research communities: XML data mining and
(Reis et al., 2004). However, the availability of huge XML data change detection. The XML data mining
amount of Web data does not imply that users can get community has looked at developing novel algorithms
whatever they want more easily. On the contrary, the to mine snapshots of XML data. The database commu-
massive amount of data on the Web has overwhelmed nity has focused on detecting, representing, and query-
their abilities to find the desired information. It has ing changes to XML data.
been claimed that 99% of the data reachable on the Web Some of the initial work for XML data mining is
is useless to 99% of the users (Han & Kamber, 2000, pp. based on the use of the XPath language as the main
436). That is, an individual may be interested in only a component to query XML documents (Braga et al.,
tiny fragment of the Web data. However, the huge and 2002; Braga et al., 2003). In Braga et al. (2002, 2003),
diverse properties of Web data do imply that Web data the authors presented the XMINE operator, which is a
provides a rich and unprecedented data mining source. tool developed to extract XML association rules for
Web mining was introduced to discover hidden knowl- XML documents. The operator is based on XPath and
edge from Web data and services automatically (Etzioni, inspired by the syntax of XQuery. It allows us to express
1996). According to the type of Web data, Web mining complex mining tasks, compactly and intuitively. XMINE
can be classified into three categories: Web content can be used to specify indifferently (and simultaneously)
mining, Web structure mining, and Web usage mining mining tasks both on the content and on the structure of
(Madria et al., 1999). Web content mining is to extract the data, since the distinction in XML is slight.
patterns from online information such as HTML files, Other works for XML data mining focus on extract-
e-mails, or images (Dumais & Chen, 2000; Ester et al., ing the frequent tree patterns from the structure of XML
2002). Web structure mining is to analysis the link data such as TreeFinder (Termier et al., 2002) and
structures of Web data, which can be inter-links among TreeMiner (Zaki, 2002). TreeFinder uses an Inductive
different Web documents (Kleinberg 1998) or intra- Logic Programming approach. Notice that TreeFinder
links within individual Web document (Arasu & Hector, cannot produce complete results. It may miss many
2003; Lerman et al., 2004). Web usage mining is defined frequent subtrees, especially when the support thresh-
as to discover interesting usage patterns from the sec- old is small or trees in the database have common node
ondary data derived from the interaction of users while labels. TreeMiner can produce the complete results by
surfing the Web (Srivastava et al., 2000; Cooley, 2003). using a novel vertical representation for fast subtree
Recently, XML is widely used as a standard for data support counting.
exchanging in the Internet. Existing work on XML data Different from the above techniques, which focus on
mining includes frequent substructure mining (Inokuchi designing ad-hoc algorithms to extract structures that
et al., 2000; Kuramochi & Karypis, 2001; Zaki, 2002, occur frequently in the snapshot data collections, his-
Yan & Han, 2003; Huan et al., 2003), classification torical XML mining focus on the sequence of changes
(Zaki & Aggarwal, 2003; Huan et al., 2004), and associa- among XML versions.
tion rule mining (Braga et al., 2002). As data in different Considering the dynamic nature of XML data, many
domains can be represented as XML documents, XML efforts have been directed into the research of change
data mining can be useful in many applications such as detection for XML data. XML TreeDiff (Curbera &
bioinformatics, chemistry, network analysis (Deshpande Epstein, 1999) computes the difference between two
et al., 2003; Huan et al., 2004) and etcetera. XML documents using hash values and simple tree
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Historical XML
comparison algorithm. XyDiff (Cobena et al., 2002) is are inserted and deleted frequently under others.
proposed to detect changes of ordered XML docu- Such change patterns can be critical for monitoring M
ments. Besides insertion, deletion, and updating, and predicting trends in e-commerce Web sites.
XyDiff also support move operation. X-Diff (Wang et They may imply certain underlying semantic mean-
al., 2003) is designed to detect changes of unordered ings; and can be exploited by strategy makers.
XML documents. In our historical XML mining, we
extend the XML change detection techniques to dis- Applications
cover hidden knowledge from the history of changes to
XML data with data mining techniques. Such novel knowledge can be useful in different applica-
tions, such as intelligent change detection for very large
XML documents, Web usage mining, dynamic XML
MAIN THRUST indexing, association rule mining, evolutionary pattern
based classification, and etcetera. We only elaborate on
Overview the first two applications due to the limitation of space.
Suppose one can discover substructures that change
Consider the dynamic property of XML data and exist- frequently and those that do not (frozen structures),
ing XML data mining research; it can be observed that then he/she can use this knowledge to detect changes to
the dynamic nature of XML leads to two challenging relevant portions of the documents at different fre-
problems in XML data mining. First is the maintenance quency based on their change patterns. That is, one can
of the mining results for existing mining techniques. As detect changes to frequently changing content and struc-
the data source changes, new knowledge may be found ture at a different frequency compared to structures that
and some old ones may not be valid. Second, is the do not change frequently. Moreover, one may ignore
discovery of novel knowledge hidden behind the histori- frozen substructures during change detection, as most
cal changes, some of them are difficult or impossible to likely they are not going to change. As one of the major
be discovered from snapshot data. In this paper, we limitations of existing XML change detection systems
focus on the second issue. That is, to discover novel (Cobena et al., 2002; Wang et al., 2003) is that they are
hidden knowledge from the historical changes to XML not scalable for very large XML documents. Knowledge
data. Suppose there is a sequence of XML documents, extracted from historical changes can be used to im-
which are different versions of the same documents. prove the scalability of XML change detection systems.
Then, following novel knowledge can be discovered. Recently, a lot of work has been done in Web usage
Note that, by no means we claim that the list is exhaus- mining. However, most of the existing works focus on
tive. We use them as representatives for the various snapshot Web usage data, while usage data is dynamic in
types of knowledge behind the history of changes. real life. Knowledge hidden behind historical changes
of Web usage data, which reflects how Web access
Frequently changing/Frozen structures/con- patterns (WAP) change, is critical to adaptive Web,
tents: Some parts of the structure or content Web site maintenance, business intelligence, and etcet-
change more frequently and significantly com- era. The Web usage data can be considered as a set of
pared to other structures. Such structures and con- trees, which have the similar structures as XML docu-
tents reflect the relatively more dynamic parts of ments. By partitioning Web usage data according to the
the XML document. Frozen structures/contents user-defined calendar pattern, we can obtain a sequence
represent the most stable part of the XML docu- of changes from the historical Web access patterns.
ment. Identifying such structure is useful for vari- From the changes, useful knowledge, such as how cer-
ous applications such as trend monitoring and tain Web access patterns changed, which parts changes
change detection of very large XML documents. more frequently and which parts do not, can be ex-
Association rules: Some structures/contents are tracted. Some preliminary results of mining the changes
associated in terms of their changes. The associa- to historical Web access patterns have been shown in
tion rules imply the concurrence of changes among Zhao and Bhowmick (2004).
different parts of the XML document. Such knowl-
edge can be used for XML change prediction, Research Issues
XML index maintenance, and XML based multi-
media annotation. To the best of our knowledge, existing state-of-the-art
Change patterns: From the historical changes, XML (structure related) data mining techniques (Yan &
one may observe that more and more nodes are Han, 2002; Yan & Han, 2003) cannot extract such novel
inserted under certain substructures, while nodes knowledge. Even if we apply such techniques repeatedly
801
TEAM LinG
Mining Historical XML
to a sequence of snapshots of XML structure data, they patterns as predicted based on the history. For
cannot discover such knowledge efficiently and com- instance, some of the frozen structures that are
pletely. It is because that the historical XML mining not supposed to change may change while some
research focuses on the structural changes between ver- of the frequently changing structures that are
sions of XML documents, which is generated by XML supposed to change may not change. Such changes
data change detection tools (Zhao & Bhowmick, 2004). may happen for various reasons. For example,
Given a sequence of XML documents (which are some of the structures may be modified by mis-
versions of the same XML document), the objective of takes; some of them may be the result of some
historical XML data mining is to extract hidden and intrusion or fraud actions; and others may be
useful knowledge from the sequence of changes to the caused by intentional or highly occurrence of
XML versions. In this article, we focus on the structural underling events.
changes of XML documents. We proposed three major
issues for historical XML mining. They are identifying Besides the interesting structures, association rules
interesting structures, mining XML delta association between structures can be extracted as well. There are
rules, and classification/clustering evolutionary XML two types of association rules. One is structural delta
data. Based on the historical change behaviors of the association rule and another is semantic delta associa-
XML data, the first issue is to discover the interesting tion rule.
structures. We elaborate on some of the representatives.
Structural delta association rule: Among those
Frozen/Frequently changing structure: Frozen interesting structures, we may observe that some
structure refers to structure that does not change of the structures change together frequently with
frequently and significantly in the history. There certain confidence. For example, whenever struc-
are different possible reasons that a structure does ture A changes, structure B also changes with a
not change in the sequence of XML document probability of 80%. Structural delta association
versions. First, the structure is so well designed rule is used to represent such kind of knowledge.
that it does not change even if the content has It can be used for different applications such as
changed. Second, some parts of the XML docu- version and concurrency control systems.
ment may be ignored or redundant so that they Semantic delta association rule: By incorpo-
never change. Also, some data in the XML docu- rating some metadata, such as types of changes,
ment may never change by nature. ontology, and content summaries of leaf nodes,
Frequently changing structure: Refers to sub- semantic delta association rule can be extracted.
structures in the XML document may change more For example, with the history of changes in an e-
frequently and significantly compared to others. commerce Web site, we may discover that when
They may reflect the relatively more dynamic parts product A becomes more popular, product B will
of the XML document. become less popular. Such semantic delta asso-
Periodic dynamic structure: Among the history ciation rules can be useful for competitor moni-
of changes for some structures, there may be cer- toring and strategy analysis in e-commerce.
tain fixed patterns. Such change patterns may occur
repeatedly in the history. Those structures, where
there exist certain patterns in their change histo- FUTURE TRENDS
ries, are called periodic dynamic structure.
Increasing/Decreasing dynamic structure: Considering the different types of mining we proposed
Among those frequently changing structures, some and the types of data we are going to mine, there are
of them change according to certain patterns. For many challenges ahead. We elaborate on some of the
instance, the changes of some structures may be- representatives here.
come more significant and more frequent. Such
structures will be defined as increasing dynamic Real Data Collection
structures. Similarly, decreasing dynamic struc-
tures denote for structures whose frequency and To collect real data of historical XML structural ver-
significance of changes are decreasing. sions is a challenge. To get the structural data from
Outlier structures: From the historical change XML documents, a parser is needed. Currently, there
behavior of the structures, one may observe that are many XML parsers available. However, it has been
some of them may not comply with their change verified that the parsing process is the most expensive
802
TEAM LinG
Mining Historical XML
process of XML data management (Nicola & John, 2003). detection, the temporal and dynamic property of XML
Moreover, to get the historical structural information, data is incorporated with the semistructure property in M
every time the XML document changes the entire docu- historical XML mining. Examples shown that the re-
ment has to be parsed again. Consequently, to extract sults of historical XML mining can be useful in a variety
historical structural information is a very expensive task. of applications, such as intelligent change detection for
Especially for very large XML documents, usually only very large XML documents, Web usage mining, dy-
some parts of the document change frequently. In order namic XML indexing, association rule mining, evolu-
to get knowledge that is more useful, a longer sequence tionary pattern based classification, and etcetera. By
of historical structural data and larger datasets are desir- exploring the characteristics of XML changes, we present
able. However, to get a longer and larger historical XML a framework of historical XML mining with a list of
structural datasets, a longer time is needed to collect the major research issues and challenges.
corresponding XML documents.
With the dynamic nature of XML data, the process of Arasu, A., & Hector, G.-M. (2003). Extracting struc-
gathering all versions of the historical XML structural tured data from Web pages. In Proceedings of ACM
data becomes a challenging task. The most naive method SIGMOD (pp. 337-348).
is to keep checking corresponding XML documents
continuously, but this approach is very expensive and Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi,
may overwhelm the network bandwidth. An alternative P.L. (2002). A tool for extracting XML association
method is to determine the frequency of checking by rules. In Proceedings of IEEE ICTAI (pp. 57-65).
analyzing the historical changes of similar XML docu- Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi,
ments. However, this method does not guarantee that all P.L. (2003). Discovering interesting information in
versions of the historical data can be detected. In our XML data with association rules. In Proceeding of ACM
research, we propose to extract appropriate frequency for SAC (pp. 450-454).
data gathering based on the changes of the historical data.
Cobena, G., Abiteboul, S., & Marian, A. (2002). Detect-
Incremental Mining ing changes in XML documents. In Proceedings of
IEEE ICDE (pp. 41-52).
The high cost of some data mining processes and the Cooley, R. (2003). The use of Web structure and con-
dynamic nature of XML data make it desirable to mine tent to identify subjectively interesting Web usage pat-
the data incrementally based on the part of data that terns. In ACM Transactions on Internet Technology
changed rather than mining the entire data again from (pp. 93-116).
scratch. Such algorithms are based on the previous
mining results. Most recently, there is a survey of Curbera, & Epstein, D.A. (1999). Fast dierence and
incremental mining of sequential patterns in large data- update of XML documents. In Proceeding of XTech.
base (Parthasarathy et al., 1999). Integrated with a change Deshpande, M., Kuramochi, M., & Karypis, G. (2003).
detection system, our incremental mining of historical Frequent sub-structure-based approaches for classify-
XML structural data can keep the discovered knowledge ing chemical compounds. In Proceedings of IEEE ICDM
up to date and valid. However, the challenge is that (pp. 35-42).
existing incremental mining techniques are for rela-
tional and transactional data, while XML structural is Dumais, S.T., & Chen, H. (2000). Hierarchical
semistructure. How to modify those approaches so that classication of Web content. In Proceedings of An-
they can be used for incremental mining historical XML nual International ACM SIGIR (pp. 256-263).
will be one of our research issues.
Ester, M., Kriegel, H.-P., & Schubert, M. (2002). Web
site mining: A new way to spot competitors, customers
and suppliers in the World Wide Web. In Proceedings
CONCLUSION of the eighth ACM SIGKDD (pp. 249-258).
In this article, we present a novel research direction: Etzioni, O. (1996). The World-Wide Web: Quagmire or
mining historical XML documents. Different from ex- gold mine? Communications of the ACM, 39(11), 65-68.
isting research of XML data mining and XML change
803
TEAM LinG
Mining Historical XML
Han, J., & Kamber, M. (2000). Data mining concepts and Wang, Y., DeWitt, D.J., & Cai, J.-Y. (2003). X-difi: An
techniques. Morgan Kaufmann. effective change detection algorithm for XML documents.
In Proceedings of IEEE ICDE (pp. 519-530).
Huan, J., Wang, W., & Prins, J. (2003). Efficient mining
of frequent subgraph in the presence of isomorphism. In Yang, L.H., Lee, M.L., & Hsu, W. (2003). Efficient mining
Proceedings of IEEE ICDM (pp. 549-552). of XML query patterns for caching. In Proceedings of
VLDB (pp. 6980).
Huan, J., Wang, W., Washington, A., Prins, J., Shah, R.,
& Tropshas, A. (2004). Accurate classification of protein Zaki, M.J. (2002). Efficiently mining frequent trees in a
structural families using coherent subgraph analysis. In forest. In Proceedings of ACM SIGKDD (pp. 71-80).
Proceedings of PSB (pp. 411-422).
Zaki, M.J., & Aggarwal, C.C. (2003). XRules: An effective
Kleinberg, J.M. (1998). Authoritative sources in a structural classifier for XML data. In Proceedings of ACM
hyperlinked environment. Journal of the ACM, 46(5), SIGKDD (pp. 316-325).
604-632.
Zhao, Q., & Bhowmick, S.S. (2004). Mining history of
Lerman, K., Getoor, L., Minton, S., & Knoblock, C. changes to Web access patterns. In Proceeding of PKDD.
(2004). Using the structure of Web sites for automatic
segmentation of tables. In Proceedings of ACM SIGMOD
(pp. 119-130).
KEY TERMS
Madria, S.K., Bhowmick, S.S., Ng, W.K., & Lim, E.-P.
(1999). Research issues in Web data mining. In DMKD Changes to XML: Given two XML document, the set
(pp. 303-312). of edit operations that transform one document to another
Matthias, N., & Jasmi, J. (2003). XML parsing: A threat to is called changes to XML.
database performance. In Proceedings of CIKM (pp. 175- Historical XML: A sequence of XML documents,
178). which are different versions of the same XML document.
McHugh, J., Abiteboul, S., Goldman, R., Quass, D., & It records the change history of the XML document.
Widom, J. (1997). Lore: A database management sys- Mining Historical XML: It is the process of knowl-
tem for semistructured data. ACM SIGMOD Record, edge discovery from the historical changes to versions
26(3), 54-66. of XML documents. It is the integration of XML change
Reis, D. de C., Golgher, P.B, da Silva Altigran, S., & detection systems and XML data mining techniques.
Laender, A.H.F. (2004). Automatic Web news extraction Semi-Structured Data Mining: Semi-structured data
using tree edit distance. In Proceedings of WWW (pp. 502- mining is a sub-field of data mining, where the data
511). collections are semi-structured, such as Web data, chemi-
Srinivasan, P., Zaki, M.J., Ogihara, M. & Dwarkadas, S.. cal data, biological data, network data, and etcetera.
(1999). Incremental and interactive sequence mining. In Web Mining: Web data mining is to use data mining
Proceedings of CIKM (pp. 251-258). techniques to automatically discovery and extract infor-
Shearer, K., Dorai, C., & Venkatesh, S. (2000). Incorpo- mation from Web data and services.
rating domain knowledge with video and voice data XML: Extensible Markup Language (XML) is a
analysis in news broadcasts. In Proceedings of ACM simple, very flexible text format derived from SGML.
SIGKDD (pp. 46-53).
XML Structural Changes: Among the changes to
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. XML, not all the changes may cause the structure changes
(2000). Web usage mining: Discovery and applications of the XML when it is represented as a tree structure. In
of usage patterns from Web data. ACM SIGKDD Explo- this case, only insertion and deletion are called XML
rations, 1(2), 12-23. structural changes.
804
TEAM LinG
805
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Images for Structure
Figure 1. The hierarchical hidden Markov random case, the HMRF is defined over graphs that depict fea-
field (HHMRF) model for image understanding. Here, tures and their relations. That is, consider two attributed
the hidden state variables, X, at each scale are graphs, Gs and G x , representing the image and the query,
evidenced by observations, Y, at the same scale and
respectively. We want to determine just how, if at all, the
the state dependencies within and between levels of
query (graph) structure is embedded somewhere in the
the hierarchy. The HHMRF is defined over pixels and/
image (graph). We define HMRF over the query graph,
or feature graphs.
G x . A single node in G x is defined by xi , and in the graph
Gs , by s . Each node in each graph has vertex and edge
attributes, and the query corresponds to solving a sub-
graph isomorphism problem that involves the assignment
of each xi a unique s , assuming that there is only one
instance of the query structure embedded in the image,
although this can be generalized. In this formulation, the
HMRF model considers each node xi in G x as a random
forest) dependent on a set of regions labeled tree at the variable that can assume any of S possible values corre-
next scale, which, in turn, are dependent on trunk,
branches, and leaves labels at the next scale, and so sponding to the nodes of Gs .
forth. These labels have specific positional relations
over scales, and the observations for each label are The Observation Component: Using HMRF for-
supported by their compatibilities and observations. malities, the similarity (distance: dist) between
Consequently, the SBR query for finding forestry vertex attributes of both graphs is consequently
images is translated into finding the posterior maximum defined as the observation matrix model
likelihood (MAP) of labeling the image regions, given
the forestry model. Using Bayes rule, this reduces to Bi = p ( y xi / xi = s ) = dist ( y xi , y s ) .
the following optimization problem:
s*l (x)) argmax{p(sl (x) /ol (x)) p(sl v (x u))} The Markov Component: Here, we use the bi-
S u,v nary (relational) attributes to construct the com-
patibility functions between states of neighboring
where l v corresponds to the states above and below nodes. Assume that xi and x j are neighbors in the
level l of the hierarchy. In other words, the derived
labeling MAP probability is equivalent to a probabilistic HMRF (being connected in G x ). Similar to the
answer to the query if this image is a forestry scene. previously described unary attributes, we have
There are many approaches to approximate solutions to
this problem, including relaxation labeling, expectation A ji; = p( x j = s / xi = s ) = dist ( y xij , y s ) .
maximization (EM), loopy belief propagation and the
junction tree algorithm (see definitions in Terms and
Definitions). All of these methods are concerned with
optimal propagation of evidence over different layers of Optimization Problem and Solutions: Given
the graphical representation of the image, given the this general HMRF formulation for graph match-
model and the observations, and all have their limita- ing, the optimal solution reduces to that of deriv-
tions. When the HHMRF model is approximated by a ing a state vector s * = ( s 1 ,.., s T ) where s i G s for
triangulated image state model, the junction tree algo-
each vertex x i G x such that the MAP criterion is
rithm is optimal. However, triangulating such hierarchi-
cal meshes is computationally expensive. On the other satisfied, given the model = ( A, B) and data
hand, the other approaches mentioned previously are
not optimal, converging to local minima (Caetano & s * = arg max{ p ( x1 = s ,...., xT = s / )} .
Caelli, 2004). s .. s
806
TEAM LinG
Mining Images for Structure
First, probabilistic relaxation labeling (PRL) is a parallel adjacency matrix where relations are defined in terms of
iterative update method for deriving the most consistent the matrix off-diagonal elements. These matrices are de- M
labels, having the form composed into their eigenvalues and eigenvectors, as
p t 1 (d i ) p( yi / d i ) p ( d i / d i ) p t (d i ) G = P P ' , G = T
1,..i 1,i 1,..,n i i
807
TEAM LinG
Mining Images for Structure
Table 1. Cluster decomposition obtained from and robust way. Machine learning, Bayesian networks,
correspondence clustering of the two shock graphs and even new uses of matrix algebra all open up new
for shapes BOXER-10 and BOXER-24. Each cluster possibilities. Unless computers can store, evaluate and
enumerates the vertices from the two graphs that are interpret images the way we do, image data warehousing
grouped together and lists in brackets the part of the and mining will remain a doubtful and certainly labor-
body from which the vertex approximately comes. The intensive technology with limited use without significant
percentage of variance explained by the first three human intervention.
dimensions for shapes Boxer-10 and Boxer-24 was
32% and 33%, respectively (Caelli & Kosinov, 2004).
REFERENCES
808
TEAM LinG
Mining Images for Structure
Scott, G., & Longuet-Higgins, H. (1991). An algorithm for Hidden Markov Random Field (HMRF): A Markov
associating the features of two patterns. Proceedings of Random Field with additional observation variables at M
the Royal Society of London. each node, whose values are dependent on the node
states. Additional pyramids of MRFs defined over the
Shapiro, L., & Brady, J. (1992). Feature-based correspon- HMRF give rise to hierarchical HMRFs (HHMRFs).
denceAn eigenvector approach. Image and Vision
Computing, 10, 268-281. Image Features: Discrete properties of images that
can be local or global. Examples of local features include
Siddiqi, K., Shokoufandeh, P., Dickinson, S., & Zucker, S. edges, contours, textures, and regions. Examples of glo-
(1999). Shock graphs and shape matching. International bal features include color histograms and Fourier compo-
Journal of Computer Vision, 30, 1-24. nents.
Image Understanding: The process of interpreting
images in terms of what is being sensed.
KEY TERMS
Junction Tree Algorithm: A two-pass method for
Attributed Graph: Graphs whose vertices and edges updating probabilities in Bayesian networks. For trian-
have attributes typically defined by vectors and matri- gulated networks the inference procedure is optimal.
ces, respectively. Loopy Belief Propagation: A parallel method for
Bayesian Networks: A graphical model defining update beliefs or probabilities of states of random vari-
the dependencies between random variables. ables in a Bayesian network. It is a second-order exten-
sion of probabilistic relaxation labeling.
Dynamic Programming: A method for deriving
the optimal path through a mesh. For hidden Markov Markov Random Field (MRF): A set of random
models (HMMs), it is also termed the Viterbi algorithm variables defined over a graph, where dependencies
and involves a method for deriving the optimal state between variables (nodes) are defined by local cliques.
sequence, given a model and an observation sequence. Photogrammetry: The science or art of obtaining
Expectation Maximization: Process of updating reliable measurements or information from images.
model parameters from new data where the new param- Relaxation Labeling: A parallel method for updat-
eter values constitute maximum posterior probability ing beliefs or probabilities of states of random variables
estimates. Typically used for mixture models. in a Bayesian network. Node probabilities are updated in
Grammars: A set of formal rules that define how to terms of their consistencies with neighboring nodes and
perform inference over a dictionary of terms. the current evidence.
Graph: A set of vertices (nodes) connected in various Shape-From-X: The process of inferring surface
ways by edges. depth information from image features such as stereo,
motion, shading, and perspective.
Graph Spectra: The plot of the eigenvalues of the
graph adjacency matrix.
809
TEAM LinG
810
Li Liu
Aventis, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Microarray Data
microarray analysis suite (MAS) provides the quan- ratios were colored red, cells with negative log
tification software. ratios were colored green, and the intensities of M
Data Normalization: normalization is a step neces- reds or greens were proportional to the absolute
sary for remove systematic array to array variations. values of the log ratios. The clustering analysis
Different normalization methods have been pro- efficiently grouped genes with similar functions
posed. Cyclic loess method (Dudoit et al., 2002), together, and the colored image provided an over-
quantile normalization method and contrast-based all pattern in the data. Clustering analysis can also
method normalize the probe level data; Scaling help us understand the novel genes if they were co-
method and non-linear method normalize expres- expressed with genes with known functions.
sion intensity level data. Bolstad, Irizarry, Astrand, Standard clustering analysis, such as hierarchical
and Speed (2003) provides a comparison of all these clustering, k-means clustering, self-organizing
methods and suggests that simple quantile normal- maps, are very useful in mining the microarray
ization performs relatively stable. data. However, these data tables are often cor-
Estimation of Expression Intensities: different rupted with extreme values (outliers), missing
methods to summarize expression intensity from values, and non-normal distributions that preclude
probe level data have appeared in literature. Among standard analysis. Liu, Hawkins, Ghosh, and Young
them, Li and Wong (2001) proposed a model (2003) proposed a robust analysis method, called
based expression intensity (MEBI), Irizarrys rSVD (Robust Singular Value Decomposition), to
(2003) robust multi-array (RMA) methods and the address these problems. The method applies a
method provided in Affymetrix MAS5 software. combination of mathematical and statistical meth-
Data Transformation: data transformation is a ods to progressively take the data set apart so that
very important step and it enables the data to fit to different aspects can be examined for both general
many of the assumptions behind statistical meth- patterns and for very specific effects. The benefits
ods. Log transformation and glog (Durbin, Hardin, of this robust analysis will be both the understand-
Hawkins, & Rocke, 2002) transformation are the ing of large-scale shifts in gene effects and the
two commonly used methods. isolation of particular sample-by-gene effects that
might be either unusual interactions or the result
Mining Microarray Data of experimental flaws. The method requires a
single pass, and does not resort to complex clean-
In principal, the current data mining activities in ing or imputation of the data table before analy-
microarray data can be grouped into two types of stud- sis. The method rSVD was applied to a micro array
ies: unsupervised and supervised. Unsupervised analy- data, revealed different aspects of the data, and
sis has been used widely for mining microarray experi- gave some interesting findings.
ment. Cluster analysis has been the dominant method
for unsupervised mining. Examples for supervised data mining:
Examples of unsupervised data mining:
Golub et al. (1999) studied the gene expression of
Eisen, Spellman, Brown, and Botstein (1998) stud- two types of acute leukemias, acute myeloid leu-
ied the gene expression of budding yeast Saccha- kemia (AML) and acute lymphoblastic leukemia
romyces cerevisiae spotted on cDNA microarrays (ALL), and demonstrated the feasibility of cancer
during the diauxic shift, the mitotic cell division prediction based on gene expression data. The data
cycle, sporulation, and temperature and reducing comes from Affymetrix arrays with 6817 genes,
shocks. Hierarchical clustering was applied to this and consists of 47 cases of ALL and 25 cases of
gene expression data, and the result was repre- AML. 38 samples (27 ALL, 11 AML) were used as
sented by a tree whose branch lengths reflected training data. A set of 50 genes with the highest
the degree of similarity between genes, which was correlations with an idealized expression pat-
assessed by a pair-wise similarity function. The tern vector, where the expression level is uni-
computed tree was then used to order genes in the formly high for AML and uniformly low for ALL,
original data table, and genes with similar expres- were selected. The prediction of a new sample was
sion pattern were grouped together. The ordered based on weighted votes of these 50 genes. The
gene expression table can be displayed graphically method made strong prediction for 29 of the 34
in a colored image, where cells with log ratios of test samples, and the accuracy was 100%. Golubs
0s were colored black, cells with positive log method is actually a minor variant of the maximum
likelihood linear discriminate analysis for two
811
TEAM LinG
Mining Microarray Data
classes. Instead of using variances in computing discovering enriched biological themes within gene
weights, Golubs method uses standard deviations. lists. Blalock et al. (2004) studied the gene expres-
Dudoit el al. (2002) compared the prediction perfor- sion of Alzheimers disease (AD), and found some
mance of a variety of classification methods, such as interesting over-represented gene categories
linear/quadratic discriminate analysis, nearest neigh- among the regulated genes using EASE. The find-
bor method, classification trees, aggregated classi- ings suggest a new model of AD pathogenesis in
fiers, bagging and boosting, on three microarray which a genomically orchestrated up-regulation of
datasets, lymphoma data (Alizadeh et al.), leukemia tumor suppressor-mediated differentiation and in-
data (Golub et al.) and NCI60 data (Ross et al.). volution processes induces the spread of pathol-
Based on their comparisons, the rankings of the ogy along myelinated axons.
classifiers were similar across datasets and the main Transcriptional factor binding factors are essen-
conclusion, for the three datasets, is that simple tial elements in the gene regulatory networks.
classifiers such as diagonal linear discriminate analy- Directly experimental identification of such tran-
sis and nearest neighbors perform remarkably well scriptional factors is not practical or efficient in
compared to more sophisticated methods such as many situations. Conlon, Liu, Lieb, and Liu (2003)
aggregated classification trees. provided one method based on least regression to
Dimension reduction techniques, such as principal integrating microarray data and transcriptional
component analysis (PCA) and partial least squares factor binding sites (TFBS) patterns. Curran, Liu,
(PLS) can be used to reduce the dimension of the Long, and Ge (2004) provided a logistic regres-
microarray data before certain classifier is used. sion approach to joint mining an internal
For example, Nguyen et al. (2002) proposed an microarray database and a corresponding TFBS
analysis procedure to predict tumor samples. The database (http://transfac.gbf.de/).
procedure first reduces the dimension of the
microarray data using PLS, and then applies logis-
tic discrimination (LD) or quadratic discriminate FUTURE TRENDS
analysis (QDA). See also West et al. (2001) and
Huang et al. (2003). As more and more new technologies platform are in-
troduced for medical research, it is imaginable that
Joint Data Mining of Microarray, Gene data will continue to grow. For example, Affymetrix
Ontology and TFBS Data recently introduced its SNP chip which contains
100,000 human SNPs. If such a technology were ap-
Combining microarray data and other available data such plied to a clinical trial with 10000 subjects, the SNP
as DNA sequence data, gene ontology data has attracted data alone will be a 10000 by 100000 table in addition
lots of attention in recent years. From a system biology to data from potential other technologies such as
view, combining different data types provides added proteomics, metabonomics and bio-imaging.
power than looking microarray data alone. This is still a Associated with the growth of data will be the in-
very active research area for data mining and we review creasing need for effective data management and data
some of the current work done in two aspects: combining integration. More efficient data retrieval system will
with gene ontology database and combining with (TFBS) be needed as well as system that can accommodate
database. large scale and diversified data.
Besides the growth of data management system, it is
Combining microarray data with gene ontology foreseeable that integrated data analysis will become
data is important in interpreting the microarray more and more routine. At this point, most of the data
data. Expression Analysis Systematic Explorer analysis software is only capable of analyzing small
(EASE, http://david.niaid.nih.gov/david/ease.htm) is scale, isolated data sets. So, the challenges to the
developed by Hosack, Dennis, Sherman, Lane, and informatics field and statistics field will continue to
Lempicki (2003), and is a customizable, standalone grow dramatically.
software application that facilitates the biological
interpretation of gene lists derived from the results
of microarray, proteomics, and SAGE experiments. CONCLUSION
EASE can generate the annotations of a list of
genes in one shot, automatically link to online Microarray technology has generated vast amount of
analysis tools, and provide statistical methods for data for very interesting data mining research. How-
812
TEAM LinG
Mining Microarray Data
ever, as the central dogma indicates, microarray only Durbin, B.P., Hardin, J.S., Hawkins, D.M., & Rocke, D.M.
provides one snapshot of the biological system. Se- (2002). A variance-stabilizing transformation of gene ex- M
quencing, proteomics technology, metabolite profiling pression microarray data. Bioinformatics, 18(Supplement
provide different view about the biological system and it 1), S105-S110.
will remain a challenge to deal with the joint data mining
of data from such diverse technology platform. Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D.
There are significant challenges with respect to devel- (1998). Cluster analysis and display of genome-wide ex-
opment of data mining technology to deal with data pression patterns. PNAS, 95, 14863-14868.
generated from different technology platform. The chal- Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,
lenges to deal with the combined data will be even more Gaasenbeek, M., Mesirov, J.P., et al. (1999). Molecular
difficult. Besides methodological challenges, how to or- classification of cancer: Class discovery and class pre-
ganize the data, how to develop standards for data gen- diction by gene Expression monitoring. Science,
erated from different platforms and how to develop com- 286(5439), 531-537.
mon terminologies, all remain to be answered.
Hosack, D.A., Dennis, G. Jr., Sherman, B.T., Lane, H.C., &
Lempicki, R.A. (2003). Identifying biological themes within
REFERENCES lists of genes with EASE. Genome Biology, 4(9), R60.
Huang, E., Cheng, S.H., Dressman, H., Pittman, J., Tsou,
Alizadeh, A.A. et al. (2000). Different types of diffuse M.H., Horng, C.F. et al. (2003). Gene expression predictors of
large b-cell lymphoma identified by gene expression breast cancer outcomes. Lancet, 361, 1590-1596.
profiling. Nature, 403(6769), 503-511.
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D.,
Blalock, E.M., Geddes, J.W., Chen, K.C., Porter, N.M., Antonellis, K.J., Scherf, U. et al. (2003) Exploration, nor-
Markesbery, W.R., & Landfield, P.W. (2004). Incipient malization, and summaries of high density oligonucle-
Alzheimers disease: Microarray correlation analyses otide array probe level data. Biostatistics, 4(2), 249-264.
reveal major transcriptional and tumor suppressor re-
sponses. PNAS, 101(7), 2173-8. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., & Pavlidis, P. (2004).
Coexpression analysis of human genes across many microarray
Bolstad, B.M., Irizarry, R.A., Astrand, M., & Speed, T.P. data sets. Genome Res, 14(6), 1085-1094.
(2003). A comparison of normalization methods for
high density oligonucleotide array data based on bias Li, C., & Wong, W.H. (2001). Model-based analysis of
and variance. Bioinformatics, 19(2), 185-193. oligonucleotide arrays: Expression index computation
and outlier detection. PNAS, 98(1), 31-36.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D.,
Brown, P.O. et al. (1998). The transcriptional program of Liu, L., Hawkins, D.M., Ghosh, S., & Young, S.S. (2003).
sporulation in budding yeast. Science, 282(5389), 699- Robust singular value decomposition analysis of
705. microarray data. PNAS, 100(23), 13167-13172.
Conlon, E.M., Liu, X.S., Lieb, J.D., & Liu, J.S. (2003). The International Human Genome Mapping Consortium
Integrating regulatory motif discovery and genome- (IHGMC). (2001). A physical map of the human ge-
wide expression analysis. PNAS, 100(6), 3339-3344. nome. Nature, 409(6822), 934-941.
Curran, M., Liu, H., Long, F., & Ge, N. (2003). Statisti- Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees,
cal methods for joint data mining of gene expression and C., Spellman, P., et al. (2000). Systematic variation in
DNA sequence database. SIGKDD Explorations Spe- gene expression patterns in human cancer cell lines.
cial Issue on Microarray Data Mining, 5, 122-129. Nature Genetics, 24, 227-234.
Dudoit, S., Fridlyand, J., & Speed, T.P. (2002). Com- Venter, J. et al. (2001). The human genome. Science,
parison of discrimination methods for the classifica- 291(5507), 1304-51.
tion of tumors using gene expression data. Journal of West, M., Blanchette, C., Dressman, H., Huang, E.,
the American Statistical Association, 97(457), 77-87. Ishida, S., Spang, R., et al. (2001). Predicting the clini-
Dudoit, S., Yang, Y.H., Callow, M.J., & Speed, T.P. cal status of human breast cancer by using gene expres-
(2002). Statistical methods for identifying genes with sion profiles. PNAS, 98, 11462-11467.
differential expression in replicated cDNA microarray
experiments. Statistical Sinica, 12(1), 111-139.
813
TEAM LinG
Mining Microarray Data
814
TEAM LinG
815
Susumu Horiguchi
Tohoku University, Japan
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mining Quantitative and Fuzzy Association Rules
Mining Quantitative Association Rules tal thus has its representative center given by the average
weighted value of (cu, nu) and (cv, nv):
One method that was proposed is an adaptive numerical
value discretization method that considers both value v 1 w 1
cu i =u ni + cv i =u ni cu N u + cv N v
density and value distance of numerical attributes and cu = w 1
=
produces better quality intervals than the two classical i =u
ni Nu + Nv
methods (Li et al., 1999). This method repeatedly selects
a pair of adjacent intervals that have the minimum differ- An optimal interval merge scheme produces a mini-
ence to merge until a given criterion is met. It requires mum number of intervals whose maximal intrainterval
quadratic time on the number of attribute values, because distances are each within a given threshold and whose
each interval initially contains only a single value, and all populations are as equal as possible. Assume that the
the intervals may be merged into one large interval con- threshold for the maximal intrainterval difference is d,
taining all the values in the worst case. A method of linear- which can be the average interinterval differences of all
scan merging for quantizing numeric attribute values that the adjacent interval pairs or can be given by the system.
can be implemented in linear time sequentially and linear For k intervals, let the average population (support) of
cost in parallel was also proposed (Shen, 2001). This
method takes into consideration the maximal intrainterval each interval be N k = N / k and the population deviation
distance and load balancing simultaneously to improve of interval Iu be u =| N u N k | , where Nu is the actual
the quality of merging. In comparison with the existing population of Iu. Initially, Iu={xu} and 0<u<m-1. Our strat-
results in the same category, the algorithm achieves a egy leads to the following algorithm for interval merging:
linear time speedup.
Suppose that a numerical attribute has m distinct 1. Partition {I0, I1, , Im-1} into a minimum number of
values, I = {x0 , x1 ,...x m1 } , where attribute value xi has ni intervals such that each interval has a maximal
m 1 intrainterval distance not greater than d.
occurrences in the database (weight). Let N = i =0 ni be
2. Assume that Step 1 produces k inter-
the total number of attribute value occurrences, called
instances. Without loss of generality, we further assume vals: {I u , I u ,..., I u } and 0=u0< u1< uk-1<m-1. For
0 1 k 1
that xi < xi+1 for all 0<i<m-2 (otherwise, we can simply sort I u j = [ X u j : cu j ] ,where X u j = {xu j , xu j +1 ,..., xu j +1 1 } and
these values). Define P to be a set of maximal disjoint
intervals on I, where interval IuP contains a sequence of cu j is the representative center of I u j , check to see
v is the index of the next interval after Iu in P, and 0<u<v<m. balance while preserving the maximal intrainterval
We also assume that Iu has a representative center, cu. distance property, and do so if it will.
Initially, Iu contains only xu, which is also its representa-
tive center. Noticing that x 0<x1 < < xm-1, we can implement Step
We define the maximal intrainterval distance, denoted 1 simply by using a linear scan to form appropriate seg-
by D*(Iu; cu), as follows: ments of intervals after a single pass. Starting from I0 ,
merge Iu with Iu+j for j = 1,2, , until the next merge would
result in Ius maximal intrainterval distance greater than
D * (cu , cv ) = max | xi cu | the threshold; continue this process until no interval to be
i
merged remains. This process requires time O(m).
Assume that two adjacent Step 2 examines every adjacent pair of intervals after
the merge, requiring, at most, m-1 steps. Each step checks
intervals, I u = {xu ,...xv 1} and I v = {xv ,...xw1} , contain
the changes of population deviation by moving uj+1-1
v 1 w 1
N u = i =u ni and N v = i = v ni attribute value occurrences instances from I u to I u . That is, it considers whether the
j j +1
v 1
and have representative centers cu = i =u xi ni / N u and following condition holds:
w 1
cv = i =v xi ni / N v , respectively, and 0<u<v<m. The union,
| u j u j |<| u j +1 +u j +1 | ,
I u = I u U I v , of the two intervals containing (v-u)+(w-
w 1
v)=w-u attribute values and N u + N v = i =u ni instances to- where u =| N u nu N k | and +u j =| N u j+1 nu j+1 1 N k | .
j j j +1
816
TEAM LinG
Mining Quantitative and Fuzzy Association Rules
Do the move only when the condition holds. Clearly, this There are two main reasons for this assumption. First,
step requires O(m) time as well. fuzzy attributes sharing a common original attribute are M
It was shown that the above linear-scan implementa- usually mutually exclusive in meaning, so they would
tion indeed produces the minimum number of intervals largely reduce the support of rules in which they are
satisfying the intrainterval distance property. contained together. Second, such a rule would not be
worthwhile and would carry little meaning. Hence, we can
Mining Fuzzy Association Rules conclude that all fuzzy attributes in the same rule are
independent with respect to the fact that no pair of fuzzy
Let D be a relational database with I ={i1, , in} as the set attributes whose original attributes are identical exists.
of n attributes and T={t1, ,tm} as the set of m records. This observation is the foundation of our new parallel
Using the fuzzification method, we associate each at- algorithm.
The idea of partitioning algorithm is to divide the
tribute, iu, with a set of fuzzy sets: Fiu = { f i1u ,..., f iuk } . A fuzzy
original set of fuzzy attributes into separated parts (each
association rule (Gyenesei, 2001; Kuok et al., 1998) is an for a processor), so that every part retains at least one
implication as follows: fuzzy attribute for each original attribute. To do so, we
divide according to several original attributes, so that the
X is A = Y is B , number of fuzzy attributes reduced at each processor is
maximized. After dividing the set of all the fuzzy at-
where X,Y I are disjointedly frequent itemsets. A and B tributes for parallel processors, we can use any tradi-
are sets of disjoint fuzzy sets corresponding to attributes tional algorithms, such as Apriori, CHARM, and so forth,
to mine local association rules. Finally, the local results
in X and Y: A= { f x ,... f x } , B = { f y ,... f y } , f x Fx , f y Fy .
1 p 1 p i i j j
at processors are gathered to constitute the overall
A fuzzy itemset is now defined as a pair (<X, A>) of an result. Space limitations prevent a detailed discussion of
itemset X and a set A of fuzzy sets associated with at- the partitioning algorithm (FDivision) and the parallel
tributes in X. The support factor, denoted fs(<X, A>), is algorithm (PFAR) (Phan & Horiguchi, 2004b)
determined by the formula The PFAR algorithm was implemented using MPI
standard on the SP2 parallel system, which has a total of
m 64 nodes. Each node consists of four 322MHz PowerPC
{
v =1
x1 (tv x1 ) ... x p (tv x p )} / T , 604e processors, 512 MB local memory, and 9.1GB local
disks. In the experiments, we used only 24 processors
(PEs) on 24 nodes. The testing data included synthetical
where X ={x1,, xp} and tv is the vth record in T; is the T- data and real world databases of heart disease (created
by George John, 1994, statlog-adm@ncc.up.pt,
norm operator in fuzzy logic theory; x (tv xu ) is calculated
u
bob@stams.strathclyde.ac.uk). The experimental results
as showed that the performance of PFAR is satisfactory
(Phan & Horiguchi, 2004a).
mxu (tv xu ), mxu (tv xu ) wxu
0 otherwise,
FUTURE TRENDS
where mx is the membership function of fuzzy set f x ;
u u
Quantitative and fuzzy association rules are two types of
and wx is a user-specified threshold of membership func-
u important association rules that exist in many real-life
tion mx . u
applications. Future research trends include mining them
Each fuzzy attribute is a pair of the original attribute among data that have structural properties such as se-
name accompanied by the fuzzy set name. We perceive that quence, time series, and spatiality. Mining these types of
any fuzzy association rule never contains two fuzzy at- rules in multidimensional databases is also a challenging
tributes sharing a common original attribute in I. For problem. These complex data mining tasks require tech-
example, the rule Age_Old< niques not only for data mining but also for data represen-
tation, reduction, transformation, and visualization.
Cholesterol_High<Age_Young HeartDisease_Yes is
Completion of these tasks depends on successful integra-
invalid because it contains Age_Old and Age_Young,
tion of all the relevant techniques and effective applica-
both of which derive from a common original attribute, Age.
tion of those techniques.
817
TEAM LinG
Mining Quantitative and Fuzzy Association Rules
Fukuda, T., Morimoto, Y., Morishita, S., & Tokuyama, T. Zaki, M. J., Parthasarathy, S., & Ogihara, M. (2001).
(1996). Mining optimized association rules for numeric Parallel data mining for association rules on shared-
attributes. Proceedings of the 15th ACM SIGACT- memory systems. Knowledge and Information Systems,
SIGMOD (pp. 182-191). 3(1), 1-29.
818
TEAM LinG
Mining Quantitative and Fuzzy Association Rules
819
TEAM LinG
820
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Model Identification through Data Mining
It directly provides a logical understandable expres- Table 1. The three steps executed by Hamming clustering
sion (Muselli & Liberati, 2002), which is the final to build the set of rules embedded in the mined data M
synthesized function directly expressed as the OR
of ANDs of the salient variables, possibly negated. 1. The input variables are converted into binary strings
via a coding designed to preserve distance and, if
When the variables are selected, a mathematical model relevant, ordering.
of the underlying generating framework still must be 2. The OR of ANDs expression of a logical function
produced. At this point, a first hypothesis of linearity may is derived from the training examples coded in the
be investigated (usually being only a very rough approxi- binary form of step 1.
mation) where the values of the variables are not close to 3. In the OR final expression, each logical AND pro-
the functioning point around which the linear approxima- vides intelligible conjunctions or disjunctions of the
involved variables, ruling the analyzed problem.
tion is computed.
Building a non-linear model is far from easy; the
structure of the non-linearity needs to be a priori knowl-
is one reason for the algorithm speed and for its
edge, which is not usually the case. A typical approach
insensitivity to precision.
consists of exploiting a priori knowledge, when available,
to define a tentative structure, then refining and modify- Step 2: Classical techniques of logical synthesis are
ing it on the training subset of data, and finally retaining specifically designed to obtain the simplest AND-
the structure that best fits a cross-validation on the OR expression able to satisfy all the available input-
testing subset of data. The problem is even more complex output pairs without an explicit attitude to generali-
when the collected data exhibit hybrid dynamics (i.e., their zation. To generalize and infer the underlying rules
evolution in time is a sequence of smooth behaviors and at every iteration, by Hamming clustering groups
abrupt changes). together in a competitive way, binary strings have
An alternative approach is to infer the model directly the same output and are close to each other. A final
from the data without a priori knowledge via an identifica- pruning phase does simplify the resulting expres-
tion algorithm capable of reconstructing a very general sion, further improving its generalization ability.
class of a piece-wise affine model (Ferrari-Trecate et al., Moreover, the minimization of the involved vari-
2003). This method also can be exploited for the data- ables intrinsically excludes the redundant ones,
driven modeling of hybrid dynamic systems, where logic thus enhancing the very salient variables for the
phenomena interact with the evolution of continuous- investigated problem. The low (quadratic) compu-
valued variables. Such an approach will be described tational cost allows for managing quite large
concisely later, after a more detailed drawing of the rules- datasets.
oriented mining procedure, and some applications will be Step 3: Each logical product directly provides an
discussed briefly. intelligible rule, synthesizing a relevant aspect of
the underlying system that is believed to generate
Binary Rule Generation and Variable the available samples.
Selection While Mining Data
Identification of Piece-wise Affine
The approach followed by Hamming clustering in mining Systems Through a Clustering
the available data to select the salient variables and to Technique
build the desired set of rules consists of the three steps
in Table 1. Once the salient variables have been selected, it may be
of interest to capture a model of their dynamical interac-
Step 1: A critical issue is the partition of a possibly tion. Piece-wise affine identification exploits K-means
continuous range in intervals, whose number and clustering that associates data points in multivariable
limits may affect the final result. The thermometer space in such a way that jointly determines a sequence
code then may be used to preserve ordering and of linear submodels and their respective regions of opera-
distance (in the case of nominal input variables, for tion without even imposing continuity at each change in
which a natural ordering cannot be defined, instead the derivative. In order to obtain such a result, the five
adopting the only-one). The simple metric used is steps reported in Table 2 are executed.
the Hamming distance, computed as the number of
different bits between binary strings. In this way, Step 1: The model is locally linear; small sets of data
the training process does not require floating point points close to each other likely belong to the same
computation but rather basic logic operations. This submodel. For each data point, a local set is built,
821
TEAM LinG
Model Identification through Data Mining
collecting the selected points together with a given ation for ongoing research projects. The field of life
number of its neighbors (whose cardinality is one of science plays a central role because of its relevance to
the parameters of the algorithm). Each local set will science and to society.
be pure if made of points really belonging to the same A growing acceptance of international relevance, in
single linear subsystem; otherwise, it is mixed. which the described approaches are being used to
Step 2: For each local dataset, a linear model is provide a contribution, is the so-called field of systems
identified through a usual least squares procedure. biologya feedback model of how proteins interact with
Pure sets belonging to the same submodel give each other and with nucleic acids within the cell, both of
similar parameter sets, while mixed sets yield isolated which are needed to better understand the control mecha-
vectors of coefficients, looking as outliers in the nisms of the cellular cycle. This is especially evident with
parameter space. If the signal-to-noise ratio is good respect to duplication (such as cancer, when such
enough, and if there are not too many mixed sets (i.e., mechanisms become out of control). Such an under-
the number of data points is more than the number of standing will hopefully encourage a drive to personal
submodels to be identified, and the sampling is fair therapy, when everybodys gene expression will be cor-
in every region), then the vectors will cluster in the related to the amount of corresponding proteins in-
parameter space around the values pertaining to each volved in the cellular cycle. Moreover, a new computa-
submodel, apart from a few outliers. tional paradigm could arise by exploiting biological com-
Step 3: A modified version of the classical K-means, ponents like cells instead of the usual silicon hardware,
whose convergence is guaranteed in a finite number thus overcoming some technological issues and possi-
of steps (Ferrari-Trecate et al., 2003), takes into bly facilitating neuroinformatics.
account the confidence on pure and mixed local sets The study of systems biology as a leading edge of the
in order to cluster the parameter vectors. large field of bioinformatics, begins by analyzing data
from so-called micro-arrays. They are little standard
Step 4: Data points are then classified, each belong-
chips, where thousands of gene expressions may be
ing to a local dataset one-to-one related to its gen- obtained from the same cell material, thus providing a
erating data point, which is classified according to large amount of data whose handling with the usual
the cluster to which its parameter vector belongs. deterministic approaches is not conceivable, and whose
Step 5: Both the linear submodels and their regions fault is its inability to obtain significant synthetic infor-
are estimated from the data in each subset. The mation. Thus, matrices of as many subjects as available,
coefficients are estimated via weighted least squares, possibly grouped in homogeneous categories for super-
taking into account the confidence measures. The vised training, each one carrying thousands of gene
shape of the polyhedral region characterizing the expressions, are the natural input to algorithms like
domain of each model may be obtained via linear Hamming clustering. The desired output are rules able to
support vector machines (Vapnik, 1998), easily solved classify; for instance, patients affected by different tu-
via linear/quadratic programming. mors from healthy subjects, on the basis of a few iden-
tified genes, whose set is the candidate basis for the
A Few Applications piece-wise linear model describing their complex interac-
tion in such a particular class of subjects.
The field of application of the proposed approach is Also, without deeper insight into the cell, the iden-
intrinsically wide (both tools are most general and quite tification of the prognostic factors in oncology is already
powerful, especially if combined together). Here, only a occuring with Hamming clustering (Paoli et al., 2000) and
few suggestions will be drawn with reference to already also providing a hint about their interaction, which is not
obtained results or to some application under consider- explicit in the outcome of a simple neural network (Drago
et al., 2002).
Table 2. The five steps for piece-wise affine identification
1. The local datasets neighboring each sample are built. FUTURE TRENDS
2. The local linear models are identified through least squares.
3. The parameters vectors are clustered through a modified Beside improvements in the algorithmic approaches (also
K-means. implying the possibility of taking into account potential
4. The data points are classified according to the clustered a priori knowledge), a wide range of applications will
parameter vectors.
benefit from the proposed ideas in the near future; some
5. The submodels are estimated together with their domains.
of which are outlined here and partially recalled in Table
3.
822
TEAM LinG
Model Identification through Data Mining
Drug design would benefit from the a priori forecast- human organ is probably the brain, whose study may be
ing provided by Hamming clustering about hydrophilic undertaken today either in the sophisticated frame of M
behavior of the not yet experimented pharmacological functional nuclear magnetic imaging or in the simple way
molecule, on the basis of the known properties of some of EEG recording. Multidimensional (three space vari-
possible radicals and in the track of Manga et al. (2003) ables plus time) images or multivariate time series provide
within the frame of computational methods for the predic- an abundance of raw data to mine in order to understand
tion of drug-likeness (Clark & Pickett, 2000). In principle, which kind of activation is produced in correspondence
such not quite different in silico predictions from the with an event (Maieron et al., 2002) or a decision. A brain
systems biology expectation are of paramount relevance computer-interfacing device then may be built that is able
to pharmaceutical companies to save money in designing to reproduce ones intention to perform a movenot
minimally invasive drugs controlling kidney metabolism directly possible for the subject in some impaired physical
that would have less of an affect on the liver the more conditionsand command a proper actuator. Both Ham-
hydrophilic they are. ming clustering and piece-wise affine identification would
Compartmental behavior of drugs then may be improve the only partially successful approach based on
analysed via piece-wise identification by collecting in artificial neural networks (Babiloni et al., 2000). A simple
vivo data samples and clustering them within the more drowsiness detector based on the EEG may be designed, as
active compartment at each stage, instead of the classical well as a flexible anaesthesia/hypotension level detector,
linear system identification that requires non-linear algo- without needing a time-varying more precise, but more
rithms. The same is true for metabolism, such as glucose costly, identification. A psychological stress indicator
in diabetes or urea in renal failure. Dialysis can be modeled may be inferred, outperforming (Pagani et al., 1991). The
as a compartmental process in a sufficiently accurate way. multivariate analysis (Liberati et al., 1997), possibly taking
A piece-wise linear approach is able to simplify the iden- into account the input stimulation so useful in approaching
tification even on a single patient, when population data difficult neurological tasks (like modeling electro
are not available to allow a simple linear deconvolution encephalographic coherence in Alzheimers disease pa-
approach (Liberati & Turkheimer, 1999) (whose result is tients (Locatelli et al., 1998)) or non-linear effects in muscle
only an average knowledge of the overall process, with- contraction (Orizio et al., 1996) would be outperformed by
out taking into special account the very subject under piece-wise linear identification, even in time-varying con-
analysis). Moreover, compartmental models are perva- texts like epilepsy.
sive, like in ecology, wildlife, and population studies; Industrial applications, of course, are not excluded
potential applications in that direction are almost never from the field of possible interests. In Ferrari-Trecate et al.
ending. (2003), for instance, the classification and identification
Many physiological processes are switching natu- of the dynamics of an industrial transformer are performed
rally or intentionally from an active to quiescent state, like via the piece-wise approach, with a much simpler cost, and
hormone pulses whose identification (Sartorio et al., 2002) no really significant reduction of performances with re-
is important in growth and fertility diseases as well as in spect to the non-linear modeling described in Bittanti, et
doping assessment. In that respect, the most fascinating al. (2001).
System Biology and Bioinformatics: To identify the interaction of the main proteins involved in cell cycle.
Drug Design: To forecast desired and undesired behavior of the final molecule from components.
Compartmental Modeling: To identify number, dimensions, and exchange rates of communicating reservoirs.
Hormone Pulses Detection: To detect true pulses among the stochastic oscillations of the baseline.
Sleep Detector: To forecast drowsiness as well as to identify sleep stages and transitions.
Stress Detector: To identify the psychological state of a subject from multivariate analysis of biological signals.
Prognostic Factors: To identify the interaction between selected features (e.g., in oncology).
Pharmacokinetics and Metabolism: To identify diffusion and metabolic time constants from time series
sampled in blood.
Switching Processes: To identify abrupt or possibly smoother commutations within the process duty cycle.
Brain Computer Interfacing: To identify a decision taken within the brain and propagate it directly form
brain waves.
Anesthesia Detector: To monitor the level of anesthesia and provide close-loop control.
Industrial Applications: A wide spectrum of logical or dynamic-logical hybrid problems may be faced (i.e., the
tracking of a transformer).
823
TEAM LinG
Model Identification through Data Mining
824
TEAM LinG
Model Identification through Data Mining
Vapnik, V. (1998). Statistical learning theory. New York: Micro-Arrays: Chips where thousands of gene ex-
Wiley. pressions may be obtained from the same biological cell M
material.
Model Identification: Definition of the structure and
KEY TERMS computation of its parameters best suited to mathemati-
cally describe the process underlying the data.
Bioinformatics: The processing of the huge amount Rule Generation: The extraction from the data of the
of information pertaining to biology. embedded synthetic logical description of their relation-
Hamming Clustering: A fast binary rule generator ships.
and variable selector able to build understandable logical Salient Variables: The real players among the many
expressions by analyzing the Hamming distance between apparently involved in the true core of a complex business.
samples.
Systems Biology: The quest of a mathematical model
Hybrid Systems: Their evolution in time is composed of the feedback regulation of proteins and nucleic acids
by both smooth dynamics and sudden jumps. within the cell.
825
TEAM LinG
826
Charles Greenidge
University of the West Indies, Barbados
INTRODUCTION BACKGROUND
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Modeling Web-Based Data in a Data Warehouse
Details of data warehousing/warehouses issues are acted on by the search engine. Results coming from the
provided in Berson and Smith (1997), Bischoff and Yevich search engine also can be processed prior to being re- M
(1996), Devlin (1998), Hackathorn (1995, 1998), Inmon layed to the warehouse. This intermediate, independent
(2002), Mattison (1996), Rob and Coronel (2002), Becker meta-data bridge is an important concept in this model.
(2002), Kimball and Ross (2002), Lujan-Mora and Trujillo (2003).
User Filter
Final Download
Query Results
Submit Major PARSE
Query Engines HTML FILES
FILES
Store
Links
Download Seed
Results Follow Links
Invoke Meta- HTML Seed
Data Engine Links
FILES
on Results
827
TEAM LinG
Modeling Web-Based Data in a Data Warehouse
Meta-Data
Format Analyze
Handle Results
Provide Results to from S.E.
D.W.
828
TEAM LinG
Modeling Web-Based Data in a Data Warehouse
ing: Does every page containing the word Java specifi- data. With careful selection of external data, organiza-
cally address Java programming issues or tools? tions with data warehouses can reap disproportionate M
benefits over competitors who do not use their historical
The Need for Data Synchronization or external data to maximum effect.
Our model also allows for flexibility. By allowing de-
Data synchronization is an important issue in data ware- velopment to take place in three languagesquery lan-
housing. Data warehouses are designed to provide a guage SQL (e.g., for the data warehouse), Perl for the
consistent view of data emanating from several sources. meta-data engine, and Java for the search enginethe
The data from these sources must be combined, in a strengths of each can be utilized effectively. By allowing
staging area outside of the warehouse, through a process flexibility in the actual implementation of the model, we
of transformation and integration. To achieve this goal can increase efficiency and decrease development time.
requires that the data drawn from the multiple sources be Our model allows for security and integrity to be
synchronized. Data synchronization can only be guaran- maintained. The warehouse need not expose itself to
teed by safeguarding the integrity of the data and ensur- dangerous integrity questions or to increasingly mali-
ing that the timeliness of the data is preserved. cious threats on the Internet. By isolating the module that
We propose that the warehouse transmit a request interfaces with the Internet from the module that carries
for external data to the search engine. Once the eventual vital internal data, the prospects for good security are
results are obtained, they are left for the warehouse to improved.
gather as it completes a new refresh cycle. This process Our model also relieves information overload. By us-
is not intended to provide an immediate response (as ing the meta-data engine to manipulate external data, we
required by a Web surfer) but to lodge a request that will are indirectly removing time-consuming filtering activi-
be answered at some convenient time in the future. ties from users of the data warehouse.
At their simplest, data originating externally may be
synchronized by their date of retrieval as well as by
system statistics such as last time of update, date of FUTURE TRENDS
posting, and page expiration date. More sophisticated
schemes could include verification of dates of origin, use It is our intention to incorporate portals (Hall, 1999; Raab,
of Internet archives, and automated analysis of content to 2000) and vortals (Chakrabarti et al., 2001; Rupley, 2000)
establish dates. For example, online newspaper articles in our continued research in data warehousing. A goal of
often carry the date of publication close to the headlines. our model is to provide a subject-oriented bias of the
An analysis of such a documents content would reveal warehouse when satisfying queries involving the identi-
a date field in a position adjacent to the heading for the story. fication and extraction of external data from the Internet.
The following three potential problems are still to be
Analysis of the Model addressed in our model:
The major strengths of our model are its: 1. It relies on fast-changing search engine technologies;
2. It introduces more system overhead and the need
Independence; for more administrators;
Value-added approach; and 3. It results in the storage of large volumes of irrel-
Flexibility and security evant/worthless information.
This model exhibits (a) logical independence (ensures Other issues to be addressed by our model are:
that each component of the model retains its integrity); (b)
physical independence (allows summarized results of 1. The need to investigate how the search engine
external data to be transferred via network connections to component can make smarter choices of which links
any remote physical machines; and (c) administrative to follow;
independence (administration of each of the three compo- 2. The need to test the full prototype in a live data
nents in the model is a specialized activity; hence, the warehouse environment;
need for three separate administrators). 3. Providing a more detailed comparison between our
In terms of its value-added approach, the model maxi- model and the traditional external data approaches.
mizes the utility of acquired data by adding value to this 4. Determining how the traditional Online Transaction
data long after their day-to-day usefulness has expired. Processing (OLTP) systems will funnel data into
The ability to add pertinent external data to volumes of this model.
aging internal data extends the usefulness of the internal
829
TEAM LinG
Modeling Web-Based Data in a Data Warehouse
We are confident that further research with our model Bischoff, J., & Yevich, R. (1996). The superstore: Building
will provide the techniques for handling the aforemen- more than a data warehouse. Database Programming and
tioned issues. One possible approach is to incorporate Design, 9(9), 220-229.
languages such as WebSQL (Mihaila, 1996) and Squeal
(Spertus, 1999) in our model. Chakrabarti, S. (2002). Mining the Web: Analysis of
hypertext and semi-structured data. New York: Morgan
Kaufmann.
CONCLUSION Chakrabarti, S., van den Berg, M., & Dom, B. (2001).
Focused crawling: A new approach to topic-specific
The model presented in this paper is an attempt to bring Web resource discovery. Retrieved January, 2004, from
the data warehouse and search engine into cooperation to http://www8.org/w8-papers/5a-search-query/crawling/
satisfy the external data requirements for the warehouse.
However, the new model anticipates the difficulty of Day, A. (2004). Data warehouses. American City and
simply merging two conflicting architectures. The ap- County, 119(1), 18.
proach we have taken is the introduction of a third inter- Devlin, B. (1998). Meta-data: The warehouse atlas. DB2
mediate layer called the meta-data engine. The role of this Magazine, 3(1), 8-9.
engine is to coordinate data flows to and from the respec-
tive environments. The ability of this layer to filter extra- Goodman, A. (2000). Searching for a better way. Re-
neous data retrieved from the search engine, as well as trieved July 25, 2002, from http://www.traffick.com
compose domain-specific queries, will determine its success. Hackathorn, R. (1995). Data warehousing energizes your
In our model, the data warehouse is seen as the agent enterprise. Datamation, 38-45.
that initiates the search for external data. Once a search
has been started, the meta-data engine component takes Hackathorn, R. (1998). Web farming for the data ware-
over and eventually returns results. The model provides house. New York: Morgan Kaufmann.
for querying of general-purpose search engines, but a
simple configuration allows for subject-specific engines Hall, C. (1999). Enterprise information portals: Hot air or
(vortals) to be used in the future. When specific engines hot technology [Electronic version]. Business Intelli-
designed for specific niches become widely available, the gence Advisor, 111(11).
model stands ready to accommodate this change. Imhoff, C., Galemmo, N., & Geiger, J.G. (2003). Mastering
We believe that WebSQL and Squeal can be used in data warehouse design: Relational and dimensional
our model to perform the structured queries that will techniques. New York: John Wiley and Sons.
facilitate speedy identification of relevant documents. In
so doing, we think that these tools will allow for faster Inmon, W.H. (2002). Building the data warehouse. New
development and testing of a prototype and, ultimately, York: John Wiley and Sons.
a full-fledged system Inmon, W.H., Zachman, J.A., & Geiger, J.G. (1997). Data
stores, data warehousing, and the Zachman framework:
Managing enterprise knowledge. New York: McGraw-Hill.
REFERENCES
Kimball, R. (1996). Dangerous preconceptions. Retrieved
Barquin, R., & Edelstein, H. (Eds.). (1997). Planning and June 13, 2002, from http://pwp.starnetinc.com/larryg/
designing the data warehouse. Upper Saddle River, NJ: index.html
Prentice-Hall. Kimball, R., & Ross, M. (2002). The data warehouse
Bauer, A., Hmmer, W., Lehner, W., & Schlesinger, L. toolkit: The complete guide to dimensional modeling.
(2002). A decathlon in multidimensional modeling: Open New York: John Wiley and Sons.
issues and some solutions. DaWaK, 274-285. Lujan-Mora, S., & Trujillo, J. (2003). A comprehensive
Becker, S.A. (Ed.). (2002). Data warehousing and Web method for data warehouse design. DMDW.
engineering. Hershey, PA: IRM Press. Marakas, G.M. (2003). Modern data warehousing, min-
Berson, A., & Smith, S.J. (1997). Data warehousing, data ing, and visualization: Core concepts. Upper Saddle
mining and olap. New York: McGraw-Hill. River, NJ: Prentice-Hall College Division.
830
TEAM LinG
Modeling Web-Based Data in a Data Warehouse
Mattison, R. (1996). Data warehousing: Strategies, tech- Strehlo, K. (1996). Data warehousing: Avoid planned
nologies and techniques. New York: McGraw-Hill. obsolescence. Datamation, 38-45. M
McElreath, J. (1995). Data warehouses: An architectural Sullivan, D. (2000). Search engines review chart. Re-
perspective. CA: Computer Sciences Corp. trieved June 10, 2002, from http://searchenginewatch.com
Mihaila, G.A. (1996). WebSQLAn SQL-like query lan- Zghal, H.B., Faiz, S., & Ghezala, H.B (2003). Casme: A case
guage for the World Wide Web [masters thesis]. Univer- tool for spatial data marts design and generation, design
sity of Toronto. and management of data warehouses 2003. Proceedings
of the 5th International Workshop DMDW2003, Berlin,
Parsaye, K. (1996). Data mines for data warehouses. Da- Germany.
tabase Programming and Design, 9(9).
Peralta, V., & Ruggia, R. (2003). Using design guidelines
to improve data warehouse logical design. DMDW.
WEBSITES OF INTEREST
Pfaffenberger, B. (1996). Web search strategies. MIS
Press. www.perl.com
www.cpan.com
Raab, D.M. (1999). Enterprise information portals [Elec- www.xml.com
tronic version]. Relationship Marketing Report. http://java.sun.com
Ray, E.J., Ray, D.S., & Seltzer, R. (1998). The Alta Vista http://semanticweb.org
search revolution. CA: Osborne/McGraw-Hill.
Rob, P., & Coronel, C. (2002). Database systems: Design, KEY TERMS
implementation, and management. New York: Thomson
Learning. Decision Support System (DSS): An interactive ar-
Rupley, S. (2000). From portals to vortals. PC Magazine. rangement of computerized tools tailored to retrieve and
display data regarding business problems and queries.
Sander-Beuermann, F., & Schomburg, M. (1998). Internet
information retrievalThe further development of meta- External Data: Data originating from other than the
search engine technology. Proceedings of the Internet operational systems of a corporation.
Summit, Internet Society. Granular Data: Data representing the lowest level of
Schneider, M. (2003). Well-formed data warehouse struc- detail that resides in the data warehouse.
tures, design and management of data warehouses 2003. Metadata: Data about data; in the data warehouse, it
Proceedings of the 5th International Workshop describes the contents of the data warehouse.
DMDW2003, Berlin, Germany.
Operational Data: Data used to support the daily
Soudah, T. (2000). Search, and you shall find. Retrieved processing a company does.
July 17, 2002, from http://searchenginesguides.com
Refresh Cycle: The frequency with which the data
Spertus, E., & Stein, L.A. (1999). Squeal: A structured warehouse is updated (e.g., once a week).
query language for the Web. Retrieved August 10, 2002,
from http://www9.org/w9cdrom/222/222.html Transformation: The conversion of incoming data
into the desired form.
831
TEAM LinG
832
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Moral Foundations of Data Mining
vacy. Or it might publicly thought not necessarily Appropriate Uses and Users of Data
individually disclose the collection. Such disclo- Mining Technology M
sures are often key components of privacy policies.
Privacy policies sometimes seek permission in advance It should be uncontroversial to point out that not all data
to obtain and archive personal data or, more frequently, mining or knowledge discovery is done by appropriate
disclose that data are being collected and then provide a users, and not all uses enjoy equal moral warrant. A data-
mechanism for individuals to opt out. The question mining police state may not be said to operate with the
whether such opt-out policies are adequate to give indi- same moral traction as a government public health ser-
viduals opportunities to control use of their data is vice in a democracy. (We may one day need to inquire
subject to widespread debate. In another context, a whether use of data-mining technology by a government
government will collect data for vital statistics or public is itself grounds for identifying it as repressive.) Simi-
health databases. Such uses, at least in democratic soci- larly, given two businesses (insurance companies, say),
eties, may be justified on grounds of the implied con- it is straightforward to report that the one using data-
sent of those to whom the information applies and who mining technology to identify trends in accidents to
would benefit from its collection. better offer advice about preventing accidents is on firm
It is not clear how much or what kind of consent moral footing, as opposed to one that identifies trends
would be necessary to provide ethical warrant for data in accidents to discriminate against minorities.
mining of personal information. The problem of ad- One way to carve the world of data mining is at the
equate consent is complicated by what may be hypoth- public/private joint. Public uses are generally by gov-
esized to be widespread ignorance about data mining and ernments or their proxies, which can include universi-
its capabilities. As elsewhere, some solutions to this ties and corporate contractors, and can employ data from
ethical problem might be identified or clarified by private sources (such as credit card information). Public
empirical research related to public understanding of data mining can, at least in principle, claim to be in the
data-mining technology, individuals preference for (lev- service of some collective good. The validity of such a
els of) control over use of their information, and similar claim must be assessed and then weighed against damage
considerations. The U.S. Department of Health and or threats to other public goods or values.
Human Services has, for instance, supported research In the United States, the General Accounting Office,
on the process of informed consent for biomedical a research and investigative branch of Congress, identi-
research. Data mining warrants a kindred research pro- fied 199 federal data-mining projects and found that of
gram. Among the issues to be clarified by such research these, 54 mined private sector data, with 36 involving
are the following: personal information. There were 77 projects using data
from other federal agencies, and, of these, 46 involve
To what extent do individuals want to control personal information from the private sector. The per-
access to their information? sonal information, apparently used in these projects
What are the differences between consent to ac- without explicit consent, is said to include student loan
quire data for one purpose and consent for second- application data, bank account numbers, credit card
ary uses of that data? information, and taxpayer identification numbers. The
How do individual preferences or inclinations to projects served a number of purposes, the top six of
consent vary along with the data miner? (That is, it which are given as improving service or performance
may be hypothesized that some or many individu- (65 projects), detecting fraud, waste and abuse (24),
als will be sanguine about data mining by trusted analyzing scientific and research information (23),
public health authorities, but opposed to data min- managing human resources (17), detecting criminal
ing by [certain] business entities or governments.) activities or patterns (15) and analyzing intelligence
and detecting terrorist activities (14) (General Ac-
It should be noted that many information exchanges counting Office, 2004).
especially including those generally involving the Is losing confidentiality in credit card transactions a
most sensitive or personal data are at least partly fair exchange for improved government service? Re-
governed by professionalism standards. Thus, doctor- search? National security? These questions are the fo-
patient and lawyer- or accountant-client relationships cus of sustained debate.
traditionally, if not legally, impose high standards for In the private sphere, data miners enjoy fewer oppor-
the protection of information acquired during the course tunities to claim that their work will result in collective
of a professional relationship. benefit. The strongest warrant for private or for-profit
833
TEAM LinG
Moral Foundations of Data Mining
data mining is that corporations have fiduciary duties to surveillance system, or national security agency could
shareholders and investors, and these duties can only or be catastrophic. This means that good ethics requires
are best discharged by using data-mining technology. In good science. That is, it is not adequate to collect and
many jurisdictions, the use of personal information, analyze data for worthy purposes one must also seek
including health and credit data, is governed by laws that and stand by standards for doing so well.
include requirements for privacy policies and notices. One feature of data mining that makes this excep-
Because these often include opt-out provisions, it is tionally difficult is its very status as a science. It is
clear is that individuals must as a matter of strategy, customary in most empirical inquiries to frame and
if not moral self-defense take some responsibility for then test hypotheses. Although philosophers of sci-
safeguarding their own information. ence disagree about the best or most productive meth-
A further consideration is that for the foreseeable ods for conducting such inquiries, there is broad agree-
future, most people will continue to have no idea of what ment that experiments and tests generally offer more
data mining is or what it is capable of. For this reason, the rigor and produce more reliable results and insights
standard bar for consent to use data must be set very high than (mere) observation of patterns between or among
indeed. It ought to include definitions of data mining and variables. This is a well-known problem in domains in
must explicitly disclose that the technology is able to which experiments are impossible or unethical. In epi-
infer or discover things about individuals that are not demiology and public health, for instance, it is often
obvious from the kinds of data that are willingly shared: not possible to test a hypothesis by using the experi-
that databases can be linked; that data can be inaccurate; mental tools of control groups, double blinding, and
that decisions based on this technology can affect credit placebos (one could not intentionally attempt to give
eligibility and employment; and so on (OLeary, 1995). people a disease by different means in order to identify
Indeed, the very point of data mining is the discovery of the most dangerous routes of transmission). For this
trends, patterns and (putative) facts that are not obvious reason, epidemiologists and public health scientists
or knowable on the surface. This must become a central are usually mindful of the fact that their studies might
feature of privacy notice disclosure. fail to reveal a causal relation, but rather, no more than
Further, all data-mining entities ought to develop or a statistical correlation.
adopt and then adhere to policies and guidelines govern- Put differently, when there is no hypothesis to test,
ing their practice (OLeary, 1995). it is not possible to know what will be found until it is
discovered (Fule & Roddick, 2004, p. 160).
Error and Uncertainty There is a precedent for addressing this kind of
challenge. In meta-analysis, or the concatenation and
Data analysis is probabilistic. Very large datasets are reanalysis of others results, statisticians endured with-
buggy and contain inaccuracies. It follows that patterns, ering criticism of questionable methods, bias intro-
trends, and other inferences based on algorithmic analy- duced by initial data selection and over-broad conclu-
sis of databases are themselves open to challenge on sions. They have responded by refining their tech-
grounds of accuracy and reliability. Moreover, when niques and improving their processes. Meta-analysis
databases are proprietary, there are disincentives to op- remains imperfect, but scientists in numerous disci-
erate with standards for openness and transparency that plines (perhaps most noteworthily in the biomedical
are morally required and increasingly demanded in sciences, where in some instances meta-analysis has
government, business, and science. What emerges is an altered standards of patient care) have come to rely on
imperative to identify and follow best practices for data it (Goodman, 1998b).
acquisition, transmission, and analysis. Although some
errors are forgivable, those caused by sloppiness, low
standards, haste, or inattention can be blameworthy. When FUTURE TRENDS
the stakes are high, errors can cause extensive harm. It
would indeed be a shame to solve problems of privacy, Data mining has, in a short time, become ubiquitous.
consent, and appropriate use only to fail to do the job one Like many hybrid sciences, it was never clear who
set in the first place. owned it and therefore who should assume responsi-
Perhaps the most interesting kind of data-mining bility for its practice. Attention to data mining has
error is that which identifies or relies on illusory pat- increased in part because of concern about its use by
terns or flawed inferences. A business data-mining op- governments and the extent to which such use will
eration that errs in predicting the blue frock market may infringe on civil liberties. This has, in turn, led to
disappoint, but an error by data miners for a manned renewed scrutiny, a positive development. It must, how-
space program, biomedical research project, public health ever, be emphasized that while issues related to privacy
834
TEAM LinG
Moral Foundations of Data Mining
and confidentiality are of fundamental concern, they are National Security Surveillance
by no means the only ethical issues raised by data mining M
and, indeed, attention to them at the expense of the other Although it would be blameworthy for vulnerable soci-
issues identified here would be a dangerous oversight. eties to fail to use all appropriate tools at their disposal
The task of identifying ethical issues and proposing to prevent terrorist attacks, data-mining technology
solutions, approaches, and best (ethical) practices itself poses new and interesting challenges to the concept of
requires additional research. It is even possible to evalu- appropriate tool. There are concerns at the outset that
ate data minings ethical sensitivity, that is, the extent data acquisition for whatever purpose may be
to which rule generation itself takes ethical issues into inappropriate. This is true whether police, military, or
account (Fule & Roddick, 2004). other government officials are collecting information
A broad-based research program will use the con- on notecards or in huge, machine-tractable databases.
tent, user, purpose triad to frame and test hypotheses Knowledge discovery ups the ante considerably.
about the best ways to protect widely shared values; it The privacy rights of citizens in a democracy are not
might be domain specific, emphasizing those areas that customarily bartered for other benefits, such as secu-
raise the most interesting issues: business and econom- rity. The rights are what constitute the democracy in the
ics, public health (including bioterrorism prepared- first place. On the other hand, rights are not absolute,
ness), scientific (including biomedical) research, gov- and there might be occasions on which morality permits
ernment operations (including national security), and so or even requires infringements. Moreover, when a data-
on. Some issues/domains raise keenly important issues base is created, appropriate use and error/uncertainty
and concerns. These include genetics and national secu- become linked: appropriate use includes (at least tac-
rity surveillance. itly) the notion that such use will achieve its specified
ends. To the degree that error propagation or uncer-
Genetics and Genomics tainty can impeach data-mining results, the use itself is
correspondingly impeached.
The completion of the Human Genome Project in 2001
is correctly seen as a scientific watershed. Biology and
medicine will never be the same. Although it has been CONCLUSION
recognized for some time that the genome sciences are
dependent on information technology, the ethical is- The intersection of computing and ethics has been a
sues and challenges raised by this symbiosis have only fertile ground for scholars and policymakers. The rapid
been sketched (Goodman, 1996; Goodman, 1999). The growth and extraordinary power of data mining and
fact that vast amounts of human and other genetic infor- knowledge discovery in databases bids fair to be a
mation can be digitized and analyzed for clinical, re- source of new and interesting ethical challenges.
search, and public health purposes has led to the cre- Precedents in kindred domains, such as health
ation of a variety of databases and, consequently, estab- informatics, offer a platform on which to begin to
lishment of the era of data mining in genomics. It could prepare the kind of thoughtful and useful conceptual
not be otherwise: [T]he amount of new information is tools that will be required to meet those challenges.
so great that data mining techniques are essential in By identifying the key nodes of content, user, and
order to obtain knowledge from the experimental data purpose, and the leading ethical issues of privacy/con-
(Larraaga, Menasalvas, & Pea, 2004). fidentiality, appropriate use(r), and error/uncertainty,
The ethical issues that arise at the intersection of we can enjoy some optimism that the future of data
genetics and data mining should be anticipated: privacy mining will be guided by thoughtful rules and policies
and confidentiality, appropriate uses and users, and and not the narrower (but not invalid) interests of entre-
accuracy and uncertainty. Each of these embodies spe- preneurs, police, or scientists who hesitate or fail to
cial features. Confidentiality of genetic information weigh the goals of their inquiries against the needs and
applies not only to the source of genetic material but expectations of the societies that support their work.
also, in one degree or another, to his or her relatives.
The concepts of appropriate use and user are strained
given extensive collaborations between and among REFERENCES
corporate, academic, and government entities. And chal-
lenges of accuracy and uncertainty are magnified in case
the results of data mining are applied in medical practice. American Statistical Association. (1999). Ethical guide-
lines for statistical practice. Retrieved from http://
835
TEAM LinG
Moral Foundations of Data Mining
836
TEAM LinG
837
Ingo la Tendresse
Technical University of Clausthal, Germany
INTRODUCTION BACKGROUND
A standard approach for content-based image retrieval Relevance feedback techniques are often used in image
(CBIR) is based on the extraction and comparison of databases in order to gain additional information about
features usually related to dominant colours, shapes, the sought image set (Rui, Huang, & Mehrotra, 1998).
textures and layout (Del Bimbo, 1999). These features The user evaluates the retrieval results and selects for
are a-priori defined and extracted, when the image is example positive and negative instances, thus in the
inserted into the database. At query time the user sub- subsequent retrieval steps the search parameters are
mits a similar sample image (query-by-sample-image) optimised and the corresponding images/features are
or draws a sketch (query-by-sketch) of the sought supplied with higher weights (Mller & Pun, 2000; Rui
archived image. The similarity degree of the current & Huang, 2000). Moreover, additional query images can
query image and the target images is determined by be considered and allow a more detailed specification
calculation of a multidimensional distance between the of the target image (Baudisch, 2001). However, the
corresponding features. The computed similarity values more complex is the learning model, the more difficult
allow the creation of an image ranking, where the first k, is the analysis and the evaluation of the retrieval results.
usually k=32 or k=64, images are considered retrieval Users in particular those with limited expertise in
hits. These are chained in a list called ranking and then image processing and retrieval are not able to detect
presented to the user. Each of these images can be used misleading areas in the image/sketch with respect to the
as a starting point for a refined search in order to applied retrieval algorithms and to modify the sample
improve the obtained results. image/sketch appropriately. Consequently, an iterative
The assessment of the retrieval result is based on a search is often reduced to a random process of param-
subjective evaluation of whole images and their position eter optimisation.
in the ranking. An important disadvantage of the re- The consideration of the existing user knowledge is the
trieval with content-based features and the presentation main objective of many feedback techniques. In case of
of the resulting images as ranking is that the user is user profiling (Cox, Miller, Minka, Papathomas, & Yianilos,
usually not aware, why certain images are shown on the 2000) the previously completed retrieval sessions are
top positions and why certain images are ranked low or analysed in order to obtain additional information about
not presented at all. Furthermore, users are also inter- users preferences. Furthermore, the selection actions
ested which sketch properties are decisive for the con- during the current session are monitored and images simi-
sideration and rejection of the images, respectively. In lar to those are given higher weights. A hierarchical ap-
case of primitive features like colour these questions proach named multi-feedback ranking separates the query-
can be often answered intuitively. Retrieval with com- ing image in several regions and allows a more detailed
plex features considering for example texture and lay- search (Mirmehd & Perissamy, 2001).
out creates rankings, where the similarity between the Techniques for permanent feedback guide the user
query and the target images is not always obvious. Thus, through the entire retrieval process. An example for this
the user is not satisfied with the displayed results and approach is implemented by the system Image Retro:
would like to improve the query, but it is not clear to based on a selected sample image a number of images
him/her, which parts of the querying sketch or the are removed from the possible image set. By analysing
sample image should be modified and improved accord- these images, the user develops an intuition about prom-
ing to the desired targets. Therefore, a suitable feedback ising starting points (Vendrig, Worring, & Smeulders,
mechanism is necessary. 2001). In case of the fast feedback, the user receives the
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Mosaic-Based Relevance Feedback for Image Retrieval
838
TEAM LinG
Mosaic-Based Relevance Feedback for Image Retrieval
created a test image database with pictures of various sections per mosaic and needed on average 63 seconds
categories and selected a wavelet-based feature for the study time.
In order to evaluate the mosaic as a feedback tech-
M
retrieval (Kao & Joubert, 2001). Wavelet-based features
have in common, that the semantic interpretation of the nique we selected 26 sketches, which did not lead to the
extracted coefficients and thus the produced rankings is desired images during the traditional and mosaic-based
rather difficult, in particular for users without signifi- retrieval and asked 16 test persons to improve the
cant background in image processing. Nevertheless, such quality of the sketches. Eight test persons received the
features perform well in general image sets and are target image, the old sketch as well as the correspond-
applied in a variety of systems (Jacobs, Finkelstein, & ing mosaic and had 300 seconds to correct the colours,
Salesin, 1995). The application of the mosaic-based shapes, layout, etc. The other eight persons should
feedback could improve the usability of these powerful improve the sketch without the mosaic feedback. All
features by helping the user to find and to modify the modified sketches were submitted to the CBIR-system
decisive image regions. and produced following results: 46.59% of all sketches
The experiments considered fine-grained mosaics changed according to the mosaic feedback led to a
with a 1616 grid, a 3232 grid, and adaptive grids. Simul- successful traditional retrieval, i.e. the sought image
taneously, a traditional ranking for the current query was was returned within the 16 most similar images. In
created, which contained the best k images for the given opposite only 34.24% of the sketches modified with-
sketch and the selected similarity criterion. Both rankings out mosaic feedback enabled a successful retrieval. If
were subsequently distributed to 16 users, who analysed a mosaic-based ranking was presented, then the recall
those and based on the derived knowledge improved quote increased up to 68.75% (feedback allowed) and
the sketch appropriately. up to 54.35% otherwise. Moreover, the desired image
The obtained results were evaluated manually in or- is placed on average two positions above the ranking
der to resolve following basic questions related to pos- position prior modification.
sible applications of the mosaic feedback:
839
TEAM LinG
Mosaic-Based Relevance Feedback for Image Retrieval
can be combined in a so-called convergence map, which is tem, pichunter: Theory, implementation and psychophysi-
subsequently processed by a number of rules. Based on the cal experiments. In IEEE Transactions on Image Process-
output of the analysis process, several actions can be ing, 9(1), 20-37.
proposed. For example, if the given sketch is completely on
the wrong route, the user is asked to re-draw the sketch Del Bimbo, A. (1999). Visual information retrieval. Mor-
entirely. If only parts of the sketch are misleading, the user gan Kaufmann Publishers.
should be asked to pay more attention to those. Finally, Heesch, D., & Rger, S. M. (2003). Relevance feedback for
even some of the necessary modifications can also be content-based image retrieval: What can three mouse
performed/proposed automatically, for example if intuitive clicks achieve? In Advances in Information Retrieval,
features such as colours are used for the retrieval. In this LNCS 2633 (pp. 363-376).
case the user is advised to use more red or green colour in
some sections or to change the average illumination of the Jacobs, C.E., Finkelstein, A., & Salesin, D.H. (1995). Fast
sketch. multiresolution image querying. In Proceedings of ACM
An additional visual help for the user during creation Siggraph (pp. 277-286).
of the query sketch is given by a permanent evaluation of Kao, O., & Joubert, G.R. (2001). Efficient dynamic image
the archived images and presentation of the most prom- retrieval using the trous wavelet transformation. In
ising images in a three-dimensional space. The current Advances in Multimedia Information Processing, LNCS
sketch builds the centre of this 3D-space and the archived 2195 (pp. 343-350).
images are arranged according to their similarity dis-
tance around the sketch. The nearest images are the most La Tendresse, I., & Kao, O. (2003). Mosaic-based sketch-
similar images, thus the user can copy some aspects of ing interface for image databases. Journal of Visual
those images in order to move the retrieval in the Languages and Computing 14(3), 275-293.
desired direction. After each new line the distance to all
images is re-computed and the images in the 3D-space La Tendresse, I., Kao, O., & Skubowius, M. (2002). Mosaic
are re-arranged. Thus, the user can see the effects of the feedback for sketch training and retrieval improvement. In
performed modification immediately and proceed in Proceedings of the IEEE Conference on Multimedia and
the same direction or remove the latest modifications Expo (ICME 2002) (pp. 437-440).
and try another path. Mirmehdi M., & Perissamy, R. (2001). CBIR with percep-
tual region features. In Proceedings of the 12th British
Machine Vision Conference (pp. 51-520).
CONCLUSIONS
Mller, H., Mller, W., Squire, D., Marchand-Maillet, S.,
This article described a mosaic-based feedback method, & Pun, T. (2000). Learning features weights from user
which helps image database users to recognise and to behavior in content-based image retrieval. In Proceed-
modify misleading areas in the querying sketch and to ings of ACM SIGKDD International Conference on
improve the retrieval results. Performance measure- Knowledge Discovery and Data Mining (Workshop on
ments showed that the proposed method easies the Multimedia Data Mining MDM/KDD2000).
querying process and allows the users to modify their Rui, Y., & Huang, T.S. (2000). Optimizing learning in image
sketches in a suitable manner. retrieval. In Proceeding of IEEE International Confer-
ence on Computer Vision and Pattern Recognition (pp.
236-245).
REFERENCES
Rui, Y., Huang, T.S., & Mehrotra, S. (1998). Relevance
Baudisch, P. (2001). Dynamic information Filtering. PhD feedback techniques in interactive content-based Image
Thesis, GMD Research Series 2001, No. 16. GMD retrieval. In Proceedings of SPIE 3312 (pp. 25-36).
Forschungszentrum. Santini, S., Gupta, A., & Jain, R. (2001). Emergent seman-
Cinque, L., Levialdi, S., Malizia, A., & Olsen, K.A. (1998). tics through interaction in image databases. IEEE Trans-
A multi-dimensional image browser. Journal of Visual actions on Knowledge Data Engineering, 13(3), 337-351.
Languages and Computing, 9(1), 103-117. Santini, S., & Jain, R. (2000). Integrated browsing and
Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T., & querying for image databases. IEEE MultiMedia, 7(3), 26-
Yianilos, P.N. (2000). The bayesian image retrieval sys- 39.
840
TEAM LinG
Mosaic-Based Relevance Feedback for Image Retrieval
Veltkamp, R.C., Tanase, M., & Sent, D. (2001). Features in sample image. Therefore the user sketches the looked-for
content-based image retrieval systems: A survey. Kluwer image with a few drawing tools. It is not necessary to do M
Academic Publishers. this correctly in all aspects.
Vendrig, J., Worring, M., & Smeulders, A.W.M. (2001). Query-By-Pictorial-Example: The query is formu-
Filter image browsing: Interactive image retrieval by using lated by using a user-provided example image for the
database overviews. Multimedia Tools and Applica- desire retrieval. Both query and stored image are analysed
tions, 15(1), 83-103. in the same way.
Ranking: List of the most similar images extracted
from the database according to the querying sketch/
KEY TERMS image or other user-defined criteria. The ranking dis-
plays the retrieval results to the user.
Content-Based Image Retrieval: Search for suit- Relevance Feedback: The user evaluates the qual-
able image in a database by comparing extracted fea- ity of the individual items in the ranking based on the
tures related to color, shape, layout and other specific subjective expectations / specification of the sought
image characteristics. item. Subsequently, the user supplies the system with
Feature Vector: Data that describes the content of the evaluation results (positive/negative instances,
the corresponding image. The elements of the feature weights ) and re-starts the query. The system consid-
vector represent the extracted descriptive information ers the user-knowledge while computing the similarity.
with respect to the utilised analysis. Similarity: Correspondence of two images. The simi-
Query By Sketch, Query by Painting: A widespread larity is determined by comparing the extracted feature
query type where the user is not able to present a similar vectors, for example by a metric or distance function.
841
TEAM LinG
842
Chitra Dorai
IBM T. J. Watson Research Center, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Multimodal Analysis in Multimedia Using Symbolic Kernels
analyses help capture information from symbolic feature fully used to capture important information from large,
spaces, visualize symbolic data, and aid tasks such as nonlinear feature spaces into a smaller set of principal M
classification and clustering and therefore are eminently components (Scholkopf et al., 1999). Operations such as
useful in multimodal analysis of multimedia. clustering or classification can then be carried out in this
reduced dimensional space. Because noise is eliminated
as projections on eigen-vectors with low eigen-values,
BACKGROUND the final reduced space of larger principal components
contains less noise and yields better results with further
Early approaches to multimedia content analysis dealt data analysis tasks such as classification.
with multimodal feature data in two primary ways. Either Although many conventional methods have been
a learning technique such as Neural Nets was used to find previously developed for extraction of principal compo-
patterns in the multimodal data after mapping symbolic nents from nonlinearly correlated data, none allowed for
values into integers, or the multimodal features were generalization of the concepts to dimensionality reduc-
segregated into different groups according to their modes tion of symbolic spaces. The kernel-space representation
of origin (e.g., into audio and video features), processed of KPCA presents such an opportunity. However, since
separately, and the results from the separate processes its inception, applications of KPCA have been primarily
were merged by using some probabilistic mechanism or limited to domains with real-valued, nonlinearly corre-
evidence combination method. The first set of methods lated features despite the recent literature on defining
implicitly assumed the Euclidean distance as an underly- kernels over several discrete objects such as sequences,
ing metric between feature vectors. Although this may be trees, graphs, as well as many other types of objects.
appropriate for real-valued data, it imposes a neighbor- Moreover, recent techniques like the Fisher kernel ap-
hood property on symbolic data that is artificial and is proach by Jaakkola and Haussler (1999) can be used to
often inappropriate. The second set of methods essen- systematically derive kernels from generative models,
tially dealt with each category of multimodal data sepa- which have been demonstrated quite successfully in the
rately and fused the results. They were thus incapable of rich symbolic feature domain of bioinformatics. Against
leading to novel patterns that can arise if the data were the backdrop of these emerging collections of research,
treated together as a whole. As audiovisual collections of the work presented in this paper uses the ideas of Kernel
today provide multimodal information, they need to be PCA and symbolic kernel functions to investigate the yet
examined and interpreted together, not separately, to unexplored problem of symbolic domain principal compo-
make sense of the composite message (Bradley, Fayyad, nent extraction in the context of multimedia. The kernels
& Mangasarian, 1998). used here are designed based on well-known distance
Recent advances in machine-learning and data analy- metrics, namely Hamming distance, Cityblock distance,
sis techniques, however, have enabled more sophisti- and the Edit distance metric, and have been previously
cated means of data analyses. Several researchers have used for string comparisons in several domains, including
attempted to generalize the existing PCA-based frame- gene sequencing.
work. For instance, Tipping (1999) presented a probabilis- With these and other symbolic kernels, multimodal
tic latent-variable framework for data visualization of data from multimedia analysis containing real and sym-
binary and discrete data types. Collins and co-workers bolic values can be handled in a uniform fashion by using,
(Collins, Dasgupta, & Schapire, 2001) generalized the say, an SVM classifier employing a kernel function that is
basic PCA framework, which inherently assumes Gaussian a combination of Euclidean, Hamming, and Edit Distance
features and noise, to other members of the exponential kernels. Applications of the proposed kernel functions to
family of functions. In addition to these research efforts, temporal analysis of videotext data demonstrate the util-
Kernel PCA (KPCA) has emerged as a new data represen- ity of this approach.
tation and analysis method that extends the capabilities
of the classical PCA which is traditionally restricted to
linear feature spaces to feature spaces that may be MAIN THRUST
nonlinearly correlated (Scholkopf, Smola, & Muller, 1999).
In this method, the input vectors are implicitly projected Distance Kernels for Multimodal Data
on a high-dimensional space by using a nonlinear map-
ping. Standard PCA is then applied to this high-dimen- Kernel-based classifiers such as SVMs and Neural Net-
sional space. KPCA avoids explicit calculation of high- works use linear, Radial Basis Function (RBF), or polyno-
dimensional projections with the use of kernel functions, mial functions as kernels that first (implicitly) transform
such as radial basis functions (RBF), high-degree polyno- input data into a higher dimensional feature space and
mials, or the sigmoid function. KPCA has been success- then process them in this space. Many of the common
843
TEAM LinG
Multimodal Analysis in Multimedia Using Symbolic Kernels
kernels assume a Euclidean distance to compare feature ber of change, delete, and insert operations re-
vectors. Symbolic kernels, in contrast, have been less quired to convert one string into another, and
commonly used in the published literature. We use the
len(x ) is the length of string x. In theory, Edit
following distance-based kernel functions in our analysis.
distance does not obey Mercer validity, as has
1. Linear (Euclidean) Kernel Function for Real-valued been recently proved by Cortes and coworkers
Features: This is the most commonly used kernel (Cortes, Haffner, & Mohri, 2002, 2003). However,
function for linear SVMs and other kernel-based empirically, the Kernel matrices generated by the
algorithms. Let x and z be two feature vectors. Then edit kernel are often positive definite, justifying the
the function practical use of Edit distancebased kernels.
844
TEAM LinG
Multimodal Analysis in Multimedia Using Symbolic Kernels
search, event monitoring, and video program categoriza- tion of feature vectors comprising strings of recognized
tion. Therefore, a videotext based Multimedia Description videotext. Experiments with our Edit distance kernel in- M
Scheme has recently been adopted into the MPEG-7 ISO/ vestigated the use of KPCA for analyzing the temporal
IEC standard to facilitate media content description contiguity of videotext using these feature vectors.
(Dimitrova, Agnihotri, Dorai, & Bolle, 2000). To this end,
videotext extraction and recognition is an important task Videotext Clustering and Change Detection: Fig-
of an automated video content analysis system. Figure 1 ure 2 shows the first two components obtained by
shows three illustrative frames containing videotext taken applying KPCA with the Edit distance kernel to a set
from MPEG videos of different genres. of strings recognized from 20 consecutive frames.
However, unlike scanned paper documents, videotext These frames contain instances of two distinct
is superimposed on often changing backgrounds com- strings. Without any assumed knowledge, KPCAs
prising moving objects with a rich variety of color and use of the Edit distance kernel clearly shows two
texture. In addition, videotext is often of low resolution distinct clusters corresponding to these two strings.
and suffers from compression artifacts. Due to these Videotext Tracking and Outlier Detection: Figure
difficulties, existing OCR algorithms result in low accu- 3 shows the first three principal components ob-
racy when applied to the problem of videotext extraction tained by applying KPCA with our Edit distance
and recognition. On the other hand, we observed that kernel to a set of strings recognized from 20 con-
temporal videotext often persists on the screen over a secutive frames. These frames contain instances of
span of time (approximately 30 seconds) to ensure read- videotext scrolling across the screen. In this three-
ability, resulting in many available samples of the same dimensional plot, we can see a visual representation
body of text over multiple frames of video. This redun- of the changing content of videotext as a trajectory
dancy can be exploited to improve recognition accuracy, in the principal component space, and locating
although erroneous text extraction and/or incorrect char- outliers from the trajectory indicates the appear-
acter recognition by a classifier may make the strings ance of other strings in the video frames.
dissimilar from frame to frame. Often no single instance of
the text may lead to perfect recognition, underscoring the These results show that the symbolic kernels can
need for intelligent postprocessing. assist significantly in automated agglomeration and track-
In the existing literature, unfortunately, temporal con- ing of recognized text as well as effective data visualiza-
tiguity analysis of videotext is implemented by using ad- tion. In addition, multimodal feature vectors constructed
hoc thresholds and heuristics for the following reasons. from recognized text strings and frame motion estimates
First of all, due to missed or merged characters, the same can now be analyzed jointly by using our hybrid kernel for
string may be perceived to be of different lengths on media content characterization.
different frames. We thus have feature vectors of varying
lengths. Secondly, the exact duration of the persistence
of videotext is unknown a priori. Two consecutive frames FUTURE TRENDS
can have completely different strings. Thirdly, videotext
can be in scrolling motion. Because multiple moving text One of the big hurdles facing media management systems
blocks can be present in the same video frame, it is is the semantic gap between the high-level meaning sought
nontrivial to recognize which videotext objects from con- by user queries in search for media and the low-level
secutive frames are instances of the same text. features that we actually compute today for media index-
In light of these difficulties, we present brief illustra- ing and description. Computational Media Aesthetics, a
tive examples of dimensionality reduction and visualiza- promising approach to bridging the gap and building
845
TEAM LinG
Multimodal Analysis in Multimedia Using Symbolic Kernels
high-level semantic descriptions for media search and paper, help apply traditionally numeric methods to sym-
navigation services, is founded upon an understanding bolic spaces without any forced integer mapping for
of media elements and their individual and joint roles in important tasks such as data visualization, principal com-
synthesizing meaning and manipulating perceptions, with ponent extraction, and clustering in multimedia and other
a systematic study of media productions (Dorai & domains.
Venkatesh, 2001). The core trait of this approach is that in
order to create effective tools for automatically under-
standing video, we need to be able to interpret the data REFERENCES
with its makers eye. In order to realize the potential of this
approach, it becomes imperative that all sources of de- Aradhye, H., & Dorai, C. (2002). New kernels for analyzing
scriptive information, audio, video, text, and so forth need multimodal data in multimedia using kernel machines.
to be considered as a whole and analyzed together to Proceedings of the IEEE International Conference on
derive inferences with certain level of integrity. With the Multimedia and Expo, Switzerland, 2 (pp. 37-40).
ability to treat multimodal features as an integrated fea-
ture set to describe media content during classification Bradley, P. S., Fayyad, U. M., & Mangasarian, O. (1998).
and visualization, new higher level semantic mappings Data mining: Overview and optimization opportunities
from low-level features can be achieved to describe media (Tech. Rep. No. 98-01). Madison: University of Wiscon-
content. The symbolic kernels are promising an initial step sin, Computer Sciences Department.
in that direction to facilitate rigorous joint feature analysis Collins, M., Dasgupta, S., & Schapire, R. (2001). A gener-
in various media domains. alization of principal component analysis to the exponen-
tial family. In T.G. Dietterich, S. Becker, & Z. Ghahramani
(Eds.), Advances in neural information processing sys-
CONCLUSION tems 14 (pp. 617-624). Cambridge, MA: MIT Press.
Traditional integer representation of symbolic multimedia Cortes, C., Haffner, P., & Mohri, M. (2002). Rational
feature data for classification and other data-mining tasks kernels. In S. Becker, S. Thrun, & K. Obermayer (Ed.),
is artificial, as the symbolic space may not reflect the Advances in neural information processing systems 15
continuity and neighborhood relations as imposed by (pp. 41-56). Cambridge, MA: MIT Press.
integer representations. In this paper, we use distance- Cortes, C., Haffner, P., & Mohri, M. (2003). Positive
based kernels in conjunction with kernel space methods definite rational kernels. Proceedings of the 16th Annual
such as KPCA to handle multimodal data, including sym- Conference on Computational Learning Theory (pp. 41-
bolic features. These symbolic kernels, as shown in this 56), USA.
846
TEAM LinG
Multimodal Analysis in Multimedia Using Symbolic Kernels
Dimitrova, N., Agnihotri, L., Dorai, C., & Bolle, R. (2000, Mercers Condition: A kernel function is said to obey
October). MPEG-7 videotext descriptor for superimposed Mercers condition for kernel validity iff the kernel matrix M
text in images and video. Signal Processing: Image Com- comprising pairwise kernel evaluations over any given
munication, 16, 137-155. subset of the feature space is guaranteed to be positive
semidefinite.
Dorai, C., & Venkatesh, S. (2001, October). Computational
media aesthetics: Finding meaning beautiful. IEEE Mul- MPEG Compression: Video/audio compression stan-
timedia, 8(4), 10-12. dard established by Motion Picture Experts Group. MPEG
compression algorithms use psychoacoustic modeling of
Jaakkola, T. S., & Haussler, D. (1999). Exploiting genera- audio and motion analysis as well as DCT of video data for
tive models in discriminative classifiers. In M. S. Kearns, efficient multimedia compression.
S. A. Solla, & D. A. Cohn (Eds.), Advances in neural
information processing systems 11 (pp. 487-493). Cam- Multimodality of Feature Data: Feature data is said to
bridge, MA: MIT Press. be multimodal if the features can be characterized as a
mixture of real-valued, discrete, ordinal, or nominal
Scholkopf, B., Smola, A., & Muller, K. R. (1999). Kernel values.
principal component analysis. In B. Scholkopf, C. J. C.
Burges, & A. J. Smola (Eds.), Advances in kernel methods: Principal Component Analysis (PCA): One of the
SV learning (pp. 327-352). Cambridge, MA: MIT Press. oldest modeling and dimensionality reduction techniques.
PCA models observed feature data as a linear combination
Tipping, M. E. (1999). Probabilistic visualisation of high- of a few uncorrelated, Gaussian principal components and
dimensional binary data. In M. S. Kearns, S. A. Solla, & D. additive Gaussian noise.
A. Cohn (Eds.), Advances in neural information process-
ing systems 11 (pp. 592-598). Cambridge, MA: MIT Press. Videotext: Text graphically superimposed on video
imagery, such as caption text, headline news, speaker
identity, location, and so on.
KEY TERMS
847
TEAM LinG
848
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Multiple Hypothesis Testing for Data Mining
849
TEAM LinG
Multiple Hypothesis Testing for Data Mining
Table 1. Summary table for multiple testing, following committing at least one Type I error, and then calculate
Benjamini and Hochberg (1995) a corresponding threshold in terms of nominal P-value.
null sampling
distribution
Advantages
Tj Pj
The basic Bonferroni procedure is extremely
simple to understand and use.
These P-values are called nominal P-values and
represent per-test error probabilities. The procedure by Disadvantages
which a P-value corresponding to an observed test statis-
tic is computed is not described here, but can be found in The Bonferroni procedure sets the threshold to
any textbook of statistics, such as those mentioned pre- meet a specified probability of making at least
viously. Essentially, the null cumulative distribution func- one Type I error. This notion of error is called
tion of the test statistic is used to compute the probabil- Family Wise Error Rate, or FWER, in the sta-
ity of Type I error at threshold T j. tistical literature. FWER is a very strict control
of error, and is far too conservative for most
(3) Arrange the P-values obtained in the previous step practical applications. Using FWER on datasets
in ascending order (smaller P-values correspond to with large numbers of variables often results in
objects more likely to be relevant). Let the or- the selection of a very small number of objects,
dered P-values be denoted by: or even none at all.
The assumption of statistical independence be-
P(1) P( 2) P( m ) tween tests is a strong one, and almost never
holds in practice.
Thus, each object has a corresponding P-value. In order
to select a subset of objects, it will therefore be sufficient The FDR Method of Benjamini and
to determine a threshold in terms of nominal P-value. Hochberg
850
TEAM LinG
Multiple Hypothesis Testing for Data Mining
the average proportion of false positives among the Robust FDR Methods
objects selected: M
S The FDR procedure outlined above, while simple and
FDR E 0 computationally efficient, makes several strong assump-
S tions, and while better than Bonferroni is still often too
conservative for practical problems (Storey & Tibshirani,
Where, E[] denotes expectation. The Benjamini and 2003). The recently introduced q-value (Storey, 2003)
Hochberg method allows us to compute a threshold is a more sophisticated approach to FDR correction and
corresponding to a specified FDR. provides a very robust methodology for multiple test-
ing. The q-value method makes use of the fact that P-
Procedure values are uniformly distributed under the null hypoth-
esis to accurately estimate the FDR associated with a
(1) Specify an acceptable FDR q* particular threshold. Estimated FDR is then used to set
(2) Find the largest i which satisfies the following an appropriate threshold. The q-value approach is an
inequality: excellent choice in many multi-variable settings.
TEAM LinG
Multiple Hypothesis Testing for Data Mining
852
TEAM LinG
Multiple Hypothesis Testing for Data Mining
as relevant. More generally, it is the false positive rate Test Statistic: A relevance scoring function used in
corresponding to an observed test statistic. a hypothesis test. Classical test statistics (such as the t- M
statistic) have null sampling distributions which are
Random Variable: A variable characterized by ran- known a priori. However, under certain circumstances
dom behavior in assuming its different possible values. null sampling distributions for arbitrary functions can
Sampling Distribution: The distribution of values be obtained by computational means.
obtained by applying a function to random data.
853
TEAM LinG
854
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Music Information Retrieval
processing techniques are broadly applied here. One set of pitches) and duration (Birmingham et al.,
of the main topics of computational auditory scene 2001). Query-by-humming is one of more popular M
analysis is automatic separation of individual sound topics within music information retrieval domain.
sources from a mixture. It is difficult with mixtures of Audio retrieval-by-example for orchestral music
harmonic instrument sounds, where spectra overlap. aims at searching for acoustic similarity in an
However, assuming time-frequency smoothness of audio collection, based on analysis of the audio
the signal, sound separation can be performed, and signal. Given an example audio document, other
when sound changes in time are observed, onset, documents in a collection can be ranked by simi-
offset, amplitude, and frequency modulation have larity on the basis of long-term structure; specifi-
similar shapes for all frequencies in the spectrum; cally, the variation of soft and louder passages,
thus, a demixing matrix can be estimated for them determined from envelope of audio energy vs.
(Virtanen, 2003; Viste & Evangelista, 2003). Audio time in one or more frequency bands (Foote, 2000).
source separation techniques also can be used to This research is a branch of audio retrieval by
source localization for auditory scene analysis. These content. Audio query-by-example search also can
techniques, like independent component analysis, be performed within a single document when
originate from speech recognition in cocktail party searching for sounds similar to the selected sound
environment, where many sound sources are present. event. Such a system for content-based audio re-
Independent component analysis is used for finding trieval can be based on a self-organizing feature
underlying components from multidimensional sta- map (i.e., a special kind of neural network de-
tistical data, and it looks for components that are signed by analogy with a simplified model of the
statistically independent (Vincent et al., 2003). neural connections in the brain and trained to find
Computational auditory scene recognition aims at relationships in the data). Perceptual similarity
classifying auditory scenes into predefined classes, can be assessed on the basis of spectral evolution
using audio information only. Examples of auditory in order to find sounds of similar timbre (Spevak
scenes are various outside and inside environments, & Polfreman, 2001). Neural networks also are used
like streets, restaurants, offices, homes, cars, and in other forms in audio information retrieval sys-
so forth. Statistical and nearest neighbor algorithms tems. For instance, time-delayed neural networks
can be applied for this purpose. In the nearest (i.e., neural nets with time delay inputs) are applied,
neighbor algorithm, the class (type of auditory since they perform well in speech recognition ap-
scene, in this case) is assigned on the basis of the plications (Meier et al., 2000). One of applications
distance of the investigated sample to the nearest of audio retrieval-by-example is searching for the
sample, for which the class membership is known. piece in a huge database of music pieces with the
Various acoustic features, based on Fourier spec- use of so-called audio fingerprintingtechnology
tral analysis (i.e., mathematic transform, decom- that allows piece identification. Given a short pas-
posing the signal into frequency components), can sage transmitted, for instance, via car phone, the
be applied to parameterize the auditory scene for piece is extracted, and, most important, informa-
classification purposes. Effectiveness of this re- tion also is extracted on the performer and title
search approaches 70% correctness for about 20 linked to this piece in the database. In this way, the
auditory scenes (Peltonen et al., 2002). user may identify the piece of music with very high
Query-by-humming systems, which search me- accuracy (95%) only on the basis of a small re-
lodic databases using sung queries (Adams et al., corded (possibly noisy) passage.
2003). This topic represents audio retrieval by Transcription of music, defined as writing down
contents. Melody usually is quantized coarsely the musical notation for the sounds that constitute
with respect to pitch and duration, assuming mod- the investigated piece of music. Onset detection
erate singing abilities of users. Music retrieval based on incoming energy in frequency bands and
system takes such an aural query (i.e., a motif or a multi-pitch estimation based on spectral analysis
theme) as input, and searches the database for the may be used as the main elements of an automatic
piece from which this query comes. Markov mod- music transcription system. The errors in such a
els, based on Markov chains, can be used for system may contain additional inserted notes,
modeling musical performances. Markov chain is omissions, or erroneous transcriptions (Klapuri
a stochastic process for which the parameter is et al., 2001). Pitch tracking (i.e., estimation of
discrete time values. In Markov sequence of events, pitch of note events in a melody or a piece of
the probability of future states depends on the music) is often performed in many music infor-
present state; in this case, states represent pitch (or mation retrieval systems. For polyphonic music,
855
TEAM LinG
Music Information Retrieval
polyphonic pitch-tracking and timbre separation in ranges from about 70% accuracy for instrument
digital audio is performed with such applications as identification to more than 90% for instruments
score-following and denoising of old analog re- family (i.e., strings, winds, etc.), approaching 100%
cordings, which is also a topic of interest within for discriminating impulsive and sustained sounds,
music information retrieval. Wavelet analysis can thus even exceeding human performance. Such
be applied for this purpose, since it decomposes instrument sound classification can be included in
the signal in time-frequency space, and then musical automatic music transcription systems.
notes can be extracted from the result of this decom- Sonification, in which utilities for intuitive auditory
position (Popovic et al., 1995). Simultaneous poly- display (i.e., in audible form) are provided through
phonic pitch and tempo tracking, aiming at automati- a graphical user interface (Ben-Tal et al., 2002).
cally inferring a musical notation that lists the pitch Generating human-like expressive musical per-
and the time limits of each note, is a basis of the formances with appropriately adjusted dynamics
automatic music transcription. A musical perfor- (i.e., loudness), rubato (variation in time limits of
mance can be modeled for these purposes using notes), vibrato (changes of pitch) (Mantaras &
dynamic Bayesian networks (i.e., directed graphical Arcos, 2002), and identification of musical pieces
models of stochastic processes) (Cemgil et al., 2003). representing different types of emotions (ten-
It is assumed that the observations may be gener- derness, sadness, joy, calmness, etc.), which
ated by a hidden process that cannot be directly music evokes. Emotions in music are gaining the
experimentally observed, and dynamic Bayesian net- interest of researchers recently, also including
works represent the hidden and observed states in recognition of emotions in the recordings.
terms of state variables, which can have complex
interdependencies. Dynamic Bayesian networks The topics mentioned previously interrelate and
generalize hidden Markov models, which have one sometimes partially overlap. For instance, both audi-
hidden node and one observed node per observation tory scene analysis and recognition may take into ac-
time. count a very broad range of recordings containing nu-
Automatic characterizing the rhythm and tempo of merous acoustic elements to identify and analyze. Query
music and audio, revealing tempo and the relative by humming requires automatic transcription of music,
strength of particular beats, is a branch of research since the input audio samples first must be transformed
on automatic music transcription. Since highly into the form based on musical notation, describing
structured or repetitive music has strong beat spec- basic melodic features of the query. Audio retrieval by
trum peaks at the repetition times, it allows tempo example and automatic classification of musical in-
estimation and distinguishing between different strument sounds are both branches of retrieval by con-
kinds of rhythms at the same tempo. The tempo can tent. Transcription of music requires not only pitch
be estimated using beat spectral peak criterion (the tracking, but also automatic characterizing the rhythm
lag of the highest peak exceeding assumed time and tempo, so these topics overlap. Pitch tracking is
threshold) accurately to within 1% in the analysis needed in many research topics, including music tran-
window (Foote & Uchihashi, 2001). scription, query by humming, and even automatic clas-
Automatic classification of musical instrument sification of musical instrument sounds, since pitch is
sounds, aiming at accurate identification of musi- one of the features characterizing instrumental sound.
cal instruments playing in a given recording, based Sonification and generating human-like expressive
on various sound analysis and data mining tech- musical performances both are related to sound syn-
niques (Herrera et al., 2000; Wieczorkowska, 2001). thesis, which are needed to create auditory display or
This research is focused mainly on monophonic emotional performance. All these topics are focused
sounds, and sound mixes are usually addressed in on a broad domain of music and its various aspects.
the research on separation of sound sources. In Results of this research are not always easily mea-
most cases, sounds of instruments of definite pitch surable, especially in the case of synthesis-based top-
have been investigated, but recently, research on ics, since they usually are validated via subjective tests.
percussion also has been undertaken. Various analy- Other topics, like transcription of music, may produce
sis methods are used to parameterize sounds for errors of various importance (i.e., wrong pitch, length,
instrument classification purposes, including time- omission, etc.), and comparison of the obtained tran-
domain description, Fourier, and wavelet analysis. script with the original score can be measured in many
Classifiers range from statistic and probabilistic ways, depending on the considered criteria. The easiest
methods, through learning by example, to artificial estimation and comparison of results can be performed in
intelligence methods. Effectiveness of this research case of recognition of singular sound events or files. In
856
TEAM LinG
Music Information Retrieval
case of query by example, very high recognition rate has Mellody, M. & Rand, B. (2001). MUSART: Music retrieval
been reached already (95% of correct piece identifica- via aural queries. Proceedings of ISMIR 2001 2 nd Annual M
tion via audio fingerprinting), reaching commercial level. International Symposium on Music Information Re-
The research on music information retrieval is gain- trieval, Bloomington, Indiana.
ing an increasing interest from the scientific commu-
nity, and investigation of further issues in this domain Bregman, A.S. (1990). Auditory scene analysis, the per-
can be expected. ceptual organization of sound. Cambridge, MS: MIT
Press.
Cemgil, A.T., Kappen, B., & Barber, D. (2003). Genera-
FUTURE TRENDS tive model based polyphonic music transcription. Pro-
ceedings of the IEEE Workshop on Applications of
Multimedia databases and library collections, expand- Signal Processing to Audio and Acoustics WASPAA03,
ing tremendously nowadays, need efficient tools for New Paltz, New York.
content-based search. Therefore, we can expect an in-
tensification of research effort on music information de Mantaras, R.L., & Arcos, J.L. (2002, Fall). AI and
retrieval, which may aid searching music data. Espe- music: From composition to expressive performance.
cially, tools for query-by-example and query-by-hum- AI Magazine, 43-58.
ming are needed, as well as tools for automatic music Downie, J.S. (2001). Wither music information re-
transcription, so these areas should be investigated trieval: Ten suggestions to strengthen the MIR research
broadly in the near future. community. Proceedings of the Second Annual Inter-
national Symposium on Music Information Retrieval:
ISMIR 2001, Bloomington, Indiana.
CONCLUSION
Fingerhut, M. (1997). Le multimdia dans la
Music information retrieval is a broad range of re- bibliothque. Culture et recherche, 61. Retrieved 2004
search, focusing on various aspects of possible applica- from http://catalogue.ircam.fr/articles/textes/Fingerhut
tions. The main domains include audio retrieval by 97a/.
content, automatic music transcription, denoising of Foote, J. (1999). An overview of audio information
old recordings, generating human-like performances, retrieval. Multimedia Systems, 7(1), 2-11.
and so forth. The results of this research help users find
the audio data they need, even if the users are not Foote, J. (2000). ARTHUR: Retrieving orchestral mu-
experienced musicians. Constantly growing audio re- sic by long-term structure. Proceedings of the Interna-
sources evoke a demand for efficient tools to deal with tional Symposium on Music Information Retrieval
this enormous amount of data; therefore, music infor- ISMIR 2000, Plymouth, Massachusetts.
mation retrieval becomes a dynamically developing field Foote, J., & Uchihashi, S. (2001). The beat spectrum: A
of research. new approach to rhythm analysis. Proceedings of the
International Conference on Multimedia and Expo
ICME 2001, Tokyo, Japan.
REFERENCES
Herrera, P., Amatriain, X., Batlle, E., & Serra X. (2000).
Adams, N.H., Bartsch, M.A., & Wakefield, G.H. (2003). Towards instrument segmentation for music content
Coding of sung queries for music information retrieval. description: A critical review of instrument classifica-
Proceedings of the IEEE Workshop on Applications of tion techniques. Proceedings of the International Sym-
Signal Processing to Audio and Acoustics WASPAA03, posium on Music Information Retrieval ISMIR 2000,
New Paltz, New York. Plymouth, Massachusetts.
Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G. International Organization for Standardization ISO/IEC
& Cook, P. (2002). SONART: The sonification application JTC1/SC29/WG11. (2003). MPEG-7 Overview. Re-
research toolbox. Proceedings of the 2002 International trieved 2004 from http://www.chiariglione.org/mpeg/
Conference on Auditory Display, Kyoto, Japan. standards/mpeg-7/mpeg-7.htm
Birmingham, W.P., Dannenberg, R.D., Wakefield, G.H., Klapuri, A., Virtanen, T., Eronen, A., & Seppnen, J.
Bartsch, M.A., Bykowski, D., Mazzoni, D., Meek, C., (2001). Automatic transcription of musical recordings.
Proceedings of the Consistent & Reliable Acoustic Cues
for Sound Analysis CRAC Workshop, Aalborg, Denmark.
857
TEAM LinG
Music Information Retrieval
Meier, U., Stiefelhagen, R., Yang, J., & Waibel, A. (2000). KEY TERMS
Towards unrestricted lip reading. International Journal
of Pattern Recognition and Artificial Intelligence, 14(5), Digital Audio: Digital representation of sound wave-
571-586. form, recorded as a sequence of discrete samples, repre-
Peltonen, V., Tuomi, J., Klapuri, A., Huopaniemi, J., & senting the intensity of the sound pressure wave at a
Sorsa, T. (2002). Computational auditory scene recogni- given time instant. Sampling frequency describes the
tion. Proceedings of the International Conference on number of samples recorded in each second, and bit
Acoustics Speech and Signal Processing ICASSP, Or- resolution describes the number of bits used to repre-
lando, Florida. sent the quantized (i.e., integer) value of each sample.
Popovic, I., Coifman, R., & Berger, J. (1995). Aspects Fourier Analysis: Mathematical procedure for
of pitch-tracking and timbre separation: Feature detec- spectral analysis, based on Fourier transform that de-
tion in digital audio using adapted local trigonometric composes a signal into sine waves, representing fre-
bases and wavelet packets. Center for Studies in Music quencies present in the spectrum.
Technology. Retrieved 2004 from http://www- Information Retrieval: The actions, methods, and
ccrma.stanford.edu/~brg/research/pc/pitchtrack.html procedures for recovering stored data to provide infor-
Rosenthal, D., & Okuno, H.G. (Eds.). (1998). Computa- mation on a given subject.
tional auditory scene analysis. Proceedings of the IJCAI- Metadata: Data about data (i.e., information about
95 Workshop, Mahwah, New Jersey. the data).
Spevak, C., & Polfreman, R. (2001). Sound spottingA MIDI: Musical Instrument Digital Interface. MIDI
frame-based approach. Proceedings of the Second An- is a common set of hardware connectors and digital
nual International Symposium on Music Information codes used to interface electronic musical instruments
Retrieval: ISMIR 2001, Bloomington, Indiana. and other electronic devices. MIDI controls actions
Vincent, E., Rodet, X., Rbel, A., Fvotte, C., & Carpentier, such as note events, pitch bends, and the like, while the
.L. (2003). A tentative typology of audio source separa- sound is generated by the instrument itself.
tion tasks. Proceedings of the 4th Symposium on Inde- Music Information Retrieval: Multi-disciplinary
pendent Component Analysis and Blind Source Separa- research on retrieving information from music.
tion, Nara, Japan.
Pitch Tracking: Estimation of pitch of note events
Virtanen, T. (2003). Algorithm for the separation of in a melody or a piece of music.
harmonic sounds with time-frequency smoothness con-
straint. Proceedings of the 6th International Confer- Sound: A physical disturbance in the medium through
ence on Digital Audio Effects DAFX-03, London, UK. which it is propagated. Fluctuation may change rou-
tinely, and such a periodic sound is perceived as having
Viste, H., & Evangelista, G. (2003). Separation of har- pitch. The audible frequency range is from about 20 Hz
monic instruments with overlapping partials in multi- (hertz, or cycles per second) to about 20 kHz. Harmonic
channel mixtures. Proceedings of the IEEE Workshop sound wave consists of frequencies being integer mul-
on Applications of Signal Processing to Audio and tiples of the first component (fundamental frequency)
Acoustics WASPAA03, New Paltz, New York. corresponding to the pitch. The distribution of fre-
Wieczorkowska, A. (2001). Musical sound classifica- quency components is called spectrum. Spectrum and
tion based on wavelet analysis. Fundamenta its changes in time can be analyzed using mathematical
Informaticae Journal, 47(1/2), 175-188. transforms, such as Fourier or wavelet transform.
Wieczorkowska, A., & Ras, Z. (2001). Audio content Wavelet Analysis: Mathematical procedure for
description in sound databases. In Proceedings of the time-frequency analysis, based on wavelet transform
First Asia-Pacific Conference on Web Intelligence, WI that decomposes a signal into shifted and scaled ver-
2001, Maebashi City, Japan. sions of the original function called wavelet.
858
TEAM LinG
859
David Taniar
Monash University, Australia
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Negative Association Rules in Data Mining
860
TEAM LinG
Negative Association Rules in Data Mining
implies wind, or ~OilStock~PetrolStock, which says if cell in a contingency table. The chi-squared statistic is
the price the oil shares does not go up, petrol shares calculated, which is, in short, a normalized deviation from N
wouldnt go up either. expectation for each cell. From the obtained value, the
In the negative association rules mining, the number of strength of the correlation between database items is
possible generated negative rules is vast and numerous in estimated.
comparison to the positive association rules mining. It is Now, we explain category II. The data or negative
a critical issue to distinguish only the most interesting rules generated in the initial step are analyzed first, and
ones among the candidates. then the obtained information is employed on the next
A simple and straightforward method to search nega- steps of generation process.
tive association would be to add the negations of absent Mining frequent negative itemsets starts with itemsets
items to the database transaction and treat them as addi- that contain n items and only one negation of an item; for
tional items. Then, the traditional positive association example, {ABC~D, AB~CD}, where ~C, ~D means ab-
mining algorithms could be run on the extended database. sence of the item C or D. Then, the produced rules with
This approach works in domains with a very limited number n items and one negation are analyzed and some informa-
of database attributes. A good example is weather predic- tion is obtained that allows the next step when producing
tion. The database only contains 10-15 different factors rules with n items and two negations (A~BC~D) to dis-
like fog, rain, sunshine, and so forth. A database record regard some itemsets without even considering (Fortes,
would look like sunshine, ~fog, ~rain, windy. Thus, the Balcazar & Morales, 2001). n takes on values from 1 to the
number of database items will be doubled, and the number overall number of different items in the database. The
of itemsets will increase significantly. Besides, the sup- negative rules mining stops when the maximum number
port of negations of items can be calculated, based on the of negations m in rules has been reached, m provided by
support of positive items. Then, specially created algo- the user, or when all negative rules have been generated,
rithms are required. if m has not been provided.
The approach works in tasks like weather prediction/ Another approach in category II is to consider the
limited stock market analyses but is hardly applicable to hierarchy of items. An example of database hierarchy is
domains with thousands of different database items (sales/ shown in Figure 2.
healthcare/marketing). There could be numerous and sense- It is supposed that database items are organized into
less negative rules generated then. That is why special the hierarchy ancestor-to-descendant. First, the nega-
approaches are required to distinct interesting negative tive rules on the very top level are generated and then on
association rules. the lower levels. When proceeding to the lower level of
There are two categories in mining negative associa- hierarchy, the information obtained from rules on the
tion rules. In category I, a special measure for negative higher level is utilized (Daly & Taniar, 2003, 2004). The
association rules is created. In category II, the data/first- hierarchy approach makes the rules more general, which
generated rules are analyzed to further produce only most is crucial for negative association. Any further explora-
interesting rules with no computational costs. tion can be ceased, when no additional knowledge will be
First, we explain category I. A measure to distinguish extracted on the lower levels.
the most valuable negative rules will be employed. For instance, if a rule {Beef Not Juice} is generated,
The interest measure may be applied to negative rules it is not interesting what kind of juice (because Not
(Wu, Zhang & Zhang, 2002). Interest measure was first Juice is negation of any kind of juice). In contrast, in the
offered for positive rules. The measure is defined by a positive rule {Beef Juice}, it is required to discover
formula: Interest(AB)=support(AB)-support(A)* a specific kind of juice and go deeper into the levels.
support(B). The negative rules are generated from infre-
quent itemsets. The mining process for negative rules
starts within 1-frequent itemsets. Figure 2. Hierarchy of items
A more complicated way to search negative associa-
tion rules is to employ statistical theories (Brin, Motwani Items
& Silverstein, 1997). Correlations among database items
are calculated. If the correlation is positive, a positive Stationary Food & Drinks
association rule will be obtained. If the correlation is Writing
negative, a negative association rule will be obtained. The Instruments Paper Ink Drinks Food
stronger the correlation, the stronger the generated rule.
In this approach, Chi-squared statistics is employed to Pen Pencil Juice Coke Fish Meat Pastry Dairy
calculate the correlations among database items. Each
Apple Juice Orange Juice Trout Beef Bread Milk Butter
different database transaction (market basket) denotes a
861
TEAM LinG
Negative Association Rules in Data Mining
Though the hierarchy approach requires additional infor- Some research has been done in the negative associa-
mation about the database, the way items are organized tion rules mining, but there are many unexplored subar-
into a hierarchy and an experts opinion may be needed. eas, and advanced research is essential for the areas
After the negative association rules have been gener- development.
ated, some reduction techniques may be applied to refine
the rules. For instance, the reduction techniques could A need for special interest measures and algorithms
verify the true support of items (not ancestors) or gener- for negative association vs. positive association.
alize the sets of negative rules.
New interest measures should be invented, new ap-
proaches for negative rules discovered, the ones that
FUTURE TRENDS would take advantage of the special properties of the
negative rules.
There has been some research in the area of negative
association mining. The negative association obviously A need for parallel algorithms for negative associa-
has provoked more complicated issues than the positive tion rules mining.
association mining. In order to overcome the difficulties
of negative association mining, the research community Parallel algorithms have taken data mining to the new
endeavors to invent new methods and algorithms to extent of mining abilities. As the negative association is
achieve the desirable efficiency. Following are the limita- a relatively new research area, the parallel data mining
tions and trends in the negative association mining. algorithms have not been developed yet, and they are
absolutely necessary.
Issues/Open Research Areas A need for new sophisticated reduction techniques.
A vast number of candidate rules for negative asso- After the set of negative rules has been generated, one
ciation rules in comparison with the positive asso- may apply some reduction and refining techniques to the
ciation rules. set to make sure the rules are not redundant.
In comparison with the positive association mining, in The issues require additional research in the area to be
the negative association rules, a greater number of candi- conducted and more scientists to contribute their knowl-
date itemsets is obtained, which would make it technically edge, time, and inspiration.
impossible to evaluate all and distinguish the rules with
high support/confidence/interest. Besides, the gener- Current Trends
ated rules may not provide efficient information. Special
approaches, measures, and algorithms should be devel-
oped to solve the issue. Use of various approaches in negative association
mining.
Hardware limitations.
Researchers are currently trying to invent distinctive
Hardware limitations are crucial for negative associa- approaches to generate the negative rules and to make
tion rules mining for the following reasons: a huge number sure they provide valuable information. A straightfor-
of calculations and assessments required from the CPU; ward approach is not acceptable for the negative rules;
a storage for vast frequent itemsets trees in the sufficient support/confidence measures are not enough to distin-
main memory; and rapid data interchange with databases guish the negative rules. Some novel/adapted measures,
required. particularly for interest measures, will need to be utilized.
Data structure analysis also forms an important part in the
Vast data sets. process of negative rule generation.
With the development of the hardware components, vast Specialized mining algorithms development.
databases could be handled in sufficient time for the companies.
The algorithms for positive association rules mining
A need for advanced research in negative associa- are not suitable for negative rules, so new and specialized
tion mining. algorithms often are developed by the researchers cur-
rently working in negative rules mining.
862
TEAM LinG
Negative Association Rules in Data Mining
Rules reduction development. Chen, M., Han, J., & Yu, P. (1996). Data mining: An
overview from a database perspective. Institute of Elec- N
Rules reduction techniques have been developed to trical and Electronics Engineers, Transactions on Knowl-
refine the final set of the negative rules. Reduction tech- edge and Data Engineering, 8(6), 866-883.
niques verify the rules quality or generalize the rules.
Daly, O., & Taniar, D. (2003). Mining multiple-level nega-
The researchers are attempting to overcome the vast tive association rules. Proceedings of the International
search space in negative association mining; they look at the Conference on Intelligent Technologies (InTech03).
issue from different points of view to discover what makes Daly, O., & Taniar, D. (2004). Exception rules mining based
the negative rules more interesting and what measures could on negative association rules. Computational Science
distinguish the highly valuable rules from the rest. and Its Applications, 3046, 543-552.
Fayyad, U. et al. (1996). Advances in knowledge discov-
CONCLUSION ery and data mining. American Association for Artificial
Intelligence Press.
This article is a review of the research that has been done Fortes, I., Balczar, J., & Morales, R. (2001). Bounding
in the negative association rules. Negative association negative information in frequent sets algorithms. Pro-
rules have brought interesting challenges and research ceedings of the 4th International Conference of Discovery
opportunities. This is an open area for future research. Science.
This article describes the main issues and approaches
in negative association rules mining derived from the Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1991).
literature. One of the main issues is a vast number of Knowledge discovery in databases: An overview. Ameri-
candidate rules for negative association rules in compari- can Association for Artificial Intelligence Press.
son with the positive association rules. Mannila, H., Toivonen, H., & Verkamo, A. (1994). Efficient
There are two categories in mining negative associa- algorithms for discovering association rules. Proceed-
tion rules. In category I, a special measure for negative ings of the American Association for Artificial Intelli-
association rules is created. In category II, the data/first- gence Workshop on Knowledge Discovery in Databases.
generated rules are analyzed to further produce only most
interesting rules with no computational costs. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and
presentation of strong rules. American Association for
Artificial Intelligence Press.
REFERENCES Savasere, A., Omiecinski, E., & Navathe, S. (1995). An
efficient algorithm for mining association rules in large
Agrawal, R. et al. (1996). Fast discovery of association databases. Proceedings of the 21st International Confer-
rules. In U. Fayyad et al. (Eds.), Advances in knowledge ence on Very Large Data Bases.
discovery and data mining. American Association for
Artificial Intelligence Press. Srikant, R., & Agrawal, R. (1995). Mining generalized
association rules. Proceedings of the 21st Very Large
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Data Bases Conference.
association rules between sets of items in large data-
bases. Proceedings of the Association for Computing Wu, X., Zhang, C., & Zhang, S. (2002). Mining both
Machinery, Special Interest Group on Management of positive and negative association rules. Proceedings of
Data, International Conference Management of Data. the 19th International Conference on Machine Learning.
863
TEAM LinG
Negative Association Rules in Data Mining
Confidence: The rule AB has confidence c, if c% of Negative Association Rules: Rules of a kind AB,
transactions that contain A, also contains B. where A and B are frequent negative itemsets.
Database Item: An item/entity occurring in the database. Negative Itemsets: Itemsets that contain both items
and their negations.
Frequent Itemsets: Itemsets that have support at least
equal to minsup. Support: The rule AB has support s, if s% of all
transactions contains both A and B.
Itemset: A set of database items.
864
TEAM LinG
865
There are many different neural network models that have MAIN THRUST
been developed over the last fifty years or so to achieve
these tasks of prediction, classification, and clustering. A number of significant issues will be discussed, and
Broadly speaking, these models can be grouped accord- some guidelines for successful training of neural net-
ing to supervised learning algorithms (for prediction and works will be presented in this section.
classification), and unsupervised learning algorithms (for
clustering). This paper focused on the former paradigm. Figure 1. Architecture of MFNN (note: not all weights
We refer the interested reader to Haykin (1994) for a are shown)
detailed account of many neural network models.
J neurons K neurons
According to a recent study (Wong, Jiang, & Lam, (hidden layer) (output layer)
2000), over fifty percent of reported neural network busi-
ness application studies utilise multilayered feedforward
x1 . w21
w11
w12
y1 v11 z1
v21
.
wJ1
neural networks (MFNNs) with the backpropagation learn- vK1
x2 w22
ing rule (Werbos, 1974; Rumelhart & McClelland, 1986). wJ2 y2 z2
This type of neural network is popular because of its
broad applicability to many problem domains of relevance x3 . w13
w23
..
.
. yJ zK
to business and industry: principally prediction, classifi- wJ2
. w 1N+1 wJ3
cation, and modeling. MFNNs are appropriate for solving . w v1j
.
2N+1
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Neural Networks for Prediction and Classification
STEP 1: Randomly select an input pattern x to present to the MFNN through the
input layer
STEP 2: Calculate the net inputs and outputs of the hidden layer neurons
N +1
net hj = w ji xi y j = f (net hj )
i =1
STEP 3: Calculate the net inputs and outputs of the K output layer neurons
J +1
net ko = v kj y j z k = f ( net ko )
j =1
STEP 4: Update the weights in the output layer (for all k, j pairs)
vkj vkj + c (d k z k ) z k (1 z k ) y j
STEP 5: Update the weights in the hidden layer (for all i, j pairs)
K
w ji w ji + c 2 y j (1 y j ) x i ( ( d k z k ) z k (1 z k )v kj )
k =1
and repeat from STEP 1 until all input patterns have been presented (one
epoch).
Critical Issues for Neural Networks 1. Learning: The developed model must represent
an adequate fit of the training data. The data itself
Neural networks have not been without criticism, and must contain the relationships that the neural net-
there are a number of important limitations that need work is trying to learn, and the neural network
careful attention. At this stage it is still a trial and error model must be able to derive the appropriate
process to obtain the optimal model to represent the weights to represent these relationships.
relationships in a dataset. This requires appropriate selec- 2. Generalization: The developed model must also
tion of a number of parameters including learning rate, perform well when tested on new data to ensure that
momentum rate (if an additional momentum term is added it has not simply memorized the training data char-
to steps 4 and 5 of the algorithm), choice of activation acteristics. It is very easy for a neural network model
function, as well as the optimal neural architecture. If the to overfit the training data, especially for small
architecture is too large, with too many hidden neurons, data sets. The architecture needs to be kept small,
the MFNN will find it very easy to memorize the training and key validation procedures need to be adopted
data, but will not have generalized its knowledge to other to ensure that the learning can be generalized.
datasets (out-of-sample or test sets). This problem is
known as overtraining. Another limitation is the gradient Within each of these stages there are a number of
descent nature of the backpropagation learning algo- important guidelines that can be adopted to ensure effec-
rithm, which causes the network weights to become tive learning and successful generalization for prediction
trapped in local minima of the error function being and classification problems (Remus & OConnor, 2001).
minimized. Finally, neural networks are considered inap- To ensure successful learning:
propriate for certain problem domains where insight and
explanation of the model is required. Some work has been Prepare the Data Prior to Learning the Neural
done on extracting rules from trained neural networks Network Model: A number of pre-processing
(Andrews, Diederich, & Tickle, 1995), but other data steps may be necessary including cleansing the
mining techniques like rule induction may be better suited data; removing outliers; determining the correct
to such situations. level of summarisation; converting non-numeric
data (Pyle, 1999).
Guidelines for Successful Training Normalise, Scale, Deseasonalise and Detrend
the Data Prior to Learning: Time series data
Successful prediction and classification with MFNNs often needs to be deseasonalised and detrended to
requires careful attention to two main stages: enable the neural network to learn the true patterns
in the data (Zhang, Patuwo, & Hu, 1998).
866
TEAM LinG
Neural Networks for Prediction and Classification
Ensure that the MFNN Architecture is Appro- the average performance of all randomly extracted
priate to Learn the Data: If there are not enough test sets. The most popular method is known as N
hidden neurons, the MFNN will be unable to repre- ten-fold cross-validation, and involves repeating
sent the relationships in the data. Most commercial the approach for ten distinct subsets of the data.
software packages extend the standard 3-layer Avoid Unnecessarily Large and Complex Ar-
MFNN architecture shown in Figure 1 to consider chitectures: An architecture containing a large
additional layers of neurons, and sometimes in- number of hidden layers and hidden neurons re-
clude feedback loops to provide enhanced learning sults in more weights than a smaller architecture.
capabilities. The number of input dimensions can Since the weights correspond to the degrees of
also be reduced to improve the efficiency of the freedom or number of parameters the model has
architecture if inclusion of some variables does to fit the data, it is very easy for such large
not improve the learning. architectures to overfit the training data. For the
Experiment with the Learning Parameters to sake of future generalization of the model, the
Ensure that the Best Learning is Produced: architecture should therefore be only as large as
There are several parameters in the backpropagation is required to learn the data and achieve an accept-
learning equations that require selection and ex- able performance on all data sets (training, test,
perimentation. These include the learning rate c, and validation where available).
the equation of the function f() and its gradient l,
and the values of the initial weights. When these guidelines are observed, the chances of
Consider Alternative Learning Algorithms to developing a MFNN model that learns the training data
Backpropagation: Backpropagation is a gradient effectively and generalizes its learning on new data are
descent technique that guarantees convergence only greatly improved. Most commercially available neural
to a local minimum of the error function. Other local network software packages include features to facilitate
optimization techniques, such as the Levenberg- adherence to these guidelines.
Marquardt algorithm, have gained popularity due to
their increased speed and reduced memory require-
ments (Kinsella, 1992). Recently, researchers have FUTURE TRENDS
used more sophisticated search strategies such as
genetic algorithms and simulated annealing in an Despite the successful application of neural networks to
effort to find globally optimal weight values (Sexton a wide range of application areas, there is still much
& Dorsey, 2000). research that continues to improve their functionality.
Specifically, research continues in the development of
To ensure successful generalization: hardware models (chips or specialized analog devices)
that enable neural networks to be implemented rapidly in
Extract a Test Set from the Training Data: Com- industrial contexts. Other research attempts to connect
monly 20% of the training data is reserved as a test neural networks back to their roots in neurophysiology,
set. The neural network is only trained on 80% of and seeks to improve the biological plausibility of the
the data, and the degree to which it has learnt or models. On-line learning of neural network models that
memorized the data is gauged by the measured are more effective in situations when the data is dynami-
performance on the test set. When ample addi- cally changing, will also become increasingly important.
tional data is available, a third group of data known A useful discussion of the future trends of neural net-
as the validation set is used to evaluate the gener- works can be found at a virtual workshop discussion:
alization capabilities of the learnt model. For time http://www.ai.univie.ac.at/neuronet/workshop/.
series prediction problems, the validation set is
usually taken as the most recently available data,
and provides the best indication of how the devel- CONCLUSION
oped model will perform on future data. When
there is insufficient data to extract a test set and Over the last decade or so, we have witnessed neural
leave enough training data for learning, cross-vali- networks come of age. The idea of learning to solve
dation sets are used. This involves randomly ex- complex pattern recognition problems using an intelli-
tracting a test set, developing a neural network gent data-driven approach is no longer simply an inter-
models based on the remaining training data, and esting challenge for academic researchers. Neural net-
repeating the process with several random divi- works have proven themselves to be a valuable tool
sions of the data. The reported results are based on across a wide range of application areas. As a critical
867
TEAM LinG
Neural Networks for Prediction and Classification
component of most data mining systems, they are also Smith, K. A., & Gupta, J. N. D. (Eds.). (2002). Neural
changing the way organizations view the relationship networks in business: Techniques and applications.
between their data and their business strategy. Hershey, Pennsylvania: Idea Group Publishing.
The multilayered feedforward neural network (MFNN)
has been presented as the most common neural network Weinstein, J. N., Myers, T., Casciari, J. J., Buolamwini, J.,
employing supervised learning to model the relationships & Raghavan, K. (1994). Neural networks in the biomedical
between inputs and outputs. This dominant neural net- sciences: A survey of 386 publications since the begin-
work model finds application across a broad range of ning of 1991. Proceedings of the World Congress on
prediction and classification problems. A series of critical Neural Networks, 1 (pp. 121-126).
guidelines have been provided to facilitate the successful Werbos, P. J. (1974). Beyond regression: New tools for
application of these neural network models. prediction and analysis in the behavioral sciences. Cam-
bridge, MA: Harvard University, Ph.D. dissertation.
868
TEAM LinG
Neural Networks for Prediction and Classification
Learning Algorithm: The method used to change the Neural Model: Specifying a neural network model
weights so that the error is minimized. Training data is involves declaring the architecture and activation func- N
repeatedly presented to the MFNN through the input tion types.
layer, the output of the MFNN is calculated and compared
to the desired output. Error information is used to deter- Overtraining: When the MFNN performs significantly
mine which weights need to be modified, and by how better on the training data than an out-of-sample test data,
much. There are several parameters involved in the learn- it is considered to have memorized the training data, and
ing algorithm including learning rate, momentum factor, be overtrained. This can be avoided by following the
initial weights, etc. guidelines presented above.
869
TEAM LinG
870
Nilesh Mishra
Indian Institute of Technology, India
Mayank Vatsa
Indian Institute of Technology, India
Richa Singh
Indian Institute of Technology, India
P. Gupta
Indian Institute of Technology, India
INTRODUCTION BACKGROUND
The most commonly used protection mechanisms today Starting from banks, signature verification is used in many
are based on either what a person possesses (e.g. an ID other financial exchanges, where an organizations main
card) or what the person remembers (like passwords and concern is not only to give quality services to its custom-
PIN numbers). However, there is always a risk of pass- ers, but also to protect their accounts from being illegally
words being cracked by unauthenticated users and ID manipulated by forgers.
cards being stolen, in addition to shortcomings like for- Forgeries can be classified into four typesrandom,
gotten passwords and lost ID cards (Huang & Yan, 1997). simple, skilled and traced (Ammar, Fukumura & Yoshida,
To avoid such inconveniences, one may opt for the new 1988; Drouhard, Sabourin, & Godbout, 1996). Generally
methodology of Biometrics, which though expensive will online signature verification methods display a higher
be almost infallible as it uses some unique physiological accuracy rate (closer to 99%) than off-line methods (90-
and/or behavioral (Huang & Yan, 1997) characteristics 95%) in case of all the forgeries. This is because, in off-line
possessed by an individual for identity verification. Ex- verification methods, the forger has to copy only the
amples include signature, iris, face, and fingerprint recog- shape (Jain & Griess, 2000) of the signature. On the other
nition based systems. hand, in case of online verification methods, since the
The most widespread and legally accepted biometric hardware used captures the dynamic features of the
among the ones mentioned, especially in the monetary signature as well, the forger has to not only copy the
transactions related identity verification areas is carried shape of the signature, but also the temporal characteris-
out through handwritten signatures, which belong to tics (pen tilt, pressure applied, velocity of signing etc.) of
behavioral biometrics (Huang & Yan,1997). This tech- the person whose signature is to be forged. In addition,
nique, referred to as signature verification, can be classi- he has to simultaneously hide his own inherent style of
fied into two broad categories - online and off-line. While writing the signature, thus making it extremely difficult to
online deals with both static (for example: number of black deceive the device in case of online signature verification.
pixels, length and height of the signature) and dynamic Despite greater accuracy, online signature recogni-
features (such as acceleration and velocity of signing, tion is not encountered generally in many parts of the
pen tilt, pressure applied) for verification, the latter ex- world compared to off-line signature recognition, be-
tracts and utilizes only the static features (Ramesh and cause it cannot be used everywhere, especially where
Murty, 1999). Consequently, online is much more efficient signatures have to be written in ink, e.g. on cheques,
in terms of accuracy of detection as well as time than off- where only off-line methods will work. Moreover, it re-
line. But, since online methods are quite expensive to quires some extra and special hardware (e.g. pressure
implement, and also because many other applications still sensitive signature pads in online methods vs. optical
require the use of off-line verification methods, the latter, scanners in off-line methods), which are not only expen-
though less effective, is still used in many institutions. sive but also have a fixed and short life span.
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Off-Line Signature Recognition
Figure 2. Noise removal using median filter Figure 3. Converting grayscale image into binary image
(a) Gray scale (b) Noise free (a) Gray scale (b) Binary
871
TEAM LinG
Off-Line Signature Recognition
where S(x,y) is the binary image and R(x,y) is the are less sensitive to noise as well. Examples include
grayscale image. width and height of individual signature compo-
Smoothening: The noise-removed image may have nents, width to height ratio, total area of black
small connected components or small gaps, which pixels in the binary and high pressure region (HPR)
may need to be filled up. It is done using a binary images, horizontal and vertical projections of sig-
mask obtained by thresholding followed by morpho- nature images, baseline, baseline shift, relative
logical operations (which include both erosion and position of global baseline and centre of gravity
dilation) (Huang & Yan, 1997; Ismail & Gad, 2000; with respect to width of the signature, number of
Ramesh & Murty, 1999). cross and edge points, circularity, central line,
Extracting the High Pressure Region Image: corner curve and corner line features, slant, run
Ammar et al. (Ammar, M., Fukumura, T., & Yoshida lengths of each scan of the components of the
Y., 1986) have used the high pressure region as one signature, kurtosis (horizontal and vertical), skew-
of the prominent features for detecting skilled ness, relative kurtosis, relative skewness, relative
forgeries. It is the area of the image where the horizontal and vertical projection measures, enve-
writer gives special emphasis reflected in terms of lopes, individual stroke segments. (Ammar,
higher ink density (more specifically, higher gray Fukumura, & Yoshida, 1990; Bajaj & Chaudhury,
level intensities than the threshold chosen). The 1997; Baltzakis & Papamarkos, 2001, Fang, et al.,
threshold is obtained as follows: 2003, Huang & Yan, 1997; Ismail & Gad, 2000; Qi &
Hunt, 1994, Ramesh & Murty, 1999; Yacoubi,
Th HPR = I min + 0.75(I max I min ), Bortolozzi, Justino, & Sabourin, 2000; Xiao &
Leedham, 2002)
Local Features: Local features are confined to a
where Imin and Imax are the minimum and maximum
limited portion of the signature, which is obtained
grayscale intensity values.
by dividing the signature image into grids or treat-
Thinning or Skeletonization: This process is carried
ing each individual signature component as a sepa-
out to obtain a single pixel thick image from the
rate entity (Ismail & Gad, 2000). In contrast to
binary signature image. Different researchers have
global features, they are responsive to small distor-
used various algorithms for this purpose (Ammar,
tions like dirt, but are not influenced by other
Fukumura, & Yoshida, 1988; Baltzakis & Papamarkos,
regions of the signature. Hence, though extraction
2001; Huang & Yan, 1997; Ismail & Gad, 2000).
of local features requires more computations, they
are much more precise. Many of the global features
Feature Extraction have their local counterparts as well. Examples of
local features include width and height of indi-
Most of the features can be classified in two categories - vidual signature components, local gradients such
global and local features. as area of black pixels in high pressure region and
binary images, horizontal and vertical projections,
Global Features: Ismail and Gad (2000) have de- number of cross and edge points, slant, relative
scribed global features as characteristics, which position of baseline, pixel distribution, envelopes
identify or describe the signature as a whole. They etc. of individual grids or components. (Ammar,
are less responsive to small distortions and hence Fukumura, & Yoshida, 1990; Bajaj & Chaudhury,
1997; Baltzakis & Papamarkos, 2001; Huang & Yan,
1997; Ismail & Gad, 2000; Qi & Hunt, 1994; Ramesh
Figure 4. Extracting the high pressure region image & Murty, 1999; Yacoubi, Bortolozzi, Justino, &
Sabourin, 2000)
TEAM LinG
Off-Line Signature Recognition
Mode of
O
Author Database Feature Extraction Results
Verification
Statistical Global baseline, upper and
10 genuine
Ammar, method-- lower extensions, slant
signatures per
Fukumara and Euclidian features, local features (e.g. 90% verification rate
person from 20
Yoshida (1988) distance and local slants) and pressure
people
threshold feature
Signature outline, core
Neural network feature, ink distribution,
Based-- A total of 3528 high pressure region,
99.5 % for random
Huang and Yan Multilayer signatures directional frontiers feature,
forgery and 90 % for
(1997) perception based (including genuine area of feature pixels in
targeted forgeries
neural networks and forged) core, outline, high pressure
trained and used region, directional frontiers,
coarse, and in fine ink
Global geometric features:
Aspect ratio, width without
blanks, slant angle, vertical
Genetic center of gravity (COG) etc.
Algorithms used Moment Based features: 90 % under genuine,
650 signatures, 20
for obtaining horizontal and vertical 98 % under random
Ramesh and genuine and 23
genetically projection images, kurtosis forgery and 70-80 %
Murty (1999) forged signatures
optimized measures( horizontal and under skilled forgery
from 15 people
weights for vertical) case
weighted feature Envelope Based:
vector extracting the lower and
upper envelopes
and
Wavelet Based features
Central line features, corner
220 genuine and line features, central circle 95% recognition rate
Ismail and Gad
Fuzzy concepts 110 forged features, corner curve and 98% verification
(2000)
samples features, critical points rate
features
Nonliinear
dynamic time Horizontal and Vertical Best result of 18.1%
warping applied projections for non linear average error rate
Fang, Leung, 1320 genuine
to horizontal and dynamic time warping (average of FAR and
Tang, Tse, signatures from 55
vertical method. Skeletonized FRR) for first
Kwok, and authors and 1320
projections. image with approximation method. The
Wong (2003) forgeries from 12
Elastic bunch of the skeleton by short Average error rate
authors
graph matching lines for elastic bunch was 23.4% in case of
to individual graph matching algorithm the second method
stroke segments
used. These signature instances include either the genu- is to be tested, its feature matrix/vector is calculated, and
ine or both genuine and forged signatures depending on is passed into the verification sub module of the system,
the method. All the methods extract features in a manner which identifies the signature to be either authentic or
such that the signature cannot be constructed back from unauthentic. So, unless the system is trained properly,
them in the reverse order but have sufficient data to chances, though less, are that it may recognize an authen-
capture the features required for verification. These fea- tic signature to be unauthentic (false rejection), and in
tures are stored in various formats depending upon the certain other cases, recognize the unauthentic signature
system. It could be either in the form of feature values, to be authentic (false acceptance). So, one has to be
weights of the neural network, conditional probability extremely cautious at this stage of the system. There are
values of a belief (Bayesian) network or as a covariance various methods for off-line verification systems that are
matrix (Fang et al., 2003). Later, when a sample signature in use today. Some of them include:
873
TEAM LinG
Off-Line Signature Recognition
Statistical methods (Ammar, M., Fukumura, T., & cult to compare the values of these error rates, as there is
Yoshida Y., 1988; Ismail & Gad, 2000) no standard database either for off-line or for online
Neural network based approach (Bajaj & Chaudhury, signature verification methods.
1997; Baltzakis & Papamarkos, 2001; Drouhard, Although online verification methods are gaining
Sabourin, & Godbout, 1996; Huang & Yan, 1997) popularity day by day because of higher accuracy rates,
Genetic algorithms for calculating weights (Ramesh off-line signature verification methods are still consid-
& Murty, 1999) ered to be indispensable, since they are easy to use and
Hidden Markov Model (HMM) based methods have a wide range of applicability. Efforts must thus be
(Yacoubi, Bortolozzi, Justino, & Sabourin, 2000) made to improve its efficiency to as close as that of online
Bayesian networks (Xiao & Leedham, 2002) verification methods.
Nonliinear dynamic time warping (in spatial domain)
and Elastic bunch graph matching (Fang et al., 2003)
REFERENCES
The problem of signature verification is that of divid-
ing a space into two different sets of genuine and forged Ammar, M., Fukumura, T., & Yoshida Y. (1986). A new
signatures. Both online and off-line approaches use fea- effective approach for off-line verification of signature
tures to do this, but, the problem with this approach is that by using pressure features. Proceedings 8th Interna-
even two signatures by the same person may not be the tional Conference on Pattern Recognition, ICPR86 (pp.
same. The feature set must thus have sufficient interper- 566-569), Paris.
sonal variability so that we can classify the input signa-
ture as genuine or forgery. In addition, it must also have Ammar, M., Fukumura, T., & Yoshida, Y. (1988). Off-line
a low intrapersonal variability so that an authentic signa- preprocessing and verification of signatures. Interna-
ture is accepted. Solving this problem using fuzzy sets tional Journal of Pattern Recognition and Artificial
and neural networks has also been tried. Another problem Intelligence 2(4), 589-602.
is that an increase in the dimensionality i.e. using more Ammar, M., Fukumura, T., & Yoshida, Y. (1990). Structural
features does not necessarily minimize(s) the error rate. description and classification of signature images. Pat-
Thus, one has to be cautious while choosing the appro- tern Recognition 23(7), 697-710.
priate/optimal feature set.
Bajaj, R., & Chaudhury, S. (1997). Signature verification
using multiple neural classifiers. Pattern Recognition
FUTURE TRENDS 30(1), l-7.
Baltzakis, H., & Papamarkos, N. (2001). A new signature
Most of the presently used off-line verification methods verification technique based on a two-stage neural net-
claim a success rate of more than 95% for random forgeries work classifier. Engineering Applications of Artificial
and above 90% in case of skilled forgeries. Although, a Intelligence, 14, 95-103.
95% verification rate seems high enough, it can be noticed
that, even if the accuracy rate is as high as 99 %, when we Drouhard, J. P., Sabourin, R., & Godbout, M. (1996). A
scale it to the size of a million, even a 1% error rate turns neural network approach to off-line signature verifica-
out to be a significant number. It is therefore necessary to tion using directional pdf. Pattern Recognition 29(3),
increase this accuracy rate as much as possible. 415-424.
Fang, B., Leung, C. H., Tang, Y. Y., Tse, K. W., Kwok, P.
C. K., & Wong, Y. K. (2003). Off-line signature verification
CONCLUSION by tracking of feature and stroke positions. Pattern Rec-
ognition 36, 91-101.
Performance of signature verification systems is mea-
sured from their false rejection rate (FRR or type I error) Huang, K., & Yan, H. (1997). Off-line signature verification
(Huang & Yan, 1997) and false acceptance rate (FAR or based on geometric feature extraction and neural network
type II error) (Huang & Yan, 1997) curves. Average error classification. Pattern Recognition, 30(1), 9-17.
rate, which is the mean of the FRR and FAR values, is also
used at times. Instead of using the FAR and FRR values, Ismail, M. A., & Gad, S. (2000). Off-line Arabic signature
many researchers quote the 100 - average rate values recognition and verification. Pattern Recognition, 33,
as the performance result. Values for various approaches 1727-1740.
have been mentioned in the table above. It is very diffi-
874
TEAM LinG
Off-Line Signature Recognition
x =1 y =1
KEY TERMS
where m = width of the image and n = height of the
Area of Black Pixels: It is the total number of black image.
pixels in the binary image.
Biometric Authentication: The identification of indi- Moment Measures: Skewness and kurtosis are mo-
viduals using their physiological and behavioral charac- ment measures calculated using the horizontal and verti-
teristics. cal projections and co-ordinates of centre of gravity of the
signature.
Centre of Gravity: The centre of gravity of the image
is calculated as per the following equation: Slant: Either it is defined as the angle at which the
image has maximum horizontal projection value on rota-
tion or it is calculated using the total number of positive,
negative, horizontally or vertically slanted pixels.
875
TEAM LinG
876
Since its origin in the 1970s research and development Two reasons why traditional OLTP is not suitable for data
into databases systems has evolved from simple file warehousing are presented: (a) Given that operational
storage and processing systems to complex relational databases are finely tuned to support known OLTP
databases systems, which have provided a remarkable workloads, trying to execute complex OLAP queries against
contribution to the current trends or environments. Data- the operational databases would result in unacceptable
bases are now such an integral part of day-to-day life that performance. Furthermore, decision support requires data
often people are unaware of their use. For example, pur- that might be missing from the operational databases; for
chasing goods from the local supermarket is likely to instance, understanding trends or making predictions
involve access to a database. In order to retrieve the price requires historical data, whereas operational databases
of the item, the application program will access the prod- store only current data. (b) Decision support usually
uct database. A database is a collection of related data and requires consolidating data from many heterogeneous
the database management system (DBMS) is software sources: these might include external sources such as
that manages and controls access to the database (Elmasri stock market feeds, in addition to several operational
& Navathe, 2004). databases. The different sources might contain data of
varying quality, or use inconsistent representations,
codes and formats, which have to be reconciled.
BACKGROUND
Traditional Online Transaction
Data Warehouse Processing (OLTP)
A data warehouse is a specialized type of database. More Traditional relational databases have been used primarily
specifically, a data warehouse is a repository (or archive) to support OLTP systems. The transactions in an OLTP
of information gathered from multiple sources, stored system usually retrieve and update a small number of
under a unified schema, at a single site (Silberschatz, records accessed typically on their primary keys. Opera-
Korth, & Sudarshan, 2002, p. 843). Chaudhuri and Dayal tional databases tend to be hundreds of megabytes to
(1997) consider that a data warehouse should be sepa- gigabytes in size and store only current data (Ramakrishnan
rately maintained from the organizations operational & Gehrke, 2003).
database since the functional and performance require- Figure 1 shows a simple overview of the OLTP system.
ments of online analytical processing (OLAP) supported The operational database is managed by a conventional
by data warehouses are quite different from those of the relational DBMS. OLTP is designed for day-to-day opera-
online transaction processing (OLTP) traditionally sup- tions. It provides a real-time response. Examples include
ported by the operational database. Internet banking and online shopping.
Online
Transaction
Processing
(OLTP)
system
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Online Analytical Processing Systems
Online Analytical Processing (OLAP) special proprietary techniques to store data in matrix-like
n-dimensional arrays (Ramakrishnan & Gehrke, 2003). O
OLAP is a term that describes a technology that uses a The multi-dimensional data cube is implemented by
multi-dimensional view of aggregate data to provide quick the arrays with the dimensions forming the axes of the
access to strategic information for the purposes of ad- cube (Sarawagi, 1997). Therefore, only the data value
vanced analysis (Ramakrishnan & Gehrke, 2003). corresponding to a data cell is stored as direct mapping.
OLAP supports queries and data analysis on aggre- MOLAP servers have excellent indexing properties due to
gated databases built in data warehouses. It is a system the fact that looking for a cell is simple array lookups rather
for collecting, managing, processing and presenting mul- than associative lookups in tables. But unfortunately it
tidimensional data for analysis and management pur- provides poor storage utilization, especially when the
poses (Figure 2). There are two main implementation data set is sparse.
methods to support OLAP applications: relational OLAP In a multi-dimensional data model, the focal point is on
(ROLAP) and multidimensional OLAP (MOLAP). a collection of numeric measures. Each measure depends
on a set of dimensions. For instance, the measure attribute
ROLAP is amt as shown in Figure 4. Sales information is being
arranged in a three-dimensional array of amt. Figure 4
Relational online analytical processing (ROLAP) pro- shows that the array only shows the values for single L#
vides OLAP functionality by using relational databases value where L# = L001, which presented as a slice or-
and familiar relational query tools to store and analyse thogonal to the L# axis.
multidimensional data (Ramakrishnan & Gehrke, 2003).
Entity Relationship diagrams and normalization tech- Cube-By Operator
niques are popularly used for database design in OLTP
environments. However, the database designs recom- In decision support database systems, aggregation is a
mended by ER diagrams are inappropriate for decision commonly used operation. As previously mentioned,
support systems where efficiency in querying and in current SQL can be very inefficient. Thus, to effectively
loading data (including incremental loads) are crucial. A support decision support queries in OLAP environment,
special schema known as a star schema is used in an OLAP a new operator, Cube-by was proposed by (Gray,
environment for performance reasons (Martyn, 2004). Bosworth, Lyaman, & Pirahesh, 1996). It is an extension
This star schema usually consists of a single fact table of the relational operator Group-by. The Cube-by opera-
and a dimension table for each dimension (Figure 3). tor computes Group-by corresponding to all possible
combinations of attributes in the Cube-by clause.
MOLAP In order to see how a data cube is formed, an example
is provided in Figure 5. It shows an example of data cube
Multidimensional online analytical processing (MOLAP) formation through executing the cube statement at the top
extends OLAP functionality to multidimensional data- left of the figure. Figure 5 presents two way of presenting
base management systems (MDBMSs). A MDBMS uses the aggregated data: (a) a data cube, and (b) a 3D data cube
in table form.
Online
Analytical
Processing
(OLAP)
System
877
TEAM LinG
Online Analytical Processing Systems
Figure 3. Star schema showing that location, product, date, and sales are represented as relations
878
TEAM LinG
Online Analytical Processing Systems
Figure 5. An example of data cube formation through executing the cube statement at the top left of the figure
O
Online
Analytical
Processing
(OLAP)
System
879
TEAM LinG
Online Analytical Processing Systems
stored in the memory whenever the user requests for it. response time for providing the resulted data. Another
The difference between the two processing levels is consideration is that the analyst or manager using the
that the front-end has the pre-computed aggregated data data warehouse may have time constraints. There are two
in the memory which is ready for the user to use or analyze sub-stages or fundamental methods of handling the data:
it at any given time. On the other hand the back-end (a) indexing and (b) partitioning in order to provide the
processing computes the raw data directly whenever user with the resulted data in a reasonable time frame.
there is a request from the user. This is why the front-end Indexing has existed in databases for many decades.
processing is considered to present the aggregated data Its access structures have provided faster access to the
faster or more efficiently than back-end processing. How- base data. For retrieval efficiency, index structures would
ever, it is important to note that back-end processing and typically be defined especially in data warehouses or
front-end processing are basically one whole processing. ROLAP where the fact table is very large (Datta,
The reason behind this break down of the two processing VanderMeer & Ramamritham, 2002). ONeil and Quass
levels is because the problem can be clearly seen in back- (1997) have suggested a number of important indexing
end processing and front-end processing. In the next schemes for data warehousing including bitmap index,
section, the problems associated with each areas and the value-list index, projection index, data index. Data index is
related work in improving the problems will be considered. similar to projection index but it exploits a positional
indexing strategy (Datta, VanderMeer, & Ramamritham,
Back-End Processing 2002). Interestingly MOLAP servers have better indexing
properties than ROLAP servers since they look for a cell
Back-end processing involves basically dealing with using simple array lookups rather than associative look-
the raw data, which is stored in either tables (ROLAP) or ups in tables.
arrays (MOLAP) as shown in Figure 7. The user queries Partitioning of raw data is more complex and challeng-
the raw data for decision-making purpose. The raw data is ing in data warehousing as compared to that of relational
then processed and computed into aggregated data, which and object databases. This is due to the several choices
will be presented to the user for analyzing. Generally the of partitioning of a star schema (Datta, VanderMeer, &
basic stage: extracting, is followed by two sub-stages: Ramamritham, 2002). The data fragmentation concept in
indexing, and partitioning in back-end processing. the context of distributed databases aims to reduce query
Extracting is the process of querying the raw data execution time and facilitate the parallel execution of
either from tables (ROLAP) or arrays (MOLAP) and com- queries (Bellatreche, Karlapalem, Mohania, & Schneide,
puting it. The process of extracting is usually time-con- 2000).
suming. Firstly, the database (data warehouse) size is In a data warehouse or ROLAP, either the dimension
extremely large and secondly the computation time is tables or the fact table or even both can be fragmented.
equally high. However, the user is only interested in a fast Bellatreche et al. (2000) have proposed a methodology for
applying the fragmentation techniques in a data ware-
house star schema to reduce the total query execution
cost. The data fragmentation concept in the context of
Figure 7. Stages in back-end processing
distributed databases aims to reduce query execution time
and facilitates the parallel execution of queries.
Front-End Processing
880
TEAM LinG
Online Analytical Processing Systems
processing as both of them are basically querying the raw the number of Cube-by attributes and the database size
data. However, in this case, extracting concentrates on are large. Second, storage has insufficient memory space
how the fundamental methods can help in handling the this is due to the loaded data and also in addition to the
raw data in order to provide efficient retrieval. Construct- incremental loads. Third, storage is not properly utilized
ing in the Front-end processing concentrates on the cube- the array may not fit into memory especially when the
by operator that involves the computation of raw data. data set is sparse. Fourth, certain queries might be diffi-
Storing is the process of putting the aggregated data cult to fulfill the need of decision makers. Fifth, the
into either the n-dimension table of rows or the n-dimen- execution querying time has to be reduced when there is
sional arrays or data cube. There are two parts within the n-dimensional query as the time factor is important to
storing process. One part is to store temporary raw data decision makers. However, it is important to consider that
in the memory for executing purpose and other part is to there is scope for other possible problems to be identified.
store pre-computed aggregated data. Hence, there are two
problems related to the storage: (a) insufficient memory
space due to the loaded raw data and also in addition to FUTURE TRENDS
the incremental loads; (b) poor storage utilization the
array may not fit into memory especially when the data set Problems have been identified in each of the three stages,
is sparse. which have generated considerable attention from re-
Querying is the process of extracting useful informa- searchers to find solutions. Several researchers have
tion from the pre-computed data cube or n-dimensional proposed a number of algorithms to solve these cube-by
table for decision makers. It is important to take note that problems. Examples include:
Querying is also part of Constructing process. Querying
also makes use of cube-by operator that involves the Constructing
computation of raw data. ROLAP is able to support ad hoc
requests and allows unlimited access to dimensions un- Cube-by operator is an expensive approach, especially
like MOLAP only allows limited access to predefined when the number of Cube-by attributes and the database
dimensions. Despite the fact that it is able to support ad size are large.
hoc requests and access to dimensions, certain queries
might be difficult to fulfill the need of decision makers and
also to reduce the querying execution time when there is
Fast Computation Algorithms
n-dimensional query as the time factor is important to
decision makers. There are algorithms aimed at fast computation of large
To conclude, the three stages and their associated sparse data cube (Ross & Srivastava, 1997; Beyer &
problems in the OLAP environment have been outlined. Ramakrishnon, 1999). Ross and Srivastava (1997) have
First, Cube-by is an expensive approach, especially when taken into consideration the fact that real data is fre-
881
TEAM LinG
Online Analytical Processing Systems
quently sparse. (Ross & Srivastava, 1997) partitioned Condensed Data Cube
large relations into small fragments so that there was
always enough memory to fit in the fragments of large Wang, Feng, Lu, and Yu (2002) have proposed a new
relation. Whereas, Beyer and Ramakrishnan (1999) pro- concept called a condensed data cube. This new ap-
posed the bottom-up method to help reduce the penalty proach reduces the size of data cube and hence its com-
associated with the sorting of many large views. putation time. They make use of single base tuple
compression to generate a condensed cube so it is
Parallel Processing System smaller in size.
882
TEAM LinG
Online Analytical Processing Systems
CONCLUSION Ng, R.T., Wagner, A., & Yin, Y. (2001, May). Iceberg-cube
computation with pc clusters. International ACM O
In this overview, the OLAP systems in general have been SIGMOD Conference (pp. 25-36), Santa Barbara, Califor-
considered, followed by processing level in the OLAP nia.
systems and lastly the related and future work in the ONeil, P., & Graefe, G. (1995). Multi-table joins through
OLAP systems. A number of problems and solutions in bit-mapped join indices. SIGMOD Record, 24(3), 8-11.
the OLAP environment have been presented. However,
consideration needs to be given to the possibility that Ramakrishnan, R., & Gehrke, J. (2003). Database manage-
other problems maybe identified which in turn will present ment systems. NY: McGraw-Hill.
new challenges for the researchers to address.
Ross, K.A., & Srivastava, D. (1997, August). Fast compu-
tation of sparse datacubes. International VLDB Confer-
ence (pp. 116-185), Athens, Greece.
REFERENCES
Sarawagi, S. (1997). Indexing OLAP data. IEEE Data
Bellatreche, L., Karlapalem, K., Mohania M., & Schneider Engineering Bulletin, 20(1), 36-43.
M. (2000, September). What can partitioning do for your
data warehouses and data marts? International IDEAS Silberschatz, A., Korth, H., & Sudarshan, S. (2002). Data-
conference (pp. 437-445), Yokohoma, Japan. base system concepts. NY: McGraw-Hill.
Beyer, K.S., & Ramakrishnan, R. (1999, June). Bottom-up Taniar, D., & Tan, R.B.N. (2002, May). Parallel processing
computation of sparse and iceberg cubes. International of multi-join expansion-aggregate data cube query in high
ACM SIGKDD conference (pp. 359-370), Philadelphia, performance database systems. International I-SPAN
PA. Conference (pp. 51-58), Manila, Philippines.
Chaudhuri, S., & Dayal, U. (1997). An overview of data Tan, R.B.N., Taniar, D., & Lu, G.J. (2004). A Taxonomy for
warehousing and OLAP technology. ACM SIGMOD Data Cube Queries, International Journal for Computers
Record, 26, 65-74. and Their Applications, 11(3), 171-185.
Datta, A., VanderMeer, D., & Ramamritham, K. (2002). Wang, W., Feng, J.L., Lu, H.J., & Yu, J.X. (2002, February).
Parallel Star Join + DataIndexes: Efficient Query Process- Condensed Cube: An effective approach to reducing data
ing in Data Warehouses and OLAP. IEEE Transactions cube size. International Data Engineering Conference
on Knowledge & Data Engineering, 14(6), 1299-1316. (pp. 155-165), San Jose, California.
Dehne, F., Eavis, T., Hambrusch, S., & Rau-Chaplin, A. Yu, J.X., & Lu, H.J. (2001, April). Multi-cube computation.
(2002). Parallelizing the data cube. International Journal International DASFAA Conference (pp. 126-133), Hong
of Distributed & Parallel Databases, 11, 181-201. Kong, China.
883
TEAM LinG
Online Analytical Processing Systems
Multidimensional OLAP (MOLAP): Extends OLAP Relational OLAP (ROLAP): Provides OLAP func-
functionality to multidimensional database management tionality by using relational databases and familiar rela-
systems (MDBMSs). tional query tools to store and analyse multidimensional
data.
Online Analytical Processing (OLAP): Is a term used
to describe the analysis of complex data from data ware-
house (Elmasri & Navathe, 2004, p.900).
884
TEAM LinG
885
Nilesh Mishra
Indian Institute of Technology, India
Mayank Vatsa
Indian Institute of Technology, India
Richa Singh
Indian Institute of Technology, India
P. Gupta
Indian Institute of Technology, India
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Online Signature Recognition
case of off-line, the forger has to copy only the shape Data Acquisition
(Jain & Griess, 2000) of the signature. On the other
hand, in case of online, the hardware used captures the Data acquisition (of the dynamic features) in online
dynamic features of the signature as well. It is extremely verification methods is generally carried out using spe-
difficult to deceive the device in case of dynamic fea- cial devices called transducers or digitizers (Tappert,
tures, since the forger has to not only copy the charac- Suen, & Wakahara, 1990, Wessels & Omlin, 2000), in
teristics of the person whose signature is to be forged, contrast to the use of high resolution scanners in case of
but also at the same time, he has to hide his own inherent off-line. The commonly used instruments include the
style of writing the signature. There are four types of electronic tablets (which consist of a grid to capture the
forgeries: random, simple, skilled and traced forgeries x and y coordinates of the pen tip movements), pressure
(Ammar, Fukumura, & Yoshida, 1988; Drouhard, sensitive tablets, digitizers involving technologies such
Sabourin, & Godbout, 1996). In case of online signa- as acoustic sensing in air medium, surface acoustic
tures, the system shows almost 100% accuracy for the waves, triangularization of reflected laser beams, and
first two classes of forgeries and 99% in case of the optical sensing of a light pen to extract information
latter. But, again, a forger can also use a compromised about the number of strokes, velocity of signing, direc-
signature-capturing device to repeat a previously re- tion of writing, pen tilt, pressure with which the signa-
corded signature signal. In such extreme cases, even ture is written etc.
online verification methods may suffer from repetition
attacks when the signature-capturing device is not physi-
cally secure.
Preprocessing
Input device
/ interface
Preprocessing module
886
TEAM LinG
Online Signature Recognition
887
TEAM LinG
Online Signature Recognition
Absolute and relative speed between two criti- for error rates of different approaches have been in-
cal points (Jain, Griess, & Connell, 2002) cluded in the comparison table above. Equal error rate
Acceleration (ua(t), u ax(t), uay(t)): Accelera- (ERR) which is calculated using FRR and FAR is also
tion can be derived from velocity or position used for measuring the accuracy of the systems.
coordinates. It can also be computed using an
accelerometric pen (Jain & Griess, 2000; GENERAL PROBLEMS
Plamondon & Lorette, 1989).
Parameters: A number of parameters like num- The feature set in online has to be taken very carefully,
ber of peaks, starting direction of the signature, since it must have sufficient interpersonal variability so
number of pen lifts, means and standard devia- that the input signature can be classified as genuine or
tions, number of maxima and minima for each forgery. In addition, it must also have a low intra per-
segment, proportions, signature path length, sonal variability so that an authentic signature is ac-
path tangent angles etc. are also calculated apart cepted. Therefore, one has to be extremely cautious,
from the above mentioned functions to increase while choosing the feature set, as increase in dimen-
the dimensionality (Gupta, 1997; Plamondon & sionality does not necessarily mean an increase in effi-
Lorette, 1989; Wessels & Omlin, 2000). More- ciency of a system.
over, all these features can be of both global and
local in nature.
FUTURE TRENDS
Verification and Learning
Biometrics is gradually replacing the conventional pass-
For online, examples of comparison methods include word and ID based devices, since it is both convenient
use of: and safer than the earlier methods. Today, it is not
difficult at all to come across a fingerprint scanner or an
Corner Point and Point to Point Matching algo- online signature pad. Nevertheless, it still requires a lot
rithms (Zhang, Pratikakis, Cornelis, & Nyssen, of research to be done to make the system infallible,
2000) because even an accuracy rate of 99% can cause failure
Similarity measurement on logarithmic spectrum of the system when scaled to the size of a million.
(Lee, Wu, & Jou, 1998) So, it will not be strange if we have ATMs in near
Extreme Points Warping (EPW) (Feng & Wah, future granting access only after face recognition, fin-
2003) gerprint scan and verification of signature via embedded
String Matching and Common threshold (Jain, devices, since a multimodal system will have a lesser
Griess, & Connell, 2002) chance of failure than a system using a single biometric
Split and Merging, (Lee, Wu, & Jou, 1997) or a password/ID based device. Currently online and off-
Histogram classifier with global and local line signature verification systems are two disjoint
likeliness coefficients (Plamondon & Lorette, approaches. Efforts must be also made to integrate the
1989) two approaches enabling us to exploit the higher accu-
Clustering analysis ( Lorette, 1984) racy of online verification method and greater applica-
Dynamic programming based methods; matching bility of off-line method.
with Mahalanobis pseudo-distance (Sato &
Kogure,1982)
Hidden Markov Model based methods (McCabe, CONCLUSION
2000; Kosmala & Rigoll, 1998)
Most of the online methods claim near about 99%
Table 1 gives a summary of some prominent works in accuracy for signature verification. These systems are
the online signature verification field. gaining popularity day by day and a number of products
are currently available in the market. However, online
Performance Evaluation verification is facing stiff competition from fingerprint
verification system which is both more portable and has
Performance evaluation of the output (which is to ac- higher accuracy. In addition, the problem of maintaining
cept or reject the signature) is done using false rejec- a balance between FAR and FRR has to be maintained.
tion rate (FRR or type I error) and false acceptance rate Theoretically, both FAR and FRR are inversely related
(FAR or type II error) (Huang & Yan, 1997). The values to each other. That is, if we keep tighter thresholds to
888
TEAM LinG
Online Signature Recognition
Table 1. Summery of some prominent online papers Gupta, J., & McCabe, A. (1997). A review of dynamic
handwritten signature verification. Technical Article, O
Author Mode Of Verification Database Feature Extraction
Results
(Error rates) James Cook University, Australia.
coordinates to
86.5 % accuracy rate for
Wu, Lee and
Jou (1997)
Split and merging
200 genuine and
246 forged
represent the
signature and the
genuine and 97.2 % for
forged
Huang, K., & Yan, H. (1997). Off-line signature verification
27 people
velocity
based on geometric feature extraction and neural network
Features based on
Wu, Lee and
Similarity
measurement on
each 10 signs,
560 genuine and
coefficients of the FRR=1.4 % and classification. Pattern Recognition, 30(1), 9-17.
Jou (1998) logarithmic FAR=2.8 %
logarithmic spectrum 650 forged for
spectrum
testing
Ismail, M.A., & Gad, S. (2000). Off-line Arabic signature
0.1 % mismatch in
Zhang,
Pratikakis,
Corner point matching
corner points
segments in case of recognition and verification. Pattern Recognition, 33,
algorithm and Point to 188 signatures, corner point algorithm
Cornelis and
Nyssen
point matching from 19 people
extracted based on
velocity information
and 0.4 % mismatch in 1727-1740.
algorithm case of point to point
(2000)
matching algorithm
Jain, A. K., & Griess, F. D. (2000). Online signature
Number of strokes,
co-ordinate distance verification. Project Report, Department of Computer Sci-
between two points,
angle with respect to ence and Engineering, Michigan State University, USA.
Jain, Griess 1232 signatures x and y axis,
String matching and Type I: 2.8%
and Connell of 102 curvature, distance
common threshold Type II: 1.6%
(2002) individual from centre of
gravity, grey value
Jain, A. K., Griess, F. D., & Connell, S. D. (2002). Online
in 9X9
neighborhood and
signature verification. Pattern Recognition, 35(12), 2963-
velocity features 2972.
25 users EER (Euclidean)
Extreme Points
Feng and
Warping ( both
contributed
30 genuine
x and y trajectories
used, apart from
=
25.4 % Kosmala, A., & Rigoll, G. (1998). A systematic comparison
Euclidean distance and
Wah (2003)
Correlation
signatures
and 10 forged
torque and center of
mass.
EER
(correlation) = between online and off-line methods for signature verifi-
Coefficients used)
signatures 27.7 %
cation using hidden markov models. 14th International
Conference on Pattern Recognition (pp. 1755-1757).
decrease FAR, inevitably we increase the FRR by reject-
ing some genuine signatures. Further research is thus Lee, S. Y., Wu, Q. Z., & Jou, I. C. (1997). Online signature
still required to overcome this barrier. verification based on split and merge matching mecha-
nism. Pattern Recognition Letters, 18, 665-673
Lee, S. Y., Wu, Q. Z., & Jou, I. C. (1998). Online
REFERENCES signature verification based on logarithmic spectrum.
Pattern Recognition, 31(12), 1865-1871
Ammar, M., Fukumura, T., & Yoshida Y. (1988). Off-
line preprocessing and verification of signatures. Inter- Lorette, G. (1984). Online handwritten signature recogni-
national Journal of Pattern Recognition and Artifi- tion based on data analysis and clustering. Proceedings
cial Intelligence, 2(4), 589-602. of 7th International Conference on Pattern Recognition,
Vol. 2 (pp. 1284-1287).
Ammar, M., Fukumura, T., & Yoshida Y. (1990). Structural
description and classification of signature images. Pat- McCabe, A. (2000). Hidden markov modeling with simple
tern Recognition, 23(7), 697-710. directional features for effective and efficient handwrit-
ing verification. Proceedings of the Sixth Pacific Rim
Bajaj, R., & Chaudhury, S. (1997). Signature verification International Conference on Artificial Intelligence.
using multiple neural classifiers. Pattern Recognition,
30(1), l-7. Plamondon R., & Lorette, G. (1989). Automatic signa-
ture verification and writer identification the state of
Baltzakis, H., & Papamarkos, N. (2001). A new signature the art. Pattern Recognition, 22(2), 107-131.
verification technique based on a two-stage neural net-
work classifier. Engineering Applications of Artificial Ramesh, V.E., & Murty, M. N. (1999). Off-line signa-
Intelligence, 14, 95-103. ture verification using genetically optimized weighted
features. Pattern Recognition, 32(7), 217-233.
Drouhard, J. P., Sabourin, R., & Godbout, M. (1996). A
neural network approach to off-line signature verifica- Sato, Y., & Kogure, K. (1982). Online signature verifi-
tion using directional pdf. Pattern Recognition, 29(3), cation based on shape, motion and handwriting pressure.
415-424. Proceedings of 6th International Conference on Pat-
tern Recognition, Vol. 2 (pp. 823-826).
Feng, H., & Wah, C. C. (2003). Online signature verifi-
cation using a new extreme points warping technique. Tappert, C. C., Suen, C. Y., & Wakahara, T. (1990). The
Pattern Recognition Letters 24(16), 2943-2951. state of the art in on-line handwriting recognition. IEEE
889
TEAM LinG
Online Signature Recognition
Transactions on Pattern Analysis and Machine Intelli- Global Features: The features are extracted using the
gence, 12(8). complete signature image or signal as a single entity.
Wessels, T., & Omlin, C. W. (2000). A hybrid approach for Local Features: The geometric information of the
signature verification. International Joint Conference signature is extracted in terms of features after dividing
on Neural Networks. the signature image or signal into grids and sections.
Xiao, X., & Leedham, G. (2002). Signature verification Online Signature Recognition: The signature is
using a modified bayesian network. Pattern Recognition, captured through a digitizer or an instrumented pen and
35, 983-995. both geometric and temporal information are recorded
and later used in the recognition process.
Zhang, K., Pratikakis, I., Cornelis, J., & Nyssen, E. (2000).
Using landmarks in establishing a point to point corre- Random Forgery: Random forgery is one in which
spondence between signatures. Pattern Analysis and the forged signature has a totally different semantic
Applications, 3, 69-75. meaning and overall shape in comparison to the genuine
signature.
Simple Forgery: Simple forgery is one in which the
KEY TERMS semantics of the signature are the same as that of the
genuine signature, but the overall shape differs to a great
Equal Error Rate: The error rate when the propor- extent, since the forger has no idea about how the
tions of FAR and FRR are equal. The accuracy of the signature is done.
biometric system is inversely proportional to the value Skilled Forgery: In skilled forgery, the forger has a
of EER. prior knowledge about how the signature is written and
False Acceptance Rate: Rate of acceptance of a practices it well, before the final attempt of duplicating it.
forged signature as a genuine signature by a handwritten Traced Forgery: For traced forgery, a signature
signature verification system. instance or its photocopy is used as a reference and tried
False Rejection Rate: Rate of rejection of a genu- to be forged.
ine signature as a forged signature by a handwritten
signature verification system.
890
TEAM LinG
891
Christopher D. Barko
The University of North Carolina at Greensboro, USA
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Organizational Data Mining
Relationship Management (CRM), electronic CRM (e- nical and business issues related to ODM projects, and
CRM), Executive Information Systems (EIS), digital elaborated on how organizations are benefiting through
dashboards, and enterprise information portals. ODM enhanced enterprise decision making (Nemati & Barko,
enables organizations to answer questions about the past 2001). The results of our research suggest that ODM
(what has happened?), the present (what is happening?), can improve the quality and accuracy of decisions for
and the future (what might happen?). Armed with this any organization that is willing to make the investment.
capability, organizations can generate valuable knowl- After exploring the status and utilization of ODM in
edge from their data, which, in turn, enhances enterprise organizations, we decided to focus subsequent research
decisions. This decision-enhancing technology offers on how organizations implement ODM projects and on
many advantages in operations (faster product develop- the factors critical to its success. To that end, we
ment, optimal supply chain management), marketing developed a new ODM Implementation Framework based
(higher profitability and increased customer loyalty on data, technology, organizations, and the Iron Triangle
through more effective marketing campaigns), finance (Nemati & Barko, 2003). Our research demonstrated
(optimal portfolio management, financial analytics), that selected organizational data mining project factors,
and strategy implementation (Business Performance when modeled under this new framework, have a signifi-
Management [BPM] and the Balanced Scorecard). cant influence on the successful implementation of
Over the last three decades, the organizational role ODM projects.
of information technology has evolved from efficiently Given the promise of strengthening customer rela-
processing large amounts of batch transactions to pro- tionships and enhancing profits, CRM technology and
viding information in support of tactical and strategic associated research are gaining greater acceptance within
decision-making activities. This evolution, from auto- organizations. However, findings from recent studies
mating expensive manual systems to providing strategic suggest that organizations generally fail to support their
organizational value, led to the birth of Decision Sup- CRM efforts with complete data (Brohman et al., 2003).
port Systems (DSS), such as data warehousing and data As further investigation, our latest research has focused
mining. The organizational need to combine data from on a specific ODM technology known as Electronic
multiple stand-alone systems (e.g., financial, manufac- Customer Relationship Management (e-CRM) and its
turing, and distribution) grew as corporations began to data integration role within organizations. Consequently,
acknowledge the power of combining these data sources we developed a new e-CRM Value Framework to better
for reporting. This spurred the growth of data warehous- examine the significance of integrating data from all
ing, where multiple data sources were stored in a format customer touch-points with the goal of improving cus-
that supported advanced data analysis. tomer relationships and creating additional value for the
The slowness in adoption of ODM techniques in the firm. Our research findings suggest that, despite the
1990s was partly due to an organizational and cultural cost and complexity, data integration for e-CRM projects
resistance. Business management always has been re- contributes to a better understanding of the customer
luctant to trust something it does not fully understand. and leads to higher return on investment (ROI), a greater
Until recently, most businesses were managed by in- number of benefits, improved user satisfaction, and a
stinct, intuition, and gut feeling. The transition over the higher probability of attaining a competitive advantage
past 20 years to a method of managing by the numbers is (Nemati, Barko & Moosa, 2003).
both the result of technology advances as well as a
generational shift in the business world, as younger
managers arrive with information technology training MAIN THRUST
and experience.
Data mining is the process of discovering and interpret-
ODM Research ing previously unknown patterns in databases. It is a
powerful technology that converts data into information
Given the scarcity of past research in ODM along with and potentially actionable knowledge. However, there
its growing acceptance and importance in organizations, are many obstacles to the broad inclusion of data mining
we conducted empirical research during the past several in organizations. Obtaining new knowledge in an organi-
years that explored the utilization of ODM in organiza- zational vacuum does not facilitate optimal decision
tions along with project implementation factors critical making in a business setting. Simply incorporating data
for success. We surveyed ODM professionals from mining into the enterprise mix without considering non-
multiple industries in both domestic and international technical issues is usually a recipe for failure. Busi-
organizations. Our initial research examined the ODM nesses must give careful thought when weaving data
industry status and best practices, identified both tech- mining into their organizations fabric. The unique orga-
892
TEAM LinG
Organizational Data Mining
nizational challenge of understanding and leveraging ond, organizations create, organize, and process data to
ODM to engineer actionable knowledge requires assimi- generate new knowledge through organizational learn- O
lating insights from a variety of organizational and tech- ing. This knowledge creation activity enables the orga-
nical fields and developing a comprehensive framework nization to develop new capabilities, design new prod-
that supports an organizations quest for a sustainable ucts and services, enhance existing offerings, and im-
competitive advantage. These multi-disciplinary fields prove organizational processes. Third, organizations
include data mining, business strategy, organizational search for and evaluate data in order to make decisions.
learning and behavior, organizational culture, organiza- This data is critical, since all organizational actions are
tional politics, business ethics and privacy, knowledge initiated by decisions and all decisions are commit-
management, information sciences, and decision sup- ments to actions, the consequences of which will, in
port systems. These fundamental elements of ODM can turn, lead to the creation of new data. Adopting an OT
be categorized into three main groups: Artificial Intelli- methodology enables an enterprise to enhance the
gence (AI), Information Technology (IT), and Organiza- knowledge engineering and management process.
tional Theory (OT). Our research and industry experi- In another OT study, researchers and academic
ence suggest that successfully leveraging ODM requires scholars have observed that there is no direct correla-
integrating insights from all three categories in an orga- tion between information technology (IT) investments
nizational setting typically characterized by complexity and organizational performance. Research has con-
and uncertainty. This is the essence and uniqueness of firmed that identical IT investments in two different
ODM. Obtaining maximum value from ODM involves a companies may give a competitive advantage to one
cross-department team effort that includes statisticians/ company but not the other. Therefore, a key factor for
data miners, software engineers, business analysts, line- the competitive advantage in an organization is not the
of-business managers, subject-matter experts, and upper IT investment but the effective utilization of informa-
management support. tion as it relates to organizational performance
(Brynjolfsson & Hitt, 1996). This finding emphasizes
Organizational Theory and ODM the necessity of integrating OT practices with robust
information technology and artificial intelligence tech-
Organizations are concerned primarily with studying niques in successfully leveraging ODM.
how operating efficiencies and profitability can be
achieved through the effective management of custom- ODM Practices at Leading Companies
ers, suppliers, partners, and employees. To achieve these
goals, research in Organizational Theory (OT) suggests A 2002 Strategic Decision-Making study conducted by
that organizations use data in three vital knowledge cre- Hackett Best Practices determined that world-class
ation activities. This organizational knowledge creation companies have adopted ODM technologies at more
and management is a learned ability that only can be than twice the rate of average companies (Hoblitzell,
achieved via an organized and deliberate methodology. 2002). ODM technologies provide these world-class
This methodology is a foundation for successfully lever- organizations greater opportunities to understand their
aging ODM within the organization. The three knowl- business and make informed decisions. ODM also en-
edge creation activities (Choo, 1997) are: ables world-class organizations to leverage their inter-
nal resources more efficiently and more effectively
Sense Making: The ability to interpret and under- than their average counterparts, who have not fully
stand information about the environment and events embraced ODM.
happening both inside and outside the organization. Many of todays leading organizations credit their
Knowledge Making: The ability to create new success to the development of an integrated, enter-
knowledge by combining the expertise of members prise-level ODM system. As part of an effective CRM
to learn and innovate. strategy, customer retention is now widely viewed by
Decision Making: The ability to process and ana- organizations as a significant marketing strategy in
lyze information and knowledge in order to select creating a competitive advantage. Research suggests
and implement the appropriate course of action. that as little as a 5% increase in retention can mean as
much as a 95% boost in profit, and repeat customers
First, organizations use data to make sense of changes generate over twice as much gross income as new
and developments in the external environmentsa pro- customers (Winer, 2001). In addition, many business
cess called sense making. This is a vital activity wherein executives today have replaced their cost reduction
managers discern the most significant changes, interpret strategies with a customer retention strategyit costs
their meaning, and develop appropriate responses. Sec- approximately five to 10 times more to acquire new
893
TEAM LinG
Organizational Data Mining
customers than to retain established customers (Pan & pliers, and shareholders. The never-ending challenge is
Lee, 2003). to successfully integrate data-mining technologies
An excellent example of a successful CRM strategy within organizations in order to enhance decision mak-
is Harrahs Entertainment, which has saved over $20 ing with the objective of optimally allocating scarce
million per year since implementing its Total Rewards enterprise resources. This is not an easy task, as many
CRM program. This ODM system has given Harrahs a information technology professionals, consultants, and
better understanding of its customers and has enabled managers can attest. The media can oversimplify the
the company to create targeted marketing campaigns effort, but successfully implementing ODM is not ac-
that almost doubled the profit per customer and deliv- complished without political battles, project manage-
ered same-store sales growth of 14% after only the first ment struggles, cultural shocks, business process
year. In another notable case, Travelocity.com, an reengineering, personnel changes, short-term financial
Internet-based travel agency, implemented an ODM sys- and budgetary shortages, and overall disarray.
tem and improved total bookings and earnings by 100% Recent ODM research has revealed a number of
in 2000. Gross profit margins improved 150%, and industry predictions that are expected to be key ODM
booker conversion rates rose 8.9%, the highest in the issues in the coming years. Nemati & Barko (2001)
online travel services industry. found that almost 80% of survey respondents expect
In another significant study, executives from 24 Web farming/mining and consumer privacy to be sig-
leading companies in customer-knowledge management, nificant issues. We also foresee the development of
including FedEx, Frito-Lay, Harley-Davidson, Procter widely accepted standards for ODM processes and tech-
& Gamble, and 3M, all realized that in order to succeed, niques to be an influential factor for knowledge seekers
they must go beyond simply collecting customer data in the 21st century. One attempt at ODM standardization
and must translate it into meaningful knowledge about is the creation of the Cross Industry Standard Process
existing and potential customers (Davenport, Harris & for Data Mining (CRISP-DM) project that developed an
Kohli, 2001). This study revealed that several objec- industry and tool-neutral, data-mining process model to
tives were common to all of the leading companies, and solve business problems. Another attempt at industry
these objectives can be facilitated by ODM. A few of standardization is the work of the Data Mining Group in
these objectives are segmenting the customer base, developing and advocating the Predictive Model Markup
prioritizing customers, understanding online customer Language (PMML), which is an XML-based language
behavior, engendering customer loyalty, and increasing that provides a quick and easy way for companies to
cross-selling opportunities. define predictive models and share models between
compliant vendors applications. Last, Microsofts OLE
DB for Data Mining is a further attempt at industry
FUTURE TRENDS standardization and integration. This specification of-
fers a common interface for data mining that will enable
The number of ODM projects is projected to grow more developers to embed data-mining capabilities into their
than 300% in the next decade (Linden, 1999). As the existing applications. One only has to consider
collection, organization, and storage of data rapidly Microsofts industry-wide dominance of the office pro-
increase, ODM will be the only means of extracting ductivity (Microsoft Office), software development
timely and relevant knowledge from large corporate (Visual Basic and .Net), and database (SQL Server)
databases. The growing mountains of business data, markets to envision the potential impact this could have
coupled with recent advances in Organizational Theory on the ODM market and its future direction.
and technological innovations, provide organizations
with a framework to effectively use their data to gain a
competitive advantage. An organizations future suc- CONCLUSION
cess will depend largely on whether or not they adopt
and leverage this ODM framework. ODM will continue Although many improvements have materialized over
to expand and mature as the corporate demand for one- the last decade, the knowledge gap in many organiza-
to-one marketing, CRM, e-CRM, Web personalization, tions is still prevalent. Industry professionals have sug-
and related interactive media increases. gested that many corporations could maintain current
We believe that organizations are slowly moving revenues at half the current costs, if they optimized
from the Information Age to the Knowledge Age, where their use of corporate data (Saarenvirta, 2001). Whether
decision makers will leverage ODM for optimal busi- this finding is true or not, it sheds light on an important
ness performance. Organizations are complex entities issue. Leading corporations in the next decade will
comprised of employees, managers, politics, culture, adopt and weave these ODM technologies into the fab-
hierarchies, teams, processes, customers, partners, sup- ric of their organizations at the strategic, tactical, and
894
TEAM LinG
Organizational Data Mining
operational levels. Those enterprises that see the strate- Pan, S.L., & Lee, J.-N. (2003). Using E-CRM for a
gic value of evolving into knowledge organizations by unified view of the customer. Communications of the O
leveraging ODM will benefit directly in the form of ACM, 46(4), 95-99.
improved profitability, increased efficiency, and sus-
tainable competitive advantage. Once the first organiza- Reddy, R. (2004). The reality of real time. Intelligent
tion within an industry realizes a competitive advantage Enterprise, 7(10), 40-41.
through ODM, it is only a matter of time before one of Saarenvirta, G. (2001). Operation data mining. DB2 Maga-
three events transpires: its industry competitors adopt zine. Retrieved from http://www.db3mag.com/db_area/
ODM, change industries, or vanish. By adopting ODM, archives/2001/q2/saarenvirta.html
an organizations managers and employees are able to
act sooner rather than later, anticipate rather than react, The slow progress of fast wires. (2001). The Econo-
know rather than guess, and, ultimately, succeed rather mist, 358(8209), 57-59.
than fail. Winer, R.S. (2001). A framework for customer rela-
tionship management. California Management Review,
43(4), 89-106.
REFERENCES
895
TEAM LinG
896
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Path Mining in Web Processes Using Profiles
A Web process is composed of Web services and ally. The Web service Approve Home Loan Conditionally,
transitions. Web services are represented by circles, as the name suggests, approves a home loan under a set P
and transitions are represented by arrows. Transitions of conditions.
express dependencies between Web services. A Web The following formula is used to determine if a loan
service with more than one outgoing transition can be is approved or rejected.
classified as an and-split or xor-split. And-split Web
services enable all their outgoing transitions after com- MP = (L*R*(1+R/12)12*NY)/(-12+12*(1+R12)12*NY)
pleting their execution. Xor-split Web services enable (1)
only one outgoing transition after completing their
execution. And-split Web services are represented with MP=Monthly payment,L=Loan amount,R=Interest
, and xor-split Web services are represented with . rate,NY=Number of years
A Web service with more than one incoming transition
can be classified as an and-join or xor-join. And-join When the result of a loan application is known, it is
Web services start their execution when all their incom- e-mailed to the client. Three Web services are respon-
ing transitions are enabled. Xor-join Web services are sible for notifying the client: Notify Home Loan Client,
executed as soon as one of the incoming transitions is Notify Education Loan Client, and Notify Car Loan
enabled. As with and-split and xor-split Web services, Client. Finally, the Archive Application Web service
and-join and xor-join Web services are represented with creates a report and stores the loan application data in a
the symbols and , respectively. database record.
The Web process of this scenario is composed of 14
Web services. The Fill Loan Request Web service al- Web Process Log
lows clients to request a loan from the bank. In this step,
the client is asked to fill out an electronic form with During the execution of Web processes (such as the one
personal information and data describing the condition presented in Figure 1), events and messages generated
of the loan being requested. by the enactment system are stored in a Web process
The second Web service, Check Loan Type, deter- log. These data stores provide an adequate format on
mines the type of loan a client has requested and, based which path mining can be performed. The data includes
on the type, forwards the request to one of three Web real-time information describing the execution and be-
services: Check Home Loan, Check Educational Loan, havior of Web processes, Web services, instances, tran-
or Check Car Loan. sitions, and other elements such as runtime QoS metrics.
Educational loans are not handled and managed auto- Table 1 illustrates an example of a modern Web process
matically. After an educational loan application is sub- log.
mitted and checked, a notification is immediately sent To perform path mining, current Web process logs
informing the client that he or she has to contact the need to be extended to store information indicating the
bank personally. values and the type of the input parameters passed to
A loan request can be either accepted (Approve Web services and the output parameters received from
Home Loan and Approve Car Loan) or rejected (Reject Web services. Table 2 shows an extended Web process
Home Loan and Reject Car Loan). In the case of a home log that accommodates input/output values of Web ser-
loan, however, the loan can also be approved condition- vices parameters generated at run time. Each Parameter/
value entry has a type, parameter name, and value (e.g.,
Figure 1. The loan process string loan-type=car-loan).
Additionally, the Web process log needs to include
path information describing the Web services that have
been executed during the enactment of a Web process.
This information can be easily stored in the log. For
example, an extra field can be added to the log system to
contain the information indicating the path followed.
The path needs only to be associated to the entry corre-
sponding to the last service of a process to be executed.
For example, in the Web process log illustrated in Table
2, the service NotifyUser is the last service of a Web
process. The log has been extended in such a way that the
NotifyUser record contains information about the path
that was followed during the Web process execution.
897
TEAM LinG
Path Mining in Web Processes Using Profiles
Web Process Profile when the attributes of the profile have been assigned to
specific values. The path attribute is a target class.
When beginning work on path mining, it is necessary to Classification algorithms classify samples or instances
elaborate a profile for each Web process. A profile into target classes.
provides the input to machine learning and is character- After the profiles and a path attribute value for each
ized by its values on a fixed, predefined set of attributes. profile have been determined, I can use data-mining
The attributes correspond to the Web service input/ methods to establish a relationship between the pro-
output parameters that have been stored previously in the files and the paths followed at run time. One method
Web process log. Path mining will be performed on appropriate to deal with my problem is the use of
these attributes. classification.
A profile contains two types of attributes, numeric In classification, a learning schema takes a set of
and nominal. Numeric attributes measure numbers, ei- classified profiles, from which it is expected to learn a
ther real or integer-valued. For example, Web services way of classifying unseen profiles. Because the path of
inputs or outputs parameters that are of type byte, deci- each training profile is provided, my methodology uses
mal, int, short, or double will be placed in the profile and supervised learning.
classified as numeric. In Table 2, the parameters
LoanNum, income, BudgetCode, income, and tel will be
classified as numeric in the profile. EXPERIMENTS
Nominal attributes take on values within a finite set
of possibilities. Nominal quantities have values that are In this section, I present the results of applying my
distinct symbols. For example, the parameter loan-type algorithm to a synthetic loan dataset. To generate a
from the loan application and present in Table 2 is synthetic dataset, I start with the process presented in
nominal because it can take the finite set of values: the introductory scenario and, using this as a process
home-loan, education-loan, and car-loan. In my approach, model graph, log a set of process instance executions.
string and Boolean data type manipulated by Web ser- The data are lists of event records stored in a Web
vices are considered to be nominal attributes. process log consisting of process names, instance iden-
tification, Web services names, variable names, and so
Profile Classification forth. Table 3 shows the additional data that have been
stored in the Web process log. The information in-
The attributes present in a profile trigger the execution cludes the Web service variable values that are logged
of a specific set of Web services. Therefore, for each by the system and the path that has been followed during
profile previously constructed, I associate an additional the execution of instances. Each entry corresponds to
attribute, the path attribute, indicating the path followed an instance execution.
898
TEAM LinG
Path Mining in Web Processes Using Profiles
Web process profiles provide the input to machine Each experiment has involved data from 1,000 Web
learning and are characterized by a set of six attributes: process executions and a variable number of attributes
income, loan_type, loan_amount, loan_years, name, and (ranging from two to six). I have conducted 34 experi-
SSN. The profiles for the loan process contain two types ments, analyzing a total of 34,000 records containing data
of attributes: numeric and nominal. The attributes income, from Web process instance executions. Figure 2 shows
loan_amount, loan_years, and SSN are numeric, whereas the results that I have obtained.
the attributes loan_type and name are nominal. As an The path-mining technique developed has achieved
example of a nominal attribute, loan_type can take the encouraging results. When three or more attributes are
finite set of values home-loan, education-loan, and car- involved in the prediction, the system is able to predict
loan. These attributes correspond to the Web service correctly the path followed for more than 75% of the
input/output parameters that have been stored previ- process instances. This accuracy improves when four
ously in the Web process log presented in Table 3. attributes are involved in the prediction; in this case,
Each profile is associated with a class indicating the more than 82% of the paths are correctly predicted.
path that has been followed during the execution of a When five attributes are involved, I obtain a level of
process when the attributes of the profile have been prediction that reaches a high of 93.4%. Involving all six
assigned specific values. The last column of Table 3 attributes in the prediction gives excellent results: 88.9%
shows the class named path. The profiles and path at- of the paths are correctly predicted. When a small
tributes will be used to establish a relationship between number of attributes are involved in the prediction, the
the profiles and the paths followed at runtime. The results are not as good. For example, when only two
profiles and the class path have been extracted from the attributes are selected, I obtain predictions that range
Web process log. from 25.9% to 86.7%.
After profiles are constructed and associated with
paths, these data are combined and formatted to be
analyzed using Weka (2004), a set of software for FUTURE TRENDS
machine learning and data mining. The data is automati-
cally formatted using the ARFF format. I have used the Currently, organizations use BPMSs, such as WfMS, to
J.48 algorithm, which is Wekas implementation of the define, enact, and manage a wide range of distinct appli-
C4.5 (Hand, Mannila, & Smyth, 2001) decision tree cations (Q-Link Technologies, 2002), such as insur-
learner to classify profiles. C4.5 decision tree learner ance claims, bank loans, bioinformatic experiments
is one of the most well-known decision tree algorithms in (Hall, Miller, Arnold, Kochut, Sheth, & Weise, 2003),
the data-mining community. Weka system and its data health-care procedures (Anyanwu, Sheth, Cardoso,
format (ARFF) is also one of the most well-known data- Miller, & Kochut, 2003), and telecommunication services
mining systems in academia. (Luo, Sheth, Kochut, & Arpinar, 2003).
In the future, I expect to see a wider spectrum of
Figure 2. Experimental results applications managing processes in organizations. Ac-
cording to the Aberdeen Groups estimates, spending in
100.00% the business process management software sector (which
% Correctly Predicted Paths
80.00%
includes workflow systems) reached $2.26 billion in
2001 (Cowley, 2002).
60.00%
The concept of path mining can be used effectively in
40.00% many business applications for example, to estimate
20.00%
the QoS of Web processes and workflows (Cardoso,
Miller, Sheth, Arnold, & Kochut, 2004) because the
0.00%
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6
estimation requires the prediction of paths. Organizations
Number of Attributes operating in modern markets, such as e-commerce activi-
899
TEAM LinG
Path Mining in Web Processes Using Profiles
ties and distributed Web services interactions, require Cowley, S. (2002, September 23). Study: BPM market
QoS management. Appropriate quality control leads to primed for growth. Available from the InfoWorld Web
the creation of quality products and services; these, in site, http://www.infoworld.com
turn, fulfill customer expectations and achieve customer
satisfaction (Cardoso, Sheth, & Miller, 2002). Hall, R. D., Miller, J. A., Arnold, J., Kochut, K. J., Sheth,
A. P., & Weise, M. J. (2003). Using workflow to build an
information management system for a geographically
distributed genome sequence initiative. In R. A. Prade &
CONCLUSION H. J. Bohnert (Eds.), Genomics of plants and fungi (pp.
359-371). New York: Marcel Dekker.
BPMSs, Web processes, workflows, and workflow sys-
tems represent fundamental technological infrastruc- Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of
tures that efficiently define, manage, and support busi- data mining. Bradford Book.
ness processes. The data generated from the execution
and management of Web processes can be used to Herbst, J., & Karagiannis, D. (1998). Integrating machine
discover and extract knowledge about the process ex- learning and workflow management to support acquisi-
ecutions and structure. tion and adaption of workflow models. Proceedings of the
I have shown that one important area of Web pro- Ninth International Workshop on Database and Expert
cesses to analyze is path mining. I have demonstrated Systems Applications.
how path mining can be achieved by using data-mining Luo, Z., Sheth, A., Kochut, K., & Arpinar, B. (2003).
techniques, namely classification, to extract path knowl- Exception handling for conflict resolution in cross-
edge from Web process logs. From my experiments, I organizational workflows. Distributed and Parallel
can conclude that classification methods are a good Databases, 12(3), 271-306.
solution to perform path mining on administrative and
production Web processes. Q-Link Technologies. (2002). BPM2002: Market mile-
stone report. Retrieved from http://www.qlinktech.com.
Smith, H., & Fingar, P. (2003). Business process man-
REFERENCES agement (BPM): The third wave. Meghan-Kiffer Press.
Agrawal, R., Gunopulos, D., & Leymann, F. (1998). Weijters, T., & van der Aalst, W. M. P. (2001). Process
Mining process models from workflow logs. Proceed- mining: Discovering workflow models from event-based
ings of the Sixth International Conference on Extend- data. Proceedings of the 13th Belgium-Netherlands
ing Database Technology, Spain. Conference on Artificial Intelligence.
Anyanwu, K., Sheth, A., Cardoso, J., Miller, J. A., & Weka. (2004). Weka [Computer software.] Retrieved from
Kochut, K. J. (2003). Healthcare enterprise process http://www.cs.waikato.ac.nz/ml/weka/
development and integration. Journal of Research and
Practice in Information Technology, 35(2), 83-98.
Cardoso, J., Bostrom, R. P., & Sheth, A. (2004). Workflow KEY TERMS
management systems and ERP systems: Differences, com-
monalities, and applications. Information Technology Business Process: A set of one or more linked
and Management Journal, 5(3-4), 319-338. activities that collectively realize a business objective
or goal, normally within the context of an organizational
Cardoso, J., Miller, J., Sheth, A., Arnold, J., & Kochut, K. structure.
(2004). Quality of service for workflows and Web service
processes. Journal of Web Semantics: Science, Services Business Process Management System (BPMS):
and Agents on the World Wide Web, 1(3), 281-308. Provides an organization with the ability to collectively
define and model its business processes, deploy these
Cardoso, J., Sheth, A., & Miller, J. (2002). Workflow processes as applications that are integrated with its
quality of service. Proceedings of the International Con- existing software systems, and then provide managers
ference on Enterprise Integration and Modeling Tech- with the visibility to monitor, analyze, control, and
nology and International Enterprise Modeling Confer- improve the execution of those processes.
ence, Spain.
900
TEAM LinG
Path Mining in Web Processes Using Profiles
Process Definition: The representation of a business Workflow: The automation of a business process, in
process in a form that supports automated manipulation whole or part, during which documents, information, or P
or enactment by a workflow management system. tasks are passed from one participant to another for
action, according to a set of procedural rules.
Web Process: A set of Web services that carry out a
specific goal. Workflow Management System: A system that de-
fines, creates, and manages the execution of workflows
Web Process Data Log: Records and stores events through the use of software, which is able to interpret
and messages generated by the enactment system during the process definition, interact with participants, and,
the execution of Web processes. where required, invoke the use of tools and applications.
Web Service: Describes a standardized way of inte-
grating Web-based applications by using open standards
over an Internet protocol.
901
TEAM LinG
902
M. Narasimha Murty
Indian Institute of Science, India
Shalabh Bhatnagar
Indian Institute of Science, India
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Pattern Synthesis for Large-Scale Pattern Recognition
new patterns. Computationally, this method can be less records that can then be synthesized are (bread, milk,
expensive than deriving a model. It is especially useful biscuits) and (coffee, milk, sugar). Here, milk is the P
for nonparametric methods, such as NNC- and Parzen- overlap. A compact representation in this case is shown
window-based density estimation (Duda et al., 2000), in Figure 1, where a path from left to right denotes a data
which directly use the training instances. Further, this item or pattern. So you get four patterns total (two
method can also result in reduction of the computa- original and two synthetic patterns) from the graph
tional requirements. shown in Figure 1. Association rules derived from asso-
This article presents two instance-based pattern syn- ciation rule mining (Han & Kamber, 2000) can be used
thesis techniques called overlap-based pattern syn- to find these kinds of dependencies. Generalization of
thesis and partition-based pattern synthesis and their this concept and its compact representation for large
corresponding compact representations. datasets are described in the paragraphs that follow.
If the set of features, F, can be arranged in an order
Overlap-based Pattern Synthesis such that F = {f1, f 2 , ..., fd } is an ordered set, with fk being
the k th feature and all possible three-block partitions
Let F be the set of features (or attributes). There may can be represented as Pi = {A i, B i, Ci } such that Ai = (f 1,
exist a three-block partition of F, say, {A, B, C}, with the ..., fa ), B i = (f a+1, ..., fb ), and Ci = (fb+1, ..., fd ), then the
following properties. For a given class, there is a depen- compact representation called overlap pattern graph
dency (probabilistic) among features in A U B. Simi- is described with the help of an example.
larly, features in B U C have a dependency. However,
features in A (or C ) can affect those in C (or A ) only Overlap Pattern Graph (OLP-graph)
through features in B. That is, to state it more formally,
A and C are statistically independent, given B. Suppose Let F = (f 1, f2, f 3 , f4 , f5 ). Let two partitions satisfying the
that this is the case and you are given two patterns, X = conditional independence requirement be P1 = { {f1}, { f 2
(a1, b, c1) and Y = (a2, b, c2), such that a1 is a feature- , f3 }, { f4 , f5 }} and P2 = { {f 1, f2 } , {f3 , f4 }, { f5 }}. Let three
vector that can be assigned to the features in A, b to the given patterns be (a,b,c,d,e), (p,b,c,q,r), and (u,v,c,q,w),
features in B, and c1 to the features in C. Similarly, a2, b, respectively. Because (b,c) is common between the1st
and c2 are feature-vectors that can be assigned to fea- and 2nd patterns, two synthetic patterns that can be
tures in A, B, and C, respectively. Our argument, then, is generated are (a,b,c,q,r) and (p,b,c,d,e). Likewise, three
that the two patterns, (a 1, b, c 2) and (a2, b, c 1), are also other synthetic patterns that can be generated are
valid patterns in the same class or category as X and Y. If (p,b,c,d,e), (p,b,c,q,w), and (a,b,c,q,w). (Note that the
these two new patterns are not already in the class of last synthetic pattern is derived from two earlier syn-
patterns, it is only because of the finite nature of the set. thetic patterns.) A compact representation called over-
We call this generation of additional patterns an over- lap pattern graph (OLP-graph) for the entire set (in-
lap-based pattern synthesis, because this kind of syn- cluding both given and synthetic patterns) is shown in
thesis is possible only if the two given patterns have the Figure 2, where a path from left to right represents a
same feature-values for features in B. In the given ex- pattern. The graph is constructed by inserting the given
ample, feature-vector b is common between X and Y and patterns, whereas the patterns that can be extracted out
therefore is called the overlap. This method is suitable of the graph form the entire synthetic set consisting of
only with discrete valued features (can also be of sym- both original and synthetic patterns. Thus, from the
bolic or categorical types). If more than one such parti- graph in Figure 2, a total of eight patterns can be ex-
tion exists, then the synthesis technique is applied se- tracted, five of which are new synthetic patterns.
quentially with respect to the partitions in some order. OLP-graph can be constructed by scanning the given
One simple example to illustrate this concept is as dataset only once and is independent of the order in
follows. Consider a supermarket sales database where which the given patterns are considered. An approxi-
two records, (bread, milk, sugar) and (coffee, milk, mate method for finding partitions, a method for con-
biscuits), are given. Assume that a known dependency
exists between (a) bread and milk, (b) milk and sugar, (c)
Figure 2. OLP-graph
coffee and milk, and (d) milk and biscuits. The two new
903
TEAM LinG
Pattern Synthesis for Large-Scale Pattern Recognition
struction of OLP-graph, and its application to NNC are block. A path in the PC-tree for block 1 (from root to leaf)
described in Viswanath, Murty, and Bhatnagar (2003). concatenated with a path in the PC-tree for block 2 gives
For large datasets, this representation drastically re- a pattern that can be extracted. So a total of six patterns
duces the space requirement. can be extracted from the structures shown in Figure 3.
904
TEAM LinG
Pattern Synthesis for Large-Scale Pattern Recognition
905
TEAM LinG
906
Mukesh Mohania
IBM India Research Lab, India
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Physical Data Warehousing Design
to project a pseudo-multi-dimensional model (example cations have driven the growth of the DBMS industry in
includes Informix Red Brick Warehouse); and (2) by using the past three decades and will doubtless continue to be
true multi-dimensional data structures such as, arrays important. One of the main objectives of relational sys-
(example includes Hyperion Essbase OLAP Server tems is to maximize transaction throughput and minimize
Hyperion). The advantage of MOLAP architecture is that concurrency conflicts. However, these systems generally
it provides a direct multi-dimensional view of the data have limited decision support functions and do not ex-
whereas the ROLAP architecture is just a multi-dimen- tract all the necessary information required for faster,
sional interface to relational data. On the other hand, the better, and intelligent decision making for the growth of
ROLAP architecture has two major advantages: (i) it can an organization. For example, it is hard for an RDBMS to
be used and easily integrated into other existing relational answer the following query: What are the supply patterns
database systems; and (ii) relational data can be stored for product ABC in New Delhi in 2003, and how were they
more efficiently than multi-dimensional data. different from the year 2002? Therefore, it has become
Data warehousing query operations include standard important to support analytical processing capabilities in
SQL operations, such as selection, projection, and join. In organizations for (1) the efficient management of organi-
addition, it supports various extensions to aggregate zations, (2) effective marketing strategies, and (3) efficient
functions, such as percentile functions (e.g., top 20th and intelligent decision making. OLAP tools are well
percentile of all products), rank functions (e.g., top 10 suited for complex data analysis, such as multi-dimen-
products), mean, mode, and median. One of the important sional data analysis and to assist in decision support
extensions to the existing query language is to support activities that access data from a separate repository
multiple group-by, by defining roll-up, drill-down, and called a data warehouse, which selects data from many
cube operators. Roll-up corresponds to doing further operational legacies, and possibly heterogeneous data
group-by on the same data object. Note that roll-up sources. The following table summarizes the differences
operator is order sensitive; that is, when it is defined in the between OLTP and OLAP.
extended SQL, the order of columns (attributes) matters.
The function of a drill-down operation is the opposite of
roll-up. MAIN THRUST
907
TEAM LinG
Physical Data Warehousing Design
Materialized Views static. This is because each algorithm starts with a set of
frequently asked queries (a priori known) and then se-
Materialized views are used to precompute and store lects a set of materialized views that minimize the query
aggregated data, such as sum of sales. They also can be response time under some constraint. The selected ma-
used to precompute joins with or without aggregations. terialized views will be a benefit only for a query belong-
So, materialized views are used to reduce the overhead ing to the set of a priori known queries. The disadvantage
associated with expensive joins or aggregations for a large of this kind of algorithm is that it contradicts the dynamic
or important class of queries. Two major problems related nature of decision support analysis. Especially for ad-
to materializing the views are (1) the view-maintenance hoc queries, where the expert user is looking for interest-
problem, and (2) the view-selection problem. Data in the ing trends in the data repository, the query pattern is
warehouse can be seen as materialized views generated difficult to predict.
from the underlying multiple data sources. Materialized
views are used to speed up query processing on large Indexing Techniques
amounts of data. These views need to be maintained in
response to updates in the source data. This often is done Indexing has been the foundation of performance tuning
using incremental techniques that access data from under- for databases for many years. It creates access struc-
lying sources. In a data-warehousing scenario, accessing tures that provide faster access to the base data relevant
base relations can be difficult; sometimes data sources to the restriction criteria of queries. The size of the index
may be unavailable, since these relations are distributed structure should be manageable, so that benefits can be
across different sources. For these reasons, the issue of accrued by traversing such a structure. The traditional
self-maintainability of the view is an important issue in indexing strategies used in database systems do not
data warehousing. The warehouse views can be made self- work well in data warehousing environments. Most OLTP
maintainable by materializing some additional information, transactions typically access a small number of rows;
called auxiliary relations, derived from the intermediate most OLTP queries are point queries. B-trees, which are
results of the view computation. There are several algo- used in the most common relational database systems,
rithms, such as counting algorithm and exact-change algo- are geared toward such point queries. They are well
rithm, proposed in the literature for maintaining material- suited for accessing a small number of rows. An OLAP
ized views. query typically accesses a large number of records for
To answer the queries efficiently, a set of views that are summarizing information. For example, an OLTP transac-
closely related to the queries is materialized at the data tion would typically query for a customer who booked a
warehouse. Note that not all possible views are material- flight on TWA 1234 on April 25, for instance; on the
ized, as we are constrained by some resource like disk other hand, an OLAP query would be more like give me
space, computation time, or maintenance cost. Hence, we the number of customers who booked a flight on TWA
need to select an appropriate set of views to materialize 1234 in one month, for example. The second query would
under some resource constraint. The view selection prob- access more records that are of a type of range queries.
lem (VSP) consists in selecting a set of materialized views B-tree indexing scheme is not suited to answer OLAP
that satisfies the query response time under some resource queries efficiently. An index can be a single-column or
constraints. All studies showed that this problem is an NP- multi-column table (or view). An index either can be
hard. Most of the proposed algorithms for the VSP are clustered or non-clustered. An index can be defined on
908
TEAM LinG
Physical Data Warehousing Design
one table (or view) or many tables using a join index. In the ship Management (CRM) offerings have evolved, there is
data warehouse context, when we talk about index, we a need for active integration of CRM with the ODS for real- P
refer to two different things: (1) indexing techniques and time consulting and marketing (i.e., how to integrate ODS
(2) the index selection problem. A number of indexing with CRM via messaging system for real-time business
strategies have been suggested for data warehouses: analysis).
value-list index, projection index, bitmap index, bit-sliced Another trend that has been seen recently is that many
index, data index, join index, and star join index. enterprises are moving from data warehousing solutions
to information integration (II). II refers to a category of
Data Partitioning and Parallel middleware that lets applications access data as though
Processing they were in a single database, whether or not they are. It
enables the integration of data and content sources to
The data partitioning process decomposes large tables provide real-time read and write access in order to trans-
(fact tables, materialized views, indexes) into multiple form data for business analysis and data warehousing and
(relatively) small tables by applying the selection opera- to data placement for performance, currency, and avail-
tors. Consequently, the partitioning offers significant ability. That is, we envisage that there will more focus on
improvements in availability, administration, and table integrating the data and contents rather than only inte-
scan performance Oracle9i. grating structured data, as done in the data warehousing.
Two types of partitioning are possible to decompose
a table: vertical and horizontal. In the vertical fragmenta-
tion, each partition consists of a set of columns of the CONCLUSION
original table. In the horizontal fragmentation, each par-
tition consists of a set of rows of the original table. Two The data warehousing design is quite different from those
versions of horizontal fragmentation are available: pri- of transactional database systems, commonly referred as
mary horizontal fragmentation and derived horizontal Online Transaction Processing (OLTP) systems. A data
fragmentation. The primary horizontal partitioning (HP) warehouse tends to be extremely large, and the informa-
of a relation is performed using predicates that are defined tion in a warehouse usually is analyzed in a multi-dimen-
on that table. On the other hand, the derived partitioning sional way. The main objective of a data warehousing
of a table results from predicates defined on another design is to facilitate the efficient query processing and
relation. In a context of ROLAP, the data partitioning is maintenance of materialized views. For achieving this
applied as follows (Bellatreche et al., 2002): it starts by objective, it is important that the relevant data is materi-
fragmenting dimension tables, and then, by using the alized in the warehouse. Therefore, the problems of select-
derived horizontal partitioning, it decomposes the fact ing materialized views and maintaining them are very
table into several fact fragments. Moreover, by partition- important and have been addressed in this article. To
ing data of ROLAP schema (star schema or snowflake further reduce the query processing cost, the data can be
schema) among a set of processors, OLAP queries can be partitioned. That is, partitioning helps in reducing the
executed in a parallel, potentially achieving a linear irrelevant data access and eliminates costly joins. Further,
speedup and thus significantly improving query response partitioning at a finer granularity can increase the data
time (Datta et al., 1998). Therefore, the data partitioning access and processing cost. The third problem is index
and the parallel processing are two complementary tech- selection. We found that judicious index selection does
niques to achieve the reduction of query processing cost reduce the cost of query processing, but we also showed
in data warehousing environments. that indices on materialized views improve the perfor-
mance of queries even more. Since indices and material-
ized views compete for the same resource (storage), we
FUTURE TRENDS found that it is possible to apply heuristics to distribute
the storage space among materialized views and indices
It has been seen that many enterprises are moving toward so as to efficiently execute queries and maintain material-
building the Operational Data Store (ODS) solutions for ized views and indexes.
real-time business analysis. The ODS gets data from one It has been seen that enterprises are moving toward
or more Enterprise Resource Planning (ERP) systems and building the data warehousing and operational data store
keeps the most recent version of information for analysis solutions.
rather than the history of data. Since the Client Relation-
909
TEAM LinG
Physical Data Warehousing Design
910
TEAM LinG
Physical Data Warehousing Design
Dimension Table: A table containing the data for one fact table. The index is implemented using one of two
dimension within a star schema. The primary key is used representations: row id or bitmap, depending on the P
to link to the fact table, and each level in the dimension has cardinality of the indexed column.
a corresponding field in the dimension table.
Legacy Data: Data that you already have and use.
Fact Table: The central table in a star schema, contain- Most often, this takes the form of records in an existing
ing the basic facts or measures of interest. Dimension database on a system in current use.
fields are also included (as foreign keys) to link to each
dimension table. Measure: A numeric value stored in a fact table or
cube. Typical examples include sales value, sales volume,
Horizontal Partitioning: Distributing the rows of a price, stock, and headcount.
table into several separate tables.
Star Schema: A simple database design in which
Join Index: Built by translating restrictions on the dimensional data are separated from fact or event data. A
column value of a dimension table to restrictions on a large dimensional model is another name for star schema.
911
TEAM LinG
912
Andrew L. Betz
Progressive Insurance, USA
James H. Drew
Verizon Laboratories, USA
Capacity
efficiently utilized. Marketing departments focus on ini- 60
Peak Demand
tiatives that increase infrastructure usage to improve 40
both customer retention and ongoing revenue. Engineer-
ing and operations departments focus on the cost of 20 Avg Demand
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Figure 2. Flowchart describing data sources and data mining operations used in predicting busy-hour impact of
marketing initiatives P
utes could potentially increase peak usage. The quantifi- lifetime value5, access charge, subscribed rate plan, and
cation of this effect is complicated by the corporate reality peak and off-peak minutes used. Rate plan data provide
of myriad rate plans and geographically extensive and details for a given rate plan, including monthly charges,
complicated peak usage patterns. In this study, we use allowed peak, off-peak, weekend minutes of use, per-
data mining methods to analyze customer, call detail, rate minute charges for excess use, long distance, roaming
plan, and cell-site location data to predict the effect of charges, and so forth. Call detail data, for every call placed
marketing initiatives on busy-hour4 network utilization. in a given time period, provide the originating and termi-
This will enable forecasting network cost of service for nating phone numbers (and, hence, originating and termi-
marketing initiatives, thereby leading to optimization of nating customers), cell sites used in handling the call, call
capital outlay. duration, and other call details. Cell site location data
indicate the geographic location of cell sites, capacity of
each cell site, and details about the physical and electro-
MAIN THRUST magnetic configuration of the radio towers.
Ideally, the capital cost of a marketing initiative is ob- Data Mining Process
tained by determining the existing capacity, the increased
capacity required under the new initiative, and then fac- Figure 2 provides an overview of the analysis and data-
toring the cost of the additional capital; data for a study mining process. The numbered processes are described in
like this would come from a corporate data warehouse more detail to illustrate how the various components are
(Berson & Smith, 1997) that integrates data from relevant integrated into an exploratory tool that allows marketers
sources. Unfortunately, such detailed cost data are not and network engineers to evaluate the effect of proposed
available in most corporations and businesses. In fact, in initiatives.
many situations, the connection between promotional
marketing initiatives and capital cost is not even recog- 1. Cell-Site Clustering: Clustering cell sites using
nized. In this case study, we therefore need to assemble the geographic location (latitude, longitude) results
relevant data from different and disparate sources in order in cell site clusters that capture the underlying
to predict the busy-hour impact of marketing initiatives. population density, with cluster area generally in-
versely proportional to population. This is a natural
Data consequence of the fact that heavily populated
urban areas tend to have more cell towers to cover
The parallelograms in the flowchart in Figure 2 indicate the large call volumes and provide good signal
essential data sources for linking marketing initiatives to coverage. The flowchart for cell site clustering is
busy-hour usage. Customer data characterize the cus- included in Figure 3 with results of k-means cluster-
tomer by indicating a customers mobile phone number(s), ing (Hastie, Tibshirani & Friedman, 2001) for the San
913
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Figure 3. Steps 1and 2 in the data mining process outlined Francisco area shown in Figure 4. The cell sites in
in Figure 2 the area are grouped into four clusters, each cluster
approximately circumscribed by an oval.
2. Predictive Modeling: The predictive modeling
stage, shown in Figure 3, merges customer and rate
plan data to provide the data attributes (features)
used for modeling. The target for modeling is ob-
tained by combining cell site clusters with network
usage information. Note that the sheer volume of
call detail data makes its summarization and merg-
ing with other customer and operational data a
daunting one. See Berry and Linoff (2000) for a
discussion. The resulting dataset has related cus-
tomer characteristics and rate plan details for every
customer, matched up with that customers actual
network usage and the cell sites providing service
for that customer. This data can then be used to
build a predictive model. Feature selection (Liu &
Motoda, 1998) is performed, based on correlation
to target, with grouped interactions taken into
account. The actual predictive model is based on
linear regression (Hand, Mannila & Smyth, 2001;
Hastie, Tibshirani & Friedman, 2001), which, for
this application, performs similar to neural network
models (Haykin, 1994). Note that the sheer number
of customer and rate plan characteristics requires
the variable reduction capabilities of a data-mining
solution.
Figure 4. Clustering cell sites in the San Francisco area, based on geographic position. Each spot represents a cell
site, with the ovals showing approximate location of cell site clusters.
914
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Figure 5. Regression modeling results showing predicted plans into a small number of groups. Without such
vs. actual busy-hour usage for the cell site clusters shown clustering, the complex data combinations and analy- P
in Figure 4. ses needed to predict busy-hour usage will result in
very little predictive power. Clusters ameliorate the
situation by providing a small set of representative
Ac tu a l vs . P red ic ted B H U sa g e
Cell Cluster 1
Ac tu al vs . P re dic te d B H U sa g e
Cell Cluster 2
cases to use in the prediction.
10 10
8 8
6
R2 = 0.672
6
y = 0.3562x + 0.5269
R2 = 0.3776
ration, we decide on using four clusters for both custom-
4 4
ers and rate plans. Results of the clustering for customers
2 2
and applicable rate plans in the San Francisco area are
shown in Figures 6 and 7.
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Actual BH Usage (min) Actual BH Usage (min)
8 8
6 6
4
y = 0.3066x + 0.6334
R2 = 0.3148 4
y = 0.3457x + 0.4318 the expected number of customers who would sub-
R2 = 0.3591
2 2
scribe to the plan and predict the busy-hour usage
0
0 2 4 6 8 10
0 on targeted cell sites. The process is outlined in
Figure 8. Inputs labeled A, B, and C come from the
0 2 4 6 8 10
Actual BH Usage (min) Actual BH Usage (min)
Figure 6. Customer clustering flowchart and results of Figure 7. Rate plan clustering flowchart and clustering
customer clustering. The key attributes identifying the results. The driving factor that distinguishes rate plan
four clusters are access charge (ACC_CHRG), lifetime clusters include per-minute charges (intersystem, peak,
value (LTV_PCT), peak minutes of use (PEAK_MOU), and off-peak) and minutes of use.
and total calls (TOT_CALL).
915
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Figure 8. Predicting busy-hour usage at a cell site for a given marketing initiative
fore, we validate the analytic model by estimating its ity could be useful, for example, when price elasticity data
impact and comparing the estimate with known param- suggest inward projection differences by cluster.
eters. Starting at the cluster level, we applied the predictive
We began this estimation by identifying the 10 most model to estimate the average busy-hour usage for each
frequent rate plans in the data warehouse. From this list, customer on each cell tower. These cluster-level predic-
we selected one rate plan for validation. Based on market- tions were disaggregated to the cellular-tower level by
ing area and geographic factors, we assume an equal assigning busy-hour minutes proportionate to total min-
inwards projection for each customer cluster. Of course, utes for each cell tower within the respective cellular-
our approach allows for, but does not require, different tower cluster. Following this, we then calculated the
inward projections for the customer clusters. This flexibil- actual busy-hour usage per customer of that rate plan
across the millions of call records. In Figure 9, a scatter
plot of actual busy-hour usage against predicted busy-
Figure 9. Scatter plot showing actual vs. predicted hour usage, with an individual cell tower now the unit of
busy-hour usage for each cell site, for a specific rate plan analysis, reveals an R2 correlation measure of 0.13.
The estimated model accuracy dropped from R2s in the
mid 0.30s for cluster-level data (Figure 5) to about 0.13
Actual vs. Predicted BH Usage for Each Cell Site when the data were disaggregated to the cellular tower
Rate Plan RDCDO
level (Figure 9). In spite of the relatively low R2 value, the
1 correlation is statistically significant, indicating that this
0.8
approach can make contributions to the capital estimates
Predicted BH Usage
y = 1.31x - 0.1175
R2 = 0.1257
of marketing initiatives. However, the model accuracy on
0.6 disaggregated data was certainly lower than the effects
0.4
observed at the cluster level. The reasons for this loss in
accuracy probably could be attributed to the fine-grained
0.2
disaggregation, the large variability among the cell sites
0 in terms of busy-hour usage, the proportionality assump-
0 0.1 0.2 0.3 0.4 0.5 0.6 tion made in disaggregating, and model sensitivity to
Actual BH Usage inward projections. These data point out both an oppor-
916
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Figure 10. Lifetime value (LTV) distributions for customers with below and above average usage and BH usage
indicates that high-LTV customers also impact the network with high BH usage P
tunity (that rate plans can be intelligently targeted to the customer lifetime value density plots as a function of
specific locations with excess capacity) and a threat (that strain that they place on the network. The left panel in
high busy-hour volatility would lead engineering to be Figure 10 shows LTV density for customers with below-
cautious in allowing usage-increasing plans). average total usage and below-average busy-hour usage.
The panel on the right shows LTV density for customers
Business Insights with above-average total usage and above-average busy-
hour usage. Although the predominant thinking in mar-
In the context of the analytic model, the data also can keting circles is that higher LTV is always better, the data
reveal some interesting business insights. First, observe suggest this reasoning should be tempered by whether
Figure 11. Exploring customer clusters and rate plan clusters in the context of BH usage shows that customers in a
cluster who subscribe to a class of rate plans impact BH usage differently from customers in a different cluster
0.2
0.2 0.2
Proportion of Heavy BH Users
0 0
0
1 2 3 4 1 4
Customer Cluster Customer Cluster
A
Rate Plan Cluster 2 Rate Plan Cluster 3 B
917
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
the added value in revenue offsets the disproportionate target specific customer segments with rate plan promo-
strain on network resources. This is the basis for a tions in order to increase overall usage (and, hence,
fundamental tension between marketing and engineering retention), while more efficiently using, but not exceeding
functions in large businesses. network capacity. A further practical advantage of this
Looking at busy-hour impact by customer cluster and data-mining approach is that all of the customer groups,
rate plan cluster is also informative, as shown in Figure 11. cell site locations, and rate plan promotions are simplified
For example, if we define heavy BH users as customers through clustering in order to simplify their characteristic
who are above average in total minutes as well as busy- representation facilitating productive discussion among
hour minutes, we can see main effect differences across upper-management strategists.
customer clusters (Figure 11a). We have sketched several important practical busi-
This is not entirely surprising, as we have already seen ness insights that come from this methodology. A basic
that LTV was a major determinant of customer cluster and concept to be demonstrated to upper management is that
heavy BH customers also skewed towards having higher busy-hour usage, the main driver of equipment capacity
LTV. However, there was an unexpected crossover inter- needs, varies greatly by customer segment and by general
action of rate plan cluster by customer cluster when heavy cell site grouping (see Figure 9) and that usage can be
BH users were the target (Figure 11b). The implication is differently volatile over site groupings (see Figure 5).
that controlling for revenue, certain rate plan types are These differences point out the potential need for target-
more network-friendly, depending on the customer clus- ing specific customer segments with specific rate plan
ter under study. Capital-conscious marketers in theory promotions in specific locations. One interesting example
could tailor their rate plans to minimize network impact by is illustrated in Figure 11, where two customer segments
tuning rate plans more precisely to customer segments. respond differently to each of two candidate rate plans,
one group decreasing its BH usage under one plan and the
other decreasing it under the other plan. This leads to the
FUTURE TRENDS possibility that certain rate plans should be offered to
specific customer groups in locations where there is little
The proof of concept case study sketched here describes excess equipment capacity, but others can be offered
an analytical effort to optimize capital costs of marketing where there is more slack in capacity.
programs, while promoting customer satisfaction and Of course, there are many organizational, implementa-
retention by utilizing the business insights gained to tion, and deployment issues associated with this inte-
tailor marketing programs to better match the needs and grated approach. All involved business functions must
requirements of customers (Lenskold, 2003). As database accept the validity of the modeling and its results, and this
systems incorporate data-mining engines (e.g., Oracle concurrence requires the active support of upper manage-
data mining [Oracle, 2004]), future software and customer ment overseeing each function. Second, these models
relationship management applications will automatically should be repeatedly run to assure their relevance in a
incorporate such an analysis to extract optimal recommen- dynamic business environment, as wireless telephony is
dations and business insights for maximizing business in our illustration. Third, provision should be made for
return on investment and customer satisfaction, leading capturing the sales and engineering consequences of the
to effective and profitable one-on-one customer relation- decisions made from this approach, to be built into future
ship management (Brown, 2000; Gilmore & Pine, 2000; models. Despite these organizational challenges, busi-
Greenberg, 2001). ness decision makers informed by these models hopefully
may resolve a fundamental business rivalry.
CONCLUSION
REFERENCES
We have made the general argument that a comprehensive
company data warehouse and broad sets of models ana- Abramowicz, W., & Zurada, J. (2000). Knowledge discov-
lyzed with modern data mining techniques can resolve ery for business information systems. Kluwer.
tactical differences between different organizations within Berry, M.A., & Linoff, G.S. (2000). Mastering data min-
the company and provide a systematic and sound basis ing: The art and science of customer relationship man-
for business decision making. The role of data mining in agement. Wiley.
so doing has been illustrated here in mediating between
the marketing and engineering functions in a wireless Berry, M.A., & Linoff, G.S. (2004). Data mining tech-
telephony company, where statistical models are used to niques: For marketing, sales, and customer relationship
management. Wiley.
918
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
Berson, A., & Smith, S. (1997). Data warehousing, data Capital-Efficient Marketing: Marketing initiatives
mining and OLAP. McGraw-Hill. that explicitly take into account and optimize, if possible, P
the capital cost of provisioning service for the introduced
Brown, S.A. (2000). Customer relationship management: initiative or promotion.
Linking people, process, and technology. John Wiley.
Feature Selection: The process of identifying those
Drew, J., Mani, D.R., Betz, A., & Datta, P. (2001). Targeting input attributes that contribute significantly to building
customers with statistical and data-mining techniques. a predictive model for a specified output or target.
Journal of Services Research, 3(3), 205-219.
Inwards Projection: Estimating the expected number
Gilmore, J., & Pine, J. (2000). Markets of one. Harvard of customers that would sign on to a marketing promotion,
Business School Press. based on customer characteristics and profiles.
Green, J.H. (2000). The Irwin handbook of telecommuni- k-Means Clustering: A clustering method that groups
cations. McGraw-Hill. items that are close together, based on a distance metric
Greenberg, P. (2001). CRM at the speed of light: Captur- like Euclidean distance, to form clusters. The members in
ing and keeping customers in Internet real time. Mc- each of the clusters can be described succinctly using the
Graw Hill. mean (or centroid) of the respective cluster.
Han, J., & Kamber, M. (2000). Data mining: Concepts and Lifetime Value: A measure of the profit generating
techniques. Morgan Kaufmann. potential, or value, of a customer; a composite of expected
tenure (how long the customer stays with the business
Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of provider) and expected revenue (how much a customer
data mining. Bradford Books. spends with the business provider).
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Predictive Modeling: Use of statistical or data-mining
elements of statistical learning: Data mining, inference, methods to relate attributes (input features) to targets
and prediction. Springer. (outputs) using previously existing data for training in
Haykin, S. (1994). Neural networks: A comprehensive such a manner that the target can be predicted for new data
foundation. Prentice-Hall. based on the input attributes alone.
919
TEAM LinG
Predicting Resource Usage for Capital Efficient Marketing
4
Busy hour (BH) is that hour during which the net- ized for each individual customer. (Drew, et al., 2001;
work utilization is at its peak. It is that hour during Mani et al., 1999).
the day or night when the product of the average 6
Inwards projection is a term used to denote an
number of incoming calls and average call duration estimate of the number of customers that would
is at its maximum. subscribe to a new (or enhanced) rate plan, with
5
The well-established and widely used concept of details reasonably summarizing characteristics of
customer lifetime value (LTV) captures expected customers expected to sign on to such a plan.
revenue and expected tenure, preferably personal-
920
TEAM LinG
921
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Privacy and Confidentiality Issues in Data Mining
power to extract previously unknown knowledge, data horizontally. If the schema of the local databases are
mining also has become one of the main targets of privacy complementary to each other, then this means that the
advocates. In reference to data mining, there are two main data are vertically distributed. The Purdue research group
concerns regarding privacy and confidentiality: considered both horizontal and vertical data distribution
for privacy-preserving association rule mining. In both
(1) Protecting the confidentiality of data and privacy of cases, the individual association rules together with their
individuals against data mining methods. statistical properties were assumed to be confidential.
(2) Enabling the data mining algorithms to be used over Also, secure multi-part computation techniques were
a database without revealing private and confiden- employed, based on specialized encryption practices, to
tial information. make sure that the confidential association rules were
circulated among the participating companies in encrypted
We should note that, although both of the tracks seem form. The resulting global association rules can be ob-
to be similar to each other, their purposes are different. In tained in a private manner without each company knowing
the first case, the data may not be confidential, but data- which rule belongs to which local database.
mining tools could be used to infer confidential informa- Both of the seminal works on privacy-preserving data
tion. Therefore, some sanitization techniques are needed mining (Agrawal & Srikant, 2000; Kantarcioglu & Clifton,
to modify the database, so that privacy concerns are 2002) have shaped the research in this area. They show
alleviated. In the second track, the data are considered how data mining can be done on private data in a central-
confidential and are perturbed before they are given to a ized and distributed environment. Although a specific
third party. Both of the approaches are necessary in order data-mining model was the target in both of these papers,
to achieve full privacy of individuals and will be discussed these authors initiated ideas that could be applied to other
in detail in the following subsections. data-mining models.
922
TEAM LinG
Privacy and Confidentiality Issues in Data Mining
part of the released data as a training set, which can be used The New York Times article cited previously high-
further to predict the hidden data values. Chang and lights another important aspect of privacy issues in data P
Moskovitz (2000) propose a feedback mechanism that will mining by saying, Perhaps the strongest protection
try to construct a prediction model and go back to the against abuse of information systems is Strong Audit
database to update it with the aim of blocking the predic- mechanisms we need to watch the watchers (Markoff,
tion of confidential data values. 2002). This confirms that data-mining tools also should
Another aspect of privacy and confidentiality against be monitored, and users access to data via data-mining
data mining is that data-mining results, such as the pat- tools should be controlled. Although there is some initial
terns in a database, could be confidential. For example, a work on this issue (Oliveira, Zaiane & Saygin, 2004), how
database could be released with the thought that the this could be achieved is still an open problem waiting to
confidential values are hidden and cannot be queried. be addressed, since there is no available access control
However, there may be patterns in the database that are mechanism specifically developed for data-mining tools
confidential. This issue was first pointed out by OLeary similar to the one employed in database systems.
(1991). Clifton and Marks (1996) further elaborate on the Privacy-preserving data-mining techniques have been
issue of patterns being confidential by specific examples proposed, but they may not fully preserve the privacy, as
of data-mining models, such as association rules. The pointed out in Agrawal and Aggarwal (2001) and
association rules, which are very popular in retail market- Evfimievski, Gehrke, and Srikant (2003). Therefore, pri-
ing, may contain confidential information that should not vacy metrics and benchmarks are needed to assess the
be disclosed. In Verykios, et al. (2004), a way of identifying privacy threats and the effectiveness of the proposed
confidential association rules and sanitizing the database privacy-preserving data-mining techniques or the pri-
to limit the disclosure of the association rules are dis- vacy breaches introduced by data-mining techniques.
cussed. Furthermore, the work illustrates ways in which
heuristics for updating the database can reduce the signifi-
cance of the association rules in terms of their support and CONCLUSION
confidence. The main idea is to change the transactions in
the database that contribute to the support and confi- Data mining has found a lot of applications in the indus-
dence of the sensitive (or private) association rules in a try and government, due to its success in combining
way that the support and/or confidence of the rules de- machine learning, statistics, and database fields with the
crease with a limited effect on the non-sensitive rules. aim of turning heaps of data into valuable knowledge.
Widespread usage of the Internet, especially for e-com-
merce and other services, has led to the collection and
FUTURE TRENDS storage of more data at a lower cost. Also, large data
repositories about individuals and their activities, coupled
Privacy and confidentiality issues in data mining will be with powerful data mining tools, have increased fears
more and more crucial as data collection efforts increase about privacy. Therefore, data mining researchers have
and the type of data collected becomes more diverse. A felt the urge to address the privacy and confidentiality
typical example of this is the usage of RFID tags and mobile issues. Although privacy-preserving data mining is still
phones, which reveal our sensitive location information. in its infancy, some promising results have been achieved
As more data become available and more tools are imple- as the outcomes of initial studies. For example, data
mented to search this data, the privacy risks will increase perturbation techniques enable data-mining models to
even more. Consider the World Wide Web, which is a huge be built on private data, and encryption techniques allow
data repository with very powerful search engines work- multiple parties to mine their databases as if their data
ing on top of it. An adversary can find the phone number were stored in a central database. The threat of powerful
of a person from some source and use the search engine data-mining tools revealing confidential information also
Google to obtain the address of that person. Then, by has been addressed. The initial results in this area have
applying the address to a tool such as Mapquest, door-to- shown that confidential patterns in a database can be
door driving directions to the corresponding persons concealed by specific hiding techniques. For the ad-
home could be found easily. This is a very simple example vance of the data-mining technology, privacy issues
of how the privacy of a person could be in danger by need to be investigated more, and the current problems,
integrating data from a couple of sources, together with a such as privacy leaks in privacy-preserving data-mining
search engine over a data repository. Data integration over algorithms, and the scalability issues need to be re-
multiple data repositories will be one of the main chal- solved. In summary, data-mining technology is needed
lenges in privacy and confidentiality of data against data to make our lives easier and to increase our safety
mining. standards, but, at the same time, privacy standards should
923
TEAM LinG
Privacy and Confidentiality Issues in Data Mining
be established and enforced on data collectors in order to Conference on Knowledge Discovery and Data Mining,
protect the privacy of data owners against the misuse of Sydney, Australia.
the collected data.
Saygin, Y., Verykios, V.S., & Clifton, C. (2001). Using
unknowns to prevent the discovery of association rules.
SIGMOD Record, 30(4), 45-54.
REFERENCES
Schwartz, J., & Micheline, M. (2004, May 1). Airlines gave
Adam, N.R., & Wortmann, J.C. (1989). Security-control F.B.I. millions of records on travelers after 9/11. New York
methods for statistical databases: A comparative study. Times.
ACM Computing Surveys, 21(4), 515-556.
Sweeney, L. (2003). k-anonymity: A model for protecting
Agrawal, D., & Aggarwal, C. (2001). On the design and privacy. International Journal on Uncertainty, Fuzziness
quantification of privacy preserving data mining algo- and Knowledge-Based Systems, 10(5), 557-570.
rithms. Proceedings of ACM Symposium on Principles of
Database Systems, Santa Barbara, California. Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., &
Dasseni, E. (2004). Association rule hiding. IEEE Trans-
Agrawal, R., & Srikant, R. (2000). Privacy preserving data actions on Knowledge and Data Engineering, 16(4),
mining. Proceedings of SIGMOD Conference, Dallas, 434-447.
Texas.
Clifton, C., & Marks, D. (1996). Security and privacy
implications of data mining. Proceedings of ACM Work- KEY TERMS
shop on Data Mining and Knowledge Discovery, Montreal
Canada. Data Perturbation: Modifying the data so that origi-
Evfimievski, A.V., Gehrke, J., & Srikant, R. (2003). Limiting nal confidential data values cannot be recovered.
privacy breaches in privacy preserving data mining. Pro- Distributed Data Mining: Performing the data-mining
ceedings of ACM Symposium on Principles of Database task on data sources distributed in different sites.
Systems, San Diego, California.
K-Anonymity: A privacy metric, which ensures an
Farkas, C., & Jajodia, S. (2003). The inference problem: A individuals information cannot be distinguished from at
survey. ACM SIGKDD Explorations, 4(2), 6-12. least k-1 other people when a data source is disclosed.
Kantarcioglu, M., & Clifton, C. (2002). Privacy-preserving Privacy Against Data Mining: Preserving the privacy
distributed mining of association rules on horizontally of individuals against data-mining tools when disclosed
partitioned data. Proceedings of The ACM SIGMOD data contain private information that could be extracted
Workshop on Research Issues on Data Mining and Knowl- by data-mining tools.
edge Discovery, Madison, Wisconsin.
Privacy Preserving Data Mining: Performing the data-
Markoff, J. (2002, December 19). Study seeks technology mining task on private data sources (centralized or distributed).
safeguards for privacy. New York Times (p. 18).
Secure Multiparty Computation: Computing the re-
OLeary, D.E. (1991). Knowledge discovery as a threat to sult of an operation (i.e., sum, min, max) on private data
database security. In G. Piatetsky-Shapiro, & W.J. Frawley (e.g., finding the richest person among a group of people
(Eds.), Knowledge discovery in databases (pp. 507-516). without revealing the wealth of the individuals).
AAI/MIT Press.
Statistical Database: Type of database system that is
Oliveira, S., Zaiane, O., & Saygin, Y. (2004). Secure asso- designed to support statistical operations while prevent-
ciation rule sharing. Proceedings of the 8th Pacific-Asia ing operations that could lead to the association of
individuals with confidential data.
924
TEAM LinG
925
Shamik Sural
Indian Institute of Technology, Kharagpur, India
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Privacy Protection in Association Rule Mining
globally valid results since a rule that is valid in one or of privacy concerns in data mining has led to a wide range
more of the individual locations need not be valid over the of proposals in the past few years. The solutions can be
entire data set. broadly categorized as those belonging to the classes of
Efforts have been made to develop methods that data obfuscation, data summarization and data separa-
perform local operations at each site to produce interme- tion. The goal of data obfuscation is to hide the data to be
diate results, which can then be used to obtain the final protected. This is achieved by perturbing the data before
result in a secure manner. For example, it can be easily delivering it to the data miner by either randomly modify-
shown that if a rule has support > m% globally, it must ing the data, swapping the values between the records or
have support > m% on at least one of the individual sites. performing controlled modification of data to hide the
This result can be applied to the distributed case with secrets. Cryptographic techniques are often employed to
horizontally partitioned data (all sites have the same encrypt the source data, perform intermediate operations
schema but each site has information on different enti- on the encrypted data and then decrypt the values to get
ties). A distributed algorithm for this would work by back the final result with each site not knowing anything
requesting each site to send all rules with support at least but the global rule. Summarization, on the other hand,
m. For each rule returned, the sites are then asked to send attempts to make available innocuous summaries of the
the count of their items that support the rule, and the total data and therefore only the needed facts are exposed.
count of all items at the site. Using these values, the global Data separation ensures that only the trusted parties can
support of each rule can be computed with the assurance see the data by making all operations and analysis to be
that all rules with support at least m have been found. performed either by the owner/creator of the data or by
This method provides a certain level of information trusted third parties.
security since the basic data is not shared. However, the One application of data perturbation technique is
problem becomes more difficult if we want to protect not decision tree based classification to protect individual
only the individual items at each site, but also how much privacy by adding random values from a normal/Gaussian
each site supports a given rule. The above method reveals distribution of mean 0 to the actual data values (Agrawal
this information, which may be considered to be a breach & Srikant, 2000). Bayes rule for density functions is then
of security depending on the sensitivity of any given used to reconstruct the distribution. The approach is
application. Theoretical studies in the field of secure quite elegant since it provides a method for approximating
computation started in the late 1980s. In recent years, the the original data distribution and not the original data
focus has shifted more to the application field (Maurer, values by using the distorted data and information on the
2003). The challenge is to apply the available theoretical random data distribution. Similar data perturbation tech-
results in solving intricate real-world problems. Du & niques can be applied to the mining of Boolean associa-
Atallah (2001) review and suggest a number of open tion rules also (Rizvi & Haritsa, 2002). It is assumed that
secure computation problems including applications in the tuples in the database are fixed length sequences of
the field of computational geometry, statistical analysis 0s and 1s. A typical example is the market basket appli-
and data mining. They also suggest a method for solving cation where the columns represent the items sold by a
secure multi-party computational geometry problems and supermarket, and each row describes, through a sequence
secure computation of dot products in separate pieces of of 1s and 0s, the purchases made by a particular cus-
work (Atallah & Du, 2001; Ioannidis et al., 2002). tomer (1 indicates a purchase and 0 indicates no pur-
In all the above-mentioned work, the secure computa- chase). One interesting feature of this work is a flexible
tion problem has been treated with an approach to provid- definition of privacy; for example, the ability to correctly
ing absolute zero knowledge whereas the corporations guess a value of 1 from the perturbed data can be
may not always be willing to bear the cost of zero informa- considered a greater threat to privacy than correctly
tion leakage as long as they can keep the information learning a 0.
shared within known bounds. In the next section, we For many applications such as market basket, it is
discuss some of the important approaches to privacy reasonable to expect that the customers would want more
preserving data mining with an emphasis on the algo- privacy for the items they buy compared to the items they
rithms developed for association rule mining. do not. There are primarily two ways of handling this
requirement. In one method, the data is changed or per-
turbed to a certain extent to hide the exact information that
MAIN THRUST can be extracted from the original data. In another ap-
proach, data is encrypted before running the data mining
In this section, we describe the various levels of privacy algorithms on it. While data perturbation techniques
protection possible while mining data and the corre- usually result in a transformation that leads to loss of
sponding algorithms to achieve the same. Identification information and the exact result cannot be determined,
926
TEAM LinG
Privacy Protection in Association Rule Mining
cryptographic protocols try to achieve zero knowledge Since the goal is to determine if the support exceeds
transfer using lossless transformation. Cryptographic tech- a threshold rather than to learn the exact support of a rule, P
niques have been developed for the ID3 classification the secure sum computation is slightly altered and, in-
algorithm with two parties having horizontally partitioned stead of sending the computed values to each other, the
data (Lindell & Pinkas, 2000). While the approach is inter- sites perform secure comparison among themselves. If
esting, it is not very efficient in mining rules from very large the goal were to have a totally secure method, the union
databases. Also, completely secure multi-party computa- step would have to be eliminated. However, using the
tion may not actually be required in practical applications. secure union method gives higher efficiency with prov-
Although there may be many different data mining ably controlled disclosure of some minor information
techniques, they often perform similar computations at (e.g., the number of duplicate items and the candidate
various stages. For example, counting the number of items sets). The validity of even this disclosed information can
in a subset of the data shows up in both association rule be reduced by noise addition as each site can add some
mining and learning decision trees. There exist four popu- fake large itemsets to its actual locally large itemsets. In
lar methods for privacy-preserving computations that can the pruning phase, the fake items can be eliminated.
be used to support different forms of data mining. These In contrast to the above method, computing associa-
are: secure sum, set union, set intersection size and scalar tion rules in vertically partitioned data is even more
product. Though all methods are not truly secure (in some, challenging (Vaidya & Clifton, 2002). Here the items are
information other than the results is revealed), but they do partitioned and each itemset is split between sites. Most
have provable bounds on the information released. In steps of the traditional a priori algorithm can be done
addition, they are efficient as the communication and locally at each of the sites (Agarwal & Srikant, 1994). The
computation cost is not significantly increased through crucial step involves finding the support count of an
addition of the privacy-preserving component. itemset. If the support count of an itemset is securely
At this stage, it is important to understand that not all computed, it can be checked if the support is greater than
the association rules have equal need for protection. One the threshold to determine whether the itemset is fre-
has to analyze the sensitivity of the various association quent. Consider the entire transaction database to be a
rules mined from a database (Saygin et al., 2002). From the Boolean matrix where 1 represents the presence of an item
large amount of data made available to the public through (column) in a transaction (row), while 0 correspondingly
various sources, a malicious user may be able to extract represents an absence. Then the support count of an
association rules that were meant to be protected. Reduc- itemset is the scalar product of the vectors representing
ing the support or the confidence can hide the sensitive the sub-itemsets with both parties. An algorithm to
rules. Evfimievski et al. (2002) investigate the problem of compute the scalar product securely is sufficient for
applying randomization techniques in association rule secure computation of the support count.
mining. One of their most important contributions is the Most of the above protocols assume a semi-honest
application of privacy preserving techniques in rule min- model, where the parties involved will honestly follow
ing from categorical data. They also provide a formal the protocol but can later try to infer additional informa-
definition for privacy breaches. It is quite challenging to tion from whatever data they receive through the proto-
extend this work by combining randomization techniques col. One result of this is that parties are not allowed to
with cryptographic techniques to make the scheme more give spurious input to the protocol. If a party is allowed
robust. to give spurious input, they can probe to determine the
Kantarcioglu & Clifton (2002) propose a cryptographic value of a specific item at other parties. For example, if a
technique for mining association rules in a horizontally party gives the input (0, ..., 0, 1, 0, , 0), the result of the
partitioned database. It assumes that the transactions are scalar product (1 or 0) tells the malicious party if the other
distributed among n sites. The global support count of an party has the transaction corresponding to the 1. Attacks
item set is the sum of all the local support counts. An of this type can be termed as probing attacks. All of the
itemset A is globally supported if the global support count protocols currently suggested in the literature are sus-
of A is greater than s% of the total transaction database ceptible to such probing attacks. Better techniques,
size. The global confidence of a rule A => B can be given which work even in the malicious model, are needed to
as (AUB).support/(A).support and a k-itemset is called a guard against this.
globally large k-itemset if it is globally supported. Quasi-
commutative hash functions are used for secure computa-
tion of set unions that determine globally frequent candi- FUTURE DIRECTIONS
date itemsets from locally frequent candidate itemsets.
The globally frequent k-itemsets are determined from the The ongoing work in the field has suggested several
candidate itemsets with a secure protocol. interesting directions for future research. Existing ap-
927
TEAM LinG
Privacy Protection in Association Rule Mining
proaches should be extended to cover new classes of Symposium on Principles of Database Systems (pp. 247-
privacy breaches and to ascertain the theoretical limits on 255). Santa Barbara, California.
discoverability for a given level of privacy (Agrawal &
Aggarwal, 2001). Potential research needs to concentrate Agrawal, R., & Srikant, R. (2000). Privacy-preserving data
on combining the randomization and cryptographic pro- mining. In ACM SIGMOD Conference on Management of
tocols to get the strengths of both without the weak- Data (pp. 439-450). Dallas, TX, USA.
nesses of either. Privacy estimation formulas used in data Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2002). A data
perturbation techniques can be refined so as to include mining architecture for distributed environments. Inno-
the effects of using the mining output to re-interrogate the vative Internet Computing Systems, 27-38.
distorted database. Extending association rule mining of
vertically partitioned data to multiple parties is an impor- Atallah, M.J., & Du, W. (2001). Secure multi-party compu-
tant research topic in itself, especially considering collu- tational geometry. In Seventh International Workshop on
sion between the parties. Algorithms and Data Structures (pp. 165-179). Provi-
Much of the current work on vertically partitioned dence, Rhode Island, USA.
data is limited to Boolean association rule mining. Cat- Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M.
egorical attributes and quantitative association rule min- (2003). Tools for privacy preserving distributed data
ing are significantly more complex problems. Most impor- mining. ACM SIGKDD Explorations, 4(2), 28-34.
tantly, a wide range of data mining algorithms exists such
as classification, clustering and sequence detection, and Du, W., & Atallah, M.J. (2001). Secure multi-party compu-
the effect of privacy constraints on these algorithms is tation problems and their applications: A review and open
also an interesting area of future research (Verykios et al., problems. In New Security Paradigms Workshop (pp. 11-
2004). The final goal of researchers is to develop methods 20). Cloudcroft, New Mexico, USA.
enabling any data mining operation that can be done at a
single site to be done across various sources, while Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, G.
respecting the privacy policies of the sources. (2002). Privacy preserving mining of association rules. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (pp. 217-228). Edmonton,
Canada28.
CONCLUSION
Han, J., & Kamber, M. (2003). Data mining: Concepts and
Several algorithms have been proposed to address the techniques. San Francisco: Morgan Kaufmann Publish-
conflicting goals of supporting privacy and accuracy ers.
while mining association rules on large databases. Once
this field of research is matured, a toolkit of privacy- Ioannidis, I., Grama, A., & Atallah, M. (2002). A secure
preserving distributed computation techniques needs to protocol for computing dot-products in clustered and
be built that can be assembled to solve specific real-world distributed environments. In International Conference
problems (Clifton et al., 2003). Current techniques address on Parallel Processing (pp. 279-285).
the problem of performing one secure computation and Kantarcioglu, M., & Clifton, C. (2002). Privacy-preserving
using that result to perform the next computation reveals distributed mining of association rules on horizontally
intermediate information that may not be part of the final partitioned data. In ACM SIGMOD Workshop on Re-
results. Controlled disclosure is guaranteed by evaluat- search Issues on Data Mining and Knowledge Discov-
ing whether the real results together with the extra infor- ery.
mation violate privacy constraints. This approach how-
ever, becomes more difficult with iterative techniques as Lindell, Y., & Pinkas, B. (2000). Privacy preserving data
intermediate results from several iterations may reveal a mining. Advances in Cryptology (pp. 36-54).
lot of information. Proving that this does not violate Maurer, U. (2003). Secure multi-party computation made
privacy is a difficult problem. simple. In Third Conference on Security in Communica-
tion Networks (SCN02) (pp. 14-28). Lecture Notes in
Computer Science (Vol. 2576). Berlin: Springer-Verlag.
REFERENCES
Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy
Agrawal, D., & Aggarwal, C.C. (2001). On the design and in association rule mining. In Twenty-Eighth Interna-
quantification of privacy preserving data mining algo- tional Conference on Very Large Data Bases (pp. 682-
rithms. In Twentieth ACM SIGACT-SIGMOD-SIGART 689).
928
TEAM LinG
Privacy Protection in Association Rule Mining
Saygin, Y., Verykios, V.S., & Elmagarmid, A.K. (2002). Distributed Data Mining: Mining information from a
Privacy preserving association rule mining. In Twelfth very large set of data spread across multiple locations P
International Workshop on Research Issues in Data without transferring the data to a central location.
Engineering (pp. 151-158).
Horizontally Partitioned Data: A distributed archi-
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso- tecture in which the all the sites share the same database
ciation rule mining in vertically partitioned data. In Eighth schema but have information about different entities. The
ACM SIGKDD International Conference on Knowledge union of all the rows across all the sites forms the complete
Discovery and Data Mining (pp. 639-644). Edmonton, database.
Alberta, Canada.
Itemset: A set of one or more items that are purchased
Verykios, V.S., Bertino, E., Fovino, I.N., Provenza, L.P., together in a single transaction.
Saygin, Y., & Theodoridis. (2004). State-of-the-art in pri-
vacy preserving data mining. ACM SIGMOD Record, 33 Privacy Preserving Data Mining: Algorithms and
(1), 50-57. methods that are used to mine rules from a distributed set
of data in which the sites reveal detailed information as
less as possible.
KEY TERMS Secure Multi-Party Computation: Computation of an
overall result based on the data from a number of users in
Association Rule: A relation between the occurrences which only the final result is known to the individual user
of a set of items with another set of items in a large data at the end of computation and nothing else.
set.
Vertically Partitioned Data: A distributed architec-
Cryptographic Data Mining Techniques: Methods ture in which the different sites store different attributes
that encrypt individual data before running data mining of the data. The union of all these attributes or columns
algorithms so that the final result is also available in an together forms the complete database.
encrypted form.
929
TEAM LinG
930
Profit Mining
Senqiang Zhou
Simon Fraser University, Canada
Ke Wang
Simon Fraser University, Canada
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Profit Mining
decision-making. These works, however, did not pro- part of her transactions and including a special target
pose concrete solutions to the actionability problem. item NULL representing no recommendation. Now, P
Recently, there were several works applying associa- each recommendation of a non-NULL item (and price)
tion rules to address business related problems. (Brijs, corresponds to identifying a buyer of the item. This
Swinnen, Avanhoof & Wets, 1999; Wong, Fu & Wang, modeling is more general than the traditional direct
2003; Wang & Su, 2002) studied the problem of select- marketing in that it can identify buyers for more than
ing a given number of items for stocking. The goal is to one type of item and promotion strategies.
maximize the profit generated by selected items or
customers. These works present one important step Profit Mining
beyond association rue mining, i.e., addressing the issue
of converting a set of individual rules into a single We solve the profit mining by extracting patterns from
actionable model for recommending actions in a given a set of past transactions. A transaction consists of a
scenario. collection of sales of the form (item, price). A simple
There were several attempts to generalize associa- price can be substituted by a promotion strategy, such
tion rules to capture more semantics, e.g., (Lin, Yao & as buy one get one free or X quantity for Y dollars,
Louie, 2002; Yao, Hamilton & Butz, 2004; Chan, Yang that provides sufficient information for derive the price.
& Shen, 2003). Instead of a uniform weight associated The transactions were collected over some period of
with each occurrence of an item, these works associate times and there could be several prices even for the
a general weight with an item and mine all itemsets that same item if sales occurred at different times. Given a
pass some threshold on the aggregated weight of items collection of transactions, we find recommendation
in an itemset. Like association rule mining, these works rules of the form {s 1, , sk}<I,P>, where I is a target
did not address the issue of converting a set of rules or item and P is a price of I, and each si is a pair of non-target
itemsets into a model for recommending actions. item and price. An example is (Perfume,
Collaborative filtering (Resnick & Varian, 1997)
price=$20) (Lipstick, price=$10). This recommen-
makes recommendation by aggregating the opinions
(such as rating about movies) of several advisors who dation rule can be used to recommend Lipstick at the
share the taste with the customer. Built on this technol- price of $10 to a customer who bought Perfume at the
ogy, many large commerce web sites help their custom- price of $20. If the recommendation leads to a sale of
ers to find products. For example, Amazon.com uses Lipstick of quantity Q, it generates (10-C)*Q profit,
Book Matcher to recommend books to customers; where C is the cost of Lipstick.
Moviefinder.com recommends movies to customers Several practical considerations would make rec-
using We Predict recommender system. For more ommendation rules more useful. First, items on the
examples, please refer to (Schafer, Konstan & Riedl, left-hand side in si can be item categories instead to
1999). The goal is to maximize the hit rate of recom- capture category-related patterns. Second, a customer
mendation. For items of varied profit, maximizing profit may have paid a higher price if a lower price was not
is quite different from maximizing hit rate. Also, col- available at the shopping time. We can incorporate the
laborative filtering relies on carefully selected item domain knowledge that paying a higher price implies the
endorsements for similarity computation, and a good willingness of paying a lower price (for exactly the
set of advisors to offer opinions. Such data are not same item) to search for stronger rules at lower prices.
easy to obtain. The ability of recommending prices, in This can be done through multi-level association mining
addition to items, is another major difference between (Srikant and Agrawal, 1995; Han and Fu, 1995), by
profit mining and other recommender systems. modeling a lower price as a more general category than
Another application where data mining is heavily a higher price. For example, the sale {<chicken, $3.8>}
used for business targets is direct marketing. See (Ling in a transaction would match any of the following more
& Li, 1998; Masand & Shapiro, 1996; Wang, Zhou, general sales in a rule: <chicken, $3.8>, <chicken,
Yeung & Yang, 2002), for example. The problem is to $3.5>, <chicken, $3.0>, chicken, meat, food. Note that
identify buyers using data collected from previous cam- the last three sales are generalized by climbing up the
paigns, where the product to be promoted is usually category hierarchy and dropping the price.
fixed and the best guess is about who are likely to buy. A key issue is how to make a set of individual rules
The profit mining, on the other hand, is to guess the best work as a single recommender. Our approach is ranking
item and price for a given customer. Interestingly, these rules the recommendation profit. The recommendation
two problems are closely related to each other. We can profit of a rule r is defined as the average profit of the
model the direct marketing problem as profit mining target item in r among all transactions that match r. Note
problem by including customer demographic data as that the rank by average profit implicitly takes into
account of both confidence and profit because a high
931
TEAM LinG
Profit MIning
average profit implies that both confidence and profit are CONCLUSION
high. Given a new customer, we pick up the highest
ranked matching rule to make recommendation. Profit mining is a promising data mining approach
Before making recommendation, however, over- because it addresses the ultimate goal of data mining. In
fitting rules that work only for observed transactions, this article, we study profit mining in the context of
but not for new customers, should be pruned because our retailing business, but the principles and techniques
goal is to maximize profit on new customers. The idea is illustrated should be applicable to other applications.
as follows. Instead of ranking rules by observed profit, For example, items can be general actions and prices
we rank rules by projected profit, which is based on the can be a notion of utility resulted from actions. In
estimated error of a rule adapted for pruning classifiers addition, items can be used to model customer demo-
(Quinlan, 1993). Intuitively, the estimated error will graphic information such as Gender, in which case the
increase for a rule that matches a small number of price component is unused.
transactions. Therefore, over-fitting rules tend to have a
larger estimated error, which translates into a lower
projected profit, and a lower rank. REFERENCES
For a detailed exposure and experiments on real life
and synthetic data sets, the reader is referred to (Wang, Agrawal, R., Imilienski, T., & Swami, A. (1993, May).
Zhou & Han, 2002). Mining association rules between sets of items in large
databases. ACM Special Interest Group on Manage-
ment of Data (SIGMOD) (pp. 207-216), Washington D.C.,
FUTURE TRENDS USA.
The profit mining proposed is only the first, but impor- Agrawal, R. & Srikant, R. (1994, September). Fast algo-
tant, step in addressing the ultimate goal of data mining. rithms for mining association rules. International Con-
To make profit mining more practical, several issues ference on Very Large Data Bases (VLDB) (pp. 487-499),
need further study. First, it is quite likely that the recom- Santiago de Chile.
mended item tends to be the item that the customer will Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999,
buy independently of the recommendation. Obviously, August). Using association rules for product assortment
such items need not be recommended, and recommenda- decisions: A case study. International Conference on
tion should focus on those items that the customer may Knowledge Discovery and Data Mining (KDD) (pp. 254-
buy if informed, but may not otherwise. Recommending 260), San Diego, USA.
such items likely brings in additional profit. Second, the
current model maximizes only the profit of one-shot Chan, R., Yang, Q., & Shen, Y. (2003, November). Mining
selling effect, therefore, a sale in a large quantity is high utility itemsets. IEEE International Conference on
favored. In reality, a customer may regularly shop the Data Mining (ICDM) (pp. 19-26), Melbourne, USA.
same store over a period of time, in which case a sale in
a large quantity will affect the shopping frequency of a Domingos, P. (1999, August). MetaCost: A general
customer, thus, profit. In this case, the goal is maximiz- method for making classifiers cost-sensitive. ACM SIG
ing the profit for reoccurring customers over a period of International Conference on Knowledge Discovery and
time. Another interesting direction is to incorporate the Data Mining (SIGKDD) (pp. 155-164), San Diego, USA.
feedback whether a certain recommendation is rejected Han, J., & Fu, Y. (1995, September). Discovery of mul-
and accepted to improve future recommendations. tiple-level association rules from large databases. Inter-
This current work has focused on the information national Conference on Very Large Data Bases (VLDB)
captured in past transactions. As pointed out in Introduc- (pp. 420-431), Zurich, Switzerland.
tion, other things such as competitors offers, recom-
mendation by friends or customers, consumer fashion, Kleinberg, J., Papadimitriou, C. & Raghavan, P. (1998,
psychological issues, conveniences, etc. can affect the December). A microeconomic view of data mining.
customers decision. Addressing these issues requires Data Mining and Knowledge Discovery Journal, 2(4),
additional knowledge, such as competitors offers, and 311-324.
computers may not be the most suitable tool. One solu- Lin, T. Y., Yao, Y.Y., & Louie, E. (2002, May). Value
tion could be suggesting several best recommendations added association rules. Advances in Knowledge Dis-
to the domain expert, the store manager or sales person covery and Data Mining, 6th Pacific-Asia Conference
in this case, who makes the final recommendation to the PAKDD (pp. 328-333), Taipei, Taiwan.
customer after factoring the other considerations.
932
TEAM LinG
Profit Mining
Ling, C., & Li, C. (1998, August) Data mining for direct Wong, R. C. W., Fu, A. W. C., & Wang, K. (2003, Novem-
marketing: problems and solutions. ACM SIG Interna- ber). MPIS: Maximal-profit item selection with cross- P
tional Conference on Knowledge Discovery and Data selling considerations. IEEE International Conference
Mining (SIGKDD) (pp. 73-79), New York, USA. on Data Mining (ICDM) (pp. 371-378), Melbourne, USA.
Margineantu, D. D., & Dietterich, G. T. (2000, June-July). Yao, H., Hamilton, H. J., & Butz, C. J. (2004, April). A
Bootstrap methods for the cost-sensitive evaluation of foundational approach for mining itemset utilities from
classifiers. International Conference on Machine Learn- databases. SIAM International Conference on Data
ing (ICML) (pp. 583-590), San Francisco, USA. Mining (SIAMDM) (pp. 482-486), Florida, USA.
Masand, B., & Shapiro, G. P. (1996, August) A comparison Zadrozny, B., & Elkan, C. (2001, August). Learning and
of approaches for maximizing business payoff of predic- making decisions when costs and probabilities are both
tion models. ACM SIG International Conference on unknown. ACM SIG International Conference on
Knowledge Discovery and Data Mining (SIGKDD) (pp. Knowledge Discovery and Data Mining (SIGKDD) (pp.
195-201), Portland, USA. 204-213), San Francisco, USA.
Pednault, E., Abe, N., & Zadrozny, B. (2002, July). Sequen-
tial cost-sensitive decision making with reinforcement
learning. ACM SIG International Conference on Knowl- KEY TERMS
edge Discovery and Data Mining (SIGKDD) (p. 259-268),
Edmonton, Canada. Association Rule: An association has the form I1 I2,
Quinlan, J.R. (1993). C4.5: Programs for Machine Learn- where I1 and I2 are two itemsets. The support of an
ing. Morgan Kaufmann. association rule is the support of the itemset I1 I2, and
the confidence of a rule is the ratio of support of I1 I2
Resnick, P., & Varian, H.R. (1997). CACM special issue on and the support of I1.
recommender systems. Communications of the ACM,
40(3), 56-58. Classification: Given a set of training examples in
which each example is labeled by a class, build a model,
Schafer, J. B., Konstan, J. A., & Riedl, J. (1999, November). called a classifier, to predict the class label of new
Recommender systems in E-commerce. ACM Conference examples that follow the same class distribution as
on Electronic Commerce (pp. 158-166), Denver, USA. training examples. A classifier is accurate if the pre-
dicted class label is the same as the actual class label.
Silberschatz, A., & Tuzhilin, A. (1996). What makes pat-
terns interesting in knowledge discovery systems. IEEE Cost Sensitive Classification: The error of a
Transactions on Knowledge and Data Engineering, 8(6), misclassification depends on the type of the
970-974. misclassification. For example, the error of
misclassifying Class 1 as Class 2 may not be the same as
Srikant, R., & Agrawal, R. (1995, September). Mining the error of misclassifying Class 1 as Class 3.
generalized association rules. International Conference
on Very Large Data Bases (VLDB) (pp. 407-419), Zurich, Frequent Itemset: The support of an itemset refers
Switzerland. to as the percentage of transactions that contain all the
items in the itemset. A frequent itemset is an itemset
Wang, K., & Su, M. Y. (2002, July). Item selection by hub- with support above a pre-specified threshold.
authority profit ranking. ACM SIG International Confer-
ence on Knowledge Discovery and Data Mining Over-fitting Rule: A rule has high performance (e.g.
(SIGKDD) (pp. 652-657), Edmonton, Canada. high classification accuracy) on observed transaction(s)
but performs poorly on future transaction(s). Hence, such
Wang, K., Zhou, S., & Han, J. (2002, March). Profit mining: rules should be excluded from the decision-making sys-
From patterns to actions. International Conference on tems (e.g. recommender). In many cases over-fitting rules
Extending Database Technology (EDBT) (pp. 70-87), are generated due to the noise in data set.
Prague, Czech Republic.
Profit Mining: In a general sense, profit mining refers
Wang, K., Zhou, S., Yeung, J. M. S., & Yang, Q. (2003, to data mining aimed at maximizing a given objective
March). Mining customer value: From association rules function over decision making for a targeted population
to direct marketing. International Conference on Data (Wang, Zhou & Han, 2002). Finding a set of rules that pass
Engineering (ICDE) (pp. 738-740), Bangalore, India. a given threshold on some interestingness measure (such
as association rule mining or its variation) is not profit
933
TEAM LinG
Profit MIning
mining because of the lack of a specific objective function building a model for recommending target products and
to be maximized. Classification is a special case of profit prices with the objective of maximizing net profit.
mining where the objective function is the accuracy and
the targeted population consists of future cases. This Transaction: A transaction is some set of items cho-
paper examines a specific problem of profit mining, i.e., sen from a fixed alphabet.
934
TEAM LinG
935
INTRODUCTION BACKGROUND
Graphical models such as Bayesian networks (BNs) (Pearl, Let V be a set of n discrete variables x1, , xn (in what
1988) and decomposable Markov networks (DMNs) follows, we will focus on finite, discrete variables). Each
(Xiang, Wong & Cercone, 1997) have been applied widely variable xi has a finite space Si = {xi,1, x i,2, , xi,D } of
to probabilistic reasoning in intelligent systems. Figure1 cardinality Di. When there is no confusion, we write xi,j as
illustrates a BN and a DMN on a trivial uncertain domain: xij for simplicity. The space of a set V of variables is defined
A virus can damage computer files, and so can a power by the Cartesian product of the spaces of all variables in
glitch. A power glitch also causes a VCR to reset. The BN V, that is, SV = S1 x ... x Sn (or S ). Thus, SV contains the
i i
in (a) has four nodes, corresponding to four binary vari- tuples made of all possible combinations of values of the
ables taking values from {true, false}. The graph structure variables in V. Each tuple is called a configuration of V,
encodes a set of dependence and independence assump- denoted by v = (x 1, , xn).
tions (e.g., that f is directly dependent on v, and p but is Let P(xi ) denote the probability function over x i and
independent of r, once the value of p is known). Each node P(xij ) denote the probability value P(xi = x ij ). A probabi-
is associated with a conditional probability distribution listic domain model (PDM) over V defines the probabil-
conditioned on its parent nodes (e.g., P(f | v, p)). The joint ity values of every configuration for every subset A V .
probability distribution is the product P(v, p, f, r) = P(f |
Let P(V) or P(x1, , xn) denote the joint probability
v, p) P(r | p) P(v) P(p). The DMN in (b) has two groups
distribution (JPD) function over x1, , xn and P(x1 j1, ,
of nodes that are maximally pair-wise connected, called
xn jn) denote the probability value of a configuration (x1 j1,
cliques. Each clique is associated with a probability
distribution (e.g., clique {v, p, f} is assigned P(v, p, f)). The , xn jn). We refer to the function P(A) over A V as the
joint probability distribution is P(v, p, f, r) = P(v, p, f) P(r, marginal distribution over A and P(xi) as the marginal
p) / P(p), where P(p) can be derived from one of the clique distribution of x i. We refer to P(x1 j1, , xn jn) as a joint
distributions. The networks, for instance, can be used to parameter and P(xij) as a marginal parameter of the
reason about whether there are viruses in the computer corresponding PDM over V.
system, after observations on f and r are made. For any three disjoint subsets of variables W, U and
Construction of such networks by elicitation from Z in V, subsets W and U are called conditionally indepen-
domain experts can be very time-consuming. Automatic dent given Z, if
discovery (Neapolitan, 2004) by exhaustively testing all
possible network structures is intractable. Hence, heuris- P(W | U, Z) = P(W | Z)
tic search must be used. This article examines a class of
graphical models that cannot be discovered using the for all possible values in W, U and Z such that P(U, Z) >
common heuristics. 0. Conditional independence signifies the dependence
mediated by Z. This allows the dependence among
W U U U Z to be modeled over subsets W U Z and U U Z
separately. Conditional independence is the key property
explored through graphical models.
Figure 1. (a) a trivial example BN; (b) a corresponding Subsets W and U are said to be marginally indepen-
DMN dent (sometimes referred to as unconditionally indepen-
dent) if
P(W | U) = P(W)
for all possible values W and U such that P(U) > 0. When
two subsets of variables are marginally independent,
there is no dependence between them. Hence, each subset
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of I