Sie sind auf Seite 1von 5

DECSUP-11850; No of Pages 5

Decision Support Systems xxx (2011) xxxxxx

Contents lists available at ScienceDirect

Decision Support Systems


j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d s s

Information mining Reections on recent advancements and the road ahead in data, text, and media mining
a r t i c l e
Available online xxxx Keywords: Data mining Text mining Media mining Information mining Opinion and sentiment analysis

i n f o

a b s t r a c t
In this introduction, we briey summarize the state of data and text mining today. Taking a very broad view, we use the term information mining to refer to the organization and analysis of structured or unstructured data that can be quantitative, textual, and/or pictorial in nature. The key question, in our view, is, How can we transform data (in the very broad sense of this term) into actionable knowledge, knowledge that we can use in pursuit of a specied objective(s). After detailing a set of key components of information mining, we introduce each of the papers in this volume and detail the focus of their contributions. 2011 Elsevier B.V. All rights reserved.

1. Introduction In recent years, several authors [3,11] have emphasized the explosion in data availability, especially at the micro level. The authors stressed the challenges posed both by the sheer expansion of data gathered and maintained by individual rms and by the dynamic explosion of what may be today's largest data repository, Internet content. The often eeting presence of Internet data adds to the challenge of investigators seeking to capture and utilize the expanding data. Despite the challenges, the authors make a convincing argument that opportunities abound as we have moved from data-constrained to data-enabled research. Perhaps nowhere are the opportunities greater than in the arena of data mining. For years, data miners struggled with sufciency of observations and the lack of detail provided by many observations. We all seemed to crave more data and at a ner micro level. Today, these pleas are seldom heard. Data miners have quickly adapted to the data-rich environment and developed more and more interesting techniques and effective applications. In some situations, the very potential effectiveness of data-mining can be provocative. On-line poker sites and television broadcasts of poker tournaments have provided signicant opportunity for micro data gathering and analysis of individual player habits and strategies. Astute individual observation, personal mental analysis, and individual processes to read an opponent have been lauded as key characteristics of great poker players. But when computerized data-mining entered the picture, the poker powers got nervous. In 2009, the on-line poker site, Full Tilt Poker, revoked Brian Townsend's Red Pro status for one month due to Townsend's use of that wicked tool, data-mining (see Ref [7]). Full Tilt poker's terms and conditions include the following (#8 and #9): (8) Full Tilt Poker prohibits the use of external player assistance programs (EPA Programs) which are designed to provide users with an unfair advantage over their opponents. Full Tilt Poker denes external to mean computer software (other than the Full Tilt Poker game client), and non-software-based
0167-9236/$ see front matter 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2011.01.008

databases or proles (e.g., web sites and subscription services). Full Tilt Poker denes an unfair advantage as a user accessing or compiling information on other players beyond that which the user has personally observed through his or her own game play. (9) The use of articial intelligence including, without limitation, robots is strictly forbidden in connection with Full Tilt Poker. All actions taken in relation to Full Tilt Poker's games and tournaments must be executed personally by players through the user interface accessible by use of the game software. You agree that Full Tilt Poker may take steps to detect and prevent the use of prohibited EPA programs. These steps may include, but are not limited to, the examination of software programs running concurrently with the Full Tilt Poker software on the player's computer. (see http://www. fulltiltpoker.com).

While Full Tilt's desire for fairness in poker games may stymie creative data compilation and analysis, such constraints are thankfully rare. Micro-level data gathering and mining are now mainstream activities that range across virtually all business enterprises, from groceries to NBA Basketball, from restaurants to Wall Street, from the doctor's ofce to military operations. Historically, data mining has focused heavily on numerical data. Quite simply, numerical data was the dominant and easily managed source of available information. A devoted group of researchers did center attention on pattern recognition for image processing and renement, but image uniqueness and irregularity were signicant obstacles (see Ref. [5]). Today's ood in information availability seems to have been nearly balanced by the expanding conduits of computing power and analytic ability that are channeling data into knowledge. The term data has come to include textual material and images along with numerical content. The term data mining has been joined by text mining and media mining.

Please cite this article as: R. Gopal, et al., Information mining Reections on recent advancements and the road ahead in data, text, and media mining, Decis. Support Syst. (2011), doi:10.1016/j.dss.2011.01.008

R. Gopal et al. / Decision Support Systems xxx (2011) xxxxxx

Here, we take a very broad view, using the term information mining to refer to the organization and analysis of structured or unstructured data that can be quantitative, textual, and/or pictorial in nature. We do not limit consideration to any specic set of techniques or methods. The key question, in our view, is, How can we transform data (in the very broad sense of this term) into actionable knowledge, knowledge that we can use in pursuit of a specied objective(s). In addition, we include as an integral part of information mining the task to keep on the alert for unexpected knowledge, that is, knowledge that may not relate to a specied objective for a particular information mining instance, but which can be useful in other problem domains. In the remainder of this paper, we set out a brief synopsis of where we are today in these elds, highlighting recent major developments. We also discuss various analytic techniques and approaches whose relevance and use cut across data domains as well as those whose impact has been specic to a single data type. A comprehensive analysis of all aspects of data, text, and media mining is beyond the scope of this effort. For the interested reader, we have tried to provide notes on and references to thorough review articles that focus on individual domains within the comprehensive moniker of information mining. In the third section of this paper, we provide brief introductions to the eight papers presented in this special issue, a set of papers that should help the reader obtain a good sense of the breadth of the eld of data mining as well as the latest directions of expansion in application areas and in emerging techniques within various domains. We follow this section with a few reections and a short epilog. 2. So where is information mining today? In academics, one way to tell that a eld or domain has reached the rst level of maturity is through the appearance of a noticeable number of survey papers summarizing research to date in the eld. Perhaps the second level of maturity is demonstrated by a proliferation of survey papers directed at summarizing research in a number of specialized sub-elds within the specic domain. In information mining, surveys have appeared fairly regularly since the early 2000s, with increasing emphasis in recent years on specic subsets including clustering techniques, visualizations, wavelet applications, web mining, trend detection in text mining, direct marketing, nancial data mining, temporal information mining, and fraud detection (see Refs. [36,810,1316] and [17]). The use of information mining techniques continues to expand into more and more application areas. Standard techniques, such as clustering, continue to expand in popularity and breadth of application. But challenges have been raised to the development and use of processes commonly referred to as black box techniques. For example, in the U.S., equal credit opportunity legislation and subsequent regulations [1] require transparency in various credit decisions with the lender having a duty to clearly demonstrate the satisfaction of various requirements. Though neural networks are well-known (to be) .... amongst the best performing techniques for credit scoring because of their universal approximation property... their complex mathematical internal workings essentially turn them into black box, opaque structures, which limit their comprehensibility and prohibit their practical use [1] (also, see discussion in Ref. [2]). The recent upheavals across global economies have raised black box modeling concerns to a new level. The speed and signicance of the crisis raised an alarm and brought us to a realization that the complexity and opaqueness of nancial information mining and modeling were at such a level that no one seemed to understand the actual risk, triggers, or impact trails of a system shock. While questions remain on the form and rigor of new regulations, there is certainly greater emphasis on clarity and comprehensibility in nancial data mining and modeling processes. Thus, we note that data mining today is, by necessity, placing greater emphasis on transparency, tractability, and validation.

The landscape of information mining is also undulating rapidly as researchers strive to develop ways to take advantage of the exploding magnitude of textual information being created and shared in the blogosphere. At rst, straightforward word counts, word pairings, and text clustering processes were used to develop inferences about textual content or author attitude. As research efforts began to tackle more complex analyses of the exploding textual information domain, new approaches and algorithms appeared. The area of opinion and sentiment mining has matured to the point where a thorough survey paper has now appeared [12]. Rather than using focus groups or individual customer feedback, companies can (and some say, must) now look to the blogosphere for commentary on their products, comparison with products of competitors, or even views on their corporate image. The blogosphere is international and also multilingual. Thus investigators not only have to determine appropriate methods for binary representations of text in one language; they must be able to map among multiple languages in a consistent way to binary values. In Fig. 1, we set forth a summary representation of the elements of information mining: data, techniques, tasks, application area, and nal objective. We then use the gure as a way to organize our discussion of the current state of information mining, noting particular challenges facing us today. We rst note that the amount of data of all types continues to grow exponentially. The development of new technologies and techniques now make it fairly easy to access and collect data in increasing forms and types. Further, the sets of potentially relevant data continue to grow as the set of application areas expands. Expanding data, new technique development and/or enhancement, advancing computing power, and increasingly cheap data storage alternatives have combined to enable academics and information mining professionals to undertake more complex tasks and tackle more challenging objectives. Fig. 1 also provides the linkage to the next two sections. We rst discuss how each of the manuscripts in this special issue contributes to advancing the state of information mining. This is followed by a section devoted to a discussion of take-aways in the form of challenging future research questions related to each of the key components in Fig. 1. What new types of data are on the horizon? What application areas are emerging? What new tasks and techniques appear to hold high value? 3. Special issue manuscripts and their contributions to advancing information mining The process of compiling a special issue certainly includes many challenges and much time. But when the contributors work as diligently and productively as ours have, the challenges melt away and time moves swiftly. In turn, we briey consider the contributions of each of the papers. In the rst paper, Post-Retrieval Search Hit Clustering to Improve Information Retrieval Effectiveness: Two Digital Forensics Case Studies, Beebe et al. provide a contribution in expanding innovative techniques in an emerging application area (see Fig. 1 above). Their work extends text mining and information retrieval research in the digital forensics arena. They demonstrate the information retrieval effectiveness in terms of precision, recall, and overhead compared against the current approach in the eld. Khansa and Liginhal's paper, Predicting Stock Market Returns from Malicious Attacks: A Comparative Analysis of Vector Autoregression and Time-Delayed Neural Networks, examines the expected benets of investing in information security using two complementary methods. The authors provide contributions in at least two of the elements presented in Fig. 1. To the Technique set, they add a new model approach and form. To the Application Area set, they add a novel risk assessment of malicious attacks, with potential applications to investment portfolio management and hedging.

Please cite this article as: R. Gopal, et al., Information mining Reections on recent advancements and the road ahead in data, text, and media mining, Decis. Support Syst. (2011), doi:10.1016/j.dss.2011.01.008

R. Gopal et al. / Decision Support Systems xxx (2011) xxxxxx

THE ESSENTIAL COMPONENTS OF INFORMATION MINING

I. DATA TYPE
(numerical, textual, media/multimedia)

II. APPLICATION AREA


(marketing, software engineering, financial engineering, counter terrorism, bio-informatics...)

THE INPUTS III. TECHNIQUES


(SVM, k-means clustering, logit regression, decision trees, neural nets, specialized algorithms...)

IV. TASKS
(classification, prediction, pattern matching, measuring complexity, evaluation, sentiment analysis...)

THE OUTPUT: FINAL OBJECTIVE


(enhanced profits, accurate medical diagnosis, risk reduction/mitigation, ...)

Fig. 1. The key components of information mining.

In their paper, A Framework for Exploring Organizational Structure in Dynamic Social Networks, Qiu and Lin work in the familiar domain of social networks. Focusing on the discovering of organizational structure within dynamic social networks, the authors construct a new data structure, the Community Tree, and develop an innovative algorithm that uses this data structure in organizational structure discovery. Their efforts add new components to the Data and Technique groupings in Fig. 1. Kim et al.'s, Collaborative User Modeling for Enhanced Content Filtering in Recommender Systems, also deals with a familiar application area and objective, developing personalized recommender systems. But they provide a new method of building a user model and incorporate a collaborative enrichment process and thus expand Fig. 1's Technique grouping. The controversy over black box information mining techniques reached a high point during the recent global nancial market melt downs. Martens et al.'s current paper, Performance of Classication Models from a User Perspective, builds on the early innovative algorithm development by a subset of these authors to present a comprehensive approach to the development of classication models that satisfy specic standards for accuracy, comprehensibility, and justiability. Of special note is their development of a novel metric to measure model justiability. The authors also note that the ability to provide models that satisfy high comprehensibility and justiability standards leads to the acceptability of such models in new domains. Thus this work contributes directly to both the Task (justication) and Technique segments in Fig. 1. In addition, it contributes to the Application Areas segment by enhancing the acceptability of data mining in new arenas.

Coelho et al.'s special issue contribution, Multi-Objective Design of Hierarchical Consensus Functions for Clustering Ensembles via Genetic Programming, focuses on adding to the Technique and Task sets of Fig. 1. They present a novel GP-based approach (MCHPF) formed as a hybrid of advanced strategies of multi-objective clustering and clustering ensembles. By considering complementary validation indices, the authors' techniques enable the identication of useful partitions for the many different structures available in a given dataset. In their discussion of future work, the authors suggest the potential usefulness of their approach for use in document clustering, image segmentation, market segmentation, vector quantization and topographic map formation all emphasizing the expanding nature of Fig. 1's Application Areas. In the paper, Improving the Ranking Quality of Medical Image Retrieval using a Genetic Feature Selection Method, da Silva and his co-authors demonstrate the innovative use of information mining techniques in analyzing medical images. This work illustrates the breadth of information mining and the potential for valuable analyses of image media. Noting the importance of nding appropriate image features for identifying pathologically similar images to support physicians in patient diagnosis, therapy, and surgery, the authors develop a novel process for image retrieval and demonstrate its effectiveness. In terms of the elements presented in Fig. 1, this work adds to Data, Application Area, and Technique. The nal paper, O'Leary's Blog Mining Review and Extensions: From Each According to His Opinion, focuses on the rapidly expanding information content of blogs. O'Leary's work demonstrates the extraction of knowledge in various ways from blog tags and blog

Please cite this article as: R. Gopal, et al., Information mining Reections on recent advancements and the road ahead in data, text, and media mining, Decis. Support Syst. (2011), doi:10.1016/j.dss.2011.01.008

R. Gopal et al. / Decision Support Systems xxx (2011) xxxxxx

content. He applies the extraction process to examine the relationship between blog chatter and sales, and between blog content and public image. Emphasizing the innovative use of the emerging plethora of blog content, the paper contributes directly to Fig. 1's Data, Application Area, and Tasks segments while demonstrating the potential for blog mining in pursuing outcomes such as enhancing public image and increasing record sales. We are indebted to this international array of researchers whose work has contributed to the many elements of information mining. The breadth of their analyses reects the breadth of the eld. The innovativeness of their approaches reects the need for and value of creative thinking in information mining. The eld is young and vibrant with many challenges ahead. We now briey proffer our reections on key integrating themes that cut across the many facets of information mining. Building on these points, we set out what we regard as key topics for future research. 4. Reections We began by underscoring the explosion in the types and amount of data that is now available. While once we were data poor, we now strive to avoid drowning in data. Our task set today is how to deal with data of all forms in ways that mine valuable information from data of all types. There are some unifying themes that can help organize and structure future research. Here we consider a few, and certainly nonexhaustive set, of such themes. 4.1. Seeking the message Image mining and sentiment analysis both commonly are seeking the message, one from the visual content and one from the textual content. But what if we switched the objective, what if we were able to shape images or text material to communicate a specic message? Consider the potential value from the applications/use in education and training. What form of image bests gets critical information (i.e., the correct medical information) across to doctors? What images and textual material best gets critical messages across to medical students? To date, media modication discussions have been limited to music including ways to lter out noise and unwanted sounds or to turn music into images. We suggest that information mining can be used in other settings to structure images or text as a powerful tool to convey messages. 4.2. Connecting data mining and theory building in an iterative fashion Information mining and traditional theory building have commonly been viewed as alternative rather than complementary methods. We suggest that an approach that takes the two methods as complementary and inter-related may provide signicant benets. As Martens and his co-authors emphasize, there is a growing demand that comprehensibility and justication be added to accuracy as requirements in information mining yield. Along with the requirements of comprehensibility and justication (though in different ways than that proposed by Martens et al.), a key element of theory building is verication, a combination of testing the implications of the theory and assessing the accuracy of the model. Noting these similarities between information mining and theory building, we suggest that researchers may develop an iterative approach that uses information mining outcomes as inputs into the theory construction and validation processes. The approach could involve only a few iterations or be an ongoing process. Though this technique may be of little use in certain information mining areas (e.g., short term and quick turn-around decision making), the areas of medicine, marketing, and counter terrorism strike us as potential candidates.

4.3. Updating and longevity of ndings Our interactions with information mining professionals and researchers lead us to quite condently assert that we all seem to agree that mining results/outcomes need to be updated (re-estimated) as new data emerges or underlying conditions shift. While we may agree in the abstract, there seems little agreement on the appropriate frequency of updating or on what timeline, circumstances, or occurrences should trigger or necessitate updating or an entirely new analysis. In today's world, there is constant updating of information, information sources, model forms, and techniques. How should these changes impact information mining updating/re-estimation decisions? This leads us to suggest research inquiries directed at addressing these issues and seeking to identify possible general principles of model updating where specic application-specic characteristics enter as explanatory variables that shift the model updating process as appropriate for the application setting. One possible avenue for this research to follow is to begin with a variety of individual case studies. Taken together, these case studies might suggest some set of unifying general principles and situation-specic characteristics that might form foundation elements of a theory of updating. 5. Epilog With respect to Shakespeare for adapting his works, we perhaps doth speak too much! Thus we stop with our words and urge the reader to move on to the real content of this volume, the research of our contributors. We appreciate all that they have put into their contributions and hope that you will nd their work helpful and insightful. Enjoy. References
[1] B. Baesens, C. Mues, M. De Backer, J. Vanthienen, Building intelligent credit scoring systems using decision tables, in: Camp Olivier, Joaquim B.L. Filipe, Hammoudi Slimane, Piattini Mario (Eds.), Building Intelligent Credit Scoring Systems Using Decision Tables, Springer, Netherlands, 2005. [2] B. Baesens, R. Setiono, C. Mues, J. Vanthienen, Using neural network rule extraction and decision tables for credit-risk evaluation, Management Science 49 (3) (2003) 3123298 2003. [3] R. Bapna, P. Goes, R. Gopal, J.R. Marsden, Moving from data-constrained to dataenabled research: experiences and challenges in collecting, validating, and analyzing large-scale e-commerce data, Statistical Science 21 (2) (2006) 116130. [4] P. Berkhin, J. Kogan, C. Nicholas, M. Teboulle, A survey of clustering data mining techniques, Grouping Multi Dimensional Data, Spring Berlin Heidelberg, 2006, pp. 2571. [5] M.C.F. de Oliveira, H. Levkowitz, From visual data exploration to visual data mining: a survey, IEEE Transactions on Visualization and Computer Graphics 9 (30) (2003) 378394. [6] J.A. Harding, M. Shahbaz, S. Srinivas, K. Kusiak, Data mining in manufacturing: a review, Journal of Manufacturing Science and Engineering 128 (2006) 969975. [7] http://www.pokerlistings.com/full-tilt-suspends-brian-townsend-14503 (last accessed August, 2010). [8] R. Kosala, H. Blockeel, Web-mining research: a survey, ACM SIGKDD Explorations Newsletter 2 (1) (2000) 115. [9] S. Laxman, P.S. Sastry, A survey of temporal data mining, Sadhana 31 (2) (2006) 173198. [10] T. Li, Q. Li, S. Zhu, M. Ogihara, A survey on wavelet applications in data mining, ACM SIGKDD Explorations Newsletter 4 (2) (2002) 4968. [11] J.R. Marsden, The internet and DSS massive, real-time data availability is changing the DSS landscape, Information Systems and e-Business Management 6 (2) (2008) 193203. [12] B. Pang, L. Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval 2 (12) (2008) 1135. [13] C. Phua, V. Lee, K. Smith, R. Gayler, A Comprehensive Survey of Data Miningbased Fraud Detection Research, Clayton School of Information Technology, Monash University. Available at http://clifton.phua.googlepages.com/frauddetection-survey.pdf. [14] N.R.S. Raghavan, Data mining in e-commerce: a survey, Sadhana 30 (2 &3) (2005) 275289. [15] J. Srivastava, R. Cooley, M. Deshpande, P. Tan, Web usage mining: discovery and applications of usage patterns from web data, SIGFDD Explorations 1 (2) (2000) 1223. [16] W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G.H. Schimm, A.J.M.M. Weijters, Workow mining: a survey of issues and approaches, Data & Knowledge Engineering 47 (2) (2003) 237267.

Please cite this article as: R. Gopal, et al., Information mining Reections on recent advancements and the road ahead in data, text, and media mining, Decis. Support Syst. (2011), doi:10.1016/j.dss.2011.01.008

R. Gopal et al. / Decision Support Systems xxx (2011) xxxxxx Ram D. Gopal is the GE Capital Endowed Professor of Business in the School of Business, University of Connecticut. He serves as the Ph.D. director for the Department of Operations and Information Management. His current research interests are in the areas of economics of intellectual property rights, data management and security, and online market design and performance evaluation. His research has appeared in Information Systems Research, Management Science, Operations Research, Journal of Business, Journal of Law and Economics, Journal of Management Information Systems, Decision Sciences, and other journals and conference proceedings. He is a senior editor for Information Systems Research and serves on the editorial board of Journal of Database Management and Information Systems Frontiers.

Ram Gopal Department of Operations and Information Management, School of Business, University of Connecticut, United States Corresponding author. E-mail address: ram.gopal@business.uconn.edu. James R. Marsden1 Department of Operations and Information Management, School of Business, University of Connecticut, United States Katholieke Universiteit Leuven, Faculty of Business and Economics, Department of Decision Sciences and Information Management, Belgium Jan Vanthienen Katholieke Universiteit Leuven, Faculty of Business and Economics, Department of Decision Sciences and Information Management, Belgium Available online xxxx

James R. Marsden is the Treibick Family Endowed Chair in e-Business and Board of Trustees Distinguished Professor at the Department of Operations and Information Management (OPIM), University of Connecticut. He has been at UConn since 1993 as Professor, serving fteen years (19932008) as Head of OPIM. He helped develop both the Connecticut Information Technology Institute and the Treibick Electronic commerce Initiative and currently serves as Executive Director of both. Jim also served for nine years as the UConn Director of edgelab, the unique ongoing research partnership between GE and UConn. Dr. Marsden has a lengthy publication record in market innovation and analysis, economics of information, articial intelligence, and production theory. His research work has appeared or is forthcoming in Management Science; Journal of Law and Economics; American Economic Review; Journal of Economic Theory; Journal of Political Economy; IEEE Transactions on Systems, Man, and Cybernetics; Computer Integrated Manufacturing Systems; Decision Support Systems; Journal of Management Information Systems, and numerous other academic journals. He received his A.B. from the University of Illinois and his M.S. and Ph.D. from Purdue University. Also holding a J.D, Jim has been admitted to both the Kentucky and Connecticut Bar.

Jan Vanthienen is professor of information systems at Katholieke Universiteit Leuven, Department of Decision Sciences and Information Management, Information Systems Group, where he is teaching courses on business intelligence, systems analysis, business information systems and information management. His current research interests include information and knowledge management, business rules and processes, business intelligence, information systems analysis and design, computer based training. He is a founding member of the Leuven Institute for Research in Information Systems (LIRIS), and a member of the ACM and the IEEE Computer Society. He is chairholder of the PricewaterhouseCoopers Chair on E-Business at K.U. Leuven and cochairholder of the Microsoft Research Chair on Intelligent Environments. He received the Belgian Francqui Chair 2009 at FUNDP and is co-founder and president-elect of the Benelux Association for Information Systems (BENAIS). In the past, he was director of the Postgraduate Program in Management (K.U. Leuven) and, as vice-chairman of the Department of Applied Economics, director of research.

1 Professor Marsden was a Visiting Professor at the Department of Decision Sciences and Information Management, Faculty of Business and Economics, Katholieke Universiteit Leuven from November, 2009 to April, 2010. Professor Marsden was a Visiting Professor at the Department of Decision Sciences and Information Management, Faculty of Business and Economics, Katholieke Universiteit Leuven from November, 2009 to April, 2010.

Please cite this article as: R. Gopal, et al., Information mining Reections on recent advancements and the road ahead in data, text, and media mining, Decis. Support Syst. (2011), doi:10.1016/j.dss.2011.01.008

Das könnte Ihnen auch gefallen