Beruflich Dokumente
Kultur Dokumente
Abstract. Data Mining is consider as a process of finding previously unknown patterns and trends in a database, in order to create predictive models which will predict the future outcome; also, Data Mining could be consider as a confluence of techniques provided by stastics and mathematics. Taking advantage in the diversity of characteristics from these techniques, they can be classified into two major categories, supervised learning and unsupervised learning. Nowadays, Internet had changed the living way; also, the way for find, and communicate, and use information, as well as the way people do commerce, marketing or advertising. For this reason, the purpose of this paper is to do a brief surrey over some techniques, and some applications, as well as in some trends in the field of data mining.
1 Introduction
Along his common way of living, the human being stores in his mental warehouse immense lots of data from objects, situations or events that he experiences, as well as the relations or links that they present him; thus, when human being confronts or tries to predict similar situations as he lived before, he explores his mind searching, gathering and 1 selecting from his mental store information to understand, modify or explain facts in matter. Because Data Mining was created to drive and control very big sets of data, it is necessary to support this activity by means of models, in order to store, and retrieve, and understand, and use, and obtaining necessary knowledge about some situations, as well as updating information and relations that data could present. Thus, data mining models must allow inter-operability in different platforms for the applications, without concerning on the system that user had built; so, to facilitate tasks with large streams of data, Data Mining offers users tables for work with data those of which, many times are found in transactional forms. Therefore, an assembling task is needed to obtain those tables; however, it is necessary taking in account timing and special conditions required by large databases [10]. For this reason, in the next sections I will try to explain and briefly explore the stateof-art in data mining, beginning in section 2 with some techniques people normally use,
1
Information is more useful for him because, besides data, it is concerned on the relations among data as well as importance and transcendence of them.
following with some applications, in section 3; also, in section 4, I present some trends detected in the technical literature in this respect; finally in section 5, I expose some considerations about data mining concerning with knowledge management.
There are several techniques that specialized people used in the data mining field; those in which it is possible to find the next two main groups: Supervised Learning, which are concerned on the value from attributes; in this way, this group involves neural networks, decision trees, classification (in which registers are assigned in defaulted categories), genetic algorithms or statistical techniques such as regression and others. Non-supervised Learning, which is concerned over the find out trends or patterns into datasets, without previous knowledge about patterns in search; this group is supported, mainly, over association rules and clustering techniques.
present similarities are grouped and mapped to the same or neighboring map units and, at each unit, there are links to the document database. Commonly in text mining, it is possible to use the WEBSOM method, which is based on the Self-Organized Maps. So, based on the Encyclopedia Britannica, in the article entitled Mining massive document collections by the WEBSOM method, Krista Lagus, and Samuel Kaski, and Teuvo Kohonen present diverse arguments about creating document Self-Organized Maps of very large text document collections [3].
2.1.2
Decision Trees
This model is supported over tree form structures that represent, in a graphic form, predetermined rules to make a decision over a specific element that must be assigned in response for an output class or element, including classification trees and regression trees, those in which are needed to identify and remove branches representing noise or outlier. So, their most representative examples are the ID-3 Algorithm and the C4-5 Algorithm, those which have been analyzed and develop by several authors, in order to provide algorithms that may support mining in sequence databases.
evaluating performance of different data mining techniques, including logistic regression and rule-based classification [14].
2.4 Clustering
This technique is concerned on sectioning datasets into different groupings in order to get maximum similarities among elements sharing the same group and minimum similarities over elements from different groups, it is a non-supervised process, which is related in merging iteratively components that share many variations and provides an idea from the distribution of data. Clustering quality relies on similarity and the implemented way used by the applied method, where similarity is determined as a function of distance, which depends on the data type; in other words, this technique gathers vectors over distance approaches, grouping them by the similarity of their characteristics. Based on the Classification Algorithm by Preferential Clustered Link (CPCL), in the article entitled Mining Textual Data through Term Variant Clustering: the TermWatch system, Fidelia Ibekwe-Sanjuan and Eric Sanjuan described an experiment on unsupervised text mining, performed on a corpus of scientific titles and abstracts from 16 information retrieval journals, as well as presenting an overview of the TermWatch system, describing its different processing stages, such as term extraction, followed by terminological variation, identification and, finally, by clustering [5].
Generalized Sequential Pattern, proposed by Srikant and Agrawal to discover sequential patterns, where items could be located across all levels of a hierarchy.
Some applications
It is almost impossible to mention or, stiller, to list the whole application of Data Mining at the present time but, at least, it is possible to mention its main or general applications, such as marketing analysis, decision support systems, fraud detection, Business Management, games, terrorism, internet development, human resources, genetics, text mining, clustering, categorization, extraction-transformation-loading information, pattern recognition, and more.
which is extended from Ad-Hoc techniques, those which are considered to be computationally tractable heuristics, such as Model-based Interval Estimation (MBIE) that builds a model to construct an exploration policy, and Randomized Strategies, used to trade exploration and exploitation in practice [17].
comparison among several data mining classification methods, such as naive Bayes, logistic regression, decision tree, decision table, neural network, k-nearest neighbor, and support vector machine (with and without incorporating domain knowledge) focusing in the domain of indirect bank lending for automobiles. In this work, the authors present that domain expertise captured in the form of a partial knowledge base can significantly improve the performance of a wide variety of classifiers on relatively small data sets, as well as incorporation of domain knowledge affects different classifiers in different degrees [18].
As I referred before, also, Fayyad and Ramasamy consider that the aggressive rate of growth of disk storage to capture and store data has helped us to far outpaced our ability to process and utilize it. In the last part of the 20th Century and the beginnings of the 21st Century, Data Mining has
impacted strongly knowledge management and business practices; thus, Business Intelligence has emerged as one of the most popular applications of data mining techniques that which provokes that all participants in KDD field must watch and respect technical standards, in order to get effective, compatible and adaptable solutions one to each other. For that reason, Fayyad and Ramasamy mention too that transparency and data fusion represent two major challenges for the growth of technology development and
data mining market, because the problem of building and maintaining useful data warehouses remains one of the great obstacles to successful data mining.
According with the main applications of data mining, it is possible to find those related with artificial intelligence, stochastic models, decision trees or clustering, those which, for example, are used in different disciplines, such as business (concerned in business intelligence), marketing applied to customers preferences, insurance operations, human resources, internet behaviour, terrorism analysis, scientific analysis, games analysis, engineering (including genetic engineering), informatics (concerned in RNAs and genetic algorithms), and more. Jeffrey Hsu, in his article entitled DATA MINING TRENDS AND DEVELOPMENTS: The Key Data Mining Technologies and Applications for the 21st Century foresaw some trends in the data mining field, such as distributed data mining, hypertext/hypermedia mining, ubiquitous data mining, as well as multimedia, spatial, and time series/sequential data mining [7]. Nowadays, when internet had modified the information and knowledge acquisition and the multidimensional processes of data are available, social organizations could take advantage of it to develop, instead of traditional marketing, strategies to extend on-line purchase models in real time. In this context, there are several authors that have been
studied and proposing different and alternative to find or discover fuzzy sequential patterns, as extension from the Apriori and the GSP algorithms. Thus, at the present, another sector of
the informatics community is moving away from addressing pure cost-sensitive learning or information retrieval tasks and many of the developers are focusing on UDBM, which is
concerned in maximizing the utility of data mining process when there are competing costs and benefits, as well as the utility of methods and applications that address the costs associated with acquiring data, the costs of learning from the data, and the costs and benefits of utilizing the learned knowledge, as well as working on the detection of rare events of high utility value; also, it is been used in some works related on Experimental
Design and Games Theory [22, 23, 24].
Conclusions
After the survey carried out in this paper, it is possible to mention that, at the present moment, data mining is a powerful tool to help organizations finding out relevant information hidden in their Data Warehouses or Data Marts, in order to predict future behaviours or trends and to allow directive staff their making decisions with a real knowledge from their reality and for adding value to their strategic decisions.
Acknowledgements
The author of this work presents his gratefulness to CONACYT because the scholarship she has granted with the number 16841 and the file 215363; likewise, the author thanks to SIP-IPN the scholarship granted by means of the project 20082201.
References
1. 2. www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.ht m , visited in August 2008, Brydon, M., and Gemino, A., 2008, Classification trees and decision-analytic feed forward control: a case study from the video game industry, Faculty of Business Administration, Simon Fraser University, Burnaby, BC, Canada, www.springerlink.com , visited in August 2008, Costas, G., Data Mining, www.scribd.com/doc/2202958/, visited in September 2008, es.wikipedia.org/wiki/Data_Mining, visited in September 2008, Fayyad, U., and Ramasamy, U., 2002, Evolving Data Mining into Solutions for Insights Our ability to capture and store data far outpaces our ability to process and exploit it, Communications of the ACM, August 2002/Vol. 45, N 8, ACM 00020782/02/0800. Ghoting, A., and Parthasarathy, S., and Otey, M. E., 2008, Fast mining of distancebased outliers in high-dimensional datasets, www.springerlink.com , visited in August 2008,
3. 4. 5.
6.
7.
8.
9. 10.
11.
16. 17.
18.
Hsu, J., Data Mining Trends and Developments: The Key Data Mining Technologies and Applications for the 21st Century, Information Systems, Fairleigh Dickinson University Madison, USA, www.sciencedirect.com , visited in August 2008, Huang, Z., et al, 2006, Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining, Decision Support Systems, www.sciencedirect.com , visited in August 2008, Ibekwe-Sanjuan F., and Sanjuan, E., Mining Textual Data through Term Variant Clustering : the TermWatch system, www.springerlink.com , visited in August 2008, Kgel, A., and Ohlebusch, E., 2008, A space efficient solution to the frequent string mining problem for many databases, Faculty of Engineering and Computer Sciences, University of Ulm, Germany, www.springerlink.com, visited in August 2008, Lagus, K., Kaski, S., and Kohonen, T., 2004, Mining massive document collections by the WEBSOM method, Helsinki University of Technology, Neural Networks Research Centre, www.sciencedirect.com , visited in August 2008, Markel, G. And Ruz, C., What is Data Mining?, www.ieee.org , visited in September 2008, Martens, D. et al, 2008, Predicting going concern opinion with data mining, DSS, doi:10.1016/j.dss.2008.01.003, www.sciencedirect.com , visited in August 2008, www.monografias.com/trabajos/datamining/datamining.shtml, visited in September 2008, Raghu, T. S., and Chen, H., 2006, Cyber infrastructure for homeland security: Advances in information sharing, data mining, and collaboration Systems, www.sciencedirect.com , visited in August 2008, Rassi, Ch., and Calders, T., and Poncelet, P., 2008, Mining conjunctive sequential patterns, www.springerlink.com , visited in August 2008, Rokach, L., and Naamani, L., and Shmilovici, A., 2008 , Pessimistic cost-sensitive active learning of decision trees for profit maximizing targeting campaigns, www.springerlink.com , visited in August 2008, Sinha, A. P. and Zhao, H., 2008, Incorporating domain knowledge into data mining classifiers: An application in indirect lending, Sheldon B. Lubar School of Business, University of Wisconsin-Milwaukee, Decision Support Systems, www.sciencedirect.com , visited in August 2008, Song, M., and van der Aalst W. M. P., 2008, Towards Comprehensive Support for Organizational Mining, www.sciencedirect.com , visited in August 2008, Sun, J. et al, 2008, Two heads better than one: pattern Discovery in time-evolving multi-aspect data, www.springerlink.com , visited in August 2008, Wan.Jui, L., and Jung-Yi, J., and Shi-Jue,L., 2008, Mining fuzzy periodic association rules, www.sciencedirect.com , visited in August 2008, Weiss, G. M., and Tian, Y., 2008, Maximizing classifier utility when there are data acquisition and modeling costs, www.springerlink.com , visited in August 2008, Weiss, G. M., and Zadrozny , B., and Saar-Tsechansky, M., 2008, special issue on utility-based data mining, www.springerlink.com , visited in August 2008, Yao, H., and Hamilton, H. J., 2005, Mining item set utilities from transaction databases, www.sciencedirect.com , visited in August 2008, Yen-Liang Chen and Tony Cheng-Kui Huang, 2008, A novel knowledge discovering model for mining fuzzy multi-level sequential patterns in sequence databases, http://www.elsevier.com/locate/datak, Data & Knowledge Engineering, visited in August 2008.