Sie sind auf Seite 1von 10

A brief survey in the Data Mining field

Alfonso Hernndez Medrano


Escuela Superior de Ingeniera Mecnica y Elctrica, Seccin de Estudios de Postgrado e Investigacin Unidad Profesional Adolfo Lpez Mateos, Edificio 5 tercer piso; Colonia Lindavista; Mxico, 07738, D. F. Tel. (55) 5729-6000 Exts. 54663 y 54621 ahdezm@prodigy.net.mx , ahm10832@hotmail.com

Abstract. Data Mining is consider as a process of finding previously unknown patterns and trends in a database, in order to create predictive models which will predict the future outcome; also, Data Mining could be consider as a confluence of techniques provided by stastics and mathematics. Taking advantage in the diversity of characteristics from these techniques, they can be classified into two major categories, supervised learning and unsupervised learning. Nowadays, Internet had changed the living way; also, the way for find, and communicate, and use information, as well as the way people do commerce, marketing or advertising. For this reason, the purpose of this paper is to do a brief surrey over some techniques, and some applications, as well as in some trends in the field of data mining.

1 Introduction
Along his common way of living, the human being stores in his mental warehouse immense lots of data from objects, situations or events that he experiences, as well as the relations or links that they present him; thus, when human being confronts or tries to predict similar situations as he lived before, he explores his mind searching, gathering and 1 selecting from his mental store information to understand, modify or explain facts in matter. Because Data Mining was created to drive and control very big sets of data, it is necessary to support this activity by means of models, in order to store, and retrieve, and understand, and use, and obtaining necessary knowledge about some situations, as well as updating information and relations that data could present. Thus, data mining models must allow inter-operability in different platforms for the applications, without concerning on the system that user had built; so, to facilitate tasks with large streams of data, Data Mining offers users tables for work with data those of which, many times are found in transactional forms. Therefore, an assembling task is needed to obtain those tables; however, it is necessary taking in account timing and special conditions required by large databases [10]. For this reason, in the next sections I will try to explain and briefly explore the stateof-art in data mining, beginning in section 2 with some techniques people normally use,
1

Information is more useful for him because, besides data, it is concerned on the relations among data as well as importance and transcendence of them.

following with some applications, in section 3; also, in section 4, I present some trends detected in the technical literature in this respect; finally in section 5, I expose some considerations about data mining concerning with knowledge management.

State-of-the-art of some Data Mining techniques

There are several techniques that specialized people used in the data mining field; those in which it is possible to find the next two main groups: Supervised Learning, which are concerned on the value from attributes; in this way, this group involves neural networks, decision trees, classification (in which registers are assigned in defaulted categories), genetic algorithms or statistical techniques such as regression and others. Non-supervised Learning, which is concerned over the find out trends or patterns into datasets, without previous knowledge about patterns in search; this group is supported, mainly, over association rules and clustering techniques.

2.1 Artificial Neural Networks


With this technique user may solve problems that present calculation difficulties and are related with association, and evaluation or pattern recognition. So, this technique is supported over non-predictable models that learn through training and resemble as framework similar to a biological neural network, those which are conformed by the interconnection of a great quantity of simple processing elements, called neurons, with the ability to learn and adapt themselves to noisy environments, generating their own rules over a parallel processing, to find a response to an input pattern. The best known types of neural networks are Perceptron, Associative Maps (known as Hopfield Maps) and Self-Organized Maps (known as SOM or Kohonen Maps). In order to get a better performance from the ANN (Artificial Neural Networks), they could be combined with either qualitative or quantitative techniques, such as Fuzzy Logic, Genetic Algorithms or Statistics. It is necessary to remark that ANN requires, for a good performance, to develop parallel processing and recognition pattern; so, in data mining field, there are studies that employ taxonomy concept in patterns recognition mining; also, in the knowledge management field, through pattern evaluation of data streams, coming from the selection carried out by the Data Mining process, it is possible to identify knowledge as the core of knowledge discovery process.

2.1.1 Self-Organized Maps


As I mentioned in the previous section, Self-Organized Maps are a kind of artificial neural network, which has the ability to discover by itself co-relations or traces from datasets in order to find, in a non-supervised way, patterns or characteristics hidden in datasets; one of its applications is related with text mining, in which documents that

present similarities are grouped and mapped to the same or neighboring map units and, at each unit, there are links to the document database. Commonly in text mining, it is possible to use the WEBSOM method, which is based on the Self-Organized Maps. So, based on the Encyclopedia Britannica, in the article entitled Mining massive document collections by the WEBSOM method, Krista Lagus, and Samuel Kaski, and Teuvo Kohonen present diverse arguments about creating document Self-Organized Maps of very large text document collections [3].

2.1.2

Decision Trees

This model is supported over tree form structures that represent, in a graphic form, predetermined rules to make a decision over a specific element that must be assigned in response for an output class or element, including classification trees and regression trees, those in which are needed to identify and remove branches representing noise or outlier. So, their most representative examples are the ID-3 Algorithm and the C4-5 Algorithm, those which have been analyzed and develop by several authors, in order to provide algorithms that may support mining in sequence databases.

2.2 Outlier Detection


In a general way, datasets to be analyzed provide data sub-sets containing variability or mistakes; in this context, sometimes is useful to study them to detect abnormalities, meanwhile, in other occasions, the best way is to discard them because they could influence or get dirty the results. Lior Rokach, and Lihi Naamani, and Armin Shmilovici show in their article Pessimistic cost-sensitive active learning of decision trees for profit maximizing targeting campaigns a framework focused in select which potential customer would be approach with a new product, in order to maximize net profit. In such a method, it is possible to identify different types of errors, such as variability in observations, lack (or deficiencies) of applied technique, and planning deficiencies or realization mistakes [17].

2.3 Classification Models


Classification models represent knowledge as IF THEN rules, and are commonly used to rank likelihood of outcomes or to estimate uncertain outcomes probabilities, called classes, and it has many applications to classify characteristics or to predict lack or unknown numeric values, generating a rule for each path from the root to the different branches. On the other hand, Logistic Regression is a widely-used method for classification. In a work called Predicting going concern opinion with data mining, Martens, D., at al, present a rule-based classification model, built by means a technique called AntMiner+, in order to facilitate auditors to screen potential clients or as a decision aid to identify severely distressed clients that might require further consideration;

evaluating performance of different data mining techniques, including logistic regression and rule-based classification [14].

2.4 Clustering
This technique is concerned on sectioning datasets into different groupings in order to get maximum similarities among elements sharing the same group and minimum similarities over elements from different groups, it is a non-supervised process, which is related in merging iteratively components that share many variations and provides an idea from the distribution of data. Clustering quality relies on similarity and the implemented way used by the applied method, where similarity is determined as a function of distance, which depends on the data type; in other words, this technique gathers vectors over distance approaches, grouping them by the similarity of their characteristics. Based on the Classification Algorithm by Preferential Clustered Link (CPCL), in the article entitled Mining Textual Data through Term Variant Clustering: the TermWatch system, Fidelia Ibekwe-Sanjuan and Eric Sanjuan described an experiment on unsupervised text mining, performed on a corpus of scientific titles and abstracts from 16 information retrieval journals, as well as presenting an overview of the TermWatch system, describing its different processing stages, such as term extraction, followed by terminological variation, identification and, finally, by clustering [5].

2.5 Pattern Recognition


Sequential pattern mining is a useful method for discovering customers purchasing patterns through time from a transactional database. In the article called A novel discovering model for mining fuzzy multi-level sequential patterns in sequence databases, Yen-Liang Chen and Tony Cheng-Kui Huang propose a fuzzy multi-level sequential mining algorithm, defined as FMSM (developed in extension to the GSP algorithm 2 ) as well as the CROSS-FMSM algorithm to define and discover multi-level sequential patterns. As a slight modification of FMSM algorithm, which employs various minimum support thresholds for different levels, authors define the CROSSFMSM algorithm, which only uses one minimum support threshold for all levels [20]. On the other hand, in the article entitled Mining conjunctive sequential patterns, Chedy Rassi, and Toon Calders, and Pascal Poncelet present an approach to mining nonderivable sequential patterns and show its use in mining association rules for sequences; mentioning that those patterns could have a high-potential for real-life applications (such as network monitoring and biomedical fields) and the ability to get sequential association rules with all the classical statistical metrics [16].

Generalized Sequential Pattern, proposed by Srikant and Agrawal to discover sequential patterns, where items could be located across all levels of a hierarchy.

2.6 Genetic Algorithms


Similar to human being reasoning, in order to drive large streams or packages of data, Data Mining support its tasks over models that show data relations, leaving the user the task to interpret it and get the knowledge to control present events or predict future situations. However, though most of work in Data Mining is been focused on machine learning, in order to achieve better induction algorithms, it is important not to omit what algorithms are best for a specific situation. In this context, genetic algorithms are reflected as optimization techniques that develop natural selections, mutations or genetic combinations; so, changes of expression levels of genes through different samples could provide information that allows reverse engineering to construct the network of regulatory relations among those genes. In this context, Bayesian Networks, those which helps to model gene regulatory networks, have been applied to represent and learn gene regulatory networks from microarray data and applied in forecasting, manufacturing control, diagnosis and other activities, to infer causal relationships among random variables and to generate causal network structures. On the other hand, Association Rule Mining, that provides correlation between genes and direction of these relationships, normally has been used to study consumer purchasing patterns in retail store or in systems such as customer relationship Management, network Communications or image processing. Trying to build a rough casual network from a small-size data for a relatively large number of variables, in an article called Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining, Zan Huang et al, present two scalable algorithms for learning large-scale gene regulatory, the Modified Bayesian Networks learning algorithm(MBN), as well as the Modified Association Rule mining algorithm (MAR) [5].

2.7 Association Rules


This non-supervised technique provides a framework that allows user to predict future behavior patterns over simultaneous occurrences from variables values; thus, an association between two attributes is presented when is got a relative high frequency of values from each attribute. For this reason, association rules try to find associations or connections between objects. In the article called Mining fuzzy periodic association rules, Wan-Jui Lee, and Jung-Yi Jiang, and Shie-Jue Lee present their work with fuzzy algorithms for mining fuzzy periodicities in databases and the fuzzy periodic association rules within them 3 , focusing in finding or discover, from temporal databases, fuzzy periodic associations rules; for this purpose, firstly they define, assigning fixed values for a support vector threshold and the fuzzy weighting factor, as well as different values for the match ratio threshold. Thus, these authors purpose crisp, and cyclic, and fuzzy components, in order to define fuzzy calendar schemas and fuzzy periodic calendars that show fuzzy periodic behaviors [21].
3 As they mention at the beginning of the mentioned document, it is possible to find periodic behaviors in activities such as planning, banking, or event logs of computer networks, those which may be useful to understand clients behaviors accessing the web.

2.8 Outliers Definition


Understanding outliers as points that are highly unlikely to occur, they could be considered as anomalies in a dataset due to malicious content, faulty collection or bad data; the outliers definition is focused in the use of a points distance to its nearest neighbor as a measure of unusualness in which it is possible to find log-linear performance as a function of the number of data points on many real low-dimensional datasets. In the article Fast mining of distance-based outliers in high-dimensional datasets, Amol Ghoting, and Srinivasan Parthasarathy, and Matthew Eric Otey present the Recursive Binning and Re-Projection two-phase algorithm for Fast mining of distance-based outliers in high dimensional datasets, which scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions, starting iterations with a preprocessing of dataset into bins and points that are close to each other in space are likely to be placed in the same bin; alter that, it uses an extension of an NL algorithm, that operates over bins, in order to observe whether nearest neighbors are relatively close, then the data point is considered as normal, otherwise it is considered as an outlier [6].

Some applications

It is almost impossible to mention or, stiller, to list the whole application of Data Mining at the present time but, at least, it is possible to mention its main or general applications, such as marketing analysis, decision support systems, fraud detection, Business Management, games, terrorism, internet development, human resources, genetics, text mining, clustering, categorization, extraction-transformation-loading information, pattern recognition, and more.

3.1 Enterprise Management


In their work called Towards Comprehensive Support for Organizational Mining, Minseok Song and Wil M.P. van der Aalst, present three basic types of process mining, discovery, conformance, and extension, in order to monitor and improve real organizational processes, focusing in three different perspectives: [19] Process perspective, that is focused in the order of activities, Organizational perspective, that considers performers are involved and how are they related, distinguishing three types of mining (Organizational model mining, Social network analysis, and Information flows between organizational entities), Case perspective, that is focused on properties of elements As they referred, their work is focused not only on social networks for originators, but also on mining organizational models and analyzing relationships between organizational entities. On the other hand, using a decision tree, Lior Rokach, and Lihi Naamani, and Armin Shmilovici, in the article Pessimistic cost-sensitive active learning of decision trees for profit maximizing targeting campaigns expose a Pessimistic Active Learning algorithm,

which is extended from Ad-Hoc techniques, those which are considered to be computationally tractable heuristics, such as Model-based Interval Estimation (MBIE) that builds a model to construct an exploration policy, and Randomized Strategies, used to trade exploration and exploitation in practice [17].

3.2 Insurance and Banking Negotiations


Because experts often find difficult to articulate heuristics or rules of thumb that they use to efficiently solve a problem, as well as acquiring their expertise is usually a difficult and a challenging task; in their work, called Incorporating domain knowledge into data mining classifiers: an application in indirect lending, Atish P. Sinha and Huimin Zhao present a

comparison among several data mining classification methods, such as naive Bayes, logistic regression, decision tree, decision table, neural network, k-nearest neighbor, and support vector machine (with and without incorporating domain knowledge) focusing in the domain of indirect bank lending for automobiles. In this work, the authors present that domain expertise captured in the form of a partial knowledge base can significantly improve the performance of a wide variety of classifiers on relatively small data sets, as well as incorporation of domain knowledge affects different classifiers in different degrees [18].

3.3 Internet users behavior


Internet had changed the way in which we live, find and use information, communicate and do commerce, as well as marketing, and the way people make end diffuse advertisings. For this reason, to respond and get capacity to adapt web sites to users requirements, Data Mining can help internet owners and developers to know aspects from the websites visitors profiles and, also, the way data are submitted during web transactions. Fayyaad and Ramasamy consider that a model of the data can be a model of the entire data set and can be predictive; it can be used to, say, anticipate future customer behavior (such as the likelihood a customer is or is not happy, based on historical data of interaction with a particular company) [5].

3.4 Games configurations


In order to establish o determine best games strategies, pattern recognition is useful to produce either appropriate databases and game procedures or problems to solve. Over this context, in order to could predict video games sales and supporting their work over data from video game industry and a commercial data mining package, in the article called Classification trees and decision-analytic feed forward control: a case study from the video game industry, Michael Brydon and Andrew Gemino analyzed and present a methodology for transforming data mining output into a complete decision-analytic model, as well as to assess the feasibility of applying the methodology to a large, and complex, and real-world decision problem [2].

3.5 Engineering applications


Data Mining has had a widely use to monitoring physic and operative conditions of the various elements of power electric systems, by means of information achieved from information technologies, in which clustering techniques, such as Self-Organized Maps helps to detect abnormal conditions or to estimate source of these abnormalities.

3.6 National Security


Even though, developed countries lead researches about national security, after the terrorist events carried out in large cities like New York, London and Madrid, it was necessary that chairmen turned to look that information sharing needs to be re-examined and adapted for such purposes. For this reason, in the article called Cyber infrastructure for homeland security: Advances in information sharing, data mining, and collaboration
Systems, T. S. Raghu and Hsinchun Chen analyze the state-of-the-art knowledge of information sharing, data mining or collaboration systems to the context of homeland security [15].

Data mining trends found out

As I referred before, also, Fayyad and Ramasamy consider that the aggressive rate of growth of disk storage to capture and store data has helped us to far outpaced our ability to process and utilize it. In the last part of the 20th Century and the beginnings of the 21st Century, Data Mining has

impacted strongly knowledge management and business practices; thus, Business Intelligence has emerged as one of the most popular applications of data mining techniques that which provokes that all participants in KDD field must watch and respect technical standards, in order to get effective, compatible and adaptable solutions one to each other. For that reason, Fayyad and Ramasamy mention too that transparency and data fusion represent two major challenges for the growth of technology development and
data mining market, because the problem of building and maintaining useful data warehouses remains one of the great obstacles to successful data mining.

According with the main applications of data mining, it is possible to find those related with artificial intelligence, stochastic models, decision trees or clustering, those which, for example, are used in different disciplines, such as business (concerned in business intelligence), marketing applied to customers preferences, insurance operations, human resources, internet behaviour, terrorism analysis, scientific analysis, games analysis, engineering (including genetic engineering), informatics (concerned in RNAs and genetic algorithms), and more. Jeffrey Hsu, in his article entitled DATA MINING TRENDS AND DEVELOPMENTS: The Key Data Mining Technologies and Applications for the 21st Century foresaw some trends in the data mining field, such as distributed data mining, hypertext/hypermedia mining, ubiquitous data mining, as well as multimedia, spatial, and time series/sequential data mining [7]. Nowadays, when internet had modified the information and knowledge acquisition and the multidimensional processes of data are available, social organizations could take advantage of it to develop, instead of traditional marketing, strategies to extend on-line purchase models in real time. In this context, there are several authors that have been

studied and proposing different and alternative to find or discover fuzzy sequential patterns, as extension from the Apriori and the GSP algorithms. Thus, at the present, another sector of
the informatics community is moving away from addressing pure cost-sensitive learning or information retrieval tasks and many of the developers are focusing on UDBM, which is

concerned in maximizing the utility of data mining process when there are competing costs and benefits, as well as the utility of methods and applications that address the costs associated with acquiring data, the costs of learning from the data, and the costs and benefits of utilizing the learned knowledge, as well as working on the detection of rare events of high utility value; also, it is been used in some works related on Experimental
Design and Games Theory [22, 23, 24].

Conclusions
After the survey carried out in this paper, it is possible to mention that, at the present moment, data mining is a powerful tool to help organizations finding out relevant information hidden in their Data Warehouses or Data Marts, in order to predict future behaviours or trends and to allow directive staff their making decisions with a real knowledge from their reality and for adding value to their strategic decisions.

Acknowledgements
The author of this work presents his gratefulness to CONACYT because the scholarship she has granted with the number 16841 and the file 215363; likewise, the author thanks to SIP-IPN the scholarship granted by means of the project 20082201.

References
1. 2. www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.ht m , visited in August 2008, Brydon, M., and Gemino, A., 2008, Classification trees and decision-analytic feed forward control: a case study from the video game industry, Faculty of Business Administration, Simon Fraser University, Burnaby, BC, Canada, www.springerlink.com , visited in August 2008, Costas, G., Data Mining, www.scribd.com/doc/2202958/, visited in September 2008, es.wikipedia.org/wiki/Data_Mining, visited in September 2008, Fayyad, U., and Ramasamy, U., 2002, Evolving Data Mining into Solutions for Insights Our ability to capture and store data far outpaces our ability to process and exploit it, Communications of the ACM, August 2002/Vol. 45, N 8, ACM 00020782/02/0800. Ghoting, A., and Parthasarathy, S., and Otey, M. E., 2008, Fast mining of distancebased outliers in high-dimensional datasets, www.springerlink.com , visited in August 2008,

3. 4. 5.

6.

7.

8.

9. 10.

11.

12. 13. 14. 15.

16. 17.

18.

19. 20. 21. 22. 23. 24. 25.

Hsu, J., Data Mining Trends and Developments: The Key Data Mining Technologies and Applications for the 21st Century, Information Systems, Fairleigh Dickinson University Madison, USA, www.sciencedirect.com , visited in August 2008, Huang, Z., et al, 2006, Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining, Decision Support Systems, www.sciencedirect.com , visited in August 2008, Ibekwe-Sanjuan F., and Sanjuan, E., Mining Textual Data through Term Variant Clustering : the TermWatch system, www.springerlink.com , visited in August 2008, Kgel, A., and Ohlebusch, E., 2008, A space efficient solution to the frequent string mining problem for many databases, Faculty of Engineering and Computer Sciences, University of Ulm, Germany, www.springerlink.com, visited in August 2008, Lagus, K., Kaski, S., and Kohonen, T., 2004, Mining massive document collections by the WEBSOM method, Helsinki University of Technology, Neural Networks Research Centre, www.sciencedirect.com , visited in August 2008, Markel, G. And Ruz, C., What is Data Mining?, www.ieee.org , visited in September 2008, Martens, D. et al, 2008, Predicting going concern opinion with data mining, DSS, doi:10.1016/j.dss.2008.01.003, www.sciencedirect.com , visited in August 2008, www.monografias.com/trabajos/datamining/datamining.shtml, visited in September 2008, Raghu, T. S., and Chen, H., 2006, Cyber infrastructure for homeland security: Advances in information sharing, data mining, and collaboration Systems, www.sciencedirect.com , visited in August 2008, Rassi, Ch., and Calders, T., and Poncelet, P., 2008, Mining conjunctive sequential patterns, www.springerlink.com , visited in August 2008, Rokach, L., and Naamani, L., and Shmilovici, A., 2008 , Pessimistic cost-sensitive active learning of decision trees for profit maximizing targeting campaigns, www.springerlink.com , visited in August 2008, Sinha, A. P. and Zhao, H., 2008, Incorporating domain knowledge into data mining classifiers: An application in indirect lending, Sheldon B. Lubar School of Business, University of Wisconsin-Milwaukee, Decision Support Systems, www.sciencedirect.com , visited in August 2008, Song, M., and van der Aalst W. M. P., 2008, Towards Comprehensive Support for Organizational Mining, www.sciencedirect.com , visited in August 2008, Sun, J. et al, 2008, Two heads better than one: pattern Discovery in time-evolving multi-aspect data, www.springerlink.com , visited in August 2008, Wan.Jui, L., and Jung-Yi, J., and Shi-Jue,L., 2008, Mining fuzzy periodic association rules, www.sciencedirect.com , visited in August 2008, Weiss, G. M., and Tian, Y., 2008, Maximizing classifier utility when there are data acquisition and modeling costs, www.springerlink.com , visited in August 2008, Weiss, G. M., and Zadrozny , B., and Saar-Tsechansky, M., 2008, special issue on utility-based data mining, www.springerlink.com , visited in August 2008, Yao, H., and Hamilton, H. J., 2005, Mining item set utilities from transaction databases, www.sciencedirect.com , visited in August 2008, Yen-Liang Chen and Tony Cheng-Kui Huang, 2008, A novel knowledge discovering model for mining fuzzy multi-level sequential patterns in sequence databases, http://www.elsevier.com/locate/datak, Data & Knowledge Engineering, visited in August 2008.

Das könnte Ihnen auch gefallen