Sie sind auf Seite 1von 24

Contents

1. Introduction 2. Data warehouse and OLAP technology 3. Data preprocessing 4. Mining association rules in large databases 5. Classification and prediction 6. Cluster analysis 7. Mining complex types of data 8. Trends in data mining

Ch1. Introduction

DB

Data Mining

Information

1.1 What Motivated Data Mining?


The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from business management, production control, and market analysis, to engineering design and science exploration.

1.2 What is Data Mining?

Fig. 1. Data Miningsearching for knowledge in your data.

Fig. 2. Data mining as a step in the process of knowledge discovery.

1.2 What is Data Mining?


Data cleaning: to remove noise and inconsistent data. Data integration: where multiple data source may be combined. Data selection: where data relevant to the analysis task are retrieved from the database. Data transformation: where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance.

1.2 What is Data Mining?


Data mining: an essential process where intelligent methods are applied in order to extract data patterns. Pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures. Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user.

Graphical user interface

Pattern evaluation

Data mining engine

Knowledge base

Database or Data warehouse server Data cleaning Data integration Database

Filtering
Data warehouse

Fig. 3. Architecture of a typical data mining system.

1.3 Data Mining On What Kind of Data


Relational database Data warehouse Transactional database Advanced database systems and advanced database applications
Object-oriented database Object-relational database Spatial database Temporal database and time-series database Text database and multimedia database heterogeneous database and legacy database

1.3 Data Mining On What Kind of Data


Relational database: is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.

Data warehouse: is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site.

1.3 Data Mining On What Kind of Data


Transactional database: consists of a file where each record represents a transaction. Object-oriented databases: are based on the objectoriented programming paradigm, where in general terms, each entity is considered as an object. Spatial databases: contain spatial-related information. Such databases include geographic (map) databases, VLSI chip design databases, and medical and satellite image databases. Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, such as park, for instance. Other patterns may describe the climate of mountainous areas located at various altitudes.

1.3 Data Mining On What Kind of Data


Temporal databases and time-series databases: both store time-related data. A temporal database usually stores relational data that include time-related attributes.

Data mining techniques can be used to find the characteristics of object evolution, or the trend of changes for objects in the database.

1.4 Association Analysis


Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for market basket or transaction data analysis.
Association rules are of the form X==>Y, that is, A1 ... Am B1 ... Bn, where Ai(for i{1, , m}) and Bj(for j{1, , n}) are attribute-value pairs.

1.4 Association Analysis


Support: the items must appear in many baskets. Confidence: the rule XY holds in the transaction set that contain X also contain Y.
Support(X) P(X) P(X Y) Confidence (X Y) P(X)
Ex: age(X,2029) income(X,20K29K) ==>buys(X,CD player) [support=2%, confidence=60%]

1.4 Association Analysis


The problem of mining association rules is decomposed into the following two steps: 1. Discover the large itemsets, i.e., the sets of itemsets that have transaction support above a predetermined minimum supports. 2. Use the large itemsets to generate the association rules for the database. Algorithms: Apriori, AprioriTID, Boolean,

1.4 Association Analysis


TID Items 100 ACD 200 BCE 300 ABCE 400 BE Database

1.4 Association Analysis


Large Itemsets: BCE min support=2 min confidence=80%
B => CE [conf=67%] C => BE [conf=67%] E => BC [conf=67%] BC => E [conf=100%] BE => C [conf=67%] CE => B [conf=100%]

1.5 Classification and Prediction


Classification is the process of finding a set of methods that describe and distinguish data class or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.

The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).

1.5 Classification and Prediction

Fig. 4. A simple classification example.

1.6 Cluster Analysis


Unlike classification and prediction, which analyze classlabeled data objects, clustering analyzes data objects without consulting a known class label.
Clustering can be used on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.

1.6 Cluster Analysis

Fig. 5. A 2-D plot of customer data w.r.t customer locations in a city, showing 3 data clusters.

1.7 Classification of Data Mining Systems

Fig. 6. Data mining as a confluence of multiple disciplines.

1.8 Major Issues in Data Mining


Mining methodology and user interaction issues:
These reflect the kinds of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization.

1.8 Major Issues in Data Mining


Performance issues:
These include efficiency, scalability, and parallelization of data mining algorithms. Issues relating to the diversity of database types: Handling of relational and complex types of data

mining information from heterogeneous databases and global information systems

Das könnte Ihnen auch gefallen