Data Mining

Contents
1. Introduction 2. Data warehouse and OLAP technology 3. Data preprocessing 4. Mining association rules in large databases 5. Classification and prediction 6. Cluster analysis 7. Mining complex types of data 8. Trends in data mining
Ch1. Introduction
DB
Data Mining
Information
1.1 What Motivated Data Mining?

The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from business management, production control, and market analysis, to engineering design and science exploration.
1.2 What is Data Mining?
Fig. 1. Data Miningsearching for knowledge in your data.
Fig. 2. Data mining as a step in the process of knowledge discovery.

Data cleaning: to remove noise and inconsistent data. Data integration: where multiple data source may be combined. Data selection: where data relevant to the analysis task are retrieved from the database. Data transformation: where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance.

Data mining: an essential process where intelligent methods are applied in order to extract data patterns. Pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures. Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user.
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge base
Database or Data warehouse server Data cleaning Data integration Database
Filtering
Data warehouse
Fig. 3. Architecture of a typical data mining system.
1.3 Data Mining On What Kind of Data

Relational database Data warehouse Transactional database Advanced database systems and advanced database applications
Object-oriented database Object-relational database Spatial database Temporal database and time-series database Text database and multimedia database heterogeneous database and legacy database

Relational database: is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.
Data warehouse: is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site.

Transactional database: consists of a file where each record represents a transaction. Object-oriented databases: are based on the objectoriented programming paradigm, where in general terms, each entity is considered as an object. Spatial databases: contain spatial-related information. Such databases include geographic (map) databases, VLSI chip design databases, and medical and satellite image databases. Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, such as park, for instance. Other patterns may describe the climate of mountainous areas located at various altitudes.

Temporal databases and time-series databases: both store time-related data. A temporal database usually stores relational data that include time-related attributes.
Data mining techniques can be used to find the characteristics of object evolution, or the trend of changes for objects in the database.
1.4 Association Analysis

Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for market basket or transaction data analysis.
Association rules are of the form X==>Y, that is, A1 ... Am B1 ... Bn, where Ai(for i{1, , m}) and Bj(for j{1, , n}) are attribute-value pairs.

Support: the items must appear in many baskets. Confidence: the rule XY holds in the transaction set that contain X also contain Y.
Support(X) P(X) P(X Y) Confidence (X Y) P(X)
Ex: age(X,2029) income(X,20K29K) ==>buys(X,CD player) [support=2%, confidence=60%]

The problem of mining association rules is decomposed into the following two steps: 1. Discover the large itemsets, i.e., the sets of itemsets that have transaction support above a predetermined minimum supports. 2. Use the large itemsets to generate the association rules for the database. Algorithms: Apriori, AprioriTID, Boolean,

TID Items 100 ACD 200 BCE 300 ABCE 400 BE Database

Large Itemsets: BCE min support=2 min confidence=80%
B => CE [conf=67%] C => BE [conf=67%] E => BC [conf=67%] BC => E [conf=100%] BE => C [conf=67%] CE => B [conf=100%]
1.5 Classification and Prediction

Classification is the process of finding a set of methods that describe and distinguish data class or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).
1.5 Classification and Prediction
Fig. 4. A simple classification example.
1.6 Cluster Analysis

Unlike classification and prediction, which analyze classlabeled data objects, clustering analyzes data objects without consulting a known class label.
Clustering can be used on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
1.6 Cluster Analysis
Fig. 5. A 2-D plot of customer data w.r.t customer locations in a city, showing 3 data clusters.
1.7 Classification of Data Mining Systems
Fig. 6. Data mining as a confluence of multiple disciplines.
1.8 Major Issues in Data Mining

Mining methodology and user interaction issues:
These reflect the kinds of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization.
1.8 Major Issues in Data Mining

Performance issues:
These include efficiency, scalability, and parallelization of data mining algorithms. Issues relating to the diversity of database types: Handling of relational and complex types of data
mining information from heterogeneous databases and global information systems

Data Mining

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining

Hochgeladen von

Copyright:

Verfügbare Formate

Contents

1.1 What Motivated Data Mining?

1.2 What is Data Mining?

Fig. 1. Data Miningsearching for knowledge in your data.

Fig. 2. Data mining as a step in the process of knowledge discovery.

1.2 What is Data Mining?

1.2 What is Data Mining?

Graphical user interface

Data mining engine

Database or Data warehouse server Data cleaning Data integration Database

Fig. 3. Architecture of a typical data mining system.

1.3 Data Mining On What Kind of Data

1.3 Data Mining On What Kind of Data

1.3 Data Mining On What Kind of Data

1.3 Data Mining On What Kind of Data

1.4 Association Analysis

1.4 Association Analysis

1.4 Association Analysis

1.4 Association Analysis

1.4 Association Analysis

1.5 Classification and Prediction

1.5 Classification and Prediction

Fig. 4. A simple classification example.

1.6 Cluster Analysis

1.6 Cluster Analysis

1.7 Classification of Data Mining Systems

Fig. 6. Data mining as a confluence of multiple disciplines.

1.8 Major Issues in Data Mining

1.8 Major Issues in Data Mining

mining information from heterogeneous databases and global information systems

Das könnte Ihnen auch gefallen