Sie sind auf Seite 1von 30

Data Mining: An Overview from a Database Perspective

Motivation: Necessity is the Mother of Invention

Data mining (knowledge discovery in databases):

Extraction of interesting knowledge (rules, regularities,


patterns, constraints) from data in large databases.

Necessity: Data explosion problem --- computerized data collection tools and mature database technology lead to tremendous amounts of data stored in databases.

We are drowning in data, but starving for knowledge!

Why Data Mining? Potential Applications

Marketing Corporate Analysis Fraud Detection

Other Applications

Marketing

Sales Analysis associations between product sales beer and diapers

Customer Profiling data mining can tell you what types of customers buy what products

Identifying Customer Requirements

identify the best products for different customers


use prediction to find what factors will attract new customers

Corporate Analysis

Finances cash flow analysis and prediction Resources summarize and compare the resources and spending

Competition compare with other competitors by summarizing data to the same level.

Fraud Detection

Auto Insurance Fraud Association Rule Mining can detect a group of people who stage accidents to collect on insurance

Money Laundering Since 1993, the US Treasury's Financial Crimes Enforcement Network agency has used a data-mining application, to detect suspicious money transactions

Other Applications

Sports Teams New York Knicks use data mining to gain a competitive advantage Astronomy California Institute of Technology and the Palomar Observatory discovered 22 quasars with the help of data mining Banking Security Pacific/Bank of America uses data mining to help with commercial lending decisions and to prevent fraud

Data Mining: Major Issues

Diversity of data mining tasks: Summarization, characterization, association, classification, clustering, trend and deviation analysis, other pattern analysis. Diversity of data: Relational, transactional, data warehouse, spatial, text, multimedia, active, objectoriented, Web, etc. Efficiency and scalability Expression and visualization of data mining results Data mining applications, social issues (security and

Data Mining: Classification

Different views, different classifications:


the kinds of knowledge to be mined the kinds of database to be mined on the kinds of techniques adopted

Knowledge to be mined:
Summarization, characterization, association, classification, clustering, trend and deviation analysis, other pattern analysis.

Database to be mined on:


Relational, transactional, data warehouse, spatial, text, multimedia, active, object-oriented, Web, etc.

Techniques adopted:
Database, statistics, visualization, machine learning,

Data Mining: A KD Process


Data mining: the core of knowledge discovery process.
Task-relevant Data Data Warehouse Selection Pattern Evaluation

Data Mining

Data Cleaning
Data Integration Databases

From OLAP to OLAP Mining

Construction of data warehouse and computation of data cubes. OLAP: On-Line Analytical Processing. OLAP operations: drilling/rolling, pivoting, slicing/dicing, filtering, etc. OLAP mining (OLAM): Integration of OLAP with data mining. On-line interactive mining: Mining interwined with drilling, slicing and dicing, pivoting, etc. Dynamic swapping mining tasks.w

Why OLAP Mining?

Integration of data mining with data warehouse and OLAP technologies. Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Interactive characterization, comparison, association, classification, clustering, prediction. Integration of different data mining functions, e.g., characterized classification, first clustering and then association, etc.

Data Mining: OLAM Architecture


User GUI API

OLAM Engine
Data Cube API

OLAP Engine

Meta Data

Data Cube
ODBC/OLEDB
Data Warehouse

Database

Mining Data Dispersion Characteristics

Data Dispersion Characteristics

median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision. Boxplot or quantile analysis on sorted intervals. Folding measures into numerical dimensions.

Numerical dimensions correspond to sorted intervals:


Dispersion analysis on computed measures:

Boxplot or quantile analysis on the transformed cube.

Visualization of Data Dispersion: Boxplot Analysis

Mining Discriminant Rules


Discrimination: Comparison of two or more classes Strategy:


Collect the relevant data respectively into the target class and the contrasting class Generalize both classes to the same high level concepts, Compare tuples with the same high level descriptions, Present for every tuple its description and two numbers support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Find attributes (features) which best distinguish different classes.

Relevance Analysis:

Mining Association Rules

Assocation rule mining:

Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses.

Applications:

Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, etc.
Rule form: LHS RHS [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%] major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]

Examples.

Mining Different Kinds of Association Rules

Boolean vs. quantitative associations

Association on discrete vs. continuous data


E.g., association on items bought vs. on multiple predicates. E.g, what brandof beers is associated with what brand of diapers? E.g., small sales (sum < 100) trigger big buys (sum > 1,000)? Association does not necessarily imply correlation.

Sinlge dimension vs. multiple dimensional associations

Single level vs. multiple-level analysis

Simple vs. constraint-based

Association vs. correlation analysis.

Classification

Data categorization based on a set of training objects. Applications: credit approval, target marketing, medical diagnosis, treatment effectiveness analysis, etc. Example: classify a set of diseases and provide the symptoms which describe each class or subclass. The classification task: Based on the features present in the class_labeled training data, develop a description or model for each class. It is used for classification of future test data, better understanding of each class, and prediction of certain properties and behaviors. Data classification methods: Decision-trees (e.g., ID3, C4.5), statistics, neural networks, rough sets, etc.

Major Classification Methods

Decision tree-based classification:


Training set vs test set or cross-validation Overfitting problem and tree pruning Boosting techniques.

Bayesian classification:
Nave Bayesian classification Bayesian belief networks Boosting techniques (e.g., AdaBoosting).

Neural network approach:


Multi-layer networks and back-propagation.

Genetic algorithms:
Genetic operators and fitness function selection.

Three Categories of Clustering Techniques

Partitioning-based:
Basically enumerate various partitions and then score them by some criterion. K-means, K-medoids, etc.

Hierarchy-based:
Create a hierarchical decomposition of the set of data (or objects) using some criterion.

Model-based:
A model is hypothesized for each of the clusters Find the best fit of that model to each other. E.g., Bayesian classification (AutoClass), Cobweb.

Database Clustering Methods


CLARANS (Ng & Han94): An extension to kmedoid algorithm based on randomized search. BIRCH (Zhang et al96): CF tree (a balanced tree structure). DBSCAN (EKXS96): connects regions of sufficiently high desity into clusters. STING (WYM97): A hierarchical cell structure that store statistical information. CLIQUE (Agrawal et al98): Cluster high dimensional data.

Time-Series Data Mining

Trend and deviation analysis


Find trend (data evolution regularity) and deviations.
Regression analysis, visualization techniques.

Subsequence analysis: similarity search


Subsequence matching: normalization + matching
Template specification: shape and macro specification.

Sequential pattern analysis


Sequential association rules

Periodicity analysis
full periods vs. partial periods, cyclic association

Similarity Search in Data Mining

Faloutsos et al. (1994) : Extract features from each window Fourier Transform & R*-tree structure.

Agrawal et al. (1995) : Amplitude scaling, offset translation Distance is determined from the sequence envelopes

Agrawal et al. (1995) : SDL pattern language to encode queries about shapes Jagadish et al. (1997) : domain-independent framework

Periodic Pattern Search in Time-Related Data Sets

Full cycle analysis:


Fourier transformation, other statistical analysis methods

Fragment-wise cyclic behavior analysis:


Example. Jack reads NY Times at every 9:00am.
Given (natural) periods vs. arbitray periods. A data cube and OLAP-based technique: (Gong and Han98)

Cyclic association rules:


Associations which form cycles.

Cyclic Association Rules (B. zden, S. Ramawamy, A.

Systems for Data Warehousing

Arbor Software: Essbase

Oracle: Express/Data-mart Suite.


Informix: Meta-Cube.

Cognos: PowerPlay
Redbrick Systems: Redbrick Warehouse Microstrategy: DSS/Server Microsoft: PLATO (SQL-Server 7.0) [OLEDB for OLAP]

Systems for Data Mining

IBM: Intelligent Miner. SAS Institute: Enterprise Miner. Silicon Graphics: MineSet. Integral Solutions Ltd.: Clementine. Information Discovery Inc.: Data Mining Suite. DBMiner Technology Inc.: DBMiner Rutger: DataMine, GMD: Explora, Univ. Munich: VisDB

Major Approaches in Data Mining Systems

Database-oriented approach: IBM Intelligent Miner. OLAM approach: DBMiner.

Machine learning: AQ15, ID3/C4.5/C5.0, Cobweb.


Rough sets, fuzzy sets: Datalogic/R, 49er, etc.

Statistical approaches, e.g., SAS Enterprise Miner.


Neural network approach: Cognos 4thoughts.

Conclusions

Data Mining: A rich, promising, young field with broad applications and many challenging research issues.

Data mining tasks: characterization, association, classification, clustering, prediction, sequence and pattern analysis, etc.
Data mining domains: relational, transactional, text, spatial, timeseries, multimedia, active DBs, data warehouses, and WWW.

Data mining methods: Data-intensive, statistics, visualization, information science, and other disciplines.
Progress: Scalable methods and multi-task systems. OLAM: On-line analytical mining provides a high promise for integration of OLAP and mining.

Future Work

Theoretical foundations of data mining. Implementation and new data mining methodologies: A set of well-tuned, standard mining operators. Data and knowledge visualization tools. Integration of multiple data mining strategies. Data mining in advanced information systems: Spatial, multimedia, Web-mining Data mining applications: content browsing, query optimization, multiresolution model, etc. Social issues: A threat to security and privacy.

Das könnte Ihnen auch gefallen