Beruflich Dokumente
Kultur Dokumente
Necessity: Data explosion problem --- computerized data collection tools and mature database technology lead to tremendous amounts of data stored in databases.
Other Applications
Marketing
Customer Profiling data mining can tell you what types of customers buy what products
Corporate Analysis
Finances cash flow analysis and prediction Resources summarize and compare the resources and spending
Competition compare with other competitors by summarizing data to the same level.
Fraud Detection
Auto Insurance Fraud Association Rule Mining can detect a group of people who stage accidents to collect on insurance
Money Laundering Since 1993, the US Treasury's Financial Crimes Enforcement Network agency has used a data-mining application, to detect suspicious money transactions
Other Applications
Sports Teams New York Knicks use data mining to gain a competitive advantage Astronomy California Institute of Technology and the Palomar Observatory discovered 22 quasars with the help of data mining Banking Security Pacific/Bank of America uses data mining to help with commercial lending decisions and to prevent fraud
Diversity of data mining tasks: Summarization, characterization, association, classification, clustering, trend and deviation analysis, other pattern analysis. Diversity of data: Relational, transactional, data warehouse, spatial, text, multimedia, active, objectoriented, Web, etc. Efficiency and scalability Expression and visualization of data mining results Data mining applications, social issues (security and
Knowledge to be mined:
Summarization, characterization, association, classification, clustering, trend and deviation analysis, other pattern analysis.
Techniques adopted:
Database, statistics, visualization, machine learning,
Data Mining
Data Cleaning
Data Integration Databases
Construction of data warehouse and computation of data cubes. OLAP: On-Line Analytical Processing. OLAP operations: drilling/rolling, pivoting, slicing/dicing, filtering, etc. OLAP mining (OLAM): Integration of OLAP with data mining. On-line interactive mining: Mining interwined with drilling, slicing and dicing, pivoting, etc. Dynamic swapping mining tasks.w
Integration of data mining with data warehouse and OLAP technologies. Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Interactive characterization, comparison, association, classification, clustering, prediction. Integration of different data mining functions, e.g., characterized classification, first clustering and then association, etc.
OLAM Engine
Data Cube API
OLAP Engine
Meta Data
Data Cube
ODBC/OLEDB
Data Warehouse
Database
median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision. Boxplot or quantile analysis on sorted intervals. Folding measures into numerical dimensions.
Collect the relevant data respectively into the target class and the contrasting class Generalize both classes to the same high level concepts, Compare tuples with the same high level descriptions, Present for every tuple its description and two numbers support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Find attributes (features) which best distinguish different classes.
Relevance Analysis:
Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses.
Applications:
Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, etc.
Rule form: LHS RHS [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%] major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]
Examples.
Classification
Data categorization based on a set of training objects. Applications: credit approval, target marketing, medical diagnosis, treatment effectiveness analysis, etc. Example: classify a set of diseases and provide the symptoms which describe each class or subclass. The classification task: Based on the features present in the class_labeled training data, develop a description or model for each class. It is used for classification of future test data, better understanding of each class, and prediction of certain properties and behaviors. Data classification methods: Decision-trees (e.g., ID3, C4.5), statistics, neural networks, rough sets, etc.
Bayesian classification:
Nave Bayesian classification Bayesian belief networks Boosting techniques (e.g., AdaBoosting).
Genetic algorithms:
Genetic operators and fitness function selection.
Partitioning-based:
Basically enumerate various partitions and then score them by some criterion. K-means, K-medoids, etc.
Hierarchy-based:
Create a hierarchical decomposition of the set of data (or objects) using some criterion.
Model-based:
A model is hypothesized for each of the clusters Find the best fit of that model to each other. E.g., Bayesian classification (AutoClass), Cobweb.
CLARANS (Ng & Han94): An extension to kmedoid algorithm based on randomized search. BIRCH (Zhang et al96): CF tree (a balanced tree structure). DBSCAN (EKXS96): connects regions of sufficiently high desity into clusters. STING (WYM97): A hierarchical cell structure that store statistical information. CLIQUE (Agrawal et al98): Cluster high dimensional data.
Periodicity analysis
full periods vs. partial periods, cyclic association
Faloutsos et al. (1994) : Extract features from each window Fourier Transform & R*-tree structure.
Agrawal et al. (1995) : Amplitude scaling, offset translation Distance is determined from the sequence envelopes
Agrawal et al. (1995) : SDL pattern language to encode queries about shapes Jagadish et al. (1997) : domain-independent framework
Cognos: PowerPlay
Redbrick Systems: Redbrick Warehouse Microstrategy: DSS/Server Microsoft: PLATO (SQL-Server 7.0) [OLEDB for OLAP]
IBM: Intelligent Miner. SAS Institute: Enterprise Miner. Silicon Graphics: MineSet. Integral Solutions Ltd.: Clementine. Information Discovery Inc.: Data Mining Suite. DBMiner Technology Inc.: DBMiner Rutger: DataMine, GMD: Explora, Univ. Munich: VisDB
Conclusions
Data Mining: A rich, promising, young field with broad applications and many challenging research issues.
Data mining tasks: characterization, association, classification, clustering, prediction, sequence and pattern analysis, etc.
Data mining domains: relational, transactional, text, spatial, timeseries, multimedia, active DBs, data warehouses, and WWW.
Data mining methods: Data-intensive, statistics, visualization, information science, and other disciplines.
Progress: Scalable methods and multi-task systems. OLAM: On-line analytical mining provides a high promise for integration of OLAP and mining.
Future Work
Theoretical foundations of data mining. Implementation and new data mining methodologies: A set of well-tuned, standard mining operators. Data and knowledge visualization tools. Integration of multiple data mining strategies. Data mining in advanced information systems: Spatial, multimedia, Web-mining Data mining applications: content browsing, query optimization, multiresolution model, etc. Social issues: A threat to security and privacy.