Beruflich Dokumente
Kultur Dokumente
Data Mining
Data mining (knowledge discovery in databases) or KDD is the Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
May 4, 2012
Data Mining - infer knowledge from the data held to answer queries e.g.
What characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies? Why is the Child Care Policy so profitable?
May 4, 2012
Data mining
Mining of gold from rock or sand is called gold mining rather than rock and sand mining Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
May 4, 2012
Credit card transactions, loyalty cards, discount customer complaint calls, plus (public) lifestyle studies
coupons,
Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis
May 4, 2012
Customer profiling
data mining can tell you what types of customers buy what products (clustering or classification)
identifying the best products for different customers use prediction to find what factors will attract new customers
various multidimensional summary reports statistical summary information (data central tendency and variation)
May 4, 2012
cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
Resource planning
Competition
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Data Warehousing and Data Mining
May 4, 2012
Approach
use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring
Data Warehousing and Data Mining
May 4, 2012
Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
Retail
May 4, 2012
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and analyze effectiveness of Web marketing, improving Web site organization, etc.
May 4, 2012
10
May 4, 2012
11
Data Mining
Choosing the data mining algorithm(s) Data mining: search for patterns of interest
visualization, transformation, removing redundant patterns, and techniques used to present the mined knowledge to the user
Data Warehousing and Data Mining
May 4, 2012
12
Database
May 4, 2012
13
Databases, Data Warehouse and other information repository are used to hold the data necessary for data mining. Database or Data Warehouse server is used to fetch the relevant data based on the users data mining request. Knowledge base acts as the domain knowledge that is used to guide the search or to evaluate the interestingness of resulting patterns. Data mining engine ideally consists of a set of functional modules for tasks such as summarization, classification, cluster analysis, regression, association, correlation analysis, outlier analysis and evolution analysis.
May 4, 2012
14
Pattern evaluation module performs interestingness measurement and filters discovered patterns by using interestingness thresholds. It also interacts with the data mining module so as to focus the search toward interesting patterns. User interface module establishes communication between users and the data mining system. This module allows users to browse database/data warehouse, evaluate mined patterns, and visualize the patterns in different forms.
May 4, 2012
15
Relational databases Data warehouses Transactional databases Advanced DB and information repositories
Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW
Data Warehousing and Data Mining
May 4, 2012
16
Relational database also called DBMS consists of a collection of interrelated data known as a database and a set of software programs to manage and access the data. Here data is stored in the form of tables each of which is assigned a unique name. Each tables consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple represent an object identified by an unique key and described by a set of attributes values. Data Warehouse is a repository of information collected from multiple sources (possibly located at different geographical locations) stored under a unified schema and that usually reside at a single site.
May 4, 2012
17
Data warehouses are constructed via a process of data cleaning, integration, selection, reduction, transformation followed by data loading and periodic data refreshing. Here the data are organized around major subjects such as customer, item, supplier, and activity. Here the data are stored to provide historical perspective (say for past 5-10 years) and are typically summarized.
May 4, 2012
18
May 4, 2012
19
May 4, 2012
20
A Data Warehouse (DW) is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analyzed far more efficiently as compared to regular OLTP application databases. In this sense an OLAP system is designed to be read-optimized.
May 4, 2012
21
Database
Designed for real time business operations.
Optimized for bulk loads and large, Optimized for a common set of complex, unpredictable queries that access transactions, usually adding or retrieving a many rows per table. single row at a time per table.
Optimized for validation of incoming data during transactions; uses validation data tables.
May 4, 2012
22
Other databases
Transactional databases : It consists of a file where each record represents a transaction specified by a transaction identity number and a list of items making up the transaction.
Note that most relational database systems do not support nested relational structures such as list of items as in the said example. This type of database facilitates market basket analysis that would enable one bundle groups of items together as a strategy for maximizing sales.
May 4, 2012
23
Object-Relational Databases
The object-relational data model inherits essential concepts of objectoriented database, where each entity considered as an object. For example, objects can be individual employees, customers, or items. Data and code relating to an object are encapsulated into a single unit. Each object has associated with it the following. A set of variables that describe the objects. These correspond to attributes in the entity-relationship and relational models. A set of messages that the object can use to communicate with other objects, or with the rest of the database system. A set of methods, where each method holds the code to implement a message. Upon receiving a message, the method returns a value in response. Objects that share a common set of properties can be grouped into an object class. Each object is an instance of its class.
May 4, 2012 Data Warehousing and Data Mining
24
May 4, 2012
25
May 4, 2012
26
Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, such as a park, for instance. Other patterns may describe the climate of mountainous areas located at various altitudes, or describe the change in trend, of metropolitan poverty rates based on city distances from major highways.
May 4, 2012
27
May 4, 2012
28
May 4, 2012
29
Data streams
Many applications involve the generation and analysis of a new kind of data, called stream data, where data flow in and out of an observation platform (or window) dynamically. Such data streams have the following unique features:
huge or possibly infinite volume, dynamically changing, flowing in and out in a fixed order, allowing only one or a small number of scans, and demanding fast (often real-time) response time.
Typical example of data streams include various kinds of scientific and engineering data, time-series data, and data produced in other dynamic environments, such as power supply, network traffic, stock exchange, telecommunications, web click streams, video surveillance(inspection), and weather or environment monitoring. Because data streams are normally not stored in any kind of data repository, effective and efficient management and analysis of stream data poses great challenges to researchers.
May 4, 2012 Data Warehousing and Data Mining
30
May 4, 2012
31
Statistics
Machine Learning
Pattern Recognition
Data Mining
Visualization
Algorithm
Other Disciplines
May 4, 2012
32
General functionality
Descriptive
data
mining
Descriptive
mining
tasks
characterize the general properties of the data in the database. It utilizes human-interpretable patterns that describe the data.
Predictive data mining : Predictive mining task perform inference on the current data in order to make predictions. It uses some variables to predict unknown or future values of some other variables.
May 4, 2012
33
May 4, 2012
34
Data to be mined
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Knowledge to be mined Descriptive : Summarization or Characterization, Discrimination, Association, Clustering, Sequential Pattern Discovery etc. Predictive : Classification and Prediction, Regression, Time series analysis Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Data Warehousing and Data Mining
Applications adapted
May 4, 2012
35
Summarization or characterization : This provides the general characteristics or features of a target class of data. For instance one may want to find the characteristics (such as 40-50 years old, employed etc. ) of customers who spend more than Rs.10000/ a year in a shopping mall. This system should allow the users to drill down on any dimensions such as on occupation to view those customers according to their type of employment. Output of charecterization process are often represented in the form of pie charts, bar charts, curves, multidimensional data cubes and multidimensional tables.
May 4, 2012
36
The out of such a process may provide a general comparative profile of the customers, such as 80% of the customers who frequently purchase computer products are between 20 and 40 years old and have a university education whereas 60% of the customers who infrequently buy such products are either senior or youth and have no university degree. Drilling down on a dimension such as occupation or adding new dimension such as income level may help in finding even more discriminative feature between the two classes.
May 4, 2012 Data Warehousing and Data Mining
37
Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
Items
TID
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke} {Diaper, Milk} --> {Beer}
May 4, 2012
38
Inventory Management: Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.
May 4, 2012
39
May 4, 2012
40
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
Similarity Measures:
May 4, 2012
41
May 4, 2012
42
Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:
Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
May 4, 2012
43
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
May 4, 2012
44
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.
(A B)
(C)
(D E)
Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.
(A B)
(C)
(D E)
May 4, 2012
45
May 4, 2012
46
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
May 4, 2012
47
May 4, 2012
48
No Yes No Yes
Test Set
Model
Training Set
Learn Classifier
May 4, 2012
49
Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach:
Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, dont buy} decision forms the class attribute. Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
May 4, 2012
50
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.
May 4, 2012
51
Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
Approach:
Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
May 4, 2012
52
Early
Class:
Attributes:
Stages of Formation
Intermediate
Late
Data Size:
53
Regression (Predictive)
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
May 4, 2012
54
A data mining system/query may generate thousands of patterns, not all of them are interesting.
Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of
Subjective: based on users belief in the data, e.g., unexpectedness (contradicting a users belief), novelty, action-ability (strategic information on which the users can act), etc.
May 4, 2012
55
Can a data mining system find all the interesting patterns? It is useless and inefficient to generate all of the possible patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search. Can a data mining system find only the interesting patterns? Approaches
First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patternsmining query optimization
May 4, 2012
56