Sie sind auf Seite 1von 40

Data mining – An Overview

IIM Udaipur
In this session we shall learn
• Data Mining - who cares? what is it? where is
it used?
• Some concepts in Data Mining
• Learning types
• Typical steps in Data Mining
What is Data Mining
• “Extracting useful information from large data
sets” – Hand, Mannila, and Smyth (2001)
• “Data mining is the process of exploration and
analysis, by automatic or semi-automatic
means, of large quantities of data in order to
discover meaningful patterns and rules” –
Berry and Linoff (1997)
What is Data Mining
• “[Data mining is] statistics at scale and speed”
- Pregibon (1999)
• “[Data Mining is] the process of discovering
meaningful correlations, patterns and trends
by sifting through large amounts of data
stored in repositories. Data mining employs
pattern recognition technologies, as well as
statistical and mathematical techniques” –
Gartner Group
Where is it used
• Medical research (or broadly, Health research)
• Science and Engineering research
• Military
• Intelligence
• Security
• Business research
• Sports
• And many more….
In the Business World
• From a list of prospective customers, which are most likely
to respond?
• Which customers are most likely to commit fraud?
• Which loan applications are likely to default?
• Which customers are most likely to abandon a subscription
service
(telephone, magazine etc.)?
In the Business World
All the questions above can be answered
through classification techniques – logistic
regression or classification trees!
• Individuals whose data matches the best with
that of the existing customers
• Higher probability of involving fraud
• Probability of leaving
How did they get here
• Statistical tools – linear regression, logistic
regression, discriminant analysis, principal
components analysis, clustering techniques,
time series analysis and forecasting
• Computer Science tools (machine learning
techniques) – classification trees, artificial
neural networks (ANN), support vector
machines (SVM)
How did they get here
• Shmueli, Patel and Bruce’s (2010) extension of
Pregibon’s (1999) idea of data mining –
“statistics at scale, speed, and simplicity”
“Big” Data and Data Mining
• Walmart captured 20 million transactions per
day in a 10-terabyte database in 2003
• Lyman and Varian (2003) estimate that 5-
exabytes of information were produced in
2002 (1-exabyte = 1 million terabytes)
• Scannable bar codes, POS devices, GPS
• Growth of Internet
• Advancement in computational facilities
“Big” Data and Data Mining
• Data warehouses – Central repositories of
integrated data from one or more disparate
sources.
• Data marts – subsets of a data warehouse;
focus on single subjects such as sales, finance
or marketing
Useful Books on Data Mining
Techniques for this course
• “Data Mining for Business Intelligence” –
Shmueli, Patel, and Bruce (Textbook)
• “Data Mining and Business Analytics with R” –
Johannes Ledolter
• “Data Mining Techniques” – Linoff and Berry
• “An Introduction to Statistical Learning” –
James, Witten, Hastie, and Tibshirani
Core ideas in Data Mining
• Data exploration - Reviewing and examining the
data to see what messages they hold
- full understanding of the data may require a
reduction in its scale or dimension
- Data transformations
- Missing data
- Dealing with outliers
- Dealing with predictors of different types
Core ideas in Data Mining
• Data visualization – graphical exploration of the data
to see what information they hold
- Looking at each variable separately, as well as at
relationships between variables
- For numerical variables - histograms, boxplots
- For categorical variable - bar charts, dot plots
- For pairs of numerical variables to look for their
possible relationships, and type of relationships –
scatter plots
Core ideas in Data Mining
• Data reduction - Reduction of complex data
into simpler data. Instead of dealing with
thousands of product types, we might want
to put them in a smaller number of groups.
Core ideas in Data Mining
• Prediction - Predict the value of a numerical
(more specifically, continuous) variable
- Examples - sales, revenue, performance
- Each row is a case (unit, subject)
- Each column is a variable
- Technique: Multiple linear regression
Core ideas in Data Mining
• Classification – classifying units according to their
characteristics.
- Most basic form of data analysis
- Examples : (a) a loan applicant can repay on time, repay late,
or declare bankruptcy (b) the recipient of an offer can respond,
or not respond, (c) purchase / no purchase, (d) fraud / no fraud
- Each row of data is a case (customer, tax return, applicant)
- Each column is a variable
- Target variable is often binary (yes / no)
- Technique: Logistic regression; Discriminant analysis; k-
Nearest neighbors; Classification trees; Artificial Neural
Networks
Core ideas in Data Mining
• Association rules – Analysis of associations among items
purchased.
- Also called “affinity analysis”
- Data on transactions
- “What goes with what?”
- The “recommender” system of Amazon.com or
Netflix.com
- “Our records show you bought X, you may also like Y”
- Market Basket Analysis - Based on simple conditional
probability concept
Core ideas in Data Mining
• Predictive analytics - Combination of
classification, prediction and (to some extent)
association rules.
Learning Types
• Supervised learning algorithms
• Unsupervised learning algorithms
Supervised Learning Algorithms
• used in classification and prediction
• must have data available in which value of the
outcome of interest is known
• partitioning the data into two (sometimes,
three) parts – training data, validation data,
and test data
Supervised Learning Algorithms
Training partition:
• typically the largest partition
• contains the data used to build various models
we are examining
• this is the data from which the classification or
prediction algorithm “learns”, or is “trained”,
about the relationships between the outcome
and predictor variables
Supervised Learning Algorithms
Validation partition:
• after the algorithm has learned from the
training data, it is applied to the validation
data, to see how well it does
• used to assess the performance of each
model, so that we can compare the models,
and pick the best one
• sometimes, used also to fine tune, and hence
to improve the model
Supervised Learning Algorithms
Test partition:
• If many different models are being examined,
then we may save this third partition, to see
the performance of the model which is finally
chosen, with a new data
• Also called a “holdout”, or “evaluation”
partition
Supervised Learning Algorithms
Examples:
• Simple and Multiple Linear Regression
• Logistic Regression
• Discriminant Analysis
• k-Nearest Neighbors
• Classification and Regression Trees
• Artificial Neural Networks
• Support Vector Machines
Unsupervised Learning Algorithms
• used where there is no outcome variable to
predict or to classify
• no “learning” from cases where such an
outcome variable is known
Unsupervised Learning Algorithms
Examples:
• Association Rules
• Dimension Reduction Methods (such as,
principal component analysis)
• Clustering Techniques
Some typical steps in Data Mining
• Develop an understanding of the purpose of the
data mining project
• Obtain the data set to be used in the analysis
• Explore, clean and preprocess the data
• Reduce the data (if necessary). If supervised Data
Mining, then separate the data into training,
validation and test data sets
• Determine the data mining task (classification,
prediction, clustering etc.)
Some typical steps in Data Mining
• Choose the data mining techniques to be used
• Use algorithms to perform the task
• Interpret the results of the algorithms, and
compare the models (in case there are many)
• Deploy the model that performs the best
But most importantly:
The Understanding!

Before getting into any algorithm,


develop an understanding of the
problem at hand first!
Additional

SOME USEFUL CONCEPTS


Obtaining Data - Sampling
• Data Mining typically deals with large,
sometimes huge, databases
• Algorithms and models are typically applied to
a sample from a database, to produce
statistically valid results
• Once we develop and select a final model, we
use it to “score” the observations in the larger
databases
Rare Event Oversampling
• The event of interest may be a rare one
sometimes
• Example – customers purchasing a product in
response to a mailing
• Sampling may yield too few “interesting”
cases to effectively train a model
• Solution – oversample the rare cases to get a
more balanced dataset
• Use carefully!!
Pre-processing and Cleaning the Data
• Types of variables – numeric and categorical
• Numeric variables – Continuous and Integer
• Categorical variables – Ordered and
Unordered
• Dummies for categorical variables – XLMiner
cannot create dummies itself while R can; for
XLMiner, dummy variables need to be created
manually
Data Transformations
• Transforming the predictors may be necessary for
various reasons: The modeling technique may
require the predictors in a common scale
• Centering and Scaling
- Increases numerical stability of some models
- Loss of interpretability
• Skewness transformations
- Diagnosis from the skewness formula
- Usual transformations: log, square-root, inverse
- Box-Cox transformations
Detecting Outliers
• An outlier is an observation that is “extreme”, being distant
from rest of the data
• Check for obvious data recording errors
• “Even with a thorough understanding of the data, outliers
can be hard to define” – Kuhn and Johnson (2013)
• (If at all) detected, domain knowledge is necessary to
decide whether to delete it or not
• In some contexts, detecting outliers is the Data Mining
exercise itself (airport security screening); it is called
“anomaly detection”
• Outliers can have significant influence on some models,
e.g., regression analysis
• Models resistant to outliers: Tree-based algorithms
Handling Missing Data
• Missing values in a real dataset are unavoidable
• Informative missingness
• Mostly occurs for certain predictors
• Solution 1: Omission
- may be the most feasible solution sometimes
- usually not a problem for large datasets
• Solution 2: Imputation
- use of statistical / machine learning techniques to
impute the missing values by reasonable substitutes
- for example, use of k-nearest neighbours algorithm
Dealing with Predictors of Different
Types
• Removing predictors that are not useful
- Advantages
- Zero-variance, and Near-zero-variance
predictors
- Between-predictor correlations
- Understanding before removal is very
important though
Predictive Power and Overfitting
• How well the model will perform when
applied to new data?
• We want our model to generalize beyond the
dataset we have at hand
• Data partitioning
• Over-fitting
• Cross-Validation
References
1. “Data Mining for Business Intelligence” by G.
Shmueli, N. Patel and P. Bruce
2. “Applied Predictive Modeling” by M. Kuhn
and K. Johnson
3. “Data Mining and Statistics for Decision
Making” by S. Tuffery

Das könnte Ihnen auch gefallen