Sie sind auf Seite 1von 47

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme


Fachgebiet Wissensbasierte Systeme (KBS)

Data Mining I
Summer semester 2017

Lecture 1: Introduction
Lectures: Prof. Dr. Eirini Ntoutsi
Tutorials: Tai Le Quy and Damianos Melidis
Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Introduction 2


Why to study Data Mining famous quotes*

A breakthrough in machine learning would be worth ten Microsofts (Bill Gates, Chairman,
Microsoft)
Machine learning is the next Internet (Tony Tether, Director, DARPA)
Machine learning is the hot new thing (John Hennessy, President, Stanford)
Web rankings today are mostly a matter of machine learning (Prabhakar Raghavan, Dir. Research,
Yahoo)
Machine learning is going to result in a real revolution (Greg Papadopoulos, Former CTO, Sun)
Machine learning today is one of the hottest aspects of computer science (Steve Ballmer, CEO,
Microsoft)

*Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf

Data Mining I: Introduction 3


Why to study Data Mining - Data Scientist: The sexiest job of 21st century

If sexy means having rare qualities that are much in demand, data scientists are already
there. They are difficult and expensive to hire and, given the very competitive market for their
services, difficult to retain. There simply arent a lot of people with their combination of
scientific background and computational and analytical skills.

Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link

Data Mining I: Introduction 4


Data Mining Data Science Big Data Machine Learning Analytics

New fancy words for knowledge discovery from data


Data mining, machine learning have been focusing on knowledge discovery from data for decades
Well defined set of tasks and solutions
Big data and analytics are more business terms and ill-defined

Big data is like teenage sex: everyone talks about it, nobody really knows
how to do it, everyone thinks everyone else is doing it, so everyone claims
they are doing it.
Source: Dan Ariely, Duke University

Though nowadays we have more data than ever and the infrastructure to deal with it
more opportunities and challenges for data mining and machine learning communities

Data Mining I: Introduction 5


Ever increasing interest

Source: Google trends, query on 19.4.2017


Data Mining I: Introduction 6
Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Introduction 7


Why we need Data Mining

Huge amounts of data are collected nowadays from different application domains
We are drowning in information but starving for knowledge John Naibett link
The amount and the complexity of the collected data does not allow for manual analysis.

Supermarkets Banks Biology


Astronomy

IoT
Telecommunication

Internet

Data Mining I: Introduction 8


Examples of data sources: The Internet

Internet users (Source: http://www.internetlivestats.com/internet-users/)

Web 2.0: A world of opinions

Data Mining I: Introduction 9


Examples of data sources: Internet of things

The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics,
software, sensors, and network connectivity, which enables these objects to collect and exchange
data.
Source: https://en.wikipedia.org/wiki/Internet_of_Things

During 2008, the number of things connected to the


internet surpassed the number of people on earth By
2020 there will be 50 billion vs 7.3 billion people (2015).

These things are everything, smartphones, tablets,


refrigerators . cattle.

Source: http://blogs.cisco.com/diversity/the-internet-of-things-
infographic
Image source:http://tinyurl.com/prtfqxf

Data Mining I: Introduction 10


Examples of data sources: data intensive science

Increasingly, scientific breakthroughs will


be powered by advanced computing
capabilities that help researchers
manipulate and explore massive datasets.

-The Fourth Paradigm Microsoft

Examples of e-science applications:


Earth and environment
Health and wellbeing
E.g., The Human Genome Project (HGP)
Citizen science
Scholarly communication
Basic science
E.g., CERN

Slide from:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt
Data Mining I: Introduction 11
From data to knowledge

Data Methods Knowledge

Call records Outlier Detection Detect fraud cases

Bank transactions Classification Customer credibility


for loan applications

Customer transactions Which products people


Association rules
from supermarkets/ tend to buy together?
online stores

Telescope images What is the class


Classification of a star? E.g., early, intermediate
or late formation
Data Mining I: Introduction 12
Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Introduction 13


What is KDD

Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data.
[Fayyad, Piatetsky-Shapiro, and Smyth 1996]

Remarks:
valid: the discovered patterns should also hold for new, previously unseen problem instances.
novel: at least to the system and preferably to the user
potentially useful: they should lead to some benefit to the user or task
ultimately understandable: the end user should be able to interpret the patterns either immediately or after
some post-processing

Data Mining I: Introduction 14


Data
Selection:
Select a relevant dataset or

Data Mining I: Introduction


focus on a subset of a dataset
File / DB/
Target data

Preprocessing/Cleaning:
Integration of data from
different data sources
Noise removal
[Fayyad, Piatetsky-Shapiro & Smyth, 1996]

Missing values
Preprocessed data

Transformation:
Select useful features
Feature transformation/
discretization
Dimensionality reduction
The KDD process and the Data Mining step

Data Mining:
Transformed data

Search for patterns of


interest

Evaluation:
Evaluate patterns based on
interestingness measures
Patterns

Statistical validation of the


Models
Visualization
Descriptive Statistics
15
Knowledge
The interdisciplinary nature of KDD 1/2

Statistics
Machine Learning
Data visualization

KDD Databases

Pattern recognition

Algorithms Other disciplines

Data Mining I: Introduction 16


The interdisciplinary nature of KDD 2/2

Statistics Machine Learning


Model based inference Theory + methods
Focus on numerical data Focus on small datasets
[Berthold & Hand 1999] [Mitchell 1997]

KDD
Databases
Scalability to large data sets
New data types (web data, micro-arrays, social data ...)
Integration with commercial databases
[Chen, Han & Yu 1996]

Data Mining I: Introduction 17


Terminology drift: data mining data science Source: http://www.kdd.org/

Source: Google trends, query on 19.4.2017


Data Mining I: Introduction 18
Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Introduction 19


Supervised vs Unsupervised learning

There are two different ways of learning from data:


Unsupervised learning/ Descriptive:
Discover groups of similar objects within the data
Rely on the characteristics/ features of the data
There is no a priori knowledge about the partitioning of the data.
e.g., Clustering, Outlier detection, Association rules
Supervised learning/ Predictive:
Learns to predict output from input.
The output/ class labels is predefined, e.g. in a loan application it might be yes or no.
A set of labeled examples (training set) is provided as input to the learning model. The goal of the model is to extract some
kind of rules for labeling future data.
e.g., Classification, Regression, Outlier detection

The majority of the methods operate on the so called feature vectors, i.e., vectors of numerical features.
There are numerous methods though that work on other type of data like text, sets, graphs
Data Mining I: Introduction 20
Clustering definition

Clustering can be defined as the decomposition of a set of objects into subsets of similar objects (the
so called clusters)
Given a set of data points, each having a set of attributes, and a similarity measure among them, find
clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.

The different clusters represent different classes of objects; the number of the classes and their
meaning is not known in advance.

Data Mining I: Introduction 21


Clustering: an example

Height [cm]

Cluster 2 Cluster 1

Width[cm]

Each point described in terms of its height and width


No information on the actual classes (nails, paper clips) is available to the clustering algorithm.

Data Mining I: Introduction 22


Clustering applications 1/2

Application: Market Segmentation


Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
E.g., age, income, education, family status, .

Find clusters of similar customers.


Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.

Data Mining I: Introduction 24


Clustering applications 2/2

Application: Document clustering


Find groups of documents (topics) that are similar to each other based on the important
terms appearing in them.
Approach:
Identify important terms in each document.
Form a similarity measure between documents.
Cluster based on the similarity measure.

Gain:
Help the end user to navigate in the collection of documents (based on the extracted clusters).
Utilize the clusters to relate a new document or search term to clustered documents.

Check for example, Google News.

Data Mining I: Introduction 25


Example: Google News

Data Mining I: Introduction 26


Classification definition

Given a collection of records (training set)


Each record contains a set of attributes, one of the attributes is the class attribute.
The class variable is nominal (categorical), e.g, {fraud,normal}, {churn, no churn}, {yes, no}

Find a model for class attribute as a function of the values of other attributes.
The goal is to learn from the already labeled training data, the rules to classify new
objects based on their attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided
into training and test, with training set used to build the model and test set used to validate it.

Data Mining I: Introduction 27


Classification: an example
Height [cm]

Screw
Training data
Nails
Paper clips

New object

Width[cm]
The goal is to learn a mapping from the height, width space to the class space (nails, screw,paper
clips)
For the new objects, the result of the classification if one of the class labels {nails, screw,paper clips}

Data Mining I: Introduction 28


Classification applications 1/2

Application: Fraud Detection


Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.

Data Mining I: Introduction 29


Classification applications 2/2

Application: Churn prediction in telco


Goal: Predict whether a customer is likely to be lost to a competitor
Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital
status, etc.
Label the customers as loyal or disloyal (class attribute).
Find a model for customer loyalty
Use this model to predict churn and organize possible retain strategies.

Data Mining I: Introduction 30


Outlier detection

Outlier detection is defined as identification of non-typical data


Outliers might indicate
possible abuse of credit cards, mobile phones
data errors
device failures

Data Mining I: Introduction 35


Application

Analysis of the SAT.1-Ran-Soccer-Database (Season 1998/99)


375 players
Primary attributes: Name, #games, #goals, playing position (goalkeeper, defense, midfield, offense),
Derived attribute: Goals per game
Outlier analysis (playing position, #games, #goals)
Result: Top 5 outliers

Rank Name # games #goals position Explanation


1 Michael Preetz 34 23 Offense Top scorer overall
2 Michael Schjnberg 15 6 Defense Top scoring defense player
3 Hans-Jrg Butt 34 7 Goalkeeper Goalkeeper with the most goals
4 Ulf Kirsten 31 19 Offense 2nd scorer overall
5 Giovanne Elber 21 13 Offense High #goals/per game

Note: Outliers is not necessarily a negative term.


Data Mining I: Introduction 36
Regression

Predict a value of a given continuous valued variable based on the values of other variables,
assuming a linear or nonlinear model of dependency.
Similar to classification, but the feature-result to be learned is continuous rather than discrete

Given this data, a friend has a house 750 square


feet - how much can they be expected to get?

Data Mining I: Introduction 37


Application: Precision farming

production
production
curve
Soil parameters
Weather
Fertilizers Water capacity

Fertilizers

Create a production curve depending on multiple parameters like soil characteristics, weather, used
fertilizers.
Only the appropriate amount of fertilizers given the environmental settings (soil, weather) will result
in maximum yield.
Controlling the effects of over-fertilization on the environment is also important
Data Mining I: Introduction 38
Association rules

a,b,c,d,e
a,b,c,d,e
b,c,d
b,c,d In 5 out of 5 cases (100 %)
a,b,c,d
a,b,c,d In 5 out of 7 cases it holds that:
a,b,c,d,e
a,b,c,d,e (~71 %) If b,c then d also exists.
a,c,e,f
a,c,e,f b,c,d appear together
d,c,e,f
d,c,e,f
a,b,c,d,f
a,b,c,d,f

Task: Find all rules in the database, in the following form:


If x, y, z are contained in a set M, then t is also contained in M with a probability of at least X%.

Data Mining I: Introduction 39


Application: Market basket analysis

Possible generalizations:
Paprika-Chips Snacks
Enrichment of customer data

Data
Warehouse
Shopping basket

Association rules

Result:
Frequently purchased items together may be better to be positioned close to each other: E.g. since diapers
are often purchased together with beers => Place beer in the way from diapers to the checkout
Generate recommendations for customers with similar baskets:
=> e.g. Customers that bought Star Wars, might be also interested in The lord of the rings .

Data Mining I: Introduction 40


Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Introduction 41


About the course 1/2

Class schedule
Lectures: Wednesdays, 14:00 - 15:30, Multimedia-Hrsaal (3703 - 023), Appelstrae 4.
Exercises: directly after, i.e., Wednesdays, 15:45 - 16:30, Multimedia-Hrsaal (3703 - 023), Appelstrae 4.
StudIP as a common announcement place
Up to date announcements and material
Office hours:
Eirini: Fridays, 13:00-14:00, Appelstrae 4, 2nd floor, room 203.
Tai: Thursdays, 14:00-15:00, room 1535 Appelstrae 9, 15th floor.
Damianos: Thursdays, 14:00-15:00, room 240 Appelstrae 4, 2nd floor.
Ideally sent us an email before ({ntoutsi, tai, melidis}@l3s.de). Use Data Mining I in the title.
Exam:
Written exam
The exam will be based on the material discussed in the class plus the tutorials.
Exam date: 8.9.2017

Data Mining I: Introduction 42


About the course 2/2

Tutorials
In the class, you (start) solving the weekly exercises with help from the TAs
You submit the completed solutions till next week (on Tuesday)
You (or, TAs) present the best solution of previous week exercises (this takes place first)
Bonus 10%
Project
Focus on the complete KDD pipeline for two learning tasks: classification and clustering
First project will be announced on lecture 5
Bonus 15%

Data Mining I: Introduction 43


Overview of the lectures (current planning)

1. Introduction

2. Feature spaces

3. Association Rules

4. Classification

5. Clustering

6. Outlier Detection

? Regression

Data Mining I: Introduction 44


Textbook and recommended readings

Textbook:
Tan P.-N., Steinbach M., Kumar V., Introduction to Data Mining, Addison-
Wesley, 2006
Recommended readings
Han J., Kamber M., Pei J., Data Mining: Concepts and Techniques, 3rd ed.,
Morgan Kaufmann, 2011

Mitchell T. M., Machine Learning, McGraw-Hill, 1997

Witten I. H., Frank E., Data Mining: Practical Machine Learning Tools and
Techniques, Morgan Kaufmann Publishers, 2005.

Data Mining I: Introduction 45


Online resources

Mining of Massive Datasets book by Anand Rajaraman and Jerey D. Ullman


http://infolab.stanford.edu/~ullman/mmds.html
Machine Learning class by Andrew Ng, Stanford
http://ml-class.org/
Introduction to Databases class by Jennifer Widom, Stanford
http://www.db-class.org/course/auth/welcome

Kdnuggets: Data Mining and Analytics resources


http://www.kdnuggets.com/

Data Mining I: Introduction 46


Tools

Several options for either commercial or free/ open source tools


Check an up to date list at: http://www.kdnuggets.com/software/suites.html
Commercial tools offered by major vendors
e.g., IBM, Microsoft, Oracle
Free/ open source tools

R
Weka SciPy + NumPy

Elki

Orange
Rapid Miner (free, commercial versions)

Data Mining I: Introduction 47


Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Things you should know from this lecture
Homework/ Tutorial

Data Mining I: Introduction 48


Things you should know from this lecture

KDD definition
KDD process
DM step
Supervised vs Unsupervised learning
Main DM tasks
Clustering
Classification
Regression
Association rules mining
Outlier detection

Data Mining I: Introduction 49


Outline

Why to study Data Mining?


Why we need Data Mining?
What is Knowledge Discovery in Databases (KDD) and Data Mining?
Main data mining tasks
Course logististics
Things you should know from this lecture
Homework/ Tutorial

Data Mining I: Introduction 50


Homework/ Tutorial

Homework: Think of some real world applications that you find suitable for KDD.
Why?
What type of patterns would you look for?
Would you approach it as a supervised or unsupervised learning task?

Readings:
Tan P.-N., Steinbach M., Kumar V book, Chapter 1.
U. Fayyad, et al. (1995), From Knowledge Discovery to Data Mining: An Overview Advances in Knowledge
Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press.

Data Mining I: Introduction 51


Acknowledgement

The slides are based on


KDD I lecture at LMU Munich (Johannes Afalg, Christian Bhm, Karsten Borgwardt, Martin Ester, Eshref
Januzaj, Karin Kailing, Peer Krger, Eirini Ntoutsi, Jrg Sander, Matthias Schubert, Arthur Zimek, Andreas
Zfle)
Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
Pedro Domingos Machine Lecture course slides at the University of Washington
Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
Old Data Mining course slides at LUH by Prof. Udo Lipeck

Data Mining I: Introduction 52

Das könnte Ihnen auch gefallen