Sie sind auf Seite 1von 41

Relational Data Mining

Donato Malerba

Dipartimento di Informatica
Università degli studi di Bari
malerba@di.uniba.it
http://www.di.uniba.it/~malerba/
Overview
• Single-table assumption
• (Multi-)relational data mining and ILP
• FO representations
• Upgrading propositional DM systems to FOL
• A case study: Mining Association rules
• Conclusions

MRDM – Prof. D. Malerba 2


Standard Data Mining
Approach
• Most existing data mining approaches look for patterns in
a single table of data (or DB relation)
ID Name First Street City Sex Social Income Age Resp
Name Status onse
3478 Smith John 38 Lake St Seattle M single 160k 32 Y
3479 Doe Jane 45 Sea St Venice F married 180k 45 N
… … … … … … … … … …

•Each row represents an object and columns represent


properties of objects.
Single table assumption
MRDM – Prof. D. Malerba 3
Standard Data Mining
Approach
• In the customer table we can add as many attributes about our customers
as we like.
 A person’s number of children
• For other kinds of information the single-table assumption turns out to be
a significant limitation
 Add information about orders placed by a customer, in particular
 Delivery and payment modes
 With which kind of store the order was placed (size, ownership, location)
 For simplicity, no information on the goods ordered

ID Name First … Resp Delivery Payment Store Store Locat


Name onse mode mode size type ion
3478 Smith John … Y regular cash small franchis city
3479 Doe Jane … N express credit large indep rural
… … … … … … … … … …
MRDM – Prof. D. Malerba 4
Standard Data Mining
Approach
• This solution works fine for once-only customers
• What if our business has repeat customers?
• Under the single-table assumption we can
1. Make one entry for each order in our customer table
ID Name First … Resp Delivery Payment Store Store Locat
Name onse mode mode size type ion
3478 Smith John … Y regular cash small franchis city
3478 Smith John … Y express check small franchis city
… … … … … … … … … …

• We have usual problems of non-normalized tables


• Redundancy, anomalies, …
MRDM – Prof. D. Malerba 5
Standard Data Mining
Approach
• one line per order  analysis results will really be about
orders, not customers, which is not what we might want!
2. Aggregate order data into a single tuple per customer.

ID Name First … Response No. of No. of


Name orders stores
3478 Smith John … Y 3 2
3479 Doe Jane … N 2 2
… … … … … … …

• No redundancy. Standard DM methods work fine, but


• There is a lot less information in the new table
• What if the payment mode and the store type are important?
MRDM – Prof. D. Malerba 6
Relational Data
• A database designer would represent the information in
our problem as a set of tables (or relations)
ID Name First Street City Sex Social Income Age Resp
Name Status onse
3478 Smith John 38 Lake St Seattle M single 160k 32 Y
3479 Doe Jane 45 Sea St Venice F married 180k 45 N
… … … … … … … … … …

Cust Order Store Delivery Payment Store size Type Location


ID ID ID mode mode ID
3478 213444 12 regular cash 12 small franchis city
3478 372347 19 regular cash 19 large indep rural
3478 334555 12 express check
… … …
… … … … …
MRDM – Prof. D. Malerba 7
Relational Data Mining
• (Multi-)Relational data mining algorithms can analyze data
distributed in multiple relations, as they are available in
relational database systems.
• These algorithms come from the field of inductive logic
programming (ILP)
• ILP has been concerned with finding patterns expressed as
logic programs
• Initially, ILP focussed on automated program synthesis from
examples
• In recent years, the scope of ILP has broadened to cover
the whole spectrum of data mining tasks (association rules,
regression, clustering, …)

MRDM – Prof. D. Malerba 8


ILP successes in scientific
fields

• In the field of chemistry/biology


 Toxicology
 Prediction of Dipertene classes from nuclear magnetic
resonance (NMR) spectra
• Analysis of traffic accident data
• Analysis of survey data in medicine
• Prediction of ecological biodegradation rates
The first commercial data mining systems with ILP
technology are becoming available.

MRDM – Prof. D. Malerba 9


Relational patterns

• Relational patterns involve multiple relations from a relational


database.
• They are typically stated in a more expressive language than
patterns defined on a single data table.
 Relational classification rules
 Relational regression trees
 Relational association rules
IF Customer(C1,N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1)
AND order(C1,O1,S1,Deliv1, Pay1)
AND Pay1 = credit_card
AND In1  108000
THEN Resp1 = Yes
MRDM – Prof. D. Malerba 10
Relational patterns
IF Customer(C1,N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1)
AND order(C1,O1,S1,Deliv1, Pay1)
AND Pay1 = credit_card
AND In1  108000
THEN Resp1 = Yes
good_customer(C1) 
customer(C1, N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1)
 order(C1,O1,S1,Deliv1, credit_card) 
In1  108000
This relational pattern is expressed in a subset of first-order logic!
A relation in a relational database corresponds to a predicate in
predicate logic (see deductive databases)

MRDM – Prof. D. Malerba 11


Relational decision tree

Equivalent Prolog program:


class(sendback) :- worn(X), not_replaceable(X), !.
class(fix) :- worn(X), !.
class(keep).
MRDM – Prof. D. Malerba 12
Relational regression rule
Background knowledge

Induced model

MRDM – Prof. D. Malerba 13


Relational association rule
Relational database
LIKES HAS PREFERS
KID OBJECT KID OBJECT KID OBJECT TO
Joni ice-cream Joni ice-cream Joni ice-cream pudding
Joni dolphin Joni piglet Joni pudding raisins
Elliot piglet Elliot ice-cream Joni giraffe gnu
Elliot gnu Elliot lion ice-cream
Elliot lion Elliot piglet dolphin

likes(KID, piglet), likes(KID, ice-cream)


 likes (KID, dolphin) (9%, 85%)
likes(KID, A), has(KID, B)  prefers (KID, A, B) (70%, 98%)

MRDM – Prof. D. Malerba 14


First-order representations
• An example is a set of ground facts, that is a set of tuples
in a relational database
• From the logical point of view this is called a (Herbrand)
interpretation because the facts represent all atoms
which are true for the example, thus all facts not in the
example are assumed to be false.
• From the computational point of view each example is a
small relational database or a Prolog knowledge base
• A Prolog interpreter can be used for querying an example.

MRDM – Prof. D. Malerba 15


FO representation (ground
clauses)
• Example:
eastbound(t1):­
car(t1,c1),rectangle(c1),short(c1),none(c1),two_wheels(c1),
load(c1,l1),circle(l1),one_load(l1),
car(t1,c2),rectangle(c2),long(c2),none(c2),three_wheels(c2),
load(c2,l2),hexagon(l2),one_load(l2),
car(t1,c3),rectangle(c3),short(c3),peaked(c3),two_wheels(c3),
load(c3,l3),triangle(l3),one_load(l3),
car(t1,c4),rectangle(c4),long(c4),none(c4),two_wheels(c4),
load(c4,l4),rectangle(l4),three_load(l4).
• Background theory:
polygon(X) :- rectangle(X)
polygon(X) :- triangle(X)
• Hypothesis:
eastbound(T):­car(T,C),short(C),not none(C).

MRDM – Prof. D. Malerba 16


Background knowledge

• As background knowledge is visible for each


example, all the facts that can be derived from the
background knowledge and an example are part of
the extended example.
• Formally, an extended example is the minimal
Herbrand model of the example and the
background theory.
• When querying an example, it suffices to assert the
background knowledge and the example; the Prolog
interpreter will do the necessary derivations.

MRDM – Prof. D. Malerba 17


Learning from interpretations
• The ground-clause representation is peculiar of an ILP
setting denoted as learning from interpretations.
• Similar to older work on structural matching.
• It is common to several relational data mining systems,
such as
 CLAUDIEN: searches for a set of clausal regularities that hold on
the set of examples
 TILDE: top-down induction of logical decision trees
 ICL: Inductive classification logic (upgrade of CN2)
• It contrasts with the classical ILP setting employed by
the systems PROGOL and FOIL.

MRDM – Prof. D. Malerba 18


FO representation (flattened)
• Example:
eastbound(t1).
• Background theory:
car(t1,c1).      car(t1,c2).        car(t1,c3).      car(t1,c4).
rectangle(c1).   rectangle(c2).     rectangle(c3).   rectangle(c4).
short(c1).       long(c2).          short(c3).       long(c4).
none(c1).        none(c2).          peaked(c3).      none(c4).
two_wheels(c1).  three_wheels(c2).  two_wheels(c3).  two_wheels(c4).
load(c1,l1).     load(c2,l2).       load(c3,l3).     load(c4,l4).
circle(l1).      hexagon(l2).       triangle(l3).    rectangle(l4).
one_load(l1).    one_load(l2).      one_load(l3).    three_loads(l4).
• Hypothesis:
eastbound(T):­car(T,C),short(C),not none(C).

MRDM – Prof. D. Malerba 19


FO representation (terms)

• Example:
eastbound([c(rectangle,short,none,2,l(circle,1)),
           c(rectangle,long,none,3,l(hexagon,1)),
           c(rectangle,short,peaked,2,l(triangle,1)),
           c(rectangle,long,none,2,l(rectangle,3))]).
• Background theory: empty
• Hypothesis:
eastbound(T):­member(C,T),arg(2,C,short),
                          not arg(3,C,none).

MRDM – Prof. D. Malerba 20


FO representation (strongly
typed)
• Type signature:
data Shape  = Rectangle | Hexagon | …; data Length = Long | Short;
data Roof   = None | Peaked | …; data Object = Circle | Hexagon | …;

type Wheels = Int; type Load = (Object,Number); type Number = Int


type Car    = (Shape,Length,Roof,Wheels,Load); type Train = [Car];

eastbound::Train­>Bool;
• Example:
eastbound([(Rectangle,Short,None,2,(Circle,1)),
           (Rectangle,Long,None,3,(Hexagon,1)),
           (Rectangle,Short,Peaked,2,(Triangle,1)),
           (Rectangle,Long,None,2,(Rectangle,3))]) = True
• Hypothesis: eastbound(t) = (exists \c ­> member(c,t) && 
     proj2(c)==Short && proj3(c)!=None)
• Example language: Escher™ functional logic programming

MRDM – Prof. D. Malerba 21


FO representation (database)
LOAD_TABLE TRAIN_TABLE
TRAIN_TABLE
LOAD TRAIN
TRAIN EASTBOUND
EASTBOUND
LOAD CAR
CAR OBJECT
OBJECT NUMBER
NUMBER
l1l1 c1 circle 11 t1
t1 TRUE
TRUE
c1 circle
l2l2 c2 hexagon 11 t2
t2 TRUE
TRUE
c2 hexagon
l3l3 c3 triangle 11 …… ……
c3 triangle
l4l4 c4 rectangle 33 t6
t6 FALSE
FALSE
c4 rectangle
…… …… …… …… ……

CAR_TABLE
CAR
CAR TRAIN
TRAIN SHAPE
SHAPE LENGTH
LENGTH ROOF
ROOF WHEELS
WHEELS
c1
c1 t1
t1 rectangle
rectangle short
short none
none 22
c2
c2 t1
t1 rectangle
rectangle long
long none
none 33
c3
c3 t1
t1 rectangle
rectangle short
short peaked
peaked 22
c4
c4 t1
t1 rectangle
rectangle long
long none
none 22
…… …… …… ……

LECT DISTINCT TRAIN_TABLE.TRAIN FROM TRAIN_TABLE, CAR_TABLE


HERE TRAIN_TABLE.TRAIN = CAR_TABLE.TRAIN AND
CAR_TABLE.LENGTH = ‘short’ AND CAR_TABLE.ROOF != 'none'

MRDM – Prof. D. Malerba 22


Individual-centered
representation
• The database contains information on a number of
trains.
• Each train is an individual.
• The database can be partitioned according to individual
to obtain a ground-clause representation
• Problem: sometime individuals share common parts.
• Example: we want to discriminate
black and white figures on the basis of their
position.
Each geom. figure is an individual

MRDM – Prof. D. Malerba 23


Object-centered
representation

The whole sequence is an object, which can be represented by


a multiple-head ground clause:

black(x11)  black(x12)  white(x13)  black(x14) :-


first(x11), crl(x11), next(x12,x11), crl(x12),
sqr(x13), crl(x14), next(x14,x13), next(x13,x12)

This is the representation adopted in ATRE.

MRDM – Prof. D. Malerba 24


How to upgrade propositional
DM algorithms to first-order
1. Identify the propositional DM system that best matches the DM task
2. Use interpretations to represent examples
3. Upgrade the representation of propositional hypotheses attribute-
value tests by first-order literals and modify the coverage test
accordingly.
4. Structure the search-space by a more-general-than relation that
works on first-order representations
 -subsumption
5. Adapt the search operators for searching the corresponding rule
space
6. Use a declarative bias mechanism to limit the search space
7. Implement
8. Evaluate your (first-order) implementation on propositional and
relational data
9. Add
MRDM interesting
– Prof.extra features
D. Malerba 25
Mining association rules: a
case study
A set I of literals called items.
A set D of transactions t’s such that t  I.
X  Y (s%, c%) Association rule
"IF a pattern X appears in a transaction t, THEN the pattern
Y tends to hold in the same transaction t"
• X I, Y I, XY=
• s% = p(XY) support
• c% = p(Y|X) = p(XY) / p(X) confidence
Agrawal, Imielinsky & Swami.
Mining association rules between sets of items in large databases.
Proc. SIGMOD 1993

MRDM – Prof. D. Malerba 26


What is an association rule?
Example: market basket analysis.
Each transaction is the list of items bought by a customer on a
single visit to a store. It is represented as a row in a table
Bread Butter Cheese Beer
1 yes yes yes no
2 yes no yes Yes
3 … … … …
IF a customer buys bread and butter THEN he also buys cheese
(20%, 66%) =
Given that 20% of customers buy bread, cheese and butter,
66% of customers who buy bread and butter also buy cheese
MRDM – Prof. D. Malerba 27
Mining association rules
The propositional approach

Problem statement
Given:
• a set of transactions D
• a couple of thresholds, minsup and minconf
Find
all association rules that have support and
confidence greater than minsup and minconf
respectively.

MRDM – Prof. D. Malerba 28


Mining association rules
The propositional approach

Problem decomposition
• Find large (or frequent) itemsets
• Generate highly-confident association rules

Representation issues
• The transaction set D may be a data file, a relational
table or the result of a relational expression
• Each transaction is a binary vector

MRDM – Prof. D. Malerba 29


Mining association rules
The propositional approach

Solution to the first sub-problem


The APRIORI algorithm (Agrawal & Srikant, 1999)
Find large 1-itemsets
Cycle on the size (k>1) of the itemsets
 APRIORI-gen Generate candidate k-itemsets from
large (k-1)-itemsets
 Generate large k-itemsets from candidate k-itemsets
(cycle on the transactions in D)
until no more large itemsets are found.
MRDM – Prof. D. Malerba 30
Mining association rules
The propositional approach

Solution to the second sub-problem


• For every large itemset Z, find all non-empty subsets X’s
of Z
• For every subset X, output a rule of the form X  (Z-X) if
support(Z)/support(X)  minconf.

Relevant work
Agrawal & Srikant (1999). Fast Algorithms for Mining Association Rules,
in Readings in Database Systems, Morgan Kaufmann Publishers.
Han & Fu (1995). Discovery of Multiple-Level Association Rules from
Large Databases, in Proc. 21st VLDB Conference

MRDM – Prof. D. Malerba 31


Mining association rules
The ILP approach

Problem statement
Given:
• a deductive relational database D
• a couple of thresholds, minsup and minconf
Find
all association rules that have support and
confidence greater than minsup and minconf
respectively.

MRDM – Prof. D. Malerba 32


Mining association rules
The ILP approach

Problem decomposition
• Find large (or frequent) atomsets
• Generate highly-confident association rules

Representation issues
A deductive relational database is a relational database
which may be represented in first-order logic as follows:
• Relation  Set of ground facts (EDB)
• View  Set of rules (IDB)

MRDM – Prof. D. Malerba 33


Mining association rules
The ILP approach
Example Relational database
LIKES HAS PREFERS
KID OBJECT KID OBJECT KID OBJECT TO
Joni ice-cream Joni ice-cream Joni ice-cream pudding
Joni dolphin Joni piglet Joni pudding raisins
Elliot piglet Elliot ice-cream Joni giraffe gnu
Elliot gnu Elliot lion ice-cream
Elliot lion Elliot piglet dolphin

likes(joni, ice-cream) atom

likes(KID, piglet), likes(KID, ice-cream) atomset


 likes (KID, dolphin) (9%, 85%)
likes(KID, A), has(KID, B)  prefers (KID, A, B) (70%, 98%)
MRDM – Prof. D. Malerba 34
Mining association rules
The ILP approach

Solution to the first sub-problem


The WARMR algorithm (Dehaspe & De Raedt, 1997)
L. Dehaspe & L. De Raedt (1997). Mining Association Rules in Multiple
Relations, Proc. Conf. Inductive Logic Programming
Compute large 1-atomsets
Cycle on the size (k>1) of the atomsets
 WARMR-gen Generate candidate k-atomsets from large
(k-1)-atomsets
 Generate large k-atomsets from candidate k-atomsets
(cycle on the observations loaded from D)
until no more large atomsets are found.

MRDM – Prof. D. Malerba 35


Mining association rules
The ILP approach

WARMR APRIORI
• Breadth-first search on • Breadth-first search on
the atomset lattice the itemset lattice
• Loading of an • Loading of a transaction t
observation o from D from D (tuple)
(query result)
• Largeness of candidate • Largeness of candidate
atomsets computed by a itemsets computed by a
coverage test
subset check

MRDM – Prof. D. Malerba 36


Mining association rules
The ILP approach
Pattern Space false
false

Q1   is_a(X, large_town)
Q1
 intersects(X, R)
 is_a(R, road)

Q2
Q2  is_a(X, large_town)
 intersects(X,Y)

Q3
Q3  is_a(X, large_town)
 
true true

MRDM – Prof. D. Malerba 37


Mining association rules
The ILP approach
Candidate generation

is_a(X, large_town), intersects(X,R), is_a(R, road) Operator under


-subsumption

is_a(X,large_town), intersects(X,R), is_a(R,road), adjacent_to(X,W), is_a(W, water)


Refinement step

yes Does it -subsume no


infrequent patterns?

Pruning step
MRDM – Prof. D. Malerba 38
Mining association rules
The ILP approach
Candidate evaluation
is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water)

?- is_a(X, large_town),
intersects(X,R), is_a(R, road),
adjacent_to(X,W), is_a(W, water) D
no

<X=barletta,R=a14,W=adriatico>
Large?
<X=bari,R=ss16bis,W=adriatico>
...

yes
MRDM – Prof. D. Malerba 39
Mining association rules
The ILP approach

is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water)

Rule generation

is_a(X, large_town), intersects(X,R), is_a(R, road), is_a(W, water)


 adjacent_to(X,W) (62%, 86%)

yes High no
confidence?

MRDM – Prof. D. Malerba 40


Conclusions and future work
• Multi-relational data mining: more data mining than
logic program synthesis
 choice of representation formalisms
 input format more important than output format
 data modelling — e.g. object-oriented data mining
 new learning tasks and evaluation measures
Reference
Saso Dzeroski and Nada Lavrac, editors,
Relational Data Mining,
Springer-Verlag, September 2001

MRDM – Prof. D. Malerba 41