Beruflich Dokumente
Kultur Dokumente
Expert System
Prof. Glenn Shafer
Fall 2001
Anonymous
1
Abstract
This paper is a literature review of a new algorithm for mining association rules in
a spatial database. Spatial data mining, or knowledge discovery in large spatial
databases, is the process of extracting implicit knowledge, spatial relations, or
other patterns not explicitly stored in spatial databases. Recently, there has been a
lot of research in data mining and these studies led to a set of interesting
techniques, including methods for mining strong association and dependency
rules, attribute-oriented induction for mining characteristic and discriminant rules,
etc. Such studies set a foundation and provide some interesting methods for the
exploration of highly promising spatial data mining techniques.
Based on previous studies on spatial data mining and mining association rules in
transaction-based databases, this paper will introduce and study an interesting
method for mining strong spatial association rules in large spatial databases [6].
Discovery of spatial association rules may disclose interesting relationship among
spatial and non-spatial data in large spatial database and thus it represents a new
and promising direction in spatial data warehousing and spatial data mining.
Basically the method that will be presented in this paper explores efficient mining
of spatial association rules at multiple approximation and abstraction levels. It
proposes first to perform less costly, approximate spatial computation to obtain
approximate spatial relationships at a high abstraction level and then refine the
spatial computation only for those data or predicates whose refined computation
may contribute to the discovery of strong association rules. Such two-step spatial
mining algorithm facilitates mining strong spatial association rules at multiple
concept levels by a top-down, progressive deepening technique [6].
This method is based on the assumption that a user has reasonably good
knowledge on what kind of rules he wants to find from the database, and that
there exists good knowledge, such as concept or operation hierarchies, for spatial
or non-spatial generalization. Such assumptions may rule out naive users and
complex spatial databases with poorly understood structures.
2
1. Introduction to spatial data mining
a. General introduction
A spatial database stores a large amount of space-related data, such as maps,
preprocessed remote sensing or medical imaging data, and VLSI chip layout data.
Spatial databases have many features distinguish them from relational databases.
They carry topological or distance information, usually organized by sophisticated,
multidimensional spatial indexing structures that are accessed by spatial data access
methods and often require spatial reasoning, geometric computation, and spatial
knowledge representation techniques.
Specifically, spatial data mining can also be categorized based on the kinds of rules to
be discovered in spatial databases. A spatial characteristic rule is a general
description of a set of spatial-related data. For example, the description of the general
weather patterns in a set of geographic regions is a spatial characteristic rule. A
spatial discriminant rule is the general description of the contrasting or
discriminating features of a class of spatial-related data from other classes. For
example, the comparison of two weather patterns in two geographic regions is a
spatial discriminant rule. A spatial association rule, which will be discussed in this
paper, is a rule that describes the implication of one or a set of features by another set
of features in spatial databases. For example, a rule like “most cities in Canada are
close to the Canada-US border” is a spatial association rule [6].
There have been some interesting studies related to the mining of spatial databases. In
this paper, I will study the extension of the techniques for mining association rules in
transaction-based databases.
3
A spatial association rule is a rule of the form “A→ B, where A and B are sets of
predicates and some of which are spatial ones. In a large database, many association
relationships may exist, but some of them may occur rarely or may not hold in most
cases. People are only interested in the association rules that occur very strongly, i.e.,
which occur frequently and hold in most cases. Due to this fact, the concepts of
minimum support and minimum confidence are introduced. Informally, the support
of a pattern A in a set of spatial objects S is the probability that a member of S
satisfies pattern A, and the confidence of A → B is the probability that pattern B
occurs if pattern A occurs. A user or an expert may specify thresholds to confine the
rules to be discovered as strong ones.
For example, we may find that 92% of cities within British Columbia (BC) and
adjacent to water are close to USA, which associates predicates is_a, within, and
adjacent_to and spatial predicate close_to in the following format:
is_a(X, city) ∧ within(X, BC) ∧ adjacent_to(X, water) → close_to(X, USA). (92%)
Although such rules are usually not 100% true, they have some nontrivial and
valuable knowledge about spatial associations, and thus it is interesting to discover
them from large spatial databases. Also, various kinds of spatial predicates can
constitute a spatial association rule. Examples include distance information such as
close_to and far_away, topological relations like intersect, overlap and disjoint, and
spatial orientation like left_of and west_of.
4
However, it typically assumes statistical independence among the spatially distributed
data, which is not true in reality since spatial objects are always inter-related. This
assumption violates Tobler’s first law of Geography: everything is related to
everything else, but nearby things are more related than distant things. In other words,
the values of attributes of nearby spatial objects tend to systematically affect each
other. In spatial statistics, an area within statistics devoted to the analysis of spatial
data, is called spatial autocorrelation, where researchers have created, adapted, and
applied statistical techniques to spatial data. For example, in image processing and
vision, Markov Random Field (MRF) is a popular model to incorporate context for
image segmentation and classification.
Also, knowledge mining in image databases, which can be treated as a major type of
spatial databases, has been studied recently. Method for the classification of sky
objects and another method for recognition of volcanoes on the surface of Venus are
studied, where classification trees were used to make final decisions.
Finally, the spatial data mining techniques are closely related to traditional data
mining methods in relational databases. In most cases, we first study mining
algorithms for traditional cases, then apply them to spatial data to see if it is feasible
or if it needs more modifications.
5
a. Deep insights into spatial association rules
Various kinds of spatial predicates can be involved in spatial association rules. They
may represent topological relationships between spatial objects, such as disjoint,
intersects, inside/outside, adjacent_to, covers/covered_by, equal, etc. They may also
represent spatial orientation or ordering, such as left, right, north, east, etc, or some
distance information, such as close_to, far_away, etc.
For deep insights into the mining of spatial association rules, let’s first introduce one
formal definition.
To facilitate the specification of the specification of the primitives for spatial data
mining, an SQL-like spatial data mining query interface, which is designed based on
a spatial SQL, has been proposed to explain the following example, which is
thoroughly studied in [6]:
Where the attribute “geo” represents a spatial object (a point, line, area, etc) whose
spatial pointer is stored in a tuple of the relation (a row of the table) and points to a
6
geographic map. The attribute “type” of a relation is used to categorize the types of
spatial objects in the relation. For example, the types of road could be (national
highway, local highway, street), and the type for water could be (ocean, sea, inlets,
lakes, rivers, bay, creeks). The boundary relation specifies the boundary between two
Administrative regions. The omitted fields could be other pieces of information, such
as the area of a lake and the flow of a river.
Suppose a user is interested in finding within the map of British Columbia (BC) the
strong spatial association relationships between large towns and other “near_by”
objects including mines, country boundaries, water and major highways. The SQL
query could be presented below:
In this query, a relation variable X is used to represent one of a set of four variables
{R, W, M, B}, a predicate close_to(A, B) says that a spatial object A is close to
another spatial object B, and g_closed_to is a predefined generalized predicate which
covers a set of spatial predicates: intersect, adjacent_to, contains, close_to. Moreover,
close_to is a condition dependent predicate and is defined by a set of knowledge
rules, for example, if X is a town and Y is a country, then X is close to Y if their
distance is within 80 mile, however, “close_to” between a town and a road will be
defined by a smaller distance such as 5 miles.
Also, spatial predicates (topological relations) should be arranged into a hierarchy for
computation of approximation spatial relations using coarse resolution at a high
7
concept level and refine the computation when it is confined to a set of more focused
candidate objects. See the following for an example:
g_close_to
not_disjoint close_to
Firstly, the set of relevant data is retrieved by execution of the data retrieval methods
of the data mining query, which extracts the following data sets whose spatial portion
is inside BC: (1) towns: only large towns; (2) roads: only divided highways; (3)
water: only seas, oceans, large lakes and large rivers; (4) mines: any mines; (5)
boundary: only the boundary of BC and USA.
8
Since many people may not be satisfied with approximate spatial relationships, such
as g_close_to, more detailed spatial computation are needed to performed to find the
refined or precise spatial relationship in the spatial predicate hierarchy, thus we have
the following refined computation which is performed on the large predicate sets, i.e.,
those retained in the g_close_to table. Each g_close_to predicate is replaced by one or
a set of concrete predicates such as intersect, adjacent_to, close_to, inside, etc. Such a
process results in Table 2.
Table 2 forms a base for the computation of detailed spatial relationships at multiple
concept levels. Based in this, the level-by-level detailed computation of large
predicates and the corresponding association rules is presented. The computation
starts at the top-most level and computes large predicates at this level. For example,
for each row in the Table 2, i.e., for each large town, if the water attribute is
nonempty, the count of water is incremented by one. Such a count accumulation
forms 1-predicate rows (with k=1) of Table 3 where the support count registered. If
the support count of a row is smaller than the minimum support threshold, the row is
removed from the table. Suppose the minimum support is set to 50% at level 1, a row
whose count is less than 20 is removed. Similarly, the 2-predicate rows (with k=2)
are formed by the pair-wise combination of the large 1-predicates, with their support
counts accumulated by checking against Table 2, and the rows with the count smaller
than the minimum support will be removed. The same procedure applies to 3-
predicates computation. Finally, the computation of large k-predicates results in Table
3.
9
2 <adjacent_to, water>, <intersects, highway> 25
2 <adjacent_to, water>, <close_to, us_boundary> 23
2 <close_to, us_boundary>, <intersects, highway> 26
3 <adjacent_to, water>, <close_to, us_boundary>, 22
<intersects, highway>
Table3: large k-predicates sets at the top concept level (for 40 large towns in BC)
Thirdly, spatial association rules can be extracted directly from Table 3. For
example, since <intersects, highway> has a support count of 29, and <adjacent_to,
water> and <intersects, highway> has count of 25, and 25/29 = 86%, we get the
following association rule: is_a(X, large_town) ∧ intersects(X, highway) →
adjacent_to(X, water). (86%). Since we are only dealing with large towns, is_a(X,
large_town) is added here in the antecedent of the rule. If we set the minimum
confidence threshold at 90%, this rule would have been removed from the list of the
association rules to be generated.
Finally, after mining rules at the highest level of the concept hierarchy, large k-
predicates can be computed in the same way at the lower concept levels, which are
Table 4 and Table 5. And similarly, spatial association rules can be derived directly
from these 2 tables for detail level 2 and 3.
Table 4: large k-predicates sets at the second level (for 40 large towns in BC)
Table 5: large k-predicates sets at the third level (for 40 large towns in BC)
10
For example, the following two rules can be derived from these 2 tables:
is_a(X, large_towns) → adjacent_to(X, seas) (52.5%: 21/40 towns) level 2
is_a(X, large_towns) ∧ adjacent_to(X, George_strait) → close_to(X, US). (78%) level 3
Notice that only the descendants of the large 1-predicates will be examined at a lower
concept level, and the mining process stops at the lowest level of the hierarchies or when
an empty large 1-predicate set is derived.
The above rule mining process can be summarized in the following algorithm:
Algorithm: mining the spatial association rules defined by Definition 1 in a large spatial
database.
Input: a spatial database, a mining query, and a set of thresholds:
1. a database consists of 3 parts: a spatial database SDB containing a set of
spatial objects; a relational database RDB describing nonspatial properties of
spatial objects; and a set of concept hierarchies.
2. a query consists of 3 parts: a reference class S; a set of task-relevant classes
for spatial objects C1, …, Cn; a set of task-relevant spatial relations.
3. two thresholds: minimum support and minimum confidence for each level l of
description.
Output: strong multiple-level spatial association rules for the relevant sets of objects and
relations.
Method: mining spatial association rules proceeds as follows:
Step 1: Task_relevant_DB := extract_task_objects(SDB, RDB);
Step 2: Coarse_predicate_DB :=
coarse_spatial_computation(Task_relevant_DB);
Step 3: Large_coarse_predicate_DB :=
filtering_with_minimum_support(Coarse_predicate_DB);
Step 4: Fine_predicate_DB :=
refined_spatial_computation(Large_coarse_predicate_DB);
Step 5: Find_large_predicates_and_mine_rules(Fine_predicate_DB).
Pseudo code:
where LL[l] is the large predicate set table at level l, and L[l, k] is the large k-predicate
set table at level l. The syntax procedure is similar to Pascal.
(1) Procedure find_large_predicates_and_mine_rules(DB);
(2) for (i :=1; L[i, 1] ≠ 0 and i < max_level; i++) do begin
(3) L[i,1] := get_large_1_predicate_sets(DB,i);
(4) for (k :=2; L[i,k-1] ≠0; k++) do begin
(5) Pk := get_candidate_set(L[i,k-1]);
(6) foreach object s in S do begin
(7) Ps := get_subsets(Pk,s); {Candidates satisfied by s}
(8) foreach candidate p ∈ Ps do p.support++;
(9) end;
(10) L[i,k] := {p ∈Pk|p.support ≥ minsup[i]};
(11) end;
(12) LL[i] := Uk L[i,k];
(13) output := generate_association_rules(LL[i]);
11
(14) end
(15) end
Thirdly, the spatial data mining algorithm developed above has the following major
strength for mining spatial association rules as stated in [6]:
12
4). Detailed spatial computation: performed once and used for knowledge mining at
multiple levels
The computation of support counts at each level can be performed by scanning
through the same computed spatial predicate table.
4. Conclusions
Basically, the algorithm presented in this paper discusses efficient mining procedures
for spatial association rules, which explores techniques at multiple approximation and
abstraction levels. It proposes first to perform less costly, approximate spatial
computation to obtain approximate spatial relationships at a high abstraction level and
then refine the spatial computation only for those data or predicates whose refined
computation may contribute to the discovery of strong association rules. Such two-
13
step spatial mining algorithm facilitates mining strong spatial association rules at
multiple concept levels by a top-down, progressive deepening technique.
This method is based on the assumption that a user has reasonably good knowledge
on what kind of rules he wants to find from the database, and that there exists good
knowledge, such as concept or operation hierarchies, for spatial or non-spatial
generalization. Such assumptions may rule out naive users and complex spatial
databases with poorly understood structures or knowledge, which needs more studies
in the future.
References:
[1] Tom Barclay, Jim Gray and Don Slutz, “Microsoft TerraServer: a spatial data
warehouse,” Proceedings of the 2000 ACM SIGMOD on Management of data,
pages 307-318.
[2] Peter Baumann, “Web-enabled Raster GIS Services for Large Image and Map
Databases,” Proceedings of the ACM DEXA2001, pages 870 - 874.
[4] Volker Coors, Volker Jung, “Using VRML as an Interface to the 3D Data
Warehouse”, Proceedings of the third symposium on Virtual reality modeling
language, 1998, Page 121 - 129 .
[5] Martin Ester, Hans-Peter Kriegel, Jorg Sabder, “Knowledge Discovery in Spatial
Databases”, Invited Paper at 23rd German Conf. on Artificial Intelligence (KI ’99), Bonn,
Germany, 1999..
[7] Shashi Shekhar, Sanjay Chawla, Siva Ravadam Andrew Fetterer, Xuan Liu
and Chang-tien Lu, “Spatial Databases – Accomplishments and Research Needs,”
IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 1, 1999.
14