Beruflich Dokumente
Kultur Dokumente
2015-08
Information Systems
journal homepage: www.elsevier.com/locate/infosys
a r t i c l e in f o
abstract
Article history:
Received 13 January 2014
Received in revised form
21 October 2014
Accepted 26 January 2015
Available online 13 February 2015
Besides integers and short character strings, current applications require storing photos,
videos, genetic sequences, multidimensional time series, large texts and several other
types of complex data elements. Comparing them by equality is a fundamental concept
for the Relational Model, the data model currently employed in most Database Management Systems. However, evaluating the equality of complex elements is usually senseless,
since exact match almost never happens. Instead, it is much more significant to evaluate
their similarity. Consequently, the set theoretical concept that a set cannot include the
same element twice also becomes senseless for data sets that must be handled by
similarity, suggesting that a new, equivalent definition must take place for them. Targeting
this problem, this paper introduces the new concept of similarity sets, or SimSets for
short. Specifically, our main contributions are: (i) we highlight the central properties of
SimSets; (ii) we develop the basic algorithm required to create SimSets from metric
datasets, ensuring that they always meet the required properties; (iii) we formally define
unary and binary operations over SimSets; and (iv) we develop an efficient algorithm to
perform these operations. To validate our proposals, the SimSet concept and the tools
developed for them were employed over both real world and synthetic datasets. We
report results on the adequacy, stability, speed and scalability of our algorithms with
regard to increasing data sizes.
& 2015 Elsevier Ltd. All rights reserved.
Keywords:
Metric space
Similarity sets
Independent dominating sets
Graphs
1. Introduction
Traditional Database Management Systems (DBMS) are
severely dependent on the concept that a set never includes
the same element twice. In fact, the Set Theory is the main
mathematical foundation for most existing data models,
including the Relational Model [1]. However, current applications are much more demanding on the DBMS, requiring it
Corresponding author.
E-mail addresses: ives@icmc.usp.br (I.R.V. Pola),
robson@icmc.usp.br (R.L.F. Cordeiro), caetano@icmc.usp.br (C. Traina),
agma@icmc.usp.br (A.J.M. Traina).
http://dx.doi.org/10.1016/j.is.2015.01.011
0306-4379/& 2015 Elsevier Ltd. All rights reserved.
131
Table 1
Table of symbols.
Symbol
Description
T; T1 ; T2
R; S; T
R; S; T
r i ; si ; t i
Database relations
Data domain
Sets of elements, from R; S; T respectively
Elements in a set (r i A R, si A S, t i A T)
Relational selection operator
Similarity selection operator
Similarity extraction operator
^
^
^
Similarity threshold
Distance function (metric)
Sufficiently similar operator
Equivalent expression
Not equivalent expression
R^ ; S^ ; T^
G^
-Similarity-sets
P S^ S
A -similarity graph
A extraction policy
The similarity union operator
The similarity intersection operator
^
How to handle SimSets on real data?
132
3. Background
The problem of extracting SimSets that best represent
the original set S involves representing the similarities
AkNNQ (All k-nearest neighbor queries), developing a technique to prune branches that exploits trigonometric properties
to reduce the search space, based on spherical regions. That
technique was also employed by Jacox [14] to improve
similarity join algorithms over high dimensional datasets,
leading to the method called QuickJoin. An improvement on
the QuickJoin method was proposed by Fredriksson et al. [15]
based on the QuickSort algorithm, modifying components of
the previous QuickJoin method to improve the execution time.
The performance of AESA (Approximating and Eliminating
Search Algorithm) and its derived methods was improved by
Ban [16]. It uses geometric properties to define lower bounds
for approximate distances over the metric spaces. Most of the
trigonometric properties derive from geometrical properties
of the embedding space and are applicable only to positive
semi-definite metric spaces. Fortunately, as the authors highlight, many of the commonly used metric spaces are positive
semi-definite [17]. Zhu et al. [18] investigated the cosine
similarity problem. As in several studies that require a minimum threshold to perform the computation, this work proposes to use the cosine similarity to retrieve the top-k most
correlated object pairs. They first identify an upper bound
property, by exploiting a diagonal traversal strategy, and then
they retrieve the required pairs. Some related works also focus
on solving particular problems, such as sensor placement [19]
and classification [20].
The selection of representative objects that are not too
similar to each other has already been addressed in metric
spaces as a selection of pivots. The method Sparse Spatial
Selection SSS [21] is a greedy algorithm able to choose a
set of objects (i.e., pivots) that are not too close to each
other, based on a distance threshold . SSS targets the use
of few pivots to improve the efficiency of similarity searches in metric spaces. In fact, although the set of pivots is
in some ways connected to the concept of SimSets, SSS
does not necessarily obtain the same result that the
SimSets may produce, once the latter control the cardinality of the resulting dataset.
Result diversification can improve the quality of results
retrieved by user queries [22]. The usual approach is to
consider that there is a distance threshold to find elements
that are considered very similar in the query result. Then,
similar elements in the result are filtered so the final result
contains only dissimilar elements [23,24]. In this paper, we
do not focus on filtering or improving similarity queries
quality, but extracting sets of elements with no similarity
between them under a threshold .
Clustering techniques dealing with complex data usually
explore subspaces of high dimensionality datasets to find
patterns. A recent survey can be found in [25]. Although many
methods are typically super-linear in space or execution time,
the Halite method [26] is a fast and scalable clustering method
that uses a top-down strategy to analyze the point distribution in the full dimensional space by performing a multiresolution partition of the space to create a counting tree
data structure. Therefore, the tree is searched to find clusters
by using convolution masks and the final clusters are blended
together if they overlap.
Although clustering methods can be adapted to generate SimSets, some limitations come out when the elements are not distributed in clusters, or the clusters do not
133
(si
^ si ) and symmetric (si
^ sj ) sj
^ si ), but it is not
transitive. Therefore, for any sh ; si ; sj A S, such that sh
^ si
and si
^ sj , it is not guaranteed that sh
^ sj . In fact, it is
easy to prove that the
^ comparison operator is transitive if and only if 0, and in this case
^ 0 is equivalent
to .
The idea that a set never has the same element twice
must be acknowledged when pairs of -similar elements
are accounted as being the same, so they should not occur
twice in the set. To represent this idea we propose the
concept of similarity-set as follows.
^
Definition 5.2 (-Similarity-set). A -similarity-set
S is a
set of elements from a metric domain M S; d , S^ S,
1
The symbol superimposed to an operator or set indicates the
similarity-based variant of the operator or set.
134
Fig. 1. (a) A 2-dimensional toy dataset; (b)(d) examples of 2-simsets produced from our toy dataset using the Euclidean distance; (e) corresponding
-similarity graph for 2.
si
^ sj . We say that S^ is a -similarity-set (or a -simset
for short) that has as its similarity threshold.
Extracting a -simset S^ from a set S involves representing every -similarities occurring among pairs of elements
in S in a -Similarity Graph, which is defined as follows.
Definition 5.3 (-Similarity graph). Let S be a dataset in a
metric space and be a similarity threshold. The corre
sponding -similarity graph G^ S fV; Eg is a graph where
each node vi A V corresponds to an element si A S, and
there is an edge vi ; vj A E if and only if si
^ sj . .
Fig. 1(e) illustrates the -similarity graph for the example dataset in Fig. 1(a), considering 2. It is important to
note that, given the -simgraph of S, the following rule
allows checking if S is a -simset or not:
^
sk A S{S there is at least one element si A S^ , such that
sk
^ si .
Notice that although a -simset S^ DS S exists independent from S, an -extraction from S is always associated
^
to S. Following this definition, the -similarity graph G^ S
is a maximum/minimum independent dominant set of
graph G^ S. Also, note that it is possible to -extract several distinct -simsets from a given set S. We name the set
of all possible -extractions of -simsets from S as its Similarity cover, which is defined as follows.
Definition 5.5 (Similarity-cover). Let S S be a set exist
135
^ ^
1
S Z S i 8 S^ i A P S^ S
^ ^ ^
S r S i 8 S i A P S^ S
136
Table 2
Properties of -simset operators. Univocal indicates if the result set is
univocal (n) or not ( ).
Meaning
Property
Univocal
Identity
0
S^ S
Commutativity
S^
R^
^ R^
S^
S^
R^
^ R^
S^
S^
S^
Query policies
The union: S^ R^ ;
The intersection: S^ R^ ;
^ ^
The difference: S
^R ;
^ rj .
si
Replace
S^
S^
S^
Insert policies
R^
^ R^
R^
^ R^
R^
R^
R^
^ R^
^ R^
^ R^
S^
S^
S^
S^
S^
n
n
S^
R^
^ R^
S^
S^
fsi g
^ S^
fsi g
fsi g
S^
S^
S^
S^
S^
fsi g
^ S^
fsi g
^ fsi g
fsi g
^ fsi g
fsi g
^ fsi g
fsi g
^ fsi g
S^
S^
n
n
S^
S^
^ ^ ^
min The Union operator T S R chooses the
elements aimed at generating a result -simset
with the minimum cardinality, such that 8 si A
^ t j and 8 r i A R^ - ( t j A T^ j r i
^
S^ - ( t j A T^ j si
t j .
^
^
^
- ( t j A T j si
^ t j and 8 r i A R - ( t j A T j r i^
t j .
^
^S
^
^
R is the set of all elements t i in S R (set union) such
that t i A S^ 4 (r j A R^ jt i
^ r j or t i A R^ 4 ( sk A S^ jt i
^ sk . The
policy follows the same one defined for the union operator,
so the corresponding ; ; ; operators exist.
Definition 5.9 (-similarity difference). The -similarity
no element r i A R^ and si
^ r j . There is only one way to
perform the
^ operator, thus no policy is assigned to the
-similarity difference operation.
Notice that the extraction of a -simset over the result
of a regular operator is not equivalent to the execution
of the corresponding simset-based operator over the
-extractions of the original sets. For example, considering the union operator, DistinctS; ; DistinctR; ; ^
DistinctSR; ; .
S^ u
InsertS^ ; si
^
Leave The insert operation
^ sj ,
S^ fsi g inserts si into S^ only if sj A S^ jsi
otherwise does nothing;
^
^
RInsertS ; si
^ S fsi g removes every element
sj A S^ jsi
^ sj from S^ and inserts si into S^ .
^
min The shrink operation S u ShrinkS^ ;
^
si
^ S fsi g evaluates the cardinality of the subset
R D S^ j sj A R ) si
^ sj ; if jRj 4 1 then removes
R D S^ jsj A R ) si
^ sj ; if jRj 41 then does nothing,
^
-similarity difference of S and the unitary set fsi g, S^ fs
^ i g.
Notice that, as the -similarity difference operator is
insensitive to policies, there is only one operator
S^ u DeleteS^ ; si
^ S^ fs
^ i g.
Definition 5.12 (-query). The -query centered over an
137
^
returns si if (sj A S jsi
^ sj , or otherwise.
^
^
si
^ S
fsi g returns the subset R D S jsj A R ) si
^ sj .
Notice that query operators are subject only to the extraction policies [min] and [Max]. However, it is interesting to
notice that maintenance policies would produce equivalent
results, as S^ fsi g
^ S^ fsi g, and S^ fsi g
^ S^ fsi g.
A final operation worth exploring for -simsets is the
aggregation of elements in a dataset S into groups, each
^
each si A S as the representative of all the elements sg A S
such that si
^ sg . Thus we define the following SQL syntax
extension to represent similarity grouping:
SELECT f atrib;
FROM Tab
GROUP BY SIMILARITY
catrib UP TO USING distance_function j
[CONSTRAINT name];
138
1
2
3
4
5
6
7
8
9
10
dataset S: G^ S SimGraphS; ;
2. Obtain the -similarity-set S^ from graph G^ S, follow
ing the A {[min], [Max]} policy: S^ DistinctS; ; .
17
Algorithm 1. S^ DistinctS; ; .
28
29
Array TotalsjSj;
Set INodes, JNodes, SimSet;
SimSet; GraphSSimGraphS; ;
foreach node si in GraphS with zero degree do
j remove si from GraphS and insert it into SimSet;
Totals[i]degree of node si in graph GraphS;
Min0 smallest value in Totals;
INodes nodes having degree Min0;
while edges exist in GraphS do
if is Max then
randomly select a node si from INodes;
remove every node linked to si from GraphS;
remove si from GraphS and insert it into SimSet;
if is min then
JNodes set of nodes linked to nodes in INodes;
foreach s in JNodes do
i
j C1i number of nodes with degree Min0 linked to si ;
foreach si in JNodes do
C2i C1i;
foreach s in JNodes connected to s do
j
i
j C2i C2i C1j;
Min1smallest value in C2;
randomly select a node s from JNodes having C2i Min1;
i
remove every node linked to si from GraphS;
remove s from GraphS and insert it into SimSet;
i
update Totals;
foreach Node si in GraphS with zero degree do
j remove s from GraphS and insert it into SimSet;
i
update Min0 and INodes;
30
end
11
12
13
14
15
16
18
19
20
21
22
23
24
25
26
27
Fig. 2. Steps performed by algorithm S^ DistinctS; ; to obtain a -simset from G^ , using policy . (a) The original dataset, (b)(d) steps to obtain the
-simset using the [Max] policy, (e)(g) steps to obtain the -simset using the [min] policy.
algorithm is based on the node degrees, which are evaluated in lines 68. During the execution, as nodes are
being dropped from the graph, the variables Totals, Min0
and INodes are maintained updated, in Lines 26 and 29.
Nodes that become isolated during the execution of the
algorithm are moved from the graph into the -simset, as
shown in Line 27. Notice that evaluating the node degrees
requires a full scan over the dataset, but thereafter, as
those variables are locally maintained, no additional scan
is required.
Let us first exemplify how the [Max] policy is evaluated on our example graph from Fig. 2(a). In the first
iteration of Line 9, nodes fp1 ; p2 ; p5 ; p6 ; p12 ; p13 ; p19 ; p21 ; p22 ;
p24 ; p25 g have the smallest count of edges (degree). So, one of
them is removed from the graph and inserted into the result,
in Line 10. Fig. 2(b) depicts the case when one of such nodes
(in this example, node p13) is inserted in the result, and all
the nodes linked to it are removed from the graph (in this
example node p11 is removed). Fig. 2(c) shows the graph
after all of those nodes have been inserted into the result and
the nodes linked to them removed, in successive iterations of
Line 9. Now, as nodes fp7 ; p10 ; p14 ; p17 g have two edges each,
they are inserted into the result in the next iterations of Line
9, as shown in Fig. 2(d). As no other node remains in the
graph, the result set is the final -simset, which has 16 nodes
in this example. In fact, no SimSet with more than 16 nodes
exists.
Let us now exemplify how the algorithm works for
policy [min], which is processed in Line 14. Once more
we employ the graph in Fig. 2(a) to be our running
example. In the first iteration of Line 9, the nodes having
the smallest count of edges are the same ones selected
with the [Max] policy. However, these nodes are now
employed in Line 15 to select those nodes directly linked
to them, which results in the set fp3 ; p4 ; p11 ; p18 ; p20 ; p23 g. In
Line 16, it is counted in INodes how many neighbors each
of the selected nodes have, resulting in the counting vector
C1 fp3 : 2; p4 : 2; p11 : 2; p18 : 1; p20 : 2; p23 : 2g. Following, these
counts are discounted from the total number of nodes with
the minimum degree linked to each of their neighbors
in Line 18, resulting in the counting vector C2 fp3 : 0;
p4 : 0; p11 : 2; p18 : 3; p20 : 1; p23 : 1g. As p11 is the sole
node with the minimum count ( 2) in C2, it is moved into
the result, also removing all of its neighbors from the graph,
what is done in Lines 2225.
In the next iteration of Line 9, the same values result in
the counting vectors C1 and C2, this time without node
p11. Now, both nodes p20 and p23 tie with the minimum
count 1, so either of them can be inserted into the result.
Fig. 2(e) shows the graph just after this second iteration, in
which we chose to insert node p20 into the result. Proceeding, nodes p4 and p23 are moved to the result, as shown
in Fig. 2(f). Finally, nodes p9 and p15 are successively
moved, so the SimSet that corresponds to the final graph
shown in Fig. 2(g) is obtained, which in this example has
10 nodes. In fact, no SimSet with fewer than 10 nodes can
be extracted.
Whenever a node is moved from the graph into the
result, the graph and the corresponding degree of the
remaining nodes are updated in Line 26. Disconnected
nodes are also moved to the resulting SimSet, in Line 27.
139
140
Fig. 3. Binary operations for Similarity-sets: only nodes in S^ R^ may have edges.
^
and R . The main point to be attained is that the results hold
the -simset requirements. We developed one algorithm
^
^
in R (Line 1) and then creates its -Similarity Graph G (B)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
26
GraphSSimgraphB; ;
if Op
^ then
foreach node si in GraphS that is an element in R^ do
Remove every element s from B linked to s in GraphS;
j
i
Remove si from B;
Return B as SimSet;
if Op
then
foreach node si in GraphS with zero degree do
j Remove si from B and from GraphS;
if [Max] or [min] then
j Return DistinctGraphS; as SimSet;
if [Leave] then
foreach edge si ; sj A GraphS j si A S^
Return B as SimSet;
End;
^
^
^
^
intersection S R or difference S
^ R , following any
police A f[min], [Max], [Leave] or [Replace]g. It is shown as
Algorithm 2
Notice that if we generate a (traditional) set B with all
^
^
as B S R ) and then extracts a -Similarity Graph G^ (B),
then its edges will always link one element from S^ to one
bag B, regardless of the operation being union or intersection. If the policy is either Max or min, algorithm
Distinct is directly applied in step 17 to obtain the result.
If the policy is Leave or Replace, respectively steps 19 or 23
remove elements participating in edges from the proper
141
142
25
Number of nodes after Distinct (x100)
6
5
4
3
2
1
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Number of edges (% of complete graph)
0.9
15
10
450
4.5
400
350
3.5
Time spent (seconds)
0.1
20
300
250
200
150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Number of edges (% of complete graph)
0.9
2
1.5
1
50
0.5
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1000
3000
0.2
2.5
100
0
0.1
0.1
5000
7000
9000
11000
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of nodes
13000
15000
17000
19000
21000
Fig. 4. Results of executing S^ DistinctS; ; over synthetic graphs generated by the preference attachment method with varying number of nodes and
^
edges. (a) S node cardinality in the [Max] policy, (b) S^ node cardinality in the [Min] policy, (c) time spent using the [Max] policy, (d) time spent using the
[min] policy, (e) key description.
Difference: S^ ^ R^ ;
Intersection: S^ R^ ;
Union: S^ R^ ;
Each plot in Fig. 6 shows the time spent to perform
each operator, varying the number of nodes. Also, each
curve corresponds to graphs with different number of
1400
Number of nodes after Distinct
3500
3000
2500
2000
1500
1000
500
0
10
12
14
16
18
1200
1000
800
600
400
200
0
20
10
12
14
16
18
20
1000
10
100
143
10
1
=1
h slope
Line wit
0.1
0.01
1000
=1
slope
with
e
n
i
L
0.1
0.01
1000
10000
Number of nodes
10000
Number of nodes
Number of edges (%)
0.2%
0.3%
0.4%
0.5%
0.6%
0.8%
1%
Fig. 5. Results of executing S^ DistinctS; ; over synthetic graphs with random edge distribution, varying the number of nodes but keeping the
proportion of edges per node. (a) S^ node cardinality in the [Max] policy, (b) S^ node cardinality in the [Min] policy, (c) time spent using the [Max] policy, (d)
time spent using the [min] policy, (e) keys description.
ence operations S^
^ R^ , while Fig. 6(b) shows the time
number of edges in Simgraph(S^ UNION ALL R^ ; ) increases, the time spent for performing the Union operation
also increases.
8.3. Evaluation of the USA cities dataset
This section reports the results of experiments that
employ the USA Cities dataset, aiming at validating the
Distinct algorithm over real datasets. As the dataset contains the geographical positioning of cities, that is, points
distributed in a Euclidean two-dimensional space, it is also
useful to obtain an intuition of the algorithm behavior over
a dataset that is more easily understandable to us. A possible scenario using this dataset would be the possibility to
144
7
5
3
1
(logarithmic scale)
10000
20000
Dataset size (S + R)
10
30000
1
(logarithmic scale)
(logarithmic scale)
10000
0.2%
20000
Dataset size (S + R)
0.3%
0.4%
30000
10000
20000
Dataset size (S + R)
0.8%
30000
1%
Fig. 6. Results of executing T^ BinOpS^ ; R^ ; ; Op; over synthetic graphs with random edge distribution, varying the number of nodes keeping
the proportion of edges per node.
Number of edges in
SimGraph (x100000)
Table 5
Statistics for the [min] policy.
16
14
12
10
8
6
4
2
0
10
20
30
40
50
60
Distance radius for Range Join (kilometers)
30
25
20
15
10
5
0
10
20
30
40
50
60
Distance radius for Range Join (kilometers)
Distinct - Max policy
145
256
64
16
4
1
10
20
30
40
50
60
Distance radius for Range Join (kilometers)
Table 3
USA road networks graphs dataset.
Graph name
Number of nodes
Number of edges
264,346
321,270
435,666
733,846
800,172
1,057,066
Table 4
Statistics for the [Max] policy.
Graph name
128,738 (49%)
164,444 (51%)
223,532 (51%)
67.95
89.83
171.02
Graph name
84,745 (32%)
111,606 (35%)
148,278 (34%)
36.05
69.72
122.90
Source: http://www.agritempo.gov.br/estacoes.html
146
Fig. 8. (a) Distribution of the existing network of meteorological stations in Brazil. (b) Distribution of stations selected from the Climate dataset using
Distinct with min policy.
Fig. 9. (a) Distribution of the existing network of meteorological stations in the southeast and south of Brazil. (b) Distribution of stations selected from the
existing real station network using Distinct with min policy.
Fig. 10. (a) Location of new stations to build according to the similarity set resulted from S^
^ R^ . (b) Result of a similarity union R^
stations and new stations from simulated data.
147
S^ between real
9. Conclusion
In this paper we introduced the novel concept of similarity-sets, or SimSets for short. A set is a data structure
holding a collection of elements that has no duplicates. A
^
si ; sj A S such that dsi ; sj r , where dsi ; sj measures the
(dis)similarity between both elements and is a similarity
threshold. The -simset concept extends the idea of sets in a
way that SimSets with similarity threshold do not include
148
[12] C. Xiao, W. Wang, X. Lin, J.X. Yu, G. Wang, Efficient similarity joins for
near-duplicate detection, ACM Trans. Database Syst. 36 (2011). 15:1
15:41. URL http://doi.acm.org/10.1145/2000824.2000825.
[13] T. Emrich, F. Graf, H.-P. Kriegel, M. Schubert, M. Thoma, Optimizing
all-nearest-neighbor queries with trigonometric pruning, in:
SSDBM, 2010, pp. 501518. URL: http://portal.acm.org/citation.
cfm?id=1876037.1876079.
[14] E.H. Jacox, H. Samet, Metric space similarity joins, ACM Trans.
Database Syst. 33 (2008). 7:17:38. URL http://doi.acm.org/10.1145/
1366102.1366104.
[15] K. Fredriksson, B. Braithwaite, Quicker similarity joins in metric
spaces, in: N. Brisaboa, O. Pedreira, P. Zezula (Eds.), Similarity Search
and Applications, Lecture Notes in Computer Science, vol. 8199,
Springer, Berlin, Heidelberg, 2013, pp. 127140.
[16] T. Ban, Y. Kadobayashi, On tighter inequalities for efficient similarity
search in metric spaces, IAENG Int. J. Comput. Sci. 35 (2008)
392402.
[17] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis,
Cambridge University Press, Cambridge, England, 2004.
[18] S. Zhu, J. Wu, H. Xiong, G. Xia, Scaling up top-k cosine similarity
search, Data Knowl. Eng. 70 (1) (2011) 6083.
[19] F.-S. Lin, P.L. Chiu, A near-optimal sensor placement algorithm to
achieve complete coverage-discrimination in sensor networks, IEEE
Commun. Lett. 9 (1) (2005) 4345, http://dx.doi.org/10.1109/
LCOMM.2005.01027.
[20] L.I. Kuncheva, L.C. Jain, Nearest neighbor classifier: Simultaneous
editing and feature selection, Pattern Recognition Letters 20 (1113)
(1999) 11491156.
[21] O. Pedreira, N.R. Brisaboa, Spatial selection of sparse pivots for
similarity search in metric spaces, in: Proceedings of the 33rd
Conference on Current Trends in Theory and Practice of Computer
Science, SOFSEM '07, Springer-Verlag, Berlin, Heidelberg, 2007,
pp. 434445.
[22] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, S. Yahia,
Efficient computation of diverse query results, in: IEEE 24th International Conference on Data Engineering, 2008, ICDE 2008, 2008,
pp. 228236.
[23] T. Skopal, V. Dohnal, M. Batko, P. Zezula, Distinct nearest neighbors
queries for similarity search in very large multimedia databases, in:
Proceedings of the Eleventh International Workshop on Web Information and Data Management, WIDM 09, ACM, New York, NY, USA,
2009, pp. 1114.
[24] V. Gil-Costa, R. Santos, C. Macdonald, I. Ounis, Sparse spatial
selection for novelty-based search result diversification, in:
R. Grossi, F. Sebastiani, F. Silvestri (Eds.), String Processing and
Information Retrieval, Lecture Notes in Computer Science, vol.
7024, Springer, Berlin, Heidelberg, 2011, pp. 344355.
[25] H.-P. Kriegel, P. Krger, A. Zimek, Clustering high-dimensional data:
a survey on subspace clustering, pattern-based clustering, and
correlation clustering, ACM Trans. Knowl. Discov. Data 3 (1)
(2009). 1:11:58.
[26] R.L.F. Cordeiro, A.J.M. Traina, C.Faloutsos, C. Traina Jr., Halite: fast and
scalable multiresolution local-correlation clustering, IEEE Trans.
Knowl. Data Eng. 25 (2) (2013) 387401.
[27] K. Stefanidis, G. Koutrika, E. Pitoura, A survey on representation,
composition and application of preferences in database systems,
ACM Trans. Database Syst. 36 (3) (2011). 19:119:45. URL http://doi.
acm.org/10.1145/2000824.2000829.
[28] M. Drosou, E. Pitoura, Search result diversification, SIGMOD Rec. 39 (1)
(2010) 4147, http://dx.doi.org/10.1145/1860702.1860709. URL http://
doi.acm.org/10.1145/1860702.1860709.
[29] D. Chakrabarti, Y. Zhan, C. Faloutsos, R-mat: a recursive model for
graph mining, in: SDM, 2004, p. 5.
[30] T.L. Black, The new NMC mesoscale eta model: description and
forecast examples, Weather Forecast. 9 (1994) 265284. doi:10.1175/
1520-0434(1994)0090265:TNNMEM2.0.CO;2.
[31] I.R.V. Pola, R.L.F. Cordeiro, C. Traina Jr., A.J.M. Traina, A new concept
of sets to handle similarity in databases: the simsets, in: Similarity
Search and Applications 6th International Conference, SISAP 2013,
A Corua, Spain, October 24, 2013, Proceedings, 2013, pp. 3042.