Improving Cache Techniques Performance For Advanced Operations On High Dimensional Data

IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm

A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 10, October 2017 ISSN 2321-5992
Improving Cache Techniques performance for

Advanced Operations on High Dimensional
Data
Sahithi H1, M. Ravi2, Dr. D. Baswaraj3,Dr. M. Janga Reddy4
1
PG Student, Department of CSE, CMR Institute of Technology, Hyderabad (Telangana), India
2
Assistant Professor, Department of CSE, CMR Institute of Technology, Hyderabad (Telangana), India
3
Professor, Department of CSE, CMR Institute of Technology, Hyderabad (Telangana), India
4
Professor, Department of CSE, CMR Institute of Technology, Hyderabad (Telangana), India
ABSTRACT
this paper, we present an economical tree based mostly categorization technique, referred to as iDistance, for K Nearest
neighbor (KNN) search in a very high-dimensional mathematical space. IDistance partitions the information based on a space-
or data-partitioning strategy, and selects a point of reference for every partition. The data points in every partition are reworked
into one dimensional price supported their similarity with regard to the indicator. This enables the points to be indexed using a
tree structure and KNN search to be performed exploitation one-dimensional vary search. The selection of partition and
reference points adapts the index structure to the information distribution. We conducted in depth experiments to judge the
iDistance technique, and report results demonstrating its effectiveness. We additionally present a price model for iDistance
KNN search, which can be exploited in query improvement.
Keywords: High Dimensional Data, KNN Search, iDistance, LSH, Caching Framework
1. INTRODUCTION
Many rising information applications like image, statistic and scientific databases, manipulate high-dimensional
information. In these applications, one of the most frequently used and however valuable operations is to search out
objects within the high dimensional database that are kind of like a given query object. Nearest neighbor search may be
a central demand in such cases. There is a protracted stream of analysis on finding the nearest neighbor search
drawback and an outsized range of multidimensional indexes are developed for this purpose. Existing multi-
dimensional indexes like R-trees have been shown to be inefficient even for supporting vary queries in high-
dimensional databases; but, they type the idea for indexes designed for high-dimensional databases.
To reduce the result of high spatiality, use of larger fan outs, spatiality reduction techniques, and filter-and-refine
strategies are proposed. Indexes have additionally been specifically designed to facilitate metric based mostly query
process. However, linear scan remains an economical search strategy for similarity search. This is often as a result of
there's an inclination for data points to be nearly equal to query points in a very high-dimensional area. While linear
scan is effective in terms of ordered scan, each purpose incurs valuable distance computation, once used for the nearest
neighbor drawback. For quick response to queries, with some tolerance for errors (i.e., answers might not necessarily be
the nearest neighbors), approximate nearest neighbor (NN) search indexes like the P-Sphere tree are proposed. The P-
Sphere tree works well on static databases and provides answers with assigned accuracy. It achieves its potency by
duplicating information points in information clusters supported sample query set. Generally, most of those structures
don't seem to be adaptive to information distributions; consequently, they have an inclination to perform well for a few
datasets and poorly for others.
In this paper, we tend to present iDistance, a brand new technique for KNN search that may be adapted to completely
different information distributions. In our technique, we tend to initial partition the data and outline a reference for
every partition. Then we tend to index the space of information to the reference of its partition. Since this distance may
be a simple scalar, with a small mapping effort to stay partitions distinct, a classical tree may be used to index this
distance. As such, it's simple to graft our technique on prime of an existing business relational database. This is often
Volume 5, Issue 10, October 2017 Page 106

necessary as most commercial DBMSs these days don't support indexes on the far side the tree and therefore the Rtree
(or one among its variants). The effectiveness of iDistance depends on however the information is divided, and the way
reference points are elect. For a KNN query targeted at q, a variety query with radius r is issued. The iDistance KNN
search rule searches the index from the question purpose outward and for each partition that intersects with the
question sphere, a variety question is resulted. If the rule finds K parts that are nearer than r from q at the top of the
search, the rule terminates. Otherwise, it extends the search radius by r, and the search continues to look at the
unexplored region within the partitions that intersects with the query sphere.
2. RELATED WORK
LSH may be a standard approach for similarity search on high-dimensional data. As a result, there are various
implementations of LSH available on-line, such as: E2LSH, LSHKIT, LikeLike, LSH-Hadoop, LSH on GPU and
optimum LSH. Among these, LikeLike, LSH-Hadoop and LSH-GPU were designed specifically for parallel procedure
models (MapReduce, Hadoop and GPU, respectively). LikeLike and LSH-Hadoop are distributed LSH
implementations; but they are doing not promise high performance. LSH-GPU, on the opposite hand, is directed
towards delivering high levels of performance however is unable to handle large datasets as a result of current memory
capability limitations of GPUs. However, to the best of our information, none of those implementations are designed for
normal multi-core processors, and are unable to handle the large scale period applications thought of during this paper.
There are several variations of LSH implementations for distributed systems to reduce communication prices through
information clustering and clever placement of close information on near nodes. Performing cluster but needs the
utilization of load equalization techniques since queries are directed to some nodes however not others. We show that
even with a homogenous information distribution, the communication cost for running nearest neighbor queries on
large clusters is < 125th with very little load imbalance. Even with alternative custom techniques for knowledge
distribution, every node eventually runs a standard LSH formula. Therefore a high playacting LSH implementation that
achieves performance near hardware limits could be a useful contribution to the community that may be adopted by all.
Our paper introduces a replacement cache-friendly variant of the all-pairs LSH hashing conferred that computes all
LSH hash functions faster than a naive formula. There are different quick algorithms for computing LSH functions,
notably the one given that's based on quick Hadamard transform. However, the all-pairs approach scales linearly with
the scantiness of the input vectors, while the fast Hadamard transform computation takes a minimum of (D log D)
time, wherever D is that the input dimension (which in our case is regarding 500, 000). As a result, our adaptation of
the all-pairs technique is much more economical for the applications we tend to address. The difficulty of parameter
choice for the LSH formula is an identified issue. Our approach is comparable to it utilized, in this we decompose the
query time into separate terms (hashing, bucket search, etc.), estimate them as perform of the parameters k, L, then
optimize those parameters. However, our price model is way a lot of elaborated. First, we incorporate similarity into the
model. Second, we tend to separate the cost of the computation, that depends on the quantity of distinctive points that
hit the query, from the price that's perform of the entire variety of collisions. As a result, our model is very correct,
predicting the particular performance of our formula among a 15-25% margin of error. LSH has been previously used
for similarity search over Twitter data. Specifically, the paper applied LSH to twitter information for the aim of initial
story detection, i.e. those tweets that were highly dissimilar to any or all preceding tweets. So as to reduce the query
time, they compare the query purpose to a relentless variety of points that hit the query most often. This approach
works well for his or her application; however it would not be applicable to more general issues and domains. To
increase their formula to streaming information, they keep bin sizes constant and write recent points if a bin gets full.
As a result, every purpose is kept in multiple bins, and also the expiration time isn't well-defined. Overall, the heuristic
variant of LSH introduced in this work is correct and fast enough for the precise purpose of detection new topics. In this
paper, however, we tend to present AN economical general implementation of LSH that's way more scalable and
provides well-defined correctness guarantees. In short, our work introduces a high performance, in memory,
multithreaded, distributed nearest neighbor query system capable of handling large amounts of streaming information,
one thing no previous work has achieved.
3 FRAME WORK
In our caching drawback, the most analysis question is: however to exploit the restricted memory size and query work
to reduce the candidate refinement time. So as to spice up the cache hit ratio, we tend to propose to cache conservative
approximations of data points (i.e., representing every purpose in a very few bits). Such conservative illustration

provides lower and higher distance bounds, which might be used to prune unfortunate candidates and notice true kNN
results early.
Figure 1: Caching framework
First, we tend to illustrate caching framework in Figure 1. It consists of 3 phases:

(1) Candidate generation: It retrieves a candidate set of object identifiers from hash tables.
(2) Candidate reduction: It incurs no I/O and it runs our proposed technique to reduce the number of candidates
before entering into candidate refinement.
(3) Candidate refinement: For all object identifiers in the candidate set, fetch their data points from data point file
in the hard disk, then compute their distances to q and determine the k nearest results.
Every phase one and three apply existing work directly and that they incur I/O; In Phase 1, we apply an existing index I
(e.g., disk-based LSH). In Phase 3, we apply a multi-step kNN search technique. Part a pair of incurs no I/O and it runs
our planned technique to cut back the amount of candidates before coming into part three. Since the candidate
refinement part is dominated by the I/O price, we tend to categorical the candidate refinement price.
4. EXPERIMENTAL RESULTS
We demonstrate the prevalence of our caching answer on the real dataset (SOGOU, 29.7 GB) with a true question log.
We use C2LSH because the index. Figure 2, shows plots the typical query response time of our greatest caching method
(HC-O), caching exact information points (EXACT) and while not cache (i.e., C2LSH) for different cache size cs.
First, actual performs a lot of better than C2LSH (i.e., without cache). Second, our caching method (HC-O)
outperforms C2LSH by up to 5 times. Third, our best technique achieves the simplest performance once the cache size
reaches only 1/3 of the information size (i.e., 9G).
Figure 2: Query Response Time

5. Conclusion
Similarity search is of growing importance, and is usually most helpful for objects represented during a high spatiality
attribute area. A central drawback in similarity search is to find the points within the dataset nearest to a given query
point. In this paper, we have given an easy and efficient technique, known as iDistance, for k nearest neighbor (kNN)
search during a high-dimensional metric space and we investigate a caching solution to reduce the candidate
refinement time.
Acknowledgement
We thanks to all concerned authors, research scholars referred in this paper for accessing useful information.
References
[1] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios, BoostMap: A method for efficient approximate similarity
rankings, in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2004, vol. 2, pp. II-268II- 275.
[2] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff, Query-sensitive embeddings, ACM Trans.
Database Syst., vol. 32, no. 2, p. 8, 2007.
[3] C. Bohm, S. Berchtold, and D. A. Keim, Searching in high-dimensional spaces: Index structures for
improving the performance of multimedia databases, ACM Comput. Surv., vol. 33, no. 3, pp. 322373, 2001.
[4] L. Boytsov and B. Naidan, Learning to prune in metric and nonmetric spaces, in Proc. Adv. Neural Inf.
Process. Syst., 2013, pp. 15741582.
[5] J. Brandt, Transform coding for fast approximate nearest neighbor search in high dimensions, in Proc. IEEE
Conf. Comput. Vis. Pattern Recog., 2010, pp. 18151822.
[6] P. Ciaccia, M. Patella, and P. Zezula, M-tree: An efficient access method for similarity search in metric
spaces, in Proc. 23rd Int. Conf. Very Large Databases, 1997, pp. 426435.
[7] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, Localitysensitive hashing scheme based on p-stable
distributions, in Proc. Symp. Comput. Geometry, 2004, pp. 253262.
[8] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval: Ideas, influences, and trends of the new age, ACM
Comput. Surv., vol. 40, no. 2, 2008.
[9] W. Dong, C. Moses, and K. Li, Efficient k-nearest neighbor graph construction for generic similarity
measures, in Proc. 20th Int. Conf. World Wide Web, 2011, pp. 577586.
[10] W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li, Modeling lsh for performance tuning, in Proc.
17th ACM Conf. Inf. Knowl. Manage., 2008, pp. 669678.

Improving Cache Techniques Performance For Advanced Operations On High Dimensional Data

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Improving Cache Techniques Performance For Advanced Operations On High Dimensional Data

Hochgeladen von

Copyright:

Verfügbare Formate

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm

Improving Cache Techniques performance for

Volume 5, Issue 10, October 2017 Page 106

Volume 5, Issue 10, October 2017 Page 107

Figure 1: Caching framework

First, we tend to illustrate caching framework in Figure 1. It consists of 3 phases:

Figure 2: Query Response Time

Volume 5, Issue 10, October 2017 Page 108

Volume 5, Issue 10, October 2017 Page 109

Das könnte Ihnen auch gefallen