You are on page 1of 3

IJSRD - International Journal for Scientific Research & Development| Vol.

4, Issue 04, 2016 | ISSN (online): 2321-0613

Outlier Detection Using Anti-hubs
Miss.Gavale Swati S.1 Prof. Kahate Sandip2
1
M. E Student 2Assistant Professor
1,2
Department of Computer Engineering
1,2
Sharadchandra Pawar College of Engineering, Dumbarwadi Otur, Pune, India
Abstract— The distance base outliers detection method fails 3) For outlier detection based on the relation anti-hubs and
to increase the dimensionality of the data. This problem outlier two methods are proposed for high and low
occurs due to irrelevant and redundant feature because the dimensional data for showing the outlier-ness of points,
distance between two points is less. Reverse nearest beginning with the method ODIN (Outlier Detection
neighbors of point P is the points for which P is in their k using in-degree Number).
nearest neighbor list. Antihubs are some points are frequently Existing system, it takes large computation cost,
comes in k-nearest neighbor list of another points and some time to calculate the reverse nearest neighbors of the all
points are infrequently comes in k nearest neighbor list of points. Use of antihubs for outlier detection is of high
different points. Latest proposes are antihub base computational task. Computational complexity depends on
unsupervised outlier detection method, but these propose are data dimensionality as dimensionality of data increases the
suffering from high computational cost of finding outlier. complexity of computation increases. Because of this nearest
This is depends on data who having better dimensionality, neighbor introduced to remove irrelevant and redundant data.
high computation cost, time requirement to find high To avoid this feature selection is introduced. All the features
antihubs. To avoid this there is need to remove out irrelevant are arrange rank wise and required features are taken for
and redundant feature of high dimensionality data. It reversed nearest neighbor. Outlier score is calculated by
increases the efficiency by removing the redundant feature. using reversed nearest neighbor. According to studies, if
Using feature selection method redundant feature are system does not know about the distribution of the data then
removed. euclidean distance is the best choice.
Key words: Nearest Neighbor, Outlier Detection, Reverse
Nearest Neighbors II. LITERATURE SURVEY
Edwin M. Knorr et al [1] described to find outlier in heavy
I. INTRODUCTION multidimensional dataset. Existing system is used to find
There is need of finding intrusion and outlier detection outliers which can only deals with multi dimension attribute
methods like unsupervised, semi supervised and supervised of dataset here outlier detection could be done efficiently for
by studying outlier detection. These types are divided into heavy dataset and k dimensional dataset along with large
labels of instances on which outlier detection is to be value of k and outlier detection of useful with clear meaning
applied.There is need to availability of correct labels of and useful knowledge gain task. For finding outlier proposed
instance for supervised and semi supervised outlier detection. and analyze some algorithm.
Distance based outlier detection most popular and effective Amit Singh et al[2] stated that reverse nearest
method for unsupervised outlier detection. In the distance queries is widely used in such an applications decision
base outlier detection normal distances have small distance support system, data streaming documents profile base
among them and outliers have large distance between marketing bioinformatics. To solve this problem is high
them.As the dimensionality increases the distance fails to find dimensional data. In this paper proposed solution is used for
outliers. to reverse nearest neighbor queries in high dimensional
Unsupervised methods can detect outliers under the dataset using k nearest neighbor and reverse nearest neighbor.
assumption that all data attributes are purposeful, i.e. not The problem of finding RNN in high dimensions not covered
noisy. The relation between the high dimensionality and in the past. They discussed the challenges and perfected some
outlier nature of the instances investigates. Some points are important observations related to high dimensional RNNs.
frequently comes in k-nearest neighbor list of other points Then proposed an approximate solution to answer RNN
and some points are infrequently comes in k nearest neighbor queries in high dimensions.
list of some other points are called as Anti-hubs. Ke Zhang et al[3] described outlier detection has
For outlier detection RNN concept is used but there some issues with some dataset is a large challenge in this
is no theoretical proof which explores the relation between world in KDD application. Existing system of outlier
the outlier natures of the points and reverses nearest detection is not effective on scattered dataset due to which
neighbors. S. Ramaswamy et al [6] stated that reverse nearest constant pattern and parameter setting problem. Here paper
count is get affected as the dimensionality of the data introduced a local distance based outlier factor (LDOF) to
increases, so there is need to investigate how outlier detection calculate outliers of object in scattered dataset.
methods bases on RNN get affected by the dimensionality of Mahbod Tavallaee et al[4] stated that during
the data. anomaly detection researcher has many problem occur in the
1) In high dimensionalitythe problems in outlier detection detection of novel attacks and KDDcup 99 there is weakness
and shows that how unsupervised methods can be used of signature based IDSs .Those dataset are very useful and
for outlier detection. widely used in analysis. To solve this issues this paper
2) How Anti-hubs are related to outlier nature of the point proposed a new dataset NSL-KDD which consist of required
is investigates. records which are redundant records are removed and not

All rights reserved by www.ijsrd.com 500

results.None provide dimensionality. The analysis shows performance of majority of examples. data sets with millionsof examples and show that the near linear scaling holds over several orders of magnitude. Markus M. It object a degree as outlier .com 501 . Existing system required high finding unusual examples in a dataset. in past there are two ways of using k-occurrence information are proposed is some issues like curse of high dimensionality in nearest for outlier detection for high and low dimensional data for neighbor and indexing is high dimensional data become showing the outlierness of points. there Charu C. Using for RNN. 4/Issue 04/2016/122) goes from any attacks. In this paper we proposed angle based outlier outlier detection in finding intrusion detection and outlier detection (ABOD) angle based outlier detection and access detection in many applications and real life sources. A.in many Milos Radovanovic et al [11] discoursed issues in KDD applications finding outliers is more interesting than outlier detection in the case of eminent data dimensionality finding common patterns. In this . Outlier Detection Using Anti-hubs (IJSRD/Vol. Feature Selection: Stephen D. which is placed at neighborhood.Evaluate the validity of the identify outliers and improve one’s V. PROBLEM DEFINITION approach of finding local outliers can be practical. Outlier is a binary value (0. process. computational task Edwin M. Existing method for outlier detection using reverse formation using distance base outliers . beginning with the method sparse and indexing fails from efficiency and therefore the ODIN (Outlier Detection using in-degree Number). With the curse of dimensionality. does not depend on the size of the data evaluated systems and their results. PROPOSED SYSTEM understanding of data. Stephen D. in high dimensional From set of instances existing system consist of the process datasets this approaches degrades due to curse of of finding irregular instances and it aims at make the use of dimensionality. time to calculate the reverse nearest use a simple nested loop algorithm that in the worst case is neighbors of the all points.This degree is called local outlier also enquires how Anti-hubs are associated to the point’s factor (LOF) of an object. proposed intentional knowledge provide and how to optimize system architecture contains three main steps:- computation of such knowledge for first issue proposed 1) Feature Selection strongest and week outlier and for second proposed naive and 2) Find Reverse nearest neighbor semi naive algorithm.they showed that what kind of To remove the drawback of the existing system. It ABOD is better than another and its useful for high also investigates how anti-hubs are related to outlier nature of dimensional data. any intentional knowledge of the outliers. There is need to experimental study real life and synthetic dataset are used. Find Reverse Nearest Neighbor: Average case analysis suggests that much of the efficiency is Data of selected features will be considered for searching the because the time to process non-outliers. All rights reserved by www. This degree is depends on object outlier nature. dataset records are not useful or bad quality . Bay et al test algorithm on real high-dimensional selected for finding reverse nearest neighbors. Existing approach are to find outliers using distance in full dimensional datasets. computation complexity Sridhar Ramswamy et al[6] proposed various attack increases. To evaluate the reverse nearest neighbor. the point and based on the relation anti-hubs and outlier. which are the reverse nearest neighbor. In this paper it is more useful to give each data can be made using unsupervised methods describe.For this efficiency and high memory requirement. Feature selection method is quadratic can give near linear time performance when the applied on the datato recover this problem. Aggarwal et al[8] described. first k-nearest neighbors of each point is evaluated. B. rank according to their importance and required features are Stephen D. Then performance evaluation of our algorithm confirms to show that our III. They points it takes high computation cost. overcome this issue of performance and computation in RNN Analyze partition based algorithm for mining outliers. Existing the angle between points of a vector to another point.ijsrd.This is calculated nearest neighbour suffers from high computation requirement using distance of appoint from kth nearest neighbor. Knorr et al [9] stated that existing system  Computation complexity depends on the data are outlier focuses on the identification aspects. 1) in and showed the way outlier detection in high dimensional outlier detection. Bay et al computation cost. High computation results into reduced time rank basis to its distance outlier can be calculated . Results can improve the effect of standard clustering  For outlier detection use of antihubs is of high and knn algorithms. Bay et al[10] declared outliers by their To handle the effect of curse of dimensionality proposed distance to neighboring examples are a popular approach to system is designed. EXISTING SYSTEM the main task. Breunig et al[5] described . For the experimental analysis result 3) Find outlier score of each instance shows the significance reduction in input and output and significant speedup in runtime. time in existing show that the fractional distance metrics provide more useful system. Hans-Peter Kriegel et al[7] use different technique for finding outlier detection in different group of dataset is IV. Also verified the performance of can be used for outlier detection in high dimensional data. In this system discussed the problem in outlier detection in high way effect of curse of dimensionality is recover as compare dimensionality and shows that how unsupervised methods to distance approaches. set. All features are data is in random order and a simple pruning rule is used. They studied Limitation of existing System dimensionality curse as the point of view the distance metrics  To calculate the reverse nearest neighbors of the all which is used to measure similarity between objects.

and J. R. pp. 427–438. Hinneburg. C. and K. Inform. vol.2008. 2000. and H. Tavallaee.. Discovery Data Mining.-P.” inProc. nos. and A. R. Computational complexity increases with the data dimensionality to avoid this removal of irrelevant features before application of reverse nearest neighbor is introduced. 8th Int. But using anti hub for outlier detection is of high computational task. 1–6. of the VLDB Conference. 420–434. 2009. Proposed system will follow existing system to calculate the outlier score of the point. Singh. From actual results it is clear that proposed system maintains the accuracy and also reduces the time and memory requirement for outlier detection. [3] K.pp. and D.Comput. R. Sander. VI. and neighbor list of each point is evaluated. Less k-occurrence indicates more outlier score of the point. REFERENCES [1] E.” VLDB J. 93–104. pp. of the VLDB Conference. no. Schubert. Conf. Database Theory. M. New York. and V. Ng. M. C.” SIGMOD Rec. September 1998 All rights reserved by www. 2009. Tucakov. Outlier Score of Each Point: Previous methods than existing system considered k- occurrence of the point as an outlier score. no. pp. Secur. Knowl. [6] S. “Distance-based outliers: Algorithms and applications. Exiting conclude that reverse nearest neighbor outlier detection using anti-hub. 3–4. Conf. [5] M. Ferhatosmano_glu. M. USA. Sum of k-occurrence score of k-nearest neighbors of the point P is outlier score of the point P. 2001. Kriegel. 29. MirjanaIvanov.ijsrd. A.” SIGMOD Rec. Knowl. [10] Edwin Knorr and Raymond Ng. pp. Keim. Knowl. pages 392–403. “LOF: Identifying density-based local outliers. CONCLUSION In many applications there is need of finding intrusion or outlier detection. H. [8] C. Zhang. [9] Edwin Knorr and Raymond Ng. Manage. Edinburgh. 2nd IEEE Symp. reverse nearest [11] Milos Radovanovic. UK.. Outlier Detection Using Anti-hubs (IJSRD/Vol. Alexandros Nanopoulos. “Efficient algorithms formining outliers from large data sets. Bagheri. 237–253.”Reverse nearest Neighbors in Unsupervised Distance-Based Outlier Detection. pp.com 502 . ¸Saman Tosun. 813–822. Zimek. 2003. In Proc. M. vol. 29.pp. T. 444–452. In Proc. E. Breunig. Defense Appl.” in Proc.. 8. “On the surprising behavior of distance metrics in high dimensional spaces. “Algorithms for mining distance-based outliers in large datasets”. and A. Rastogi. H. Aggarwal. “Finding intensional knowledge of distance-based outliers”. Discovery Data Mining. “Angle-based outlier detection in high-dimensional data. and A. “A new local distance- based outlier detection approach for scattered real-world data. vol. W. 2000 [7] H. pp.” in Proc 12th ACMConf.” in Proc 13thPacific-Asia Conf. “A detailed analysis of the KDD CUP 99 data set. Shim. ages 211–222. 2000.. “High dimensional reverse nearest neighbor queries. [4] M. A. Ramaswamy. September 1999. [2] A.-P. 2. 4/Issue 04/2016/122) From the k-nearest neighbor list of each point. Lu. Intell. 2. Ghorbani. T.” 2015.. Knorr. Jin. 91– 98.” in Proc 14th ACM SIGKDDInt. Ng. Kriegel. Hutter.