0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
693 Ansichten5 Seiten
In this paper we proposed a novel approach
of searching of out sourced data, because the
outsourced data usually encrypted before storage for
the privacy preserving, traditional approaches uses the
time stamp based approaches and Boolean approach
those are not optimal, those are not suitable for large
datasets. Our approach searches the encrypted
information in the outsource data by maintains the
search table information for finding the relation
between the search key word with publishing date and
documents related to it and it maintains the relevance
score of the search keyword with respect to documents
and results can be displayed to the user based on the
ranking of documents.
Originaltitel
An Efficient Temporal query search for Time
sensitive queries
In this paper we proposed a novel approach
of searching of out sourced data, because the
outsourced data usually encrypted before storage for
the privacy preserving, traditional approaches uses the
time stamp based approaches and Boolean approach
those are not optimal, those are not suitable for large
datasets. Our approach searches the encrypted
information in the outsource data by maintains the
search table information for finding the relation
between the search key word with publishing date and
documents related to it and it maintains the relevance
score of the search keyword with respect to documents
and results can be displayed to the user based on the
ranking of documents.
In this paper we proposed a novel approach
of searching of out sourced data, because the
outsourced data usually encrypted before storage for
the privacy preserving, traditional approaches uses the
time stamp based approaches and Boolean approach
those are not optimal, those are not suitable for large
datasets. Our approach searches the encrypted
information in the outsource data by maintains the
search table information for finding the relation
between the search key word with publishing date and
documents related to it and it maintains the relevance
score of the search keyword with respect to documents
and results can be displayed to the user based on the
ranking of documents.
An Efficient Temporal query search for Time sensitive queries NVSS Subrahmanyam 1 , Manoj kumar S.K.A 2
1 Final MTech Student, 2 Assistant professor 1,2 Dept. of CSE, Pydah College of Engineering, Boyapalem,Visakhapatnam, AP, India
Abstract:-In this paper we proposed a novel approach of searching of out sourced data, because the outsourced data usually encrypted before storage for the privacy preserving, traditional approaches uses the time stamp based approaches and Boolean approach those are not optimal, those are not suitable for large datasets. Our approach searches the encrypted information in the outsource data by maintains the search table information for finding the relation between the search key word with publishing date and documents related to it and it maintains the relevance score of the search keyword with respect to documents and results can be displayed to the user based on the ranking of documents. I.INTRODUCTION Over the past three decades, a probabilistic model of document Retrieval has been studied comprehensively. Usually these approaches can be characterized as methods of estimating the probability of relevance of documents to queries of user and One component of a probabilistic retrieval model is the indexing model prototype i.e., a prototype of the assignment of indexing terms to documents and We argue that the current indexing models have not led to improved retrieval results. We believe this is due to two Unwarranted assumptions made by these models. We have taken a different approach based on non- parametric estimation that allows us to relax these assumptions. We have implemented our approach and empirical results on two different collections and query sets are significantly better than the standard tf.idf method of retrieval. Now we take a brief look at some existing models of document Indexing. We begin our discussion of indexing models with the 2-Poisson model, due to Bookstein and Swanson [l] and task was to assign a subset of words contained in a document (the specialty words) as index terms and the probability model was intended to indicate the useful indexing terms by means of the differences in their rate of occurrence in documents elite for a given keyword, i.e., the document that would satisfy a user posing that single termas a query, vs. those without the property of eliteness. The success of the 2-Poisson model has been somewhat limited but it should be noted that Robertsons etc.., it has been quite successful and was intended to behave similarly to the 2-Poisson model [12]. Other researchers have proposed a mixture model of more than two Poisson distributions in order to better fit the observed data. Margulis proposed the n-Poisson model and tested the idea empirically etc. The conclusion of this study was that a mixture of n-Poisson distributions provides a very close fit to the data. In a certain sense, this is not surprising. For large values of n one can fit a very complex distribution arbitrarily closely by a mixture of n parametric models if one has enough data to estimate the parameters [18]. However, what is somewhat surprising is the closeness of fit for relatively small values of n reported by Margulis [lo]. Nevertheless, the n-Poisson model has not brought about increased retrieval effectiveness in spite of the close fit to the data. In any event, the semantics of the underlying distributions are less obvious in the n-Poisson caseas compared to the 2-Poisson case where they model the concept of eliteness. Apart from the adequacy of the available indexing models, estimating the parameters of these models is a crucial problemand Researchers have looked at this problemfroma variety of perspectives and we will discuss several of these of these approaches . In addition and it is previously mentioned in many of the current indexing models make assumptions about the data that we feel are unwarranted. The parametric assumption. Documents are members of pre-defined classes. In our approach we relax these two assumptions. Rather than making parametric assumptions, as is done in the 2-Poisson model it is assumed that terms follow a mixture of two Poisson distributions, as Silverman said, the data will be allowed to speak for themselves [16]. We feel that it is unnecessary to construct a parametric model of the data when we have the actual data. Instead, we rely on non-parametric methods. II. RELATED WORK Regarding the second assumption as made and the 2- Poisson model was originally based on the idea of eliteness [7]. It was assumed that document elite for a given termwould satisfy a user if the user posed that single termas a query. Since that time, the prevailing view has come to be that multiple termqueries are more practical. In general, this requires a combinatorial explosion of elite sets for all possible subsets of terms in the collection and We take the view that each query needs to be looked at International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 3748
individually and that documents will not necessarily fall cleanly into elite and non-elite sets. In order to relax these assumptions and to avoid the Difficulties imposed by separate indexing and retrieval models; we have developed an approach to retrieval based on probabilistic language modeling. Our proposed approach provides a conceptually simple but explanatory model of retrieval. A)Comparative Analysis Previously traditional approach works on basic similar In this study we use three classes of similarity of topics measure: association, correlation and distance. In this section we describe each of these classes. However, we must first describe the term distributions used by these classes. These distributions were constructed across the each topic overview using: P(t)=ntf(t)/ nt(t) =1
ntf(t) is the normalised termfrequency [2] of termt in the set of terms T taken from the topic overview. This set (i.e. all unique terms from the title of topic , summary, description and example relevant documents) is extracted and ranked based on P(t), the probability that a termt is relevant in that topic relatively We divide by the sumof all ntf(t) to ensure the probabilities sum to one. We calculate the similarity between each pair of topics using the topic distributions and association, correlation and distance measures. Each measure is based on the intersection between the topic sets (i.e. terms that occur in both topic overview). In the proposed approach apart from the topic similarity, we are considering time sensitivity also an important factor during the search implementation. In this proposed paper , we identified that, for an useful class of queries over news archives that we call time-sensitive queries, topic similarity is not sufficient for ranking of documents and for such queries, the publication time of the documents is important and should be considered in conjunction with the topic similarity to derive the final ranking of documents and in the previous approach we considered the temporal relevance
B)Answering Time-Sensitive Queries with Language Models To answer general time-sensitive queries, we want to identify not just the relevant documents for the query but also the relevant time slots Craswell et al. [11] introduced a framework to complement the topical relevance of a document for a query with additional evidence and We build on this framework and on the idea of splitting a document d into a content component cd as well as a temporal component td.9 We can then write p(d/q) as p(d). p(d/q), which expresses the probability that cd is topically relevant to q and that td is a time period relevant to q, where cd is the content of the document d and td is the time when d was published Using basic rules of probability, we have P(d|q)=p(c d ,t d |q)=p(c d |q).p(t d |c,q). C)Answering Time-Sensitive Queries with BM 25 we showed how to integrate the temporal relevance p(t/q) into language models and We now describe a similar integration into the probabilistic relevance model and a leading state-of-the-art approach suggested by Robertson et al. [6], [7], [14]. In defining PRM as Robertson et al showed state the following principle: To produce the optimal ranking of a set of documents as an answer for a query at hand and the documents should be ranked by the posterior probability of belonging to the relevance class R of the query. According to this proposed principle, Robertson et al. showed that ranking the documents by the odds of their being observed in R produces the optimal ranking, and introduced the following general PRM framework: p(R|d,q) q log (p(R|d,q)/p(R|d,q) q log p(d|R,q)/p(d|R,q) D)Answering Time-Sensitive Queries with Pseudo relevance Feedback
We now discuss our efforts to integrate temporal relevance into a pseudo relevance feedback technique can be Specifically and we focus on Indris pseudo relevance feedback technique and it is an adaptation of Lavrenko and Crofts relevance models and in the first stage of this technique and a baseline retrieval is performed to identify the top-k documents for a query at hand. According to Lavrenko and Croft, these top-k documents are then used to analyze the universe of unigram distributions and estimate the probability that a word w appears in a document that is relevant to the queries and it is estimated probability is used to select the top-mrepresentative words or phrases that are most related to the query. In the second stage, a second retrieval with query expansion is performed using the identified words or phrases. To integrate time into this pseudo relevance feedback mechanism we can account for time by biasing and in an appropriate manner, the choice of the top-k documents that are used in the first stage of query processing is experimented with several ways to bias this choice of top-k documents based on time but unfortunately none of these variations resulted in any gain in result accuracy for time-sensitive queries and it seems that a technique that reranks the most topically relevant documents and then mines themfor other words related to the query does not discover any useful words that were not already captured by the original technique. The cause of this shortcoming is most likely that the techniques are limited by their exposure to the same sets of documents. For brevity and we omit any further discussion of this technique
III. PROPOSED WORK International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 3749
Sometimes queries issued over a news archive are after recent events or breaking news Li and Croft [2] developed a time-sensitive approach for processing recency queries. Their approach processes arecency query by computing traditional topic relevance scores for each document, and then boosting the scores of the most recent documents and for privilege recent articles over older ones. Language models [9] have been used as a successful approach to rank documents in a collection according to their topic relevance for a query and for estimate the relevance of a document d to a query q; p(d/q), the Conditional probability that d is topically relevant to q is computed. This retrieval model defines p(d/q) as being proportional to p(d) _ p(q/d), where p(d) is the prior probability that d is relevant, and p(q/d) is the probability that query q will be generated fromdocument d.7 In the original language models and in later modifications, the prior p(d) is ignored since it is assumed to be uniformand constant for all documents. For recency queries Li and Croft suggest modifying p(q/d) to combine two elements and time relevance and topical relevance Specifically researchers define the prior p(d) of document d as a function of the document creation date important so that recent documents are given a greater prior value than older documents.
In this paper, we observe that, for an important class of queries over news archives that we call time- sensitive queries and topic similarity is not sufficient for ranking. For such type of queries, publication time of the documents is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Most recent methods for searching over large archives of timed documents incorporate time in a relatively crude manner: users can submit a keyword query say Madrid bombing and restrict the results to articles written or alternatively sort the results on the publication date of the articles. Unfortunately, researchers do not always know the appropriate time intervals for their queries, and placing the burden on the users to explicitly handle time during querying is not desirable.
Fig1.Architecture
We are proposing an efficient search implementation with respect to frequency manipulations and time stamp for efficient results for user specified query. We integrated traditional relevance score method and time stamp approach. In this approach data owner out sources the data in the server, before storing data in the server , Data owner has a collection of n data files C =(F1; F2; : : : ; Fn) that he wants to outsource on the server in encrypted formwhile still keeping the capability to search through them for effective data utilization reasons. To do so, before outsourcing, data owner will first build a secure searchable index I froma set of mdistinct keywords W =(w1;w2; :::;wm) extracted fromthe file collection C, and store both the index I and the encrypted file collection C on the server. After searching the information data can be organized after the ranking.
A)Out sourcing of documents
Data owner out sources the data in to the server, before hosting the data in to the server extracts the unique keywords fromthe documents and encrypts the data in to the server for privacy and security purpose and maintains an index table with cipher keyword, document id and time stamp and frequency of the search keyword for user retrieval.
B)Encrypted Search implementation
When user enter the query, initially it is forwarded to the server there it decrypts the keyword after converting into cipher keyword and check for the exact match in the table and authenticate himself with key and calculate the occurrences of the word(frequency) in the documents in cipher translation and retrieves the publishing time of the document. C)Algorithm for index table generation
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 3750
1. Read the document F 2. Segment the document termwise and encrypt with key 3. Calculate termfrequency (TF) and inverse document frequency(IDF) and publishing time(P T ) 4. Generate index table(I table ) and files upload to server D)Ranking Implementation
In this modules according to the query entered by the user. Initially it calculates the frequency of the word/occurrences of the word in documents is treated as term frequency and occurrences of the word in all the documents and retrieves the distinct files which have the occurrences of the search keyword.
After the decryption of the results fromthe server Results can be retrieved based on the file relevance scores through file relevance scores and organized for the user based on the score of the documents with respect to the keywords and time stamps.
E)Algorithm for Rank Implementation
1. Read the query Q and key for authentication 2. Retrieve relevant information fromthe index table (I table ) 3 Calculate the file relevance score Fscore and publishing time of the documents 4. Sort the documents D(F 1 ,F 2 ..F n ) based on file relevance score and publishing time (P T ) 5. Return the documents File relevance score: Score(Q,F d )= 1 |Pd| .(1+lnJ,t).ln (1+ N ]t ) Here Q denotes the searched keywords; fd;t denotes the TF of termt in file Fd; ft denotes the number of files that contain termt; N denotes the total number of files in the collection; and |Fd| is the length of file Fd, obtained by counting the number of indexed terms, functioning as the normalization factor . Comparative analysis of the traditional and proposed approach as follows.
Comparative analysis
IV. CONCLUSION We presented an efficient search implementation with frequency approach and time stamp approach, initially for user query it finds the file relevance score in terms of frequency, number of distinct files and index terms, along with this in formof the graphation we consider time stamp of the document, this approach gives efficient results than the traditional frequency based approach.
REFERENCES [1] R. Jones and F. Diaz, Temporal Profiles of Queries, ACM Trans. Information Systems, vol. 25, no. 3, article 14, 2007. [2] X. Li and W.B. Croft, Time-Based Language Models, Proc. 12thACMConf. Information and Knowledge Management (CIKM 03), 2003. [3] D. Metzler and W.B. Croft, Combining the Language Model and Inference Network Approaches to Retrieval, Information Processing and Management, vol. 40, no. 5, pp. 735-750, Sept. 2004. [4] S.E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau, Okapi at TREC, Proc. Fourth Text REtrieval Conf. (TREC-4),1994. [5] S.E. Robertson, Overview of the Okapi Projects, J. Documentation, vol. 53, no. 1, pp. 3-7, 1997. [6] K.S. Jones, S. Walker, and S.E. Robertson, A Probabilistic Model of Information Retrieval: Development and Comparative Experiments - Part 1, Information Processing and Management, vol. 36, no. 6, pp. 779-808, 2000. [7] K.S. Jones, S. Walker, and S.E. Robertson, A Probabilistic Model of Information Retrieval: Development and Comparative Experiments - Part 2, Information Processing and Management, vol. 36,no. 6, pp. 809-840, 2000. [8] W. Dakka, L. Gravano, and P.G. Ipeirotis, Answering General Time-Sensitive Queries, Proc. 17th ACM Conf. Information and Knowledge Management (CIKM 08), pp. 1437-1438, 2008. International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 3751
[9] J.M. Ponte and W.B. Croft, A Language Modeling Approach to Information Retrieval, Proc. 21st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 98), 1998. [10] F. Song and W.B. Croft, A General Information Retrieval, Proc. Eighth ACM Conf. Information and Knowledge Management (CIKM 99), 1999. [11] N. Craswell, S.E. Robertson, H. Zaragoza, and M. Taylor, Relevance Weighting for Query Independent Evidence, Proc. 28th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 05), 2005. [12] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proc. Seventh Intl World Wide Web Conf. (WWW 98), 1998. [13] V. Lavrenko and W.B. Croft, Relevance-Based Language Models, Proc. 24th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 01), 2001. [14] S.E. Robertson, The Probability Ranking Principle in IR, Readings in Information Retrieval, pp. 281-286, Morgan Kaufmann, 1997. [15] S.E. Robertson, S. Walker, and M. Hancock-Beaulieu, Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive Track, Proc. Seventh Text REtrieval Conf. (TREC-7), 1998. [16] N. Craswell, H. Zaragoza, and S.E. Robertson, Microsoft Cambridge at TREC-14: Enterprise Track, Proc. 14th Text Retrieval Conf. (TREC-14), 2005. [17] K. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J. Klavans, A. Nenkova, C. Sable, B. Schiffman, and S. Sigelman, Tracking and Summarizing News on a Daily Basis with Columbias Newsblaster, Proc. Second Intl Conf. Human Language Technology (HLT 02), 2002. [18] R. Krovetz, Viewing Morphology as an Inference Process, Proc. 16th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 93), 1993. [19] F. Diaz, Personal Communication, 2007. [20] E.M. Voorhees and D. Harman, Overview of TREC- 9, Proc. Ninth Text REtrieval Conf. (TREC-9), 2001.
BIOGRAPHIES
Nistala Venkata Satya Surya Subrahmanyam completed MCA Master Of Computer Appilications in Rajah RSRK Ranga rao college,Bobbili ,Vizianagram (Dist), After that I am working as Assistant Professor in Thandra Paparaya Institute of Science & Technology,Komatipalli(V),Bobbili (M) .currently pursuing MTech in Pydah College of Engineering & Technology, Gambheeram,Visakhapatnam. Andhra Pradesh. interesting research areas include, Data Mining, network security.
S.K.A. Manoj received his B.Sc degree in A.V.N college from Andhra university, Visakhapatnam, the MCA in Bullayya college fromAndhra University, Visakhapatnam, the M.TECH in Computer Science and Engineering in Pydah college from J NTU Visakhapatnam. His areas of interests include Data Mining and Data Warehousing, Artificial Intelligence, Software Engineering, Computer Networks. He is now the Assistant Professor, Department of C.S.E at Pydah College of Engineering & Technology, Vishakhapatnam.