0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
1 Ansichten8 Seiten
This document is a bona fide certificate for a project report titled "A CONTEXT SENSTIVE INDEXING MODEL FOR INFORMATION RETRIVAL" submitted by Ms. M. VANISHREE. It certifies that the work was carried out under the supervision of Dr. A. VINCENT ANTONY KUMAR and Ms. R. SUDHA. The document also acknowledges and thanks various people involved in the successful completion of the project.
This document is a bona fide certificate for a project report titled "A CONTEXT SENSTIVE INDEXING MODEL FOR INFORMATION RETRIVAL" submitted by Ms. M. VANISHREE. It certifies that the work was carried out under the supervision of Dr. A. VINCENT ANTONY KUMAR and Ms. R. SUDHA. The document also acknowledges and thanks various people involved in the successful completion of the project.
This document is a bona fide certificate for a project report titled "A CONTEXT SENSTIVE INDEXING MODEL FOR INFORMATION RETRIVAL" submitted by Ms. M. VANISHREE. It certifies that the work was carried out under the supervision of Dr. A. VINCENT ANTONY KUMAR and Ms. R. SUDHA. The document also acknowledges and thanks various people involved in the successful completion of the project.
BONAFIDE CERTIFICATE Certified that this project report titled A CONTEXT SENSTIVE INDEXING MODEL FOR INFORMATION RETRIVEL is the bonafide work of MS. M.VANISHREE (921312412018) who carried out the research under my supervision. Certified further, that to the best of my knowledge the work reported herein does not form part of any other project report or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate.
SIGNATURE SIGNATURE Dr.A.VINCENT ANTONY KUMAR, Ph.D., Ms.R.SUDHA, M.E., HEAD OF THE DEPARTMENT SUPERVISOR DEPT OF IT ASSOCIATE PROF/IT PSNA COLLEGE OF ENGG & TECH PSNA COLLEGE OF ENGG &TECH DINDIGUL DINDIGUL
Submitted for the Project Viva Voce examination held on____________________
Internal Examiner External Examiner
iii
ABSTRACT The document summarization is used to summarize the original document in order to get the condensed version of the document. By reading the summarized document the user will gain the knowledge about the whole document. Sentences extraction based single document summarization uses the graph based algorithm to calculate the saliency of each sentences in the document and the most salient are extracted to build the document summary. Therefore, the indexing weight remains independent of the other terms appearing in the document and the context in which the terms occurs is overlooked in assigning the indexing weight. Therefore the context independent document indexing is the problem arises in the existing model. Context independent document indexing can be solved by using Context sensitive document indexing model. Normally a document consists of topical and non topical term. The ratio of topical words is higher in a summary of a document than in the original document. Using lexical association metric, lexical association between two topical terms should be higher than the lexical association between two non- topical terms or a pair of topical and non-topical terms. A k-means clustering is used to determine the context-sensitive indexing weights. The context based word indexing weights is used to calculate the similarity between any two sentences. For calculating the sentences similarity, sentences splitting process take places in order to split the sentences in the document. By calculating the sentence similarity value, document will be summarized in an efficient manner and thus resulting in performance improvement for a document summarization task .Once the document is summarized, according to the user queries the information is retrieved. Information retrieval is done by using the context sensitive document indexing approach.
iv
ACKNOWLEDGEMENT I take this opportunity to express my sincere gratitude to the people who have been instrumental in the successful completion of the first phase of my project. Firstly, GOD ALMIGHTHY, without his blessings this project would not have been a reality. I would like to express my sincere thanks to (Late) Thiru.R.S.KOTHANDARAMAN, founder of my institution, Chairperson Smt.K.DHANALAKSHMI AMMAL and the respected DIRECTORS who gave me a constant support in completing the first phase of my project. I would like to express my thanks to our Principal, Dr.S.SAKTHIVEL, M.E., Ph.D. for his support throughout my project. I express my thanks to our Head of the Department Dr.A.VINCENT ANTONY KUMAR, Ph.D., for his attention and encouragement towards my project work. I also express my sincere gratitude to my project guide Ms.R.SUDHA M.E., Associate Professor in Department of Information Technology, for the keen interest shown by her from time to time. I would also like to express my gratitude to our project coordinator, Dr.P.GANESHKUMAR., M.E., Ph.D., for his help during my project work. I also thank all my parents and friends for their constant encouragement and moral support throughout my venture.
v
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
ABSTRACT iii
LIST OF FIGURES viii
LIST OF ABBREVIATIONS ix 1 INTRODUCTION
1.1 Overview 1
1.1.1 Foundation to Data Mining 3
1.1.2 Data Mining classes 4
1.1.3 Types of Data Mining 5
1.1.4 Architecture of Data Mining 8
1.1.5 Advantages and disadvantages 9
1.2 Introduction to Text Mining 9
1.2.1 Information Filtering 10
1.2.2 Collecting document 10
1.2.3 Document standardization 11
1.2.4 Document classification 11
1.2.5 Information Retrieval 12
1.2.6 Clustering Document by similarity 13
1.2.7 Tokenization 13
1.2.8 Clustering & Organizing Document 14
1.2.9 Lemmatization 14 vi
1.2.10 Inflectional stemming 14
1.2.11 Document similarity 16 2 LITERATURE SURVEY 17
2.1 Single-document and multi-document summarizations. 17
2.2 Word classification and hierarchy 18
2.3 TSCAN 19 2.4 Document summarization and keyword extraction. 20
2.5 Text summarization. 21
2.6 Graph based Multi document Summarizations. 22
2.7A language independent algorithm for single and multiple Document summarizations. 23
2.8 Using topic themes for multi-document summarization 24
2.9 Word clustering and disambiguation based on co- occurrence Data 25
2.10 Multi-document summarization via the minimum dominating Set
3.2.3 Sentences similarity value calculation. 30 vii
3.2.4 Information retrieval 30 4 SYSTEM ANALYSIS 31
4.1 Existing System 31
4.2 Proposed System 32 5 SYSTEM DESCRIPTION 33
5.1 System Architecture 34
5.2 System Environment 35
5.3 Software Description 36 6 CONCLUSION AND FUTURE ENHANCEMENT 42
REFERENCES 43
viii
LIST OF FIGURES
FIGURE NO FIGURE NAME PAGE NO 1.1 Architecture for data mining 8 1.2 Generality of basic technique 10 1.2.5 Retrieving matched documents 12 1.2.6 Computing similarity of two documents 13 5.1 System Architecture 34
ix
LIST OF ABBREVIATIONS
SERIAL NO ABBREVIATIONS 1. KDD Knowledge discovery in database process. 2. NLP Natural language processing. 3. LDC Linguistic Data Consortium. 4. TEI Text Encoding Initiative. 5. IR Information retrieval. 6. TF Term frequency. 7. IDF Inverse document frequency. 8. C# C Sharp. 9. IDE Integrated Development Environment. 10. CBSU Cluster-based sentence utility. 11. CSIS Cross-sentence informational Subsumption. 12. MLE Maximum Likelihood Estimation. 13. MMR Maximal marginal relevance