Anna University, Chennai 600 025

ii
ANNA UNIVERSITY, CHENNAI 600 025

BONAFIDE CERTIFICATE
Certified that this project report titled A CONTEXT SENSTIVE INDEXING MODEL
FOR INFORMATION RETRIVEL is the bonafide work of MS. M.VANISHREE
(921312412018) who carried out the research under my supervision. Certified further, that
to the best of my knowledge the work reported herein does not form part of any other
project report or dissertation on the basis of which a degree or award was conferred on an
earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE
Dr.A.VINCENT ANTONY KUMAR, Ph.D., Ms.R.SUDHA, M.E.,
HEAD OF THE DEPARTMENT SUPERVISOR
DEPT OF IT ASSOCIATE PROF/IT
PSNA COLLEGE OF ENGG & TECH PSNA COLLEGE OF ENGG &TECH
DINDIGUL DINDIGUL

Submitted for the Project Viva Voce examination held on____________________

Internal Examiner External Examiner

iii

ABSTRACT
The document summarization is used to summarize the original document in
order to get the condensed version of the document. By reading the summarized
document the user will gain the knowledge about the whole document. Sentences
extraction based single document summarization uses the graph based algorithm to
calculate the saliency of each sentences in the document and the most salient are
extracted to build the document summary. Therefore, the indexing weight remains
independent of the other terms appearing in the document and the context in which
the terms occurs is overlooked in assigning the indexing weight. Therefore the context
independent document indexing is the problem arises in the existing model.
Context independent document indexing can be solved by using Context
sensitive document indexing model. Normally a document consists of topical and non
topical term. The ratio of topical words is higher in a summary of a document than in
the original document. Using lexical association metric, lexical association between
two topical terms should be higher than the lexical association between two non-
topical terms or a pair of topical and non-topical terms.
A k-means clustering is used to determine the context-sensitive indexing
weights. The context based word indexing weights is used to calculate the similarity
between any two sentences. For calculating the sentences similarity, sentences
splitting process take places in order to split the sentences in the document. By
calculating the sentence similarity value, document will be summarized in an efficient
manner and thus resulting in performance improvement for a document
summarization task .Once the document is summarized, according to the user queries
the information is retrieved. Information retrieval is done by using the context
sensitive document indexing approach.

iv

ACKNOWLEDGEMENT
I take this opportunity to express my sincere gratitude to the people who have
been instrumental in the successful completion of the first phase of my project.
Firstly, GOD ALMIGHTHY, without his blessings this project would not have been
a reality.
I would like to express my sincere thanks to (Late)
Thiru.R.S.KOTHANDARAMAN, founder of my institution, Chairperson
Smt.K.DHANALAKSHMI AMMAL and the respected DIRECTORS who gave
me a constant support in completing the first phase of my project.
I would like to express my thanks to our Principal, Dr.S.SAKTHIVEL, M.E.,
Ph.D. for his support throughout my project.
I express my thanks to our Head of the Department Dr.A.VINCENT ANTONY
KUMAR, Ph.D., for his attention and encouragement towards my project work.
I also express my sincere gratitude to my project guide Ms.R.SUDHA M.E.,
Associate Professor in Department of Information Technology, for the keen interest
shown by her from time to time.
I would also like to express my gratitude to our project coordinator,
Dr.P.GANESHKUMAR., M.E., Ph.D., for his help during my project work. I also
thank all my parents and friends for their constant encouragement and moral support
throughout my venture.

v

TABLE OF CONTENTS

CHAPTER
NO
TITLE PAGE
NO

ABSTRACT iii

LIST OF FIGURES viii

LIST OF ABBREVIATIONS ix
1 INTRODUCTION

1.1 Overview 1

1.1.1 Foundation to Data Mining 3

1.1.2 Data Mining classes 4

1.1.3 Types of Data Mining 5

1.1.4 Architecture of Data Mining 8

1.1.5 Advantages and disadvantages 9

1.2 Introduction to Text Mining 9

1.2.1 Information Filtering 10

1.2.2 Collecting document 10

1.2.3 Document standardization 11

1.2.4 Document classification 11

1.2.5 Information Retrieval 12

1.2.6 Clustering Document by similarity 13

1.2.7 Tokenization 13

1.2.8 Clustering & Organizing Document 14

1.2.9 Lemmatization 14
vi

1.2.10 Inflectional stemming 14

1.2.11 Document similarity 16
2 LITERATURE SURVEY 17

2.1 Single-document and multi-document summarizations. 17

2.2 Word classification and hierarchy 18

2.3 TSCAN 19
2.4 Document summarization and keyword extraction.
20

2.5 Text summarization. 21

2.6 Graph based Multi document Summarizations. 22

2.7A language independent algorithm for single and
multiple Document summarizations.
23

2.8 Using topic themes for multi-document summarization 24

2.9 Word clustering and disambiguation based on co-
occurrence Data
25

2.10 Multi-document summarization via the minimum
dominating Set

26

2.11 Centroid-based summarization 27
3
PROJECT DESCRIPTION 28

3.1 Modules 28

3.2 Module description 28

3.2.1 Lexical association. 28

3.2.2 Context based word indexing 29

3.2.3 Sentences similarity value calculation. 30
vii

3.2.4 Information retrieval 30
4
SYSTEM ANALYSIS 31

4.1 Existing System 31

4.2 Proposed System 32
5
SYSTEM DESCRIPTION 33

5.1 System Architecture 34

5.2 System Environment 35

5.3 Software Description 36
6
CONCLUSION AND FUTURE ENHANCEMENT 42

REFERENCES 43

viii

LIST OF FIGURES

FIGURE
NO
FIGURE NAME PAGE NO
1.1 Architecture for data mining 8
1.2 Generality of basic technique 10
1.2.5 Retrieving matched documents 12
1.2.6 Computing similarity of two documents 13
5.1 System Architecture 34

ix

LIST OF ABBREVIATIONS

SERIAL NO ABBREVIATIONS
1. KDD Knowledge discovery in database process.
2. NLP Natural language processing.
3. LDC Linguistic Data Consortium.
4. TEI Text Encoding Initiative.
5. IR Information retrieval.
6. TF Term frequency.
7. IDF Inverse document frequency.
8. C# C Sharp.
9. IDE Integrated Development Environment.
10. CBSU Cluster-based sentence utility.
11. CSIS Cross-sentence informational Subsumption.
12. MLE Maximum Likelihood Estimation.
13. MMR Maximal marginal relevance

Anna University, Chennai 600 025

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Anna University, Chennai 600 025

Hochgeladen von

Copyright:

Verfügbare Formate

ii

ANNA UNIVERSITY, CHENNAI 600 025

Das könnte Ihnen auch gefallen