Sie sind auf Seite 1von 19

Agenda

Introduction Index representation

size of index What causes increased size of index


Compression technique Two types of Document numbers in index Two Address tables Comparison Of compression technique Application of indexing How search engine maintains index

Building a dictionary of terms Lay out of index


End of index and challenges

Introduction
Information retrieval in todays world is retrieval of

information from huge databases containing may be more then terabytes of information, if you need to find a piece of information or data from these huge database we just cant go for linear searching which not at all is up to mark in real time application one solution to this is indexing.
We will be talking about how indexing help us to retrieve data

from huge database of data but as data increases day by day even index itself becomes large and huge so we will be discussing about latest compression technique to compress index itself and finally we will discuss latest technique of how indexing is used in search engines like Google, AltaVista, Excite to retrieve information.

Index representation

size of index
Bits per keyword entry

Where N is Number of records in collection Total size = Total number of Bits per keyword entry.

What causes increased size of index


The primary cause is the documents numbers stored.
As these document numbers may be very long and

each search keywords has collection of document numbers so size of index grows. Solution is use compression technique.

Compression technique
Steps: 1. Divide the document number into two parts 1)document numbers which are not repetitive e.g.: 24567 2)document numbers which are repetitive e.g.: 222223 2. Use the compression technique on Repetitive numbers only. 3. First reduce the repetitive doc number e.g.: 222222331 into 2B331

Compression technique

Compression technique
4. Represent this document number in binary for storage According to table 6. e.g.: 2B331 binary representation without table: 1101001111101101011111111011 binary representation with table: 10 1011 0011 0011 0001

Compression technique

Two types of Document numbers in index


Compressible document number
Non compressible document number

Address table is divided into two compressible document numbers have different

address table and Non compressible have different address table.

Two Address tables


Non compressible document number

Compressible document number

Comparison Of compression technique

Application of indexing
Search engines

How search engine maintains index

Building a dictionary of terms


Steps : 1. Extraction of tokens from the documents by fragmenting the document. 2. Analyse tokens discovered.
Methods for Analyzing tokens: 1. Case folding. 2. Stemming using porters algorithm. 3. Elimination of white spaces.

Lay out of index


Tell you how index in distributed across the different

machines. By two methods: 1. Document based partitioning 2. Term based partitioning

End of index and challenges


Challenges in search engine : 1. Speed of query retrieval. 2. Overhead of increasing size of index. 3. Cost .

Thank You