Sie sind auf Seite 1von 11

Lucene lecture at Pisa http://lucene.sourceforge.

net/talks/pisa/

Lucene

Doug Cutting
cutting@apache.org

November 24 2004
University of Pisa

Prelude
my background..
please interrupt with questions
blog this talk now so that we can search for it later
(using a Lucene-based blog search engine, of course)
In this course, Paolo and Antonio have presented many techniques.
I present real software that uses many of these techniques.

Lucene is
software library for search
open source
not a complete application
set of java classes
active user and developer communities
widely used, e.g, IBM and Microsoft.

Lucene Architecture

1 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

[draw on whiteboard for reference throughout talk]

Lucene API
org.apache.lucene.document
org.apache.lucene.analysis
org.apache.lucene.index
org.apache.lucene.search

Package: org.apache.lucene.document
A Document is a sequence of Fields.
A Field is a <nam e, value> pair.
nam e is the name of the field, e.g., title, body, subject, date, etc.
value is text.
Field values may be stored, indexed o r analyzed (and, now, vectored).

2 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

Example
public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED


doc.add(new Field("contents", new FileReader(f)));

return doc;
}

Example (continued)
field stored indexed analyzed
path yes yes yes
modified yes yes no
content no yes yes

Package: org.apache.lucene.analysis
A n Analyzer is a TokenStream factory.
A TokenStream is an iterator over Tokens.
input is a character iterator (Reader)
A Token is tuple <text, type, start, length, positionIncrement>
text (e.g., “pisa”).
type (e.g., “word”, “sent”, “para”).
start & length offsets, in characters (e.g, <5,4>)
positionIncrement (normally 1)
standard TokenStream implementations are
Tokenizers, which divide characters into tokens and
TokenFilters, e.g., stop lists, stemmers, etc.

Example
public class ItalianAnalyzer extends Analyzer {
private Set stopWords =
StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) {


TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);

3 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

result = new StopFilter(result, stopWords);


result = new SnowballFilter(result, "Italian");
return result;
}
}

Package: org.apache.lucene.index
Term is <fieldName, text>
index maps Term → <df, <docNum, <position>* >*>
e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
new: term vectors!

Example
IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();

Some Inverted Index Strategies


1. batch-based: use file-sorting algorithms (textbook)
+ fastest to build
+ fastest to search
- slow to update
2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
+ fast to search
- update/build does not scale
- complex implementation
3. segment based: lots of small indexes (Verity)
+ fast to build
+ fast to update
- slower to search

Lucene's Index Algorithm


two basic algorithms:
1. make an index for a single document
2. merge a set of indices

incremental algorithm:
maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack

4 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

let b=10 be the merge factor; M=∞

for (size = 1; size < M; size *= b) {


if (there are b indexes with size docs on top of the stack) {
pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else {
break;
}
}

optimization: single-doc indexes kept in RAM, saves system calls


notes:
average b*logb(N)/2 indexes
N=1M, b=2 gives just 20 indexes
fast to update and not too slow to search
batch indexing w/ M=∞, merge all at end
equivalent to external merge sort, optimal
segment indexing w/ M<∞

Indexing Diagram
b=3
11 documents indexed
stack has four indexes
grayed indexes have been deleted
5 merges have occurred

Index Compression

5 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

For keys in Term -> ... map, use technique from Paolo's slides:

For values in Term -> ... map, use technique from Paolo's slides:

6 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

VInt Encoding Example

Value First byte Second byte Third byte

0 00000000

1 00000001

2 00000010

...

127 01111111

128 10000000 00000001

7 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

129 10000001 00000001

130 10000010 00000001

...

16,383 11111111 01111111

16,384 10000000 10000000 00000001

16,385 10000001 10000000 00000001

...

This provides compression while still being efficient to decode.

Package: org.apache.lucene.search
primitive queries:
TermQuery: match docs containing a Term
PhraseQuery: match docs w/ sequence of Terms
BooleanQuery: match docs matching other queries.
e.g., +path:pisa +content:“Doug Cutting” -path:nutch
new: SpansQuery
derived queries:
PrefixQuery, WildcardQuery, etc.

Example
Query p i s a = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));

PhraseQuery leaningTower = new PhraseQuery();


leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();


query.add( leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);

Search Algorithms
From Paolo's slides:

8 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

Lucene's Disjunctive Search Algorithm


described in http://lucene.sf.net/papers/riao97.ps
since all postings must be processed
goal is to minimize per-posting computation
merges postings through a fixed-size array of accumulator buckets
performs boolean logic with bit masks
scales well with large queries

[ draw a diagram to illustrate? ]

Lucene's Conjunctive Search Algorithm


From Paolo's slides:

9 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

Algorithm

use linked list of pointers to doc list


initially sorted by doc
loop
if all are at same doc, record hit
skip first to-or-past last and move to end of list

Scoring
From Paolo's slides:

10 of 11 9/28/2010 5:55 AM
Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

Is very much like Lucene's Similarity.

Lucene's Phrase Scoring


approximate phrase IDF with sum of terms
compute actual tf of phrase
slop penalizes slight mismatches by edit-distance

Thanks!
And there's lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

Finally, search for this talk on Technorati.

11 of 11 9/28/2010 5:55 AM

Das könnte Ihnen auch gefallen