Sie sind auf Seite 1von 12

SURVEY OF OPEN SOURCE

FULL TEXT SEARCH


SOLUTIONS

Curtis Spencer, curtis@cruxlux.com


SEARCH WITHOUT FULL TEXT
• SQL “Like %product%”
– Easy to setup, but…
– SQL statements get too complex (giant OR and
(?,?,?,?,?,?,?,?,?,?)
– Indexes on many columns become unwieldy and
slow down inserts
– Limited to prefixs of bigger text columns
– Separation of Power: Join Index vs. Full Text
• Outsource to Google
– Hosted Solution
– Can only reach data that you actually render to
html
FULL TEXT GOAL
 Return matches by relevance rather than
pure equality value match
 Precision vs. Recall

Precision – Are the results accurate?


Recall – Did we get all the results we expected?
 Natural Language Search
Queries such as “What is the fastest animal?”
FULL TEXT IMPLEMENTATION
 Inverted Index Data Structure
Index of words to document’s location on disk
 Tokenization, Stopwords
Internationalization Challenges
 Basic Query Languages
Boolean match, relevance, proximity, etc.
“World Series +Poker –Baseball”
Based on Apple’s Search Kit Impl
LANGUAGE STEMMING
 Reduce inflected words to their root
Increase recall
Decrease inverted index size
 Internationalization Challenges
Language detection of the dataset to determine
which stemming algorithm to use
Complexity proportional to the level of
morphology
 Porter Stemming Algorithm
Examples: names -> name, departed -> depart,
Mariners -> marin, Marin -> marin
 Snowball Project has a lot of different
stemming implementations.
MYSQL FULL TEXT
• Pluses
– Integrated into MySQL
– Easy to use without learning a new library
• Minuses
– Indexes bigger than memory tend to be slow
– Scalability options are limited
– Can slow down insertions, deletions
– CJK is lacking
SPHINX
• Pluses
– Very Fast
– Supports many data sources
– Retrieval can be integrated into MySQL
– Distributed Searching is a scaling option

• Minuses
– Configuration can be tricky
– Live index updates accomplished by delta
indexing
– Internationalization (besides Russian) is left as an
exercise for the reader
LUCENE/SOLR
• Pluses
– Java, so easy to integrate into client software as
well as web
– Stable
– Distributed Searching
– Powerful Query Language
– Extensible API
– Good Internationalization Support
• Minuses
– Java
– Configuration is a pain
WHEN TO USE WHAT
Questions?
THANK
YOU!

Das könnte Ihnen auch gefallen