Sie sind auf Seite 1von 6

12/22/13

Building a Vector Space Search Engine in Perl - Perl.com

Perl.com
news and views of the Perl programming language

Building a Vector Space Search Engine in Perl


By Maciej Ceglowski on February 19, 2003 12:00 AM

Visit the home of the Perl programming language: Perl.org


Download Documentation Perl Bloggers Foundation News

Building a Vector Space Search Engine in Perl A Few Words About Vectors Getting Down To Business Building the Search Engine Making it Better Further Reading Why waste time reinventing the wheel, when you could be reinventing the engine? -Damian Conway As a Perl programmer, sooner or later you'll get an opportunity to build a search engine. Like many programming tasks parsing a date, validating an e-mail address, writing to a temporary file - this turns out to be easy to do, but hard to get right. Most people try end up with some kind of reverse index, a data structure that associates words with lists of documents. Onto this, they graft a scheme for ranking the results by relevance. Nearly every search engine in use today - from Google on down - works on the basis of a reverse keyword index. You can write such a keyword engine in Perl, but as your project grows you will inevitably find yourself gravitating to some kind of relational database system. Since databases are customized for fast lookup and indexing, it's no surprise that most keyword search engines make heavy use of them. But writing code for them isn't much fun. More to the point, companies like Google and Atomz already offer excellent, free search services for small Web sites. You can get an instant search engine with a customizable interface, and spend no time struggling with Boolean searches, text highlighting, or ranking algorithms. Why bother duplicating all that effort? As Perl programmers, we know that laziness is a virtue. But we also know that there is more than one way to do things. Despite the ubiquity of reverse-index search, there are many other ways to build a search engine. Most of them originate in the field of information retrieval, where researchers are having all kinds of fun. Unfortunately, finding documentation about these alternatives isn't easy. Most of the material available online is either too technical or too impractical to be of use on real-world data sets. So programmers are left with the false impression that vanilla search is all there is. In this article, I want to show you how to build and run a search engine using a vector-space model, an alternative to reverse index lookup that does not require a database, or indeed any file storage at all. Vector-space search engines eliminate many of the disadvantages of keyword search without introducing too many disadvantages of their own. Best of all, you can get one up and running in just a few dozen lines of Perl. A Few Words About Vectors Vector-space search engines use the notion of a term space, where each document is represented as a vector in a highdimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart. To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don't worry - it will become clearer when we write the code. The vector-space data model gives us a search engine with several useful features: Searches take place in RAM, there is no disk or database access Queries can be arbitrarily long Users don't have to bother with Boolean logic or regular expressions It's trivially easy to do 'find similar' searches on returned documents You can set up a 'saved results' basket, and do similarity searches on the documents in it You get to talk about 'vector spaces' and impress your friends Getting Down to Business The easiest way to understand the vector model is with a concrete example. Let's say you have a collection of 10 documents, and together they contain 50 unique words. You can represent each document as a vector by counting how many times a word appears in the document, and moving that distance along the appropriate axis. So if Document A contains the sentence "cats like chicken", then you find the axis for c a t , move along one unit, and then do the same for l i k eand c h i c k e n . Since the other 47 words don't appear in your document, you don't move along the corresponding axes at all. Plot this point and draw a line to it from the origin, and you have your document vector . Like any vector, it has a magnitude (determined by how many times each word occurs), and a direction (determined by which words appeared, and their relative abundance). There are two things to notice here:

Sponsored by

Subscribe to this website's feed

Monthly Archives
January 2013 (1) December 2012 (1) November 2012 (1) October 2012 (2) August 2012 (2) June 2012 (11) May 2012 (18) April 2012 (17) February 2012 (1) December 2011 (1) September 2011 (1) August 2011 (2) June 2011 (1) May 2011 (3) April 2011 (1) March 2011 (1) February 2011 (1) January 2011 (1) November 2010 (1) October 2010 (2) September 2010 (1) August 2010 (3) July 2010 (2) April 2010 (2) March 2010 (4) May 2008 (1) April 2008 (2) March 2008 (1) February 2008 (1) January 2008 (1) December 2007 (2) September 2007 (1) August 2007 (1) July 2007 (1) June 2007 (1) May 2007 (1) April 2007 (1) March 2007 (1) February 2007 (1) January 2007 (1)

www.perl.com/pub/2003/02/19/engine.html

1/6

12/22/13

Building a Vector Space Search Engine in Perl - Perl.com


December 2006 (1) November 2006 (2) October 2006 (1) September 2006 (1) August 2006 (1) July 2006 (1) June 2006 (1) May 2006 (1) April 2006 (1) March 2006 (1) February 2006 (4) January 2006 (4) December 2005 (4) November 2005 (3) October 2005 (2) September 2005 (2) August 2005 (9) July 2005 (8) June 2005 (9) May 2005 (8) April 2005 (7) March 2005 (6) February 2005 (7) January 2005 (6) December 2004 (8)

The first is that we throw away all information word order, and there is no guarantee that the vector will be unique. If we had started with the sentence "chickens like cats" (ignoring plurals for the moment), then we would have ended up with an identical document vector, even though the documents are not the same. This may seem to be a big limitation, but it turns out that word order in natural language contains little information about content - you can infer most of what a document is about by studying the word list. Bad news for English majors, good news for us. The second thing to notice is that with three non-zero values out of a possible 50, our document vector is sparse. This will hold true for any natural language collection, where a given document will contain only a tiny proportion of all the possible words in the language. This makes it possible to build in-RAM search engines even for large collections, although the details are outside the scope of this article. The point is, you can scale this model up quite far before having to resort to disk access. To run a vector-space search engine, we need to do the following: 1. 2. 3. 4. 5. Assemble a document collection Create a term space and map the documents into it Map an incoming query into the same term space Compare the query vector to the stored document vectors Return a list of nearest documents, ranked by distance

Now let's see how to implement these steps in Perl. Building the Search Engine We'll make things easy by starting with a tiny collection of four documents, each just a few words long: " T h ec a ti nt h eh a t " " Ac a ti saf i n ep e t . " " D o g sa n dc a t sm a k eg o o dp e t s . " " Ih a v e n ' tg o tah a t . " Our first step is to find all the unique words in the document set. The easiest way to do this is to convert each document into a word list, and then combine the lists together into one. Here's one (awful) way do it:

November 2004 (8) October 2004 (5) September 2004 (9) August 2004 (6) July 2004 (8) June 2004 (6)

s u bg e t _ w o r d s{ m y($ t e x t)=@ _ ; r e t u r nm a p{l c$ _= >1} m a p{/ ( [ a z \ ' ] + ) / i} s p l i t/ \ s + / s ,$ t e x t ; } The subroutine splits a text string on whitespace, takes out all punctuation except hyphens and apostrophes, and converts everything to lower case before returning a hash of words. The curious d ostatement is just a compact way of creating a lookup hash. % d o c _ w o r d swill end up containing our word list as its keys, and its values will be the number of times each word appeared. If we run this 'parser' on all four documents in turn, then we get a master word list: a a n d c a t c a t s d o g s f i n e g o o d g o t h a t h a v e n ' t i i n i s m a k e p e t p e t s t h e Notice that many of the words in this list are junk words - pronouns, articles, and other grammatical flotsam that's not useful in a search context. A common procedure in search engine design is to strip out words like these using a stop list. Here's the same subroutine with a rudimentary stop list added in, filtering out the most common words:

May 2004 (7) April 2004 (8) March 2004 (8) February 2004 (9) January 2004 (7) December 2003 (4) November 2003 (7) October 2003 (8) September 2003 (6) August 2003 (7) July 2003 (9) June 2003 (9) May 2003 (8) April 2003 (8) March 2003 (10) February 2003 (8) January 2003 (8) December 2002 (5) November 2002 (9) October 2002 (7) September 2002 (11) August 2002 (8) July 2002 (8) June 2002 (4) May 2002 (6) April 2002 (6) March 2002 (7) February 2002 (5) January 2002 (8)

o u r% s t o p _ h a s h ; o u r@ s t o p _ w o r d s=q w / ii nat ot h ei th a v eh a v e n ' tw a sb u ti sb ef r o m / ; f o r e a c h@ s t o p _ w o r d s{ $ s t o p _ h a s h { $ _ } + +} ;

December 2001 (7) November 2001 (5) October 2001 (9) September 2001 (7)

s u bg e t _ w o r d s{

August 2001 (13) July 2001 (8) June 2001 (13)

#N o ww i t hs t o pl i s ta c t i o n !

May 2001 (11)

www.perl.com/pub/2003/02/19/engine.html

2/6

12/22/13

Building a Vector Space Search Engine in Perl - Perl.com


April 2001 (9) March 2001 (8) February 2001 (8) January 2001 (8) December 2000 (6) November 2000 (10) October 2000 (10) September 2000 (2) August 2000 (2) July 2000 (5) June 2000 (7) May 2000 (7) April 2000 (3) March 2000 (2)

m y($ t e x t)=@ _ ; r e t u r nm a p{$ _= >1} g r e p{! (e x i s t s$ s t o p _ h a s h { $ _ })} m a pl c , m a p{/ ( [ a z \ ' ] + ) / i} s p l i t/ \ s + / s ,$ t e x t ; } A true stop list would be longer, and tailored to fit our document collection. You can find a real stop list in the D A T A section of Listing 1, along with a complete implementation of the search engine described here. Note that because of Perl's fast hash lookup algorithm, we can have a copious stop list without paying a big price in program speed. Because word frequencies in natural language obey a power-law distribution, getting rid of the most common words removes a disproportionate amount of bulk from our vectors. Here is what our word list looks after we munge it with the stop list: c a t c a t s d o g s f i n e g o o d g o t h a t m a k e p e t p e t s We've narrowed the list down considerably, which is good. But notice that our list contains a couple of variants ("cat" and "cats", "pet" and "pets"), that differ only in number. Also note that someone who searches on 'dog' in our collection won't get any matches, even though 'dogs' in the plural form is a valid hit. That's bad. This is a common problem in search engine design, so of course there is a module on the CPAN to solve it. The bit of code we need is called a stemmer , a set of heuristics for removing suffixes from English words, leaving behind a common root. The stemmer we can get from CPAN uses the Porter stemming algorithm, which is an imperfect but excellent way of finding word stems in English. u s eL i n g u a : : S t e m ; We'll wrap the stemmer in our own subroutine, to hide the clunky Lingua::Stem syntax: s u bs t e m{ m y($ w o r d )=@ _ ; m y$ s t e m r e f=L i n g u a : : S t e m : : s t e m ($ w o r d) ; r e t u r n$ s t e m r e f > [ 0 ] ; } And here's how to fold it in to the g e t _ w o r d ssubroutine: r e t u r nm a p{s t e m ( $ _ )= >1} g r e p{! (e x i s t s$ s t o p _ h a s h { $ _ })} m a pl c , m a p{/ ( [ a z \ ' ] + ) / i} s p l i t/ \ s + / s ,$ t e x t ; Notice that we apply our stop list before we stem the words. Otherwise, a valid word like b e i n g s(which stems to b e ) would be caught by the overzealous stop list. It's easy to make little slips like this in search algorithm design, so be extra careful. With the stemmer added in, our word list now looks like this: c a t d o g f i n e g o o d g o t h a t m a k e p e t Much better! We have halved the size of our original list, while preserving all of the important content. Now that we have a complete list of content words , we're ready for the second step - mapping our documents into the term space. Because our collection has a vocabulary of eight content words, each of our documents will map onto an eight-dimensional vector. Here is one example: #Ac a ti saf i n ep e t

February 2000 (2) January 2000 (2) December 1999 (6) November 1999 (6) October 1999 (5) September 1999 (4) August 1999 (3) July 1999 (2) June 1999 (3) April 1999 (1) March 1999 (1) January 1999 (1) December 1998 (1) November 1998 (1) July 1998 (2) June 1998 (1) March 1998 (1)

$ v e c=[1 ,0 ,1 ,0 ,0 ,0 ,1] ; The sentence "A cat is a fine pet" contains three content words. Looking at our sorted list of words, we find c a t ,f i n e ,

www.perl.com/pub/2003/02/19/engine.html

3/6

12/22/13

Building a Vector Space Search Engine in Perl - Perl.com

and p e tat positions one, three, and eight respectively, so we create an anonymous array and put ones at those positions, with zeroes everywhere else. If we wanted to go in the opposite direction, then we could take the vector and look up the non-zero values at the appropriate position in a sorted word list, getting back the content words in the document (but no information about word order). The problem with using Perl arrays here is that they won't scale. Perl arrays eat lots of memory, and there are no native functions for comparing arrays to one another. We would have to loop through our arrays, which is slow. A better data way to do it is to use the PDL module, a set of compiled C extensions to Perl made especially for use with matrix algebra. You can find it on the CPAN. PDL stands for "Perl Data Language", and it is a powerful language indeed, optimized for doing math operations with enormous multidimensional matrices. All we'll be using is a tiny slice of its functionality, the equivalent of driving our Ferrari to the mailbox. It turns out that a PDL vector (or "piddle") looks similar to our anonymous array:

u s eP D L ; m y$ v e c=p i d d l e[1 ,0 ,1 ,0 ,0 ,0 ,1] ;

>p r i n t$ v e c [ 10010001 ] We give the piddle constructor the same anonymous array as an argument, and it converts it to a smaller data structure, requiring less storage. Since we already know that most of our values in each document vector will be zero (remember how sparse natural language is), passing full-length arrays to the piddle constructor might get a little cumbersome. So instead we'll use a shortcut to create a zero-filled piddle, and then set the non-zero values explicitly. For this we have the z e r o e s function, which takes the size of the vector as its argument: m y$ n u m _ w o r d s=8 ; m y$ v e c=z e r o e s$ n u m _ w o r d s ;

>p r i n t$ v e c [ 000000000 ] To set one of the zero values to something else, we'll have to use the obscure PDL syntax: m y$ v a l u e=3 ; m y$ o f f s e t=4 ;

i n d e x ($ v e c ,$ o f f s e t). =$ v a l u e ; >p r i n t$ v e c ; [ 000300000 ] Here we've said "take this vector, and set the value at position 4 to 3". This turns out to be all we need to create a document vector. Now we just have to loop through each document's word list, and set the appropriate values in the corresponding vector. Here's a subroutine that does the whole thing: #$ w o r d _ c o u n ti st h et o t a ln u m b e ro fw o r d si nc o l l e c t i o n #% i n d e xi sal o o k u ph a s ho fw o r dt oi t sp o s i t i o ni nt h em a s t e rl i s t

s u bm a k e _ v e c t o r{ m y($ d o c)=@ _ ; m y% w o r d s=g e t _ w o r d s ($ d o c) ; m y$ v e c t o r=z e r o e s$ w o r d _ c o u n t ;

f o r e a c hm y$ w(k e y s% w o r d s){ m y$ v a l u e=$ w o r d s { $ w } ; m y$ o f f s e t=$ i n d e x { $ w } ; i n d e x ($ v e c t o r ,$ o f f s e t). =$ v a l u e ; } r e t u r n$ v e c t o r ; } Now that we can generate a vector for each document in our collection, as well as turn an incoming query into a query vector (by feeding the query into the m a k e _ v e c t o rsubroutine), all we're missing is a way to calculate the distance between vectors. There are many ways to do this. One of the simplest (and most intuitive) is the cosine measure. Our intuition is that document vectors with many words in common will point in roughly the same direction, so the angle between two document vectors is a good measure of their similarity. Taking the cosine of that angle gives us a value from zero to

www.perl.com/pub/2003/02/19/engine.html

4/6

12/22/13

Building a Vector Space Search Engine in Perl - Perl.com

one, which is handy. Documents with no words in common will have a cosine of zero; documents that are identical will have a cosine of one. Partial matches will have an intermediate value - the closer that value is to one, the more relevant the document. The formula for calculating the cosine is this: c o s =(V 1*V 2)/| | V 1 | |x| | V 2 | | Where V2 and V2 are our vectors, the vertical bars indicate the 2-norm, and the *indicates the inner product. You can take the math on faith, or look it up in any book on linear algebra. With PDL, we can express that relation easily: s u bc o s i n e{ m y( $ v e c 1 ,$ v e c 2)=@ _ ; m y$ n 1=n o r m$ v e c 1 ; m y$ n 2=n o r m$ v e c 2 ; m y$ c o s=i n n e r ($ n 1 ,$ n 2) ; #i n n e rp r o d u c t r e t u r n$ c o s > s c l r ( ) ; #c o n v e r t sP D Lo b j e c tt oP e r ls c a l a r } We can normalize the vectors to unit length using the n o r mfunction, because we're not interested in their absolute magnitude, only the angle between them. Now that we have a way of computing distance between vectors, we're almost ready to run our search engine. The last bit of the puzzle is a subroutine to take a query vector and compare it against all of the document vectors in turn, returning a ranked list of matches: s u bg e t _ c o s i n e s{ m y($ q u e r y _ v e c)=@ _ ; m y% c o s i n e s ; w h i l e(m y($ i d ,$ v e c)=e a c h % d o c u m e n t _ v e c t o r s){ m y$ c o s i n e=c o s i n e ($ v e c ,$ q u e r y _ v e c) ; n e x tu n l e s s$ c o s i n e>$ t h r e s h o l d ; $ c o s i n e s { $ i d }=$ c o s i n e ; } r e t u r n% c o s i n e s ; } This gives us back a hash with document IDs as its keys, and cosines as its values. We'll call this subroutine from a s e a r c hsubroutine that will be our module's interface with the outside world: s u bs e a r c h{ m y($ q u e r y)=@ _ ; m y$ q u e r y _ v e c=m a k e _ v e c t o r ($ q u e r y) ; m y% r e s u l t s=g e t _ c o s i n e s ($ q u e r y _ v e c) ; r e t u r n% r e s u l t s ; } All that remains is to sort the results by the cosine (in descending order), format them, and display them to the user. You can find an object-oriented implementation of this code in Listing 1, complete with built-in stop list and some small changes to make the search go faster (for the curious, we normalize the document vectors before storing them, to save having to do it every time we run the c o s i n esubroutine). Once we've actually written the code, using it is straightforward: u s eS e a r c h : : V e c t o r S p a c e ;

m y@ d o c s=g e t _ d o c u m e n t s _ f r o m _ s o m e w h e r e ( ) ;

m y$ e n g i n e=S e a r c h : : V e c t o r S p a c e > n e w (d o c s= >\ @ d o c s) ;

$ e n g i n e > b u i l d _ i n d e x ( ) ; $ e n g i n e > s e t _ t h r e s h o l d (0 . 8) ;

w h i l e(m y$ q u e r y=< >){ m y% r e s u l t s=$ e n g i n e > s e a r c h ($ q u e r y) ; f o r e a c hm y$ r e s u l t(s o r t{$ r e s u l t s { $ b }< = >$ r e s u l t s { $ a }} k e y s% r e s u l t s){ p r i n t" R e l e v a n c e :" ,$ r e s u l t s { $ r e s u l t } ," \ n " ; p r i n t$ r e s u l t ," \ n \ n " ; }

p r i n t" N e x tq u e r y ? \ n " ; } And there we have it, an instant search engine, all in Perl. Making It Better There are all kinds of ways to improve on this basic model. Here are a few ideas to consider:

www.perl.com/pub/2003/02/19/engine.html

5/6

12/22/13

Building a Vector Space Search Engine in Perl - Perl.com

Better parsing Our g e t _ w o r d ssubroutine is rudimentary - the code equivalent of a sharpened stick. For one thing, it will completely fail on text containing hyperlinks, acronyms or XML. It also won't recognize proper names or terms that contain more than one word ( like "Commonwealth of Independent States"). You can make the parser smarter by stripping out HTML and other markup with a module like H T M L : : T o k e P a r s e r , and building in a part-of-speech tagger to find proper names and noun phrases (look for our own Lingua::Tagger::En, coming soon on the CPAN). Non-English Collections Perl has great Unicode support, and the vector model doesn't care about language, so why limit ourselves to English? As long as you can write a parser, you can adapt the search to work with any language. Most likely you will need a special stemming algorithm. This can be easy as pie for some languages (Chinese, Italian), and really hard for others (Arabic, Russian, Hungarian). It depends entirely on the morphology of the language. You can find published stemming algorithms online for several Western European languages, including German, Spanish and French. Similarity Search It's easy to add a "find similar" feature to your search engine. Just use an existing document vector as your query, and everything else falls into place. If you want to do a similarity search on multiple documents, then add the vectors together. Term Weighting Term weighting is a fancy way of saying "some words are more important than others". Done properly, it can greatly improve search results. You calculate weights when building document vectors. Local weighting assigns values to words based on how many times they appear in a single document, while global weighting assigns values based on word frequency across the entire collection. The intuition is that rare words are more interesting than common words (global weighting), and that words that appear once in a document are not as relevant as words that occur multiple times (local weighting). Incorporating Metadata If your documents have metadata descriptors (dates, categories, author names), then you can build those in to the vector model. Just add a slot for each category, like you did for your keywords, and apply whatever kind of weighting you desire. Exact Phrase Matching You can add arbitrary constraints to your result set by adding a chain of filters to your result set. An easy way to do exact phrase matching is to loop through your result set with a regular expression. This kind of post-processing is also a convenient way to sort your results by something other than relevance. Further Reading There's a vast body of material available on search engine design, but little of it is targeted at the hobbyist or beginner. The following are good places to start: http://hotwired.lycos.com/webmonkey/code/97/16/index2a.html?collection=perl This Webmonkey article dates back to 1997, but it's still the best tutorial on writing a reverse index search engine in Perl. http://moskalyuk.com/software/perl/search/kiss.htm An example of a simple keyword search engine - no database, just a deep faith in Perl's ability to parse files quickly. http://www.movabletype.org/download.shtml Movable Type is a popular weblog application, written in Perl. The search engine lives in MT::App::Search, and supports several advanced features. It's not pretty, but it's real production code. http://jakarta.apache.org/lucene/docs/index.html Lucene is a Java-based keyword search engine, part of the Apache project. It's a well-designed, open-source search engine, intended for larger projects. The documentation discusses some of the challenges of implementing a large search engine; it's worth reading even if you don't know Java. http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=3391 Foundations of Statistical Natural Language Processing , MIT Press (hardcover textbook). Don't be put off by the title this book is a fantastic introduction to all kinds of issues in text search, and includes a thorough discussion of vector space models. http://www.nitle.org/lsi/intro/ An introduction to latent semantic indexing , the vector model on steroids. The document is aimed at nontechnical readers, but gives some more background information on using vector techniques to search and visualize data collections. The adventurous can also download some Perl code for latent semantic indexing at http://www.nitle.org/lsi.php. Both the code and the article come from my own work for the National Institute for Technology and Liberal Education.

Tags: search engine vector space

Powered by Movable Type Pro This blog is licensed under a Creative Commons License.

www.perl.com/pub/2003/02/19/engine.html

6/6

Das könnte Ihnen auch gefallen