Lucene Sail

The LuceneSail is an [http://lucene.apache.org/java/ Apache Lucene]-based fullte xt enabled RDF storage layer above existing storage.
It is based on the Sesame2 platform, but can be used (in theory) on any RDF store, as Sesame has a stacked architecture allowing this. Currently, [LuceneSailFlavors there are three flavo rs] of the LuceneSail. The paper: * [http://nepomuk.semanticdesktop.org/xwiki/bin/download/Main1/Publications/Min ack%202008.pdf The Sesame LuceneSail. RDF Queries with Full-Text Search] Nepomuk Technical Report, 2008 = Contact = * Chris Fluit - developer * LeoSauermann - developing * Enrico Minack - developing * Gunnar Grimnes - developing * Alex Vigdor - developing There is no "lead" developer, individual developers may or may not update the co de and care for patches. On issues, contacting everyone is recommended. = Status = This is stable enough for us. * it is in extensive use by NEPOMUK developers and it works fine * it slows down changes to the database compared to a LuceneSail less store The version used for NEPOMUK can be browsed here: * [http://repo.aduna-software.org/svn/org.openrdf/sesame-ext/lucenesail/trunk/] publicly retrievable * [http://repo.aduna-software.org/viewvc/org.openrdf/sesame-ext/lucenesail/trun k/] viewvc * [https://repo.aduna-software.org/svn/org.openrdf/sesame-ext/lucenesail/trunk/ ] developers (https!) = Installation = LuceneSail is part of RdfRepository in nepomuk. It can also be used in a sail st ack inside a normal openrdf/sesame installation. see http://www.openrdf.org/forum/mvnforum/viewthread?thread=1528 for a discussio n about the factory and configuration. = Support = NEPOMUK does not offer free support for the LuceneSail, you can ask [http://www. aduna-software.com Aduna] or [mailto:leo.sauermann@dfki.de DFKI] for commercial support or try the [http://www.openrdf.org/forum/mvnforum/listthreads?forum=15 s esame forum]. = query language = == Example == Search for any resource an RDFS-label value that contains the string "person" {{{ PREFIX search: <http://www.openrdf.org/contrib/lucenesail#> SELECT ?x ?score ?snippet WHERE { ?x search:matches ?match. ?match search:query "person"; search:property rdfs:label; search:score ?score; search:snippet ?snippet. } }}}
== Details == The query is expressed as virtual resource with virtual properties, connected to the resource to find with a virtual property. If this is too much "virtual" to understand, read [http://jena.sourceforge.net/ARQ/extension.html#propertyFunctio ns on here]. The parameters are: * search:matches - connecting the resource to be found with the query. subject = resource to be found. object = formulated query * search:query - the lucene fulltext query property of the query * search:property - [optional] '''restrict''' the search to only this '''proper ty'''. If omitted, all literal properies will be searched * search:score - [optional] bind the '''score''' for an individual returned hit to this variable (must be a variable) * search:snippet - [optional] bind a '''highlighted snippet''' for each hit to this variable (must be a variable) The '''query''' part can be any lucene '''term''' expression, you can use the do cumented [http://lucene.apache.org/java/docs/queryparsersyntax.html#Term%20Modif iers Lucene Term Modifiers] in your query. For short, those are: * ?,* - wildcards * ~ - fuzzy match * see more at [http://lucene.apache.org/java/docs/queryparsersyntax.html#Term%2 0Modifiers Lucene Term Modifiers] Highlighting is when "... you get a small excerpt of the document, with the '''k ey words highlighted''' so that you can spot the context where the '''word''' ap peared....". In this implementation, the result uses HTML's <b></b> markers arou nd the hightlighted word. = Details: what is stored, how = The LuceneSail stores the fulltext of all literal values stored into the RdfRepo sitory. The sail is part of the sail-stack odfRepository, its triggered before i nference (inferred triples won't be indexed, to optimize storage). When resources extracted by the DataWrapper are stored (crawled resources from d atasources), the fulltext of the resource is also stored into the RDF repository and therefore the LuceneSail. The fulltext is stored as plaintext without marku p (formatting), alphanumerical characters and punctuation are indexed. The conve rsion to plaintext is done by the DataWrapper and not by LuceneSail nor the RdfR epository. Inside the LuceneSail, the fulltext is stored as Lucene documents. A Lucene Docu ment consists of key-value pairs, allowing you to store and search on many diffe rent properties. Each Lucene Document represents one RDF resource. There is a sp ecial field "uri" marking the URI of the resource. Triples are then stored by us ing the predicate URI as field name, and the object literal value as the field v alue. Another field "context" is used to capture all context(s) that contributed to a Lucene Document (here the word context means "the fourth column in a tripl estore making it a quadstore", do not mix it up with other meanings of the word context). Ususally a resource is defined in one context, but there can be multip le. All fields in Lucene are stored as "STORED" fields (in comparison to "INDEXED"). There are two reasons for this: * The index has to be updateable. Changes to the properties of a resource resul t in the Lucene Document to be re-created and the existing document replaced. Th is is due to the architecture of Lucene, which does not allow "editing" stored d ocuments. * Result highlightning (available as a Lucene Contribution) requires the fields
to be stored. Syntax highlithing is when "... you get a small excerpt of the do cument, with the '''key words highlighted''' so that you can spot the context wh ere the '''word''' appeared...." Storing the fulltext both in the RdfRepository and in the Lucene index '''double s the needed disk space''' but allows a quick update of the Lucene storage. This could be optimized by storing the fulltext only in the RdfRepository, but this needs a tighter integration with the underlying sails (it would be good to integ rate the fulltext storage right into the storage mechanisms of the NativeSail, t hen reading operations needed to update the Lucene Index could access the data d irectly). If you are interested in implementing this optimization, contact the d evelopers (see below for contact information). The LuceneSail fulltext index is activated for the '''main repository only'''. F or extra repositories, like the config repository, we have not added fulltext se arch support. This can be changed, so that you can pass options when creating re positories (do you want to code this configuration? contact LeoSauermann). = Re-Indexing = You will notice that the fulltext '''fulltext index may get corrupted''' after a few weeks of usage, this can happen when you don't shutdown the system graceful ly or by bugs. We have anticipated this, there is a "reindexing option". Go to [ http://localhost:8181/org.semanticdesktop.services.rdfrepository/repositoryactio ns.jsp?repositoryid=main your debug RDFRepository] page and press the '''re-inde x this repository''' button, its at the bottom. = Dependencies = You may notice that the Lucene OSGi jars in the Eclipse Target Platform are nowh ere to be downloaded from, they just appeared. * source:trunk/java/EclipseTargetPlatform2.0/server/eclipse/plugins - here are our Lucene Jars They were created in 5 minutes by taking the release and fumbling with the manif ests by Leo. If you want to update them, do the same: download a release of luce ne (best done using maven), use Eclipse option "create new plugin from JAR" and use the manifests in the existing jars as basis. = Development = We use Aduna's code repository to do development, then the results will outlive the nepomuk project. You can read [https://wiki.aduna-software.org/confluence/di splay/SES/Developing+Sesame Documentation on developing sesame] which also apply here. * Check out the code from svn * [http://repo.aduna-software.org/svn/org.openrdf/sesame-ext/lucenesail/trunk] * get Maven 2 * try to compile everything using Maven, this makes sure all dependencies are d ownloaded (then you don't need the Eclipse Maven PlugIn mentioned below) * mvn compile * Then let Maven create the Eclipse project file * mvn eclipse:eclipse * open the project in Eclipse. For a much tighter integration of Maven into Eclipse, see [https://wiki.aduna-so ftware.org/confluence/display/DEV/Maven Aduna Maven wiki]. Notice, if you want to commit code to the LuceneSail SVN repository, you need an Aduna account and you have to use the https URL of the SVN repository. To '''test''' if you did everything right, go to the command line inside the luc ene folder and run this, it will compile the project, run the tests, and print you some results:
{{{ mvn test }}} To '''build a release''' you do {{{ mvn jar:jar }}} Since you have set up Eclipse already, you can also open the src/test/java folde r of the LuceneSail project, right-click on org.openrdf.sail.lucene.TestAll.java and select Run As -> JUnit Test. If you have problems: * Some Aduna projects require tweaking the maven configuration, see [https://wi ki.aduna-software.org/confluence/display/DEV/Maven+repository+bootstrap+instruct ions Maven Repository Bootstrap Instructions] = External References = The paper: * [http://nepomuk.semanticdesktop.org/xwiki/bin/download/Main1/Publications/Min ack%202008.pdf The Sesame LuceneSail. RDF Queries with Full-Text Search] Nepomuk Technical Report, 2008

Lucene Sail

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lucene Sail

Hochgeladen von

Copyright:

Verfügbare Formate

The LuceneSail is an [http://lucene.apache.org/java/ Apache Lucene]-based fullte xt enabled RDF storage layer above existing storage.

Das könnte Ihnen auch gefallen