IR - Themis Framework

Tugas Information Retrieval Nama : Roska Danang Jaya NIM : 113070250
Themis - Information Retrieval Framework

Themis is an Information Retrieval (IR) framework for comparison of natural language documents. It includes implementation of theoretical retrieval models (as for now, Vector Space Model (VSM) and enhanced Topic-based Vector Space Model (eTVSM)). Themis includes implementation of common algorithms used in the Information Retrieval domain (as for now, Porter Stemmer). These algorithms might then be reused while implementation/configuration of IR models. Themis provides support for evaluation of IR models. It includes support for IR test collections management, conducting evaluations, collecting performance measurements, performing statistical tests (initial version in development). Themis is implemented as PostgreSQL schemas together with PL/pgSQL procedures to expose Themis functionality. Themis API (as for now, Java) is designed to access Themis functionality. VSM Themis Java API tutorial This tutorial shows how to write a java program that uses Themis Java API to connect to Themis backend and perform VSM document similarity judgments:
import java.sql.SQLException; import java.util.Iterator; import java.util.List; import import import import import org.themis.ir.Document; org.themis.ir.vsm.VSM; org.themis.util.LETTERCASE; org.themis.util.PREPROCESS; org.themis.util.STEMMER;
public class TestVSM { public static void main(String[] args) throws SQLException, ClassNotFoundException { VSM vsm = new VSM("localhost","themis","postgres","postgres"); vsm.clear(); vsm.setParameter(PREPROCESS.LETTERCASE.toString(), LETTERCASE.UPPER.toString()); vsm.setParameter(PREPROCESS.STEMMER.toString(), STEMMER.PORTER.toString()); vsm.addStopword("a"); vsm.addStopword("the"); vsm.addDocument("URL1", "a red car", false); vsm.addDocument("URL2", "the red auto", false); List<Document> res = vsm.searchFull("red car", 0, 10); Iterator<Document> i = res.iterator(); while (i.hasNext()) { Document doc = i.next(); System.out.println(String.format("%1s \t | \t %1.4f \t | \t %3s", doc.getId(), doc.getSimilarity(), doc.getContent())); } } }
Let us investigate proposed code extract line by line: 1. First we create a VSM object - vsm that connects to the Themis backend (we connect to the database "themis" at "localhost"; username and password are both "postgres"): VSM vsm = new VSM("localhost","themis","postgres","postgres");
2.
Then we clear the VSM IR model (remove all VSM document models). Note, VSM model configuration stays unchanged: vsm.clear(); Configure VSM model by setting: all the words in the documents to be treated as uppercase (letter casing does not influence similarity judgments) and usage of Porter stemmer when deriving VSM document similarity judgments: vsm.setParameter(PREPROCESS.LETTERCASE.toString(), LETTERCASE.UPPER.toString()); vsm.setParameter(PREPROCESS.STEMMER.toString(), STEMMER.PORTER.toString());
3.
4.
Add stopwords to use with VSM model: vsm.addStopword("a"); vsm.addStopword("the"); Build VSM document models for documents "a red car" and "the red auto". One should specify document URL to possess the original document reference. The false flag symbols that document is not a query, and is intended for later retrieval: vsm.addDocument("URL1", "a red car", false); vsm.addDocument("URL2", "the red auto", false); Perform the VSM document search for the query "red car". Second and third arguments specify retrieved documents interval (start from 0 most relevant document and retrieve 10 documents): List<Document> res = vsm.searchFull("red car", 0, 10); The last chunk of code outputs retrieved results to the console. If you did everything right you should see the output: 1 | 1.0 | a red car 2 | 0.5 | the red auto
5.
6.
7.
The output means that the document "a red car" was retrieved as the most relevant document to the search query with the VSM similarity level of 1.0. The document "the red auto" is the second most relevant document with the similarity level of 0.5.
Source : Themis

IR - Themis Framework

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

IR - Themis Framework

Hochgeladen von

Copyright:

Verfügbare Formate

Tugas Information Retrieval Nama : Roska Danang Jaya NIM : 113070250

Themis - Information Retrieval Framework

Das könnte Ihnen auch gefallen