Beruflich Dokumente
Kultur Dokumente
public class TestVSM { public static void main(String[] args) throws SQLException, ClassNotFoundException { VSM vsm = new VSM("localhost","themis","postgres","postgres"); vsm.clear(); vsm.setParameter(PREPROCESS.LETTERCASE.toString(), LETTERCASE.UPPER.toString()); vsm.setParameter(PREPROCESS.STEMMER.toString(), STEMMER.PORTER.toString()); vsm.addStopword("a"); vsm.addStopword("the"); vsm.addDocument("URL1", "a red car", false); vsm.addDocument("URL2", "the red auto", false); List<Document> res = vsm.searchFull("red car", 0, 10); Iterator<Document> i = res.iterator(); while (i.hasNext()) { Document doc = i.next(); System.out.println(String.format("%1s \t | \t %1.4f \t | \t %3s", doc.getId(), doc.getSimilarity(), doc.getContent())); } } }
Let us investigate proposed code extract line by line: 1. First we create a VSM object - vsm that connects to the Themis backend (we connect to the database "themis" at "localhost"; username and password are both "postgres"): VSM vsm = new VSM("localhost","themis","postgres","postgres");
2.
Then we clear the VSM IR model (remove all VSM document models). Note, VSM model configuration stays unchanged: vsm.clear(); Configure VSM model by setting: all the words in the documents to be treated as uppercase (letter casing does not influence similarity judgments) and usage of Porter stemmer when deriving VSM document similarity judgments: vsm.setParameter(PREPROCESS.LETTERCASE.toString(), LETTERCASE.UPPER.toString()); vsm.setParameter(PREPROCESS.STEMMER.toString(), STEMMER.PORTER.toString());
3.
4.
Add stopwords to use with VSM model: vsm.addStopword("a"); vsm.addStopword("the"); Build VSM document models for documents "a red car" and "the red auto". One should specify document URL to possess the original document reference. The false flag symbols that document is not a query, and is intended for later retrieval: vsm.addDocument("URL1", "a red car", false); vsm.addDocument("URL2", "the red auto", false); Perform the VSM document search for the query "red car". Second and third arguments specify retrieved documents interval (start from 0 most relevant document and retrieve 10 documents): List<Document> res = vsm.searchFull("red car", 0, 10); The last chunk of code outputs retrieved results to the console. If you did everything right you should see the output: 1 | 1.0 | a red car 2 | 0.5 | the red auto
5.
6.
7.
The output means that the document "a red car" was retrieved as the most relevant document to the search query with the VSM similarity level of 1.0. The document "the red auto" is the second most relevant document with the similarity level of 0.5.
Source : Themis