Fatih Korkmaz
Einleitung:
1. 2. 4. 5.
Warum braucht man Facets? Was sind Facets? Woher kennt man sie? Was ist das Ergebnis der Arbeit. Recall, Precision, Efficiency. Was ist besser, wo liegen die Unterschiede?
1. Motivation
1
Universitt Konstanz 17.01.2012
Szenario: Suche Buch auf dem Flohmarkt... Weil keine Ordnung da ist, muss man sehr lange suchen und nach 30 Bchern hat man keine Lust mehr.
1. Motivation
FACETED SEARCH
Seminar: Faceted Search Technology
2
Universitt Konstanz 17.01.2012
Scenario: Bibliothek. Weil alles geordnet ist, geht die Suche sehr schnell. Bcher sich nach Genre, oder Chronic und Alphabet sortiert.
2. Introduction
Quote: [DakkaIpeirotis2008]
Faceted interfaces represent a new powerful paradigm that proved to be a successful complement to keyword searching. predened facets interfaces:
ebay ight search engines (online) rental stores (e.g. video, book, ...) online stores
Fatih Korkmaz Seminar: Faceted Search Technology
3
Universitt Konstanz 17.01.2012
Was sind Facets? Facets gewhren ein neues sehr erfolgreiches Suchparadigma. D.h Facets werden sehr oft gesucht und genutzt. ---DEMO--- ebay: fussball Das sind vorgefertigte Facets, Menschen legen Facetten an und aktualisieren diese immer wieder.
2. Introduction
Major shortcoming of the systems that use multifaceted hierarchies have two steps: 1. Identify manually the dimensions/facets that can be used 1. to browse a collection 2. Create manually the hierachies for each dimension
www.ebay.com
Fatih Korkmaz Seminar: Faceted Search Technology
4
Universitt Konstanz 17.01.2012
Wie machen das die Firmen die Facetten von Hand anlegen? videothek beispiel 2 Schritte unterteilt: 1. Facetten ermitteln (intuitiv) 2. Facetten zusammenfassen, videothek ab 18 bereich.
2. Introduction
Problem
predened facets have some critical aspects possible weaknesses scope of view
- jeder Mensch hat eine andere Ansicht, gnstige ge, gute hotels, groe bilder (google) -> seit neuem grer als... - falsche kategorie bei ebay. autoteil fr vw, passt evetuell auch fr audi. ebay - dummies von handys - neuer artikel passt in keine kategorie, ich wei nicht in welche kategorie es gehrt?! tablet beispiel - es existieren keine facetten, sehr aufwndig von hand zu machen. onlineplattform mit rezepten
3. Automatic Extraction
Paper: [DakkaIpeirotisWood2005]
W. Dakka, P. G. Iperirotis and K. R. Wood, Automatic Construction of Multifaceted Browsing Interfaces, 2005
Paper: [DakkaDayalIpeirotis2006]
W. Dakka, R. Dayal and P. G. Iperirotis, Automatic Discovery of Useful Facet Terms, 2006
Paper: [DakkaIpeirotis2008]
W. Dakka, P. G. Iperirotis, Automatic Extraction of Useful Facet Hierarchies from Text Databases, 2008
Fatih Korkmaz Seminar: Faceted Search Technology
Mein Paper 2008, vorarbeit geleistet seit 2005. Multifacets werden zu hierarchies. 2006 zwischenarbeit, work in progress. 5 seiten ca. report ber die arbeit. bei den arbeiten 2005 und 2006 werde ich nicht ins detail gehen, weil wir sonst den rahmen sprengen.
3. Automatic Extraction
Steps for automatically constructing multifaceted hierarchies: 1. Finding feasible facets out of the collection by dint of 1. hypernyms and WordNet, building encircling sets 2. hierarchy construction in an RDBMS 3. visualizing and layouting the facets by rankings The algorithm has some limitations 1. supervised: facets are limited to the facets that appeared 1. in the training set 2. WordNet has low coverage of named entities
Fatih Korkmaz Seminar: Faceted Search Technology
supervised algorithmus, das heit, dass wir erst mal mit trainigssets unser algorithmus lehren. 1. facets im text suchen mithilfe von wordnet. 2. packen das in einen hierachy tree , danach 3. layouten von facetten mit zb hugkeiten von relationen.
3. Automatic Extraction
Improvement in contrast to the further previous work. 1. automatically term detection in an unsupervised way 1. from free text 2. automatically grouping facets together, which belong to 2. the same parent facet 3. expanding to more external resources 3. [work in progress]
8
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
nur ne zwischenarbeit: sie benutzen kein trainingset mehr. 2. zwei methoden mit denen sie die facets bestimmen. 3. erweiterung auf mehrere externe quellen, anstze und vorstellungen wurden erwhnt.
3. Automatic Extraction
2006 (research-in-progress) free text WordNet Wikipedia - Named Entities Yahoo! Term Extractor unsupervised Frequency-based Ranking Rank-based Ranking
2005 vs 2006 2005 training set term detection WordNet supervised Frequency-based Ranking Set-cover Ranking Merit-based Ranking Subsumption Hierarchy Efcient Hierarchy Construction
hierarchy construction
9
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
3. Automatic Extraction
Dakka, Ipeirotis Automatic Extraction of Useful Facet Hierarchies from Text Databases, 2008
The algorithm for automatic facet discovery is partitioned into 3 Steps: 1. Identify the important terms within the document that 1. are useful for characterizing the content 2. Query one or more external sources and get context 2. terms. Add them to the document as a set 3. Analyze the frequency of the terms and identify the 3. candidate facet terms
10
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
1. wichtige wrter nden die den text charakterisieren 2. externe quellen nutzen um contextterm zu nden. context term von angela merkel bundeskanzlerin. aber woher bekomme ich diese information? 3. analyse von den wrtern und selektion der facetten. welche wrter haben die meiste aussagekraft, bzw welche wrter reprsentieren die meisten dokumente.
3. Automatic Extraction
The documents, the original terms and the context terms will be stored in a database
important terms
11
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
3. Automatic Extraction
LingPipe
Input: is a text document Yahoo! Term Output: list of signicant words Extractor (They have observed empirically that the quality of the returned terms is high) No further information about the algorithm term in a document machtes a title of wikipedia entry Wikipedia they exploit the redirect pages, to improve the coverage Terms (e.g. Hillary Clinton, Hillary R. Clinton, redirect to Hillary Rodham Clinton
12
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
1. lingpipe ist opensource, input text, output entities. 2. yahoo! keine infors ber algo. funktiert aber nach test sehr gut. 3. namen werden zu der richtigen quelle geleitet. sonst erkennt es der compiler falsch.
3. Automatic Extraction
2. Step: Query external resources to get context terms Pilot Study phenomenon:
the terms for the useful facets do not usually appear in the news stories
term Steve Jobs Apple
Jacques Ren Chirac
context term
context terms
13
Universitt Konstanz 17.01.2012
Fatih Korkmaz
phnomen: wenn ich meinen audi verkaufe schreibe ich nicht auto rein. sondern ich verkaufe meinen audi
3. Automatic Extraction
a) Google
- query Google with a term - retrieve the most frequent - words and phrases - as context terms
14
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
3. Automatic Extraction
b) WordNet Hypernyms for each word excluded named entities WordNet afford very good results. vehicle
car truck motorcycle bus
c) Wikipedia Synonyms use the Wikipedia redirect pages to identify variations of the same term
Hillary Clinton Hillary R. Clinton Clinton, Hillary Rodham Hillary Rodham Clinton redirect Hillary Rodham Clinton
15
Universitt Konstanz 17.01.2012
Fatih Korkmaz
3. Automatic Extraction
d) Wikipedia Graph Links that appear in Wikipedia entries can offer valuable clues by tf.idf-style scoring every entry is a node and every link is an edge
Measurement between to Wikipedia entries:
t2
log(N/in(t2))/out(t1) = assoc(t1t2) the result ist a top-k set with the highest scoring terms
Hasekura Tsunenaga
Fatih Korkmaz
is_linked to
Seminar: Faceted Search Technology
16
wikipedia offline aufm server, jeder eintrag ist ein knoten und jeder link eine kante. N ist gesamtanzahl von wiki eintrgen. N/in(t2) dann sehe ich wie viele indizente kanten habe ich zu t2 habe. und das teile ich durch die abgehenden kanten von t1 und dann habe ich assoziationswert beider terme.
3. Automatic Extraction
terms Jacques Chirac Hasekura Tsunenaga Hillary Clinton ... ... ...
context terms President, France, Politican, ... Japanese, Japanese Samurai, ... Hillary R. Clinton, United States of America, ... ... ... ... ...
17
Universitt Konstanz 17.01.2012
Fatih Korkmaz
nach beiden schritten, also term extraktion und context term suche, haben wir folgendes bild. nun muss man aus den ganzen mglichen facetten die wichtigen herausltern.
3. Automatic Extraction
Comparative Term Frequency Analysis The algorithm can be split in two steps: 1. Count the frequency of a facet 1. term in the original and in the 1. expanded database 1. - Frequency-based Shifting 1. - Rank-based Shifting 2. verication that the difference in 2. frequency is statistically signicant 2. - Log-Likelihood Statistic
Fatih Korkmaz Seminar: Faceted Search Technology
18
wegen dem phnomen, zhlen wir, wie oft ein term im orginal text vorkommt und wie oft in den erweiterten daten. log-likelihood, ist die berechnung der frequency statistisch aussagekrftig genug.
3. Automatic Extraction
if both Shiftf(t) and Shiftr(t) are positive the verication can be done with the Log-Likelihood Statistic.
Fatih Korkmaz Seminar: Faceted Search Technology
19
nachteil von dem ersten, es kann sein, dass terme mit einer hohen hugkeit werden favorisiert. (wichtige terme mit niedriger hugkeit werden weniger beachtet.) deswegen die normalisierung. jeder term hat einen rang, und danach sind diese sortiert und diese werden in bins aufgeteilt. B(t) gleicht favorisierung aus, indem hochfrequente terme abgestumpft werden. erst wenn diese beiden funktionen positiv sind, wird es mit dem log-likelihood veriziert.
3. Automatic Extraction
Denition
A subsumptive hierarchy is a classication of objects from the general to the specic. In fact a subsumptive hierarchy is a "IS-A hierarchy". It describes the relationship between each level. A lower-level object "is a" member of the higher class.
20
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
4. Evaluation
For evaulation the techniques they used three following data sets:
Single Day of NYT (SNYT) 1,000 news from one day in November 2005 Single Day of Newsblaster 17,000 news from one day in November 2005 from 24 news (SNB) sources. (test with multiple data sources) Month of NYT (MNYT) 30,000 news from one month
21
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
4. Evaluation
Recall by using the Amazon Mechanical Turk service[AMT-Link]: Annotators read a story and identify term that can be used as facets.
For Extraxtion of named entities Google and Wikipedia Graph work better than Wikipedia Synonyms and WordNet. But in combination they gain a very high Recall.
Fatih Korkmaz Seminar: Faceted Search Technology
22
Amazon erklren, Google und Wikipedia nden terme die nicht in den texten vorkommen sehr gut. diese sind am besten fr facetten geeignet.
4. Evaluation
Precision by using the Amazon Mechanical Turk service[AMT-Link]: Annotators examine two task: - are the facets useful? Annotators examine two task: - is the term accurately placed?
The highest precision hierarchies were generated by WordNet not surprising: hypernyms are from a hierarchy. Google drops the presicion and Wikipedia gives more precise hiearchies.
Fatih Korkmaz Seminar: Faceted Search Technology
23
4. Evaluation
The efciency of this approach is affetced by the Yahoo! Term Extractor and Google as an external resource.
term extraction (Yahoo!) term extraction (without) doc. expansion (Google) doc. expansion (without) facet term selection 2-3 seconds per document ~100 documents per second 1 second per document >100 documents per second a few milliseconds
1-2 seconds
excluding Yahoo! and Google ~100 documents in 3-4 seconds! including Yahoo! and Google ~100 documents in 6 minutes!
Fatih Korkmaz Seminar: Faceted Search Technology
24
Der Vorteil dieser Methodik ist sind die externen quellen, die einbezogen werden, gleichzeitig bringen diese techniken auch groe effizientbuen ein. die frage ist jetzt, wie wichtig sind google und yahoo? haben sie leider nicht beantwortet.
5. Related Work
Paper: [StoicaHearstRichardson2007]
Emilia Stoica, Marti A. Hearst, Megan Richardson, Automating Creation of Hierarchial Faceted Metadata Structures, 2007
How do they do that? - Using WordNet for Extraktion of Facets - Using IS-A Relationships for building the tree - Compressing the tree
25
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
5. Related Work
Stoica, Hearst, Richardson Automating Creation of Heirarchical Faceted Metadata Structures, 2007
Algorithm: Step 1 Section of term which have a distribution larger than a threshold. Terms which can be adjectives, verbs or nouns will be deleted. (e.g. hurry - hurried)
26
Fatih Korkmaz
5. Related Work
Stoica, Hearst, Richardson Automating Creation of Heirarchical Faceted Metadata Structures, 2007
Algorithm: Step 2 If a term has only one sense in WordNet put it in the tree. If not expand to WordNet Domains[Magnini2000] semi-automatically annotation with one of 200 Dewey Decimal Classication labels - manual intervention Merging the pathes
Fatih Korkmaz Seminar: Faceted Search Technology
27
wenn ein wort eindeutig ist, wie zum beispiel banane dann passt das in obst und man kanns hinzufgen. wordnet ist nicht vollstndig, daher wird wordnet domains herangezogen. jedes nomen wird dann semiautomatisch, mit einem von 200 DDC sortiert. da gibts dann kategorien wie, geschichte, litheratur, zoologie. Pfadmerging. vter werden zuammengefasst.
5. Related Work
Stoica, Hearst, Richardson Automating Creation of Heirarchical Faceted Metadata Structures, 2007
Algorithm: Step 3 Now, adding paths of target terms which are ambiguous in WordNet. If there are two pathes take the path with more assignments
28
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
Dattel beispiel. 20 Pfade fr Tag und 700 Pfade fr frucht, dann nehm ich frucht.
5. Related Work
Stoica, Hearst, Richardson Automating Creation of Heirarchical Faceted Metadata Structures, 2007
Algorithm: Step 4 delete parents that have fewer than k children. eliminate a child whose name appears within the parents name
29
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
5. Related Work
Stoica, Hearst, Richardson Automating Creation of Heirarchical Faceted Metadata Structures, 2007
30
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
5. Related Work
Algorithmic Comparative
StoicaHearstRichardson2007 term distribution via WordNet assign ambiguous terms to Dewey Decimal Classication Prune Top Level Categories by cropping the tree Building Core Tree and extend with IS-A pathes, compressed tree with eliminated top levels
DakkaIpeirotis2008 Yahoo! Term Extraction LingPipe Wikipedia Google Wikipdia Graphs Wikipedia Synonyms Wordnet Hypernyms Subsumption Hierarchy
Term Detection
Facet Detection
Tree Construcion
31
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
5. Related Work
Who is better?
variety of ares
external resources
term detection
32
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
6. Conclusion
Cons
DakkaIpeirotis2008
Pros
unsupervised expandable adaptable (various languages) it is easy to extract facets from large datasets good recall precision
external resources need much running time no 100% functionality some articels can inserted into wrong facets
33
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
6. Conclusion
Missings and possible future work: train the efciency, make the algorithms more efcient integrate more specialized resources use ontologies to combine WordNet and Wikipedia in a single resource use dynamic programming or rather ofine calculation for faster results including Yahoo! Term Extractor and Google. Implement the tree building process from SHR2007 instead of subsumption hierarchies
Fatih Korkmaz Seminar: Faceted Search Technology Universitt Konstanz 17.01.2012
34
Sources
[DakkaIpeirotis2008]: Automatic Extraction of Useful Facet Hierarchies from Text Databases, Wisam Dakka, Panagiotis G. Iperirotis, 2008 [DakkaIpeirotisWood2005]: Automatic Contruction of Multifaceted Browsing Interfaces, Wisam Dakka, Panagiotis G. Iperirotis, Kenneth R. Wood, 2005 [DakkaDayalIpeirotis2006]: Automatic Discovery of Useful Facet Terms, Wisam Dakka, Risbabh Dayal, Panagiotis G. Iperirotis, 2006 [StoicaHearstRichardson2007]: Automating Creation of Hierarchical Faceted Metadata Structures, Emila Stoica, Marti A. Hearst, Megan Richardson, 2007 [Magnini2000]: Integrating subject eld codes into WordNet, Bernardo Magnini, 2000 [AMT-Link]: Amazon Mechnical Turk Service, http://www.mturk.com [SandersonCroft1999]: Deriving concept hierarchies from text, M. Sanderson, W. B. Croft, 1999
Fatih Korkmaz