Sie sind auf Seite 1von 26

Directed Study Report

“Jasmine” Search Engine


Chris Demwell

Under Supervision By
Dr. Ljiljana Trajkovic
School of Engineering Science
Simon Fraser University
2
1 ABSTRACT ....................................................................................................................... 4

2 INTRODUCTION: FUZZY LOGIC AND SEARCHING ............................................... 4

3 CRITERIA FOR COMPARISON OF SEARCH TECHNOLOGIES ............................. 5

4 THE STATE OF THE ART .............................................................................................. 6


4.1 EXISTING SEARCH ENGINES .......................................................................................... 6
4.1.1 Term Based Engines............................................................................................... 6
4.1.2 Popularity Based Engines ...................................................................................... 6
4.1.3 Semantic Engines................................................................................................... 7
4.1.4 Clustering Based Engines ...................................................................................... 8
4.2 RECENT RESEARCH ....................................................................................................... 9
5 JASMINE CONCEPTUAL DISCUSSION..................................................................... 12

6 JASMINE FUNCTIONAL AND ARCHITECTURAL REQUIREMENTS.................. 13


6.1 GENERAL DESCRIPTION............................................................................................... 13
6.1.1 Product Perspective ............................................................................................. 13
6.1.2 Product Functions................................................................................................ 14
6.1.3 User Characteristics ............................................................................................ 15
6.1.4 General Constraints............................................................................................. 15
6.1.5 Assumptions and Dependencies............................................................................ 16
6.2 SPECIFIC REQUIREMENTS ............................................................................................ 16
6.2.1 Functional Requirements ..................................................................................... 16
6.2.2 External Interface Requirements .......................................................................... 19
6.2.3 Performance Requirements .................................................................................. 22
6.2.4 Attributes ............................................................................................................. 23
7 CONCLUSION ................................................................................................................ 23

8 REFERENCES ................................................................................................................ 23

9 FURTHER READING .................................................................................................... 25


9.1 ARTICLES .................................................................................................................... 25
9.2 ONLINE RESOURCES.................................................................................................... 25
10 FIGURES...................................................................................................................... 26

3
1 Abstract
Language, and therefore text, is a process of negotiated meaning. Because of individual
and cultural schema, there is never a perfect or even a near-perfect correlation in understanding
meaning. Therefore, the grey or approximated area is the area where miscommunication and
search engine error can occur.
Fuzzy Logic has shown to be an invaluable tool to engineers in building systems which
perform approximate reasoning to accomplish their tasks. However, even though approximate
reasoning is desirable in searching for documents according to similarity between topics, little
work is pursuing the application of Fuzzy Logic methods to text searching.
This document presents a brief introduction to Fuzzy Logic and Searching and presents
several methods in use by active public search engines on the World Wide Web, then discusses a
selection of current research in intelligent searching. With this as background, a platform is
specified for further research into Fuzzy Logic based approximate searching. This system, named
“Jasmine” for the purposes of discussion, searches a database of documents characterised using a
researcher-defined document characterisation algorithm and a researcher-defined fuzzy similarity
measurement algorithm.
With this system, a researcher may study the performance of various fuzzy similarity
measurement algorithms in a practical situation with real-world queries, using a stable platform
for user interface and database interaction. Jasmine therefore represents a means by which
characterisation and matching techniques may be compared experimentally.

2 Introduction: Fuzzy Logic and Searching


Fuzzy Logic is a method for grouping items into sets which does not require that an item be
either in or out of a set completely [MENDEL95]. Instead, it requires that the item be a fuzzy
member of a set, where that membership can be expressed as a real value between 0 and 1. Fuzzy
membership of 0 means that the item is not a member of the set at all, fuzzy membership of 1
means that the item is completely a member of the set, and any membership value in between
means that the item is partly a member of this set. Essentially, a fuzzy set is one in which the
boundary of the set is gradual rather than crisp.
For example, imagine that we wanted to classify a group of men into the groups “short
men”, “medium-height men”, and “tall men”. While a man who was three feet tall would almost
certainly fall into the “short men” category with a fuzzy membership value of 1, a man who
stands 5 feet, 10 inches tall could be described as “tall” or “medium-height” to some degree of
accuracy. We could assign this man membership values of 0.7 for “tall men” and 0.3 for
“medium-height men” and be quite correct.
Fuzzy sets can be characterised by a fuzzy membership function, a function which takes as
input a number of attributes of the items to be classified, and provides as output a fuzzy
membership value between 0 and 1. In some fuzzy literature, the fuzzy membership function is
discussed as if it were functionally equivalent to the fuzzy set.
It is intuitively obvious that fuzzy sets represent a better approximation to human reasoning
than crisp logic. Humans do not generally consider complex concepts as being “black and white”.
Such is also the case when searching for documents; rarely is it the case that only a perfect
example of the document one is looking for will do, and rarely does one find a perfect example.
Searching at its simplest takes the form of matching some known information expected in
the wanted documents, a “key”, against a number of documents in order to find a document or
group of documents that is most similar to the key. It is common to use some kind of
characterisation, or indexing, of the document and the key that discards all but the relevant
information from the key and the documents. Frequently this characterisation takes the form of a

4
pre-processing such as the removal of commonly used meaningless “stop” words, followed by a
summarisation of the document. For example, articles such as “a”, “the”, and “and” would be
removed, and the frequency of occurrence of the remaining words in the document would be
counted and stored. In this way, comparisons can be done between the characterisation of the key
and the characterisation of the documents. This is advantageous when the characterisation
improves the ratio of relevant information in the document to the amount of misleading or
meaningless text, or when processing the document and the key allows faster identification of
interesting information.
Fuzzy logic can be applied to the searching task at each of its three main phases. Firstly, it
can be applied to the query subtask; the searcher can submit a key that has weighted parts instead
of weighting every part of the search key evenly. The key would consist of several search terms
and explicitly and arbitrarily defined fuzzy membership values for each part of the key, which
would be defined by the searcher.
Secondly, it can be applied to the characterisation of the documents in some way; each
document would be described by a number of data components and associated fuzzy membership
functions that would denote to what degree each data component characterised the document.
Finally, fuzzy logic can be applied to searching by modification of the pattern-matching
subtask. Considering the characterisation of the key as the input to a fuzzy membership function
for the fuzzy set of documents similar to one of the documents, we can return a membership value
between 0 and 1 describing how much the key and the document are similar. This can be used for
each document in turn, resulting in a list of documents which can be ordered by how similar they
are to the search key.
Both the second and final approach have been considered in the creation of the Jasmine
search framework. Asking the user to disambiguate the query regardless of whether ambiguity
actually exists is not required; it forces too much calculation and complexity on the user, who
should not have to deal with such things [NORMAN90]. Instead, the software should do as much
of the interpretation as is possible, either by searching through fuzzy data characterisations or
fuzzily matching keys to documents, or both.

3 Criteria for Comparison of Search Technologies


There is no way to describe the performance of a search technology without relying to some
degree on the satisfaction of the searcher with the results of the search. However, there are
uniformly recognised ways to quantify it.
Precision is the proportion of retrieved documents that are relevant to the query; that is, they
are related to what the searcher wanted to find. If all the documents returned are relevant, then the
query has 100% precision.
Recall is the proportion of relevant documents that are retrieved by a query; that is, the
relevant documents that are found at all. If all the relevant documents are found, the query can be
said to have 100% recall.
The perfect query would have 100% precision and 100% recall. In practice, increasing either
of these values tends to be a trade-off for a reduced value of the other. That is, for a given search
algorithm, increasing precision, for example by constraining the list of documents found to only
those that are certainly relevant, tends to decrease recall. In this case, some relevant documents
may not be shown because the engine was uncertain as to whether or not they were relevant.

5
4 The State of the Art

4.1 Existing Search Engines


Search engines commonly used on the World Wide Web use a variety of different methods
to find documents on the web. Of these, four of the methods most relevant to the Jasmine
framework will be discussed.

4.1.1 Term Based Engines


Term-based searching considers only the terms contained in documents. Originally done
by matching the exact text of the key against the exact text of the documents, this is most often
done today by pre-processing and searching for “important” words and descendants of word
stems. A key of “boating on a lake of cream” would not only return results containing “boating”
but also “boat”, “boater”, etc. but it would not return documents that contain the words “of”,
“on”, or “a” more frequently than documents which do not. Term-based queries are often made
using Boolean logic, of the form “boating AND cream AND NOT (canoe OR yacht)”. The words
“and”, “or” and “not” are reserved as non-search terms, and instead direct the search engine to
change the way it does queries to include and exclude the returned

Figure 1: Engines like Altavista.com (shown) and Yahoo.com use term-based searching.
document sets. While term-based searching is relatively crude, it is the basis of almost all other
kinds of searching in some way or another. Altavista.com, yahoo.com, and other engines all use
variants of term-based searching although some have recently begun to implement more
sophisticated search algorithms. In Figure 1, we can see that term-based searches have been done
by the altavista.com search engine in a number of databases including the web search database, a
reviewed sites database, and a database of other searches. However, no attempt is made to
disambiguate the query before the user looks at it. Visible in this section of the search results, we
can see four different senses of the term “cream” without any indication that the engine considers
the terms differently. The searcher must search through the returned sites manually using
contextual clues provided, like “Skin care products enriched with Dead Sea minerals”, which
does a lot to disambiguate “Dead Sea Care”. Furthermore, the construction of the query has a
large effect on the authoritativeness and topic of the sites returned.

4.1.2 Popularity Based Engines

6
Popularity-based searching takes into account the popularity of the document under
examination in order to find documents that are considered more authoritative based on their
popularity, the assumption being that there is a higher probability that a given search will be
satisfied by a more popular document. Introduced by Chakrabarti, Dom, Kumar, et al. in

www.cream.co.uk
You really need to get a browser with javascript (or
turn it on if you already have) Skip Intro.
Description: Liverpool night club.
Category: Regional > Europe > ... > Arts and Entertainment > Music > Clubs and Venues
www.cream.co.uk/ - 4k - Cached - Similar pages
Ben & Jerry's Ice Cream
[United States] [United Kingdom] [The Netherlands]
[France] [Japan] [Company Info] [Page Blank ...
Description: Vermont's Finest All Natural Ice Cream, Frozen Yogurt and Sorbet.
Overnight delivery.
Category: Shopping > Food > Confectionery > Frozen
www.benjerry.com/ - 3k - Cached - Similar pages
TV Cream
Category: Arts > Television > Theme Songs
tv.cream.org/ - 1k - Cached - Similar pages
[CHAKRABARTI+98], this
Figure 2: Google's query results are popular, but ambiguous.

method has been gaining popular acknowledgement through sites such as google.com
(http://www.google.com). This is often easier to do in the World Wide Web rather than plain text
databases, as the web contains implicit information about endorsement of sites in its link
structure. Each hyperlink is considered an endorsement of the linked-to site by the linking
document. Links from more endorsed sites can be weighted more on subsequent iterations. One
algorithm that exploits the link structure is called HITS (Hyperlink-Induced Topic Search) and
was developed by Kleinberg in [KLEINBERG99]. Popularity-based searching is very effective in
finding authoritative sites, even if they are authoritative on a site that has nothing to do with what
the searcher intended. It relies entirely on probability to determine intent, although this is not to
say that it is impossible to use other strategies. As with term-based search engines, the user must
disambiguate manually although it is likely that the returned sites will be more authoritative and
therefore closer to what the user wanted to find.

4.1.3 Semantic Engines


Semantic searching makes an attempt to determine the meaning behind the query's text, rather
than blindly searching for the query terms and forcing the searcher to sort out the topic of those
sites by context. Often, this takes the form of matching words in the search key against words in
an ontology - a hierarchical organisation of terms according to the specificity of their meanings -
in a database. Once the word is found, the search engine may disambiguate the user's query by
effectively having the user walk down the ontological tree from the search term to the most
specific semantic subclass of that term. In our example, the user might walk down from "cream"
to "double Devonshire heavy cream".

7
Figure 3: Both Oingo and Simpli (shown) disambiguate terms according to their respective
ontologies.
Several new web search engines, including Oingo.com and Simpli.com use an ontological
method of disambiguation. From [WEB1]:

“After you enter a term(s) into the search field, SimpliFind matches the term to a proprietary
relational knowledgebase called SimpliNet™ that automatically generates word concepts and
associations. If the term that is entered is recognized, the SimpliNet database retrieves a list of
concepts and generates a pull-down menu based on those concepts.”

Semantic searching appears to be a much better way to characterise documents, because one can
characterise the document in terms of concepts rather than terms, and because intuitively,
searchers are almost always interested in locating concepts rather than terms. Further, since the
search engine does not have to search documents that contain only alternate meanings of the
search terms, the search may take up fewer system resources on the search engine’s hardware.
However, it is very difficult to automate the process of generating the ontology, and equally
difficult to automate the recognition of these terms inside the searchable documents. Furthermore,
this disambiguation does not perform any actual searching; rather, it narrows the field by
allowing ambiguous words to be specified down to their contextual meanings. Some other
technique must thereafter be applied to search through the documents for those that contain the
contextual meaning of the term in question.

4.1.4 Clustering Based Engines


Clustering Based Engines use a post-search process to group query results into a number of
discernible clusters based on statistical measures. Data mining techniques are used to consolidate
somewhat similar pages into groups, which are hopefully recognisable as reasonable, conceptual
groupings such as one might find in a semantic ontology. The engine tries to allow the searcher to
ignore parts of the query results not relevant to the search, rather than pruning the irrelevant
information out first as in semantic searching. There is no guarantee that the engine will produce
useful clusters, however. Vivisimo.com (A search engine spun off of research from Carnegie
Mellon University) disclaims in [WEB2] that “...our technology is not perfect: the diligent user
will surely spot an occasional annotation that only a machine would make up.”

8
Figure 4: Vivisimo uses term based searching, then performs clustering on the results.
The obvious advantage to clustering is that the engine does not require an expensive ontology like
semantic engines do, while still allowing the searcher to ignore unrelated query results.
Unfortunately, the full search must still be performed, taking up the full amount of the search
engine’s resources for that query, and the clusters produced may not be interesting, or even
understandable to the searcher. Like semantic engines, clustering does no searching in and of
itself; all the searching must be done by another search engine, such as one or more term-based
search engines.

4.2 Recent Research


Searching technology is the topic of a significant amount of research in recent years. Fields
such as data mining (commonly known in the field as Knowledge Discovery in Databases or
KDD), and digital libraries have produced large amounts of work regarding searching for
documents, using statistical methods which are elegant and complex. However, there is very little
overlap between fuzzy logic research and research into search technologies. In order to find a
basis from which the Jasmine specification could be built, it was necessary to consider papers
from both fields and synthesise a common ground to work from.
It is very difficult to process plain text documents in a complex manner without first
extracting information about the document and identifying special kinds of information about it.
For a simple example, one might like to extract the author of the document, so that one could
search for documents by author in a term-based manner. In order to perform more sophisticated
searching, one might like to find key words and phrases, match against an ontological1 database,
translate into machine-readable symbolic language, produce document summaries, or extract
statistical measures about the document.

1
An ontology is a hierarchical organisation of terms according to the specificity of their meanings. It could
be represented by a directed graph, although there will be a positive branching factor due to the fact that
general concepts tend to subsume more than one more specific term.

9
Frank et al. have proposed an algorithm in [FRANK+99] for automatically extracting
keyphrases2 from text documents which relies on naive Bayesian classification to crisply separate
words into or out of the set of keyphrases. Splitting the document into a list of phrases and
eliminating most with simple tests generates candidate phrases, which are then classified using a
Bayesian classification based on the specificity of the phrase to the document and the placement
of the first occurrence of the phrase in the document. These keyphrases can then be used to
summarise the subject of a document. This algorithm increases in accuracy when information
about the knowledge domain being searched can be used to help select phrases that are more
descriptive. With an automatic keyphrase identification algorithm such as this, it would be
possible to identify descriptive words and phrases within documents and compare these
keyphrases and keywords across documents. When combined with an ontological database,
keyphrase or keyword meanings may be able to be inferred, and therefore fit into the ontology.
Thereafter search keys could be matched against the keyphrases, and more general as well as
more specific meanings could be associated with the keyphrases that would not ordinarily be
found in a term by term manner. Because the keyphrase is likely to contain several keywords, it is
reasonable to expect that documents which match the generalised concepts of many of those
keywords are more likely to be similar to the document at hand than those which match the
generalised concept of only one keyword.
The ontology, which must be created in order to make use of such a sophisticated system, is
difficult to construct. It is unreasonable to build by hand a comprehensive database relating
meanings of any reasonably large number of words, so some automated system must be devised.
Paliouras, Karkaletsis, and Spyropoulos attempted to use decision tree machine learning
techniques to disambiguate a known text in [KARKALETSIS+99]. They found that decision trees
containing about 1000 nodes disambiguated words with a precision and accuracy of about 90%,
albeit with a recall of about 60%. This means that although tree tends to be conservative in
declaring a recognised sense, The system correctly recognised or rejected words 90% of the time.
The authors suggest that the conservatism of the decision tree may be due to an artefact of the
training process, that is that there were many more negative training examples (where the word
was not used in the assumed sense) than positive ones.
Many times, it is desirable to be able to search through a text that is not a hypertext.
Many databases of knowledge are made up of unenriched, flat text, and for many kinds of writing
it takes more effort to produce a hypertext than a plain text. Regardless, there are advantages to
reading and searching a hypertext, namely that one can use the information implicit in the link
structure to locate information as well as to assess the authoritativeness of linked-to documents as
in [CHAKRABARTI+99]. In [KIM+99], Kim, Nam, and Shin outline a method to construct
hypertexts from flat text data, which has a high degree of similarity to the creation of a hyertext
from a flat text by human experts. While language is a high barrier to a reader of this paper, it is
clear that the technique uses both statistical and semantic information to place hyperlinks and
hyperlink targets, although the semantic information used is entirely derived from pre-existing
thesauri. The statistical similarity measure is obtained by a similar technique to that used in
[FRANK+99], a technique referred to as tf X idf, and an inner vector product. The text is broken
into blocks via the TextTiling technique [HEARST93]. Inserting a link from a keyword to a text
block if that block contains a sufficient “weight” of the keyword constructs the hypertext link
structure. It is unclear from the paper how this weight is obtained, although I assume it involves
the density of the keyword within the block and within the text, similarly to the method described
in [FRANK+99].
When searching through hypertexts, it is possible to extract information from the link
structure itself. As outlined by Chakrabarti et al., in [CHAKRABARTI+98] and again in

2
A keyphrase is a phrase that properly describes the document it occurs in. It can be a single keyword,
which is the degenerate case. Keyphrases are often used to topically characterise a document.

10
[CHAKRABARTI+99], the link structure of the World Wide Web has an underlying social
connotation that confers an endorsement by the linking document upon the linked-to document. If
this is not true in every case – some links are merely navigational, or are paid advertisements –
then intuitively it will be true in aggregate. The Hyperlink Induced Topic Search computes hubs,
documents which link to a variety of topically similar documents, and defines the subject of target
documents by the descriptions that linking documents provide for the link targets. By associating
a hub weight and an authority weight to each page found in a directed graph formed from the
documents resulting from a simple term search, it is possible to iteratively update weighting to
reflect endorsements from more or less heavily weighted documents. This follows the intuitive
concept that a “good” hub has many outgoing links to “good” authorities, and “good” authorities
have many incoming links from “good” hubs. In the first iteration, each document is assigned an
even weight, but on each successive iteration the weights are updated. Other interesting
information that can be mined from the link structure includes more or less enclosed
“communities” of documents that refer to each other in what is often a topically similar
collection. This algorithm is very efficient in that most of the processing can be done once as a
pre-processing step and then re-used for all subsequent queries. It fails, however, to address
semantic ambiguity and will tend to generalise narrowly focussed queries into more popular
results.
In 1997, Weinstein and Alloway described in [ALLOWAY+97] a successful project to
build a set of ontologies for the University of Michigan Digital Library (online at
http://www.si.umich.edu/UMDL/ ) using a distributed system of intelligent agents3. Ontologies
were used not just to classify information or disambiguate text terms, but also to model the
metadata of the library, including all services, licenses, and content. In this system, agents import
and export services to and from each other as well as the users of the system. Each agent is
designed to deal with only one local topic, but since they can provide services to each other, an
agent that takes a query from the user may solicit information from other agents that can help it
solve the problem. Because nearly all information is stored in ontologies, negotiation can occur to
fit information from one agent into the ontology of the seeking agent, either by walking up the
ontologies to find a common supertopic, or walking down the ontologies to find a group of
common subtopics. In the case that there is no common ground, it would then be possible to enlist
a third party agent to mediate the exchange. A system such as this allows for diverse and
changing library content by permitting the use of domain-specific phrase sense extraction
techniques such as that described in [KARKALETSIS+99], while still synthesising these domains
into a unified, searchable digital library.
Anne Veling and Peter van der Weerd describe in [VELING+98] a technique that
automatically disambiguates the user's query by clustering the information in the document
library. The documents were clustered based on "word co-occurrence networks", which are
defined in their paper as consisting "of concepts that are linked if they often appear close to each
other in the database". Veling and van der Weerd's work are based on the assumption that words
which have similar meanings tend to be located close to each other in the database. This
assumption is not always true, and as a result the clusters found do not correctly define a coherent
concept (or as the authors put it, "non-intuitive clusters"). Be that as it may, this characterisation
of the document appears very lightweight. The calculations are fast enough that Veing and van
der Weerd were able to process three hundred thousand documents in half an hour on a 200 MHz
x86-compatible processor (one might assume an Intel Pentium was used), and only four hundred
megabytes of hard drive space. Although more detailed information about the hardware is not
available, one might expect that this "simple desktop PC" could be a limiting factor.

3
An agent is a computer program that can perform some task, and can communicate with other agents. An
agent can be semiautonomous; they manage their own resources and dynamically seek out new resources
(provided by other agents) to complete their task.

11
There is very little work to be found regarding the application of fuzzy logic to the
searching task. The bulk of work which seems applicable generally pertains to incorporating
fuzzy reasoning into databases, such as in [BOUAZIZ+98], wherein Antoni Wolski and Tarik
Bouaziz propose a method by which traditional crisp database triggers may be replaced by fuzzy
ones. A database trigger is a piece of logic which, upon certain conditions being true, executes
and modifies the database in some way or performs some external action. While this work
discusses some potentially useful techniques in detail, they are somewhat beyond the scope of
this document.

5 Jasmine Conceptual Discussion


Language, and therefore text, is a process of negotiated meaning. Because of individual
and cultural schema, there is never a perfect or even a near-perfect correlation in understanding
meaning. Therefore, the grey or "fuzzy" area is the area where miscommunication and search
engine error can occur. It is this fuzzy area that is negotiated between people during
communication and which must be interpreted by Jasmine.
Because language is a social construct it is not ordered in finite categories. Items such as
synonyms, homonyms, words with multiple meanings overlap and cause confusion. Ideally, it
would be possible to represent the contextual meaning of a word as a probability, modelled as a
fuzzy membership function. Just as in a fuzzy set, when a sample deviates towards the edges of
the set, strength of membership – in this case, shared understanding – is lost. The curve of the
graph of the fuzzy membership function would represent the incidence of correlative meaning.
The purpose of this document is to prepare the groundwork to an implementation of a
system such as FuzzyBase, the system proposed in [RUBIN+98]. It is obvious from this paper
and a basic knowledge of the field that to build FuzzyBase is far from trivial. FuzzyBase would
require complex logic to accomplish its task well, and elegant logic to accomplish it quickly.
Before this can be accomplished, a framework must be built to accommodate research into the
key components of the search engine.
In order to create this framework for future work in the area of fuzzy searching, it was
necessary to make some assumptions about the way that the fuzzy logic methodology would be
applied to the searching task. Namely, that fuzzy logic would be used in searching through fuzzy
data characterisations or fuzzily matching keys to documents, or both. However, I have
endeavoured throughout the project to maintain a modularity that would allow researchers to
change the method used for document characterisation as well as the fuzzy membership function
for the set of similar documents.
It is likely that a tool such as Jasmine will be used mainly as a test bed for proposed
algorithms and techniques. The eventual implementation of a search engine that would result
from this project will most likely include a variety of technologies such as the ones discussed in
section 4.2. With these technologies alone, it would be possible to start with a plain text
document library, extract key phrases, build an ontology from them, insert appropriately located
hyperlinks into the documents and mine some if not all of those links for clues to
authoritativeness and community, and insert the metadata into a distributed, intelligent system of
agents for users to interact with.
It is likely that further extensions would also be useful for the improvement of the
system. For example, it has been shown in many documents, such as [COOLEY+00] and
[GREENBERG+97], that usage mining of hypertext can yield patterns which hold useful
information about how a system is used and by extension, what services are useful and interesting
to what users. Other papers such as [HOFMANN+99] indicate that explicit information from
users is useful, as in aggregation it can be a powerful tool. If Jasmine is to be extended to search
for multimedia information such as pictures, sound, video, or software, further effort must be
spent on integrating more complex metadata standards (as in [Baldonado+99], [WEB3] and

12
[WEB4]) into the metadata of the engine itself. Implementation of these extensions will depend
heavily on the implementation of the document characterisation and matching systems of the
Jasmine engine and are therefore beyond the scope of the current phase of the project.

6 Jasmine Functional and Architectural Requirements

6.1 General Description

Crisp Query Results

Diverse Data Formats


Database
Jasmine Logic
Output Transform Logic
Web Server

Database Data Library


Transform Logic
Control

Unique Data Format


Database
Client Transform Logic

Data Library
Data Mart

Figure 5: Level 0 Data Flow Diagram, showing data flow around and through the Jasmine
system.

6.1.1 Product Perspective

Jasmine exists as a logic layer between a web server and a data mart. It uses a web server for user
control input and query output via HTML forms, sessions, and cookie queries running over HTTP
or some variant such as HTTPS. For simplicity, we will hereafter only consider HTTP. Jasmine
receives data via queries from a single data mart that may pool many subsidiary data sources.

Client
The client will be a user accessible computer running software that allows access of documents
via HTTP and capable of browsing those documents via an HTML 4.0 compliant rendering
engine. The browser will be able to store and return cookies, and will execute JavaScript code on
the client side

Control / Output
Control messages from the browser will be contained in the HTTP requests from the browser,
generated as replies to the HTML forms in the documents served by the web server. Output will
be in HTML and be contained in the HTTP body.

13
Web Server
The web server will handle HTTP connections and will package the HTTP request into a form
easily handled by the Jasmine logic. It mediates HTTP connections between clients and the data
returned by the Jasmine logic.

Web Server / Jasmine communication (not shown in diagram)


The Jasmine logic will read the packaged HTTP request from the web server and will provide the
web server with the HTML content to send to the client.

Crisp Query Results


Database queries will be performed in a crisp manner. Data will be queried by the Jasmine logic
from the data mart using a single interface such as a Structured Query Language or Data Base
Connectivity interface.

Data Mart
The data mart is a unified representation of the data from the original data libraries, transformed
and integrated to appear to be one library of a standard format which Jasmine will be designed to
interpret. In the interest of performance, it is possible that the data mart may actually store a
transformed copy of the data in the libraries, but if reducing storage load is more important than
the performance boost then the loading and transformation can happen on the fly. Logic in the
data mart will handle the querying of all data libraries, even if the data library has no native query
service (such as a flat file with no associated database server).

Database Transform Logic


Logic that will transform the native library’s data format into a format which Jasmine is designed
to interpret. One transform logic component must exist for each native format to be transformed,
even if a single data library contains diverse formats. These components will allow Jasmine to
search existing data libraries without requiring the libraries to be completely transformed for the
task.

Data Format
These queries will be made using a method that is appropriate for query from the data library in
question. The format of the results of these queries will depend on the method used, and the
library in question.

Data Library
Data is stored in a Data Library. The library could be implemented using a relational database, a
flat file, an HTML document or a network data resource.

6.1.2 Product Functions


Jasmine’s functions can be easily grouped by user category (see section 6.1.3, User
Characteristics)

Searching functions
Searchers will be able to locate documents relevant to a particular topic by interacting with
Jasmine through forms presented to them through their browser.

14
Content Administration Functions
Access to a configuration store via the Administrator web tool will allow Content Administrators
to configure Jasmine to access new libraries accessible in the data mart. The security level bound
to the account in the security database will restrict access to System Administration Functions.

System Administration Functions


The System Administrators will be able to access a web configuration utility through a web
server, possibly the same web server as the one that serves Searchers’ requests. Through this
interface, the System Administrator will be able to shut down the Jasmine system or initiate a
backup of Jasmine’s persistent data stores including log file rotation and duplication of
configuration files. They will be able to kick off indexing or re-indexing of digital libraries that
Content Administrators have set up in the data mart and configured Jasmine to handle. The
security level bound to the account in the security database will restrict access to Content
Administration Functions.

6.1.3 User Characteristics


Jasmine users fall into the categories of Searchers, Content Administrators, and System
Administrators.

Searchers
Searchers are users who connect to Jasmine using a browser and seek to find a document or set of
related documents relevant to a particular topic. Most will have experience with traditional search
engines and will be familiar with the functions of the browser software that they are using.
Almost none will read an instruction document but will require online help to fall back on. There
will certainly be many Searchers using Jasmine sometime.

Content Administrators
Content Administrators will be responsible for the configuration and maintenance of Jasmine’s
searchable data libraries, including the addition and deletion of new digital libraries to be indexed
as well as handling the selection of correct data transformations for the digital libraries. These
users will understand the uniform Jasmine data interface format as well as the format of the data
for addition. They will be experienced with using Jasmine from a Searcher perspective and so
will be able to use the Searcher functionality to test their results. There may be many searchers
working with Jasmine at one time.

System Administrators
System Administrators will be responsible for startup and shutdown of the Jasmine system as
well as backup of Jasmine’s persistent data and kicking off indexing of new digital libraries.
System Administrators are responsible for creating and deleting Administrator accounts as well as
handling security issues for Searchers. System Administrators will require printed manuals in
addition to detailed descriptions of the configuration of the instance of Jasmine they are
concerned with. To avoid concurrency issues, no two System Administrators can work on
Jasmine at once.

6.1.4 General Constraints


Jasmine can only interact with the user via a web browser, and is therefore limited in what it can
do by the bounds of what the browser allows. Due to technology that allows the extension of the
browser’s functionality, this constraint can be made very small indeed. Due to basic security
considerations, the Searcher must never be permitted to voluntarily add information directly into

15
Jasmine or into the data mart behind Jasmine, and Jasmine must never be party to information
stored by other software on the Searcher’s client computer system.

6.1.5 Assumptions and Dependencies


Jasmine is dependent on stores of information stored in the correct format for it to retrieve and
display. If the format of the data mart is incorrect, or if it does not exist, Jasmine cannot function.
It is also dependent on the web server that handles the data stream to and from the client
computer, as it cannot communicate using HTTP. While robustness will be maintained by
checking for error states where these dependencies are not met, it is generally assumed that other
system inputs and outputs, such as the client computer and browser, function properly.

6.2 Specific requirements

6.2.1 Functional Requirements

Overview
Jasmine will provide the following functionality:
• Access for Searchers via an interface to a web server and HTML 4.0 over HTTP
• Access for Content Administrators via the web server as above
• Access for System Administrators via the web server as above
• A searchable fuzzy index into a body of HTML documents
• An interface accessible by a web server which provides HTML results for searches made
through the web server.

16
Level 1 Data Flow Diagram

Characterize
Request
Web Server Request Data
HTTP Request
Request
Results in HTML Handle Web Characterization
Server Input / Vector
Output

Request Log Search Results


Check Fuzzy
Membership in
Fuzzy similarity
HTML Style / sets
Formatting Guidelines References to
Limit and Documents similar
Administration User Format to Request
Interface Output Document
Characterization
Vectors
Document Data
Administrate and Metadata
System Data Mart
Security
Database Account
Information
New Document
Configuration
Characterization
Store
Vectors

Figure 6: Level 1 Data Flow Diagram, showing data flowing between processes within
Jasmine.
The Handle Web Server Input/Output process is responsible for unpacking the HTTP request data
passed in from the web server and handing it off to the rest of the system in a format that is more
friendly to the system. This may involve putting the request data into an implementation-specific
data type, or even inserting the data it into queue. This process is also responsible for producing
the HTML data for output through the web server back to the client. It reads from a data store, the
HTML Style / formatting guidelines to get the format for this, allowing the format to be changed
easily. It writes complete, anonymous requests to the Request Log for usage analysis. It will not
write requests to the log file which are larger than an System Administrator-defined threshold.

The Administrate System process is responsible for startup and shutdown of the Jasmine system
as well as backup of Jasmine’s persistent data and kicking off indexing of new digital libraries. It
is also responsible for creating, modifying, and deleting Administrator and Searcher accounts in
the security database and modifying the system Configuration store. Finally, it handles input and
output for the System Administrator’s administration tool interface via the Handle Web Server
Input/Output process.

The Characterize Request process takes a complete request as input and produces a
characterization of the request which describes all important data in the request which is required
to check fuzzy membership in the Check Fuzzy Membership in Fuzzy similarity sets process. The
exact data required is dependent on the specific implementation of the fuzzy membership
function.

17
The Check Fuzzy Membership in Fuzzy similarity sets process compares the characterization of
the key to the characterization of each document in the searchable libraries. It is possible to use a
better search algorithm than a linear scan if care is taken to store characterizations in the data
mart in an order. Since the data mart is external to the Jasmine system, this cannot be assumed.

The Limit and Format Output process limits the listing of document references by removing from
the list any documents which fall below some threshold membership in the fuzzy set of
documents similar to the search key. It also places the data into the format expected by the
Handle Web Server Input/Output process.

System Data Stores

Note that for simplicity, there is no distinction made between system data files, databases, or
caches here.

The Request Log contains complete, anonymous requests that are made to the system, as well as
basic information such as the time and date of the request as well as the number of returned
documents. No information about the originator of the request is stored. The requests are stored
verbatim, so that Jasmine may be further evolved by Content Administrator examination of the
types of queries made and the results of those queries.

The HTML Style / formatting guidelines file contains a description of the format to display all
HTML-coded information to the Searcher or to the System Administrator. The file will describe
styles of font and colour to be used, as well as page layout and structure that is to be used in
displaying the data.

The Security Database contains usernames and encrypted passwords for authentication of
Jasmine users as well as security level information. Authentication will occur by encrypting
provided passwords and comparing the encrypted forms. This way, the system need not store the
password in the clear at any time. The security level can be used to store information about the
user’s authority to perform various kinds of actions, including whether they should have access to
Searcher, System Administrator, or Content Administrator functions.

The Data Mart contains all the documents that can be searched by Jasmine as well as
precalculated characterization vectors for those documents which have been indexed by Jasmine.
Note that the Data Mart is also an external entity to Jasmine, however data in the Data Mart can
be read from and written to by Jasmine so it is also considered a data store just as a local
database.

System Exceptions
In the case of a recoverable exception in the form of either a system or user error, the system will
return a message to the user declaring that a recoverable error has occurred and that the current
task has been aborted. A control will be provided with this message, allowing the user to revise
and reattempt the action he or she was attempting when the error occurred. A recoverable
exception is defined as an exception that, once caught, will not leave Jasmine in an unstable or
unworkable state.

In the case of an unrecoverable exception in the form of a system or user error, the system will
automatically initiate a restart of the Jasmine system in order to preserve system integrity as much
as possible. If the unrecoverable exception takes the form of an error in an external entity that

18
Jasmine requires to function, Jasmine will provide an error message to all currently connected
users and initiate an immediate system shutdown in order to preserve system integrity.

6.2.2 External Interface Requirements

User Interfaces
The user interfaces for Jasmine differ for Searchers, Content Administrators, and System
Administrators. Authentication is not covered in this section because it is implementation-
specific. If authentication is performed, the method for such will vary depending on the method
used.

Searchers
Searchers will have a simple interface to Jasmine:

Input Page
Search Key Field: A field will be provided which allows the Searcher to input the search key. A
prompt labelled “Find Documents similar to:” or something homologous will be associated with
this field.

Membership Threshold Control: The Searcher will be able to select a threshold for membership
limitation from a limited number of pre-set values. This value will default to 0.8, which can
reasonably be expected to exclude most documents. This is important lest the Searcher be forced
to wait for a very long time as an excessively large number of document references are displayed.

Submit Control: A control will be provided that will allow Searchers to indicate that they have
finished entering the search key and that the search may begin. The label “Search” or something
homologous will be associated with this field.

Result Page
Result References: A list of hypertext links to all the documents in the library which belong
more to the fuzzy set of documents to the search key than the threshold requested in the
Membership Threshold Control are displayed along with their fuzzy membership value. By
selecting one of these links, the Searcher can access the document referred to in the link.

Return to Input Page: A control will be provided to allow the user to return to the Input Page.

Content Administrators
Content Administrators will have access to some or all of the following controls, depending on
access rules set by the System Administrators.

Activate New Library Page


Data Source Location Field: A field that contains the location, such as a URL specifying the
connection to use to communicate with the new Library. For example, this is what a JDBC URL
connecting to an Oracle database via the Oracle Thin Driver looks like:

jdbc\\\:oracle\\\:thin\\\:@(DESCRIPTION\\\=(ADDRESS_L
IST\\\=(ADDRESS\\\=(PROTOCOL\\\=TCP)(PORT\\\=1521)(HO
ST\\\=123.123.123.123)))(CONNECT_DATA\\\=(SID\\\=GND1
)))

19
Activation Control: A control will be provided that will allow Content Administrators to
indicate that they have finished entering the URL into the page. The new library will be activated
once it has been indexed; if it has not yet been indexed, it will become active after the completion
of an indexing event that a System Administrator must schedule.

Remove Library Page


Library Status List: A list will enumerate all Data Libraries known to the Jasmine system, along
with their status (active or disabled), size, usage rate, and approximate query hit rate. Any number
of these items can be marked by the user to indicate a selection group.

Deactivate Library Control: A control will be provided that will allow Content Administrators
to indicate that they have finished selecting Libraries, and that all the selected Libraries should be
deactivated but the Library locations should be retained by the Jasmine system. Further queries
will fail to find documents inside these libraries until they are reactivated. If any already
deactivated libraries are selected when this control is activated there will be no effect on their
status.

Remove Library Control: A control will be provided that will allow Content Administrators to
indicate that they have finished selecting Libraries, and that all the selected Libraries should be
deactivated and the locations of the selected Libraries should be removed from the system.
Further queries will fail to find documents inside these libraries. The Library Status List will no
longer list them, and the Content Administrator will need to add the Libraries again to reactivate
them.

Edit Library Page


Library Status List: A list will enumerate all Data Libraries known to the Jasmine system, along
with their status (active or disabled), size, usage rate, and approximate query hit rate. Any number
of these items can be marked by the user to indicate a selection group.

Data Source Location Field: A field that contains the location, such as a URL specifying the
connection to use to communicate with the Library. For example, this is what a JDBC URL
connecting to an Oracle database via the Oracle Thin Driver looks like:

jdbc\\\:oracle\\\:thin\\\:@(DESCRIPTION\\\=(ADDRESS_L
IST\\\=(ADDRESS\\\=(PROTOCOL\\\=TCP)(PORT\\\=1521)(HO
ST\\\=123.123.123.123)))(CONNECT_DATA\\\=(SID\\\=GND1
)))

Update Control: A control will be provided that will allow Content Administrators to indicate
that they have finished entering the URL into the page. The new library will be activated if it has
been indexed; if it has not yet been indexed, it will become active after the completion of an
indexing event that a System Administrator must schedule.

System Administrators
System Administrators will have access to some or all of the following controls, depending on
access rules determined by their security level. At least one Administrator must have access to
setting security levels.

Control Jasmine Page

20
Jasmine Status: This display element will indicate to the System Administrator the current status
of the Jasmine system.

Shutdown Jasmine Control: When activated, this control begins a shutdown of the Jasmine
system, saving all persistent data to storage and closing all open files and connections. If Jasmine
is already shut down, this control has no effect.

Start Jasmine Control: When activated, this control begins a startup of the Jasmine system if it
is currently shut down. If Jasmine is already started, this control has no effect.

Rotate Logs Now Control: Rotates all Jasmine logs down one place (ie, foo.log.1 becomes
foo.log.2, and foo.log becomes foo.log.1). Any foo.log.0 is deleted before this happens.

Rotate Configuration Files Now Control: Rotates all Jasmine configuration files down one
place (ie, foo.conf.1 becomes foo.conf.2, and foo.conf becomes foo.conf.1. Any foo.conf.0 is
deleted before this happens).

Configure Accounts Page


User Data Fields: One field will exist for the input of each field in a user database record. This
could be the bare minimum such as merely a username, user number, and password, or it could
include more detailed information as desired by the implementor.

Search for User Control: When activated, the system will return a list of all records which
contain the elements entered in the User Data Fields. If nothing has been entered in the User Data
Fields, the system will warn the System Administrator that all the records will be returned, and
allows the System Administrator to abort, or continue. Clicking on any of the returned record
listings will show the System Administrator the User Information Page for that user record.

Create New User Control: When activated, the system will display the Create New User Dialog
to obtain any needed information not specified in the User Data Fields and to display the result of
the creation operation.

Create New User Dialog


The sole purpose of the Create New User Dialog is to obtain information from the System
Administrator and to return the quality of the result of the new user creation operation. This
function does not need to be implemented as a separate page.

The Create New User Dialog will allow input of any information not specified for the new user in
the User Information Fields on the Configure Accounts Page and will return a result describing
the result of the add operation. If the operation failed, a human-readable description of the error
encountered will be displayed, if possible. If not, any debug information will be returned along
with a message that human-readable debug information was not available.

User Information Page


User Data Fields: One field will exist for the input of each field in a user database record. This
could be the bare minimum such as merely a username, user number, and password or it could
include more detailed information as desired by the implementor. These fields will be editable,
but will contain the database record’s information by default.

21
Update User Record Control: When activated, this control updates the database with the
information in the User Data Fields. Data previously in the associated record in the database will
be discarded.

Delete User Record Control: When activated, this control deletes the user record from the
database.

Communication Interfaces
All of Jasmine will use TCP/IP communication with external entities. The higher level protocols
running over these connections will vary depending on the nature of the entity connected to.

6.2.3 Performance Requirements

Response Times
Response times for Jasmine will be dictated by the expectations of the user, who will probably be
used to searching with other search engines. The following requirements are based on the
guidelines in [SHNEIDERMAN98], adjusted for the extreme expense of reducing the search time
on potentially vast databases.

Login times for all users will be less than one second. Of necessity, the time it will take to return
search data will vary; system load, data mart size, and query complexity all contribute to longer
search times. It is key to provide adequate user feedback when asking users to wait; therefore, if
the delay before returning some value can be predicted to take more than 12 seconds, a progress
indicator will be shown as an estimate of the time remaining in the operation. This progress
indicator must be time-based, not task-based in that progression from less to more complete must
be related to the actual amount of time remaining, rather than the number of tasks completed.

For very long operations, including operations that take more than one minute to complete, the
user will be warned of the estimated completion time. If the operation is estimated to take more
than 10 minutes to complete, the user will not be permitted to perform the operation so as to
reduce the likelihood that other users’ use of the system will be interrupted. Otherwise,
confirmation will be requested and the operation may proceed at the user’s request, although at a
lower priority to other, less demanding queries.

Throughput
Jasmine shall be able to handle at least 10 concurrent Searchers, exactly one concurrent System
Administrator, and at least 5 concurrent Content Administrators. It must be able to handle 19,000
transactions per day (assuming a 3 minutes of usage from login to logout).

Storage Capacity
Due to the uncertainty involved with the associated entities, these storage space requirements
estimates are for the Jasmine system itself, not the web server, data mart, or any other related
modules.

The configuration files are allotted at least 5 Mb. The code which forms the system is allotted at
least 10 Mb. The system logs allotted at least 95 Mb.

Total: 100 Mb.

22
6.2.4 Attributes

Availability
The system will be online at all times unless a full system backup is required, including a
complete image of all code and logs. Otherwise, all persistent modifiable data can be backed up
without shutting down the system. The system will be offline at any time when the hardware or
software of any part of the system is being upgraded, excepting those parts of hardware or
software which can be interchanged without perturbation to the system (such as hot-pluggable
RAID components). The system will be shut down if any attempt fails to write to a log file.

Security
Usernames and passwords of all users shall be stored in the database. The System Administrators
will be able to configure Jasmine to use the correct database by editing a file on the host system.
The security of that system will then control access to the configuration files in question. In the
database, the shadow password technique or other method of passphrase encryption shall be
employed for double blind security. Note that in future revisions, Jasmine may be upgraded to
optionally use a Public Key Infrastructure for authentication.

Hardware
In theory, Jasmine could run on a single host, even a PC. Alternatively, it could be run in a
distributed manner on several racks full of machines (even having different architectures). The
implementation must reflect this flexibility so that Jasmine supports modular, scalable
deployment and therefore can handle high loads.

Operating System
Jasmine could be implemented to run on any operating system which supports network
communication and multitasking.

7 Conclusion
From the basis of the literature and practical examples discussed in this paper, a platform
has been specified which will permit experimental comparison of fuzzy document
characterisation and pattern matching algorithms. This system, named “Jasmine” for the purposes
of discussion, searches a database of documents characterised using a researcher-defined
document characterisation algorithm and a researcher-defined fuzzy similarity measurement
algorithm. Should this system be implemented, it would permit the researcher to use a unified
interface and database interaction layer with which to quickly implement different methods of
performing fuzzy document characterisation, fuzzy key/characterisation matching, or both.

8 References
[RUBIN+98]
S. Rubin, M. H. Smith, and Lj. Trajkovic, ``FuzzyBase: an information – intelligent retrieval
system,'' Proc. 1998 IEEE Int. Conf. on Systems, Man, and Cybernetics, San Diego, CA,
Oct. 1998, TA11, pp. 2797-2802.
[MENDEL95]
Mendel, Jerry M: Fuzzy Logic Systems for Engineering: A Tutuorial. Proceedings of the
IEEE, Vol. 83, No. 3, March 1995
[NORMAN90]

23
Norman, Donald A. The Design of Everyday Things. Doubleday and Company, 1990.
[CHARKRABARTI+98]
S. Chakrabarti, B.E. Dom, and P. Indyk. Enhanced hypertext classification using hyper-links.
In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'98), pages 307-318,
Seattle, Washington, June 1998.
[KLEINBERG99]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM,
46:604-632, 1999.
[WEB1]
http://corp.oingo.com/About/Infostructure/Infostructure.html, About the ‘Oingo
infostructure’. World Wide Web, 2001.
[WEB2]
http://www.vivisimo.com/vivisimo-1.1/html/FAQ.html, Frequently Asked Questions about
Vivísimo. World Wide Web, 2001.
[FRANK+99]
E. Frank, G. Paynter, I. Witten, C. Gutwin, and C. Nevill-Manning. Domain-Specific
Keyphrase Extraction. In Proc. 16th Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP
668-673, Stockholm, Sweeden, 1999.
[KIM+99]
Munseok Kim, Sejin Nam, and Dongwook Shin. Hypertext Construction using statistical and
semantic similarity. 16th Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP 57-63,
Stockholm, Sweeden, 1999.
[HEARST93]
M. Hearst. TextTiling: A quantitative approach to discourse segmentation. Technical report
93/24, University of Berkeley.
[CHAKRABARTI+99]
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajahopalan, A. Tomkins, D.
Gibson, and J. M. Kleinberg. Mining the web's link structure. COMPUTER, 32:60-67, 1999.
[KARKALETSIS+99]
Vangelis Karkaletsis, Georgios Paliouras, and Constantine D. Spyropoulos. Learning Rules
for Large Vocabulary Word Sense Disambiguation. 16th Joint Int. Conf. on Artificial
Intelligence (IJCAI'99), PP 674-679, Stockholm, Sweeden, 1999.
[ALLOWAY+99]
Gene Alloway and Peter Weinstein. Seed Ontologies: growing digital libraries as distributed,
intelligent systems. Proceedings of the second ACM International Conference on Digital
Libraries, pp. 83-91, Philadelphia, USA, 1999.
[VELING+98] Anne Veling and Peter van der Weerd. Conceptual grouping in word co-
occurrence networks. 16th Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP 694-699,
Stockholm, Sweeden, 1999.
[BOUAZIZ+98]
Tarik Bouaziz and Anton Wolski. Fuzzy Triggers: Incorporating Imprecise Reasoning into
Active Databases. Proc. IEEE 14th International Conference on Data Engineering. 1998.
[COOLEY+00]
R. Cooley, M. Deshpande, J. Srivastava, and P. N. Tan. Web usage mining: Discovery and
aplications of usage patterns from web data. SIGKDD Explorations, 1:12-23, 2000.
[GREENBERG+97]
L. Tauscher and S. Greenberg. How people revisit web pages: Empirical findings and
implications for the design of history systems. International Journal of Human Computer
Studies, Special issue on World Wide Web Usability, 47:97-138, 1997.
[BALDONADO+99]

24
Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, and Andreas Paepcke. Metadata
for Digital Libraries: Architecture and Design Rationale. 16th Joint Int. Conf. on Artificial
Intelligence (IJCAI'99), PP 694-699, Stockholm, Sweeden, 1999.
[WEB3]
http://www.darmstadt.gmd.de/mobile/MPEG7/, The MPEG 7 web page. MPEG 7 is a
proposed standard for metadata description of multimedia information of varying kinds.
[WEB4]
http://dublincore.org/documents/, recommendations of the Dublin Core Metadata
Initiative, an open forum concerned with "development of interoperable online metadata
standards that support a broad range of purposes and business models".
[SHNEIDERMAN98]
Ben Shneiderman. Designing the User Interface: Strategies for Effective Human-Computer
Interaction. Addison Wesley Longman, Inc. USA, 1998.

9 Further Reading

9.1 Articles
1. The Fuzzy Systems Handbook. Cambridge, MA: AP Professional, 1994
2. D. Konopnicki and O. Shmueli. W3QS: A query system for the world-wide-web. In Proc.
1995 Int. Conf. Very Large Data Bases (VLDB'95), pp 54-65, Zurich, Switzerland, Sept. 1995
3. A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the world-wide web. Int. Journal of
Digital Libraries, 1:54-67, 1995.
4. S. Abitboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for
semistructured data. Int. Journal of Digital Libraries, 1:68-88, 1997.
5. L. V. S. Lakshmanan, F. Sadri, and S. Subramanian. A declarative query language for
querying and restructuring the web. In Proc. Int. Workshop Research Issues in Data
Engineering, Tempe, AZ, 1996.
6. G. Arocena and A. O. Mendelzon. WebOQL: Restructuring documents, databases, and webs.
In Proc. 1998 Int. Conf. Data Engineering (ICDE'98), Orlando, Florida, Feb. 1998.
7. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc.
7th World Wide Web Conf. (WWW'98), Brisbane, Australia, 1998.
8. K. Wang, S. Zhou, and S. C. Liew. Building hierarchical classifiers using class proximity. In
Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pp 363-374, Ediburgh, UK, Sept.
1999
9. O. R. Zaïane, J. Han. WebML:Querying the world-wide web for resources and knowledge. In
Proc. Int. Workshop Web Information and Data Management (WIDM'98), pages 9-12,
Bethesda, MD, Nov. 1998.
10. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers, San Francisco, 2001; ISBN 1-55860-489-8.
11. M. Perkowitz and O. Etzioni. Adaptive web sites: Conceptual cluster mining. In Proc. 16th
Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP 264-269, Stockholm, Sweeden, 1999.
12. O. R. Zaïane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying
OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries
Conf. (ADL'98), pp19-29, Santa Barbara, California, Apr. 1998.

9.2 Online Resources


19. news:comp.ai.fuzzy , The comp.ai.fuzzy newsgroup.

25
20. http://www.cs.berkeley.edu/~mazlack/BISC/BISC-DBM.html The Berkeley Initiative in Soft
Computing’s Data Mining Special Interest Group.
21. http://www.oingo.com , Oingo, a search engine which employs a lexical ontology to
determine meanings of terms, then seems to employ fuzzy category matching.
22. http://www.simpli.com, Simpli, a search engine which prompts the user to disambiguate
search terms which have multiple contextual meanings and also employs a lexical ontology. I
have asked for access to their restricted technical papers but had no reply.
23. http://www.simpli.com/search_white_paper.html, Simpli’s searching white paper which
describes their search technology.
24. http://www.google.com/technology/index.html, About Google’s Technology, the tech behind
the famous Google search engine. Note that this is derivative of the HITS algorithm.
25. http://www.sprawlnet.com/about.html, About SprawlNet’s Technology. SprawlNet uses
demographics and geographic information to categorize its users. It also uses some sort of
aggregate learning technique to improve result relevance.
26. http://www.northernlight.com/docs/about_company_mission.html, About Northern Light.
Northern Light classifies each document within an entire source collection into pre-defined
subjects and then, at query time, selects those subjects that best match the search results. Very
little information exists here, though it is interesting for performance comparisons.
27. http://www.fast.no/fast.php3?d=technology&c=fastsrch&h=2, About FAST Search &
Transfer’s technology, which describes in a high-level manner their hardware configuration
as well as about their video & image compression (read: index) technology. FAST is the
company behind AltaVista.
28. http://www.pandia.com/index.html, Pandia, a site devoted to discussion and rating of web
search engines.

10 Figures
FIGURE 1: ENGINES LIKE ALTAVISTA.COM (SHOWN) AND YAHOO.COM USE TERM-BASED
SEARCHING. ......................................................................................................................... 6
FIGURE 2: GOOGLE'S QUERY RESULTS ARE POPULAR, BUT AMBIGUOUS. ...................................... 7
FIGURE 3: BOTH OINGO AND SIMPLI (SHOWN) DISAMBIGUATE TERMS ACCORDING TO THEIR
RESPECTIVE ONTOLOGIES. .................................................................................................... 8
FIGURE 4: VIVISIMO USES TERM BASED SEARCHING, THEN PERFORMS CLUSTERING ON THE
RESULTS............................................................................................................................... 9
FIGURE 5: LEVEL 0 DATA FLOW DIAGRAM, SHOWING DATA FLOW AROUND AND THROUGH THE
JASMINE SYSTEM................................................................................................................ 13
FIGURE 6: LEVEL 1 DATA FLOW DIAGRAM, SHOWING DATA FLOWING BETWEEN PROCESSES
WITHIN JASMINE................................................................................................................. 17

26

Das könnte Ihnen auch gefallen