Beruflich Dokumente
Kultur Dokumente
Under Supervision By
Dr. Ljiljana Trajkovic
School of Engineering Science
Simon Fraser University
2
1 ABSTRACT ....................................................................................................................... 4
8 REFERENCES ................................................................................................................ 23
3
1 Abstract
Language, and therefore text, is a process of negotiated meaning. Because of individual
and cultural schema, there is never a perfect or even a near-perfect correlation in understanding
meaning. Therefore, the grey or approximated area is the area where miscommunication and
search engine error can occur.
Fuzzy Logic has shown to be an invaluable tool to engineers in building systems which
perform approximate reasoning to accomplish their tasks. However, even though approximate
reasoning is desirable in searching for documents according to similarity between topics, little
work is pursuing the application of Fuzzy Logic methods to text searching.
This document presents a brief introduction to Fuzzy Logic and Searching and presents
several methods in use by active public search engines on the World Wide Web, then discusses a
selection of current research in intelligent searching. With this as background, a platform is
specified for further research into Fuzzy Logic based approximate searching. This system, named
“Jasmine” for the purposes of discussion, searches a database of documents characterised using a
researcher-defined document characterisation algorithm and a researcher-defined fuzzy similarity
measurement algorithm.
With this system, a researcher may study the performance of various fuzzy similarity
measurement algorithms in a practical situation with real-world queries, using a stable platform
for user interface and database interaction. Jasmine therefore represents a means by which
characterisation and matching techniques may be compared experimentally.
4
pre-processing such as the removal of commonly used meaningless “stop” words, followed by a
summarisation of the document. For example, articles such as “a”, “the”, and “and” would be
removed, and the frequency of occurrence of the remaining words in the document would be
counted and stored. In this way, comparisons can be done between the characterisation of the key
and the characterisation of the documents. This is advantageous when the characterisation
improves the ratio of relevant information in the document to the amount of misleading or
meaningless text, or when processing the document and the key allows faster identification of
interesting information.
Fuzzy logic can be applied to the searching task at each of its three main phases. Firstly, it
can be applied to the query subtask; the searcher can submit a key that has weighted parts instead
of weighting every part of the search key evenly. The key would consist of several search terms
and explicitly and arbitrarily defined fuzzy membership values for each part of the key, which
would be defined by the searcher.
Secondly, it can be applied to the characterisation of the documents in some way; each
document would be described by a number of data components and associated fuzzy membership
functions that would denote to what degree each data component characterised the document.
Finally, fuzzy logic can be applied to searching by modification of the pattern-matching
subtask. Considering the characterisation of the key as the input to a fuzzy membership function
for the fuzzy set of documents similar to one of the documents, we can return a membership value
between 0 and 1 describing how much the key and the document are similar. This can be used for
each document in turn, resulting in a list of documents which can be ordered by how similar they
are to the search key.
Both the second and final approach have been considered in the creation of the Jasmine
search framework. Asking the user to disambiguate the query regardless of whether ambiguity
actually exists is not required; it forces too much calculation and complexity on the user, who
should not have to deal with such things [NORMAN90]. Instead, the software should do as much
of the interpretation as is possible, either by searching through fuzzy data characterisations or
fuzzily matching keys to documents, or both.
5
4 The State of the Art
Figure 1: Engines like Altavista.com (shown) and Yahoo.com use term-based searching.
document sets. While term-based searching is relatively crude, it is the basis of almost all other
kinds of searching in some way or another. Altavista.com, yahoo.com, and other engines all use
variants of term-based searching although some have recently begun to implement more
sophisticated search algorithms. In Figure 1, we can see that term-based searches have been done
by the altavista.com search engine in a number of databases including the web search database, a
reviewed sites database, and a database of other searches. However, no attempt is made to
disambiguate the query before the user looks at it. Visible in this section of the search results, we
can see four different senses of the term “cream” without any indication that the engine considers
the terms differently. The searcher must search through the returned sites manually using
contextual clues provided, like “Skin care products enriched with Dead Sea minerals”, which
does a lot to disambiguate “Dead Sea Care”. Furthermore, the construction of the query has a
large effect on the authoritativeness and topic of the sites returned.
6
Popularity-based searching takes into account the popularity of the document under
examination in order to find documents that are considered more authoritative based on their
popularity, the assumption being that there is a higher probability that a given search will be
satisfied by a more popular document. Introduced by Chakrabarti, Dom, Kumar, et al. in
www.cream.co.uk
You really need to get a browser with javascript (or
turn it on if you already have) Skip Intro.
Description: Liverpool night club.
Category: Regional > Europe > ... > Arts and Entertainment > Music > Clubs and Venues
www.cream.co.uk/ - 4k - Cached - Similar pages
Ben & Jerry's Ice Cream
[United States] [United Kingdom] [The Netherlands]
[France] [Japan] [Company Info] [Page Blank ...
Description: Vermont's Finest All Natural Ice Cream, Frozen Yogurt and Sorbet.
Overnight delivery.
Category: Shopping > Food > Confectionery > Frozen
www.benjerry.com/ - 3k - Cached - Similar pages
TV Cream
Category: Arts > Television > Theme Songs
tv.cream.org/ - 1k - Cached - Similar pages
[CHAKRABARTI+98], this
Figure 2: Google's query results are popular, but ambiguous.
method has been gaining popular acknowledgement through sites such as google.com
(http://www.google.com). This is often easier to do in the World Wide Web rather than plain text
databases, as the web contains implicit information about endorsement of sites in its link
structure. Each hyperlink is considered an endorsement of the linked-to site by the linking
document. Links from more endorsed sites can be weighted more on subsequent iterations. One
algorithm that exploits the link structure is called HITS (Hyperlink-Induced Topic Search) and
was developed by Kleinberg in [KLEINBERG99]. Popularity-based searching is very effective in
finding authoritative sites, even if they are authoritative on a site that has nothing to do with what
the searcher intended. It relies entirely on probability to determine intent, although this is not to
say that it is impossible to use other strategies. As with term-based search engines, the user must
disambiguate manually although it is likely that the returned sites will be more authoritative and
therefore closer to what the user wanted to find.
7
Figure 3: Both Oingo and Simpli (shown) disambiguate terms according to their respective
ontologies.
Several new web search engines, including Oingo.com and Simpli.com use an ontological
method of disambiguation. From [WEB1]:
“After you enter a term(s) into the search field, SimpliFind matches the term to a proprietary
relational knowledgebase called SimpliNet™ that automatically generates word concepts and
associations. If the term that is entered is recognized, the SimpliNet database retrieves a list of
concepts and generates a pull-down menu based on those concepts.”
Semantic searching appears to be a much better way to characterise documents, because one can
characterise the document in terms of concepts rather than terms, and because intuitively,
searchers are almost always interested in locating concepts rather than terms. Further, since the
search engine does not have to search documents that contain only alternate meanings of the
search terms, the search may take up fewer system resources on the search engine’s hardware.
However, it is very difficult to automate the process of generating the ontology, and equally
difficult to automate the recognition of these terms inside the searchable documents. Furthermore,
this disambiguation does not perform any actual searching; rather, it narrows the field by
allowing ambiguous words to be specified down to their contextual meanings. Some other
technique must thereafter be applied to search through the documents for those that contain the
contextual meaning of the term in question.
8
Figure 4: Vivisimo uses term based searching, then performs clustering on the results.
The obvious advantage to clustering is that the engine does not require an expensive ontology like
semantic engines do, while still allowing the searcher to ignore unrelated query results.
Unfortunately, the full search must still be performed, taking up the full amount of the search
engine’s resources for that query, and the clusters produced may not be interesting, or even
understandable to the searcher. Like semantic engines, clustering does no searching in and of
itself; all the searching must be done by another search engine, such as one or more term-based
search engines.
1
An ontology is a hierarchical organisation of terms according to the specificity of their meanings. It could
be represented by a directed graph, although there will be a positive branching factor due to the fact that
general concepts tend to subsume more than one more specific term.
9
Frank et al. have proposed an algorithm in [FRANK+99] for automatically extracting
keyphrases2 from text documents which relies on naive Bayesian classification to crisply separate
words into or out of the set of keyphrases. Splitting the document into a list of phrases and
eliminating most with simple tests generates candidate phrases, which are then classified using a
Bayesian classification based on the specificity of the phrase to the document and the placement
of the first occurrence of the phrase in the document. These keyphrases can then be used to
summarise the subject of a document. This algorithm increases in accuracy when information
about the knowledge domain being searched can be used to help select phrases that are more
descriptive. With an automatic keyphrase identification algorithm such as this, it would be
possible to identify descriptive words and phrases within documents and compare these
keyphrases and keywords across documents. When combined with an ontological database,
keyphrase or keyword meanings may be able to be inferred, and therefore fit into the ontology.
Thereafter search keys could be matched against the keyphrases, and more general as well as
more specific meanings could be associated with the keyphrases that would not ordinarily be
found in a term by term manner. Because the keyphrase is likely to contain several keywords, it is
reasonable to expect that documents which match the generalised concepts of many of those
keywords are more likely to be similar to the document at hand than those which match the
generalised concept of only one keyword.
The ontology, which must be created in order to make use of such a sophisticated system, is
difficult to construct. It is unreasonable to build by hand a comprehensive database relating
meanings of any reasonably large number of words, so some automated system must be devised.
Paliouras, Karkaletsis, and Spyropoulos attempted to use decision tree machine learning
techniques to disambiguate a known text in [KARKALETSIS+99]. They found that decision trees
containing about 1000 nodes disambiguated words with a precision and accuracy of about 90%,
albeit with a recall of about 60%. This means that although tree tends to be conservative in
declaring a recognised sense, The system correctly recognised or rejected words 90% of the time.
The authors suggest that the conservatism of the decision tree may be due to an artefact of the
training process, that is that there were many more negative training examples (where the word
was not used in the assumed sense) than positive ones.
Many times, it is desirable to be able to search through a text that is not a hypertext.
Many databases of knowledge are made up of unenriched, flat text, and for many kinds of writing
it takes more effort to produce a hypertext than a plain text. Regardless, there are advantages to
reading and searching a hypertext, namely that one can use the information implicit in the link
structure to locate information as well as to assess the authoritativeness of linked-to documents as
in [CHAKRABARTI+99]. In [KIM+99], Kim, Nam, and Shin outline a method to construct
hypertexts from flat text data, which has a high degree of similarity to the creation of a hyertext
from a flat text by human experts. While language is a high barrier to a reader of this paper, it is
clear that the technique uses both statistical and semantic information to place hyperlinks and
hyperlink targets, although the semantic information used is entirely derived from pre-existing
thesauri. The statistical similarity measure is obtained by a similar technique to that used in
[FRANK+99], a technique referred to as tf X idf, and an inner vector product. The text is broken
into blocks via the TextTiling technique [HEARST93]. Inserting a link from a keyword to a text
block if that block contains a sufficient “weight” of the keyword constructs the hypertext link
structure. It is unclear from the paper how this weight is obtained, although I assume it involves
the density of the keyword within the block and within the text, similarly to the method described
in [FRANK+99].
When searching through hypertexts, it is possible to extract information from the link
structure itself. As outlined by Chakrabarti et al., in [CHAKRABARTI+98] and again in
2
A keyphrase is a phrase that properly describes the document it occurs in. It can be a single keyword,
which is the degenerate case. Keyphrases are often used to topically characterise a document.
10
[CHAKRABARTI+99], the link structure of the World Wide Web has an underlying social
connotation that confers an endorsement by the linking document upon the linked-to document. If
this is not true in every case – some links are merely navigational, or are paid advertisements –
then intuitively it will be true in aggregate. The Hyperlink Induced Topic Search computes hubs,
documents which link to a variety of topically similar documents, and defines the subject of target
documents by the descriptions that linking documents provide for the link targets. By associating
a hub weight and an authority weight to each page found in a directed graph formed from the
documents resulting from a simple term search, it is possible to iteratively update weighting to
reflect endorsements from more or less heavily weighted documents. This follows the intuitive
concept that a “good” hub has many outgoing links to “good” authorities, and “good” authorities
have many incoming links from “good” hubs. In the first iteration, each document is assigned an
even weight, but on each successive iteration the weights are updated. Other interesting
information that can be mined from the link structure includes more or less enclosed
“communities” of documents that refer to each other in what is often a topically similar
collection. This algorithm is very efficient in that most of the processing can be done once as a
pre-processing step and then re-used for all subsequent queries. It fails, however, to address
semantic ambiguity and will tend to generalise narrowly focussed queries into more popular
results.
In 1997, Weinstein and Alloway described in [ALLOWAY+97] a successful project to
build a set of ontologies for the University of Michigan Digital Library (online at
http://www.si.umich.edu/UMDL/ ) using a distributed system of intelligent agents3. Ontologies
were used not just to classify information or disambiguate text terms, but also to model the
metadata of the library, including all services, licenses, and content. In this system, agents import
and export services to and from each other as well as the users of the system. Each agent is
designed to deal with only one local topic, but since they can provide services to each other, an
agent that takes a query from the user may solicit information from other agents that can help it
solve the problem. Because nearly all information is stored in ontologies, negotiation can occur to
fit information from one agent into the ontology of the seeking agent, either by walking up the
ontologies to find a common supertopic, or walking down the ontologies to find a group of
common subtopics. In the case that there is no common ground, it would then be possible to enlist
a third party agent to mediate the exchange. A system such as this allows for diverse and
changing library content by permitting the use of domain-specific phrase sense extraction
techniques such as that described in [KARKALETSIS+99], while still synthesising these domains
into a unified, searchable digital library.
Anne Veling and Peter van der Weerd describe in [VELING+98] a technique that
automatically disambiguates the user's query by clustering the information in the document
library. The documents were clustered based on "word co-occurrence networks", which are
defined in their paper as consisting "of concepts that are linked if they often appear close to each
other in the database". Veling and van der Weerd's work are based on the assumption that words
which have similar meanings tend to be located close to each other in the database. This
assumption is not always true, and as a result the clusters found do not correctly define a coherent
concept (or as the authors put it, "non-intuitive clusters"). Be that as it may, this characterisation
of the document appears very lightweight. The calculations are fast enough that Veing and van
der Weerd were able to process three hundred thousand documents in half an hour on a 200 MHz
x86-compatible processor (one might assume an Intel Pentium was used), and only four hundred
megabytes of hard drive space. Although more detailed information about the hardware is not
available, one might expect that this "simple desktop PC" could be a limiting factor.
3
An agent is a computer program that can perform some task, and can communicate with other agents. An
agent can be semiautonomous; they manage their own resources and dynamically seek out new resources
(provided by other agents) to complete their task.
11
There is very little work to be found regarding the application of fuzzy logic to the
searching task. The bulk of work which seems applicable generally pertains to incorporating
fuzzy reasoning into databases, such as in [BOUAZIZ+98], wherein Antoni Wolski and Tarik
Bouaziz propose a method by which traditional crisp database triggers may be replaced by fuzzy
ones. A database trigger is a piece of logic which, upon certain conditions being true, executes
and modifies the database in some way or performs some external action. While this work
discusses some potentially useful techniques in detail, they are somewhat beyond the scope of
this document.
12
[WEB4]) into the metadata of the engine itself. Implementation of these extensions will depend
heavily on the implementation of the document characterisation and matching systems of the
Jasmine engine and are therefore beyond the scope of the current phase of the project.
Data Library
Data Mart
Figure 5: Level 0 Data Flow Diagram, showing data flow around and through the Jasmine
system.
Jasmine exists as a logic layer between a web server and a data mart. It uses a web server for user
control input and query output via HTML forms, sessions, and cookie queries running over HTTP
or some variant such as HTTPS. For simplicity, we will hereafter only consider HTTP. Jasmine
receives data via queries from a single data mart that may pool many subsidiary data sources.
Client
The client will be a user accessible computer running software that allows access of documents
via HTTP and capable of browsing those documents via an HTML 4.0 compliant rendering
engine. The browser will be able to store and return cookies, and will execute JavaScript code on
the client side
Control / Output
Control messages from the browser will be contained in the HTTP requests from the browser,
generated as replies to the HTML forms in the documents served by the web server. Output will
be in HTML and be contained in the HTTP body.
13
Web Server
The web server will handle HTTP connections and will package the HTTP request into a form
easily handled by the Jasmine logic. It mediates HTTP connections between clients and the data
returned by the Jasmine logic.
Data Mart
The data mart is a unified representation of the data from the original data libraries, transformed
and integrated to appear to be one library of a standard format which Jasmine will be designed to
interpret. In the interest of performance, it is possible that the data mart may actually store a
transformed copy of the data in the libraries, but if reducing storage load is more important than
the performance boost then the loading and transformation can happen on the fly. Logic in the
data mart will handle the querying of all data libraries, even if the data library has no native query
service (such as a flat file with no associated database server).
Data Format
These queries will be made using a method that is appropriate for query from the data library in
question. The format of the results of these queries will depend on the method used, and the
library in question.
Data Library
Data is stored in a Data Library. The library could be implemented using a relational database, a
flat file, an HTML document or a network data resource.
Searching functions
Searchers will be able to locate documents relevant to a particular topic by interacting with
Jasmine through forms presented to them through their browser.
14
Content Administration Functions
Access to a configuration store via the Administrator web tool will allow Content Administrators
to configure Jasmine to access new libraries accessible in the data mart. The security level bound
to the account in the security database will restrict access to System Administration Functions.
Searchers
Searchers are users who connect to Jasmine using a browser and seek to find a document or set of
related documents relevant to a particular topic. Most will have experience with traditional search
engines and will be familiar with the functions of the browser software that they are using.
Almost none will read an instruction document but will require online help to fall back on. There
will certainly be many Searchers using Jasmine sometime.
Content Administrators
Content Administrators will be responsible for the configuration and maintenance of Jasmine’s
searchable data libraries, including the addition and deletion of new digital libraries to be indexed
as well as handling the selection of correct data transformations for the digital libraries. These
users will understand the uniform Jasmine data interface format as well as the format of the data
for addition. They will be experienced with using Jasmine from a Searcher perspective and so
will be able to use the Searcher functionality to test their results. There may be many searchers
working with Jasmine at one time.
System Administrators
System Administrators will be responsible for startup and shutdown of the Jasmine system as
well as backup of Jasmine’s persistent data and kicking off indexing of new digital libraries.
System Administrators are responsible for creating and deleting Administrator accounts as well as
handling security issues for Searchers. System Administrators will require printed manuals in
addition to detailed descriptions of the configuration of the instance of Jasmine they are
concerned with. To avoid concurrency issues, no two System Administrators can work on
Jasmine at once.
15
Jasmine or into the data mart behind Jasmine, and Jasmine must never be party to information
stored by other software on the Searcher’s client computer system.
Overview
Jasmine will provide the following functionality:
• Access for Searchers via an interface to a web server and HTML 4.0 over HTTP
• Access for Content Administrators via the web server as above
• Access for System Administrators via the web server as above
• A searchable fuzzy index into a body of HTML documents
• An interface accessible by a web server which provides HTML results for searches made
through the web server.
16
Level 1 Data Flow Diagram
Characterize
Request
Web Server Request Data
HTTP Request
Request
Results in HTML Handle Web Characterization
Server Input / Vector
Output
Figure 6: Level 1 Data Flow Diagram, showing data flowing between processes within
Jasmine.
The Handle Web Server Input/Output process is responsible for unpacking the HTTP request data
passed in from the web server and handing it off to the rest of the system in a format that is more
friendly to the system. This may involve putting the request data into an implementation-specific
data type, or even inserting the data it into queue. This process is also responsible for producing
the HTML data for output through the web server back to the client. It reads from a data store, the
HTML Style / formatting guidelines to get the format for this, allowing the format to be changed
easily. It writes complete, anonymous requests to the Request Log for usage analysis. It will not
write requests to the log file which are larger than an System Administrator-defined threshold.
The Administrate System process is responsible for startup and shutdown of the Jasmine system
as well as backup of Jasmine’s persistent data and kicking off indexing of new digital libraries. It
is also responsible for creating, modifying, and deleting Administrator and Searcher accounts in
the security database and modifying the system Configuration store. Finally, it handles input and
output for the System Administrator’s administration tool interface via the Handle Web Server
Input/Output process.
The Characterize Request process takes a complete request as input and produces a
characterization of the request which describes all important data in the request which is required
to check fuzzy membership in the Check Fuzzy Membership in Fuzzy similarity sets process. The
exact data required is dependent on the specific implementation of the fuzzy membership
function.
17
The Check Fuzzy Membership in Fuzzy similarity sets process compares the characterization of
the key to the characterization of each document in the searchable libraries. It is possible to use a
better search algorithm than a linear scan if care is taken to store characterizations in the data
mart in an order. Since the data mart is external to the Jasmine system, this cannot be assumed.
The Limit and Format Output process limits the listing of document references by removing from
the list any documents which fall below some threshold membership in the fuzzy set of
documents similar to the search key. It also places the data into the format expected by the
Handle Web Server Input/Output process.
Note that for simplicity, there is no distinction made between system data files, databases, or
caches here.
The Request Log contains complete, anonymous requests that are made to the system, as well as
basic information such as the time and date of the request as well as the number of returned
documents. No information about the originator of the request is stored. The requests are stored
verbatim, so that Jasmine may be further evolved by Content Administrator examination of the
types of queries made and the results of those queries.
The HTML Style / formatting guidelines file contains a description of the format to display all
HTML-coded information to the Searcher or to the System Administrator. The file will describe
styles of font and colour to be used, as well as page layout and structure that is to be used in
displaying the data.
The Security Database contains usernames and encrypted passwords for authentication of
Jasmine users as well as security level information. Authentication will occur by encrypting
provided passwords and comparing the encrypted forms. This way, the system need not store the
password in the clear at any time. The security level can be used to store information about the
user’s authority to perform various kinds of actions, including whether they should have access to
Searcher, System Administrator, or Content Administrator functions.
The Data Mart contains all the documents that can be searched by Jasmine as well as
precalculated characterization vectors for those documents which have been indexed by Jasmine.
Note that the Data Mart is also an external entity to Jasmine, however data in the Data Mart can
be read from and written to by Jasmine so it is also considered a data store just as a local
database.
System Exceptions
In the case of a recoverable exception in the form of either a system or user error, the system will
return a message to the user declaring that a recoverable error has occurred and that the current
task has been aborted. A control will be provided with this message, allowing the user to revise
and reattempt the action he or she was attempting when the error occurred. A recoverable
exception is defined as an exception that, once caught, will not leave Jasmine in an unstable or
unworkable state.
In the case of an unrecoverable exception in the form of a system or user error, the system will
automatically initiate a restart of the Jasmine system in order to preserve system integrity as much
as possible. If the unrecoverable exception takes the form of an error in an external entity that
18
Jasmine requires to function, Jasmine will provide an error message to all currently connected
users and initiate an immediate system shutdown in order to preserve system integrity.
User Interfaces
The user interfaces for Jasmine differ for Searchers, Content Administrators, and System
Administrators. Authentication is not covered in this section because it is implementation-
specific. If authentication is performed, the method for such will vary depending on the method
used.
Searchers
Searchers will have a simple interface to Jasmine:
Input Page
Search Key Field: A field will be provided which allows the Searcher to input the search key. A
prompt labelled “Find Documents similar to:” or something homologous will be associated with
this field.
Membership Threshold Control: The Searcher will be able to select a threshold for membership
limitation from a limited number of pre-set values. This value will default to 0.8, which can
reasonably be expected to exclude most documents. This is important lest the Searcher be forced
to wait for a very long time as an excessively large number of document references are displayed.
Submit Control: A control will be provided that will allow Searchers to indicate that they have
finished entering the search key and that the search may begin. The label “Search” or something
homologous will be associated with this field.
Result Page
Result References: A list of hypertext links to all the documents in the library which belong
more to the fuzzy set of documents to the search key than the threshold requested in the
Membership Threshold Control are displayed along with their fuzzy membership value. By
selecting one of these links, the Searcher can access the document referred to in the link.
Return to Input Page: A control will be provided to allow the user to return to the Input Page.
Content Administrators
Content Administrators will have access to some or all of the following controls, depending on
access rules set by the System Administrators.
jdbc\\\:oracle\\\:thin\\\:@(DESCRIPTION\\\=(ADDRESS_L
IST\\\=(ADDRESS\\\=(PROTOCOL\\\=TCP)(PORT\\\=1521)(HO
ST\\\=123.123.123.123)))(CONNECT_DATA\\\=(SID\\\=GND1
)))
19
Activation Control: A control will be provided that will allow Content Administrators to
indicate that they have finished entering the URL into the page. The new library will be activated
once it has been indexed; if it has not yet been indexed, it will become active after the completion
of an indexing event that a System Administrator must schedule.
Deactivate Library Control: A control will be provided that will allow Content Administrators
to indicate that they have finished selecting Libraries, and that all the selected Libraries should be
deactivated but the Library locations should be retained by the Jasmine system. Further queries
will fail to find documents inside these libraries until they are reactivated. If any already
deactivated libraries are selected when this control is activated there will be no effect on their
status.
Remove Library Control: A control will be provided that will allow Content Administrators to
indicate that they have finished selecting Libraries, and that all the selected Libraries should be
deactivated and the locations of the selected Libraries should be removed from the system.
Further queries will fail to find documents inside these libraries. The Library Status List will no
longer list them, and the Content Administrator will need to add the Libraries again to reactivate
them.
Data Source Location Field: A field that contains the location, such as a URL specifying the
connection to use to communicate with the Library. For example, this is what a JDBC URL
connecting to an Oracle database via the Oracle Thin Driver looks like:
jdbc\\\:oracle\\\:thin\\\:@(DESCRIPTION\\\=(ADDRESS_L
IST\\\=(ADDRESS\\\=(PROTOCOL\\\=TCP)(PORT\\\=1521)(HO
ST\\\=123.123.123.123)))(CONNECT_DATA\\\=(SID\\\=GND1
)))
Update Control: A control will be provided that will allow Content Administrators to indicate
that they have finished entering the URL into the page. The new library will be activated if it has
been indexed; if it has not yet been indexed, it will become active after the completion of an
indexing event that a System Administrator must schedule.
System Administrators
System Administrators will have access to some or all of the following controls, depending on
access rules determined by their security level. At least one Administrator must have access to
setting security levels.
20
Jasmine Status: This display element will indicate to the System Administrator the current status
of the Jasmine system.
Shutdown Jasmine Control: When activated, this control begins a shutdown of the Jasmine
system, saving all persistent data to storage and closing all open files and connections. If Jasmine
is already shut down, this control has no effect.
Start Jasmine Control: When activated, this control begins a startup of the Jasmine system if it
is currently shut down. If Jasmine is already started, this control has no effect.
Rotate Logs Now Control: Rotates all Jasmine logs down one place (ie, foo.log.1 becomes
foo.log.2, and foo.log becomes foo.log.1). Any foo.log.0 is deleted before this happens.
Rotate Configuration Files Now Control: Rotates all Jasmine configuration files down one
place (ie, foo.conf.1 becomes foo.conf.2, and foo.conf becomes foo.conf.1. Any foo.conf.0 is
deleted before this happens).
Search for User Control: When activated, the system will return a list of all records which
contain the elements entered in the User Data Fields. If nothing has been entered in the User Data
Fields, the system will warn the System Administrator that all the records will be returned, and
allows the System Administrator to abort, or continue. Clicking on any of the returned record
listings will show the System Administrator the User Information Page for that user record.
Create New User Control: When activated, the system will display the Create New User Dialog
to obtain any needed information not specified in the User Data Fields and to display the result of
the creation operation.
The Create New User Dialog will allow input of any information not specified for the new user in
the User Information Fields on the Configure Accounts Page and will return a result describing
the result of the add operation. If the operation failed, a human-readable description of the error
encountered will be displayed, if possible. If not, any debug information will be returned along
with a message that human-readable debug information was not available.
21
Update User Record Control: When activated, this control updates the database with the
information in the User Data Fields. Data previously in the associated record in the database will
be discarded.
Delete User Record Control: When activated, this control deletes the user record from the
database.
Communication Interfaces
All of Jasmine will use TCP/IP communication with external entities. The higher level protocols
running over these connections will vary depending on the nature of the entity connected to.
Response Times
Response times for Jasmine will be dictated by the expectations of the user, who will probably be
used to searching with other search engines. The following requirements are based on the
guidelines in [SHNEIDERMAN98], adjusted for the extreme expense of reducing the search time
on potentially vast databases.
Login times for all users will be less than one second. Of necessity, the time it will take to return
search data will vary; system load, data mart size, and query complexity all contribute to longer
search times. It is key to provide adequate user feedback when asking users to wait; therefore, if
the delay before returning some value can be predicted to take more than 12 seconds, a progress
indicator will be shown as an estimate of the time remaining in the operation. This progress
indicator must be time-based, not task-based in that progression from less to more complete must
be related to the actual amount of time remaining, rather than the number of tasks completed.
For very long operations, including operations that take more than one minute to complete, the
user will be warned of the estimated completion time. If the operation is estimated to take more
than 10 minutes to complete, the user will not be permitted to perform the operation so as to
reduce the likelihood that other users’ use of the system will be interrupted. Otherwise,
confirmation will be requested and the operation may proceed at the user’s request, although at a
lower priority to other, less demanding queries.
Throughput
Jasmine shall be able to handle at least 10 concurrent Searchers, exactly one concurrent System
Administrator, and at least 5 concurrent Content Administrators. It must be able to handle 19,000
transactions per day (assuming a 3 minutes of usage from login to logout).
Storage Capacity
Due to the uncertainty involved with the associated entities, these storage space requirements
estimates are for the Jasmine system itself, not the web server, data mart, or any other related
modules.
The configuration files are allotted at least 5 Mb. The code which forms the system is allotted at
least 10 Mb. The system logs allotted at least 95 Mb.
22
6.2.4 Attributes
Availability
The system will be online at all times unless a full system backup is required, including a
complete image of all code and logs. Otherwise, all persistent modifiable data can be backed up
without shutting down the system. The system will be offline at any time when the hardware or
software of any part of the system is being upgraded, excepting those parts of hardware or
software which can be interchanged without perturbation to the system (such as hot-pluggable
RAID components). The system will be shut down if any attempt fails to write to a log file.
Security
Usernames and passwords of all users shall be stored in the database. The System Administrators
will be able to configure Jasmine to use the correct database by editing a file on the host system.
The security of that system will then control access to the configuration files in question. In the
database, the shadow password technique or other method of passphrase encryption shall be
employed for double blind security. Note that in future revisions, Jasmine may be upgraded to
optionally use a Public Key Infrastructure for authentication.
Hardware
In theory, Jasmine could run on a single host, even a PC. Alternatively, it could be run in a
distributed manner on several racks full of machines (even having different architectures). The
implementation must reflect this flexibility so that Jasmine supports modular, scalable
deployment and therefore can handle high loads.
Operating System
Jasmine could be implemented to run on any operating system which supports network
communication and multitasking.
7 Conclusion
From the basis of the literature and practical examples discussed in this paper, a platform
has been specified which will permit experimental comparison of fuzzy document
characterisation and pattern matching algorithms. This system, named “Jasmine” for the purposes
of discussion, searches a database of documents characterised using a researcher-defined
document characterisation algorithm and a researcher-defined fuzzy similarity measurement
algorithm. Should this system be implemented, it would permit the researcher to use a unified
interface and database interaction layer with which to quickly implement different methods of
performing fuzzy document characterisation, fuzzy key/characterisation matching, or both.
8 References
[RUBIN+98]
S. Rubin, M. H. Smith, and Lj. Trajkovic, ``FuzzyBase: an information – intelligent retrieval
system,'' Proc. 1998 IEEE Int. Conf. on Systems, Man, and Cybernetics, San Diego, CA,
Oct. 1998, TA11, pp. 2797-2802.
[MENDEL95]
Mendel, Jerry M: Fuzzy Logic Systems for Engineering: A Tutuorial. Proceedings of the
IEEE, Vol. 83, No. 3, March 1995
[NORMAN90]
23
Norman, Donald A. The Design of Everyday Things. Doubleday and Company, 1990.
[CHARKRABARTI+98]
S. Chakrabarti, B.E. Dom, and P. Indyk. Enhanced hypertext classification using hyper-links.
In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'98), pages 307-318,
Seattle, Washington, June 1998.
[KLEINBERG99]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM,
46:604-632, 1999.
[WEB1]
http://corp.oingo.com/About/Infostructure/Infostructure.html, About the ‘Oingo
infostructure’. World Wide Web, 2001.
[WEB2]
http://www.vivisimo.com/vivisimo-1.1/html/FAQ.html, Frequently Asked Questions about
Vivísimo. World Wide Web, 2001.
[FRANK+99]
E. Frank, G. Paynter, I. Witten, C. Gutwin, and C. Nevill-Manning. Domain-Specific
Keyphrase Extraction. In Proc. 16th Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP
668-673, Stockholm, Sweeden, 1999.
[KIM+99]
Munseok Kim, Sejin Nam, and Dongwook Shin. Hypertext Construction using statistical and
semantic similarity. 16th Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP 57-63,
Stockholm, Sweeden, 1999.
[HEARST93]
M. Hearst. TextTiling: A quantitative approach to discourse segmentation. Technical report
93/24, University of Berkeley.
[CHAKRABARTI+99]
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajahopalan, A. Tomkins, D.
Gibson, and J. M. Kleinberg. Mining the web's link structure. COMPUTER, 32:60-67, 1999.
[KARKALETSIS+99]
Vangelis Karkaletsis, Georgios Paliouras, and Constantine D. Spyropoulos. Learning Rules
for Large Vocabulary Word Sense Disambiguation. 16th Joint Int. Conf. on Artificial
Intelligence (IJCAI'99), PP 674-679, Stockholm, Sweeden, 1999.
[ALLOWAY+99]
Gene Alloway and Peter Weinstein. Seed Ontologies: growing digital libraries as distributed,
intelligent systems. Proceedings of the second ACM International Conference on Digital
Libraries, pp. 83-91, Philadelphia, USA, 1999.
[VELING+98] Anne Veling and Peter van der Weerd. Conceptual grouping in word co-
occurrence networks. 16th Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP 694-699,
Stockholm, Sweeden, 1999.
[BOUAZIZ+98]
Tarik Bouaziz and Anton Wolski. Fuzzy Triggers: Incorporating Imprecise Reasoning into
Active Databases. Proc. IEEE 14th International Conference on Data Engineering. 1998.
[COOLEY+00]
R. Cooley, M. Deshpande, J. Srivastava, and P. N. Tan. Web usage mining: Discovery and
aplications of usage patterns from web data. SIGKDD Explorations, 1:12-23, 2000.
[GREENBERG+97]
L. Tauscher and S. Greenberg. How people revisit web pages: Empirical findings and
implications for the design of history systems. International Journal of Human Computer
Studies, Special issue on World Wide Web Usability, 47:97-138, 1997.
[BALDONADO+99]
24
Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, and Andreas Paepcke. Metadata
for Digital Libraries: Architecture and Design Rationale. 16th Joint Int. Conf. on Artificial
Intelligence (IJCAI'99), PP 694-699, Stockholm, Sweeden, 1999.
[WEB3]
http://www.darmstadt.gmd.de/mobile/MPEG7/, The MPEG 7 web page. MPEG 7 is a
proposed standard for metadata description of multimedia information of varying kinds.
[WEB4]
http://dublincore.org/documents/, recommendations of the Dublin Core Metadata
Initiative, an open forum concerned with "development of interoperable online metadata
standards that support a broad range of purposes and business models".
[SHNEIDERMAN98]
Ben Shneiderman. Designing the User Interface: Strategies for Effective Human-Computer
Interaction. Addison Wesley Longman, Inc. USA, 1998.
9 Further Reading
9.1 Articles
1. The Fuzzy Systems Handbook. Cambridge, MA: AP Professional, 1994
2. D. Konopnicki and O. Shmueli. W3QS: A query system for the world-wide-web. In Proc.
1995 Int. Conf. Very Large Data Bases (VLDB'95), pp 54-65, Zurich, Switzerland, Sept. 1995
3. A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the world-wide web. Int. Journal of
Digital Libraries, 1:54-67, 1995.
4. S. Abitboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for
semistructured data. Int. Journal of Digital Libraries, 1:68-88, 1997.
5. L. V. S. Lakshmanan, F. Sadri, and S. Subramanian. A declarative query language for
querying and restructuring the web. In Proc. Int. Workshop Research Issues in Data
Engineering, Tempe, AZ, 1996.
6. G. Arocena and A. O. Mendelzon. WebOQL: Restructuring documents, databases, and webs.
In Proc. 1998 Int. Conf. Data Engineering (ICDE'98), Orlando, Florida, Feb. 1998.
7. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc.
7th World Wide Web Conf. (WWW'98), Brisbane, Australia, 1998.
8. K. Wang, S. Zhou, and S. C. Liew. Building hierarchical classifiers using class proximity. In
Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pp 363-374, Ediburgh, UK, Sept.
1999
9. O. R. Zaïane, J. Han. WebML:Querying the world-wide web for resources and knowledge. In
Proc. Int. Workshop Web Information and Data Management (WIDM'98), pages 9-12,
Bethesda, MD, Nov. 1998.
10. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers, San Francisco, 2001; ISBN 1-55860-489-8.
11. M. Perkowitz and O. Etzioni. Adaptive web sites: Conceptual cluster mining. In Proc. 16th
Joint Int. Conf. on Artificial Intelligence (IJCAI'99), PP 264-269, Stockholm, Sweeden, 1999.
12. O. R. Zaïane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying
OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries
Conf. (ADL'98), pp19-29, Santa Barbara, California, Apr. 1998.
25
20. http://www.cs.berkeley.edu/~mazlack/BISC/BISC-DBM.html The Berkeley Initiative in Soft
Computing’s Data Mining Special Interest Group.
21. http://www.oingo.com , Oingo, a search engine which employs a lexical ontology to
determine meanings of terms, then seems to employ fuzzy category matching.
22. http://www.simpli.com, Simpli, a search engine which prompts the user to disambiguate
search terms which have multiple contextual meanings and also employs a lexical ontology. I
have asked for access to their restricted technical papers but had no reply.
23. http://www.simpli.com/search_white_paper.html, Simpli’s searching white paper which
describes their search technology.
24. http://www.google.com/technology/index.html, About Google’s Technology, the tech behind
the famous Google search engine. Note that this is derivative of the HITS algorithm.
25. http://www.sprawlnet.com/about.html, About SprawlNet’s Technology. SprawlNet uses
demographics and geographic information to categorize its users. It also uses some sort of
aggregate learning technique to improve result relevance.
26. http://www.northernlight.com/docs/about_company_mission.html, About Northern Light.
Northern Light classifies each document within an entire source collection into pre-defined
subjects and then, at query time, selects those subjects that best match the search results. Very
little information exists here, though it is interesting for performance comparisons.
27. http://www.fast.no/fast.php3?d=technology&c=fastsrch&h=2, About FAST Search &
Transfer’s technology, which describes in a high-level manner their hardware configuration
as well as about their video & image compression (read: index) technology. FAST is the
company behind AltaVista.
28. http://www.pandia.com/index.html, Pandia, a site devoted to discussion and rating of web
search engines.
10 Figures
FIGURE 1: ENGINES LIKE ALTAVISTA.COM (SHOWN) AND YAHOO.COM USE TERM-BASED
SEARCHING. ......................................................................................................................... 6
FIGURE 2: GOOGLE'S QUERY RESULTS ARE POPULAR, BUT AMBIGUOUS. ...................................... 7
FIGURE 3: BOTH OINGO AND SIMPLI (SHOWN) DISAMBIGUATE TERMS ACCORDING TO THEIR
RESPECTIVE ONTOLOGIES. .................................................................................................... 8
FIGURE 4: VIVISIMO USES TERM BASED SEARCHING, THEN PERFORMS CLUSTERING ON THE
RESULTS............................................................................................................................... 9
FIGURE 5: LEVEL 0 DATA FLOW DIAGRAM, SHOWING DATA FLOW AROUND AND THROUGH THE
JASMINE SYSTEM................................................................................................................ 13
FIGURE 6: LEVEL 1 DATA FLOW DIAGRAM, SHOWING DATA FLOWING BETWEEN PROCESSES
WITHIN JASMINE................................................................................................................. 17
26