Beruflich Dokumente
Kultur Dokumente
TextWise LLC
274 North Goodman Street
Suite B273
Rochester, New York 14607
Copyright protection claimed includes all forms and matters of copyrightable material and
information now allowed by statutory or judicial law or hereinafter granted including without
limitation, material generated from the software programs which are displayed on the screen,
such as icons, screen displays, looks, etc.
The idea of representing and manipulating textual content in a vector formalism is an old and
well-accepted principle, dating at least as far back as the early seventies [c.f. Salton] when the
formal Vector Space Model of information processing and retrieval was proposed. The success
of any particular use of the Vector Space Model in an application space depends on:
Most implementations of text applications within the Vector Space Model use vectors derived
from a term/document matrix where the term dimensions correspond to m individual keywords
or multi-term phrases extracted from the text of n documents, which define the m x n
dimensions of the matrix. Typically, each cell of a term-document vector will contain a weight
representing the value or strength of the keyword in indicating the topic of the document.
Given a vector representation with weighted terms within document vectors, one can find
documents most related to a keyword query, for example, by ranking the documents with the
highest vector weights for the keywords used in the query (e.g. using the vector cosine
matching formula).
This basic implementation of the Vector Space Model has the key advantage that it is easy to
implement, and is highly scalable to web-scale applications, as has been demonstrated in
numerous successful search engines. This approach however, has reached its maximum
performance limit and is constrained by its inability to capture meaning independently of the
words used in a text.
While this basic model has proven extremely successful for search applications returning
documents in response to keyword queries, many variations and extensions on this model have
been explored to overcome some of the problems inherent in using individual words as
representative of a document’s content.
One of the significant problem areas with using keywords to suggest text meaning is the issue
of synonymy and polysemy. Synonymy refers to the fact that many times more than one word
can mean the same thing. A search for ‘car’ would not find documents that use the term
‘automobile’ unless there was some way to associate those words as being the same ‘concept’.
Polysemy refers to the fact that a single word may have more than one completely unrelated
meaning. The word ‘java’ might be used to search for documents on programming, coffee, or
vacations in the Far East and it can be impossible to discern which meaning is appropriate
without further information.
• Natural Language Processing: A whole range of techniques for search can be grouped
under the heading of Natural Language Processing (NLP), encompassing those
approaches that attempt to gain a ‘better understanding’ of the content of text through
2
analyses and representations that approximate those used in human understanding of
language. These techniques range from widely used stemming algorithms that reduce
keywords from inflected forms (e.g. plural to singular) to some base form, to those that
attempt to achieve a deep knowledge representation of the events conveyed in a text
through many layers of linguistic processing. Again, while many simple techniques have
proven to be quite effective in improving search performance and have gained
widespread use, many other NLP techniques have failed due to high levels of human
intervention and ‘knowledge engineering’ required and a lack of scalability.
• Latent Semantic Indexing has been used in an attempt to overcome the very high
dimensionality of a matrix constructed of all keywords occurring across a document
collection while also trying to capture synonymous words (words that mean the same
thing). The LSI technique applies a computation called Singular Value Decomposition
(SVD) to the matrix to reduce the dimensionality of the space. Rather than vector
dimensions corresponding to terms from documents, the dimensions in the LSI space
are abstract representations of the underlying “latent semantics” of the text. LSI has not
however found widespread use in the large-scale search application space due to issues
with scalability of the SVD computations. While there are specific application spaces
where the LSI approach can be very effective, the scale of Web applications such as
search and contextual advertising render LSI intractable.
One of the more successful variations on the vector space keyword theme was introduced by
Google specifically for search applications using hyperlinked web pages. The PageRank
algorithm can reprioritize search results by measuring each page’s importance based on how
the page is connected to other pages.
3
Trainable Semantic Vectors from TextWise
Trainable Semantic Vectors (TSV) technology is a new way of representing and analyzing
semantic information (meaning) in text using the vector space model. The Semantic
Signatures® produced by TSV provide a rich semantic representation of the multiple concepts
and topics contained in a text. Semantic Signatures® can be constructed for a wide range of
texts, including individual words, phrases, word-lists (e.g. metadata), short passages (such as
text advertisements), web pages, or full text documents (e.g. technical articles).
Unlike other techniques that require manual construction and maintenance of dictionaries,
thesauri, or ontology’s, TSV automatically generates its own semantic dictionary during training.
This dictionary contains the vocabulary known to be relevant to the application domain, and
automatically provides a “definition” of each term within the established semantic dimensions
(see Figure 1 below).
Saturation
Matrix
Dim 1 Dim 2 Dim n
Word 1 # # #
Word 2 # # #
Word k # # #
Distribution
Semantic
Matrix
Dict ionary
Dim 1 Dim 2 Dim n
Dim 1 Dim 2 Dim n
Word 1 # # #
Word 1 # # #
Word 2 # # #
Word 2 # # #
Word k # # #
Concentration Word k # # #
4
indicating the relevance of the match based on the positions of the two texts in the n-
dimensional Euclidean semantic space.
TextWise uses Semantic Signatures® in our Contextual Matcher application, operating in the
online contextual advertising market, placing relevant ads on web pages. (For more information
about this specific application, visit www.textwise.com.) Semantic Signatures® performs
contextual ad placement by using TSV technology applied in the Web domain. TSV defines a
domain through the semantic dimensions that will be used to represent text in that domain.
To train TSV for the web domain, we used high-quality training pages from the Open Directory
Project (ODP). These pages were automatically selected from 4 million web pages that
thousands of expert human editors had manually assigned into one or more of over 500,000
categories. Through our proprietary analysis, we transformed the 500,000 categories into
roughly 2,000 semantic dimensions useful for Semantic Signatures® and generated a
vocabulary of about 240,000 terms with full semantic definitions to be used in representing text
content.
ODP Web
Pages Words
500K
categories
TSV TSV
TSV
Word
2K
dimensions 0.0 0.5 0.5 0.0 0.5 0.5
Semantic Semantic Ad
Dictionary Signatures™ Placement
Figure 2: TSV Overview
For any new web page (or advertisement targeted at a web page), we will construct a 2000
dimension Semantic Signatures® based on a mathematical combination of the weighted
Semantic Signatures® of all vocabulary from the page that is in our 240,000-term semantic
dictionary (see Figure 2 above).
5
Semantic Signatures® Advantages
Semantic Signatures®, powered by TSV, offers a unique combination of advantages over other
approaches:
1. Understands the meaning of web pages. It expresses the world by combining over
2000 semantic dimensions that are directly derived from Internet taxonomies.
2. Comes with its own web-relevant semantic dictionary. It has 240,000 words and
phrases extracted from high-quality web pages spanning the World Wide Web. Each
word and phrase has its own Semantic Signatures® definition. Unlike simpler wordlists
and ontology’s, the Semantic Signatures® semantic dictionary is specifically trained to
cover the Web and provides computable definitions that can be used directly to calculate
the meaning of text.
4. Provides a unique Semantic Signature® for every webpage and every advertisement.
Advertisers are no longer trapped with 500 other competitor ads that look exactly the
same to the placement engine just because they have the same keyword or are dumped
into the same category.
5. Highlights the topic strengths and subtle interactions of meaning within each
document that make that document unique. Since Semantic Signatures® identifies
both focused topic areas and rich sub-contexts, it can find the best match between a
webpage and an ad.
6. Provides flexible “fuzzy matching”. Unlike all-or-nothing methods that just place an
ad, Semantic Signatures® reports how good that placement is. Advertisers can control
how close is “close enough” by paying based on relevancy rather than bidding on
keywords.
7. Handles both keyword lists and full text. It can immediately import an existing
database of ad keywords and calculate new Semantic Signatures® to get accurate
matches against incoming web pages. But it can also leverage additional text
(marketing collateral, ad copy, product literature, customer descriptions, etc) to create
even better Semantic Signatures™.
6
9. Avoids the classification failures associated with categories that do not align with
customer needs and ontology’s that rapidly become obsolete. Although Semantic
Signatures® semantic dimensions are derived from a broad Internet taxonomy, we do not
actually classify web pages or ads into that taxonomy. Because Semantic Signatures®
are formed by combining the strengths of hundreds of topic areas, they are immune to
minor changes in the underlying taxonomy.
10. Overcomes the failure of keyword matching and classification. The failure of
contextual advertising to date can be traced directly to the inability of keyword matching
and classification techniques to deliver highly accurate, highly focused ad placement.
Semantic Signatures® can find the most appropriate ad for a webpage even if the two
use completely different vocabulary.
7
TextWise Intellectual Property Related to Trainable Semantic Vectors
US Patent No. 6,751,621 Trainable Semantic Vectors (TSV) Construction and use of TSVs for clustering,
Issued: 6/15/04 classification and searching.
US Patent No. 7,299,247 TSV for Clustering Division patent to 6,751,621 adding
Issued: 11/20/07 additional method claims.
US Patent Application TSV for Contextual Advertising Advertisement placement method and system
Filed: 5/5/05 using semantic analysis.