Sie sind auf Seite 1von 8

White Paper:

Trainable Semantic Vectors


& Semantic Signatures®

TextWise LLC
274 North Goodman Street
Suite B273
Rochester, New York 14607

2008 by TextWise, LLC. All Rights Reserved.

Copyright protection claimed includes all forms and matters of copyrightable material and
information now allowed by statutory or judicial law or hereinafter granted including without
limitation, material generated from the software programs which are displayed on the screen,
such as icons, screen displays, looks, etc.

Printed in the United States of America


The Vector Space Model

The idea of representing and manipulating textual content in a vector formalism is an old and
well-accepted principle, dating at least as far back as the early seventies [c.f. Salton] when the
formal Vector Space Model of information processing and retrieval was proposed. The success
of any particular use of the Vector Space Model in an application space depends on:

i. What is used to construct the vector;


ii. How the vector weights are calculated;
iii. How vectors are matched or compared.

Most implementations of text applications within the Vector Space Model use vectors derived
from a term/document matrix where the term dimensions correspond to m individual keywords
or multi-term phrases extracted from the text of n documents, which define the m x n
dimensions of the matrix. Typically, each cell of a term-document vector will contain a weight
representing the value or strength of the keyword in indicating the topic of the document.

Given a vector representation with weighted terms within document vectors, one can find
documents most related to a keyword query, for example, by ranking the documents with the
highest vector weights for the keywords used in the query (e.g. using the vector cosine
matching formula).

This basic implementation of the Vector Space Model has the key advantage that it is easy to
implement, and is highly scalable to web-scale applications, as has been demonstrated in
numerous successful search engines. This approach however, has reached its maximum
performance limit and is constrained by its inability to capture meaning independently of the
words used in a text.

While this basic model has proven extremely successful for search applications returning
documents in response to keyword queries, many variations and extensions on this model have
been explored to overcome some of the problems inherent in using individual words as
representative of a document’s content.

One of the significant problem areas with using keywords to suggest text meaning is the issue
of synonymy and polysemy. Synonymy refers to the fact that many times more than one word
can mean the same thing. A search for ‘car’ would not find documents that use the term
‘automobile’ unless there was some way to associate those words as being the same ‘concept’.
Polysemy refers to the fact that a single word may have more than one completely unrelated
meaning. The word ‘java’ might be used to search for documents on programming, coffee, or
vacations in the Far East and it can be impossible to discern which meaning is appropriate
without further information.

Implementing the Vector Space Model


Some of the efforts for improved vector space approaches include:

• Natural Language Processing: A whole range of techniques for search can be grouped
under the heading of Natural Language Processing (NLP), encompassing those
approaches that attempt to gain a ‘better understanding’ of the content of text through

2
analyses and representations that approximate those used in human understanding of
language. These techniques range from widely used stemming algorithms that reduce
keywords from inflected forms (e.g. plural to singular) to some base form, to those that
attempt to achieve a deep knowledge representation of the events conveyed in a text
through many layers of linguistic processing. Again, while many simple techniques have
proven to be quite effective in improving search performance and have gained
widespread use, many other NLP techniques have failed due to high levels of human
intervention and ‘knowledge engineering’ required and a lack of scalability.

• Classification techniques have also been applied in an attempt to overcome the


problems of using keywords to represent text meaning. Automated classification
techniques typically rely on an algorithm that is trained over a sample data set. What is
required is a classification schema (a set of categories into which text can be organized)
that matches the application domain, a training data set which includes documents that
have already been assigned into their appropriate categories and an algorithm that
learns from these training assignments an association between the content of
documents and the categories to which they are assigned. This then enables the
classifier to assign new uncategorized documents into appropriate categories. This has
worked well in well-defined domains for which there is a good classification schema that
align well with the application domain (e.g. patents) and in applications where there have
historically been high levels of manual labor involved in categorizing documents (e.g.
news). But the manual effort involved in creating and maintaining a classification
schema and the lack of a large training data set in many applications has been a barrier
to more widespread use.

• Latent Semantic Indexing has been used in an attempt to overcome the very high
dimensionality of a matrix constructed of all keywords occurring across a document
collection while also trying to capture synonymous words (words that mean the same
thing). The LSI technique applies a computation called Singular Value Decomposition
(SVD) to the matrix to reduce the dimensionality of the space. Rather than vector
dimensions corresponding to terms from documents, the dimensions in the LSI space
are abstract representations of the underlying “latent semantics” of the text. LSI has not
however found widespread use in the large-scale search application space due to issues
with scalability of the SVD computations. While there are specific application spaces
where the LSI approach can be very effective, the scale of Web applications such as
search and contextual advertising render LSI intractable.

One of the more successful variations on the vector space keyword theme was introduced by
Google specifically for search applications using hyperlinked web pages. The PageRank
algorithm can reprioritize search results by measuring each page’s importance based on how
the page is connected to other pages.

3
Trainable Semantic Vectors from TextWise
Trainable Semantic Vectors (TSV) technology is a new way of representing and analyzing
semantic information (meaning) in text using the vector space model. The Semantic
Signatures® produced by TSV provide a rich semantic representation of the multiple concepts
and topics contained in a text. Semantic Signatures® can be constructed for a wide range of
texts, including individual words, phrases, word-lists (e.g. metadata), short passages (such as
text advertisements), web pages, or full text documents (e.g. technical articles).

A Semantic Signatures® represents a text through a weighted vector of typically thousands of


semantic dimensions. The weight of each vector entry represents the strength of the text along
that particular dimension. One can therefore visualize a document as being uniquely positioned
in an n-dimensional Euclidean semantic space.

Semantic dimensions are derived in a one-time training process from an appropriate


classification schema for the domain. Semantic dimensions may be ‘labeled’ with category
names taken from the schema, though in practice dimensions are likely to be used only
internally by automated processes.

Unlike other techniques that require manual construction and maintenance of dictionaries,
thesauri, or ontology’s, TSV automatically generates its own semantic dictionary during training.
This dictionary contains the vocabulary known to be relevant to the application domain, and
automatically provides a “definition” of each term within the established semantic dimensions
(see Figure 1 below).

Saturation
Matrix
Dim 1 Dim 2 Dim n
Word 1 # # #
Word 2 # # #
Word k # # #
Distribution
Semantic
Matrix
Dict ionary
Dim 1 Dim 2 Dim n
Dim 1 Dim 2 Dim n
Word 1 # # #
Word 1 # # #
Word 2 # # #
Word 2 # # #
Word k # # #
Concentration Word k # # #

Matrix (k words by n dims)


Dim 1 Dim 2 Dim n
Word 1 # # #
Word 2 # # #
Word k # # #

Figure 1: Building a TSV Semantic Dictionary

A Semantic Signatures® for a text can be constructed rapidly through a mathematical


combination of the semantic vectors of the vocabulary contained in that text. Having computed
a Semantic Signatures® for two given texts, one can then rapidly compute a match score

4
indicating the relevance of the match based on the positions of the two texts in the n-
dimensional Euclidean semantic space.

TSV for the World Wide Web

TextWise uses Semantic Signatures® in our Contextual Matcher application, operating in the
online contextual advertising market, placing relevant ads on web pages. (For more information
about this specific application, visit www.textwise.com.) Semantic Signatures® performs
contextual ad placement by using TSV technology applied in the Web domain. TSV defines a
domain through the semantic dimensions that will be used to represent text in that domain.

To train TSV for the web domain, we used high-quality training pages from the Open Directory
Project (ODP). These pages were automatically selected from 4 million web pages that
thousands of expert human editors had manually assigned into one or more of over 500,000
categories. Through our proprietary analysis, we transformed the 500,000 categories into
roughly 2,000 semantic dimensions useful for Semantic Signatures® and generated a
vocabulary of about 240,000 terms with full semantic definitions to be used in representing text
content.

ODP Web
Pages Words
500K
categories

TSV TSV
TSV
Word

0.6 0.2 0.1 0.6 0.2 0.1


Doc

2K
dimensions 0.0 0.5 0.5 0.0 0.5 0.5

Semantic Semantic Ad
Dictionary Signatures™ Placement
Figure 2: TSV Overview

For any new web page (or advertisement targeted at a web page), we will construct a 2000
dimension Semantic Signatures® based on a mathematical combination of the weighted
Semantic Signatures® of all vocabulary from the page that is in our 240,000-term semantic
dictionary (see Figure 2 above).

5
Semantic Signatures® Advantages
Semantic Signatures®, powered by TSV, offers a unique combination of advantages over other
approaches:

1. Understands the meaning of web pages. It expresses the world by combining over
2000 semantic dimensions that are directly derived from Internet taxonomies.

2. Comes with its own web-relevant semantic dictionary. It has 240,000 words and
phrases extracted from high-quality web pages spanning the World Wide Web. Each
word and phrase has its own Semantic Signatures® definition. Unlike simpler wordlists
and ontology’s, the Semantic Signatures® semantic dictionary is specifically trained to
cover the Web and provides computable definitions that can be used directly to calculate
the meaning of text.

3. Understands unstructured content with no human intervention. It does not require


any human markup, manual review, keyword assignment, or category assignment. It
simply reads a webpage, captures its underlying meaning, and automatically finds the
most appropriate advertisements that match that meaning.

4. Provides a unique Semantic Signature® for every webpage and every advertisement.
Advertisers are no longer trapped with 500 other competitor ads that look exactly the
same to the placement engine just because they have the same keyword or are dumped
into the same category.

5. Highlights the topic strengths and subtle interactions of meaning within each
document that make that document unique. Since Semantic Signatures® identifies
both focused topic areas and rich sub-contexts, it can find the best match between a
webpage and an ad.

6. Provides flexible “fuzzy matching”. Unlike all-or-nothing methods that just place an
ad, Semantic Signatures® reports how good that placement is. Advertisers can control
how close is “close enough” by paying based on relevancy rather than bidding on
keywords.

7. Handles both keyword lists and full text. It can immediately import an existing
database of ad keywords and calculate new Semantic Signatures® to get accurate
matches against incoming web pages. But it can also leverage additional text
(marketing collateral, ad copy, product literature, customer descriptions, etc) to create
even better Semantic Signatures™.

8. Bypasses the computational complexity of deep natural language understanding.


Full natural language processing is practical only for relatively small amounts of text, not
for real-time systems that need to handle millions of web pages a day. Semantic
Signatures® uses just the right amount of scalable semantics and linguistic knowledge to
capture the meaning and content of documents.

6
9. Avoids the classification failures associated with categories that do not align with
customer needs and ontology’s that rapidly become obsolete. Although Semantic
Signatures® semantic dimensions are derived from a broad Internet taxonomy, we do not
actually classify web pages or ads into that taxonomy. Because Semantic Signatures®
are formed by combining the strengths of hundreds of topic areas, they are immune to
minor changes in the underlying taxonomy.

10. Overcomes the failure of keyword matching and classification. The failure of
contextual advertising to date can be traced directly to the inability of keyword matching
and classification techniques to deliver highly accurate, highly focused ad placement.
Semantic Signatures® can find the most appropriate ad for a webpage even if the two
use completely different vocabulary.

7
TextWise Intellectual Property Related to Trainable Semantic Vectors

Patent Number Technology Description

US Patent No. 6,751,621 Trainable Semantic Vectors (TSV) Construction and use of TSVs for clustering,
Issued: 6/15/04 classification and searching.

US Patent No. 7,299,247 TSV for Clustering Division patent to 6,751,621 adding
Issued: 11/20/07 additional method claims.

US Divisional Patent TSV for Classification Division patent to 6,751,621 adding


Application additional method claims.

US Divisional Patent TSV for Search Division patent to 6,751,621 adding


Application additional method claims.

US Patent Application TSV for Contextual Advertising Advertisement placement method and system
Filed: 5/5/05 using semantic analysis.

US Patent Application Contextual Advertising Auction-based data retrieval method based on


Filed: 5/5/05 w/ Relevancy Pricing ad relevance.

Das könnte Ihnen auch gefallen