Sie sind auf Seite 1von 68

Inspire…Educate…Transform.

Foundations of Text Mining


and Search
Indexing, SVD
Dr. Manish Gupta
Sr. Mentor – Academics, INSOFE

Adapted from https://web.stanford.edu/class/cs276a/handouts/lecture1.ppt


nlp.stanford.edu/IR-book/ppt/18lsi.pptx
www.ics.uci.edu/~lopes/teaching/cs221W12/slides/LSI.pptx
www.csc.villanova.edu/~matuszek/fall2003/DocBased.ppt

The best place for students to learn Applied Engineering http://www.insofe.edu.in


Today’s Agenda
• Text Indexing

The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in


The Classic Search Model
User task

Info need

Query

Search
engine

Query Results
Collection
refinement

The best place for students to learn Applied Engineering 3 http://www.insofe.edu.in


Sec. 1.1
Query on Unstructured Data
• Which plays of Shakespeare contain the words Brutus AND Caesar
but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near countrymen) not feasible
– Ranked retrieval (best documents to return)

The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in


Sec. 1.1
Term-Document Incidence Matrices

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains


Calpurnia word, 0 otherwise

The best place for students to learn Applied Engineering 5 http://www.insofe.edu.in


Sec. 1.1
Incidence Vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) ➔ bitwise AND.
– 110100 AND 110111 AND 101111 = 100100

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

The best place for students to learn Applied Engineering 6 http://www.insofe.edu.in


Sec. 1.1
Bigger Collections
• Consider N = 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.

The best place for students to learn Applied Engineering 7 http://www.insofe.edu.in


Sec. 1.1
Can’t Build the Matrix
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s.


– matrix is extremely sparse.
Why?
• What’s a better representation?
– We only record the 1 positions.

The best place for students to learn Applied Engineering 8 http://www.insofe.edu.in


Sec. 1.2
Inverted Index
• For each term t, we must store a list of all documents that contain t.
– Identify each doc by a docID, a document serial number
• Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word
Caesar is added to document
14?

The best place for students to learn Applied Engineering 9 http://www.insofe.edu.in


Sec. 1.2
Inverted Index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays

Sorted by docID
Posting

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
The best place for students to learn Applied Engineering 10 http://www.insofe.edu.in
Sec. 1.2
Inverted Index Construction

Documents to Friends, Romans, countrymen.


be indexed

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
The best place for students to learn Applied Engineering 11 http://www.insofe.edu.in
Initial Stages of Text Processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution

• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match

• Stemming
– We may wish different forms of a root to match
• authorize, authorization

• Stop words
– We may omit very common words (or not)
• the, a, to, of

The best place for students to learn Applied Engineering 12 http://www.insofe.edu.in


Sec. 1.2
Indexer Steps: Token Sequence
• Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious

The best place for students to learn Applied Engineering 13 http://www.insofe.edu.in


Sec. 1.2
Indexer Steps: Sort
• Sort by terms
– And then docID

Core indexing step

The best place for students to learn Applied Engineering 14 http://www.insofe.edu.in


Sec. 1.2
Indexer Steps: Dictionary & Postings
• Multiple term entries in a
single document are
merged.
• Split into Dictionary and
Postings
• Doc. frequency
information is added.

The best place for students to learn Applied Engineering 15 http://www.insofe.edu.in


Sec. 1.3
Query Processing: AND
• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

The best place for students to learn Applied Engineering 16 http://www.insofe.edu.in


Sec. 1.3
The Merge
• Walk through the two postings simultaneously, in time linear in the
total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.

The best place for students to learn Applied Engineering 17 http://www.insofe.edu.in


Sec. 1.3
Boolean Queries: Exact Match
• The Boolean retrieval model is being able to ask a query that is a
Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight

The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in


Sec. 1.3
Query Optimization
• What is the best order for query processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings, then AND them together.

Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Query: Brutus AND Calpurnia AND Caesar

The best place for students to learn Applied Engineering 19 http://www.insofe.edu.in


Sec. 1.3
Query Optimization Example
• Process in order of increasing freq:
– start with smallest set, then keep cutting further.

This is why we kept


document freq. in dictionary

Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16

Execute the query as (Calpurnia AND Brutus) AND Caesar.

The best place for students to learn Applied Engineering 20 http://www.insofe.edu.in


Sec. 2.4
Phrase Queries
• We want to be able to answer queries such as “stanford university” –
as a phrase
• Thus the sentence “I went to university at Stanford” is not a match.
– The concept of phrase queries has proven easily understood by users; one of the
few “advanced search” ideas that works
– Many more queries are implicit phrase queries
• For this, it no longer suffices to store only
<term : docs> entries

The best place for students to learn Applied Engineering 21 http://www.insofe.edu.in


Sec. 2.4.1
A First Attempt: Bi-word Indexes
• Index every consecutive pair of terms in the text as a phrase
• For example the text “Friends, Romans, Countrymen” would generate
the biwords
– friends romans
– romans countrymen
• Each of these bi-words is now a dictionary term
• Two-word phrase query-processing is now immediate.

The best place for students to learn Applied Engineering 22 http://www.insofe.edu.in


Sec. 2.4.1
Longer Phrase Queries
• Longer phrases can be processed by breaking them down
• stanford university palo alto can be broken into the Boolean query on
biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs matching the above
Boolean query do contain the phrase.

Can have false positives!

The best place for students to learn Applied Engineering 23 http://www.insofe.edu.in


Sec. 2.4.1
Issues for Bi-word Indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than bi-words, big even for them

• Bi-word indexes are not the standard solution (for all bi-words) but
can be part of a compound strategy

The best place for students to learn Applied Engineering 24 http://www.insofe.edu.in


Sec. 2.4.2
Solution 2: Positional Indexes
• In the postings, store, for each term the position(s) in which tokens of
it appear

<term, number of docs containing term;


doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>

The best place for students to learn Applied Engineering 25 http://www.insofe.edu.in


Sec. 2.4.2
Positional Index Example
• For phrase queries, we use a merge algorithm recursively at the
document level
• But we now need to deal with more than just equality

<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?

5: 363, 367, …>

The best place for students to learn Applied Engineering 26 http://www.insofe.edu.in


Sec. 2.4.2
Processing a Phrase Query
• Extract inverted index entries for each distinct term: to, be, or, not.
• Merge their doc:position lists to enumerate all positions with “to be or
not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...

– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...

The best place for students to learn Applied Engineering 27 http://www.insofe.edu.in


Ch. 6
Ranked Retrieval
• Thus far, our queries have all been Boolean.
– Documents either match or don’t.
• Good for expert users with precise understanding of their needs and
the collection.
– Also good for applications: Applications can easily consume 1000s of results.
• Not good for the majority of users.
– Most users are incapable of writing Boolean queries.
– Most users don’t want to wade through 1000s of results.
• This is particularly true of web search.

The best place for students to learn Applied Engineering 28 http://www.insofe.edu.in


Term Frequency TF
• The term frequency tft,d of term t in document d is defined as the
number of times that t occurs in d.
• Relevance does not increase proportionally with term frequency.
• So use log frequency weighting
• The log frequency weight of term t in d is 𝑤𝑡𝑑 = log10 𝑡𝑓𝑡𝑑 if 𝑡𝑓𝑡𝑑 > 0;
else it is 0.
• Score for a document-query pair: sum over terms t in both q and d
– 𝑠𝑐𝑜𝑟𝑒 = σ𝑡∈𝑞∩𝑑(log10 𝑡𝑓𝑡𝑑 )

The best place for students to learn Applied Engineering 29 http://www.insofe.edu.in


Sec. 6.2.1
Inverse Document Frequency IDF
• Frequent terms are less informative than rare terms
• 𝑑𝑓𝑡 is the document frequency of 𝑡: the number of documents that
contain 𝑡
– 𝑑𝑓𝑡 ≤ 𝑁
𝑁
• 𝑖𝑑𝑓𝑡 = log10
𝑑𝑓𝑡
• IDF has no effect on ranking one term queries
– IDF affects the ranking of documents for queries with at least two terms
– For the query capricious person, IDF weighting makes occurrences of capricious
count for much more in the final document ranking than occurrences of person.

The best place for students to learn Applied Engineering 30 http://www.insofe.edu.in


Sec. 6.2.2
TF-IDF Weighting
• The tf-idf weight of a term is the product of its tf weight and its idf
weight.
tf.idf t ,d = log(1 + tf t ,d )  log10 ( N / df t )
• Score for a document given a query
Score(q, d ) = tq d tf.idf t ,d
• There are many variants
– How “tf” is computed (with/without logs)
– Whether the terms in the query are also weighted
– …

The best place for students to learn Applied Engineering 31 http://www.insofe.edu.in


Sec. 6.3
Binary → Count → Weight Matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued


vector of tf-idf weights ∈ R|V|

The best place for students to learn Applied Engineering 32 http://www.insofe.edu.in


Ranking in Vector Space Model
• Represent both query and document as vectors in the |V|-
dimensional space
• Use cosine similarity as the similarity measure
– Incorporates length normalization automatically (longer vs shorter documents)
   

V
  q•d q d qd
cos(q , d ) =   =  •  = i =1 i i
q d
i =1 i i =1 i
V V
qd q 2
d 2

• qi is the tf-idf weight of term i in the query


• di is the tf-idf weight of term i in the document

The best place for students to learn Applied Engineering 33 http://www.insofe.edu.in


Today’s Agenda
• Text Indexing
• Singular Value Decomposition (SVD)

34
The best place for students to learn Applied Engineering 34 http://www.insofe.edu.in
One hot encoding

CS224D Richard Socher

The best place for students to learn Applied Engineering 35 http://www.insofe.edu.in


Distributional Similarity Based
representations

CS224D Richard Socher

The best place for students to learn Applied Engineering 36 http://www.insofe.edu.in


How to make neighbors represent words?

The best place for students to learn Applied Engineering 37 http://www.insofe.edu.in


Datasets in the form of Matrices
• We are given n objects and d features describing the
objects.
• Each object has d numeric values describing it.
• Dataset
– An n-by-d matrix A, Aij shows the “importance” of feature j
for object i.
– Every row of A represents an object.
• Goal
– Understand the structure of the data, e.g., the underlying
process generating the data.
– Reduce the number of features representing the data

38
The best place for students to learn Applied Engineering 38 http://www.insofe.edu.in
Document Matrices
d terms
(e.g., theorem, proof, etc.)

n documents

Aij = frequency of the j-th


term in the i-th document

Find a subset of the terms that accurately clusters


the documents
39
The best place for students to learn Applied Engineering 39 http://www.insofe.edu.in
PCA
• Input: 2-d dimensional points
5
2nd (right) • Output:
singular • 1st (right) singular vector
vector – direction of maximal variance
• 2nd (right) singular vector
4 – direction of maximal
variance, after removing the
projection of the data along
the first singular vector.
• 1: measures how much of
3 the data variance is explained
by the first singular vector
1st (right) singular • 2: measures how much of
vector the data variance is explained
2 by the second singular vector
4.0 4.5 5.0 5.5 6.0

40
The best place for students to learn Applied Engineering 40 http://www.insofe.edu.in
SVD Decomposition

nxd nxr rxr rxd

U (V): orthogonal matrix containing the left (right) singular


vectors of A. Left ->eigs(𝑨𝑨𝑻 ), Right -> eigs(𝑨𝑻 𝑨)
S: diagonal matrix containing the singular values of A:
(1 ≥ 2 ≥ … ≥ r)

Exact computation of the SVD takes O(min{dn2 , d2n}) time.


The top k left/right singular vectors/values can be computed
faster using Lanczos/Arnoldi methods.

41
The best place for students to learn Applied Engineering 41 http://www.insofe.edu.in
Truncated SVD

• SVD is a means to the end goal.


• The end goal is dimension reduction, i.e. get another
version of A computed from a reduced space in UΣ𝑉 T
– Simply zero Σ after a certain row/column k

nxd nxr rxr rxd


42
The best place for students to learn Applied Engineering 42 http://www.insofe.edu.in
Rank-k approximations (Ak)

nxd nxk kxk kxd


• Uk (Vk): orthogonal matrix containing the top k left (right) singular
vectors of A. 𝚺𝒌 : diagonal matrix containing the top k singular
values of A
• Ak is the best approximation of A
• Eckart-Young theorem: Keeping the k largest singular values and
setting all others to zero gives you the optimal approximation of
the original matrix C wrt all rank-k matrices and Frobenius norm.
2
• 𝐴 𝐹 = σ𝑛𝑖=1 σ𝑑𝑗=1 𝑎𝑖𝑗
43
The best place for students to learn Applied Engineering 43 http://www.insofe.edu.in
Today’s Agenda
• Text Indexing
• Singular Value Decomposition (SVD)
• Latent Semantic Indexing (LSI)

44
The best place for students to learn Applied Engineering 44 http://www.insofe.edu.in
Deficiencies with Conventional Automatic Indexing
• Synonymy: Various words and phrases refer to the same concept (lowers recall)
• Polysemy: Individual words have more than one meaning (lowers precision)
• Independence: No significance is given to two terms that frequently appear
together
• Latent semantic indexing addresses the first of these (synonymy), and the third
(dependence)

auto car make


engine emissions hidden
bonnet hood Markov
tires make model
lorry model emissions
boot trunk normalize

Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
45
The best place for students to learn Applied Engineering 45 http://www.insofe.edu.in
Latent Semantic Analysis
• Bag of Words methods: A document is only
"about" the words in it
• But people interpret documents in a richer
context
– a document is about some domain and concepts in it
– reflected in the vocabulary but not limited to it
• LSI: developed at Bellcore (now Telcordia) in the
late 1980s (1988). It was patented in 1989.
– Aim: Replace indexes that use sets of words by indexes that use
concepts
– Latent: Not visible on the surface
– Semantic: Word meanings

46
The best place for students to learn Applied Engineering 46 http://www.insofe.edu.in
Word Co-Occurrences
• Bag of words approaches assume meaning is
carried by vocabulary
• Next step is to look at vocabulary groups; what
words tend to occur together?
• Still a statistical approach, but richer
representation than single terms
• E.g., Looking for articles about Tiger Woods in an
API newswire database could bring up stories
about the golfer, followed by articles about golf
tournaments that don't mention his name.
– So we need to recognize that Tiger Woods is about
golf.

47
The best place for students to learn Applied Engineering 47 http://www.insofe.edu.in
Problem: Very High Dimensionality
• A vector of TF*IDF representing a
document is high dimensional.
• Need some way to trim words looked at
– First, throw away anything “not useful” (stop
words)
– Second, identify clusters and pick
representative terms
• Use SVD

48
The best place for students to learn Applied Engineering 48 http://www.insofe.edu.in
Recall: Term-document matrix
Anthony Julius The Hamlet Othello Macbeth
and Caesar Tempest
Cleopatra
anthony 5.25 3.18 0.0 0.0 0.0 0.35
brutus 1.21 6.10 0.0 1.0 0.0 0.0
caesar 8.59 2.54 0.0 1.51 0.25 0.0
calpurnia 0.0 1.54 0.0 0.0 0.0 0.0
cleopatra 2.85 0.0 0.0 0.0 0.0 0.0
mercy 1.51 0.0 1.90 0.12 5.25 0.88
worser 1.37 0.0 0.11 4.15 0.25 1.95
...

Can we transform this matrix, so that we get a better measure of


similarity between documents and queries?

49
The best place for students to learn Applied Engineering 49 http://www.insofe.edu.in
Example of C = UΣVT : All Four Matrices

50
The best place for students to learn Applied Engineering 50 http://www.insofe.edu.in
Dimensionality Reduction
• Key property: Each singular value tells us how
important its dimension is.
• By setting less important dimensions to zero, we
keep the important information, but get rid of the
“details”.
• These details may
– be noise – in that case, reduced LSI is a better
representation because it is less noisy
– make things dissimilar that should be similar –
again reduced LSI is a better representation
because it represents similarity better.
• Fewer details is better

51
The best place for students to learn Applied Engineering 51 http://www.insofe.edu.in
Reducing the Dimensionality to 2
• Actually, we only zero
out singular values in Σ.
This has the effect of
setting the
corresponding
dimensions in U and V T
to zero when computing
the product 𝐶 = 𝑈Σ𝑉𝑇
• Compute C2 = UΣ2VT

52
The best place for students to learn Applied Engineering 52 http://www.insofe.edu.in
Original Matrix C vs. Reduced C2 = UΣ2VT
We can view C2 as a
two-dimensional
representation of the
matrix. We have
performed a
dimensionality
reduction to two
dimensions.

53
The best place for students to learn Applied Engineering 53 http://www.insofe.edu.in
Why is the LSI-Reduced Matrix “Better”?
• Similarity of d2 and d3
in the original space: 0.
• Similarity of d2 und d3
in the reduced space:
0.52 * 0.28 + 0.36 *
0.16 + 0.72 * 0.36 +
0.12 * 0.20 + - 0.39 * -
0.08 ≈ 0.52

• “boat” and “ship” are


semantically similar.
The “reduced”
similarity measure
reflects this.
• What property of the
SVD reduction is
responsible for
improved similarity?
54
The best place for students to learn Applied Engineering 54 http://www.insofe.edu.in
LSI Implementation
• Compute SVD of term-document matrix
• Reduce the space and compute reduced
document representations
• Map the query into the reduced space
– 𝑞2𝑇 = Σ2−1 𝑈2𝑇 𝑞𝑇
– This follows from 𝐶2 = 𝑈Σ2 𝑉 𝑇 ⇒ Σ2−1 𝑈 𝑇 𝐶2 = 𝑉2𝑇
• Compute similarity of 𝑞2 with all
reduced documents in 𝑉2 .
• Output ranked list of documents as
usual
56 56
The best place for students to learn Applied Engineering 56 56
http://www.insofe.edu.in
SVD with these hacks leads to good
word representations

An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence


Rohde et al. 2005
The best place for students to learn Applied Engineering 57 http://www.insofe.edu.in
Problems with SVD

The best place for students to learn Applied Engineering 58 http://www.insofe.edu.in


• Stores each word in as a point in space,
where it is represented by a vector of fixed
Word2Vec number of dimensions (generally 300)
• “Hello”: [0.4, -0.11, 0.55, 0.3 . . . 0.1, 0.02]

CBOW
Skipgram

62
The best place for students to learn Applied Engineering 62 http://www.insofe.edu.in
Analogy tasks

The best place for students to learn Applied Engineering 63 http://www.insofe.edu.in


Applications of Word Vectors
1. Word Similarity
Classic Methods : Edit Distance, WordNet, Porter’s Stemmer,
Lemmatization using dictionaries
• Easily identifies similar words and synonyms since they occur in similar contexts
• Stemming (thought -> think)
• Inflections, Tense forms
• eg. Think, thought, ponder, pondering,
• eg. Plane, Aircraft, Flight

The best place for students to learn Applied Engineering 64 http://www.insofe.edu.in


Applications of Word Vectors
2. Machine Translation
Classic Methods : Rule-based machine translation, morphological
transformation

The best place for students to learn Applied Engineering 65 http://www.insofe.edu.in


Applications of Word Vectors
3. Part-of-Speech and Named Entity Recognition
Classic Methods : Sequential Models (MEMM , Conditional Random
Fields), Logistic Regression

The best place for students to learn Applied Engineering 66 http://www.insofe.edu.in


Applications of Word Vectors
4. Relation Extraction
Classic Methods : OpenIE, Linear programing models, Bootstrapping

The best place for students to learn Applied Engineering 67 http://www.insofe.edu.in


Applications of Word Vectors
5. Sentiment Analysis
Classic Methods : Naive Bayes, Random Forests/SVM
• Classifying sentences as positive and negative
• Building sentiment lexicons using seed sentiment sets
• No need for classifiers, we can just use cosine distances to compare unseen reviews to known
reviews.

The best place for students to learn Applied Engineering 68 http://www.insofe.edu.in


Applications of Word Vectors
6. Co-reference Resolution
• Chaining entity mentions across multiple documents - can we find
and unify the multiple contexts in which mentions occurs?
7. Clustering
• Words in the same class naturally occur in similar contexts, and this
feature vector can directly be used with any conventional clustering
algorithms (K-Means, agglomerative, etc). Human doesn’t have to
waste time hand-picking useful word features to cluster on.
8. Semantic Analysis of Documents
• Build word distributions for various topics, etc.

The best place for students to learn Applied Engineering 69 http://www.insofe.edu.in


Take-away Messages
• Indexing
– Binary Incidence Matrices
– Inverted Index
– Positional Inverted Index
• SVD is used for latent semantic indexing, while EM has a number of
applications
• Latent semantic indexing handles the problem of synonymy and word
co-occurrences thereby helping us to move from word-based doc
representation to concept-based one.

The best place for students to learn Applied Engineering 70 http://www.insofe.edu.in


Further Reading
• Chapters 1,2,5 of Manning-Raghavan-Schuetze book
– http://nlp.stanford.edu/IR-book/
• Chapter 3 (Web Search and Information Retrieval) from Mining the
Web
– http://www.cse.iitb.ac.in/soumen/mining-the-web/
• As We May Think -- Vannevar Bush
– http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-
think/303881/

The best place for students to learn Applied Engineering 71 http://www.insofe.edu.in


Further Reading
• Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S.
(1988), "Using latent semantic analysis to improve information
retrieval." In Proceedings of CHI'88: Conference on Human Factors
in Computing, New York: ACM, 281-285.
• Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and
Harshman, R.A. (1990) "Indexing by latent semantic analysis."
Journal of the Society for Information Science, 41(6), 391-407.
• Foltz, P. W. (1990) "Using Latent Semantic Indexing for
Information Filtering". In R. B. Allen (Ed.) Proceedings of the
Conference on Office Information Systems, Cambridge, MA, 40-47.
• Chapter 16 and 18 of Manning-Raghavan-Schuetze book
– http://nlp.stanford.edu/IR-book/
• http://en.wikipedia.org/wiki/Singular_value_decomposition
• http://en.wikipedia.org/wiki/Latent_semantic_indexing

72
The best place for students to learn Applied Engineering 72 http://www.insofe.edu.in

Das könnte Ihnen auch gefallen