25 1NLP

Inspire…Educate…Transform.
Foundations of Text Mining

and Search
Indexing, SVD
Dr. Manish Gupta
Sr. Mentor – Academics, INSOFE
Adapted from https://web.stanford.edu/class/cs276a/handouts/lecture1.ppt

nlp.stanford.edu/IR-book/ppt/18lsi.pptx
www.ics.uci.edu/~lopes/teaching/cs221W12/slides/LSI.pptx
www.csc.villanova.edu/~matuszek/fall2003/DocBased.ppt
The best place for students to learn Applied Engineering http://www.insofe.edu.in

Today’s Agenda
• Text Indexing
The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in

The Classic Search Model
User task
Info need
Query
Search
engine
Query Results
Collection
refinement

Sec. 1.1
Query on Unstructured Data
• Which plays of Shakespeare contain the words Brutus AND Caesar
but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near countrymen) not feasible
– Ranked retrieval (best documents to return)

Sec. 1.1
Term-Document Incidence Matrices
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Brutus AND Caesar BUT NOT 1 if play contains

Calpurnia word, 0 otherwise

Sec. 1.1
Incidence Vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) ➔ bitwise AND.
– 110100 AND 110111 AND 101111 = 100100
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Sec. 1.1
Bigger Collections
• Consider N = 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.

Sec. 1.1
Can’t Build the Matrix
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
• But it has no more than one billion 1’s.

– matrix is extremely sparse.
Why?
• What’s a better representation?
– We only record the 1 positions.

Sec. 1.2
Inverted Index
• For each term t, we must store a list of all documents that contain t.
– Identify each doc by a docID, a document serial number
• Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word
Caesar is added to document
14?

Sec. 1.2
Inverted Index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
Sorted by docID
Posting
Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sec. 1.2
Inverted Index Construction
Documents to Friends, Romans, countrymen.

be indexed
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial Stages of Text Processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of

Sec. 1.2
Indexer Steps: Token Sequence
• Sequence of (Modified token, Document ID) pairs.
Doc 1 Doc 2
I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious

Sec. 1.2
Indexer Steps: Sort
• Sort by terms
– And then docID
Core indexing step

Sec. 1.2
Indexer Steps: Dictionary & Postings
• Multiple term entries in a
single document are
merged.
• Split into Dictionary and
Postings
• Doc. frequency
information is added.

Sec. 1.3
Query Processing: AND
• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings (intersect the document sets):
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

Sec. 1.3
The Merge
• Walk through the two postings simultaneously, in time linear in the
total number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.

Sec. 1.3
Boolean Queries: Exact Match
• The Boolean retrieval model is being able to ask a query that is a
Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight

Sec. 1.3
Query Optimization
• What is the best order for query processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings, then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Query: Brutus AND Calpurnia AND Caesar

Sec. 1.3
Query Optimization Example
• Process in order of increasing freq:
– start with smallest set, then keep cutting further.
This is why we kept

document freq. in dictionary
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Execute the query as (Calpurnia AND Brutus) AND Caesar.

Sec. 2.4
Phrase Queries
• We want to be able to answer queries such as “stanford university” –
as a phrase
• Thus the sentence “I went to university at Stanford” is not a match.
– The concept of phrase queries has proven easily understood by users; one of the
few “advanced search” ideas that works
– Many more queries are implicit phrase queries
• For this, it no longer suffices to store only
<term : docs> entries

Sec. 2.4.1
A First Attempt: Bi-word Indexes
• Index every consecutive pair of terms in the text as a phrase
• For example the text “Friends, Romans, Countrymen” would generate
the biwords
– friends romans
– romans countrymen
• Each of these bi-words is now a dictionary term
• Two-word phrase query-processing is now immediate.

Sec. 2.4.1
Longer Phrase Queries
• Longer phrases can be processed by breaking them down
• stanford university palo alto can be broken into the Boolean query on
biwords:
stanford university AND university palo AND palo alto
Without the docs, we cannot verify that the docs matching the above
Boolean query do contain the phrase.
Can have false positives!

Sec. 2.4.1
Issues for Bi-word Indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than bi-words, big even for them
• Bi-word indexes are not the standard solution (for all bi-words) but
can be part of a compound strategy

Sec. 2.4.2
Solution 2: Positional Indexes
• In the postings, store, for each term the position(s) in which tokens of
it appear
<term, number of docs containing term;

doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>

Sec. 2.4.2
Positional Index Example
• For phrase queries, we use a merge algorithm recursively at the
document level
• But we now need to deal with more than just equality
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

Sec. 2.4.2
Processing a Phrase Query
• Extract inverted index entries for each distinct term: to, be, or, not.
• Merge their doc:position lists to enumerate all positions with “to be or
not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...

Ch. 6
Ranked Retrieval
• Thus far, our queries have all been Boolean.
– Documents either match or don’t.
• Good for expert users with precise understanding of their needs and
the collection.
– Also good for applications: Applications can easily consume 1000s of results.
• Not good for the majority of users.
– Most users are incapable of writing Boolean queries.
– Most users don’t want to wade through 1000s of results.
• This is particularly true of web search.

Term Frequency TF
• The term frequency tft,d of term t in document d is defined as the
number of times that t occurs in d.
• Relevance does not increase proportionally with term frequency.
• So use log frequency weighting
• The log frequency weight of term t in d is 𝑤𝑡𝑑 = log10 𝑡𝑓𝑡𝑑 if 𝑡𝑓𝑡𝑑 > 0;
else it is 0.
• Score for a document-query pair: sum over terms t in both q and d
– 𝑠𝑐𝑜𝑟𝑒 = σ𝑡∈𝑞∩𝑑(log10 𝑡𝑓𝑡𝑑 )

Sec. 6.2.1
Inverse Document Frequency IDF
• Frequent terms are less informative than rare terms
• 𝑑𝑓𝑡 is the document frequency of 𝑡: the number of documents that
contain 𝑡
– 𝑑𝑓𝑡 ≤ 𝑁
𝑁
• 𝑖𝑑𝑓𝑡 = log10
𝑑𝑓𝑡
• IDF has no effect on ranking one term queries
– IDF affects the ranking of documents for queries with at least two terms
– For the query capricious person, IDF weighting makes occurrences of capricious
count for much more in the final document ranking than occurrences of person.

Sec. 6.2.2
TF-IDF Weighting
• The tf-idf weight of a term is the product of its tf weight and its idf
weight.
tf.idf t ,d = log(1 + tf t ,d )  log10 ( N / df t )
• Score for a document given a query
Score(q, d ) = tq d tf.idf t ,d
• There are many variants
– How “tf” is computed (with/without logs)
– Whether the terms in the query are also weighted
– …

Sec. 6.3
Binary → Count → Weight Matrix
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Each document is now represented by a real-valued

vector of tf-idf weights ∈ R|V|

Ranking in Vector Space Model
• Represent both query and document as vectors in the |V|-
dimensional space
• Use cosine similarity as the similarity measure
– Incorporates length normalization automatically (longer vs shorter documents)
   

V
  q•d q d qd
cos(q , d ) =   =  •  = i =1 i i
q d
i =1 i i =1 i
V V
qd q 2
d 2
• qi is the tf-idf weight of term i in the query

• di is the tf-idf weight of term i in the document

Today’s Agenda
• Text Indexing
• Singular Value Decomposition (SVD)
34
One hot encoding
CS224D Richard Socher

Distributional Similarity Based
representations
CS224D Richard Socher

How to make neighbors represent words?

Datasets in the form of Matrices
• We are given n objects and d features describing the
objects.
• Each object has d numeric values describing it.
• Dataset
– An n-by-d matrix A, Aij shows the “importance” of feature j
for object i.
– Every row of A represents an object.
• Goal
– Understand the structure of the data, e.g., the underlying
process generating the data.
– Reduce the number of features representing the data
38
Document Matrices
d terms
(e.g., theorem, proof, etc.)
n documents
Aij = frequency of the j-th

term in the i-th document
Find a subset of the terms that accurately clusters

the documents
39
PCA
• Input: 2-d dimensional points
5
2nd (right) • Output:
singular • 1st (right) singular vector
vector – direction of maximal variance
• 2nd (right) singular vector
4 – direction of maximal
variance, after removing the
projection of the data along
the first singular vector.
• 1: measures how much of
3 the data variance is explained
by the first singular vector
1st (right) singular • 2: measures how much of
vector the data variance is explained
2 by the second singular vector
4.0 4.5 5.0 5.5 6.0
40
SVD Decomposition
nxd nxr rxr rxd
U (V): orthogonal matrix containing the left (right) singular

vectors of A. Left ->eigs(𝑨𝑨𝑻 ), Right -> eigs(𝑨𝑻 𝑨)
S: diagonal matrix containing the singular values of A:
(1 ≥ 2 ≥ … ≥ r)
Exact computation of the SVD takes O(min{dn2 , d2n}) time.

The top k left/right singular vectors/values can be computed
faster using Lanczos/Arnoldi methods.
41
Truncated SVD
• SVD is a means to the end goal.

• The end goal is dimension reduction, i.e. get another
version of A computed from a reduced space in UΣ𝑉 T
– Simply zero Σ after a certain row/column k
nxd nxr rxr rxd

42
Rank-k approximations (Ak)
nxd nxk kxk kxd

• Uk (Vk): orthogonal matrix containing the top k left (right) singular
vectors of A. 𝚺𝒌 : diagonal matrix containing the top k singular
values of A
• Ak is the best approximation of A
• Eckart-Young theorem: Keeping the k largest singular values and
setting all others to zero gives you the optimal approximation of
the original matrix C wrt all rank-k matrices and Frobenius norm.
2
• 𝐴 𝐹 = σ𝑛𝑖=1 σ𝑑𝑗=1 𝑎𝑖𝑗
43
Today’s Agenda
• Text Indexing
• Singular Value Decomposition (SVD)
• Latent Semantic Indexing (LSI)
44
Deficiencies with Conventional Automatic Indexing
• Synonymy: Various words and phrases refer to the same concept (lowers recall)
• Polysemy: Individual words have more than one meaning (lowers precision)
• Independence: No significance is given to two terms that frequently appear
together
• Latent semantic indexing addresses the first of these (synonymy), and the third
(dependence)
auto car make

engine emissions hidden
bonnet hood Markov
tires make model
lorry model emissions
boot trunk normalize
Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
45
Latent Semantic Analysis
• Bag of Words methods: A document is only
"about" the words in it
• But people interpret documents in a richer
context
– a document is about some domain and concepts in it
– reflected in the vocabulary but not limited to it
• LSI: developed at Bellcore (now Telcordia) in the
late 1980s (1988). It was patented in 1989.
– Aim: Replace indexes that use sets of words by indexes that use
concepts
– Latent: Not visible on the surface
– Semantic: Word meanings
46
Word Co-Occurrences
• Bag of words approaches assume meaning is
carried by vocabulary
• Next step is to look at vocabulary groups; what
words tend to occur together?
• Still a statistical approach, but richer
representation than single terms
• E.g., Looking for articles about Tiger Woods in an
API newswire database could bring up stories
about the golfer, followed by articles about golf
tournaments that don't mention his name.
– So we need to recognize that Tiger Woods is about
golf.
47
Problem: Very High Dimensionality
• A vector of TF*IDF representing a
document is high dimensional.
• Need some way to trim words looked at
– First, throw away anything “not useful” (stop
words)
– Second, identify clusters and pick
representative terms
• Use SVD
48
Recall: Term-document matrix
Anthony Julius The Hamlet Othello Macbeth
and Caesar Tempest
Cleopatra
anthony 5.25 3.18 0.0 0.0 0.0 0.35
brutus 1.21 6.10 0.0 1.0 0.0 0.0
caesar 8.59 2.54 0.0 1.51 0.25 0.0
calpurnia 0.0 1.54 0.0 0.0 0.0 0.0
cleopatra 2.85 0.0 0.0 0.0 0.0 0.0
mercy 1.51 0.0 1.90 0.12 5.25 0.88
worser 1.37 0.0 0.11 4.15 0.25 1.95
...
Can we transform this matrix, so that we get a better measure of

similarity between documents and queries?
49
Example of C = UΣVT : All Four Matrices
50
Dimensionality Reduction
• Key property: Each singular value tells us how
important its dimension is.
• By setting less important dimensions to zero, we
keep the important information, but get rid of the
“details”.
• These details may
– be noise – in that case, reduced LSI is a better
representation because it is less noisy
– make things dissimilar that should be similar –
again reduced LSI is a better representation
because it represents similarity better.
• Fewer details is better
51
Reducing the Dimensionality to 2
• Actually, we only zero
out singular values in Σ.
This has the effect of
setting the
corresponding
dimensions in U and V T
to zero when computing
the product 𝐶 = 𝑈Σ𝑉𝑇
• Compute C2 = UΣ2VT
52
Original Matrix C vs. Reduced C2 = UΣ2VT
We can view C2 as a
two-dimensional
representation of the
matrix. We have
performed a
dimensionality
reduction to two
dimensions.
53
Why is the LSI-Reduced Matrix “Better”?
• Similarity of d2 and d3
in the original space: 0.
• Similarity of d2 und d3
in the reduced space:
0.52 * 0.28 + 0.36 *
0.16 + 0.72 * 0.36 +
0.12 * 0.20 + - 0.39 * -
0.08 ≈ 0.52
• “boat” and “ship” are

semantically similar.
The “reduced”
similarity measure
reflects this.
• What property of the
SVD reduction is
responsible for
improved similarity?
54
LSI Implementation
• Compute SVD of term-document matrix
• Reduce the space and compute reduced
document representations
• Map the query into the reduced space
– 𝑞2𝑇 = Σ2−1 𝑈2𝑇 𝑞𝑇
– This follows from 𝐶2 = 𝑈Σ2 𝑉 𝑇 ⇒ Σ2−1 𝑈 𝑇 𝐶2 = 𝑉2𝑇
• Compute similarity of 𝑞2 with all
reduced documents in 𝑉2 .
• Output ranked list of documents as
usual
56 56
The best place for students to learn Applied Engineering 56 56
http://www.insofe.edu.in
SVD with these hacks leads to good
word representations
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence

Rohde et al. 2005
Problems with SVD

• Stores each word in as a point in space,
where it is represented by a vector of fixed
Word2Vec number of dimensions (generally 300)
• “Hello”: [0.4, -0.11, 0.55, 0.3 . . . 0.1, 0.02]
CBOW
Skipgram
62
Analogy tasks

Applications of Word Vectors
1. Word Similarity
Classic Methods : Edit Distance, WordNet, Porter’s Stemmer,
Lemmatization using dictionaries
• Easily identifies similar words and synonyms since they occur in similar contexts
• Stemming (thought -> think)
• Inflections, Tense forms
• eg. Think, thought, ponder, pondering,
• eg. Plane, Aircraft, Flight

2. Machine Translation
Classic Methods : Rule-based machine translation, morphological
transformation

3. Part-of-Speech and Named Entity Recognition
Classic Methods : Sequential Models (MEMM , Conditional Random
Fields), Logistic Regression

4. Relation Extraction
Classic Methods : OpenIE, Linear programing models, Bootstrapping

5. Sentiment Analysis
Classic Methods : Naive Bayes, Random Forests/SVM
• Classifying sentences as positive and negative
• Building sentiment lexicons using seed sentiment sets
• No need for classifiers, we can just use cosine distances to compare unseen reviews to known
reviews.

6. Co-reference Resolution
• Chaining entity mentions across multiple documents - can we find
and unify the multiple contexts in which mentions occurs?
7. Clustering
• Words in the same class naturally occur in similar contexts, and this
feature vector can directly be used with any conventional clustering
algorithms (K-Means, agglomerative, etc). Human doesn’t have to
waste time hand-picking useful word features to cluster on.
8. Semantic Analysis of Documents
• Build word distributions for various topics, etc.

Take-away Messages
• Indexing
– Binary Incidence Matrices
– Inverted Index
– Positional Inverted Index
• SVD is used for latent semantic indexing, while EM has a number of
applications
• Latent semantic indexing handles the problem of synonymy and word
co-occurrences thereby helping us to move from word-based doc
representation to concept-based one.

Further Reading
• Chapters 1,2,5 of Manning-Raghavan-Schuetze book
– http://nlp.stanford.edu/IR-book/
• Chapter 3 (Web Search and Information Retrieval) from Mining the
Web
– http://www.cse.iitb.ac.in/soumen/mining-the-web/
• As We May Think -- Vannevar Bush
– http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-
think/303881/

Further Reading
• Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S.
(1988), "Using latent semantic analysis to improve information
retrieval." In Proceedings of CHI'88: Conference on Human Factors
in Computing, New York: ACM, 281-285.
• Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and
Harshman, R.A. (1990) "Indexing by latent semantic analysis."
Journal of the Society for Information Science, 41(6), 391-407.
• Foltz, P. W. (1990) "Using Latent Semantic Indexing for
Information Filtering". In R. B. Allen (Ed.) Proceedings of the
Conference on Office Information Systems, Cambridge, MA, 40-47.
• Chapter 16 and 18 of Manning-Raghavan-Schuetze book
– http://nlp.stanford.edu/IR-book/
• http://en.wikipedia.org/wiki/Singular_value_decomposition
• http://en.wikipedia.org/wiki/Latent_semantic_indexing
72

25 1NLP

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

25 1NLP

Hochgeladen von

Copyright:

Verfügbare Formate

Inspire…Educate…Transform.

Foundations of Text Mining

Adapted from https://web.stanford.edu/class/cs276a/handouts/lecture1.ppt

The best place for students to learn Applied Engineering http://www.insofe.edu.in

The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 3 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in

Brutus AND Caesar BUT NOT 1 if play contains

The best place for students to learn Applied Engineering 5 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 6 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 7 http://www.insofe.edu.in

• But it has no more than one billion 1’s.

The best place for students to learn Applied Engineering 8 http://www.insofe.edu.in

Brutus 1 2 4 11 31 45 173 174

The best place for students to learn Applied Engineering 9 http://www.insofe.edu.in

Brutus 1 2 4 11 31 45 173 174

Documents to Friends, Romans, countrymen.

Token stream Friends Romans Countrymen

Modified tokens friend roman countryman

The best place for students to learn Applied Engineering 12 http://www.insofe.edu.in

I did enact Julius So let it be with

The best place for students to learn Applied Engineering 13 http://www.insofe.edu.in

Core indexing step

The best place for students to learn Applied Engineering 14 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 15 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 16 http://www.insofe.edu.in

If the list lengths are x and y, the merge takes O(x+y)

The best place for students to learn Applied Engineering 17 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in

Query: Brutus AND Calpurnia AND Caesar

The best place for students to learn Applied Engineering 19 http://www.insofe.edu.in

This is why we kept

Execute the query as (Calpurnia AND Brutus) AND Caesar.

The best place for students to learn Applied Engineering 20 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 21 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 22 http://www.insofe.edu.in

Can have false positives!

The best place for students to learn Applied Engineering 23 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 24 http://www.insofe.edu.in

<term, number of docs containing term;

The best place for students to learn Applied Engineering 25 http://www.insofe.edu.in

5: 363, 367, …>

The best place for students to learn Applied Engineering 26 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 27 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 28 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 29 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 30 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 31 http://www.insofe.edu.in

Each document is now represented by a real-valued

The best place for students to learn Applied Engineering 32 http://www.insofe.edu.in

• qi is the tf-idf weight of term i in the query

The best place for students to learn Applied Engineering 33 http://www.insofe.edu.in

CS224D Richard Socher

The best place for students to learn Applied Engineering 35 http://www.insofe.edu.in

CS224D Richard Socher

The best place for students to learn Applied Engineering 36 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 37 http://www.insofe.edu.in

Aij = frequency of the j-th

Find a subset of the terms that accurately clusters

nxd nxr rxr rxd

U (V): orthogonal matrix containing the left (right) singular

Exact computation of the SVD takes O(min{dn2 , d2n}) time.

• SVD is a means to the end goal.