Textmining

BUSINESS INTELLIGENCE &
ANALYTICS
(MS6840)
Introduction to
Text mining
Saji K Mathew, PhD

Professor, Department of Management Studies
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Evolution
E Business model Web Mining

Rise of internet 1996-2000 Since 2000 Web
1990-1995 Personalization
Post 2000
User Modeling Targeted
1996-2000 marketing
Since 2000
Overview
} Unstructured data: Word documents, PDF files, text
excerpts, XML files, and so on
} Text mining – first, impose structure to the data, then
mine the structured data
} Related disciplines: NLP (Computer science), Linguistics,
Cognitive psychology
Foundations
} Do worry about philosophy of language
} A text without context is a pretext!
} Mental representation of language, its expression in written
form
1. Rationalist approach
} Language formed in mind not by the senses but is fixed in
advance, presumably by genetic inheritance (Chomsky, 1986)
2. Empiricist approach
} Language learning dominated by sensory inputs
Statistical NLP belongs to the second school

Natural Language Processing
} The meaning of a word is defined by the circumstances of its
use (Wittgenstein,1968)
} Linguistic analyses
} Lexical (word level meaning)
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)
“The vodka was good, but the meat was rotten”

Text Mining Process
Corpus preparation
} Token: a string of contiguous alphanumeric characters
with space on either side;
} Word, punctuations, numbers
} Tokenization: Identification of tokens in a text
} Document: A sequence of N words denoted by w =
(w1,w2,…wN), where wn is the nth word in the sequence.
} Corpus: A collection of M documents denoted by D =
{w1,w2,…wN}
} Stemming (lemmatizations): strips different forms of a
word into one stem (normalization)
} Go, gone, going
Term–by–Document Matrix (TDM)
e nt ri ng
Terms k em ee
t ris n ag n gin nt
e n a e m e
s tm e c tm w are e l op
e j ft v P
Documents inv pro so de SA ...
Document 1 1 1
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Terms: words, n-grams (sequence of n words considered together)

Features: Group of terms
Descriptive analysis
} Word counts, term frequency (tf), inverse document
frequency
} Similarity between documents
} Correlation between words/n-grams
} Graphical analysis of text structures
} Word counts, word clouds, associations, sentiments…
Importance of words
Two approaches: (a) Inclusion/exclusion approach using
stop-words (b) Quantification using tf-idf
} Stop-words: Words that are not useful for an analysis,
typically extremely common words such as “the”, “of”,
“to”, and so forth in English. Stop words could be
added/removed from source lexicons
} Quantifying what a document is about: tf*idf
Term frequency (tf):
Word count in a document/number of words in document
Inverse document frequency (idf):
Idf = ln(ndocuments/ndocuments containing term)
The statistic tf-idf is intended to measure how important a word is to a

document in a collection (or corpus) of documents
Relationship between words
} Co-occurrence within document
} Words that tend to co-occur within documents, need not be in
sequence
} Words that co-occur with a given word in a document
} [Filter out uninteresting (stop) words]
} Co-occurrence of word-pairs
} bigram counts
} pair-wise correlation (phi-coefficient): how often they appear
together relative to how often they appear separately
how much more likely it is

that either both word X
and Y appear,
or neither do, than that one
appears without the other
Cosine similarity
} Term frequency vectors are usually sparse (0 elements)
} Common zero values between two vectors (documents)
not useful; Non-zero common values matter
} Cosine similarity measures distance between documents
} 0 means 90º (orthogonal); 1 means 0º (full similarity)
nt ng
me eri
Terms sk ge ine
t ri an
a en
g
en
t
en tm re pm
tm jec wa lo
es oft ve P
Documents inv pro s de SA ...
Document 1 1 1
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Sentiment analysis
} aka opinion mining
} Words carry emotions
} Use sentiment datasets to score/classify words
} Eg.: AFINN (-5 to +5), Bing (+ve, -ve), NRC (positive, negative,
anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)
} Approaches:
} Word by word, POS tagging
The case of slumdog
} http://www.youtube.com/watch?v=AIzbwV7on6Q
} http://www.youtube.com/watch?v=LenAIw95L-s
Topic modeling
} Unsupervised document classification technique, similar
to clustering
} Latent Dirichlet Allocation (LDA) is a probabilistic
approach to topic modelling
} Treats each document as a mixture of topics, and each topic as
a mixture of words
} Each document may contain words from several topics in
particular proportions. For example, in a two-topic model we
could say “Document 1 is 90% topic A and 10% topic B, while
Document 2 is 30% topic A and 70% topic B.”
} A two-topic model of American news, with one topic for
“politics” (PM, parliament, budget) and one for “entertainment.”
(movies, dance, music). Here words can be shared between
topics
Parts of speech parse tree
Language comprehension-cognitive
psychology (Hunt & Ellis, 2004)
} Functions of language
} Speech act
} Question, command, request etc.
} Propositional content
} Ideas, thoughts etc. in one sentence
} Thematic structure
} Theme of a speech in a context
} Language structure
} Phonemes (basic sound, vowel) and morphemes (word/phrase)
} Linguistic analyses
} Lexical (word level meaning)
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)
Web mining
} Data mining efforts on the web, web mining, fall in three
categories:
} Content mining
} Mining the real content of web pages covering text, graphics and
videos
} Structure mining
} Intra-page (tags) and inter-page (hyperlinks)
} Usage mining
} Web logs that describe the pattern of use of web: IP addresses, page
references, time stamps
} User profiling
} User’s demographic information

Textmining

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Textmining

Hochgeladen von

Copyright:

Verfügbare Formate

BUSINESS INTELLIGENCE &

Saji K Mathew, PhD

E Business model Web Mining

Statistical NLP belongs to the second school

“The vodka was good, but the meat was rotten”

Terms: words, n-grams (sequence of n words considered together)

The statistic tf-idf is intended to measure how important a word is to a

how much more likely it is

Das könnte Ihnen auch gefallen