Beruflich Dokumente
Kultur Dokumente
Textmining
Textmining
ANALYTICS
(MS6840)
Introduction to
Text mining
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Sentiment analysis
} aka opinion mining
} Words carry emotions
} Use sentiment datasets to score/classify words
} Eg.: AFINN (-5 to +5), Bing (+ve, -ve), NRC (positive, negative,
anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)
} Approaches:
} Word by word, POS tagging
The case of slumdog
} http://www.youtube.com/watch?v=AIzbwV7on6Q
} http://www.youtube.com/watch?v=LenAIw95L-s
Topic modeling
} Unsupervised document classification technique, similar
to clustering
} Latent Dirichlet Allocation (LDA) is a probabilistic
approach to topic modelling
} Treats each document as a mixture of topics, and each topic as
a mixture of words
} Each document may contain words from several topics in
particular proportions. For example, in a two-topic model we
could say “Document 1 is 90% topic A and 10% topic B, while
Document 2 is 30% topic A and 70% topic B.”
} A two-topic model of American news, with one topic for
“politics” (PM, parliament, budget) and one for “entertainment.”
(movies, dance, music). Here words can be shared between
topics
Parts of speech parse tree
Language comprehension-cognitive
psychology (Hunt & Ellis, 2004)
} Functions of language
} Speech act
} Question, command, request etc.
} Propositional content
} Ideas, thoughts etc. in one sentence
} Thematic structure
} Theme of a speech in a context
} Language structure
} Phonemes (basic sound, vowel) and morphemes (word/phrase)
} Linguistic analyses
} Lexical (word level meaning)
} Syntactic (structure connecting words)
} Semantic (meaning as a whole, theme)
Web mining
} Data mining efforts on the web, web mining, fall in three
categories:
} Content mining
} Mining the real content of web pages covering text, graphics and
videos
} Structure mining
} Intra-page (tags) and inter-page (hyperlinks)
} Usage mining
} Web logs that describe the pattern of use of web: IP addresses, page
references, time stamps
} User profiling
} User’s demographic information