Beruflich Dokumente
Kultur Dokumente
An introduction to
Web Mining
The startpoint:
Key notions of Information Retrieval
Representation, storage, organization of, and access to information items
Focus is on the user information need
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
Retrieval
Database
Browsing
New challenges
10
IR and KD
Information
Retrieval
Knowledge
Discovery *
*(better term for data mining)
patterns (knowledge)
11
Conceptually:
Pragmatically:
12
13
Web Mining
Knowledge discovery
(aka Data mining):
the non-trivial process of
identifying valid, novel,
potentially useful, and
ultimately understandable
patterns in data. 1
Web Mining:
the application of data mining
techniques on the content,
(hyperlink) structure, and usage
of Web resources.
Navigation, queries,
content access & creation
14
CRISP-DM
CRoss Industry Standard Process for Data Mining
a data mining process model that describes commonly used
approaches that expert data miners use to tackle problems.
16
17
Global patterns
Description
Clustering
K-means, EM, hierarchical clustering, ...
Hidden Markov Models
Link patterns (e.g., ciation analysis la Google)
Prediction
Classification
Bayes techniques, Decision trees, Support Vector
Machines, ...
Regression
Time series analysis
Local patterns
18
19
20
http://www.cs.washington.edu/research/textrunner/
21
http://quest.sandbox.yahoo.net
22
co-occurrence
at document level (see before) or sentence level
23
Examples:
Textrunner
Yahoo! Quest
KnowItAll (http://www.cs.washington.edu/research/knowitall/)
24
WolframAlpha
25
26
27
Blogs
...
28
www.blogpulse.com
29
Hyperlinks
PageRank
Blog networks
Viral marketing?
Opinion leadership?
...
Roots in
Bibliometrics
...
30
31
32
E-commerce questions
, Google Analytics
Search-engine questions
Visualize the
Conversion Funnel
Customized Reporting
(define your own
metrics)
33
34
35
36
37
38
39
40
New challenges
41
Who is this?
(Sample from a search-query log)
42
Result
(a 1-identified person)
43
Anonymized?
In Massachusetts, the Group Insurance Commission (GIC) is responsible for
purchasing health insurance for state employees. GIC collected patient-specific
data with nearly one hundred attributes per encounter along the lines of the those
shown in the leftmost circle of Figure 1 [...] Because the data were believed to be
anonymous, GIC gave a copy of the data to researchers and sold a copy to
industry.
For twenty dollars I purchased the voter registration list for Cambridge
Massachusetts [...] The rightmost circle in Figure 1 shows that these data
included the name, address, ZIP code, birth date, and gender of each voter.
This information can be linked using ZIP code,
birth date and gender to the medical
information, thereby linking diagnosis,
procedures, and medications to particularly
named individuals.
44
45
46
Given a target user t from the forum users, find similar users (in
terms of which items they related to) in the ratings dataset
Evalute:
47
Results
48
49
k-anonymity (Sweeney)
51
52
Thanks ! Questions ?
53