TWM 1

1
An introduction to
Web Mining
About me (an Information Retrieval view)
About me (a Data/Web Mining view)
The startpoint:
Key notions of Information Retrieval
Representation, storage, organization of, and access to information items
Focus is on the user information need
User information need example:

Find all docs containing information on college tennis teams which: (1) are
maintained by a USA university and (2) participate in the NCAA
tournament.
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
Retrieval
Database
Browsing
Plan for this lecture:

From IR to Web Mining some directions
Transition 1: different outputs
Transition 2: The material changes (1)

Specific looks at user-generated content
Transition 3: The questions change
New challenges
Transition 1: different outputs
The answer is 42.
A clustering example still quite IR-ish
A clustering ex. with more information extraction
Going further: from clustering to ontology learning
10
IR and KD
Information
Retrieval
Knowledge
Discovery *
*(better term for data mining)
IR and KD: Different ways of utilizing databases (DBs)

IR: retrieving the information from a DB that matches a users information need
query (formal statement of information need)
object (an entity which stores information in a database)

KD: finding new knowledge about the real-world entities described in a DB
data/information (sometimes plus query)
patterns (knowledge)
11
IR and KD: confluences
Conceptually:
IR can be seen as a classification of objects to the classes

relevant to the users query / not relevant to the users query
(and classification is a typical KD task)
KD needs to extract the information from objects like documents,

in order to find new knowledge
(and information extraction is a typical IR task)
Pragmatically:
e.g. overlaps of topics and techniques in papers at SIGIR, SIGKDD
12
13
Web Mining
Knowledge discovery
(aka Data mining):
the non-trivial process of
identifying valid, novel,
potentially useful, and
ultimately understandable
patterns in data. 1
Web Mining:
the application of data mining
techniques on the content,
(hyperlink) structure, and usage
of Web resources.
Navigation, queries,
content access & creation
Web mining areas:

Web content mining
Web structure mining
Web usage mining
Whats different about Web mining (different from data

mining in general)?
The data and the necessary data preparation steps
To some extent, the applicable techniques
14
The process part of knowledge discovery
CRISP-DM
CRoss Industry Standard Process for Data Mining
a data mining process model that describes commonly used
approaches that expert data miners use to tackle problems.
16
17
The structural/algorithmic part of knowledge discovery (modelling in

CRISP-DM): Patterns, data mining tasks, methods (examples)
Global patterns
Description
Clustering
K-means, EM, hierarchical clustering, ...
Hidden Markov Models
Link patterns (e.g., ciation analysis la Google)
Prediction
Classification
Bayes techniques, Decision trees, Support Vector
Machines, ...
Regression
Time series analysis
Local patterns
Frequent itemsets, sequences, subgraphs

A priori and methods derived from it
Association rules
Cliques (Web Communities)
Recall: from clustering to ontology learning
18
What is needed for that?

Basic steps of text mining
19
Towards the answer is 42 ... well, lets start with:

RFID research deals with technology or privacy or
antibiotics kill bacteria
20
http://www.cs.washington.edu/research/textrunner/
21
http://quest.sandbox.yahoo.net
22

antibiotics kill bacteria (2)
ontology learning comes in two flavours: based on
co-occurrence
at document level (see before) or sentence level
(more or less shallow) natural-language processing

tends to be at sentence level
23
Open information extraction from the Web
Basic ideas of the second approach that rests on shallow linguistic

processing:
Templates (cities such as ... gives instances of class city)
Generalised templates (* such as * gives classes and instances)
Semantic role labelling (X kill bacteria. in different linguistic forms)
Query the Web for occurrences
exploit massive redundancy [and mutual constraints] as indicator of truth
Examples:
Textrunner
Yahoo! Quest
KnowItAll (http://www.cs.washington.edu/research/knowitall/)
Read the Web project (http://rtw.ml.cmu.edu/readtheweb.html)
24

antibiotics kill bacteria (3)
Note: of course, database/knowledge base retrieval is another

option!
WolframAlpha
Linked Data Initiative
(not treated further here)
25
Who is Jos Zapatero?
26
27
New material: User-generated content

Product ratings and reviews
Blogs
Social network profiles
...
28
www.blogpulse.com
29
New material: network information
Hyperlinks
PageRank
Blog networks
Viral marketing?
Opinion leadership?
Social network relationships
...
Roots in
Social network analysis (sociology et

al.)
Bibliometrics
...
30
New material: usage

The classic
31
32
New material: usage

Other examples
E-commerce questions
How do people utilize (or not) service options?
Which advertising campaign brings the most

Visitors
Customers ?
E-commerce / information systems questions
, Google Analytics
What do queries tell us about which content we

should inform about?
Search-engine questions
How can click-through behaviour aid relevance

assessments
re-ranking (learning to rank)
query recommendation
Personalization (based on explicit or implicit

features e.g. gender prediction)
Example: Google Analytics

Advertising ROI
Visualize the
Conversion Funnel
Cross Channel and

Multimedia Tracking,
Benchmarking
Customized Reporting
(define your own
metrics)
33
Specific looks at user-generated content
34
35
Transition 3: The questions change
They talk about the meaning of life.
They are very depressed about it.
36
37
38
39
(More on the interface)
40
New challenges
41
Who is this?
(Sample from a search-query log)
42
Result
(a 1-identified person)
43
Anonymized?
In Massachusetts, the Group Insurance Commission (GIC) is responsible for
purchasing health insurance for state employees. GIC collected patient-specific
data with nearly one hundred attributes per encounter along the lines of the those
shown in the leftmost circle of Figure 1 [...] Because the data were believed to be
anonymous, GIC gave a copy of the data to researchers and sold a copy to
industry.
For twenty dollars I purchased the voter registration list for Cambridge
Massachusetts [...] The rightmost circle in Figure 1 shows that these data
included the name, address, ZIP code, birth date, and gender of each voter.
This information can be linked using ZIP code,
birth date and gender to the medical
information, thereby linking diagnosis,
procedures, and medications to particularly
named individuals.
44
Is this the same person?
45
Keeping identities apart the basic setting
Paper published by the MovieLens team (collaborative-filtering

movie ratings) who were considering publishing a ratings
dataset, see http://movielens.umn.edu/
Public dataset: users mention films in forum posts
Private dataset (may be released e.g. for research purposes):

users ratings
Film IDs can easily be extracted from the posts
Observation: Every user will talk about items from a sparse

relation space (those generally few films s/he has seen)
46
Keeping identities apart the computational problem
Given a target user t from the forum users, find similar users (in
terms of which items they related to) in the ratings dataset
Rank these users u by their likelihood of being t
Evalute:
If t is in the top k of this list, then t is k-identified
Count percentage of users who are k-identified
E.g. measure likelihood by TF.IDF (m: item)
47
Results
48
What do you think helps?
49
Background, Research Directions
k-anonymity (Sweeney)
privacy-preserving data mining, privacy-preserving data

publishing
Not the whole story!
51
52
Thanks ! Questions ?
Further reading: An excellent textbook introduction
53

TWM 1

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

TWM 1

Hochgeladen von

Copyright:

Verfügbare Formate

1

About me (an Information Retrieval view)

About me (a Data/Web Mining view)

User information need example:

Plan for this lecture:

Transition 2: The material changes (1)

Transition 4: The material changes (2)

Transition 1: different outputs

The answer is 42.

A clustering example still quite IR-ish

A clustering ex. with more information extraction

Going further: from clustering to ontology learning

IR and KD: Different ways of utilizing databases (DBs)

object (an entity which stores information in a database)

IR and KD: confluences

IR can be seen as a classification of objects to the classes

KD needs to extract the information from objects like documents,

e.g. overlaps of topics and techniques in papers at SIGIR, SIGKDD

Web mining areas:

Whats different about Web mining (different from data

The data and the necessary data preparation steps

To some extent, the applicable techniques

The process part of knowledge discovery

The structural/algorithmic part of knowledge discovery (modelling in

Frequent itemsets, sequences, subgraphs

Recall: from clustering to ontology learning

What is needed for that?

Towards the answer is 42 ... well, lets start with:

Towards the answer is 42 ... well, lets start with:

ontology learning comes in two flavours: based on

(more or less shallow) natural-language processing

Open information extraction from the Web

Basic ideas of the second approach that rests on shallow linguistic

Templates (cities such as ... gives instances of class city)

Generalised templates (* such as * gives classes and instances)

Semantic role labelling (X kill bacteria. in different linguistic forms)

Query the Web for occurrences

exploit massive redundancy [and mutual constraints] as indicator of truth

Read the Web project (http://rtw.ml.cmu.edu/readtheweb.html)

Towards the answer is 42 ... well, lets start with:

Note: of course, database/knowledge base retrieval is another

Linked Data Initiative

(not treated further here)

Who is Jos Zapatero?

Transition 2: The material changes (1)

New material: User-generated content

Social network profiles

New material: network information

Social network relationships

Social network analysis (sociology et

New material: usage

New material: usage

How do people utilize (or not) service options?

Which advertising campaign brings the most

E-commerce / information systems questions

What do queries tell us about which content we

How can click-through behaviour aid relevance

Personalization (based on explicit or implicit

Example: Google Analytics

Cross Channel and

Specific looks at user-generated content

Transition 3: The questions change

They talk about the meaning of life.