Sie sind auf Seite 1von 51

1

An introduction to
Web Mining

About me (an Information Retrieval view)

About me (a Data/Web Mining view)

The startpoint:
Key notions of Information Retrieval
Representation, storage, organization of, and access to information items
Focus is on the user information need

User information need example:


Find all docs containing information on college tennis teams which: (1) are
maintained by a USA university and (2) participate in the NCAA
tournament.

Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important

Retrieval
Database
Browsing

Plan for this lecture:


From IR to Web Mining some directions
Transition 1: different outputs

Transition 2: The material changes (1)


Specific looks at user-generated content
Transition 3: The questions change

Transition 4: The material changes (2)

New challenges

Transition 1: different outputs

The answer is 42.

A clustering example still quite IR-ish

A clustering ex. with more information extraction

Going further: from clustering to ontology learning

10

IR and KD

Information

Retrieval

Knowledge

Discovery *
*(better term for data mining)

IR and KD: Different ways of utilizing databases (DBs)


IR: retrieving the information from a DB that matches a users information need
query (formal statement of information need)

object (an entity which stores information in a database)


KD: finding new knowledge about the real-world entities described in a DB
data/information (sometimes plus query)

patterns (knowledge)

11

IR and KD: confluences

Conceptually:

IR can be seen as a classification of objects to the classes


relevant to the users query / not relevant to the users query
(and classification is a typical KD task)

KD needs to extract the information from objects like documents,


in order to find new knowledge
(and information extraction is a typical IR task)

Pragmatically:

e.g. overlaps of topics and techniques in papers at SIGIR, SIGKDD

12

13

Web Mining
Knowledge discovery
(aka Data mining):
the non-trivial process of
identifying valid, novel,
potentially useful, and
ultimately understandable
patterns in data. 1
Web Mining:
the application of data mining
techniques on the content,
(hyperlink) structure, and usage
of Web resources.
Navigation, queries,
content access & creation

Web mining areas:


Web content mining
Web structure mining
Web usage mining

Whats different about Web mining (different from data


mining in general)?

The data and the necessary data preparation steps

To some extent, the applicable techniques

14

The process part of knowledge discovery

CRISP-DM
CRoss Industry Standard Process for Data Mining
a data mining process model that describes commonly used
approaches that expert data miners use to tackle problems.

16

17

The structural/algorithmic part of knowledge discovery (modelling in


CRISP-DM): Patterns, data mining tasks, methods (examples)

Global patterns

Description
Clustering
K-means, EM, hierarchical clustering, ...
Hidden Markov Models
Link patterns (e.g., ciation analysis la Google)
Prediction
Classification
Bayes techniques, Decision trees, Support Vector
Machines, ...
Regression
Time series analysis

Local patterns

Frequent itemsets, sequences, subgraphs


A priori and methods derived from it
Association rules
Cliques (Web Communities)

Recall: from clustering to ontology learning

18

What is needed for that?


Basic steps of text mining

19

Towards the answer is 42 ... well, lets start with:


RFID research deals with technology or privacy or
antibiotics kill bacteria

20

http://www.cs.washington.edu/research/textrunner/

21

http://quest.sandbox.yahoo.net

22

Towards the answer is 42 ... well, lets start with:


RFID research deals with technology or privacy or
antibiotics kill bacteria (2)

ontology learning comes in two flavours: based on

co-occurrence
at document level (see before) or sentence level

(more or less shallow) natural-language processing


tends to be at sentence level

23

Open information extraction from the Web

Basic ideas of the second approach that rests on shallow linguistic


processing:

Templates (cities such as ... gives instances of class city)

Generalised templates (* such as * gives classes and instances)

Semantic role labelling (X kill bacteria. in different linguistic forms)

Query the Web for occurrences

exploit massive redundancy [and mutual constraints] as indicator of truth

Examples:

Textrunner

Yahoo! Quest

KnowItAll (http://www.cs.washington.edu/research/knowitall/)

Read the Web project (http://rtw.ml.cmu.edu/readtheweb.html)

24

Towards the answer is 42 ... well, lets start with:


RFID research deals with technology or privacy or
antibiotics kill bacteria (3)

Note: of course, database/knowledge base retrieval is another


option!

WolframAlpha

Linked Data Initiative

(not treated further here)

25

Who is Jos Zapatero?

26

Transition 2: The material changes (1)

27

New material: User-generated content


Product ratings and reviews

Blogs

Social network profiles

...

28

www.blogpulse.com

29

New material: network information

Hyperlinks

PageRank

Blog networks

Viral marketing?

Opinion leadership?

Social network relationships

...

Roots in

Social network analysis (sociology et


al.)

Bibliometrics

...

30

New material: usage


The classic

31

32

New material: usage


Other examples

E-commerce questions

How do people utilize (or not) service options?

Which advertising campaign brings the most


Visitors
Customers ?

E-commerce / information systems questions

, Google Analytics

What do queries tell us about which content we


should inform about?

Search-engine questions

How can click-through behaviour aid relevance


assessments
re-ranking (learning to rank)
query recommendation

Personalization (based on explicit or implicit


features e.g. gender prediction)

Example: Google Analytics


Advertising ROI

Visualize the
Conversion Funnel

Cross Channel and


Multimedia Tracking,
Benchmarking

Customized Reporting
(define your own
metrics)

33

Specific looks at user-generated content

34

35

Transition 3: The questions change

They talk about the meaning of life.

They are very depressed about it.

36

37

Transition 4: The material changes (2)

38

39

(More on the interface)

40

New challenges

41

Who is this?
(Sample from a search-query log)

42

Result
(a 1-identified person)

43

Anonymized?
In Massachusetts, the Group Insurance Commission (GIC) is responsible for
purchasing health insurance for state employees. GIC collected patient-specific
data with nearly one hundred attributes per encounter along the lines of the those
shown in the leftmost circle of Figure 1 [...] Because the data were believed to be
anonymous, GIC gave a copy of the data to researchers and sold a copy to
industry.
For twenty dollars I purchased the voter registration list for Cambridge
Massachusetts [...] The rightmost circle in Figure 1 shows that these data
included the name, address, ZIP code, birth date, and gender of each voter.
This information can be linked using ZIP code,
birth date and gender to the medical
information, thereby linking diagnosis,
procedures, and medications to particularly
named individuals.

44

Is this the same person?

45

Keeping identities apart the basic setting

Paper published by the MovieLens team (collaborative-filtering


movie ratings) who were considering publishing a ratings
dataset, see http://movielens.umn.edu/

Public dataset: users mention films in forum posts

Private dataset (may be released e.g. for research purposes):


users ratings

Film IDs can easily be extracted from the posts

Observation: Every user will talk about items from a sparse


relation space (those generally few films s/he has seen)

46

Keeping identities apart the computational problem

Given a target user t from the forum users, find similar users (in
terms of which items they related to) in the ratings dataset

Rank these users u by their likelihood of being t

Evalute:

If t is in the top k of this list, then t is k-identified

Count percentage of users who are k-identified

E.g. measure likelihood by TF.IDF (m: item)

47

Results

48

What do you think helps?

49

Background, Research Directions

k-anonymity (Sweeney)

privacy-preserving data mining, privacy-preserving data


publishing

Not the whole story!

51

52

Thanks ! Questions ?

Further reading: An excellent textbook introduction

53

Das könnte Ihnen auch gefallen