Sie sind auf Seite 1von 51

Web Mining

Outline
Motivation Problems Profits Web Mining taxonomy Algorithms, methods and applications Content mining Structure mining Usage mining Conclusions and future research agenda

Motivation
The Internet as a source of information 400 million pages, 1 million pages per day, 1 server every 2 hours Unstructured, lack of any standards, highly heterogeneous Off-the-shelf database solutions impossible Search engines get stalled quickly (indices) Wealth of information (financial, marketing, personal, scientific) Possible profits Personalization of service Optimization of Web sites and resource usage Optimization of information retrieval
3

Problems
The Web seems to be too huge for effective data warehousing and data mining (see a huge Internet archive in the order of tens of terabytes www.archive.org/index1.html) The complexity of Web pages is far greater than that of any traditional text document collection Web pages lack a unifying structure The Web is highly dynamic information source The Web serves different user communities Only a small portion of the information on the Web is relevant or useful to a given user. Not enough information in Web logs (Extended Logs from W3C)
4

Web Mining taxonomy


Web Content Mining
Web Page Content Mining
Summarization of Web page contents (WebSQL, WebOQL, WebML, WebLog, W3QL)

Web Structure Mining


Search Result Mining
Summarization of search engine results (PageRank)

Capturing Webs structure using link interconnections (HITS)

Web Usage Mining


General Access Pattern Mining
Uses KDD techniques to understand general user patterns (WUM, WEBMiner, WAP, WebLogMiner)

Customized Usage Tracking


Adaptive sites
5

Web Page Content Mining (1)


WebSQL, W3QL
Languages for querying and finding relevant documents from the Web, built on top of several search machines Dont use any information about the structure of the Web, dont perform data mining from the Web
Find all the computer science graduate students who mention prof. Mendelzon in their home page. select x.url from document x such that "http://www.cs.utoronto.ca/homepages.html" =>|-> x, anchor y such that base = x where y.label contains "Mendelzon";

Web Page Content Mining (2)


WebOQL, WebLog
http://www.cs.toronto.edu/~gus/weboql/

Restructuring Web documents using Datalog-like rules or graph tree representations two-way bridge between databases and the Web

select X&2 as X.text, [Tag:"A", Url:X.text, Text:X.text] + [Tag:"br"] as schema from X in "http://lists.html" via ^* where X.tag = "h2" and X!.tag = "ul"
7

Web Page Content Mining (3)


WebML and Multi-Layered Database Model
Physical and virtual artifacts (layer 0) Generalized descriptions (layer 1) More generalized descriptions (layer 2) Distinguishes and separates meta-data from data, discovers resources without overloading servers and networks

Pros
support of DB technology, high level declarative interfaces and views, global view of DB content, incremental updates

Challenges
high non-structure nature, unified schema, automatic generation of the primitive layer

Examples of WebML queries


List the documents published in Europe and related to data mining
list * from document in Europe related-to computing science where one of keywords covered-by data mining

Describe the general characteristics in relevance to authors affiliations, publications, etc., for those documents which are popular on the Internet and are on data mining
mine description in-relevance-to authors.affiliation, publication, pub_date from document related-to computing science where one of keywords like data mining and access_frequency = high
9

Web Structure Mining

Web Structure Mining


(1970) Researchers in IR proposed methods of using citations among journal articles to evaluate the quality of reserach papers. Customer behavior evaluate a quality of a product based on the opinions of other customers (instead of products description or advertisement) Unlike journal citations, the Web linkage has some unique features:
not every hiperlink represents the endorsement we seek one authority page will seldom have its Web page point to its competitive authorities (CocaCola Pepsi) authoritative pages are seldom descriptive (Yahoo! may not contain the description Web search engine)
11

Web Search
There are two approches: page rank: for discovering the most important pages on the Web (as used in Google) hubs and authorities: a more detailed evaluation of the importance of Web pages Page Rank: the definition of importance: A page is important if important pages link to it

12

Page Rank (1)


Simple solution: create a stochastic matrix of the Web: 1. Each page i corresponds to row i and column i of the matrix 2. If page j has n successors (links) then the ijth entry is 1/n if page i is one of these n succesors of page j, and 0 otherwise.

13

Page Rank (2)


The intuition behind this matrix: initially each page has 1 unit of importance. At each round, each page shares importance it has among its successors, and receives new importance from its predecessors. The importance of each page reaches a limit, which happens to be its component in the principal eigenvector of this matrix That importance is also the probability that a Web surfer, starting at a random pag, and following random links from each page will be at the page in questionafter a long series of links.
14

Page Rank (3) Example 1


Assume that the Web consists of only three pages - A, B, and C. The links among these pages are shown below. Let [a, b, c] be the vector of importances for these three pages

A C B

15

Page Rank Example 1 (cont.)


The equation describing the asymptotic values of these three variables is: a 1/2 1/2 0 a b = 1/2 0 1 b c 0 1/2 0 c We can solve the equations like this one by starting with the assumption a=b=c=1, and applying the matrix to the current estimate of these values repeatedly. The first four iterations give the following estimates: a = 1 1 5/4 9/8 5/4 b = 1 3/2 1 11/8 17/16 c = 1 1/2 3/4 1/2 11/16
16

Problems with Real Web Graphs


In the limit, the solution is a=b=6/5, c=3/5. That is, a and b each have the same importance, and twice of c.

Problems with Real Web Graphs


dead ends: a page that has no succesors has nowhere to send its importance. Eventually, all importance will leak out of the Web. spider traps: a group of one or more pages that have no links out will eventually accumulate all the importance of the Web.

17

Page Rank Example 2


Assume now that the structure of the Web has changed. The new matrix describing transitions is: A C B a b c 0 0 1 0 0 0 a b c

The first four steps of the iterative solution are: a = 1 1 3/4 5/8 1/2 b = 1 1/2 1/2 3/8 5/16 c = 1 1/2 1/4 1/4 3/16 Eventually, each of a, b, and c become 0.
18

Page Rank Example 3


Assume now once more that the structure of the Web has changed. The new matrix describing transitions is: A C B a b c 0 0 0 0 1 1/2 a b c

The first four steps of the iterative solution are: a = 1 1 3/4 5/8 1/2 b = 1 1/2 1/2 3/8 5/16 c = 1 3/2 7/4 2 35/16 c converges to 3, and a=b=0.
19

Google Solution to DE and ST problems


Instead of applying the matrix directly, tax each page some fraction of its current importance, and distribute the taxed importance equally among all pages. Example: if we use 20% tax, the equation of the previous example becomes: a = 0.8 * (1/2*a + *b +0*c) b = 0.8 * (1/2*a + 0*b + 0*c) c = 0.8 * (0*a + *b + 1*c) The solution to this equation is a=7/11, b=5/11, and c=21/11

20

Google Anti-Spam Solution


Spamming is the attept by many Web sites to appear to be about a subject that will attract surfers, without truly being about that subject. Solutions:
Google tries to match words in your query to the words on the Web pages. Unlike other search engines, Google tends to belive what others say about you in their anchor text, making it harder fro you to appear to be about something you are not. The use of Page Rank to measure importance also protects against spammers. The naive measure (number of links into the page) can easily be fooled by the spammers who creates 1000 pages that mutually link to one another, while Page Rank recognizes that none of the pages have any real importance.

21

Search Result Mining


PageRank http://www.google.com, http://hci.stanford.edu/~page/papers/pagerank/ Citation importance ranking (approx. of importance and quality) PageRank is a usage simulation (random surfer) Visits only public data, cheap and effective Drawbacks Databases, CGIs, non-HTML links, redirects Enhancements Starting point (major sites, sites related to user), link-distance as factor
22

Web Structure Mining


CLEVER, Google
Notions of authorities (provide the best source of information on a given topic) and hubs (provide collections of links to authorities) Pure text search is insufficient (e.g., search engines), authorities are not self-descriptive HITS (Hyperlink-Induced Topic Search)
Sampling component: constructs a focused collection of several thousand Web pages likely to be rich in relevant authorities Weight-propagation component: determines numerical estimates of hub and authority weights

Searching for Cybercommunities Constructing taxonomies semi-automatically Assigning Web pages to categories

23

Hyperlink-Induced Topic Search (HITS)


The approach consists of two phases: 1. It uses the query terms to collect a starting set of pages (200 pages) from an index-based search engine root set of pages. The root set is expanded into a base set by including all the pages that the root set pages link to, and all the pages that link to a page in the root set, up to a designed size cutoff, such as 2000-5000. 2. A weight-propagation phase is initiated. This is an iterative process that determines numerical estimates of hub and authority weights (links between two pages with the same Web domain usually serve as a navigation function and thus do not confer authority such links are excluded from the analysis)
24

Hub and Authorities


We define hub and authority in a mutually recursive way: a hub links to may authorities and an authority is linked to by many hubs Authority: a page that offers information about a topic Hub: a page that doesnt provide information, but tell you where to find the information Uses a matrix formulation similar to that of Page Rank, but without stochastic restriction. We count each link as 1, regardless of how many succesors or predecessors a page has. Repeated application of the matrix leads to divergence, but we can introduce scaling factors to keep the computed values for each page within finite bounds.
25

Hub and Authorities


Define a matrix A whose rows and columns correspond to Web pages with entry Aij=1 if page i links to page j, and 0 if not. Let a and h be vectors, whose ith component corresponds to the degrees of authority and hubbiness of the ith page. Let and be suitable scaling factors. Then: 1. h = A a. That is, the hubbiness of each page is the sum of the authorities of all the pages it links to, scaled by . 2. a = AT h. That is, the authority of each page is the sum of the hubbiness of all the pages that link to it, scaled by . Then, a = AT A a h = A AT h
26

Hub and Authorities - Example


Consider the Web presented below. 1 1 1 0 0 1 1 1 0 A B C

A=

1 0 1 AT = 1 0 1 1 1 0 3 1 2 AAT = 1 1 0 2 0 2 2 2 1 ATA = 2 2 1 1 1 2
27

Hub and Authorities - Example


If we assume that = = 1 and assume that the vectors h = [ ha, hb, hc ] and a = [ aa, ab, ac ] are each initially [ 1,1,1 ], the first three iterations of the equations for a and h are the following: aa = 1 5 24 114 ab = 1 5 24 114 ac = 1 4 18 84 ha = 1 6 28 132 hb = 1 2 8 36 hc = 1 4 20 96

28

Web Usage Mining

What Is Weblog Mining?


Web Servers register a log entry for every single access they get. A huge number of accesses (hits) are registered and collected in a web log. Weblog mining: Condense these colossal files of raw web log data in order to mine significant and useful information.
WWW

Web Server
Web Documents

Access Log

30

Motivation: Why Weblog Mining?

Enhance server performance Improve web site navigation Improve system design of web applications Target customers for electronic commerce Identify potential prime advertisement locations

31

Existing Web Log Analysis Tools


There are more than 30 commercially available applications. Many more are available free. Many of them are slow and make assumptions to reduce the size of the log file to analyse. Most are limited in the result they provide, mainly frequency counts.

32

Existing Web Log Analysis Tools (cont.)


Frequently used, pre-defined reports:
Summary report of hits and bytes transferred List of top requested URLs List of top referrers List of most common browsers Hits per hour/day/week/month reports Hits per Internet domain Error report Directory tree report, etc.

Tools are limited in their performance, comprehensiveness, and depth of analysis.


33

Web Usage Mining (1)


Web Utilization Miner (WUM)
NOTIFICATION ALERT MINT QUERY

NOTIFIER
ALERT MINT QUERY

MINT PROCESSOR
AD HOC QUERY

AGGREGATION LOG

AGGREGATION SERVICE

WEB LOG

EXPLORER
RESULTS

Aggregation service Aggregate trees Aggregated log MINT queries:


SELECT a.url, b.url FROM NODE AS a b, TEMPLATE a*b WHERE a.support > 100 AND a.title LIKE %Corba% AND b.support / a.support > 0.1
34

Web Usage Mining (2)


WEBMiner
http://maya.cs.depaul.edu/~mobasher/Research-01.html
development of a flexible architecture for web usage mining, developing a model for a user transaction which consists of multiple user references, clustering algorithms for grouping log entries into transactions, integration of data from other sources such as user registration databases with access log data, adaptation of association rule, temporal sequence, and classification rule discovery algorithms to Web mining, development of knowledge-based intelligent agents to interpret the discovered rules, development of a flexible query mechanism that can be used to query the integrated data and the discovered rules in a unified manner

35

Web Usage Mining (3)


Web Log Miner http://db.cs.sfu.ca/WebMiner/ Information available: domain name, IP of the request, user ID, timestamp, method, server status code, parameters of the script, size of the data sent, browser type, referring page Basic summarization: get frequency of individual actions by user/domain/session group actions into activities get frequency of errors In-depth analysis pattern analysis (between users) trend analysis (users behavior change over time, network traffic change)
36

Web Usage Mining (4)


Web Log Data Cube
Web log - data cleaning - Database - Data cube - OLAP - Sliced and diced cube - data mining - Knowledge Dimensions
URL, action, type of resource, size of resource, time of request, time spent with resource, internet domain, server status

OLAP
Which components/features are the most/least used? What is the distribution of the network traffic over time? What is the distribution of users over domains?

Data mining
In what context are the components/features used? What are the typical event sequences? Are there general behavior patterns across all users? Whether user behavior changes over time and how?
37

Design of a Web Log Miner


Web log is filtered to generate a relational database A data cube is generated from the database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge

Knowledge Web log Database Data Cube Sliced and diced cube

1 Data Cleaning

2 Data Cube Creation

3 OLAP

4 Data Mining
38

Data Cleaning and Transformation


IP address, User, Timestamp, Method, File+Parameters, Status, Size IP address, User, Timestamp, Method, File+Parameters, Status, Size
Generic Cleaning and Transformation

Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size
Site Structure

39

Data Cleaning and Transformation


IP address, User, Timestamp, Method, File+Parameters, Status, Size IP address, User, Timestamp, Method, File+Parameters, Status, Size
Generic Cleaning and Transformation

Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size
Site Structure

Cleaning and Transformation necessitating knowledge about the resources at the site.

Machine, Internet domain, User, Field Site, Day, Month, Year, Hour, Minute, Seconds, Resource, Module/Action, Status, Size, Duration
40

Data Cube Building

Cleansed and Transformed Web Log

Multi-dimensional Data Cube

41

Web Log Data Cube


URL of the Resource

Action Type of the Resource Size of the Resource Time of the Request Time Spent with Resource Internet Domain of the Requestor Requestor Agent User Server Status
42

Typical Summaries
Request summary: request statistics for all modules/pages/files Domain summary: request statistics from different domains Event summary: statistics of the occurring of all events/actions Session summary: statistics of sessions Bandwidth summary: statistics of generated network traffic Error summary: statistics of all error messages Referring Organization summary: statistics of where the users were from Agent summary: statistics of the use of different browsers, etc.

43

From OLAP to Mining (1)

OLAP can answer questions such as: Which components or features are the most/least used? What is the distribution of network traffic over time (hour of the day, day of the week, month of the year, etc.)? What is the user distribution over different domain areas? Are there and what are the differences in access for users from different geographic areas?

44

From OLAP to Mining (2)


Some questions need further analysis: mining In what context are the components or features used? What are the typical event sequences? Are there any general behavior patterns across all users, and what are they? What are the differences in usage and behavior for different user population? Whether user behaviors change over time, and how?

45

Web Log Data Mining


Data Characterization Class Comparison Association Prediction Classification Time-Series Analysis Web Traffic Analysis Typical Event Sequence and User Behavior Pattern Analysis Transition Analysis Trend Analysis
46

Web Access Pattern Mining


Web Access Pattern Mining (WAP algorithm) Idea of the WAP-tree nodes represent events from event sequences common prefixes are merged together Conditional search vs. sequential pattern Apriori Mining access patterns from WAP-tree

47

Mining Path Traversal Patterns


Solution of the problem of mining traversal patterns: first step: devise to convert the original sequence of log data into a set of traversal subsequences (maximal forward reference) second step: determine the frequent traversal patterns, term large reference sequences Problems with finding large reference sequences

48

Mining Path Traversal Patterns - Example


Traversal path for a user: {A,B,C,D,C,B,E,G,H,G,W,A,O, U,O,V} The set of maximal forward references for this user: C
3 1

12

B
2 5 7 6 13

O E
11 14 15

4 8

U G

{ ABCD, ABEGH, ABEGW, AOU, AOV }

10

W
49

Discussion
Analyzing the web access logs can help understand user behavior and web structure, thereby improving the design of web collections and web applications, targeting e-commerce potential customers, etc. Web log entries do not collect enough information. Data cleaning and transformation is crucial and often requires site structure knowledge (Metadata). OLAP provides data views from different perspectives and at different conceptual levels. Web Log Data Mining provides in depth reports like time series analysis, associations, classification, etc.
50

Future research Agenda


Data cleaning and transformation methods Mining digital libraries unstructured, semi-structured Web pages schema information is missing or incomplete updating and growing constantly and rapidly Further integration with warehouse and OLAP technology regulations for warehousing and mining Mining on complex data in the Web spatial, text, multimedia Developing distributed and incremental algorithms
51

Das könnte Ihnen auch gefallen