Beruflich Dokumente
Kultur Dokumente
Outline
Motivation Problems Profits Web Mining taxonomy Algorithms, methods and applications Content mining Structure mining Usage mining Conclusions and future research agenda
Motivation
The Internet as a source of information 400 million pages, 1 million pages per day, 1 server every 2 hours Unstructured, lack of any standards, highly heterogeneous Off-the-shelf database solutions impossible Search engines get stalled quickly (indices) Wealth of information (financial, marketing, personal, scientific) Possible profits Personalization of service Optimization of Web sites and resource usage Optimization of information retrieval
3
Problems
The Web seems to be too huge for effective data warehousing and data mining (see a huge Internet archive in the order of tens of terabytes www.archive.org/index1.html) The complexity of Web pages is far greater than that of any traditional text document collection Web pages lack a unifying structure The Web is highly dynamic information source The Web serves different user communities Only a small portion of the information on the Web is relevant or useful to a given user. Not enough information in Web logs (Extended Logs from W3C)
4
Restructuring Web documents using Datalog-like rules or graph tree representations two-way bridge between databases and the Web
select X&2 as X.text, [Tag:"A", Url:X.text, Text:X.text] + [Tag:"br"] as schema from X in "http://lists.html" via ^* where X.tag = "h2" and X!.tag = "ul"
7
Pros
support of DB technology, high level declarative interfaces and views, global view of DB content, incremental updates
Challenges
high non-structure nature, unified schema, automatic generation of the primitive layer
Describe the general characteristics in relevance to authors affiliations, publications, etc., for those documents which are popular on the Internet and are on data mining
mine description in-relevance-to authors.affiliation, publication, pub_date from document related-to computing science where one of keywords like data mining and access_frequency = high
9
Web Search
There are two approches: page rank: for discovering the most important pages on the Web (as used in Google) hubs and authorities: a more detailed evaluation of the importance of Web pages Page Rank: the definition of importance: A page is important if important pages link to it
12
13
A C B
15
17
The first four steps of the iterative solution are: a = 1 1 3/4 5/8 1/2 b = 1 1/2 1/2 3/8 5/16 c = 1 1/2 1/4 1/4 3/16 Eventually, each of a, b, and c become 0.
18
The first four steps of the iterative solution are: a = 1 1 3/4 5/8 1/2 b = 1 1/2 1/2 3/8 5/16 c = 1 3/2 7/4 2 35/16 c converges to 3, and a=b=0.
19
20
21
Searching for Cybercommunities Constructing taxonomies semi-automatically Assigning Web pages to categories
23
A=
1 0 1 AT = 1 0 1 1 1 0 3 1 2 AAT = 1 1 0 2 0 2 2 2 1 ATA = 2 2 1 1 1 2
27
28
Web Server
Web Documents
Access Log
30
Enhance server performance Improve web site navigation Improve system design of web applications Target customers for electronic commerce Identify potential prime advertisement locations
31
32
NOTIFIER
ALERT MINT QUERY
MINT PROCESSOR
AD HOC QUERY
AGGREGATION LOG
AGGREGATION SERVICE
WEB LOG
EXPLORER
RESULTS
35
OLAP
Which components/features are the most/least used? What is the distribution of the network traffic over time? What is the distribution of users over domains?
Data mining
In what context are the components/features used? What are the typical event sequences? Are there general behavior patterns across all users? Whether user behavior changes over time and how?
37
Knowledge Web log Database Data Cube Sliced and diced cube
1 Data Cleaning
3 OLAP
4 Data Mining
38
Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size
Site Structure
39
Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size Machine, Internet domain, User, Day, Month, Year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size
Site Structure
Cleaning and Transformation necessitating knowledge about the resources at the site.
Machine, Internet domain, User, Field Site, Day, Month, Year, Hour, Minute, Seconds, Resource, Module/Action, Status, Size, Duration
40
41
Action Type of the Resource Size of the Resource Time of the Request Time Spent with Resource Internet Domain of the Requestor Requestor Agent User Server Status
42
Typical Summaries
Request summary: request statistics for all modules/pages/files Domain summary: request statistics from different domains Event summary: statistics of the occurring of all events/actions Session summary: statistics of sessions Bandwidth summary: statistics of generated network traffic Error summary: statistics of all error messages Referring Organization summary: statistics of where the users were from Agent summary: statistics of the use of different browsers, etc.
43
OLAP can answer questions such as: Which components or features are the most/least used? What is the distribution of network traffic over time (hour of the day, day of the week, month of the year, etc.)? What is the user distribution over different domain areas? Are there and what are the differences in access for users from different geographic areas?
44
45
47
48
12
B
2 5 7 6 13
O E
11 14 15
4 8
U G
10
W
49
Discussion
Analyzing the web access logs can help understand user behavior and web structure, thereby improving the design of web collections and web applications, targeting e-commerce potential customers, etc. Web log entries do not collect enough information. Data cleaning and transformation is crucial and often requires site structure knowledge (Metadata). OLAP provides data views from different perspectives and at different conceptual levels. Web Log Data Mining provides in depth reports like time series analysis, associations, classification, etc.
50