Web Mining

WEB MINING
Prof. Navneet Goyal BITS, Pilani
Web Mining
Web Mining is the use of the data mining

techniques to automatically discover and extract information from web documents/services Discovering useful information from the WorldWide Web and its usage patterns My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
Web Mining
Data Mining Techniques

Association rules Sequential patterns Classification Clustering Outlier discovery
Applications to the Web

E-commerce Information retrieval (search) Network management
Examples of Discovered Patterns
Association rules
98% of AOL users also have E-trade accounts
Classification
People with age less than 40 and salary > 40k trade on-line
Clustering
Users A and B access similar URLs
Outlier Detection
User A spends more than twice the average amount of time surfing on the Web
Web Mining
The WWW is huge, widely distributed, global
information service centre for Information services: news, advertisements,
consumer information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information
WWW provides rich sources of data for data mining
Why Mine the Web?
Enormous wealth of information on Web

Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information

People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from November until February
Why is Web Mining Different?
The Web is a huge collection of documents except for

Hyper-link information Access and usage information
The Web is very dynamic
New pages are constantly being generated
Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to

Exploit hyper-links and access patterns Be incremental
Web Mining Applications
E-commerce (Infrastructure)

Generate user profiles Targetted advertizing Fraud Similar image retrieval
Information retrieval (Search) on the Web

Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents
Network Management

Performance management Fault management
User Profiling
Important for improving customization

Provide users with pages, advertisements of interest Example profiles: on-line trader, on-line shopper
Generate user profiles based on their access patterns

Cluster users based on frequently accessed URLs Use classifier to generate a profile for each cluster
Engage technologies

Tracks web traffic to create anonymous user profiles of Web surfers Has profiles for more than 35 million anonymous users
Internet Advertizing
Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites Plenty of startups doing internet advertizing
Doubleclick, AdForce, Flycast, AdKnowledge
Internet advertizing is probably the hottest web mining application today
Scheme 1:

Manually associate a set of ads with each user profile For each user, display an ad from the set based on profile
Scheme 2:
Automate association between ads and users Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on) For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster

Use collaborative filtering (e.g. Likeminds, Firefly) Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.) Rij - rating of user Ui for ad Aj Problem: Compute user Uis rating for an unrated ad Aj
?
A1 A2 A3
Key Idea: User Uis rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Uis User Uis rating for an ad Aj that has not been previously displayed to Ui is computed as follows:

Consider a user Uk who has rated ad Aj Compute Dik, the distance between Ui and Uks ratings on common ads Uis rating for ad Aj = Rkj (Uk is user with smallest Dik) Display to Ui ad Aj with highest computed rating
Fraud
With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought) If buying pattern changes significantly, then signal fraud HNC software uses domain knowledge and neural networks for credit card fraud detection
Retrieval of Similar Images
Given:
A set of images
Find:
All images similar to a given image All pairs of similar images
Sample applications:
Medical diagnosis Weather predication Web search engine for images E-commerce
Retrieval of Similar Images

QBIC, Virage, Photobook Compute feature signature for each image

QBIC uses color histograms WBIIS, WALRUS use wavelets
Use spatial index to retrieve database image whose signature is closest to the querys signature WALRUS decomposes an image into regions A single signature is stored for each region Two images are considered to be similar if they have enough similar region pairs
Images retrieved by WALRUS
Query image
Problems with Web Search Today
Todays search engines are plagued by problems:

the abundance problem (99% of info of no interest to 99% of people) limited coverage of the Web (internet sources hidden behind search interfaces) Largest crawlers cover < 18% of all web pages limited query interface based on keywordoriented search limited customization to individual users
Problems with Web Search Today
Todays search engines are plagued by problems:
Web is highly dynamic

Lot
of pages added, removed, and updated every day
Very high dimensionality
Improve Search By Adding Structure to the Web
Use Web directories (or topic hierarchies)
Provide a hierarchical classification of documents (e.g., Yahoo!)
Yahoo home page Recreation Business Science News
Travel
Sports
Companies
Finance
Jobs
Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic
Automatic Creation of Web Directories
In the Clever project, hyper-links between Web pages are taken into account when categorizing them

Use a bayesian classifier Exploit knowledge of the classes of immediate neighbors of document to be classified Show that simply taking text from neighbors and using standard document classifiers to classify page does not work
Inktomis Directory Engine uses Concept Induction to automatically categorize millions of documents
Network Management
Objective: To deliver content to users quickly and reliably

Traffic management Fault management
Router Server
Service Provider Network
Why is Traffic Management Important?
While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three Result is frequent congestion at servers and on network links
during a major event (e.g., princess dianas death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world Olympic sites during the games NASA sites close to launch and landing of shuttles
Traffic Management
Key Ideas
Dynamically replicate/cache content at multiple sites within the network and closer to the user Multiple paths between any pair of sites Route user requests to server closest to the user or least loaded server
Use path with least congested network links
Akamai, Inktomi
Traffic Management
Congested link
Congested server Request
Router Server
Service Provider Network
Traffic Management
Need to mine network and Web traffic to determine
What content to replicate? Which servers should store replicas? Which server to route a user request? What path to use to route packets?
Network Design issues

Where to place servers? Where to place routers? Which routers should be connected by links?
One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server
Fault Management
Fault management involves

Quickly identifying failed/congested servers and links in network Re-routing user requests and packets to avoid congested/down servers and links
Need to analyze alarm and traffic data to carry out root cause analysis of faults Bayesian classifiers can be used to predict the root cause given a set of alarms
Web Mining Issues
Size

Grows at about 1 million pages a day Google indexes 9 billion documents Number of web sites Netcraft survey says 72 million sites
(http://news.netcraft.com/archives/web_server_survey.html)
Diverse types of data
Images Text Audio/video XML HTML
Number of Active Sites
Total Sites Across All Domains August 1995 - October 2007
Systems Issues

Web data sets can be very large
Tens to hundreds of terabytes

Need large farms of servers
Cannot mine on a single server!
How to organize hardware/software to mine multi-terabye data sets

Without
breaking the bank!
Different Data Formats

Structured Data Unstructured Data OLE DB offers some solutions!
Web Data
Web pages Intra-page structures Inter-page structures Usage data Supplemental data
Profiles Registration information Cookies
Web Usage Mining

Pages contain information Links are roads How do people navigate the Internet
Web Usage Mining (clickstream analysis)
Information on navigation paths available in log files Logs can be mined from a client or a server perspective
Website Usage Analysis

Why analyze Website usage? Knowledge about how visitors use Website could

Provide guidelines to web site reorganization; Help prevent disorientation Help designers place important information where the visitors look for it Pre-fetching and caching web pages Provide adaptive Website (Personalization) Questions which could be answered

What are the differences in usage and access patterns among users? What user behaviors change over time? How usage patterns change with quality of service (slow/fast)? What is the distribution of network traffic over time?

Analog Web Log File Analyser Gives basic statistics such as number of hits average hits per time period what are the popular pages in your site who is visiting your site what keywords are users searching for to get to you what is being downloaded http://www.analog.cx/
Web Usage Mining Process
Web Mining Outline

Goal: Examine the use of data mining on the World Wide Web Web Content Mining Web Structure Mining Web Usage Mining
Web Mining Taxonomy
Modified from [zai01]
Web Content Mining
Examine the contents of web pages as well as result of web searching Can be thought of as extending the work performed by basic search engines Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users Web Content Mining is: the process of extracting knowledge from web contents
Semi-structured Data
Content
is, in general, semistructured

Example: Title Author Publication_Date Length Category Abstract Content
Structuring Textual Data

Many methods designed to analyze structured data If we can represent documents by a set of attributes we will be able to use existing data mining methods How to represent a document?
Vector based representation
(referred to as bag of words as it is invariant to permutations) Use statistics to add a numerical dimension to unstructured text
Document Representation

A document representation aims to capture what the document is about One possible approach:

Each entry describes a document Attribute describe whether or not a term appears in the document
Another approach: Each entry describes a document Attributes represent the frequency in which a term appears in the document
Stop Word removal: Many words are not informative and thus irrelevant for document representation the, and, a, an, is, of, that, Stemming: reducing words to their root form (Reduce dimensionality) A document may contain several occurrences of words like fish, fishes, fisher, and fishers. But would not be retrieved by a query with the keyword fishing Different words share the same word stem and should be represented with its stem, instead of the actual word Fish

Web Mining

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Web Mining

Hochgeladen von

Copyright:

Verfügbare Formate

WEB MINING

Prof. Navneet Goyal BITS, Pilani

Data Mining Techniques

Association rules Sequential patterns Classification Clustering Outlier discovery

Applications to the Web

E-commerce Information retrieval (search) Network management

Examples of Discovered Patterns

98% of AOL users also have E-trade accounts

Users A and B access similar URLs

WWW provides rich sources of data for data mining

Why Mine the Web?

Enormous wealth of information on Web

Lots of data on user access patterns

Web logs contain sequence of URLs accessed by users

Possible to mine interesting nuggets of information

Why is Web Mining Different?

The Web is a huge collection of documents except for

Hyper-link information Access and usage information

The Web is very dynamic

New pages are constantly being generated

Exploit hyper-links and access patterns Be incremental

Web Mining Applications

Generate user profiles Targetted advertizing Fraud Similar image retrieval

Information retrieval (Search) on the Web

Performance management Fault management

Important for improving customization

Generate user profiles based on their access patterns

Doubleclick, AdForce, Flycast, AdKnowledge

Internet advertizing is probably the hottest web mining application today

Retrieval of Similar Images

All images similar to a given image All pairs of similar images

Retrieval of Similar Images

QBIC, Virage, Photobook Compute feature signature for each image

QBIC uses color histograms WBIIS, WALRUS use wavelets

Images retrieved by WALRUS

Problems with Web Search Today

Todays search engines are plagued by problems:

Problems with Web Search Today

Todays search engines are plagued by problems:

Web is highly dynamic

of pages added, removed, and updated every day

Very high dimensionality

Improve Search By Adding Structure to the Web

Use Web directories (or topic hierarchies)

Provide a hierarchical classification of documents (e.g., Yahoo!)

Yahoo home page Recreation Business Science News

Automatic Creation of Web Directories

Objective: To deliver content to users quickly and reliably

Traffic management Fault management

Service Provider Network

Why is Traffic Management Important?

Use path with least congested network links

Congested server Request

Service Provider Network

Need to mine network and Web traffic to determine

Network Design issues

Fault management involves

Web Mining Issues

Diverse types of data

Images Text Audio/video XML HTML

Number of Active Sites

Total Sites Across All Domains August 1995 - October 2007

Web data sets can be very large