Beruflich Dokumente
Kultur Dokumente
Ms.Revathy
4/22/2012
Noisy information
In web sites the noises are considered as blocks of copyright,
4/22/2012
Web mining
Web mining is the application of data mining techniques to
discover patterns from the web.
Web usage mining is a process of extracting useful information from server logs.
Web content mining is the process to discover useful information from text, image, audio or video data in the web. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site.
3
4/22/2012
4/22/2012
Style tree
In this technique propose a tree structure called style tree to capture the common presentation style and actual contents of the pages in a given site. Site style tree can be built for a web site. Web page cleaning done using a cleaning technique.
4/22/2012
Cleaning Technique
Based on the observation that most web pages are automatically generated.
Parts of a page whose layouts and actual contents also appear in other pages in the site are more likely to be noises. Parts of a page whose layouts or actual contents are quite different from other pages are usually the main contents of the page.
6
4/22/2012
Proposes an information based measure to determine which parts of the style tree indicate noises and which part of the tree contain the main contents of the pages in the website. Importance measure formula proposed is entropy based. Experimental results show a increase in accuracy of web mining using the proposed webpage cleaning method.
4/22/2012
root
bgcolor=white
BODY
Width=800 Height=200 Width=800 Bgcolor=red
TABLE
IMG
TABLE
Disadvantage of DOM
DOM tree is insufficient. It is hard to study the overall presentation style and content of a set of HTML pages and clean them based on individual DOM trees.
4/22/2012
d1
Root Bgcolor=white Width=800 TABLE BODY
d2
Width=800
Bgcolor=red IMG TABLE TABLE IMG
Bgcolor=red
TABLE
IMG
BR
A
10
4/22/2012
Compressed representation of two DOM trees. It shows which parts of the DOM trees are common and which parts are different.
Bgcolor=red
TABLE IMG IMG TABLE
4/22/2012
BR
11
A style node(S) represents a layout or presentation style, which has two components ,denoted by (Es,n),where Es is a sequence of element nodes and n is the number of pages that has this particular style at this node level. An element node E has three components denoted by (TAG,Attr,Ss), where
4/22/2012
TAG is the tag name. Attr is the set of display attributes of TAG. Ss is a set of style nodes below E.
12
Importance measure
Entropy based importance measure is used for determining noisy elements in Style Tree(ST). Based on the following assumptions 1. The more presentation styles that an element node has, the more important and vice versa. 2.The more diverse that the actual contents of an element node are, the more important the element node is ,and vice versa.
4/22/2012
13
Importance of an element node is given by combining its presentation importance and content importance.
4/22/2012
14
Importance Measure
Root BODY
Table
Img
Table
Table
Text Tr Tr Text P
P Img
Img
4/22/2012
15
Composite importance measure for a node is the importance measure of the element node and its descendants. For the internal node it is based on the presentation styles and importance of its descendants.
CompImp(E)=(1-l)NodeImp(E)+lli=1(pi CompImp(Si)) Where is the attenuating factor which is set to 0.9. -li=1 pi log m pi NodeImp(E)= 1 if m=1 Where pi is the probability that a web page uses the ith style node in E.Sss kj=1CompImp(Ej) CompImp(Si)= if m>1
k Where pi is the probability that E has the ith child style node in E.Ss
4/22/2012
16
Leaf nodes are different from internal nodes. composite importance for leaf nodes is based on the information in its actual contents of the nodes with no tags.
1
If m=1 if m>1
CompImp(E)= 1-
lj=1H(ai)
l Where ai is an actual feature of the content in E. H(ai) is the information entropy of ai within the context of E. H(ai)=-j=1m pij log m pij
4/22/2012 17
Overall algorithm
1.Randomly crawl k pages from the given website S 2.Set null SST with virtual root E 3.For each page W in the k pages do 4. Build PST(W);
4/22/2012
18
4/22/2012
19
Algorithm-Mark Noise
Noisy: For an element node E in the SST, if all of its descendents and itself have composite importance less than aspecified threshold t, then we say element node E is noisy.
Input: E: root element node of a SST. Return: TRUE if E and all of its descendents are noisy,else FALSE.
Mark Noise(E) 1. For each S E.Ss do 2. For each e S.Es do 3. If(markNoise(e)==FALSE) then
4/22/2012 20
4. 5.
6. End for 7.End for 8. if(E.CompImp<=t) then 9. Mark E as noisy 10. Return TRUE 11.Else return FALSE 12.End if
4/22/2012 21
Algorithm -Definitions
Maximal noisy element node: If a noisy element node E in the SST is not a descendent of any other noisy element node, we call E a maximal noisy element node. Meaningful: If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. Maximal meaningful element node: If a meaningful element node E is not a descendent of any other meaningful element node, we say E is a maximal meaningful element node.
4/22/2012
22
Algorithm-Definition
Root BODY
Table
Img
Table
Table
Text Tr Tr Text P
P Img
Img
4/22/2012
23
A simplified SST
Root
Body
Table
Img
Table
Table
Tr
Tr
Text
4/22/2012
24
Algorithm-MapSST
MapSST uses the simplified SST, compares that with page style tree and get the actual contents. Input: E: Root element node of the simplified SST. Input: EPST: root element node of the page style tree. Return: The main content of the page after cleaning. MapSST(E,Ep) 1. If E is noisy then 2. Delete Ep as noises
4/22/2012 25
Algorithm-MapSST
3. Return NULL 4. end if
5. If E is meaningful then 6. Ep is meaningful 7. Return the content under Ep 8. Else 9. ReturnContent = NULL 10. S2 is the style node in Ep.Ss 11. If(S1 E E.Ss S2 matches S1) then
4/22/2012 26
12. 13.
E1.i is the ith element node in sequence s1.Es; E2.i is the ith node in sequence s2.Es
14. For each pair (e1i, e 2i) do 15. returnContent += MapSST(e1i, e 2i) 16. End for 17. Return returnContent 18. Else Ep is possibly meaningful; 19. Return the content under Ep 20. End if 21.End if
4/22/2012 27
Overall algorithm-revisited
1.Randomly crawl k pages from the given website S 2.Set null SST with virtual root E 3.For each page W in the k pages do 4. Build PST(W);
4/22/2012
28
4/22/2012
29
Execution Time
The time taken to build SST is always below 20 second. The process of computing composite Importance finished in 2 seconds. Final step of cleaning each page takes less than 0.1 second.
4/22/2012
30
Advantages
Proposed system is faster than the existing one. Very efficient in handling the noise.
4/22/2012
31
Disadvantages
Faces difficulty in handling scripts inside the body. Unformatted structure of a Web page causes exceptions. Only focused on HTML pages.
4/22/2012
32
Conclusions
4/22/2012
33
Reference
Anderberg, M.R. Cluster Analysis for Applications,Academic Press, Inc. New York, 1973. Bar-Yossef, Z. and Rajagopalan, S. Template Detection viaData Mining and its Applications, WWW 2002, 2002.
Beeferman, D., Berger, A. and Lafferty, J. A model of lexicalattraction and repulsion. ACL-97, 1997
Beeferman, D., Berger, A. and Lafferty, J. Statistical modelsfor text segmentation. Machine learning, 34(1-3), 1999.
34
4/22/2012
4/22/2012
35