Sie sind auf Seite 1von 35

Eliminating Noisy Information in Web Pages for Data Mining

Presented by Jyothi.B S1 SE TKMIT


Guided by

Ms.Revathy

4/22/2012

Noisy information
In web sites the noises are considered as blocks of copyright,

privacy notices and advertisements.

Web noises are of two types. Global noises. Local noises.

4/22/2012

Web mining
Web mining is the application of data mining techniques to
discover patterns from the web.

Web usage mining is a process of extracting useful information from server logs.
Web content mining is the process to discover useful information from text, image, audio or video data in the web. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site.
3

4/22/2012

Sample web site

4/22/2012

Style tree

In this technique propose a tree structure called style tree to capture the common presentation style and actual contents of the pages in a given site. Site style tree can be built for a web site. Web page cleaning done using a cleaning technique.

4/22/2012

Cleaning Technique

Web page cleaning is a kind of preprocessing.

Based on the observation that most web pages are automatically generated.
Parts of a page whose layouts and actual contents also appear in other pages in the site are more likely to be noises. Parts of a page whose layouts or actual contents are quite different from other pages are usually the main contents of the page.
6

4/22/2012

Entropy based information measure

Proposes an information based measure to determine which parts of the style tree indicate noises and which part of the tree contain the main contents of the pages in the website. Importance measure formula proposed is entropy based. Experimental results show a increase in accuracy of web mining using the proposed webpage cleaning method.

4/22/2012

Data structure Site Style Tree

Web site commonly presented by DOM tree.


<BODY bgcolor=WHITE> <TABLE width=800 height=200 > . </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> . </TABLE> </BODY
4/22/2012

root
bgcolor=white

BODY
Width=800 Height=200 Width=800 Bgcolor=red

TABLE

IMG

TABLE

Disadvantage of DOM

DOM tree is insufficient. It is hard to study the overall presentation style and content of a set of HTML pages and clean them based on individual DOM trees.

4/22/2012

DOM Trees and Style Tree

d1
Root Bgcolor=white Width=800 TABLE BODY

d2

Root Bgcolor=white BODY

Width=800
Bgcolor=red IMG TABLE TABLE IMG

Bgcolor=red
TABLE

IMG

BR

A
10

4/22/2012

DOM Trees and Style Tree

Compressed representation of two DOM trees. It shows which parts of the DOM trees are common and which parts are different.

Root Bgcolor=white Width=800 BODY

Bgcolor=red
TABLE IMG IMG TABLE

4/22/2012

BR

11

Site Style Tree

A style node(S) represents a layout or presentation style, which has two components ,denoted by (Es,n),where Es is a sequence of element nodes and n is the number of pages that has this particular style at this node level. An element node E has three components denoted by (TAG,Attr,Ss), where

4/22/2012

TAG is the tag name. Attr is the set of display attributes of TAG. Ss is a set of style nodes below E.
12

Importance measure

Entropy based importance measure is used for determining noisy elements in Style Tree(ST). Based on the following assumptions 1. The more presentation styles that an element node has, the more important and vice versa. 2.The more diverse that the actual contents of an element node are, the more important the element node is ,and vice versa.

4/22/2012

13

Importance of an element node is given by combining its presentation importance and content importance.

4/22/2012

14

Importance Measure
Root BODY

Table

Img

Table

Table

Text Tr Tr Text P

P Img

Img

4/22/2012

15

Importance measure formula

Composite importance measure for a node is the importance measure of the element node and its descendants. For the internal node it is based on the presentation styles and importance of its descendants.
CompImp(E)=(1-l)NodeImp(E)+lli=1(pi CompImp(Si)) Where is the attenuating factor which is set to 0.9. -li=1 pi log m pi NodeImp(E)= 1 if m=1 Where pi is the probability that a web page uses the ith style node in E.Sss kj=1CompImp(Ej) CompImp(Si)= if m>1

k Where pi is the probability that E has the ith child style node in E.Ss
4/22/2012

16

Importance measure formula

Leaf nodes are different from internal nodes. composite importance for leaf nodes is based on the information in its actual contents of the nodes with no tags.
1

If m=1 if m>1

CompImp(E)= 1-

lj=1H(ai)

l Where ai is an actual feature of the content in E. H(ai) is the information entropy of ai within the context of E. H(ai)=-j=1m pij log m pij
4/22/2012 17

Overall algorithm
1.Randomly crawl k pages from the given website S 2.Set null SST with virtual root E 3.For each page W in the k pages do 4. Build PST(W);

5. BuildSST(E,Ew); 6.End for 7.CalcCompImp(E) 8.MarkNoise(E);

4/22/2012

18

9.Markmeaningful(E); 10.For each target web pages p do

11. Ep =BuildPST(P) 12. MapSST(E,Ep) 13.End for

4/22/2012

19

Algorithm-Mark Noise
Noisy: For an element node E in the SST, if all of its descendents and itself have composite importance less than aspecified threshold t, then we say element node E is noisy.

Input: E: root element node of a SST. Return: TRUE if E and all of its descendents are noisy,else FALSE.

Mark Noise(E) 1. For each S E.Ss do 2. For each e S.Es do 3. If(markNoise(e)==FALSE) then
4/22/2012 20

4. 5.

return FALSE End if

6. End for 7.End for 8. if(E.CompImp<=t) then 9. Mark E as noisy 10. Return TRUE 11.Else return FALSE 12.End if
4/22/2012 21

Algorithm -Definitions

Maximal noisy element node: If a noisy element node E in the SST is not a descendent of any other noisy element node, we call E a maximal noisy element node. Meaningful: If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. Maximal meaningful element node: If a meaningful element node E is not a descendent of any other meaningful element node, we say E is a maximal meaningful element node.

4/22/2012

22

Algorithm-Definition
Root BODY

Table

Img

Table

Table

Text Tr Tr Text P

P Img

Img

4/22/2012

23

A simplified SST
Root

Body

Table

Img

Table

Table

Tr

Tr

Text

4/22/2012

24

Algorithm-MapSST
MapSST uses the simplified SST, compares that with page style tree and get the actual contents. Input: E: Root element node of the simplified SST. Input: EPST: root element node of the page style tree. Return: The main content of the page after cleaning. MapSST(E,Ep) 1. If E is noisy then 2. Delete Ep as noises
4/22/2012 25

Algorithm-MapSST
3. Return NULL 4. end if

5. If E is meaningful then 6. Ep is meaningful 7. Return the content under Ep 8. Else 9. ReturnContent = NULL 10. S2 is the style node in Ep.Ss 11. If(S1 E E.Ss S2 matches S1) then
4/22/2012 26

12. 13.

E1.i is the ith element node in sequence s1.Es; E2.i is the ith node in sequence s2.Es

14. For each pair (e1i, e 2i) do 15. returnContent += MapSST(e1i, e 2i) 16. End for 17. Return returnContent 18. Else Ep is possibly meaningful; 19. Return the content under Ep 20. End if 21.End if
4/22/2012 27

Overall algorithm-revisited
1.Randomly crawl k pages from the given website S 2.Set null SST with virtual root E 3.For each page W in the k pages do 4. Build PST(W);

5. BuildSST(E,Ew); 6.End for 7.CalcCompImp(E) 8.MarkNoise(E);

4/22/2012

28

9.Markmeaningful(E); 10.For each target web pages p do

11. Ep =BuildPST(P) 12. MapSST(E,Ep) 13.End for

4/22/2012

29

Execution Time

The time taken to build SST is always below 20 second. The process of computing composite Importance finished in 2 seconds. Final step of cleaning each page takes less than 0.1 second.

4/22/2012

30

Advantages

Proposed system is faster than the existing one. Very efficient in handling the noise.

Removes around 99% of the unwanted content.

4/22/2012

31

Disadvantages

Faces difficulty in handling scripts inside the body. Unformatted structure of a Web page causes exceptions. Only focused on HTML pages.

4/22/2012

32

Conclusions

Proposes a technique to clean web pages for web mining.

Introduces a data structure SST to capture layout and presentation styles.


Proposes an information based measure to evaluate the importance of element nodes in SST so as to detect noises. Results show that proposed technique is highly effective.

4/22/2012

33

Reference

Anderberg, M.R. Cluster Analysis for Applications,Academic Press, Inc. New York, 1973. Bar-Yossef, Z. and Rajagopalan, S. Template Detection viaData Mining and its Applications, WWW 2002, 2002.

Beeferman, D., Berger, A. and Lafferty, J. A model of lexicalattraction and repulsion. ACL-97, 1997
Beeferman, D., Berger, A. and Lafferty, J. Statistical modelsfor text segmentation. Machine learning, 34(1-3), 1999.
34

4/22/2012

4/22/2012

35

Das könnte Ihnen auch gefallen