Sie sind auf Seite 1von 6

Chinese Web Page Classification Based on Self-Organizing Mapping

Neural Networks
Jiu-zhen Liang
Institute of Computer Science, Zhejiang Normal University, Jinhua, China, 321004
E-mail: liangjiuzhen@yahoo.com
Abstract
This paper deals with self-organizing mapping (SOM) neural networks topology and
learning algorithm, and the application in the automatic classification of Chinese web pages.
SOM neural network has the advantages of simple structure, ordered mapping topology and
low complexity of learning. It is suitable for many complex problems such as multi-class
pattern recognition, high dimension input vector and large quantity of training data. The
accuracy of clustering can be improved when combining SOMs unsupervised learning
algorithm with LVQ learning algorithm. At the end of the paper, it is proposed the
classification result of SOM neural network applied in the 5087 html pages of Peoples Daily
web edition, with the average precision 90.08% and the average recall 89.85%.

1. Introduction
Recently, the automatic classification of web pages is popular in the informationprocessing areas, with more and more people engaged in it. Web automatic
classification deals with web pages text information, structure information and
hyperlink information [1,2]. Nowadays the focus of research is on the automatic
classification of web text information, that is, classification based on text content.
However, due to the diversity of web pages content, complexity of structure and other
characteristics, it is very difficult to improve the accuracy of automatic classification.
Although researchers have designed all kinds of classifiers to apply to this problem,
such as Nave Bayes classifier [3], K-neighbor Clustering [4], SOM neural network [5],
Support Vector Machine (SVM) [6], these classifiers have their own characteristics and
conditions of application. For example, Nave Bayes classifier is based on the
assumption that features to classified are orthogonal and subject to polynomial
independent uniform distribution, K-neighbor Clustering assumes that no sample of
other class appears in the label known samples neighboring area, while SOM neural
network is an order-holding mapping but requires more training time, and SVM
demands solution of a quadratic programming problem in spite of its strong classifying
ability.
Take neural network classifier for example, its assumptions on the problems
distribution model are much less than that of Nave Bayes classifier, so it has less
independence on the problem. But iterated training of vast samples is needed to fix the
network parameters, which bring formidable difficulties to learning process when
samples are of large quantity and the dimension of feature space is also very large.
Furthermore, sometimes sample quantity is small compared with the dimension of
feature space. For example, our problem of text classification has several thousand
samples while the dimension of feature space is also several thousand, that is, these two
numbers are almost of the same order of magnitude. In such cases, the distributing
density of samples in the feature space is too sparse, which is bound to influence the
generalizing capability of classifiers. Therefore, it is very important to introduce the
prior knowledge into the selection of network structure and training process. So-called

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

prior knowledge for web page classification is the statistical information of word
frequency in the case of text classification based on keyword frequency, especially
word frequency information of different classes web pages and keywords
concentration ratio in the web pages of the same class. The statistical information is of
crucial importance for web page classification, and also important for the selection of
key word features.
This paper includes six parts as following. The first is the introduction. The second
part deals with texts feature extraction and selection based on keyword frequency. The
third part describes SOM neural networks topology. The fourth part gives the learning
algorithm corresponding to our problem of classification. The fifth part shows the
experiment results of classification on the Peoples Daily web edition. The last part,
part sixth is the conclusion of the paper.

2. Feature extraction of Chinese web pages


Web pages text feature extraction based on keywords includes two aspects. One is
whether a keyword appears in a document of a certain class; the other is the frequency
of this keyword. The above process is implemented by a parsing system. At first a
keyword dictionary is set up, then we use this dictionary to scan the web pages and
record the frequency of keywords. Generally we have so many words in the keyword
dictionary (several ten thousand or even more), whereas a large part of them contribute
little to the classification, trivial word for example, to make matters worse, these words
may play a negative role in our classifying process. So the selection of useful keywords
is very important. Feature compression is one of the methods to select keywords. Now
we have a bunch of feature compression algorithm, among them is Principal Component
Analysis (PCA), Information Entropy Loss, and word frequency threshold [7], etc.
Nowadays Vector Space Model (VSM) is one of the most popular ways to denote
text features. In this model, the document space is regarded as a vector space composed
by a group of orthogonal word vector, and every document d is a specific vector in it:
V(d)=(t1 , x1 (d); ; ti, xi(d); ; tn , xn (d)), with ti be the word and xi(d) be the weight of ti
in document d. xi(d) is often defined as the function of tfi(d), the frequency of ti in d, as
this: xi(d)= M (tf i (d )) . Ordinary M functions are as follows.

1 , tf i (d ) t 1

0 , tf i (d ) 0

Boolean function: M

Square Root function: M

tf i (d )

(1)
(2)

Logarithm function: M

(3)

TFIDF function: M

(4)

log(tf i (d )  1)
N
tf i (d ) u log( )
ni

Actually we adapt the following four forms to denote our feature vectors.
1) Simple word frequency feature. Every record is corresponding to a vector Vj=(xj1 ,
xj2 ,, xjn), among them wji is the number of appearing times of the i-th keyword in the
j-th document.
2) Feature of word frequency divided by document length. Every record is
corresponding to a vector Vj=(xj1 /Lj, xj2 /Lj,,xjn/Lj), among them xji is the same as in 1),
while Lj is the length of the j-th document.
3) ID/TIF feature set. Every record is corresponding to a vector Vj=(xj1 /v1 ,xj2 /v2 ,,
xjn/vn ), among them xji is the same as in 1), while vi is the number of documents in
which the i-th word appears (so-called Inverse Document Frequency).

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

4) Normalized frequency feature set. Every record is corresponding to a vector


n

Vj=(xj1 /S j, xj2 /S j,, xjn/S j), among them x ji is the same as in 1), while S j

2
ji

i 1

In order to filter useless keyword feature, we introduce the concept of Frequency


Covering Rate.

f ij

xij

d ij

(5)

Nj

ij

j 1

Where fij is the frequency of the i-th keyword in the j-th document, dij is the number of
the j-th class documents in which the i-th keyword appears, and Nj is the total number
of the j-th class documents. xij reflects the correlating degree between the i-th keyword
and the j-th class documents. Bigger xij is, more precisely it can denote the
characteristics of the j-th class, and more advantageous it is for the classification. So we
can set a threshold T to filter keyword features.
(6)
x i max{ x ij }
j

If xi >T , then we keep the i-th keyword, otherwise we throw it away.

3. Designing of the SOM classifier


Among all the known SOM neural network models, Kohonen model is the most
famous. A Kohonen model is composed of two layers neurons, the input layer and the
output layer, which also called competition layer. If we denote input vector as
x (x1, x2,, xn )T , then the connecting weight between the neurons in the input layer and
those in the competing layer is wj

(w1j , w2 j ,, wnj )T , j=1, 2, , H, and the output of

competition layer neurons is


n

w x

yj

wTj x , j=1, 2, , H

ij i

(7)

i 1

Rewrite it in the vector form

Wx

(8)
By self-organizing competition, SOM neural network can change the disordered
input set X {x ( k ) }kK 1 into ordered topology connection in the competition layer, such
as the distribution of clustering centers. Among all the objects in the input set,
clustering centers are those connection weights corresponding to the winner unit in the
competition layers output. So self-organizing is the process of looking for connection
weights that best match the input vector. We use the following criterion to find the
winner neuron.

h( x)

arg min x  w j , j=1, 2, , H


j

(9)

In the above equation, is the Euclid norm. In fact, equation (9) describes a sort of
mapping, which is a continuous input vector space R n X to the discrete neuron
output space {j=1, 2, , H} by competitive actions among the competing units, with
the restriction h(x) {1, 2,  , H } . By doing so we achieved the ordered division of the
object X in the vector input space R n .

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

Another problem to discuss is, how can we define the topologic neighboring areas for
these centers? As in the competition layer, a certain class is often represented by several
neurons; our usual action is to define a neighbor distance. For example, if d i,j denotes
the distance between one excited neuron j with another winner neuron i, the topologic
neighboring area of center neuron i can be put as following.
hi , j ( x ) f (d i , j , x)
(10)
Where, neighboring function

f (d i , j , x) conforms to these two restrictions: 1)

symmetrical on the center neuron, and d i,j=0 denotes the center neuron i, 2) descent
function of d i,j . Gauss neighbor function can be one of these types.

hi , j ( x )

f ( d i , j , x)

exp(

d i2, j
2V 2

(11)

Here V is the Gauss broadband coefficient.


Neighbor distance has the following definition.

d i, j

ri  r j

(12)

Here, ri, rj are the locations of winner neuron i and excited neuron j in the arrayarranged two-dimensional discrete space, respectively.

4. Learning Algorithm
SOM neural networks learning can resort to Hebb learning rule. However,
considering the irreversible transferring of information, we have to introduce the
oblivious factor g(yi)wj, with wj be the connecting weight corresponding to the j-th
neuron in the competition layer, and g(yi) be the responding function of yi . If we let
g ( y j ) Ky j , as K is the learning step length, then we get the expression to modify the
weights.

'w j
If we let y j

'w j

Ky j x  g ( y j ) w j

(13)

hi , j ( x ) , then equation (13) can be rewritten as following.

Khi , j ( x ) (x  w j )

(14)

Then we get the weights iteration formula.

w j (t 1) w j (t) K(t)hi, j (x) (t)(x  w j (t))

(15)

Where, w j (t ) , K (t ) and hi , j ( x ) (t ) are the weight, learning step length and neighbor
area function after the t-th iteration, respectively. While, hi , j ( x ) (t ) is defined as
following.

hi , j ( x )
In which, V (t )

exp(

d i2, j

)
(16)
2V 2 (t )
t
exp( ) , with W be a constant designating the maximum number of

iterative times.
The above SOM is a kind of unsupervised learning algorithm. If we have known the
labels of the samples, we can go farther using LVQ (learning Vector Quantization)
algorithm to modify the clustering centers. LVQ algorithm is described simply as
follows.

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

For all the input vectors x (k ) X , repeat the following process. Assuming weight
vector wc was the most alike vector with input vector x i , wc and xi belong to the C w c
class and C x ( i ) class, respectively, then the modifying rule of weight vector wc is as
follows.
If C w c

C x( k ) , then

wc (t 1) wc (t) Dt [x(k) wc (t)], D t (0, 1) .


If C w c

C x( k ) , then

w (t 1) wc (t) Dt [x(k )  wc (t)]


While other weight vectors remain unchanged.

5. Analysis of the experiments results


Thanks to bibliography [5], we get the sample set from Peoples Daily web edition,
5096 web pages in all falling into eight classes. That is, 594 web pages in the
International Affairs class, 524 web pages in Environment Protection class, 820 web
pages in Economy class, 777 web pages in Military class, 701 web pages in Science and
Education class, 713 web pages in Current Affairs and Politics class, 495 web pages in
Life class and 472 web pages in Entertainment class. The dictionary used for parsing
has 316819 words, in other words, the dimension of the web page vector space after
parsing is 316819. Then we use the class covering rate threshold to filter the keywords,
resulting with a keyword dictionary of 4831 keywords. Afterwards we establish
corresponding feature vector by counting the numbers of times these keywords appear
in a certain web page. So a feature vector of 4831 dimensions represents a web page.
During the experiments we use all the 5096 samples in the SOM learning and LVQ
training process. For the sake of comparison, we also give the experimental result of a
linear perceptron classifier (3617 samples in training set, 1470 samples in test set);
comparison result of the two classifiers is indicated in Table 1 and Table 2.
According to the results of these two classifiers, SOM_LVQ classifier has greater
advantages than perceptron in web page classification. The average precision and
average recall are improved; whats more, the degree of uniformity in SOM_LVQ is
also obviously improved compared with perceptron classifier.

References
[1]
[2]
[3]
[4]
[5]
[6]
[7]

J. Lv, M. Zhao, Research on the Automatic Information Extraction on Internet (In Chinese), Data
Communication, 2000, (1): 5~8.
M. Zhu, J. Wang, J. Wang, Research on the Feature Selection in Web Page Recognition (In Chinese),
Computer Engineering, 2000, 26(8): 35~37.
Y. Fan, C. Zheng, Q. Wang, Qingsheng Cai, Jie Liu, Web Page Classification with Nave Bayes Classifier
(In Chinese), Journal of Software. 2001, 12(9): 1386~1392.
Z.Z. Shi, Knowledge Discovery (In Chinese), Beijing, Tsinghua University Press, 2002.
Y.Z. Zhang, Web pages Text Information Mining Based on Content (In Chinese), Postdoctoral Report,
Tsinghua university, 2002.
X.L. Li, Jimin Liu, Zhongzhi Shi, A Chinese Web Page Classifier Based on Support Vector Machine and
Unsupervised Clustering (In Chinese), Journal of Computer, 2001,24(1): 62~68
F. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No.
1, March 2002, pp. 1~47.

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

Table 1. SOM_LVQ Classifiers result


(average precision: 90.08%, average recall: 89.85%)
Actual

Inter.

Envir.

Econo.

Milit.

Scien.

Curre.

Life

Entert.

594

524

820

777

701

713

495

472

International

475

36

Environment

478

Economy

48

770

19

27

Military

34

711

Science

625

18

14

Current

18

15

17

13

22

647

26

Life

10

18

30

421

Entertainment

10

452

Experiment

Precision (%)

88.95

97.15

87.30

94.17

92.59

84.58

84.20

91.68

Recall (%)

79.97

91.22

93.90

91.51

89.16

90.74

85.05

95.76

Table 2. Results of Perceptron Classifier


(average precision: 82.59%, average recall: 83.87%)
Actual
Experiment

Intern.
163

Envir.
149

Econo.
262

Milita.
230

Scien.
200

Curre.
206

Life
134

Entert.
126

International

139

16

23

33

Environment

146

Economy

220

17

Military

16

201

Science

185

23

Current

74

Life

11

31

122

Entertainment

20

122

Precision (%)

63.18

96.69

87.65

89.73

84.47

84.09

71.35

83.56

Recall (%)

85.28

97.99

83.97

87.39

92.5

35.92

91.04

96.83

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

Das könnte Ihnen auch gefallen