Beruflich Dokumente
Kultur Dokumente
Neural Networks
Jiu-zhen Liang
Institute of Computer Science, Zhejiang Normal University, Jinhua, China, 321004
E-mail: liangjiuzhen@yahoo.com
Abstract
This paper deals with self-organizing mapping (SOM) neural networks topology and
learning algorithm, and the application in the automatic classification of Chinese web pages.
SOM neural network has the advantages of simple structure, ordered mapping topology and
low complexity of learning. It is suitable for many complex problems such as multi-class
pattern recognition, high dimension input vector and large quantity of training data. The
accuracy of clustering can be improved when combining SOMs unsupervised learning
algorithm with LVQ learning algorithm. At the end of the paper, it is proposed the
classification result of SOM neural network applied in the 5087 html pages of Peoples Daily
web edition, with the average precision 90.08% and the average recall 89.85%.
1. Introduction
Recently, the automatic classification of web pages is popular in the informationprocessing areas, with more and more people engaged in it. Web automatic
classification deals with web pages text information, structure information and
hyperlink information [1,2]. Nowadays the focus of research is on the automatic
classification of web text information, that is, classification based on text content.
However, due to the diversity of web pages content, complexity of structure and other
characteristics, it is very difficult to improve the accuracy of automatic classification.
Although researchers have designed all kinds of classifiers to apply to this problem,
such as Nave Bayes classifier [3], K-neighbor Clustering [4], SOM neural network [5],
Support Vector Machine (SVM) [6], these classifiers have their own characteristics and
conditions of application. For example, Nave Bayes classifier is based on the
assumption that features to classified are orthogonal and subject to polynomial
independent uniform distribution, K-neighbor Clustering assumes that no sample of
other class appears in the label known samples neighboring area, while SOM neural
network is an order-holding mapping but requires more training time, and SVM
demands solution of a quadratic programming problem in spite of its strong classifying
ability.
Take neural network classifier for example, its assumptions on the problems
distribution model are much less than that of Nave Bayes classifier, so it has less
independence on the problem. But iterated training of vast samples is needed to fix the
network parameters, which bring formidable difficulties to learning process when
samples are of large quantity and the dimension of feature space is also very large.
Furthermore, sometimes sample quantity is small compared with the dimension of
feature space. For example, our problem of text classification has several thousand
samples while the dimension of feature space is also several thousand, that is, these two
numbers are almost of the same order of magnitude. In such cases, the distributing
density of samples in the feature space is too sparse, which is bound to influence the
generalizing capability of classifiers. Therefore, it is very important to introduce the
prior knowledge into the selection of network structure and training process. So-called
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
prior knowledge for web page classification is the statistical information of word
frequency in the case of text classification based on keyword frequency, especially
word frequency information of different classes web pages and keywords
concentration ratio in the web pages of the same class. The statistical information is of
crucial importance for web page classification, and also important for the selection of
key word features.
This paper includes six parts as following. The first is the introduction. The second
part deals with texts feature extraction and selection based on keyword frequency. The
third part describes SOM neural networks topology. The fourth part gives the learning
algorithm corresponding to our problem of classification. The fifth part shows the
experiment results of classification on the Peoples Daily web edition. The last part,
part sixth is the conclusion of the paper.
1 , tf i (d ) t 1
0 , tf i (d ) 0
Boolean function: M
tf i (d )
(1)
(2)
Logarithm function: M
(3)
TFIDF function: M
(4)
log(tf i (d ) 1)
N
tf i (d ) u log( )
ni
Actually we adapt the following four forms to denote our feature vectors.
1) Simple word frequency feature. Every record is corresponding to a vector Vj=(xj1 ,
xj2 ,, xjn), among them wji is the number of appearing times of the i-th keyword in the
j-th document.
2) Feature of word frequency divided by document length. Every record is
corresponding to a vector Vj=(xj1 /Lj, xj2 /Lj,,xjn/Lj), among them xji is the same as in 1),
while Lj is the length of the j-th document.
3) ID/TIF feature set. Every record is corresponding to a vector Vj=(xj1 /v1 ,xj2 /v2 ,,
xjn/vn ), among them xji is the same as in 1), while vi is the number of documents in
which the i-th word appears (so-called Inverse Document Frequency).
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
Vj=(xj1 /S j, xj2 /S j,, xjn/S j), among them x ji is the same as in 1), while S j
2
ji
i 1
f ij
xij
d ij
(5)
Nj
ij
j 1
Where fij is the frequency of the i-th keyword in the j-th document, dij is the number of
the j-th class documents in which the i-th keyword appears, and Nj is the total number
of the j-th class documents. xij reflects the correlating degree between the i-th keyword
and the j-th class documents. Bigger xij is, more precisely it can denote the
characteristics of the j-th class, and more advantageous it is for the classification. So we
can set a threshold T to filter keyword features.
(6)
x i max{ x ij }
j
w x
yj
wTj x , j=1, 2, , H
ij i
(7)
i 1
Wx
(8)
By self-organizing competition, SOM neural network can change the disordered
input set X {x ( k ) }kK 1 into ordered topology connection in the competition layer, such
as the distribution of clustering centers. Among all the objects in the input set,
clustering centers are those connection weights corresponding to the winner unit in the
competition layers output. So self-organizing is the process of looking for connection
weights that best match the input vector. We use the following criterion to find the
winner neuron.
h( x)
(9)
In the above equation, is the Euclid norm. In fact, equation (9) describes a sort of
mapping, which is a continuous input vector space R n X to the discrete neuron
output space {j=1, 2, , H} by competitive actions among the competing units, with
the restriction h(x) {1, 2, , H } . By doing so we achieved the ordered division of the
object X in the vector input space R n .
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
Another problem to discuss is, how can we define the topologic neighboring areas for
these centers? As in the competition layer, a certain class is often represented by several
neurons; our usual action is to define a neighbor distance. For example, if d i,j denotes
the distance between one excited neuron j with another winner neuron i, the topologic
neighboring area of center neuron i can be put as following.
hi , j ( x ) f (d i , j , x)
(10)
Where, neighboring function
symmetrical on the center neuron, and d i,j=0 denotes the center neuron i, 2) descent
function of d i,j . Gauss neighbor function can be one of these types.
hi , j ( x )
f ( d i , j , x)
exp(
d i2, j
2V 2
(11)
d i, j
ri r j
(12)
Here, ri, rj are the locations of winner neuron i and excited neuron j in the arrayarranged two-dimensional discrete space, respectively.
4. Learning Algorithm
SOM neural networks learning can resort to Hebb learning rule. However,
considering the irreversible transferring of information, we have to introduce the
oblivious factor g(yi)wj, with wj be the connecting weight corresponding to the j-th
neuron in the competition layer, and g(yi) be the responding function of yi . If we let
g ( y j ) Ky j , as K is the learning step length, then we get the expression to modify the
weights.
'w j
If we let y j
'w j
Ky j x g ( y j ) w j
(13)
Khi , j ( x ) (x w j )
(14)
(15)
Where, w j (t ) , K (t ) and hi , j ( x ) (t ) are the weight, learning step length and neighbor
area function after the t-th iteration, respectively. While, hi , j ( x ) (t ) is defined as
following.
hi , j ( x )
In which, V (t )
exp(
d i2, j
)
(16)
2V 2 (t )
t
exp( ) , with W be a constant designating the maximum number of
iterative times.
The above SOM is a kind of unsupervised learning algorithm. If we have known the
labels of the samples, we can go farther using LVQ (learning Vector Quantization)
algorithm to modify the clustering centers. LVQ algorithm is described simply as
follows.
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
For all the input vectors x (k ) X , repeat the following process. Assuming weight
vector wc was the most alike vector with input vector x i , wc and xi belong to the C w c
class and C x ( i ) class, respectively, then the modifying rule of weight vector wc is as
follows.
If C w c
C x( k ) , then
C x( k ) , then
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
J. Lv, M. Zhao, Research on the Automatic Information Extraction on Internet (In Chinese), Data
Communication, 2000, (1): 5~8.
M. Zhu, J. Wang, J. Wang, Research on the Feature Selection in Web Page Recognition (In Chinese),
Computer Engineering, 2000, 26(8): 35~37.
Y. Fan, C. Zheng, Q. Wang, Qingsheng Cai, Jie Liu, Web Page Classification with Nave Bayes Classifier
(In Chinese), Journal of Software. 2001, 12(9): 1386~1392.
Z.Z. Shi, Knowledge Discovery (In Chinese), Beijing, Tsinghua University Press, 2002.
Y.Z. Zhang, Web pages Text Information Mining Based on Content (In Chinese), Postdoctoral Report,
Tsinghua university, 2002.
X.L. Li, Jimin Liu, Zhongzhi Shi, A Chinese Web Page Classifier Based on Support Vector Machine and
Unsupervised Clustering (In Chinese), Journal of Computer, 2001,24(1): 62~68
F. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No.
1, March 2002, pp. 1~47.
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
Inter.
Envir.
Econo.
Milit.
Scien.
Curre.
Life
Entert.
594
524
820
777
701
713
495
472
International
475
36
Environment
478
Economy
48
770
19
27
Military
34
711
Science
625
18
14
Current
18
15
17
13
22
647
26
Life
10
18
30
421
Entertainment
10
452
Experiment
Precision (%)
88.95
97.15
87.30
94.17
92.59
84.58
84.20
91.68
Recall (%)
79.97
91.22
93.90
91.51
89.16
90.74
85.05
95.76
Intern.
163
Envir.
149
Econo.
262
Milita.
230
Scien.
200
Curre.
206
Life
134
Entert.
126
International
139
16
23
33
Environment
146
Economy
220
17
Military
16
201
Science
185
23
Current
74
Life
11
31
122
Entertainment
20
122
Precision (%)
63.18
96.69
87.65
89.73
84.47
84.09
71.35
83.56
Recall (%)
85.28
97.99
83.97
87.39
92.5
35.92
91.04
96.83
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE