Sie sind auf Seite 1von 12

Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

52
PHN LOI VN BN VI MY HC VECTOR H TR
V CY QUYT NH
Trn Cao v Phm Nguyn Khang
1

ABSTRACT
Text document classification, basically, can be considered as a classification problem.
Automatic text document classification is to assign a label to a new document based on
the similarity of the document with labeled documents in the training set. Many machine
learning and data mining methods have been applied in text document classification such
as: Naive Bayes, decision tree, k Nearest neighbor, neural network,
Support vector machine (SVM) is an efficient classification algorithm. It has been applied
to machine learning and recognition field. However, it is still not efficient in applying to
text document classification because, by the nature, this problem often deals with a large
feature space. This paper focuses on applying SVM to text document classification and
compares the efficiency of the method with the one of decision tree, a traditional
classification algorithm. The research illustrates that SVM along with the feature
selection based on the singular value decomposition (SVD) is much better than decision
tree method.
Keywords: Decision tree, Support vector machine (SVM), text document classification,
single value decomposition (SVD)
Title: Text document classification with support vector machine and decision tree
TM TT
Bi ton phn loi vn bn, thc cht, c th xem l bi ton phn lp. Phn loi vn
bn t ng l vic gn cc nhn phn loi ln mt vn bn mi da trn mc tng
t ca vn bn so vi cc vn bn c gn nhn trong tp hun luyn. Nhiu k
thut my hc v khai ph d liu c p dng vo bi ton phn loi vn bn,
chng hn: phng php quyt nh da vo Bayes ngy th (Naive Bayes), cy quyt
nh (decision tree), klng ging gn nht (KNN), mng nron (neural network),
My hc vect h tr (SVM) l mt gii thut phn lp c hiu qu cao v c p
dng nhiu trong lnh vc khai ph d liu v nhn dng. Tuy nhin SVM cha c p
dng mt cch c hiu qu vo phn loi vn bn v c im ca bi ton phn loi vn
bn l khng gian c trng thng rt ln. Bi vit ny nghin cu my hc vector h
tr (SVM), p dng n vo bi ton phn loi vn bn v so snh hiu qu ca n vi
hiu qu ca gii thut phn lp c in, rt ph bin l cy quyt nh. Nghin cu
ch ra rng SVM vi cch la chn c trng bng phng php tch gi tr n (SVD)
cho kt qu tt hn so vi cy quyt nh.
T kha: Cy quyt nh, my hc vector h tr, phn loi vn bn, tch gi tr n
1 GII THIU BI TON PHN LOI VN BN
Phn loi vn bn l mt bi ton x l vn bn c in, l nh x mt vn bn
vo mt ch bit trong mt tp hu hn cc ch da trn ng ngha ca
vn bn. V d mt bi vit trong mt t bo c th thuc mt (hoc mt vi) ch

1
Khoa Cng ngh Thng tin & Truyn thng, Trng i hc Cn Th
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

53
no (nh th thao, sc khe, cng ngh thng tin,). Vic t ng phn loi
vn bn vo mt ch no gip cho vic sp xp, lu tr v truy vn ti liu
d dng hn v sau.
c im ni bt ca bi ton ny l s a dng ca ch vn bn v tnh a ch
ca vn bn. Tnh a ch ca vn bn lm cho s phn loi ch mang tnh
tng i v c phn ch quan, nu do con ngi thc hin, v d b nhp nhng
khi phn loi t ng. R rng mt bi vit v Gio dc cng c th xp vo Kinh
t nu nh bi vit bn v tin nong u t cho gio dc v tc ng ca u t
ny n kinh t - x hi. V bn cht, mt vn bn l mt tp hp t ng c lin
quan vi nhau to nn ni dung ng ngha ca vn bn. T ng ca mt vn bn l
a dng do tnh a dng ca ngn ng (ng ngha, a ngha, t vay mn nc
ngoi,) v s lng t cn xt l ln. y cn lu rng, mt vn bn c th
c s lng t ng khng nhiu, nhng s lng t ng cn xt l rt nhiu v phi
bao hm tt c cc t ca ngn ng ang xt.
Trn th gii c nhiu cng trnh nghin cu t nhng kt qu kh quan, nht
l i vi phn loi vn bn ting Anh. Tuy vy, cc nghin cu v ng dng i
vi vn bn ting Vit cn nhiu hn ch do kh khn v tch t v cu. C th lit
k mt s cng trnh nghin cu trong nc vi cc hng tip cn khc nhau cho
bi ton phn loi vn bn, bao gm: phn loi vi my hc vect h tr [1], cch
tip cn s dng l thuyt tp th [2], cch tip cn thng k hnh v [3], cch tip
cn s dng phng php hc khng gim st v nh ch mc [4], cch tip cn
theo lut kt hp [5]. Theo cc kt qu trnh by trong cc cng trnh th nhng
cch tip cn nu trn u cho kt qu kh tt. Tuy nhin kh c th so snh cc
kt qu trn vi nhau v tp d liu thc nghim ca mi phng php l khc
nhau. Bi vit ny so snh hiu qu ca hai cch tip cn phn loi vn bn: phn
loi vi gii thut cy quyt nh v phn loi vi my hc vector h tr kt hp
vi phn tch gi tr n (SVD).
Theo c hai cch tip cn ny, trc ht, vn bn c coi nh l mt tp hp cc
t. thc hin tch t chng ti p dng gii thut MMSEG [6]. Phn tip
theo s trnh by c th m hnh ha vn bn trc khi p dng phn lp theo gii
thut cy quyt nh v phn lp theo SVM.
2 M HNH HA VN BN
Trn thc t, c th p dng mt gii thut tch t, vn bn cn qua bc tin
x l c bn: chun ha du, chun ha i v y, chun ha font, Tuy nhin
cc bc ny s khng c cp y do gii hn trang bi vit. C th xem
vn bn l tp hp cc t. Khi nim t y theo ngha l mt chui k t lin
tip nhau trong vn bn, khng nht thit phi l mt t c ngha trong ngn ng.
Vic xc nh t hay tch t s c thc hin bng mt gii thut no . Hin
nay phng php MMSEG [6] v cc ci tin ca n c p dng rng ri trong
tch t ting Vit. Mt s xut tch t c lp vi ngn ng nh phng php
n-gram; chng hn trong ting Vit c ly hai ting lin tip ng cnh nhau trong
vn bn lm 2-gram. Nh vy mt 2-gram khng nht thit phi l mt t
ng trong ting Vit. Trong nghin cu ny, chng ti dng gii thut MMSEG
tch t ting Vit. Gii thut ny c ngun gc l tch ting Trung Quc [7]
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

54
vi chnh xc 99%. Nhiu nghin cu p dng gii thut MMSEG vo tch
t ting Vit nhng cha thy c bo co chnh thc no v kt qu tch t. Tuy
nhin, trong nghin cu ca chng ti, MMSEG c th p dng vo bi ton phn
loi vn bn v: MMSEG tch t vi chnh xc kh cao trn 95%; t l sai st
trong tch t khong 5% khng nh hng ln n kt qu phn loi.
Sau khi tch t, vn bn c xem nh l mt tp hp cc t. Ch t trong du
ngoc v n l t sinh ra bi gii thut tch t, n khng nht thit phi c ngha
trong ngn ng. Vi gii thut MMSEG th cc t c tch u c ngha (c
trong t in), tuy nhin n khng nht thit phi ng hon ton trong ng cnh
ca vn bn (ng ngha). Hnh 1 cho v d v mt on vn bn c tch theo
gii thut MMSEG.
Ai/ cng/ bit/ khng gian/ c th/ tc ng/ n/ con ngi.
Mt tri/ gy nn/ nhiu/ vn / ni/ mt s/ ngi/ nhy cm/ trc/ nhng/
i thay/ ca/ thi tit.
Bn cnh/ vic/ gy nn/ bin ng/ thu triu/ mt trng/ cn/ l/ nguyn
nhn/ ca/ hin tng/ mng du/ bc i/ trong khi/ ng.
Dng nh/ ai/ cng/ nghe ni/ a cu/ chng ta/ c th/ l/ ni/ b/ ca/
cc/ thin thch/ vo/ mt/ ngy/ v nh/ no .
Hnh 1: V d v tch t vi gii thut MMSEG
R rng rng, cc t trong vn bn c mc quan trng khc nhau i vi vn
bn v c trong phn loi vn bn. Mt s t nh t ni, t ch s lng (v,
cc, nhng, mi,) khng mang tnh phn bit trong khi phn loi. Ngoi ra,
cn c rt nhiu t khc cng khng c gi tr phn loi v d nh t xut hin hu
khp cc vn bn hay dng khng ph bin trong vn bn, nhng t gi l
stopword ny cn c loi b. C nhiu cch loi b stopword, chng hn dng 1
danh sch cc stopword hoc loi b theo tn sut xut hin ca t (ch s
TF*IDF). Trong thc nghim chng ti dng mt danh sch stopword kt hp vi
vic loi b cc t c ch s TF*IDF thp. Ch s TF*IDF thp tc l t xut hin
hu khp cc bn bn hoc t rt t xut hin.
Sau khi loi b cc stopword, vn bn c th xem nh l mt tp hp cc c
trng, l tp hp cc t quan trng cn li biu din vn bn. Vic phn
loi vn bn s da trn cc c trng ny. Tuy nhin, c th thy rng, s c
trng ca mt vn bn l ln v khng gian cc c trng (tt c c trng) ca tt
c cc vn bn ang xem xt l rt ln, v nguyn tc, n bao gm tt c cc t
trong mt ngn ng. Chnh v vy, phn loi da trn cc c trng ny cn phi
c cch x l, la chn c trng nhm rt ngn s chiu ca khng gian c
trng. Trn thc t, ngi ta khng th xt tt c cc t ca ngn ng m l dng
tp hp cc t c rt ra t mt tp ( ln) cc vn bn ang xt (gi l tp
ng liu).
K n, mi vn bn d
i
trong tp ng liu ang xt s c m hnh ha nh l
mt vector trng s ca cc c trng, d
i
(w
i1
,,w
im
). Trong bi vit ny, trng s
ca mt t c tnh theo tn sut xut hin ca t trong vn bn (TF) v tn sut
nghch o ca t trong tp ng liu (IDF).

Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

55



- TF
ij
l s ln xut hin ca t th j trong vn bn th i.
- DF
j
l tng s vn bn c cha t th j trong tp ng liu.
- N l tng s vn bn trong tp ng liu.
3 PHN LOI VN BN THEO PHNG PHP CY QUYT NH
Phng php cy quyt nh [8] c th p dng vo bi ton phn loi vn bn.
Da vo tp cc vn bn hun luyn (sau ny gi tt l tp hun luyn), xy dng
mt cy quyt nh. Cy quyt nh c dng l cy nh phn, mi nt trong tng
ng vi vic phn hoch tp vn bn da trn mt thuc tnh no (mt t). Vic
xy dng cy quyt nh ph thuc vo vic la chn thuc tnh phn hoch.
Theo [8], chng ti la chn thuc tnh phn hoch da trn li thng tin
(information gain) ln nht, l hiu gia hn lon thng tin trc v sau
phn hoch vi thuc tnh . li thng tin c tnh ton da vo hn lon
thng tin (Entropy) theo cng thc (2). Gi s tp hun luyn S cha cc vn bn
thuc k ch , th hn lon thng tin ca tp S l:



Trong p
i
l xc sut mt phn t (1 vn bn) thuc vo ch th i. p
i
chnh
l tn sut xut hin mt vn bn thuc ch th i trong tp S.
li thng tin khi dng thuc tnh a phn hoch tp S thnh cc tp con ty theo
gi tr ca a (k hiu Values(a) trong cng thc) l :

( )
( )
v
a Values v
v
S Entropy
S
S Entropy a S Gain
s

e
= ) ( ) , (
3.1 Gii thut xy dng cy quyt nh
u vo :
- Tp M cha tt c cc vn bn hun luyn m hnh ha thnh cc vector
d
i
(w
i1
,,w
im
)
- Tp A cha tt c cc t trong tp hun luyn M (sau khi loi stopword)
- Mt tp ch C.
u ra : Cy quyt nh dng nh phn cho vic phn loi theo tp ch C.
Gii thut (tham kho [9]):
- Bt u: nt gc cha tt c vn bn hun luyn.
- Nu d liu ti nt ch thuc 1 ch (1 lp) th nt l nt l v c gn nhn
l ch .
- Nu mt nt cha d liu khng thun nht (thuc cc lp khc nhau) th la
chn thuc tnh phn hoch vi li thng tin ln nht (gi s thuc tnh l a
vi gi tr y, y gi l gi tr phn tch); phn chia nt ny mt cch qui lm
|
|
.
|

\
|
=
j
ij ij
DF
N
log * TF w

=
=
k
i
i i
p p S Entropy
1
2
) log ( ) (
(1)
(2)
(3)
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

56
hai tp M1, M2; M1 cha cc vn bn cha a nhng gi tr thuc tnh nh hn
y, M2 cha cc vn bn cha a v gi tr thuc tnh ln hn bng y.
Gii thut dng khi tt c cc nt l c gn nhn. Trong ng dng, ngi ta
c th khng tin hnh phn hoch nt n khi d liu ng nht (ch thuc mt
lp) m ngi ta dng phn hoch khi s phn t ti nt cn t hn mt s lng
no v gn nhn nt theo lut bnh chn s ng ca cc phn t cha ti nt.
iu ny nhm ci tin tc xy dng cy v trnh c tnh trng hc vt.
3.2 nh gi mt gii thut my hc
Mt s ch s thng dng c dng nh gi mt gii thut my hc, hay c
th l nh gi mt b phn loi hai lp tm gi l dng v m:
- S ng dng (TP- True positive): s phn t dng c phn loi dng
- S sai m (FN - False negative): s phn t dng c phn loi m
- S ng m (TN- True negative): s phn t m c phn loi m
- S sai dng (FP - False positive): s phn t m c phn loi dng
- chnh xc (Precision) = TP/(TP + FP)
- bao ph (Recall) = TP/(TP + FN)
- o F1= 2*Precision*Recall/(Precision + Recall)
Cc ch s ny s c dng nh gi hiu qu cy quyt nh v my hc
SVM v sau, trong phn thc nghim.
3.3 Xn ta cy quyt nh
Cy quyt nh va c xy dng thng l ln, khng mang tnh tng qut m
mang tnh hc vt theo tp hun luyn. tng tnh tng qut ca cy, lm cho
cy thch ng vi cc mu d liu mi, cha c hun luyn, ngi ta ct bt cc
nhnh cy hay cn gi l xn ta cy vi mt tp kim chng c lp vi tp hun
luyn. y gi l vic xn ta sau, gii thut chi tit nh sau:
- Vi mi nt trong (khng phi nt l), ct b cc nhnh phn hoch nt bin
nt thnh nt l v gn nhn theo lut bnh chn s ng.
- Dng tp kim chng c lp kim tra chnh xc (precision) ca cy mi
sau mi thao tc xn.
- Nu sau khi xn, chnh xc ca cy c tng ln th gi nguyn vic xn v
tip tc qu trnh xn cho cc nt trong cn li; ngc li th tr li hin trng
ban u (khng thc hin vic xn ta).
Thut ton dng khi tt c cc nt c xem xt xn ta.
Vic thc hin xn ta cy nh vy c phc tp thi gian ln do phi dng tp
kim chng c lng li sinh ra khi xn ta. Trong thc hnh chng ti p
dng gii thut xy dng cy vi gii php bnh chn trn s ng, nu s ng
vt ngng t ra th dng vic phn hoch. Nh vy, chng ti khng thc hin
thao tc xn ta cy.
3.4 Thc hin phn loi 1 vn bn mi
Cc cy quyt nh gi c xy dng xong v sn sng dng cho phn loi
vn bn. Vn bn mi (cn c phn loi) c coi nh l mt tp hp cc c
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

57
trng (cc t). Ta s tin hnh duyt cy quyt nh gn nhn phn loi ch
cho vn bn . Vic duyt cy quyt nh hi ging vi duyt v tm kim trn
cy nh phn tm kim:
- Nu t thuc vn bn v gi tr ca t nh hn gi tr phn tch ti nt, hoc t
khng thuc vn bn th ta s duyt tip cy con tri ca cy quyt nh.
- Nu t thuc vn bn v gi tr ca t ln hn gi tr phn tch ti nt th ta s
duyt cy con phi ca cy quyt nh.
- Qu trnh ny dng khi gp nt hin ti l nt l, gn nhn cho vn bn l nhn
ca nt l .
4 PHN LOI VN BN VI MY HC VECTOR H TR
Gn y phng php my hc vector h tr c p dng vo bi ton phn
loi vn bn v cho thy kt qu kh quan [1,12]. Tuy nhin, nh ni, bi
ton phn loi vn bn c cc c trng l t nn khng gian c trng l rt ln,
bao gm mi t ca ngn ng hoc trong tp ng liu. S chiu ca khng gian
c trng ln lm gia tng nhiu, l mt tr ngi chnh trong vic p dng SVM
vo phn loi vn bn. p dng c hiu qu SVM, ngi ta cn tm cch rt
ngn s chiu ca khng gian c trng. Trong nghin cu [1], cc tc gi
xut dng lng tin tng h loi b bt cc c trng. Trong nghin cu ny
chng ti dng k thut tch gi tr n (SVD) rt ngn s chiu khng gian
c trng.
4.1 Phn tch gi tr n (SVD)
Phn tch gi tr n l phn tch ton hc nn tng trong k thut ch mc ng
ngha tim n (LSI-Latent Semantic Indexing) c dng rng ri trong tm
kim v thu hi thng tin dng vn bn. tng chnh ca gii thut [10,11] nh
sau:
Cho ma trn A (kch thc mxn), ma trn A lun lun phn tch c thnh tch
ca ba ma trn theo dng: A = UV
T
, trong :
- U l ma trn trc giao mxm c cc ct l cc vect n bn tri ca A.
- l ma trn mxn c ng cho cha cc gi tr n, khng m c th t
gim dn:
-
o
1
o
2
o
min(m,n)
0

- V l ma trn trc giao nxn c cc ct l cc vect n bn phi ca A.
Hng ca ma trn A l s cc s khc 0 trn ng cho chnh ca ma trn .
Thng thng A l mt ma trn tha c kch thc ln. gim s chiu ca ma
trn ngi ta thng tm cch xp x ma trn A (c hng r) bng mt ma trn A
k
c
hng l k nh hn r rt nhiu. Ma trn xp x ca A theo k thut ny chnh l:
A
k
= U
k

k
V
k
T
, trong
- U
k
l ma trn trc giao mxk c cc ct l k ct u ca ma trn U.
-
k
l ma trn ng cho kxk cha k phn t u tin o
1,
o
2,
, o
k
trn ng
cho chnh.
- V
k
l ma trn trc giao nxk c cc ct l k ct u ca ma trn V.
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

58
Vic xp x ny c th xem nh chuyn khng gian ang xt (r chiu) v khng
gian k chiu, vi k nh hn rt nhiu so vi r. V mt thc hnh vic ct ma trn A
v s chiu k cn loi b nhiu v tng cng cc mi lin kt ng ngha tim n
gia cc t trong tp vn bn. Chng ti s p dng k thut xp x ny rt ngn
s chiu ca khng gian c trng. Khi u, mi vn bn c m hnh ha
thnh mt vect ct trong khng gian xc nh bi A
mxn
. Sau khi ct A
mxn
v A
k
,
cc tt c cc vect ang xt u c chiu ln khng gian A
k
c s chiu k
theo cng thc:
Proj(x) = x
T
U
k

k
-1
(4)
4.2 My hc vct h tr

Hnh 2: V d siu phng vi l cc i trong R
2

My hc vct h tr (SVM) l mt gii thut my hc da trn l thuyt hc
thng k do Vapnik v Chervonenkis xy dng [13]. Bi ton c bn ca SVM l
bi ton phn loi hai lp: Cho trc n im trong khng gian d chiu (mi im
thuc vo mt lp k hiu l +1 hoc 1, mc ch ca gii thut SVM l tm mt
siu phng (hyperplane) phn hoch ti u cho php chia cc im ny thnh hai
phn sao cho cc im cng mt lp nm v mt pha vi siu phng ny. Hnh 2
cho mt minh ha phn lp vi SVM trong mt phng.
Xt tp d liu mu c th tch ri tuyn tnh {(x
1
,y
1
),(x
2
,y
2
),...,(x
n
,y
n
)} vi
x
i
eR
d
v y
i
e{1}. Siu phng ti u phn tp d liu ny thnh hai lp l siu
phng c th tch ri d liu thnh hai lp ring bit vi l (margin) ln nht. Tc
l, cn tm siu phng H: y = w.x + b = 0 v hai siu phng H1, H2 h tr song
song vi H v c cng khong cch n H. Vi iu kin khng c phn t no
ca tp mu nm gia H1 v H2, khi :
w.x + b >= +1 vi y = +1
w.x + b >= -1 vi y = -1
Kt hp hai iu kin trn ta c y(w.x + b) >= 1.
Khong cch ca siu phng H1 v H2 n H l w . Ta cn tm siu phng H vi
l ln nht, tc l gii bi ton ti u tm w
b w,
min vi rng buc y(w.x + b) >= 1.
Ngi ta c th chuyn bi ton sang bi ton tng ng nhng d gii hn l
w
b w
2
,
2
1
min vi rng buc y(w.x + b) >= 1. Li gii cho bi ton ti u ny l cc
tiu ha hm Lagrange:
L(w,b,) = (5)
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

59
Trong l cc h s Lagrange, 0. Sau ngi ta chuyn thnh bi ton i
ngu l cc i ha hm W():
(6)

T gii tm c cc gi tr ti u cho w,b v . V sau, vic phn loi mt
mu mi ch l vic kim tra hm du sign(wx +b).
Li gii tm siu phng ti u trn c th m rng trong trng hp d liu khng
th tch ri tuyn tnh [11] bng cch nh x d liu vo mt khng gian c s
chiu ln hn bng cch s dng mt hm nhn K (kernel). Mt s hm nhn
thng dng c cho trong bng 1.
Bng 1: Mt s hm nhn thng dng

y chng ti khng c nh i su vo chi tit gii bi ton tm siu phng
ny, c gi quan tm c th tm li gii trong cng trnh ca Vapnik [13]. Chng
ti s dng phn mm Weka [14] thc hin cc tnh ton phn lp v kim tra
phng php xut.
5 KT QU THC NGHIM
Trong thc nghim, c 7842 vn bn thuc 10 ch khc nhau c tp hp
dng xy dng my hc v kim chng hiu qu. Cc vn bn c su tp t
cc trang bo in t ph bin bng ting vit nh vnexpress.net, vietnamnet.vn,
dantri.com.vn. Sau khi tch t v loi b stopword, s t cn li l 14275 t. Sau
khi m hnh ha, mi vn bn l mt vector trng s cc t, trong cc trng s
l ch s TF*IDF nh trnh by. Nh vy tp ng liu c m hnh ha nh l
mt ma trn cha TF*IDF ca cc t v c kch thc 14275 x 7842 phn t.
Bng 2 cho s liu thng k s vn bn thuc mi ch . Trong mi ch , 500
vn bn c chn mt cch ngu nhin hun luyn, tc l xy dng cy quyt
nh hoc hun luyn my hc SVM. S vn bn cn li kim chng c lp.
tin gi tn hai tp ny c t tn l tp hun luyn v tp kim chng
c lp.
Vic nh gi da vo cc ch s chnh xc (Precision), bao ph (Recall) v
F1. Kt qu kim chng cc cy quyt nh vi tp kim chng c lp c cho
trong bng 3. Cc ch s kim chng ni trn c cho trong bng 5 v so snh vi
kt qu kim chng vi my hc SVM.
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

60
Bng 2: 10 ch v s lng mu dng trong thc nghim
Tn lp
S mu hun
luyn
S mu kim
chng
Tng s mu (vn
bn)
CNTT 500 286 786
TVT 500 282 782
Gio dc 500 299 799
m thc 500 291 791
Bt ng sn 500 265 765
Khoa hc 500 282 782
Kinh t 500 291 791
Y hc 500 287 787
Th thao 500 288 788
Gii tr 500 271 771
Tng cng 5000 2842 7842
Bng 3: Kt qu kim chng b phn lp bng cy quyt nh
Tn lp
M
lp
1 2 3 4 5 6 7 8 9 10
CNTT 1 250 6 8 3 4 3 2 3 3 4
TVT 2 12 227 6 5 5 4 7 5 6 5
Gio dc 7 9 10 231 7 9 6 5 12 5 5
m thc 4 2 10 2 253 6 3 2 3 7 3
Bt ng sn 5 2 5 5 5 225 7 4 2 5 5
Khoa hc 6 4 3 8 8 7 226 5 7 8 6
Kinh t 7 5 7 4 5 5 7 243 7 5 3
Y hc 8 4 5 5 6 5 3 4 245 6 4
Th thao 9 1 0 2 1 3 3 3 1 273 1
Gii tr 10 7 4 6 9 7 6 7 6 6 213
hun luyn my hc SVM, tp ng liu ang xt ( c m hnh ha nh ma
trn A
14275x7842
) s c phn tch gi tr n v rt ngn s chiu v k=200. Tt c
cc vector tng ng vi 7842 vn bn u c chiu ln khng gian A
200
bng
cng thc (4). My hc SVM c hun luyn bng tp hun luyn c dng
xy dng cy quyt nh. Tp kim chng c lp mt ln na c dng
kim chng hiu qu my hc SVM. Kt qu kim chng c cho trong bng 4
v cc ch s nh gi c cho trong bng 5 so snh vi phn lp theo cy
quyt nh. My hc SVM trong thc nghim ny l my hc vi hm nhn
(kernel) RBF, vi tham s C bng 12 v Gama bng 2
-8
. Thc nghim cng
c lm vi mt s tham s khc ca C v Gama, b tham s ni trn c chn
bng phng php th v sai. Do tham s Gama nh nn c th dng my hc
SVM vi hm nhn tuyn tnh (linear kernel). Kt qu thc nghim trn cng b
d liu vi hm nhn tuyn tnh (C=10 v eps=0.01) cho kt qu tt hn trn hm
nhn RBF mt t, nhng khng c khc bit nhiu. V vy c th dng hm nhn
RBF hay hm nhn tuyn tnh vi cc tham s nh va nu.
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

61
Bng 4: Kt qu kim chng b phn lp bng my hc SVM
Tn lp
M
lp
1 2 3 4 5 6 7 8 9 10
CNTT 1 265 5 3 1 2 1 3 3 2 1
TVT 2 11 246 3 2 5 4 3 3 2 3
Gio dc 3 3 5 276 2 3 4 1 3 1 1
m thc 4 1 1 3 273 1 3 4 2 2 1
Bt ng sn 5 1 3 2 1 249 2 3 2 1 1
Khoa hc 6 4 3 7 4 1 251 3 4 2 3
Kinh t 7 4 7 2 5 5 3 254 3 4 4
Y hc 8 3 3 5 3 1 3 4 258 4 3
Th thao 9 2 3 2 2 2 3 1 2 269 2
Gii tr 10 2 3 3 0 2 5 3 3 6 244
T s liu kim chng chi tit trong bng 3 v 4 c th tnh ton cc ch s nh
gi: Precision, Recall v F1 nh trong bng 5.
Bng 5: So snh hiu qu phn loi vn bn vi cy quyt nh v vi my hc SVM
Tn lp
Cy quyt nh My hc SVM
Precision Recall F1 Precision Recall F1
CNTT 84.5% 87.4% 85.9% 89.5% 92.7% 91.1%
TVT 81.9% 80.5% 81.2% 88.2% 87.2% 87.7%
Gio dc 83.4% 77.3% 80.2% 90.2% 92.3% 91.2%
m thc 83.8% 86.9% 85.3% 93.2% 93.8% 93.5%
Bt ng sn 81.5% 84.9% 83.2% 91.9% 94.0% 92.9%
Khoa hc 84.3% 80.1% 82.2% 90.0% 89.0% 89.5%
Kinh t 86.2% 83.5% 84.8% 91.0% 87.3% 89.1%
Y hc 84.9% 89.9% 87.3% 91.2% 89.9% 90.5%
Th thao 84.3% 94.8% 89.2% 91.8% 93.4% 92.6%
Gii tr 85.5% 78.6% 81.9% 92.8% 90.0% 91.4%
Trung bnh 84.1% Trung bnh 91.0%
Nh vy vi my hc SVM kt hp vi phn tch gi tr n rt ngn s chiu
ca khng gian c trng s cho kt qu phn loi vn bn tt hn l phng php
cy quyt nh. Chng ti cng th nghim dng SVM vi khng gian c
trng ban u, cha rt gn s chiu. Kt qu cho thy nu dng SVM vi khng
gian c trng nguyn thy th kt qu thp (ch s F1 trung bnh thu c trn
thc nghim l 85.2%), ch gn tng ng vi hiu qu ca cy quyt nh nh
trnh by trong bng 5. Vic phn tch gi tr n v rt ngn s chiu khng
gian c trng gp phn tng chnh xc ca my hc SVM do loi b bt
nhiu v tng cng mi lin h ng ngha gia cc t trong khng gian c trng.
6 KT LUN
Trong bi vit ny chng ti trnh by phng php phn loi vn bn da trn
my hc SVM. ng gp ca chng ti l xut dng k thut phn tch gi
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

62
tr n (SVD) rt ngn s chiu ca khng gian c trng. Chng ti kim
chng xut ny trn 2842 tp tin c lp tp hun luyn thuc 10 ch vi
my hc SVM ci t trong phn mm Weka. Kt qu cho thy rng vic dng
SVD phn tch v rt gn s chiu ca khng gian c trng nng cao hiu
qu phn lp SVM. Thc nghim so snh kt qu phn lp vi SVM vi kt
qu phn lp vi cy quyt nh, qua cho thy rng SVM thc s tt hn cy
quyt nh khi s chiu khng gian c trng c rt gn mt cch hp l. Vic
rt gn c trng cn gip cho khng gian lu tr gim xung v thi gian thc
hin phn lp nhanh hn v s chiu ca khng gian c trng nh hn nhiu so
vi s chiu ca khng gian c trng ban u.
Cc kim chng thc nghim da trn tp hp cc mu c lp vi cc mu dng
xy dng my hc cho thy rng hiu qu ca my hc SVM trong bi ton
phn loi vn bn l n nh, khng phi l hc vt. Vic phn tch gi tr n
rt gn s chiu ca khng gian c trng l hon ton thch hp cho bi ton phn
loi vn bn, mt bi ton m khng gian c trng ln, c nhiu nhiu.
Kt qu nghin cu ny c th p dng vo cc bi ton phn lp v nhn dng
khc nh nhn dng ch vit tay, nhn dng hnh nh (mt ngi, vn tay). Cc bi
ton ny v bn cht khng khc so vi bi ton phn loi vn bn v qui trnh x
l, phng php x l l tng t nhau: rt trch c trng, la chn c trng, my
hc v phn lp. Chng ti s tip tc nghin cu vic la chn c trng bng
phn tch gi tr n SVD v hi vng s ci tin hiu qu nhn dng nh ni chung,
nhn dng ch vit tay ni ring.
TI LIU THAM KHO
1. Nguyn Linh Giang, Nguyn Mnh Hin, Phn loi vn bn ting Vit vi b phn
loi vect h tr SVM. Tp ch CNTT&TT, Thng 6 nm 2006.
2. Nguyn Ngc Bnh, Dng l thuyt tp th v cc k thut khc phn loi, phn
cm vn bn ting Vit, K yu hi tho ICT.rda04. H ni 2004.
3. Nguyn Linh Giang, Nguyn Duy Hi, M hnh thng k hnh v ting Vit v ng
dng, Chuyn san Cc cng trnh nghin cu, trin khai Cng ngh Thng tin v
Vin thng, Tp ch Bu chnh Vin thng, s 1, thng 7-1999, trang 61-67. 1999
4. Hunh Quyt Thng, inh Th Thu Phng, Tip cn phng php hc khng gim
st trong hc c gim st vi bi ton phn lp vn bn ting Vit v xut ci tin
cng thc tnh lin quan gia hai vn bn trong m hnh vect, K yu Hi tho
ICT.rda04, trang 251-261, H Ni 2005.
5. Phc, Nghin cu ng dng tp ph bin v lut kt hp vo bi ton phn loi
vn bn ting Vit c xem xt ng ngha, Tp ch pht trin KH&CN, tp 9, s 2, pp.
23-32, nm 2006
6. Chih-Hao Tsai, MMSEG: A Word Identification System for Mandarin Chinese Text
Based on Two Variants of the Maximum Matching Algorithm.
http://technology.chtsai.org/MMSEG/, 2000.
7. Keh-Jiann Chen, Shing-Huan Liu, Word Identification for Mandarin Chinese
sentences, proceedings of Coling 92, Nantes, pp. 23-28, 1992.
8. Quinlan J., C4.5: Programs for Machine Learning, Morgan Kaufman Publishers,
1993.
9. Thanh Ngh, Khai m d liu minh ha bng ngn ng R (chng 4), NXB i
hc Cn Th, 2010.
Tp ch Khoa hc 2012:21a 52-63 Trng i hc Cn Th

63
10. M.W. Berry, Z. Drmac, E.R. Jessup; Matrices, Vect Spaces and Information
Retrieval; Society for Industrial and Applied Mathematics, Vol. 41, No. 2, 1999. pp.
335-362.
11. T. Letsche M. Berry; Large-scale Information Retrieval with Latent Semantic
Analysis. SIGIR 2001, pp. 19-25
12. Thorsten Joachims. Text Categorization with Support Vector Machines: Learning
with Many Relevant Features. In European Conference on Machine Learning
(ECML), 1998.
13. V.Vapnik. The Nature of Statistical Learning Theory. Springer, NewYork, 1995.
14. Weka, http://www.cs.waikato.ac.nz/ml/weka/

Das könnte Ihnen auch gefallen