Sie sind auf Seite 1von 67

I HC QUC GIA H NI TRNG I HC CNG NGH

Trn Th Ngn

TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

KHO LUN TT NGHIP I HC H CHNH QUY Ngnh: Cng ngh thng tin

H NI - 2009

I HC QUC GIA H NI TRNG I HC CNG NGH

Trn Th Ngn

TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

KHO LUN TT NGHIP I HC H CHNH QUY Ngnh: Cng ngh thng tin Cn b hng dn: PGS. TS. H Quang Thy Cn b ng hng dn: Th.S Nguyn Cm T

H NI - 2009

LI CM N u tin cho em gi li cm n su sc nht n PGS. TS. H Quang Thy, Th.S Nguyn Cm T tn tnh ch bo cho em trong sut thi gian thc hin kha lun. Trong qu trnh nghin cu em gp phi nhiu kh khn nhng nh s hng dn tn tnh ca thy v ch em dn vt qua v hon thnh c kha lun. Em xin by t lng bit n n cc thy c trong trng i Hc Cng Ngh ging dy v cho em nhng kin thc qu bu, lm nn tng hon thnh kha lun cng nh thnh cng trong nghin cu, lm vic trong tng lai. Em xin gi li cm n ti cc anh ch trong phng Lab cho em nhng li khuyn qu bu, b ch trong qu trnh thc hin qu lun. V em cng xin li cm n ti nhng ngi bn thn yu, c bit l cc bn trong phng k tc x bn cnh ng vin trong gip em hon thnh kha lun cng nh vt qua nhiu kh khn trong cuc sng. Cui cng, cho con gi li cm n su sc ti gia nh, b, m, ch v em cho con nhiu tnh thng cng nh s ng vin kp thi con vt qua nhng kh khn trong cuc sng v hon thnh c kha lun.

TM TT Trch chn thng tin y t nhm xy dng c mt tp d liu tt, y h tr vic tm kim ng ngha ang l nhu cu thit yu, nhn c s quan tm c bit trong thi gian gn y. Ontology l cch biu din khi nim, thuc tnh, quan h trong min ng dng m bo tnh nht qun v phong ph. Xy dng h thng trch chn thng tin da trn mt Ontology y t Ting Vit cho php tm kim v khai ph loi d liu thuc min ng dng hiu qu hn l mt nhu cu thit yu. Kha lun ny cp ti vic xy dng mt h thng trch chn thng tin da trn mt ontology trong lnh vc y t ting Vit. Kha lun phn tch mt s phng php, cng c xy dng Ontology la chn mt m hnh v xy dng c mt Ontology y t ting Vit vi 21 lp thc th,13 mi quan h v trn 500 th hin ca cc lp thc th. Kha lun tin hnh ch thch cho 96 file d liu vi trn 1500 th hin. H thng nhn din thc th thc nghim ca kha lun hot ng c tnh kh thi vi o F1 trung bnh qua 10 ln thc nghim t khong 64%.

ii

MC LC
Li m u ...........................................................................................................................1 Chng 1 ..............................................................................................................................3 TNG QUAN V TM KIM NG NGHA.....................................................................3 1.1. Nhu cu v tm kim ng ngha ..........................................................................3 1.2. Nn tng tm kim ng ngha ..................................................................................4 1.2.1.Web ng ngha.....................................................................................................4 1.2.2. Ontology .............................................................................................................5 1.3. Kin trc ca mt my tm kim ng ngha............................................................5 1.4.Trch chn thng tin .................................................................................................6 Chng 2 ..............................................................................................................................9 XY DNG ONTOLOGY Y T TING VIT ................................................................9 2.1. Gii thiu Ontology.................................................................................................9 2.1.1. Khi nim Ontology ...........................................................................................9 2.1.2. Cc thnh phn ca Ontology...........................................................................10 2.1.3 Mt s cng trnh lin quan ti xy dng Ontology..........................................11 2.2. L thuyt xy dng Ontology ...............................................................................12 2.1.1. Phng php xy dng Ontology .....................................................................12 2.1.2. Cng c xy dng Ontology.............................................................................13 2.1.3. Ngn ng xy dng Ontology ..........................................................................15 2.3. Xy dng Ontology y t ting Vit .......................................................................16 Chng 3 ............................................................................................................................17 NHN DNG THC TH ...............................................................................................17 3.1. Gii thiu bi ton nhn dng thc th .................................................................17 3.1.1. Gii thiu chung v nhn dng thc th ...........................................................17 3.1.2. Mt s kt qu nghin cu v nhn dng thc th ...........................................18 3.2. c im d liu ting Vit ..................................................................................19 3.2.1. c im ng m..............................................................................................19 3.2.2. c im t vng .............................................................................................20 3.2.3. c im ng php ...........................................................................................20 3.3. Mt s phng php nhn dng thc th ..............................................................21 3.3.1. Phng php da trn lut, bn gim st.........................................................23 3.3.2. Cc phng php my trng thi hu hn ........................................................23 iii

3.3.3. Phng php s dng Gazetteer .......................................................................24 3.4. Nhn dng thc th y t ting Vit........................................................................25 3.4.1. Nhn dng thc th ting Vit ..........................................................................25 3.4.2. Nhn dng thc th y t ting Vit ...................................................................26 Chng 4 ............................................................................................................................30 XC NH QUAN H NG NGHA..............................................................................30 4.1. Tng quan v xc nh quan h ng ngha............................................................30 4.1.1. Khi qut v quan h ng ngha .......................................................................30 4.1.2. Trch chn quan h ng ngha ..........................................................................31 4.1.3. Mt s nghin cu lin quan n xc nh quan h ng ngha ........................35 4.2. Gn nhn ng ngha cho cu .................................................................................37 4.3.1. Phn lp vi xc nh quan h, nhn dng thc th .........................................39 4.3.2. Thut ton SVM (Support Vector Machine) ....................................................41 4.3.3 Phn lp a lp vi SVM ..................................................................................41 4.3.4. p dng SVM vo phn loi quan h ng ngha trong lnh vc y t ting Vit..............................................................................................................42 Chng 5 ............................................................................................................................43 THC NGHIM................................................................................................................43 5.1. Mi trng thc nghim .......................................................................................43 5.1.1. Phn cng .........................................................................................................43 5.1.2 Phn mm ..........................................................................................................43 5.1.3 D liu th nghim ............................................................................................44 5.2 Xy dng Ontology ................................................................................................44 5.2.1. Phn cp lp thc th........................................................................................44 5.2.2. Cc mi quan h gia cc lp thc th .............................................................47 5.3. Ch thch d liu ..................................................................................................48 5.4. Nhn dng thc th................................................................................................50 5.4.1. Xy dng tp gazetteer .....................................................................................50 5.4.2.nh gi h thng nhn dng thc th ..............................................................51 5.4.3. Kt qu t c...............................................................................................52 5.4.4. Nhn xt v nh gi ........................................................................................52 5.5. Gn nhn ng ngha cho cu .................................................................................53 PH LC - MT S THUT NG ANH VIT ............................................................54 KT LUN ........................................................................................................................55 iv

DANH MC BNG BIU


Bng 1: Gii thch cc mi quan h ng ngha...................................................................35 Bng 2: S lng cc th hin ca cc lp thc th trong tp d liu gazetteer. ................50 Bng 3: Cc gi tr nh ga mt h thng nhn din loi thc th .....................................51 Bng 4: Kt qu sau 10 ln thc nghim nhn dng thc th..............................................52 Bng 5: V d mt s cu c gn nhn quan h. .............................................................53

DANH MC HNH V Hnh 1: V d v Web ng ngha ................................................................................ 4 Hnh 2: Kin trc mt my tm kim ng ngha ......................................................... 6 Hnh 3: Minh ha mt h thng trch chn thng tin.................................................. 7 Hnh 4: M t ngha ca Ontology........................................................................... 9 Hnh 5: Minh ha cu trc phn cp ca Ontology BioCaster ................................. 10 Hnh 6: Mt s file Gazetteer c xy dng phc v bi ton nhn dng thc th 25 Hnh 7: Minh ha mt quan h ng ngha cho thc th car...................................... 30 Hnh 8: Minh ha v trch chn quan h ng ngha.................................................. 31 Hnh 9: V tr ca khai ph quan h ng ngha trong x l ngn ng t nhin ........ 32 Hnh 10: Minh ha cc quan h ng ngha c ch ra trong WordNet................... 33 Hnh 11: Mt s quan h ng ngha xy dng c............................................ 34 Hnh 12: Nhim v chung ca bi ton xc nh quan h ........................................ 36 Hnh 13: M t cc b phn trong b phn tch ng ngha SR [24] ......................... 37 Hnh 14: Minh ha Framework gii quyt bi ton xc nh tn ring gia cc ti liu............................................................................................................................. 38 Hnh 15: Mt s nhn ng ngha c gn cho cu [30].......................................... 39 Hnh 16: Gn nhn ng ngha cho cc cu m t tng thng Bill Clinton [30]. ...... 39 Hnh 17: M t cc giai on trong qu trnh phn lp ............................................ 40 Hnh 18: M t s phn chia ti liu theo du ca hm f(d)..................................... 41 Hnh 19: M t qu trnh hc ca phn lp cu cha quan h [2]............................ 42 Hnh 20: Minh ha cc lp trong Ontology xy dng. ........................................ 46 Hnh 21: Minh ha cu trc phn tng ca Ontology xy dng c...................... 46 Hnh 22: Minh ha cc th hin ca lp thc th v mi quan h gia cc th hin 48 Hnh 23: Minh ha mt d liu c ch thch bng Ontology............................... 49 Hnh 24: Minh ha cc file cha thc th trong tp Gazetteer xy dng c ........ 51 Hnh 25: Kt qu 10 ln thc nghim nhn dng thc th ....................................... 52

vi

Li m u

Chm sc sc khe lun l mt nhu cu thit yu ca con ngi, v th tm kim cc thng tin v lnh vc y t trn Internet lun l mt nhu cu thit yu. Vn ny cng cn phi c quan tm thch ng khi con ngi ang phi i mt vi nhiu dch bnh truyn nhim, v d in hnh c th k ti dch bnh cm A H1N1 ang pht trin v c chiu hng gia tng trong thi gian gn y. Cng vi s ra i v pht trin khng ngng ca cc ti nguyn trc truyn, vic khai thc hiu qu ngun ti nguyn ny a ti ngun tri thc hu ch cho ngi dng s gp phn vo vic tuyn truyn v nng cao sc khe cng ng. S bng n cc ti nguyn y t, c bit l cc thng tin trc tuyn lin quan n lnh vc sc khe; nhiu trang web v thng tin tha cng nh vic t chc thng tin mt cch t do (khng hoc bn cu trc) lm cho ngi dng kh c th theo di cng nh nm bt nhng thng tin cp nht nht. Bn cnh , cng ngh tm kim thng tin truyn thng hoc tr v kt qu t do s phong ph, phc tp ca vic din t ngn ng t nhin; hoc qu nhiu theo ngha ngi tm tin ch mun tm kim nhng tri thc n ch khng ch l cc vn bn cha t kha tm kim. Do vic khai thc ti u ngun ti nguyn phong ph ny tr thnh mt ti quan trng, thu ht nhiu nh khoa hc tham gia nghin cu trong hai thp nin gn y, c nhiu cng trnh nhm trch rt cc thng tin c cu trc t nhng ti nguyn ny nhm xy dng cc c s tri thc cho vic t chc thng tin, tm kim, truy vn, qun l v phn tch thng tin. Nhiu bi ton c t ra trong lnh vc trch chn thng tin y t nh BioCreative-I (nhn din cc tn genes v protein trong vn bn) [32], LLL05 (trch chn thng tin v gene) [33], BioCreative-II (trch chn quan h tng tc gia cc protein) [49], Nhng bi ton c a ra nhm nh gi cc chin lc khai ph d liu y t v c bit tp trung vo hai bi ton con: nhn din thc th v trch chn quan h. Nhn din thc th i hi nhn bit cc thnh phn c bn nh tn thuc, tn bnh, triu chng, gene, protein, trong vn bn. Xc nh quan h vi mt mu cho trc l nhn bit mt trng hp ca quan h ny trong vn bn. V d, xc nh quan h <gy_ra> gia mt bnh xc nh v mt virus xc nh. Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h mt cch nht qun v phong ph nht. Vic xy dng mt Ontology cho y t trong
1

ting Vit s l c s cho php tm kim, khai ph loi thng tin ny mt cch hiu qu. Theo kho st d liu cho thy Vit Nam hin nay cc Ontology cho y t ting Vit th hu nh cha c; tuy nhin cng c c mt s nhm nghin cu tp trung xy dng Ontology vi cc min c th khc phc v cho nhiu mc ch khc nhau. n c c th k ti Ontology VNKIM [34] c pht trin ti i hc Bch khoa, i Hc Quc gia TP.H Ch Minh. Ontology ny bao gm 347 lp thc th v 114 quan h v thuc tnh. VN-KIM Ontology bao gm cc lp thc th c tn ph bin nh Con _ngi, T_chc, tnh, Thnh_ph,, cc quan h gia cc lp thc th v cc thuc tnh ca mi lp thc th . Tn ti nhiu phng php c a ra xy dng mt h thng trch chn thng tin cnug nh xy dng mng ng ngha v t p dng cho bi ton tm kim ng ngha. Kha lun trnh by cch biu din da trn Ontology - mt trong s nhng phng php ang c s dng kh rng ri hin nay. Kha lun trnh by mt s phng php xy dng Ontology, m rng ontology mt cch t ng, gii thiu bi ton nhn dng thc th cng nh phn loi quan h da trn mt s phng php khc nhau. Kha lun cng xy dng c mt d liu cho y t phc v cho vic nhn dng thc th v quan h c hiu qu hn.

Chng 1 TNG QUAN V TM KIM NG NGHA 1.1. Nhu cu v tm kim ng ngha S bng n cc thng tin trc tuyn trn Internet v World Wide Web to ra mt lng thng tin khng l a ra thch thc l lm th no c th khai ph ht c lng thng tin ny mt cch hiu qu nhm phc v i sng con ngi. Cc my tm kim nh Google, Yahoo ra i nhm h tr ngi dng trong qu trnh tm kim v s dng thng tin. Tuy kt qu tr v ca cc my tm kim ny ngy cng c ci thin v cht v lng nhng vn n thun l danh sch cc ti liu cha nhng t xut hin trong cu truy vn. Nhng thng tin t cc kt qu tr v ny ch c hiu bi con ngi, my tnh khng th hiu c, iu ny gy nhng kh khn cho qu trnh tip theo x l thng tin tm kim c. Th h cc my tm kim thc th ra i (h thng Cazoodle ti trang web http://www.cazoodle.com/, h thng Arnetminer ti trang web http://www.arnetminer.org/ ...) nh du mt bc pht trin mi ca cc my tm kim. Thm vo , vi s ra i ca my tm kim ng ngha Wolfram, c xy dng v pht trin bi d n Wolfram Research, Inc. Marketed do Stephen Wolfram xut [35], th vn tm kim tri thc cng c quan tm hn na. S ra i ca Web ng ngha (hay Semantic Web) do W3C (The World Wide Web Consortium) khi xng m ra mt bc tin ca cng ngh Web, nhng thng tin trong Web ng ngha c cu trc hon chnh v mang ng ngha m my tnh c th hiu c. Nhng thng tin ny, c th c s dng li m khng cn qua cc bc tin x l. Khi s dng cc my tm kim thng thng (Google, Yahoo), tm kim thng tin trn Web ng ngha s khng tn dng c nhng u im vt tri ca Web ng ngha, kt qu tr v khng c s ci tin. Ni theo mt cch khc th vi cc my tm kim hin ti th Web ng ngha hay Web thng thng ch l mt. Do vy, cn thit c mt h thng tm kim ng ngha (Semantic Search) tm kim trn Web ng ngha hay trn mt mng tri thc mang ng ngha, kt qu tr v l cc thng tin c cu trc hon chnh m my tnh c th hiu c, nh vic s dng hay x l thng tin tr nn d dng hn [6][26][2]. Ngoi ra, vic xy dng c mt h thng tm kim ng ngha c th s to tin cho vic m rng xy dng cc h thng hi p t ng trn tng lnh vc c th nh : y t, vn ha iu ny mang mt ngha thit thc trong i sng.
3

1.2. Nn tng tm kim ng ngha 1.2.1.Web ng ngha Web ng ngha hay cn gi l Semantic Web theo Tim Berners-Lee l bc pht trin m rng ca cng ngh Word Wide Web hin ti, cha cc thng tin c nh ngha r rng con ngi v my tnh lm vic vi nhau hiu qu hn. Mc tiu ca Web ng ngha l pht trin da trn nhng chun v cng ngh chung, cho php my tnh c th hiu thng tin cha trong cc trang Web nhiu hn nhm h tr tt con ngi trong khai ph d liu, tng hp thng tin, hay trong vic xy dng cc h thng t ng khc Khng ging nh cng ngh Web thng thng, ni dung ch bao hm cc ti nguyn vn bn, lin kt, hnh nh, video m Web ng ngha c th bao gm nhng ti nguyn thng tin tru tng hn nh: a im, con ngi, t chc thm ch l mt s kin trong cuc sng. Ngoi ra, lin kt trong Web ng ngha khng ch n thun l cc siu lin kt (hyperlink) gia cc ti nguyn m cn cha nhiu loi lin kt, quan h khc. Nhng c im ny khin ni dung ca Web ng ngha a dng hn, chi tit v y hn. ng thi, nhng thng tin cha trong Web ng ngha c mt mi lin h cht ch vi nhau. Vi s cht ch ny, ngi dng d dng hn trong vic s dng, v tm kim thng tin. y cng l u im ln nht ca Web ng ngha so vi cng ngh Web thng thng [2].

Hnh 1. V d v Web ng ngha [6] Hnh 1 l mt v d m t v mt trang Web ng ngha cha thng tin ca mt ngi tn l Yo-Yo Ma. Trang Web c cu trc nh mt th c hng mang trng s, trong mi nh ca th m t mt kiu ti nguyn cha trong trang Web. Cc cnh ca th th hin mt kiu lin kt (hay cn gi l thuc tnh ca ti nguyn) gia cc ti nguyn, trng s ca cc lin kt th hin tn ca lin kt [tn ca thuc tnh] . C th ta thy Yo-Yo Ma c thuc tnh ngy sinh l 10/07/55 c ni sinh Paris, France, Paris, France c nhit l 62 F
4

Nh vy, mi ti nguyn c m t trong Web ng ngha l mt i tng. i tng ny c tn gi, thuc tnh, gi tr ca thuc tnh (gi tr c th l mt i tng khc) v lin kt vi cc ti nguyn (i tng) khc (nu c). xy dng c mt trang Web ng ngha cn phi c tp d liu y , hay ni mt cch khc l cn phi xy dng mt tp cc i tng m t ti nguyn cho Web ng ngha. Cc i c quan h vi nhau hnh thnh mt mng lin kt rng, c gi l mng ng ngha. Mng ng ngha c chia s rng khp do vy cc i tng trong mt mng ng ngha cn phi m t theo mt chun chung nht. Ontology c s dng m t v i tng, ti nguyn cho Web ng ngha [2]. 1.2.2. Ontology C th hiu mt cch n gin ontology l mt m hnh d liu trnh by mt tp cc khi nim trong mt min v mi quan h gia cc khi nim . N c s dng lp lun (suy lun) v cc i tng trong min [12]. Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h mt cch nht qun v phong ph nht, chnh v th n c s dng xy dng mng ng ngha t tp d liu th (khng hoc bn cu trc) to nn tng xy dng mt my tm kim ng ngha mt cch hiu qu. Ontology s c gii thiu mt cch c th, k lng hn trong chng 2 ca kha lun. 1.3. Kin trc ca mt my tm kim ng ngha Xt v c bn, mt my tm kim ng ngha c cu trc tng t vi mt my tm kim thng thng cng bao gm hai thnh phn chnh [2]: Phn giao din ngi dng (front end) c hai chc nng chnh: Giao din truy vn: cho php ngi dng nhp cu hi, truy vn. Hin th cu tr li, kt qu. Phn kin trc bn trong (back end) l phn ht nhn ca my tm kim bao gm ba thnh phn chnh l: Phn tch cu hi Tm kim kt qu cho truy vn hay cu hi Tp ti liu, d liu tm kim, mng ng ngha. M hnh kin trc mt my tm kim ng ngha c m t nh Hnh 2.
5

Search Services
1.
Nh p Nh p truy truy vn vn
1. 5. K K tt qu qu tr tr v v

6.

2.Phn lp cu hi

Semantic Web/Ontology

3.Bin i dng cu hi

4. Trch chn thng tin

Mng ng ngha

5.Tm kim

Hnh 2. Kin trc mt my tm kim ng ngha [2] C th thy rng s khc bit trong cu trc ca my tm kim ng ngha so vi my tm kim thng thng nm phn kin trc bn trong, c th hai thnh phn: phn tch cu hi v tp d liu tm kim. Phn tch cu hi c cp chi tit trong [2]. Tp d liu tm kim chnh l web ng ngha v mng ng ngha c xy dng da trn ontology v h thng trch chn thng tin. Kha lun ny tp trung nghin cu k v xy dng ontology, m rng t ng ontology nh trch chn thng tin m c th l nhn dng thc th. Kha lun cng cp ti nhn dng quan h ng ngha, phn loi cu cha quan h nhm mc ch nh trnh by trn, l xy dng c mt tp d liu tm kim y cho my tm kim ng ngha trong tng lai. 1.4.Trch chn thng tin Trch chn thng tin l mt lnh vc quan trng trong khai ph d liu vn bn, thc hin vic trch rt cc thng tin c cu trc t cc vn bn khng c cu trc. Ni cch khc, mt h thng trch chn thng tin rt ra nhng thng tin c nh ngha trc v cc thc th v mi quan h gia cc thc th t mt vn bn di dng ngn ng t nhin v in nhng thng tin ny vo mt vn bn ghi d liu c cu trc hoc mt dng mu c nh ngha trc . C nhiu mc trch chn thng tin t vn bn nh xc nh cc thc th (Element Extraction), xc nh quan h gia cc thc th (Relation Extraction), xc nh v theo di cc s
6

kin v cc kch bn (Event and Scenario Extraction and Tracking), xc nh ng tham chiu (Co-reference Resolution)... Cc k thut c s dng trong trch chn thng tin gm c: phn on, phn lp, kt hp v phn cm [1].

Bnh phi cp tnh l mt trong nhng nguyn nhn t vong chnh ca ngi gi, nguy him hn c bnh phi do cm. Triu chng thng gp l ngi mt mi, i khi c l ln, st tht thng, ho khan nhiu v nng nhc, c khi kh th. Cc thuc an thn, chng ho phi c s dng mt cch thn trng, nu c biu hin th rt cn phi phn bit do hen ph qun th phi dng corticoid v thuc gin ph qun.

Bnh

Triu chng

Thuc

IE

Phi cp tnh

Mt mi L ln St tht thng Ho khan Kh th

An thn Chng ho Corticoid Thuc gin ph qun

Hnh 3. Minh ha mt h thng trch chn thng tin c mt h thng trch chn thng tin u tin chng ta phi c mt h thng nhn dng thc th v tip sau mi tnh n phn loi quan h. Bi ton nhn bit cc loi thc th l bi ton n gin nht trong s cc bi ton trch chn thng tin, tuy vy n li l bc c bn nht trc khi tnh n vic gii quyt cc bi ton phc tp hn trong lnh vc ny. Ngoi ng dng trong h thng trch chn thng tin, n cn c th c p dng trong tm kim thng tin (Information Retrieval), dch my (machine translation) v h thng hi p (question answering). c rt nhiu bi ton c t ra trong lnh vc trch chn thng tin y t nh BioCreative-I (nhn din cc tn genes v protein trong vn bn) [32], LLL05 (trch chn thng tin v gene) [33], BioCreative-II (trch chn quan h tng tc gia cc protein) [49], Nhng bi ton c a ra nhm nh gi cc chin lc khai ph d liu y t v c bit tp trung vo hai bi ton con: nhn din thc th v trch chn quan h. Nhn din thc th i hi nhn bit cc thnh phn c bn nh tn thuc, tn bnh, triu chng, gene, protein, trong vn bn. Xc nh quan h vi mt mu cho trc l nhn bit mt trng hp ca quan h ny trong vn bn. V d: xc nh quan h <gy_ra> gia mt bnh xc nh v mt virus
7

xc nh. Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h mt cch nht qun v phong ph nht. Vic xy dng mt ontology cho y t trong ting Vit s l c s cho php tm kim, khai ph loi thng tin ny mt cch hiu qu. Sau khi xy dng ontology, cng vic tip theo cng rt quan trng l m rng ontology mt cch t ng. Vic c mt h thng trch chn thng tin (bao gm nhn dng thc th v trch chn quan h, ) l bc tin c th m rng ontology mt cch t ng.

Chng 2 XY DNG ONTOLOGY Y T TING VIT

2.1. Gii thiu Ontology 2.1.1. Khi nim Ontology Trong nhng nm gn y, thut ng Ontology khng ch c s dng trong cc phng th nghim trn lnh vc tr tu nhn to m tr nn ph bin i vi nhiu min lnh vc trong i sng . ng trn quan im ca ngnh tr tu nhn to, mt Ontology l s mt t v nhng khi nim v nhng quan h ca cc khi nim nhm mc ch th hin mt gc nhn v th gii. Trn min ng dng khc ca khoa hc, mt Ontology bao gm tp cc t vng c bn hay mt ti nguyn trn mt min lnh vc c th, nh nhng nh nghin cu c th lu tr, qun l v trao i tri thc cho nhau theo mt cch tin li nht [2]. Hin nay tn ti nhiu khi nim v Ontology, trong c nhiu khi nim mu thun vi cc khc nim khc, kha lun ny ch gii thiu mt nh ngha mang tnh khi qut v c s dng kh ph bin c Kincho H. Law a ra: Ontology l biu hin mt tp cc khi nim (i tng), trong mt min c th v nhng mi quan h gia cc khi nim ny. Ontology chnh l s tng hp ca mt tp t vng chia s v cc miu t ngha ca t theo cch m my tnh hiu c.

a shared vocabulary

a formal characterization of its meaning

Ontology
Hnh 4. M t ngha ca Ontology Hnh 4 m t ngha ca Ontology, trong tp t vng dng chung (Vocabulary) chnh l th hin ca cc lp, quan h. V d, c th c Vocabulary (...), Categories (Cat, White, Leg, Fish, Animal,), Relations (Is-a, Part-of,

hasMother,), Characterization (...) v cc th hin quan h "A cat is an animal", "A cat has four legs"...

Hnh 5. Minh ha cu trc phn cp ca Ontology BioCaster [11] 2.1.2. Cc thnh phn ca Ontology Cc thnh phn chnh ca Ontology l: Lp (Class), thuc tnh (Property), thc th (Individual). Lp (class) l mt b nhng thc th, cc thc th c m t logic nh ngha cc i tng ca lp; lp c xy dng theo cu trc phn cp cha con nh l mt s phn loi cc i tng. Thc th c xem l th hin ca mt lp, lm r hn v lp v c th c hiu l mt i tng no trong t nhin (England, Manchester United, bnh si, thy u). Thuc tnh (Property) th hin quan h nh phn ca cc thc th (quan h gia hai thc th) nh lin kt hai thc th vi nhau. V d thuc tnh do_virus lin kt hai thc th bnh v virus vi nhau. Thuc tnh (property) c 4 loi (1) Functional: Mt thc th ch lin quan nhiu nht n mt thc th khc, v d thuc tnh c hng v i vi cc thc th lp thc_n; (2) Inverse Functional: Thuc tnh o ngc ca Functional,
10

thuc tnh l hng v ca; (3) Transitive: Thc th a quan h vi thc th b, thc th b quan h vi thc th c thc th a quan h vi thc th c; (4) Symmetric: Thc th a quan h vi thc th b thc th b quan h vi thc th a. Thuc tnh c 3 kiu th hin (1) Object Property: Lin kt thc th ny vi thc th khc; (2) DataType Property: Lin kt thc th vi kiu d liu XML Schema, RDF literal; (3) Annotation Property: Thm cc thng tin metadata v lp, thuc tnh hay thc th khc thuc 2 kiu trn. lm vic vi ontology Web cn s dng ngn ng ontology Web (The Web Ontology Language: OWL). OWL c th c mt kiu th t l Annotation propertie. Kiu thuc tnh c s dng thm cc thng tin (metadata d liu ca d liu) i vi cc lp, cc thc th hay cc thuc tnh Object/ Datatype. 2.1.3 Mt s cng trnh lin quan ti xy dng Ontology Ngy nay, Ontology c s dng rt nhiu trong cc lnh vc lin quan n ng ngha nh tr tu nhn to (AI), semantic web, k ngh phn mm, v.v V nhng ng dng ca Ontology nn khng ch ring Vit Nam, trn th gii c nhiu d n tp trung xy dng Ontology i vi tng min d liu khc nhau v phc v cho nhiu mc ch a dng khc nhau. i vi min d liu y t c th k ti rt nhiu Ontology trong lnh vc y t, sinh hc c a ra bi t chc The National Center for Biomedical Ontology [52]. D n ny a ra c rt nhiu Ontology trong y t cng nh trong sinh hc, v d nh Ontology v cell type, Gene, FMA, Human diseasedanh sch cc Ontology a ra c hin th trong [41]. Ngoi ra c th k ti Disease Ontology [42] l mt tp t v y khoa c pht trin ti Bioinformatics Core Facility cng vi s cng tc ca d n NuGene Project ti trung tm Center for Genetic Medicine. Ontology ny c thit k vi mc ch sp xp cc bnh v cc iu kin tng ng i vi nhng code v y t c th nh l ICD9CM, SNOMED v nhng ci khc.Disease Ontology cng c s dng lin kt nhng kiu hnh sinh vt mu i vi cc bnh ca con ngi cng nh trong vic khai ph d liu y hc. Disease Ontology c thc hin nh l mt th xon c hng v s dng UMLS (Unified Medical Language System) l tp t vng truy cp cc Ontology v y t khc nh ICD9CM. Mt ontology ting Anh c cp rt nhiu trong lnh vc y t trong thi gian gn y l GENIA [43]. Mc ch chnh m ontology ny hng ti l
11

s phn ng li ca t bo trong no ngi. Ontology ny ch yu tp trung trong cc lnh vc y t v cng c s dng trong cc bi ton x l ngn ng t nhin: truy hi thng tin (Information Retrieval IR), trch chn thng tin, phn lp v tm tt vn bn Hnh v sau m t cu trc phn cp ca ontology GENIA. Tn ti nhiu Ontology v y t hin nay c xy dng trn th gii. Tuy nhin Vit Nam hin nay mc du vic tm kim ng ngha ang c tp trung nghin cu, nhng cc Ontology v y t th hu nh cha c, cho nn vic tm kim cc trang web v thuc, bnh ca ngi dng cha tr v cc kt qu y v t c hiu qu. Tn ti mt Ontology cp n cc thut ng y t trong ting Vit, l Ontology Biocaster [44]. y l Ontology c nghin cu theo d n Biocaster c pht trin ti Vin Tin hc Quc gia Nht Bn vi s cng tc ca trng cc trng i hc ti Nht Bn, Thi Lan, Vit Nam... y l ontology vit cho nhiu ngn ng nh Nht, Anh, Thi, Vit Ontology BioCaster [11] c cc thut ng ca nhiu th ting trong c 371 thut ng ting Vit, cc thut ng lin quan n bnh, virus, cc triu chng ca Vit Nam. Mc d Ontology ny c x l trch chn trong ting Vit, nhng t li a ra cc bi bo v y t Vit Nam bng ting Anh. V vy, cc thut ng, thc th, cc bnh hay virus c vit bng ting Vit cn cc quan h c m t bng ting Anh. V d, thut ng Vietnamese_103, gn nhn: vi rt gy bnh thy u, c hasLanguage: vi (Vietnamese), hasRootTerm : VIRUS_124 2.2. L thuyt xy dng Ontology 2.1.1. Phng php xy dng Ontology Ngy nay, vic nghin cu qu trnh xy dng ontology ngy cng c quan tm nhiu hn. C rt nhiu nhm sau qu trnh nghin cu a ra cc phng php khc nhau nhm xy dng Ontology. Phng php Ushold & King c xy dng da trn vic pht trin Enterprise Ontology. Phng php ny ch yu tp trung vo vic gip ngi pht trin t mc ch ca ontology c th c nhng hng pht trin nh th no, sau nh gi v vit ti liu cho ontology. Trong qu trnh xy dng, ngi dng c th tch hp cc ontology c sn vo ontology ang xy dng. Ba cch tip cn sau c a ra nhm nh ngha cc khi nim chnh trong ontology: cch tip cn top-down, bottom-up v middle-out. Phng php lun ny c xy dng khng ph thuc vo ng dng, ngha l mc ch xy dng ontology c lp vi qu
12

trnh xy dng chng, khng ph thuc vo nhau. Vi bt k ng dng no, chng ta u c th s dng chung phng php ny [17]. Phng php lun tip theo c pht trin bi Gruninger v Fox [16], c pht trin thng qua d n ontology Toronto Virtual Enterprise (TOVE). H thng ny c xy dng bt ngun t t tng v s pht trin h thng da trn tri thc, s dng first order logic. Trong phng php ny, cc khi nim ni bt nht c nh ngha trc tin, sau lm chi tit v tng qut ha cc khi nim theo cc hng thch hp. Nh vy, phng php ny bt u t mt s cc khi nim mc cao, i ri n cc khi nim mc thp v tng qut cc mc cao hn. Phng php ny s dng cch tip cn middle-out nh ngha cc khi nim v mt phn ph thuc vo ng dng sau ny ca ontology, ngha l trc khi xy dng ontology, ngi dng cn quyt nh mc ch s dng v tch hp ontology vo ng dng g. METHONTOLOGY l mt phng php xy dng Ontology c pht trin t phng nghin cu tr tu nhn to ca trng H Polytechnic Madrid. Phng php ny cho php ngi s dng c th xy dng mt ontology mi da trn bn mu thit k mi hoc c th s dng nhng ontology c sn. B framework ca METHONTOLOGY c th gip ngi dng xy dng cu trc ontology mc tri thc v bao gm: nh ngha quy trnh pht trin ontology, mt s k thut trong qu trnh xy dng quy trnh trn (v d qun l v lp lch, qun l cht lng, thu thp d liu v tri thc, qun l cu hnh, v.v.). Phng php lun ny s dng chin lc middle-out v khng ph thuc vo ng dng. 2.1.2. Cng c xy dng Ontology B cng c xy dng v pht trin Ontology bao gm cc tool h tr v mi trng gip ngi dng c th xy dng mt Ontology mi t bn thit k mi hoc s dng li nhng Ontology mi c sn. Mt s mi trng pht trin c xy dng t trc nh Ontosaurus, Ontolingua v WebOnto. Nhng b cng c mi c s dng nhiu gn y bao gm OntoEdit, OilED,WebODE, Chimera DAG-Edit v Protg. Ontoligua server [45] l b cng c xy dng ontology c pht trin t nhng nm 1990 ti Phng Th nghim H thng tri thc (Knowledge Systems Laboratory -KSL) ca Trng H Stanford (M). Cc module chnh ca b cng c bao gm b bin tp ontology (ontology editor) v cc module khc nh Webster, OKBC (Open knowledge Based Connectivity) server.
13

Ontosaurus [46] c pht trin cng trong khong thi gian bi Vin Khoa hc Thng tin ISI ca Trng H South Calfornia (M). OntoSaurus bao gm 2 module chnh: ontology server (s dng Loom) v mt web browser cho Loom ontology. Ngoi ra, b cng c cn h tr KIF, KRSS v C++, ng thi OntoSaurus ontology cng c th c truy cp da trn protocol OKBC ca Ontoligua server. WebOnto l mt ontology editor cho cc Ontology OCML (Operational Conceptual Modelling Language), c pht trin bi Vin Truyn thng Tri thc (KMI) ti Trng H m (Open University). B cng c ny l s dng Java vi webserver, cho php ngi dng c th duyt v thay i cc m hnh tri thc thng qua Internet. im mnh chnh ca b cng c ny l c th cho php cng tc gia nhiu ngi nhm thay i v hon thin ontology [26]. Cc b cng c trn (Ontolingua server, Ontosaurus v WebOnto) c xy dng n thun nhm h tr duyt v bin tp cc Ontology c vit bng nhng ngn ng ring (Ontolingua, LOOM v OCML). Nhng b cng c bin tp ny hin nay khng cn p ng nhu cu ca ngi s dng. Th h mi cc b cng c xy dng Ontology c nhiu u vit cng nh tnh nng hn hn cc b cng c ny, v d nh kh nng m rng, h thng kin trc cc thnh phn gip ngi dng c th cung cp thm cc tnh nng cho mi trng pht trin mt cch d dng. WebODE [47] l mt b cng c c kh nng m rng c pht trin bi nhm Ontology ca trng H Technical Madrid (UPM), c xem nh mt thnh cng ca ODE (Ontology Design Environment). WebODE c s dng nh mt Web server vi giao din web. Phn li chnh ca mi trng ny l mt dch v (service) ontology, trong tt c cc dch v v ng dng khc u c th s dng dch v ny. Phn son tho Ontology cng ng thi cung cp cng c kim tra rng buc, to cc lut tin (axiom rule creation) v phn tch vi WebODE Axiom Builder (WAB), ti liu trong HTML, kt hp ontology vi cc nh dng khc nhau [XML\RDF[s], OIL, DAML+OIL, CARIN, Flogic, Java v Jess]. OilED [48] l mt b cng c son tho ontology cho php ngi dng c th xy dng Ontology bng OIL v DAML+OIL, c xy dng bi Trng H Manchester, i hc Amsterdam v Interprice GmbH. Protg 2000 [51] l mt trong nhng b cng c c s dng rng ri nht hin nay, c pht trin bi Trng H Stanford. B cng c ny c pht trin
14

da trn hai mc tiu: c th tng thch vi cc h thng khc, d dng s dng v h tr cc cng c trch chn thng tin. Phn chnh ca mi trng ny l mt bin tp ontology. Bn cnh , Protg cn bao gm rt nhiu cc plugin nhm h tr chc nng nh qun l nhiu ontology, dch v suy lun (inference service), h tr v vn ngn ng ontology (language importation/exportation). 2.1.3. Ngn ng xy dng Ontology Hin ti, cc ngn ng xy dng ontology (ngn ng ontology) in hnh bao gm LOOM, LISP, Ontolingua, XML, SHOE, OIL, DAML+OIL v OWL. Ngn ng ontology c chia lm ba loi: nh ng tp t vng s dng ngn ng t nhin (object based-knowledge representation languages) nh UML, v ngn ng da trn lgic v t bc mt (first order predicate logic) nh logic m t (Description Logics). Ngn ng ontology cn phi tng thch vi nhng cng c khc, t nhin v d hc, tng thch vi cc chun hin ti ca web nh XML, XML Schema, RDF v UML. Di y l mt s cc ngn ng web-based. EXtensible Markup Language [XML] l mt chun m dng biu din d liu t W3C, c tnh mm do v mnh hn so vi HTML. RDF (Resource Description Framework) c pht trin nh mt khung gip m t v trao i cc metadata [12]. SHOE (Simple HTML Ontology Extensions) c xy dng vo nm 1996 ti Trng H Maryland, nh mt m rng ca HTML c th hp nht cc tri thc ng ngha trn cc vn bn web hin ti thng qua vic ch thch cc trang HTML [27]. OIL (Ontology Inference Layer) l m rng ca RDF, c pht trin bi d n ON-To_Knowledge, l ngn ng m t v trao i cho ontology. Ngn ng ny c kt hp bi ngn ng dng da trn frame (frame-based) vi ng ngha hnh thc (formal sematics) v dch v suy lun t logic m t (description logics). Ngn ng c chia lm ba mc i tng lp (cc thc th c th), mc u tin (firstmeta, nh ngha theo ontology) v mc th hai (second-meta, cc mi quan h) [8]. DAML+OIL c pht trin da trn d n DARPA nm 2000. C OIL v DAML+OIL u cho php m t cc khi nim, cc phn cp (taxonomy), cc quan h nh phn, chc nng v thc th [9].

15

OWL l mt ngn ng ontology c s dng ph bin hin nay, c ti u ho cho vic trao i d liu v chia s tri thc. Ngn ng ny c s dng khi thng tin cha trong vn bn cn c x l bi cc ng dng. OWL l c th c s dng biu din ng ngha cc thut ng trong tp t vng v mi quan h gia nhng thut ng ny. OWL bao gm OWL Lite, OWL DL [RDF] v OWL FULL. 2.3. Xy dng Ontology y t ting Vit Vic thit k v xy dng mt ontology bao gm cc bc sau: nh ngha cc lp trong ontology. Sp xp cc lp trong mt kin trc phn cp (taxonomic hierarchy). nh ngha cc thuc tnh (slot) v m t cc gi tr cho php cho nhng thuc tnh ny. in gi tr ca cc th hin (instance) vo cc slot. Sau , c s tri thc c to ra bng cch nh ngha cc th hin (instance) ca nhng lp ny cng vi nhng gi tr ca chng.

Khng c mt phng php no c gi l phng php chun xc cho vic xy dng tt c cc Ontology [18]. Vic la chn phng php xy dng ph hp no c da trn mc ch v tnh cht ca tng Ontology. Qua qu trnh kho st cc d liu v y t v mt s cc phng php pht trin Ontology, chng ti la chn mi trng Protg OWL xy dng mt Ontology y t bng Ting Vit th nghim. Sau khi thu thp v kho st d liu, chng ti lit k cc thut ng quan trng nhm c th nu nh ngha cho ngi dng vi hng nghin cu tip theo l t ng lin kt n cc nh ngha c sn trn trang wikipedia. T cc thut ng trn, tip theo s nh ngha cc thuc tnh ca chng. Vic xy dng Ontology l mt qu trnh lp li c bt u bng vic nh ngha cc khi nim trong h thng lp v m t thuc tnh ca cc khi nim .

16

Chng 3 NHN DNG THC TH

3.1. Gii thiu bi ton nhn dng thc th 3.1.1. Gii thiu chung v nhn dng thc th Nhn dng thc th c th hiu mt cch n gin l phn loai cc t trong mt vn bn thnh cc lp thc th c nh ngha trc nh ngi (PER), t chc (ORG), v tr (LOC), bnh (BENH), triu chng (TCHUNG), thuc (THUOC). Nhn dng thc th cho chng ta c mt phn tch b mt, cc thc th s tr li cc cu hi quan trng (c th ng dng trong h thng hi p). C rt nhiu phng php c dng gii quyt bi ton nhn dng thc th, t cc phng php th cng n cc phng php hc my nh cc m hnh markov n (Hidden Markov Models HMM), cc m hnh Markov cc i ha Entropy (Maximum Entropy Markov Models- MEMM), cc m hnh min ph thuc iu kin (Conditional Random Field - CRF), phng php my vector h tr (Support Vector Machine). Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th Proteus ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v c h tr bi mt s lng ln cc lut, tuy nhin hu ht cc lut u cn tn ti mt s lng ln cc trng hp ngoi l, trong c nhng ngoi l ch xut hin khi h thng a vo s dng, m ta kh c th gii quyt ht. Di y l mt s v d v cc lut c s dng bi Proteus cng vi cc trng hp ngoi l ca chng [1]: Lut: Title Capitalized_Word => Title Person Name Trng hp ng : Mr. Johns, Gen. Schwarzkopf Trng hp ngoi l: Mrs. Fields Cookies (mt cng ty). Lut: Month_name number_less_than_32 => Date Trng hp ng: February 28, July 15 Trng hp ngoi l: Long March 3 ( tn mt tn la ca Trung Quc). So vi cc phng php th cng va tn thi gian, cng sc, m kt qu t c li khng c nh mong mun, cc phng php hc my hin ang
17

c tp trung nghin cu nhiu hn. Hu ht cc phng php u c nhng u th ring ng thi vn cn tn ti mt s hn ch do c th ca mi m hnh. Tiu biu c th k n cc m hnh Markov n HMM v cc m hnh ci tin ca n nh MEMM, CRF; vi cc m hnh ny ta c th xem tng ng mi trng thi vi mt trong nhn cc nhn thc th v d liu quan st l cc t trong cu ang xt. My vector h tr (SVM) cng l mt trong nhng phng php hc my cho kt qu rt kh quan. 3.1.2. Mt s kt qu nghin cu v nhn dng thc th Trn th gii bi ton nhn bit thc th c quan tm nghin cu t lu v t c nhng kt qu kh n tng. C rt nhiu phng php (t cc phng php th cng n cc phng php hc my) c dng gii quyt bi ton ny. Trong cng trnh nghin cu vo nm 2007 [5], David Nadeau nh gi mt s nghin cu tiu biu trc c lin quan n bi ton nhn dng thc th. Ni dung cc nh gi ca David Nadeau c trnh by nh di y. Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th Proteus ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v c h tr bi mt s lng ln cc lut. Nm 1998, Radev cng nghin cu nhn dng nhng on m t v thc th c a ra, chng hn nh Bill Clinton s c m t l the President of the U.S., the democratic presidential candidate hay an Arkansas native H thng ca Fung 1995 (v Huang 2005) gii quyt bi ton dch cc thc th t ngn ng ny sang ngn ng khc (v d nh bn dch ting Vit ca thc th College of Technology s l Trng i hc Cng ngh). H thng ny c nh gi l gp phi t hn 10% li dch. Tip theo , nm 2001, Charniak v cng s cng b kt qu nghin cu nhn dng cu trc cc phn trong tn ngi, v d nh cm Doctor Paul R. Smith s c chia thnh c thnh phn chc danh, h, m v tn). Nghin cu ny l mt bc tin x l quan trng trong b nhn dng thc th, c th xc nh nhng trng hp nh John F. Kennedy v President Kennedy l cng mt ngi. Cng trong nm 2001, h thng Record linkage ca Cohen v Richman c xy dng vi mc ch tm ra tt c cc dng ca cng mt thc th trn ton b c s d liu. Vo nm 2002, Dimitrov v cng s gii quyt vn s dng cc i t thay th, v d trong cu Rabi finished reading the book and he replaced it in the library i t he l i t thay th cho Rabi. Nghin cu ny c rt nhiu ng dng thc t, v d nh trong h thng hi p t ng. Nm 2003, Mann v Yarowski xy dng mt h thng xa b cc nhp nhng v tn ngi, k thut ny c s dng
18

xy dng tiu s - nn tng ca mt s my tm kim nh Zoominfo.com hay Spock.com. Nm 2005, Nadeau v Turney cng b kt qu nghin cu nhn dng t y ca cc t vit tt trong mt vn bn ang xt no , v d nh IBM vit tt ca International Business Machines trong nhiu vn bn. Mt nghin cu vo nm 2006 ca Agbago nhm xy dng mt h thng c kh nng phc hi li nh dng ng ca t bao gm vic bo m cho k t u cu v u thc th lun c vit hoa l rt c ch trong dch my. Cng trong cng trnh nghin cu ca mnh [5], David Nadeau s dng tp nhn thc th ENAMEX theo mu ca hi ngh MUC 7 (Message Understanding Conference 7) v tin hnh hun luyn - kim th trn tp ng liu Medstract Gold Standard Evaluation Corpus (Tp ng liu ny c xy dng bi Pustejovsky vo nm 2001). Tc gi s dng b cng c Weka Machine Learning kim th nhiu thut ton hc c gim st v a ra kt lun tt ca h thng ph thuc rt nhiu vo thut ton c s dng v phng php hc bn gim st ca mnh cho kt qu kh quan nht. Tnh n nay, c kh nhiu hi ngh khoa hc quc t ln trao i v bi ton nhn dng thc th cng nh nh gi nh gi cc h thng nhn dng thc th c xy dng. Tiu biu c th k n MUC (Message Understanding Conference, 1987-1997), MET (Multilingual Entity Task Conference, 1998), ACE (Automatic Content Extraction Program, 2000), HAREM (Evaluation contest for named entity recognizers in Portuguese, 2004-2006), IREX (Information Retrieval and Extraction Exercise, 1998-1999) 3.2. c im d liu ting Vit Ting Vit thuc ngn ng n lp, tc l mi mt ting (m tit) c pht m tch ri nhau v c th hin bng mt ch vit. c im ny th hin r rt tt c cc mt ng m, t vng, ng php. Di y trnh by mt s c im ca ting Vit theo cc tc gi Trung tm ngn ng hc Vit Nam trnh by. Vic nghin cu cc c im d liu ting Vit s gip em c ci nhn tng quan v cc c trng d liu ting Vit. Hiu r rng hn v d liu s gip vic xy dng Ontology v trch chn thng tin c hiu qu hn. 3.2.1. c im ng m Ting Vit c mt loi n v c bit gi l "ting" m v mt ng m th mi ting l mt m tit. H thng m v ting Vit phong ph v c tnh cn i,
19

to ra tim nng ca ng m ting Vit trong vic th hin cc n v c ngha. Nhiu t tng hnh, tng thanh c gi tr gi t c sc. Khi to cu, to li, ngi Vit rt ch n s hi ho v ng m, n nhc iu ca cu vn. 3.2.2. c im t vng Ni chung, mi ting l mt yu t c ngha. Ting l n v c s ca h thng cc n v c ngha ca ting Vit. T ting, ngi ta to ra cc n v t vng khc nh danh s vt, hin tng..., ch yu nh phng thc ghp v phng thc ly. Vic to ra cc n v t vng phng thc ghp lun chu s chi phi ca quy lut kt hp ng ngha, v d: t nc, my bay, nh lu xe hi, nh tan ca nt... Hin nay, y l phng thc ch yu sn sinh ra cc n v t vng. Theo phng thc ny, ting Vit trit s dng cc yu t cu to t thun Vit hay vay mn t cc ngn ng khc to ra cc t, ng mi, v d nh tip th, karaoke, th in t (e-mail), th thoi (voice mail), phin bn (version), xa l thng tin, siu lin kt vn bn, truy cp ngu nhin, v.v. Vic to ra cc n v t vng phng thc ly th quy lut phi hp ng m chi phi ch yu vic to ra cc n v t vng, chng hn nh chm cha, chng ch, ng a ng nh, th thn, lng l lng ling, v.v. Vn t vng ti thiu ca ting Vit phn ln l cc t n tit [mt m tit, mt ting]. S linh hot trong s dng, vic to ra cc t ng mi mt cch d dng to iu kin thun li cho s pht trin vn t, va phong ph v s lng, va a dng trong hot ng. Cng mt s vt, hin tng, mt hot ng hay mt c trng, c th c nhiu t ng khc nhau biu th. Tim nng ca vn t ng ting Vit c pht huy cao trong cc phong cch chc nng ngn ng, c bit l trong phong cch ngn ng ngh thut. Hin nay, do s pht trin vt bc ca khoa hc-k thut, c bit l cng ngh thng tin, th tim nng cn c pht huy mnh m hn. 3.2.3. c im ng php T ting Vit khng bin i hnh thi. c im ny s chi phi cc c im ng php khc. Khi t kt hp t thnh cc kt cu nh ng, cu, ting Vit rt coi trng phng thc trt t t v h t. Vic sp xp cc t theo mt trt t nht nh l cch ch yu biu th cc quan h c php. Trong ting Vit khi ni Anh ta li n l khc vi Li n anh
20

ta. Khi cc t cng loi kt hp vi nhau theo quan h chnh ph th t ng trc gi vai tr chnh, t ng sau gi vai tr ph. Nh trt t kt hp ca t m "c ci" khc vi "ci c", "tnh cm" khc vi "cm tnh". Trt t ch ng ng trc, v ng ng sau l trt t ph bin ca kt cu cu ting Vit. Phng thc h t cng l phng thc ng php ch yu ca ting Vit. Nh h t m t hp anh ca em khc vi t hp anh v em, anh v em. H t cng vi trt t t cho php ting Vit to ra nhiu cu cng c ni dung thng bo c bn nh nhau nhng khc nhau v sc thi biu cm. V d, so snh cc cu sau y: - ng y khng ht thuc. - Thuc, ng y khng ht. - Thuc, ng y cng khng ht. Ngoi trt t t v h t, ting Vit cn s dng phng thc ng iu. Ng iu gi vai tr trong vic biu hin quan h c php ca cc yu t trong cu, nh nhm a ra ni dung mun thng bo. Trn vn bn, ng iu thng c biu hin bng du cu. S khc nhau trong ni dung thng bo c nhn bit khi so snh hai cu sau: - m hm qua, cu gy. - m hm, qua cu gy. Qua mt s c im ni bt va nu trn y, chng ta c th hnh dung c phn no bn sc v tim nng ca ting Vit cng nh kh khn gp phi trong vic nhn dng thc th cng nh trch chn thng tin trong ting Vit. 3.3. Mt s phng php nhn dng thc th Tn ti nhiu phng php c cp ti trong bi ton nhn dng thc th. Tuy nhin c th tng kt li mt s giai on chnh trong bi ton ny nh sau: Tin x l: Loi b HTML, tch cu, tch t. La chn thuc tnh: La chn cc nhn th (tag), mu ng cnh (feature: vit hoa, vit thng, ). Giai on hun luyn, t hc: SVM Gn nhn, khi phc.
21

S dng HMM, CRF, MEMM,

Ty thuc vo tng min ca bi ton nhn dng thc th th s la chn cc nhn th l khc nhau. C th cp ti by nhn dng c bn tng qut nht c la chn u tin: 7 dng nhn u tin (theo Ralph & Beth, [5]): ORG (t chc), LOC (v tr), PER (ngi), DATE,TIME,CUR (Biu din tin t), PCT (Phn trm). Tp nhn c th c thay i, m rng ty thuc vo tng d n. D n Biocaster [11] xy dng 22 nhn cho lnh vc y t. Mi mt nhn c gn bao gm ba phn: Phn bin (boundary category): Xc nh v tr ca t hin ti trong mt thc th. Phn thc th (Entity category): Xc nh kiu thc th. Tp c trng (Feature set) : Xc nh thng tin ng cnh (mu ng cnh).

C nhiu cch biu din phn bin ca cc t, trong cch biu din thng c cp v dng nhiu nht c th k ti l: biu din mi mt nhn gm mt tip u ch B_ (bt u mt thc th ), I_ (bn trong mt thc th), nhn O (khng phi thc th). Ly v d: bnh vim no nht bn c th c gn nhn nh sau B_DIS I_DIS I_DIS I_DIS. La chn mu ng cnh l bi ton quan trng quyt nh chnh xc ca nhn dng thc th. Mu ng cnh ti v tr quan st bt k cho ta thng tin ng cnh. Bt k mt h thng nhn dng thc th hon thin no u phi xy dng c mt tp cc mu ng cnh mt cch chnh xc v m t c tng lnh vc ca bi ton nhn dng. Bi ton nhn dng thc th chung: vit hoa, vit thng, k t % , ch s, du chm, phyBi ton tng t trong y t, l la chn mu ng cnh trong nhn dng protein, gene, thuc, t bo . Cc loi mu ng cnh [6]: Mu tin nh c bn (vit hoa, thng, chm, phy): comma, dot, oneDigit, AllDigits Mu hnh thi hc: tin t, hu t (~virus, ~lipid, ~vitamin,), Mu ng php: cm ng t, cm danh t Mu trigger ng ngha:

22

o Trigger danh t chnh: danh t chnh ca mt t hp t ( B Cell trong activated human B cells, bnh trong bnh vim xoang ). o Trigger ng t c bit: nhim, ly, bao gm, gy ra. 3.3.1. Phng php da trn lut, bn gim st H thng da trn lut bao gm mt tp cc lut c bn (Nu-Th), tp cc s vt (facts), b thng dch (interpreter) s dng tp lut sinh ra cc s vt. S dng phng php da trn lut, u tin chng ta xy dng mt tp ban u cc lut, cc thc th. Qua qu trnh hc da trn bn gim st v k thut bootstrapping, chng ta m rng tp thc th cng nh tp lut ban u. Hc bn gim st [28] c hiu l phng php hc my s dng c hai loi d liu gn nhn v cha gn nhn cho qu trnh hun luyn. Phng php ny kt hp c u im, gim bt nhng nhc im ca phng php hc c gim st v hc khng gim st. Cc thut ton bn gim st c nhim v chnh l m rng mt tp d liu hun luyn nh ban u thnh tp d liu ln hn. Mt k thut chnh ca phng php hc bn gim st l bootstrapping. K thut ny bao gm c gim st mc nh, t mt tp d liu ban u (cn gi l tp seed) bt u qu trnh hun luyn. V d mt h thng nhn dng tn bnh, lc u yu cu mt tp mu nh cc tn bnh. Sau , h thng tm kim cc cu cha cc tn bnh ny v c gng tm kim cc thng tin ng cnh chung cho mt s tn bnh trong tp ny (v d nh c s tng ng v thng tin ng cnh trong tng 5 mu tn bnh). Sau t cc thng tin ng cnh ny, h thng s tm cc th hin ca tn bnh xut hin trong cc ng cnh tng t. Qu trnh hun luyn ny s c lp i lp li tm ra cc v d mi, cng nh khai thc c cc thng tin ng cnh mi c lin quan. Bng cch lp i lp li qu trnh ny, mt s lng ln cc tn bnh v mt s lng ln cc thng tin ng cnh s c thu thp li. 3.3.2. Cc phng php my trng thi hu hn Cc phng php my trng thi hu hn dng mt s chung ca my trng thi hu hn (finite state machine - FSM hoc finite state automaton FSA). C th coi my trang thi hu hn l mt my tru tng c dng trong cc nghin cu v tnh ton v ngn ng vi mt s lng hu hn, khng i cc trng thi. My trng thi hu hn c biu din nh mt th c hng, trong c hu hn c nt (cc trng thi) v t mi nt c khng hoc mt s cung (b
23

chuyn) i ti cc nt khc. Mt xu u vo m cn xc nh dy b chuyn ph hp. Tn ti mt s kiu my trng thi hu hn. B nhn (Acceptor) cho cu tr li "c hoc khng" tip nhn xu u vo. B on nhn (Recognizer) phn lp i vi xu u vo. B bin i (Transducer) sinh ra mt xu kt qu ra tng ng vi xu u vo. M hnh my trng thi hu hn c ng dng trong trch chn thng tin thuc loi b bin i, trong vi mt xu vn bn u vo, h thng a ra xu cc c trng tng ng vi cc t kha trong xu vn bn . Theo mt cch phn loi khc, th c hai loi my trng thi hu hn l quyt nh (Deterministic finite automaton- DFA) v khng quyt nh (Non-deterministic finite automaton NFA). My trng thi hu hn bao gm: Mt bng ch , Mt tp cc trng thi S, trong o vi DFA: c mt trng thi xut pht v c t khng tr ln cc trng thi chp nhn (dng). o vi NFA: c t mt tr ln cc trng thi c coi l trng thi xut pht v c t khng tr ln cc trng thi chp nhn (dng). Mt hm chuyn T : S S.

Hot ng my trng thi c m t nh sau. Bt u t (tp) trng thi xut pht, ln lt xem xt tng k t trong xu u vo trong bng ch , trn c s hm chuyn T di chuyn ti trng thi tip theo cho n khi mi k t ca xu c xem xt. Nu gp c trng thi dng l thnh cng. Trong trng hp , xu cc trng thi c gp (xut hin) trong qu trnh x l xu u vo c coi l xu kt qu, hay cn c gi l xu nhn ph hp vi xu u vo. M hnh my trng thi hu hn ng dng trong trch chn thng tin c b sung thm mt s yu t, ch yu lin quan ti hm chuyn T, thng T c m t nh mt qu trnh Markov. 3.3.3. Phng php s dng Gazetteer T in Gazetteer (hay Gazetteer) c hiu l mt danh sch cc thc th nh tn ngi, t chc, v tr; hay ring i vi lnh vc y t l mt danh sch cc bnh, tn thuc, triu chng, nguyn nhn.Nu c th xy dng c mt tp d liu gazetteer tht tt, y , chnh xc th s to bc tin quyt quan trng i
24

vi h thng nhn dng thc th. Ngoi vic xy dng Ontology s cp ti cng vic xy dng mt tp gazetteer ban u cho y t ting Vit. Nhn dng thc th da trn tp Gazetteer ny cho kt qu kh quan. Cc file gazetteer c biu din theo nh dng sau: a.lst:b:c. Trong a.lst l file cha cc th hin ca lp thc th a, b l kiu major, c l kiu minor. C th hiu mt cch n gin lp thuc kiu minor l lp con ca lp thuc kiu major. V d cc file gazetteer biu din nguyn nhn gy ra bnh c biu din nh sau: nguyen_nhan.lst:nguyen_nhan:vikhuan, nguyen_nhan.lst:nguyen_nhan:tac_nhan.

Hnh 6: Mt s file Gazetteer c xy dng phc v bi ton nhn dng thc th. c kh nhiu bi bo cp ti vic s dng tp d liu nhn dng thc th. Trong bi bo v xy dng tp d liu cho bi ton nhn dng thc th (c trnh by trong phn 3.4.1), nhm tc gi cp ti tm quan trng ca vic xy dng mt tp d liu ban u cho qu trnh nhn dng thc th. Bi bo s dng BioCaster NE ch thch d liu v s dng Yamcha hc m hnh SVM da trn cc bi bo c ch thch [20]. 3.4. Nhn dng thc th y t ting Vit 3.4.1. Nhn dng thc th ting Vit Tn ti mt s cng trnh nghin cu cp ti vic s dng tp d liu nhn dng thc th ting Vit. Nguyn Cm T [1] xy dng mt h thng nhn din thc th nhn bit loi thc th da trn m hnh trng ngu nhin c iu
25

kin (Conditional Random Fields - CRF) xc nh 8 loi thc th, tng ng vi l 17 nhn. Tc gi tin hnh thc nghim s dng cng c FlexCRFs (cng c m ngun m c pht trin bi Phan Xun Hiu v Nguyn L Minh), s dng d liu gm 50 bi bo lnh vc kinh doanh (khong gn 1400 cu) ly t ngun http://vnexpress.net. Thao P.T.X. v cng s [21] cp ti vic khai thc cc chin lc b phiu (voting) bng cch t hp cc b my hun luyn s dng phng php da trn t (word-based). tng chnh ca nhm tc gi l cp ti l vic t hp cc my hun luyn s dng cc thut ton phn lp khc nhau (SVM, CRF, TBL, Nave Bayes) s cho kt qu cao hn khi s dng ring r mi thut ton. Trong [20], Thao P.T.X. v cng s cp ti tm quan trng ca vic xy dng mt tp d liu ban u cho qu trnh nhn dng thc th. Cc tc gi s dng BioCaster NE ch thch d liu v s dng Yamcha hc m hnh SVM da trn cc cng trnh nghin cu lin quan. Nhm tc gi d tm cc bnh truyn nhim thng qua cc bi trc tuyn v y t sc khe cp ti vic xy dng tp d liu cho bi ton nhn dng thc th ng mt vai tr rt quan trng v a ra 22 nhn thc th gn nhn v ch thch d liu. Mt nghin cu tiu biu c lin quan n bi ton nhn dng thc th Vit Nam l cng c VN-KIM IE [40] c xy dng bi nhm nghin cu do ph gio s tin s Cao Hong Tr ng u, thuc trng i hc Bch khoa Thnh ph H Ch Minh. Chc nng ca VN-KIM IE l nhn bit v ch thch lp t ng cho cc thc th c tn trn cc trang Web ting Vit. 3.4.2. Nhn dng thc th y t ting Vit Trn th gii, mt s nh nghin cu (John McNaught[10], Sammy Wang [25], ...) lu v mt s vn kh khn trong x l d liu y t. Nhng kh khn in hnh nht l s nhp nhng v a dng ca cc t, thc th trong d liu y t c cu trc phc tp, nguyn tc hnh thnh i khi li khng ging nh bnh thng; hin nay vn cha c quy c r rng v tn cc thc th, vn t ng ngha t tri ngha t vit tt v trong nhiu trng hp t c s dng khng mang ngha thng gp ca n; nhiu t cng ch mt khi nim v mt t c th c nhiu ngha, . i vi bi ton nhn dng thc th cho y t ting Vit, ngoi nhng kh khn chung ca bi ton nhn dng thc th ni trn cn gp mt s tr ngi khc. Cc vn bn ting Vit khng c d liu hun luyn v cc ngun ti nguyn c th
26

tra cu (nh Wordnet trong ting Anh), thiu cc thng tin ng php (POS) v cc thng tin v cm t nh cm danh t, cm ng t cho ting Vit, trong khi cc thng tin ny gi vai tr quan trng trong vic nhn dng thc th; khong cch gia cc t khng r rng, d gy nhp nhng. Hn na, i vi c trng ca d liu y t cng gy ra khng t kh khn cho bi ton nhn dng thc th: thng tin lu tr khng hoc bn cu trc (tn thuc, virus), cc kiu vit tt tn thc th, kiu tn thc th di, a dng, cc cch vit khc nhau ca cng mt thc th. Ring vi thc th bnh ting Vit, c th im qua mt s c im gy kh khn cho bi ton nhn dng thc th: Khng tun theo lut no v k t vit hoa. Kh hn ch s lng t v: C nhng tn bnh ch gm 01 t (Nh bnh si, bnh chn), nhng c nhng tn bnh li gm rt nhiu t nh chng ri lon tm thn th hoang tng, Cu trc cc t to thnh mt thc th c th rt phc tp: ri lon chc phn no nh tr em, C nhiu t mn, t Hn Vit: Stress, bnh paranoa, bnh gout, bnh thin u thng Cng mt bnh i khi c nhiu cch vit khng hon ton ging nhau hay thm ch khc hn nhau: thy u hay tri r, bnh gt hay gout hay cn gi l thng phong, bnh ung th mu cn c gi l bnh mu trng C nhiu t vit tt: AIDS (l vit tt t Acquired Immunodeficiency Syndrome hay t Acquired Immune Deficiency Syndrome ca ting Anh) trong nhiu ti liu y t ting Vit c dch l hi chng suy gim min dch mc phi, Cha nhng t rt d b b st v cm t d c hay khng c cc t ny vn c th c tnh l mt thc th, nh mn tnh, cp tnh, nguyn pht, th pht Bi ton nhn dng thc th c trng cho d liu sinh hc v y t cng l mt ni dung nghin cu rt c quan tm. Cc thc th c trng ca d liu sinh hc y t thng c quan tm n nhiu nht l: Bnh, Thuc, Gen, Sinh vt, Protein, Enzime, Cc khi u c tnh (Malignancies), Fibrinogen [10] [23] Mt trong nhng phng php n gin nht c xut cho bi ton nhn dng thc th trong d liu y t l s dng cc t in hoc tp t vng c nh ngha trc. n c l s dng MeSH [23]. y l mt bng t vng y t c kim
27

sot s dng nh ch mc. Thc cht n l mt danh sch cc t c xc nhn dng nh ch mc v ch c cc t trong danh sch ny c chp nhn vai tr . Cc t trong MeSH c sp xp theo h thng c cu trc cy. C tt c 16 nhnh ca cy MeSH, y l nhng nhm t ln nht v c trng nht trong d liu y t, c th k n nhnh A- Anatomy (gii phu hc), nhnh B Organisms (sinh vt), nhnh C Dieases (bnh), nhnh D Chemicals and Drugs (ha hc v thuc), nhnh G - Biological Sciences (sinh vt hc) Cc nhnh li chia lm cc nhnh nh, v d nhnh A01 - Body Regions (b phn c th), A02 Sense Organs (cc gic quan) Trong chui hi ngh quc t BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology]: c t chc di dng mt cuc thi, BioCreAtIvE I (2003-2004) tp trung vo ch nhn dng tn thc th Gene v Protein, c th im qua mt vi kt qu tiu biu di y [32]: Alexander Yeh v cng s s dng d liu v phn mm c lngca W. John Wilbur and Lorraine Tanabe cho kt qu F-measure khong 80-83%. Shuhei Kinoshita v cng s gii quyt vn bng cch coi bi ton nhn dng thc th nh mt dng ca bi ton gn nhn t loi, thm mt nhn GENE vo tp nhn thng thng, cc tc gi s dng phng php gn nhn t loi ca Brill, s dng cng c TnT mt cng c da trn m hnh HMM, h thng khng qua hu x l cho kt qu chnh xc l 68.0%, hi tng l 77.2% v F-measure l 72.3%., nu thm mt bc hu x l (bng mt s lut bt li) t chnh xc l 80.3%, hi tng 80.5% v F-measure l 80.4%; nu s dng thm mt bc hu x l da trn t in th t c F-measure l 80.9%. Nm 2004, Yi-Feng Lin, Tzong-Han Tsai, Wen-Chi Chou, Kuen-Pin Wu, Ting-Yi Sung and Wen-Lian Hsu cng b nghin cu v p dng m hnh Markov cc i ha Entropy cho bi ton nhn dng thc th trong d liu y t. Kt qu c cho bi chnh xc P, hi tng R v F-measure (2PR/(P+R)) l (0.512, 0.538, 0.525), sau khi hu x l th t c kt qu tng ng l (0.729, 0.711, 0.72). Nm 2004, Haochang Wang v cng s [7] xut phng php nhn dng thc th cho d liu y t da trn b phn lp kt hp cc phng php Generalized Winnow, Conditional Random Fields, Support Vector Machine v Maximum Entropy, cc phng php ny c phi hp theo ba chin lc khc
28

nhau. H thng m cc tc gi xy dng t c kt qu o F khong 77.57%, l mt kt qu kh tt so vi cc nghin cu cng thi im. Nm 2007, Andreas Vlachos [3] so snh hai phng php nhn dng thc th trong d liu y t da trn m hnh HMM v da trn m hnh CRF cng vi phn tch c php. Hai bng di y ch ra kt qu thc nghim, bng bn tri l kt qu thc nghim khi hun luyn bng mt tp nh d liu c ch thch thc th th cng v kim th trn ton b tp hun luyn, bng bn phi l kt qu khi hun luyn bng mt tp nh d liu nhiu v kim th trn ton b tp hun luyn Gn y nht, vo thng 3 nm 2009, Razvan C. Bunescu [45] khi trnh by v trch chn quan h t tp d liu y t lu vn nhn dng thc th c trng trong d liu y t, cc thc th c quan tm n gm c Bnh, Gen v Protein. Sau khi nhn dng c cc thc th ny, tc gi tin thm mt bc quan trng l trch chn ra quan h tng tc gia chng (v d nh Gen m ha mt Protein, Protein hon thnh chc nng ca n bng cch tng tc vi mt Protein khc ).

29

Chng 4 XC NH QUAN H NG NGHA

4.1. Tng quan v xc nh quan h ng ngha 4.1.1. Khi qut v quan h ng ngha Nh trnh by trn, sau khi c mt tp lp thc th (qua bc nhn dng thc th) c c mt mng ng ngha cc thc th, chng ta cn thc hin bc tip theo l bc trch chn quan h ng ngha (semantic relation). Quan h ng ngha c th c hiu l mi quan h tim n gia hai khi nim c biu din bng t hoc cm t [24]. Cc mi quan h ng ngha ng mt vai tr quan trng trong vic phn tch ng ngha t vng. T n c th ng dng vo nhiu bi ton khc: Xy dng nn tng tri thc ng ngha t vng, h thng hi p, tm tt vn bn, Mt s mi quan h ng ngha in hnh trong lnh vc y t l IS_A (Cm -- bnh), PART_WHOLE (Virus Nguyn nhn), CAUSE_EFFECT (virus bnh).

Hnh 7: Minh ha mt quan h ng ngha cho thc th car Tuy quan h ng ngha ng mt vai tr quan trng trong phn tch ng ngha nhng chng thng tn ti dng n gy kh khn cho vic trch chn cc quan h ny. Mt cu hi t ra l lm th no chng ta c th khai thc c cc
30

quan h ng ngha ny mt cch c hiu qu t tp d liu th (khng hoc bn cu trc). Tr li cho cu hi ny chnh l mc tiu chnh ca bi ton trch chn quan h c cp nhiu trong thi gian gn y. 4.1.2. Trch chn quan h ng ngha Mc ch ca trch chn quan h ng ngha l trch rt ra nhng quan h chuyn bit, c th no gia cc thc th trong ngun ng liu vn bn ln. Thc cht nhim v ca trich chn quan h ng ngha l khi c cho mt cp thc th xy, phi xc nh c ngha ca cp thc th [24]. Ly v d t cu mt ng do cng thng, hi hp chng ta c th suy ra quan h ng ngha: cng thng, hi hp l nguyn nhn ca bnh mt ng.

Hnh 8. Minh ha v trch chn quan h ng ngha Cc ti nguyn trich chn quan h ng ngha bao gm: Cc tp d liu: Da trn s xut hin ng thi v cc phng php thng k. Cc ti nguyn sn c v cc quan h ng ngha nh WordNet v cc b chun mc. S nh gi ca con ngi. Cng nh nhn dng thc th, nhn dng quan h ng ngha cng c mt s kh khn ring nh sau (1) cha c c s thng nht v vn s lng cc quan h ng ngha, cc quan h ng ngha c n giu di cc dng khc nhau; (2) cc s kt hp (danh t - danh t) khng hon ton tun theo cc quy tc rng buc nht nh, cc quan h ng ngha thng l n, c th c nhiu mi quan h gia cc cp khi nim, vic thng dch c th ph thuc nhiu vo ng cnh, khng c mt tp c nh ngha tt v cc quan h ng ngha.
31

Vic trch chn quan h ng ngha l mt phn ca cc d n quan trng mang tm c quc t trong lnh vc khai ph tri thc [24]. V d nh ACE (Automatic Content Extraction). DARPA EELD (Evidence Extraction and Link Discovery), ARDA-AQUAINT (Question Answering for Intelligence), ARDA NIMD (Novel Intelligence from Massive Data), Global WordNet.

Hnh 9. V tr ca khai ph quan h ng ngha trong x l ngn ng t nhin Ty thuc vo tng min, lnh vc m chng ta c cc quan h ng ngha khc nhau. Bng trong Hnh 10 minh ha mt s quan h ng ngha trong WordNet

32

Hnh 10. Minh ha cc quan h ng ngha c ch ra trong WordNet [37] i vi min d liu y t, qua kho st, chng ti thu thp c 12 loi quan h ng ngha, cc quan h ny s c m t chi tit trong Chng 5.

33

Hnh 11. Mt s quan h ng ngha xy dng c Hnh 11 m t mt s quan h ng ngha, ngha cc quan h ng ngha ny c m t trong bng Bng 1.

34

Quan h Gy_ra C_triu_chng T i Cha_bng Lm_vic Bin_chng Tng_tc_thuc Pht_hin_ti Tc_ng_tt

ngha

Quan h o ngc

M t quan h nguyn_nhn gy B_gy_ra_bi ra bnh Quan h bnh c cc triu chng T_chc c t ta a_im Bnh c cha bng thuc Ngi lm vic t_chc Bnh bin chng sang bnh khc Thuc tng tc vi thuc Bnh c pht hin ti T_chc Thc_phm,Hot_ng, Cht_ha_hc tc ng tt n c_th_ngi, bnh Thc_phm, Hot_ng, Cht_ha_hc tc ng xu n c_th_ngi, bnh Bng 1. Gii thch cc mi quan h ng ngha Cha Lin_quan

Tc ng xu

4.1.3. Mt s nghin cu lin quan n xc nh quan h ng ngha Ti Hi tho SemEval 2007 [38], nhn dng cc mi quan h ng ngha gia hai danh t l mt ni dung chnh c cp. ngha ca 2 thc th lin quan n ngha ca cc t khc trong ng cnh, nhn dng theo 1 kiu quan h no . V d: i xe p v s vui v (quan h nhn qu) Trch chn quan h ng ngha da trn 7 mi quan h c bn l Cause- Effect, Instrument-Agency, ProductProducer,Origin-Entity, Theme-Tool, Part-Whole, and Content-Container. Ngoi ra, c th k thm mt s phng php trch chn quan h gia hai khi nim c m t nh sau: thuc l 1 cch iu tr ca 1 bnh, hay 1 gene l 1 nguyn nhn ca 1 bnh. Swanson [29] gii thiu mt m hnh trch chn cc kiu quan h trn trong c s d liu y sinh hc t m ra mt khi nim th 3 (v d 1 chc nng sinh l) lin quan n c hai khi nim thuc v bnh. Vic trch chn loi khi nim th 3 ny cho php mt mi quan h gia hai khi nim chnh (cha tim n trong mt ti liu no ) c hin th ra. M t phng php trn mt cch c th hn: X lin quan n bnh no , Z lin quan n thuc, Y l mt chc nng bnh l, sinh l, triu chng, X v Y, Y v Z thng c cp
35

cng nhau, X v Z th li k cng xut hin trong 1 ti liu nghin cu. T ta c th s dng khi nim Y v 1 mi lin quan gia hai khi nim X v Z. i vi vic s dng Ontology, c nhiu nhm tc gi cp ti vic hc bn gim st s dng Ontology nh mt hng tip cn mi. Trong hng tip cn , input l mt tp cc vn bn text (tn thc th, tg ng i vi cc khi nim trong ontology m mi c xc nh). S dng cc tp d liu c sn nh GENIA corpus [14], vic gn nhn c thc hin th cng nhng d liu corpus c th c t ng to ra s dng mt h thng NER tng ng. Output: Tp cc mu bao gm cc cp lp v mi quan h trong ontology GENIA, (v d template : virus infect cell). C nhiu phng php c a ra xc nh quan h. Tuy nhin nhim v chung ca bi ton ny l t cc vn bn th nh cc trang Web, ti liu, tin tc, ; qua b phn tch ng ngha (Semantic Parser) chng ta c u ra l cc c s tri thc (Knowledge Base KB), v cc khi nim, cc mi quan h cng nh cc lin kt gia cc vn bn [24]. Hnh 12 m t nhim v chung ca bi ton xc nh thc th.

Hnh 12. Nhim v chung ca bi ton xc nh quan h Bi ton xc nh quan h cng c th hiu l t mt cp danh t (thc th) xc nh c ngha ca cp danh t [24]. ngha c din t thng qua mt danh sch cc quan h, cc cp thc th c nhn dng v mt s ti nguyn khc. i vi b phn tch ng ngha, nh trnh by phn trn, ng vai tr quan trng trong vic trch rt cc quan h ng ngha. B phn tch ng ngha ny bao gm cc thnh phn c m t nh trong Hnh 13:

36

Hnh 13. M t cc b phn trong b phn tch ng ngha SR [24] Preprocessing: Tokenizer, Part-of-speech tagger, Syntactic parser, Word sense disambiguation, Named entity recognition. Feature Selection: Xc nh cc tnh cht, rng buc (hoc ng cnh) , s dng b phn lp phn bit cc mi quan h ng ngha. Learning Model: Phn loi cc th hin (instance) input thnh cc mi quan h ph hp B phn tch ng ngha (SR: Semantic Parsers) thc hin hai nhim v chnh: Labeling: T cc mi quan h ng ngha c nh ngha trc v cp thc th (danh t - danh t) ta gn nhn mi quan h gia hai thc th . V d, Bnh xe t t <Part_Whole>. Paraphrasing: T mt cp danh t hay thc th a ra c din t ca trong vn cnh ca danh t . V d bnh mt ng do cng thng, t chng ta c th suy ra quan h cng thng l nguyn nhn ca mt ng. 4.2. Gn nhn ng ngha cho cu Trong [30], Xuan-Hieu Phan v cng s cp ti gii php kh nhp nhng thc th a ti liu bng cch gn nhn ng ngha cho cc cu trong vn bn. Kh nhp nhng thc th a ti liu l phn bit cc thc th trng th hin trong mt tp ti liu cho trc. V d, cho mt tp cc thc th c cng th hin l Bill Clinton, ta phi xc nh c tp con ti liu thc s ni v Bill Clinton cu tng thng M, tp con ti liu no ni v Bill Clinton cu th golf hay tp no ni v mt Bill Clinton no khc. Gn nhn ng ngha c th c xem nh l bi ton phn lp cc cu cha quan h ng ngha. Bi bo s dng b phn lp da trn Maxent ly cc cu t tm tt c nhn l cc cu u vo v u ra vi cc nhn ng ngha. B phn lp
37

da trn Maxent c u im l lin kt cht ch gia mt s lng rt ln (ln ti hng trm nghn hoc triu) ca cc c trng chng cho, c lp ti cc mc khc nhau. Cc tc gi [30] cng xut mt Framework cho vic kh nhp nhng thc th a ti liu gm ba phn chnh, v mt phn khng th thiu l gn nhn ng ngha cho cu trong vn bn: Tin x l: S dng x l nng mt thu thp mt tm tt bao gm cc cu lin quan ti thc th c cp. Ch nh cc nhn ng ngha i vi cu trong tm tt t chng vo cc lp khc nhau ca s vt. S ch nh ny c thc hin bi b phn lp da trn Maxent c chnh xc cao, trong d liu c hun luyn da trn phng php hc bn gim st. S dng phng php phn cm, tng ng gia cc tm tt c nhn ca mi cu c cng cc nhn ng ngha s c t bng nhau tnh ton gn ng ngha.

Hnh 14. Minh ha Framework gii quyt bi ton xc nh tn ring gia cc ti liu. Hnh v 14 cho thy gn nhn ng ngha cho cu ng mt vai tr quan trng trong bi ton xc nh tn ring gia cc ti liu cng nh l c s cho xc nh quan h ng ngha. Mt s nhn ng ngha cho cu c minh ha nh trong Hnh 15 sau y

38

Hnh 15. Mt s nhn ng ngha c gn cho cu [30] Vi cc nhn ny, tm tt c nhn ca Bill Clinton s c gn nhn nh Hnh 16 di y.

Hnh 16. Gn nhn ng ngha cho cc cu m t tng thng Bill Clinton [30]. Kha lun gn nhn th nghim cho 1000 cu vi cc nhn cha quan h lin quan n lnh vc y t. Cc nhn v d liu c gn nhn s c trnh by chih tit trong Chng 5.

4.3. Phn lp cu cha quan h


4.3.1. Phn lp vi xc nh quan h, nhn dng thc th Thc th cn nhn dng cng nh cc mi quan h cn xc nh ty thuc vo tng bi ton, tng min ng dng (domain). V d tn thc th c th l tn ngi, tn t chc, a danh, (bi ton nhn dng thc th thng thng). Trong min ng dng m kha lun thc hin, tn thc th c th l tn bnh, thuc, triu chng, nguyn nhn, Tuy nhin i vi mt s tn thc th hay quan h, v d tn bnh, triu chng, nguyn nhn, quan h c_triu_chng v quan h c_bin_chng th vic nhn dng v phn bit chng cng l mt bi ton phc
39

tp. C nhiu khi tn bnh trng vi triu chng, nguyn nhn, v d nh : au u, ho c th hiu l bnh, cng c th hiu l nguyn nhn hay triu chng trong mt s trng hp ng cnh khc nhau. Gn lin nhn dng thc th, xc nh quan h vi vn phn lp. Cc thc th sau khi c nhn dng ra cn c phn vo cc lp ng. Hn na, nh trnh by phn trc v gn nhn ng ngha cho cu bn cht cng chnh l da trn thut ton phn lp. T nhng l do m kha lun cp ti bi ton phn lp v cc thut ton phn lp c nghin cu trong thi gian qua. Hnh 17 m t cc giai on trong qu trnh phn lp. M hnh ny bao gm ba cng on chnh: cng on u l biu din d liu, tc l chuyn cc d liu (cc cu) thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mt tp hun luyn. Cng on th hai l vic s dng cc k thut hc my hc trn cc mu hun luyn va biu din. Nh vy l vic biu din cng on mt s l u vo cho cng on th hai. Cng on th ba l vic b sung cc kin thc thm vo do ngi dng cung cp lm tng chnh xc trong biu din vn bn hay trong qu trnh hc my.

D liu [cu]

Biu din ban u Tri thc thm vo [3]


Biu din ban u

Cc cng c phn lp

Hc quy np [2]

Gim s chiu hoc la chn thuc tnh Biu din cui cng

Hnh 17. M t cc giai on trong qu trnh phn lp Trong nhiu nm gn y c nhiu thut ton c a ra gii quyt bi ton phn lp, v d : SVM (Support Vector Machine), K lng ging gn nht, phn lp da vo cy quyt nh, Cc thut ton ny c Nguyn Minh Tun [2] m t kh chi tit. Chng ti s dng phng php SVM phn loi cu cha quan h, trong cc phn tip theo s trnh by k hn v thut ton ny.
40

4.3.2. Thut ton SVM (Support Vector Machine) Thut ton my vector h tr (Support Vector Machine SVM) c Corters v Vapnik gii thiu vo nm 1995. SVM rt hiu qu gii quyt cc bi ton vi d liu c s chiu ln (nh cc vector biu din vn bn). Thut ton SVM c thc hin trn mt tp d liu hc D= {(Xi,Ci), i=1,n}.Trong Ci {-1,1} xc nh d liu dng hay m. Mc ch ca thut ton l tm mt siu phng svm.d + b phn chia d liu thnh hai min. Phn lp mt ti liu mi chnh l xc nh du ca f[d] = svm.d + b. Ti liu s thuc lp dng nu f(d) > 0, thuc lp m nu f(d) < 0.

Hnh 18: M t s phn chia ti liu theo du ca hm f(d) = svm.d + b 4.3.3 Phn lp a lp vi SVM Bi ton phn lp quan h yu cu mt b phn lp a lp do cn ci tin SVM c bn (phn lp nh phn) thnh b phn lp a lp. Mt trong nhng phng php ci tin l s dng thut ton one-againstall[12]. tng c bn nh sau: Gi s tp d liu mu (x1,y1), ,(xm,ym) vi xi l mt vector n chiu. v yi Y l nhn lp c gn cho vector xi . Chia tp Y thnh m tp lp con c cu trc nh sau zi ={yi ,Y\yi } . p dng SVM phn lp nh phn c bn vi m tp Zi xy dng siu phng cho phn lp ny. B phn lp vi s kt hp ca m b phn lp trn c gi l b phn lp a lp m rng vi SVM.
41

4.3.4. p dng SVM vo phn loi quan h ng ngha trong lnh vc y t ting Vit Tuy mc tiu ban u ca SVM l dng cho phn lp nh phn, nhng hin nay c ci tin cho phn lp a lp, c th s dng ci tin ny phn lp cc cu cha quan h [2]. Hai qu trnh chun b d liu khi xy dng c m hnh phn lp quan h da trn SVM nh sau: Thit k m hnh cy phn cp (taxonomy) cho tp lp quan h. Min ng dng ca quan h s quyt nh phc tp (phn cp) ca taxonomy. Xy dng tp d liu mu (corpus) c gn nhn cho tng lp quan h. Trong bc ny, cch la chn c trng biu din quan h c vai tr quan trng. Ph thuc vo c im ca tng ngn ng m tp cc c trng c la chn khc nhau. V d vi ting Anh th tp c trng ca n l cc t. Sau khi xy dng c tp cc lp cu hi cng vi tp d liu s tin hnh hc: M hnh hc nh sau:
Tp vector c trng Trch chn c trng

Cu (cha QH) Tin x l

Cu

Phn lp SVMMulti

Hnh 19. M t qu trnh hc ca phn lp cu cha quan h [2]

42

Chng 5 THC NGHIM Vic xy dng Ontology cho y t ting Vit ng thi m rng n mt cch t ng thng qua cc bc ca bi ton trch chn thng tin: nhn dng thc th, xc nh quan h. s lm tin kha lun xy dng mt tp d liu mang ng ngha (mng ng ngha). Kt qu ca cng vic ny ng vai tr quan trng trong nhim v xy dng mt my tm kim ng ngha trong tng lai. 5.1. Mi trng thc nghim 5.1.1. Phn cng Chng ti s dng my tnh c nhn vi cu hnh phn cng l Genuine Intel CPU T2050 1.60 GHz, CHIP 798 MHz, RAM 1Gb. 5.1.2 Phn mm Chng ti tch hp cc tin ch trong cc b cng c Protg, Gate xy dng ontology, ch thch d liu v nhn dng thc th ting Vit i vi lnh vc y t . Protg [13] l mt cng c xy dng Ontology c xy dng v pht trin ti Stanford Center for Biomedical Informatics Research ca trng i hc Stanford University School of Medicine. Protg c hai loi: Protg Frame v Protg OWL. Protg Frame cung cp mt giao din dng y v m hnh c sn to, lu tr Ontology di dng Frame. Cn Protg OWL h tr v ngn ng Web ontology, c chng thc da vo web ng ngha hay W3C. Gate [31] l mt kin trc phn mm pht trin v trin khai cc b phn phn mm phc v cng vic x l ngn ng ca con ngi. Gate gip cc nh pht trin tin hnh cng vic theo ba cch: Xc nh mt cu trc, kin trc t chc cho cc phn mm x l ngn ng. Cung cp mt framework hay th vin cc lp thc th, thc hin cu trc xc nh v c th c s dng cho cc ng dng x l ngn ng t nhin. Cung cp mt mi trng pht trin c xy dng da trn framework ca cc cng c ha tin li cho cc thnh phn pht trin.

43

Gate khai ph s pht trin cc phn mm da trn b phn, hng i tng v code lu ng, bin i nhanh. Framework v mi trng pht trin c vit bi ngn ng Java v l mt phn mm m ngun m di s cho php ca th vin GNU. Gate s dng Unicode (Unicode Consortium 96) v c kim th trn mt s ngn ng : c, n . Gate bt u c xy dng v pht trin ti Trng H Sheffield t nm 1995 v t c s dng trong nghin cu v cc d n. Phin bn 1 c ra i nm 1996 v c chng nhn bi hng trm t chc. Gate s dng mt lng ln cc ng cnh t phn tch ngn ng vo trong nhiu th ting: Anh, Hy Lp, Thy in, c, , Php Cc phin bn tip sau c ra i v ngy cng p ng mt cch hiu qu trong nghin cu cng nh ng dng. 5.1.3 D liu th nghim Sau khi thu thp c hn 500 trang web t cc web site http://suckhoedoisong.vn, chng ti loi b, x l cc vn bn nhiu khng gip ch cho qu trnh xy dng Ontology cng nh nhn dng thc th. Sau khi x l thu thp c gn 400 trang web, tng ng vi trn 5000 cu phc v cho vic xy dng Ontology, nhn dng thc th v to nn tng cho phn loi quan h cu. S dng cng c tch t JvnTextPro ca Nguyn Cm T [1] loi b HTML cc trang Web cng nh tch cu, tch t tp ti liu ny. 5.2 Xy dng Ontology 5.2.1. Phn cp lp thc th Vi cc d liu v y t thu thp c t cc trang web v ontology, chng ti lit k cc thut ng (term) quan trng nhm c th nu nh ngha cho ngi dng vi hng nghin cu tip theo l t ng lin kt n cc nh ngha c sn trn trang wikipedia. T cc thut ng trn, tip theo s nh ngha cc thuc tnh ca chng. Vic xy dng Ontology l mt qu trnh lp li c bt u bng vic nh ngha cc khi nim trong h thng lp v m t thuc tnh ca cc khi nim . Qua kho st Ontology BioCaster vi cc thut ng trong ting Vit, cng vi mt s lung ln cc trang Web v y t hin nay Vit Nam, chng ti tin hnh xy dng nn mt tp cc thut ng, cc mi quan h c bn nht t xut ra Ontology th nghim ban u. Sau y l mt s lp thc th do kha lun xut xy dng Ontology: Thuc: ng y, Ty y. V d nh thuc 5-Fluorouracil Ebewe chng ung th (ung th i trc trng, v, thc qun, d dy), hay l thuc Ciloxan st trng,
44

chng nhim khun mt. Thuc ng y ng gia b cha bnh phong thp, trng gn ct Bnh, hi chng: Cc loi bnh nh cm g, vim lot d dy, cc hi chng mt ng, suy tim Triu chng: V d nh triu chng ca cm H5N1 l st cao, nhc u, au mi ton thn,... Nguyn nhn: Tc nhn (virut, vi khun..mui, g, chim..), v cc nguyn khc nh l thiu ng, li tp th dc, ht thuc l th ng Thc phm: Bao gm cc mn n c li hoc gy hi cho sc khe con ngui cng nh ph hp vi mt s loi bnh no . Ngi: Bao gm bc s, gio s m ngi bnh c th tm kim khm bnh, xin gip khi mc bnh. T chc: Bnh vin, phng khm, hiu thuc l cc a im bnh nhn c th tm n khi mc bnh. a im: a ch ca mt t chc no m bnh nhn c th tm n, cc ni dch ang pht sinh v lan rng. C th ngi: L tt c cc b phn c th ngi c th th b nhim bnh: mt, mi, gan, tim Hot ng: Chn tr, xt nghim, hi cu, h hp nhn to, phng trnh, tim phng ... Ha cht: Vitamin, khong cht gy tc ng xu, tt n c th con ngi, v d vitamin A c li cho mt, Vitamin C, E lm gim cc nguy c bnh tim Hi chng: hi chng c th xut hin ca mt bnh [hi chng sc ca bnh st xut huyt]. Bin chng: T mt bnh c th bin chng sang bnh khc (bnh quai b bin chng vim mng no).

45

Hnh 20: Minh ha cc lp trong Ontology xy dng.

Hnh 21: Minh ha cu trc phn tng ca Ontology xy dng c.


46

5.2.2. Cc mi quan h gia cc lp thc th Kha lun s dng mt s quan h ng ngha di y gia cc thc th xy dng quan h ng ngha trong Ontology cng nh vic gn nhn ng ngha cho cu: S tng tc thuc thuc: Thuc ny c th gy tc dng ph cho thuc kia, hay c th kt hp cc loi thuc vi nhau cha bnh. V d thuc chng ung th Alexan khng nn dng chung vi methotrexate hay 5fluorouracil. Thc phm tc ng xu, tt n bnh, c th ngi. V d nh ung xa nhiu c ri ro mc cc bnh ri lon trao i cht, tng vng bng, tng huyt p Quan h bnh thuc. Quan h nguyn nhn gy ra bnh, hay bnh c nguyn nhn. Quan h bnh triu chng. Quan h bnh bin chng thnh bnh khc. Cc hot ng tc ng ln bnh. Ngi lm vic trong mt t chc ti a im no . Bnh thuc chuyn khoa ca ngi. Bnh c pht hin, cha tr t chc. Bnh bin chng sang bnh khc. Quan h bnh -- hi chng.

47

Hnh 22. Minh ha cc th hin ca lp thc th v mi quan h gia cc th hin Hnh 22 minh ha mt mi quan h gia cc th hin ca cc lp thc th. Trn hnh 22 l th hin st Dengue v cc quan h vi cc th hin ca lp thc th khc: Gn_nhn, pht_hin_ti, c_triu_chng, bin_chng, cha_bng, b_gy_ra_bi. Kha lun xy dng c mt Ontology bao gm 21 lp thc th, 13 mi quan h v trn 500 th hin ca cc lp thc th. 5.3. Ch thch d liu Kha lun tch hp Ontology vo cng c Gate (General Architecture for Text Mining) ch thch d liu.. T d liu c thu thp v ontology xy dng, qu trnh ch thch d liu bao gm cc bc sau: M file cha d liu ch thch, c th dng m c th mc cha nhiu file ch thch. S dng Data_Store ca gate lu cc d liu c m v sau khi c ch thch.

48

M Ontology xy dng c. Ontology c th dng cng c Gate chnh sa li cc lp, thuc tnh, Thay i mu sc ch thch cc thc th Ontology mt cch ph hp c th tin phn bit cc thc th mt cch r rng. Chn thc th cn ch thch v chn tn lp thc th thuc ontology ch thch. Kt qu sau qu trnh ch thch, chng ta c th c mt d liu cha cc thc th tng ng vi cc lp c xy dng trn ontology. Ch thch d liu gip cho vic xy dng tp corpus trn d liu y t mt cch d dng hn, ng thi gp phn vo vic t ng m rng cc thc th trn ontology. Kha lun ch thch c 96 file d liu tng ng vi trn 1500 th hin.

Hnh 23: Minh ha mt d liu c ch thch bng Ontology.


49

5.4. Nhn dng thc th 5.4.1. Xy dng tp gazetteer Sau khi ch thch d liu, chng ta c cc file d liu c ch thch vi cc lp thc th ring bit. Sau qu trnh ch thch ny, chng ta c th da trn cc d liu c ch thch xy dng mt tp d liu tn cc thc th. Xy dng c mt tp d liu tt c th gip cho qu trnh nhn dng thc th hiu qu hn. Kha lun s dng Ontology cng mt m rng c tch hp vo Gate l gazetteer xy dng. Ngoi vic xy dng c mt tp d liu phc v cho nhim v trch chn thc th, da vo gazetteer chng ta c th lit k mt s t ng lin quan trc tip ti mt s quan h, v d nh quan h gay_ra gia thc th nguyn_nhn v bnh c cc t thng gp nh gy, gy_ra, lm, lm_cho Bng 2 minh ha s lng cc th hin ca cc lp thc th trong tp d liu gazetteer. Lp thc th S lng Bnh 232 Triu chng 246 C_th_ngi 78 Virut 53 Vi_khun 38 Phng_khm 27 Bnh_vin 52 Hiu thuc 81 Bin_chng 93 Gy_ra 15 Thuc (ng y) 212 Thuc (Ty y) 151 Thc phm 145 Cht_ha_hc 122 Hot_ng 147 Tng 1692 Bng 2. S lng cc th hin ca cc lp thc th trong tp d liu gazetteer.

50

Hnh 24. Minh ha cc file cha thc th trong tp Gazetteer xy dng c 5.4.2.nh gi h thng nhn dng thc th Cc h thng nhn bit loi thc th c nh gi cht lng thng qua ba o: chnh xc (precision), hi tng (recall) v o F (F-messure). Ba o ny c tnh ton theo cc cng thc sau:

ngha ca cc gi tr correct, incorrect, missing v spurious c nh ngha nh Bng 3 di y. Gi tr Correct Incorrect Missing Spurious ngha S trng hp c gn ng S trng hp b gn sai S trng hp b thiu S trng hp tha

Bng 3. Cc gi tr nh ga mt h thng nhn din loi thc th


51

5.4.3. Kt qu t c Kt qu sau 10 ln thc nghim nhn dng thc th cc file c ch thch ng ngha c th hin di Bng 4 di y: o L n L n L n L n L n L n L n L n 1 2 3 4 5 6 7 8 Pre. [%] 57.89 56.52 66.67 66.67 57.89 77.06 65.2 60 L n L n 9 10 56.25 73.3

Rec. 61.1 59.09 76.92 72.22 64.70 66.67 65.2 57.14 50 68.75 [%] F59.45 57.77 71.42 69.33 61.10 71.49 65.2 58.53 52.94 70.45 Measure [%] Bng 4. Kt qu sau 10 ln thc nghim nhn dng thc th.
90 80 70 60 50 40 30 20 10 0 Ln 1 Ln 2 Ln 3 Ln 4 Ln 5 Ln 6 Ln 7 Ln 8 Ln 9 Ln 10 Pre Rec F-Measure

Hnh 25. Kt qu 10 ln thc nghim nhn dng thc th 5.4.4. Nhn xt v nh gi Nhn dng thc th s dng tp Gazetteer a ra kt qu kh cao (thp nht l 50% v cao nht l 77.06 %). S d s dng phng php gazetteer cho kt qu kh quan l do gia cc ti liu hun luyn v kim th c s tng ng nht nh. Do cc thc th cn nhn dng thng xut hin trong danh sch cc gazetteer. Nu tp d liu kim th c ly t mt ngun khc th phng php ny c th khng mang li kt qu kh quan. Trong tng lai, chng ti s s dng cc c trng d liu, biu thc chnh quy, mang li kt qu cao hn cho bi ton nhn dng thc th.

52

5.5. Gn nhn ng ngha cho cu Ontology m t c mt s quan h gia cc lp thc th y t ting Vit. T cc quan h trong kha lun, chng ti lc b v s ch s dng 6 loi quan h L: Thc th ny l thc th kia (cm g cm A H5N1). C: Bnh c cc triu chng, bin chng, hi chng. GY_RA: Cc nguyn nhn gy ra bnh. LIN_QUAN: Triu chng lin quan n bnh no . IU_TR: Cc phng php iu tr bnh. TC_NG: Thc phm, hot ng tc ng n bnh no . T tp d liu thu thp c, chng ti gn nhn d liu cho 1000 cu lm d liu hc. Do thi gian c hn v tp d liu xy dng l qu ln, kha lun ch kp xy dng d liu. Vi tp d liu c xy dng, trong tng lai, chng ti s s dng 500 cu hun luyn v 500 cu dng kim th trong qu trnh phn lp cu cha quan h s dng thut ton SVM. Bng 5 m t mt s cu d liu y t c gn nhn vi cc quan h va trnh by trn. GY_RA Mt ht l bnh vim kt mc do vi khun Chlamydia gy ra. C Bnh c nhng t ti pht, vim kt mc, vim biu m gic mc. C Biu hin bnh rt a dng, t nh khng c triu chng g n nhng trng hp bnh nng ko di, bin chng nguy him c th dn n m la. C Nhng triu chng thng gp l: cm xn mt, vng mt nh c ht bi trong mt, nga mt, hay mi mt. C Tn thng so ha ca kt mc dn n cp mi, lng siu, lng qum. TC_NG Phng bnh bng cch: ra mt bng khn ring sch, nc ra sch, gi tay sch, khng di bn ln mt, khng tm ao h, trnh nc bn bn vo mt, nn eo knh khi i ng, v nh nn ra mt sch s; dit rui nhng. IU_TR i khm bnh ngay khi c nhng triu chng kh chu mt. Khi b bnh cn iu tr theo s hng dn ca bc s. IU_TR Khi pht hin thy c nhng biu hin bt thng, bn cn i khm ti chuyn khoa mt hay bnh vin mt c t vn cch iu tr bnh. GY_RA Sau trn lt lch s va qua, ti mt s a phng xut hin nhiu ngi mc bnh au mt . GY_RA y l mt bnh d gp cc vng b ngp lt do thiu nc sch sinh hot hoc do tip xc vi ha cht. L au mt (M) cn gi l vim kt mc. Bng 5. V d mt s cu c gn nhn quan h
53

PH LC - MT S THUT NG ANH VIT

Thut ng Assign sentence lable Classifier Information Extraction Information Retrieval Machine Translation NE Name Entity NER-Name Entity Recognition Semantic Relation Semantic Search Semi-Supervised

Gii thch Gn nhn ng ngha cho cu Phn loi, phn lp Trch chn thng tin Tm kim thng tin Dch my Tn thc th Nhn dng tn thc th Quan h ng ngha Tm kim ng ngha Hc bn gim st

54

KT LUN

Nhn bit c tm quan trng ca vic s dng cc ti nguyn trc tuyn trong lnh vc y t nhm phc v i sng con ngi, kha lun trnh by v th nghim mt s phng php khai ph ngun d liu y t ny nhm mc ch a li ngun tri thc cho mt s bi ton khc, v d l bi ton tm kim ng ngha. Kha lun trnh by mt s phng php, cng c xy dng Ontology v xy dng c mt Ontology cho y t ting vit. Ontology ny m t tng qut c cc thc th c bn rong d liu y t, lm tin cho vic xy dng mng ng ngha cho bi ton tm kim ng ngha. Kha lun cng trnh by mt s phng php, cng c ch thch d liu v xy dng tp d liu ban u cho qu trnh nhn dng thc th cng nh m rng Ontology mt cch t ng dng Gazetteer. Kt qu thc nghim khi s dng tp d liu tng i kh quan (thp nht l 50% v cao nht l 77.06%). Ngoi ra kha lun cng cp ti bi ton ang rt c quan tm trong thi gian gn y: xc nh quan h. i vi bi ton xc nh quan h, chng ti trnh by khi qut v quan h, xc nh quan h, gn nhn ng ngha cho cu v phn lp cu cha quan h. Hng nghin cu trong tng lai, chng ti s m rng Ontology mt cch t ng, s dng phng php trch chn c trng, biu thc chnh quy v da trn h lut c th nng cao ht qu ca h thng nhn dng thc th. Kha lun bc u th nghim gn nhn ng ngha cho cu vi khong 1000 cu, cc cu ny s c s dng thut ton SVM hc v phn lp quan h cha ng ngha cho cu trong thi gian sp ti.

55

TI LIU THAM KHO Ting Vit [1]. Nguyn Cm T. Nhn bit cc loi thc th trong vn bn ting Vit nhm h tr Web ng ngha v tm kim hng thc th. Kha lun tt nghip HCN 5/2005, tr. 3, tr. [2]. Nguyn Minh Tun. Phn lp cu hi hng ti tm kim ng ngha ting Vit trong lnh vc y t. Kha lun tt nghip HCN 5/2008, tr. 2-26. Ting Anh [3]. Andreas Vlachos. Evaluating and combining biomedical named entity recognition systems,Computer Laboratory ,University of Cambridge, 2007. [4]. Brandon Beamer, Alla Rozovskaya, Roxana Girju. Automatic Semantic Relation Extraction with Multiple Boundary Generation. University of Illinois at Urbana-Champaign, 2008, tr. 3-4. [5]. David Nadeau. Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the PhD degree in Computer Science, 2007 tr. 15-16. [6]. GuoDong Zhou, Jian Su. Named Entity Recognition using an HMM-based Chunk Tagger. Laboratories for Information Technology, Singapore, 2002, tr. 3-4. [7]. Haochang Wang, Tiejun Zhao, Hongye Tan, Shu Zhang. Biomedical Named entity recognition based on classifiers ensemble. International Journal of Comput er Science and Applications, 2004; Vol. 5, No. 2 ,tr. 1-11. [8]. I. Horrocks, D. Fensel, F. Harmelen, S. Decker, M. Erdmann, M. Klein, OIL in a Nutshell, ECAI00 Workshop on Application of Ontologies and PSMs, Berlin, 2000. [9]. I. Horrocks, F. van Harmelen. Reference Description of the DAML OIL, Ontology Markup Language, Technical report, 2001. [10]. John McNaught. Challenges for Terminology Management in Biomedicine. NaCTeM Associate, University of Manchester, 2005.

56

[11]. Kawazoe, A., and Collier, N. April. BioCaster Project Working Report on English Named Entity Annotation. National Institute of Informatics, Japan 2007 , tr. 4-6. [12]. Lassila, R. Swick. Resource description framework (RDF) model and syntax specification, W3C Recommendation 1999, http://www.w3.org/TR/REC-rdfsyntax/. [13]. LIU Yi, ZHENG Y F. One-against-all multi-Class SVM classification using reliability measures.Proceedings of the 2005 International Joint Conference on Neural Networks Montreal, Canada, 2005. [14]. Massimiliano Ciaramita, Aldo Gangemi, Esther Ratsch Jasmin, Saric Isabel Rojas. Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology. Institute for Cognitive Science and Technology (CNR), Italy, 2005, tr 1-5. [15]. M. Fernaandez-Loopez, A. Goomez-Peerez, A. Pazos-Sierra, J. Pazos-Sierra, Building a chemical ontology using METHONTOLOGY and the ontology design environment, IEEE Intelligent Systems & their applications 4 (1), 1999. [16]. M. Gruuninger, M.S. Fox. Methodology for the design and evaluation of ontologies, Workshop on Basic Ontological Issues in Knowledge Sharing, Montreal, 1995. [17]. M. Ushold, R M. Uschold, M. King. Towards a Methodology for Building Ontologies, IJCAI95 Workshop on Basic Ontological Issues in Knowledge Sharing, Montreal, 1995 [18]. Noy, N.F., and McGuinness, D.L. Ontology Development 101: A Guide to Creating Your First Ontology SMI, Technical report SMI-2001-0880, Stanford University, 2001. [19]. N. Guarino. Formal Ontology in Information Systems. Proceedings of FOIS98:3-15, Trento, Italy, 6/1998. Amsterdam, IOS Press. [20]. Thao Pham T. X., Tri T. Q., Ai Kawazoe, Dien Dinh, Nigel Collier. Construction of Vietnamese corpora for Named Entity Recognition.VNU of HCMC Vietnam, National Institute of Informatics, Tokyo, Japan, tr. 1-3. [21]. Thao, P.T.X., Tri, T.Q., Dien, D., and Collier N., 2007. Named entity recognition in Vietnamese using classifier voting, ACM Trans. Asian. Lang. Inf. Process. 6, 4, Article 14 , 12/2007, tr. 2-3. [22]. Tim Berners-Lee, Semantic Web Road map, http://www.w3.org/DesignIssues/Semantic.html.
57

[23] Razvan C. Bunescu. Learning to Extract Relations from Biomedical Corpora. Electrical Engineering and Computer Science, Ohio University, Athens, OH, 3/2009. [24] Roxana Girju. Semantic relation extraction and its applications, 20th European Summer School in Logic, Language and Information, 4/2008, tr. 2-10. [25] Sammy Wang. Application of Data and Text Mining to Bioinformatics, 2008. University of Georgia. [26] S.Cohen , Mamou, J., Kanza, Y., Sagiv, Y. Xsearch: A semantic search engine for xml. In: Proceedings of of the 29th VLDB Conference, Berlin, Germany, 2003. [27] S. Luke, J. Heflin, SHOE 1.01. Proposed Specification, SHOE Project technical report, University of Maryland, 2000. [28] Soumen Chakrabarti. Mining the web, Discovering Knowledge from Hypertext Data, Edition: 3, illustrated. Published by Morgan Kaufmann, 2003. Chapter Semisupervised Learning. [29] Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med, 1986. [30] Xuan-Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi. Personal Name Resolution Crossover Documents by A semantics-Based Approach. in IEICE Trans Inf & Syst , 2006, tr. 1-5. [31] http://gate.ac.uk/ [32]http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html [33] http://genome.jouy.inra.fr/texte/LLLchallenge/ [34] http://www.dit.hcmut.edu.vn/~tru/VN-KIM/products/vnkim-kb.htm. [35] http://www.wolframalpha.com/ [36] http://www.w3.org/ [37]http://wordnet.princeton.edu/. [38]http://nlp.cs.swarthmore.edu/semeval/ [39]http://www.nlm.nih.gov/mesh/-meshhome.html [40]http://www.dit.hcmut.edu.vn/~tru/VN-KIM/products/vnkim-ie.htm. [41 ]http://www.bioontology.org/ncbo/faces/pages/ontology_list.xhtml. [42] http://diseaseontology.sourceforge.net/ [43 ]http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi [44] http://biocaster.nii.ac.jp/ [45] http://www.ksl.stanford.edu/software/ontolingua/ [46] http://www.isi.edu/isd/ontosaurus.html [47] http://www-sop.inria.fr/acacia/ekaw2000/ode.html
58

[48] http://www.xml.com/pub/r/861 [49]http://biocreative.sourceforge.net/ [50] http://www.owlseek.com/whatis.html [51] http://protege.stanford.edu/ [52] http://www.bioontology.org/.

59

Das könnte Ihnen auch gefallen