Beruflich Dokumente
Kultur Dokumente
CNG C X.ENT CHO TRCH XUT D LIU THC TH, QUAN H GIA
THC TH V H TR PHN TCH D LIU TRONG CC TP CH
V PHNG CHNG DCH BNH TRONG NNG NGHIP CA PHP
Phan Trng Tin*, Ng Cng Thng
Khoa Cng ngh Thng tin, Hc vin Nng nghip Vit Nam
Email*: ptgtien@vnua.edu.vn
TM TT
Trch xut thc th l cng vic trch xut thng tin v phn loi thng tin trong vn bn theo nhng loi xc
nh trc nh tn ngi, t chc, a im, thi gian, v mt bc cao hn l tm mi quan h gia cc thc th
v d nh mi quan h gia tn ngi vi tn t chc. Cng c x.ent c xy dng lm cng vic nh vy,
cng c s dng cc t in cho thc th v cc lut trch xut. Trong trch xut quan h gia cc thc th chng
ti p dng hai phng php: phn tch cu trc ca vn bn v s dng m hnh hc khng gim st l phn
tch tn sut xut hin ca cc thc th. Cng c x.ent c sn trn trang ch R theo ng dn: http: //cran.r -
project.org/web/packages/x.ent/index.html.
T kho: Automat hu hn, nhn bit thc th nh danh, Perl, R, trch xut thng tin, trch xut thc th, trch
xut quan h.
X.ent Package for Extraction of Entities, Relationships between Entities and Support
Data Analysis in Epidemiological Journals in French Agriculture
ABSTRACT
Entity extraction is a task of information extraction and element classification in text such as the names of
persons, organizations, locations, times, etc. and to find relationship between entities such as the relationship
between the names of persons with the organizations. The X.ent tool was built solve this task. It uses dictionaries
matching and hand - crafted rules to extract. In extracting the relationship between the entities, we applied two
methods: analysis of text structures and unsupervised learning approach called coo ccurrence analysis. This tool is
available on the site of R at the links: http: //cran.r - project.org/web/packages/x.ent/index.html.
Keywords: Entity Extraction, Information Extraction (IE), Named entity recognition (NER), Perl, Relation
Extraction, R.
976
Cng c x.ent cho trch xut d liu thc th, quan h gia thc th v h tr phn tch d liu trong cc tp ch v
phng chng dch bnh trong nng nghip ca Php
thng tin (IE) l mt bc rt quan trng ly khuyn khch ngi nng dn s dng cc
c ra nhng thng tin cn thit cho vic phn phng php iu tr chng li cc sinh vt
tch d liu. Hin nay trch xut thng tin c gy hi. n bn u tin c ra i vo nm
s dng trong rt nhiu lnh vc ng dng nh 1946 v u l cc bn nh my (bn in), t
tm hiu v xu hng kinh doanh ch yu nm 2001 tt c cc n bn c xut bn theo
ca ngi dng, ngn nga bnh tt, phng nh dng PDF. Php c chia lm 22 vng v
chng ti phm, lnh vc tin sinh hc, phn tch cc vng nc ngoi, mi vng s xut bn cc
chng khon, v.v. bo co ring.
X.ent l mt cng c c chng ti xy Ngun d liu ca d n c 50.000 bn bo
dng cho vic trch xut d liu vn bn (trch co, trong c khong 20.000 l dng cc trang
xut thc th v quan h gia cc thc th), in. Chng ti cn scan cc bn giy ny v n
ngoi ra chng ti cn xy dng mt s tnh c chia s ti th vin BNF (Bibliothque
nng bng ho c vit trn R cung cp Franois - Mitterrand) v sau c chuyn
cho ngi s dng cc tnh nng phn tch d i sang dng text nh k thut OCR (Optical
liu sau khi trch xut. Cng c ny l s kt Character Recoginition) bi Jouve Corp.
hp cc ngn ng lp trnh khc nhau: Perl cho
y l d n c ti tr bi B Nng
phn trch xut d liu, R cho vic h tr phn
nghip v Nghin cu Php, d n bao gm cc
tch kt qu. Sau khi hon thnh chng ti
chuyn gia sinh vt hc v sinh thi hc nghin
gi cng c ca chng ti ln trang ch ca
cu cc tc nhn gy bnh: dch t hc v khoa
CRAN (l mt trang cha cc gi ng dng ca
hc mi trng (cc d bo v su bnh) vi mt
R) v c cc chuyn gia thng k hc y
mng li gi l PIC (Intergrated Crop
chp nhn, hin ti ngi s dng c th ti v
Protection). C 4 chuyn gia v khoai ty v la
v ci t trc tip t my ch CRAN. y l
m t PIC ng hnh cng chng ti trong d
sn phm c ti hon thnh trong qu trnh
n ny, d n c tn VESPA (Valeur et
hc cao hc ti Php nm 2012 - 2014.
optimisation des dispositifs depidemiosurveillance
dans une strategie durable de protection des
2. VT LIU V PHNG PHP cultures - c lng v ti u ho cc thit b
2.1. Vt liu gim st dch t hc trong chin lc bo v s
bn vng cho cy trng).
D liu c chng ti trch xut l cc bo
co v phng chng dch bnh cho cy trng ca
2.2. Phng php
Php, c 12 thc th chng ti quan tm l cy
trng (crops), bnh (diseases), sinh vt ph hoi Trch xut thng tin (IE) l mt tc v t
(pests), cc sinh vt c li khc (auxiliaries), v ng trch xut c c thng tin c cu trc
tr a l (regions, towns), ngy thng ca bo t cc ti liu khng cu trc hoc bn cu trc
co (date), s ca bo co (issues), ho cht s m my tnh c th c c. Trong hu ht cc
dng (chemicals), cc giai on pht trin cy trng hp, hot ng ny lin quan n x l
trng (developmental stage), s gy hi vi cy cc vn bn ngn ng con ngi hay ni cch
trng (crop damage), kh hu (climate), mc khc l x l ngn ng t nhin (Natural
tiu cc (negative). Cc quan h gia cc thc Language Processing)
th m chng ti quan tm: cy trng vi bnh Mc tiu chnh ca chng ti l trch xut
v cy trng vi sinh vt ph hoi. quan h gia thc th cy trng vi cc tc nhn
Php, hng tun cc nh nng hc s to gy hi cho cy trng cng vi mc gy hi
cc bo co thng tin cho ngi nng dn v ca chng. Trch xut thng tin l mt cng c
cc tn cng ca dch bnh v cn trng i vi tt trong x l ngn ng t nhin. Cc bc thc
cy trng. Mc tiu ca cc bo co ny l hin trong x l d liu trch xut thng tin:
977
Phan Trng Tin, Ng Cng Thng
978
Cng c x.ent cho trch xut d liu thc th, quan h gia thc th v h tr phn tch d liu trong cc tp ch v
phng chng dch bnh trong nng nghip ca Php
Hnh 3. Lut trch xut ngy thng c xy dng bng cng c Unitex
979
Phan Trng Tin, Ng Cng Thng
980
Cng c x.ent cho trch xut d liu th
c th, quan h gia thc th v h tr phn tch d li
u trong cc tp ch v
phng chng dch bnhnh trong nng nghi
nghip ca Php
1 1
1, 2 3
cooc(Ei,Ej) =
0
981
Phan Trng Tin, Ng Cng Thng
982
Cng c x.ent cho trch xut d liu thc th, quan h gia thc th v h tr phn tch d liu trong cc tp ch v
phng chng dch bnh trong nng nghip ca Php
#
0 P 1, P = (1)
#
#
0 R 1, R = (2)
#
0 F1 1, F1 = (
(3)
)
P R F1 P R F1 P R F1
PET 96.46 95.52 95.98 92.66 71.41 80.52 96.45 95.53 95.99
MAL 96.97 95.53 96.24 95.46 77.38 85.38 96.97 95.52 96.24
PLA 88.80 98.67 93.47 93.99 82.68 87.94 88.80 98.67 93.47
REG 100 100 100 93.20 73.73 81.92 100 100 100
TOT 94.33 96.67 95.48 93.68 76.85 84.41 94.34 96.65 95.48
983
Phan Trng Tin, Ng Cng Thng
3.2. Phn tch v thng k d liu sau trch xut kim quan h vi chng, e2 l mt thc th
Cng c x.ent c pht trin bng ngn khc loi v d "mouche du chou" l mt trng
ng Perl cho phn chc nng trch xut d liu hp ca thc th sinh vt gy hi cho cy trng,
"mildiou" l mt trng hp ca thc th bnh.
v quan h v c ng gi thnh mt gi R v
c sn trn R platform (R Development Core Trong R, bn c th nh nh sau:
Team). Gi cng c ny cng cung cp cc hm xplot(e1 = colza,e2 = c(mouche du chou,
trn R h tr cho ngi s dng phn tch v mildiou))
thm d kt qu sau khi trch xut nh: cc Chng ta c th thm cc rng buc v thi
th hin th s xut hin ng thi, biu tn gian nh:
xut, biu Venn, biu chng xp ln nhau xplot(e1 = colza,e2 = c(mouche du chou,
v s dng cc gi thuyt thng k kim tra mildiou),t = c(09.2010,02.2011))
mi lin h gia cc quan h. Nhn vo biu , ngi s dng c th bit
Trn hnh 8 chng ta nhn thy mt v d c tn ti quan h trong bo co no v
hin th song song ng thi gia hai thc th ngc li. Biu tng mu ch tn ti, mu
(e1 v e2), e1 l thc th gc m chng ta tm tm l khng tn ti trong bo co.
984
Cng c x.ent cho trch xut d liu thc th, quan h gia thc th v h tr phn tch d liu trong cc tp ch v
phng chng dch bnh trong nng nghip ca Php
Biu tn xut (histogram) thc hin Nhn vo th kt qu, chng ta bit rng
thng k c bao nhiu bo co cha thc th, cy "colza" l cy c ci ng c th b tn cng
hoc cha mt quan h no theo thi gian. bi "mouche du chou" l rui dm v "puceron"
Trong hnh 9 l cu lnh: l rp. Trong khi cc loi cy khc nh
xhist("colza: mildiou"), nhn vo th, ngi "tournesol" l cy hng dng, "mas" l cy
s dng c th bit c trong giai on no xut ng, "bl" l cy la m ch b tn cng bi
hin nhiu bnh "mildiou" vi cy "colza".
"puceron".
th dng chng xp l mt trng hp
Mt bi ton khc t ra sau khi trch xut
khc ngi s dng c th phn tch c
l phn tch s xut hin ng thi ca cc
quan h gia cc thc th, v d nh quan h
vi cy trng, da vo d liu trch xut, ngi thc th hoc cc quan h trong cc bo co.
s dng c th bit c cy trng no thng b Trong hnh 11 l v d so snh s xut hin
tn cng bi sinh vt ph hoi no, cn loi khc ng thi ca cc cy bl, orge de
th khng. Trong hnh 10 l cu lnh: printepmps v cy tournesol, chng ta c th
xprop(c("bl","mas","tournesol","colza"),c(" thc hin trong R nh sau:
mouche du chou", "puceron")) xvenn(c(bl,orge de printemps,tournesol)
985
Phan Trng Tin, Ng Cng Thng
986
Cng c x.ent cho trch xut d liu thc th, quan h gia thc th v h tr phn tch d liu trong cc tp ch v
phng chng dch bnh trong nng nghip ca Php
987
Phan Trng Tin, Ng Cng Thng
http7 Traitement Automatique du Langage Naturel Moncla L. (2013). Automatic Annotation of Motion
(2014). http://lipn.univ- Expressions and Place Named Entities. 2nd
paris13.fr/~audibert/pages/enseignement Unitex/GramLab.
/TAL_ITCN.pdf. Paumier S. et Martineau C. (2006). Manuel
http8 Stanford Named Entity Recognizer dUtilisateur Unitex 3.1 Beta. Universit Paris - Est
(2014).http://nlp.stanford.edu/software/CRF- Marne - la - Valle. version 1.2.
NER.shtml. Sutton C. et McCallum A. (2010). An Introduction to
http9 LingPipe (2014)http://alias-i.com/lingpipe/. Conditional Random Fields for Relational
http10 Information Extraction And Named Entity Learning. 1011.4088 [stat], p. 5 - 32.
Recognition (2014). R Development Core Team, R (2015). A Language and
https://web.stanford.edu/class/cs124/lec/ Environment for Statistical Computing, R
Information_Extraction_and_Named_Entity_Reco Foundation for Statistical Computing, Vienna,
gnition.pdf. Austria, ISBN 3 - 900051 - 07 - 0 (2015).URL
http11 Les Rseaux Baysienes. http: //www.R - project.org/
http://www.bayesia.com/fr/technologie/reseaux- Tannier X. (2012). Traitement Automatique des
bayesiens.php. Langue. Universit Paris - Sud.
Lafferty J., McCallum A. et Pereira F. C. N. (2001). Turenne N. (2013). Knowledge Needs and Information
Conditional Random Fields: Probabilistic Models Extraction. Wiley - ISTE.
for Segmenting and Labeling Sequence Data. Dep. Zettlemoyer L. (2012). Relation Extraction. University
Pap. CIS. of Washington.
988