Sie sind auf Seite 1von 30

Bi tp ln x l ngn ng t nhin

TRNG I HC BCH KHOA H NI


VIN CNG NGH THNG TIN V TRUYN THNG

BI TP LN
MN: X L NGN NG T NHIN
TI: TM HIU KIN TRC CA SEARCH ENGINE
FRAMEWORK LUCENE & NG DNG NUTCH
GVHD: Th.s Hong Anh Vit
Nhm sinh vin:
Nguyn Th Anh
Trn Anh Th
Nguyn Vng Quyn
Nguyn Vn Hng

20080070
20082569
20082142
20081293

Lp: Cng ngh phn mm K53

H Ni - 11/2012
1

Bi tp ln x l ngn ng t nhin

MC LC

PHN I: NGUYN L V M HNH CHUNG CA SEARCH ENGINE.................................3


1. Gii thiu chung......................................................................................................................3
2. Phn loi..................................................................................................................................4
2.1. My tm kim thng thng.............................................................................................4
2.2. My siu tm kim - Meta Search Engine........................................................................4
3. Nguyn l hot ng ca search engine..................................................................................5
4. M hnh ca seach engine........................................................................................................6
4.1. B tm duyt Crawler........................................................................................................6
4.2. Kho d liu Repository.....................................................................................................8
4.3. B lp ch mc Indexer...................................................................................................10
4.4. B tm kim thng tin Search Engine..........................................................................11
4.5. Phn hng trang (Page Rank).........................................................................................13
PHN II: FRAMEWORK LUCENE............................................................................................15
1.

Gii thiu............................................................................................................................15

2.

Lucene trong cc thnh phn ca mt ng dng tm kim.................................................15


2.1. Cc thnh phn indexing................................................................................................16
2.2. Cc thnh phn tm kim (Components for searching)..................................................19

3.

Cc lp chnh ca Lucene...................................................................................................20
3.1.

Cc lp chnh nh ch mc........................................................................................20

3.2.

Cc lp chnh tm kim...............................................................................................22

PHN III: NG DNG NUTCH.................................................................................................25


1.

2.

Gii thiu v Nutch.............................................................................................................25


1.1.

nh ngha...................................................................................................................25

1.2.

c im......................................................................................................................25

1.3.

Nutch v Lucene..........................................................................................................26

Cc nn tng pht trin ca Nutch......................................................................................26


2

Bi tp ln x l ngn ng t nhin
3.

Kin trc ca Nutch............................................................................................................27

TI LIU THAM KHO.............................................................................................................31

PHN I: NGUYN L V M HNH CHUNG CA SEARCH ENGINE


1. Gii thiu chung

My tm kim Search Engine nguyn thu l mt phn mm nhm tm ra cc trang


trn mng Internet c ni dung theo yu cu ngi dng da vo cc thng tin m ngi
s dng cung cp qua t kho tm kim. My tm kim s truy tm trong c s d liu ca
n v tr v danh mc cc trang web c cha t kho m ngi s dng a vo ban u.
Thut ng Search Engine c dng chung ch 2 h thng tm kim: Mt do
cc chng trnh my tnh t ng to ra (Crawler-Based Search Engines) v dng th
mc Internet do con ngi qun l (Human-Powered Directories). Hai h thng tm kim
ny tm v lp danh mc website theo 2 cch khc nhau.
Crawler-Based Search Engines: Cc my tm kim loi ny chng s dng cc
chng trnh my tnh, c gi l Robots, Spiders, hay Crawlers ln tm cc trang
trn mng, ri t ng phn tch cc trang ly v v a vo c s d liu ca n. Khi c
mt yu cu tm kim, cc Search Engine i chiu t kha cn tm vo trong bng ch
mc v tr v cc thng tin lu tr tng ng. Cc c my tm kim loi ny c c ch
cp nht ni dung ca web nh k pht hin s thay i (nu c) ca cc trang web.
3

Bi tp ln x l ngn ng t nhin

Human-Powered Directories: cc th mc Internet hon ton ph thuc vo s


qun l ca con ngi. Nu ngi s dng mun cc Search Engine tm thy trang web
ca mnh th h phi ng k vo th mc bng cch gi bn ng k n ban bin tp
ca Search Engine.
Ngy nay hu ht cc h thng tm kim u l s tng hpca h thng tm kim
t ng v h thng tm theo th mc do ngi dng qun l.
2. Phn loi
Xt theo phng php tm kim th cc Search Engine c chia lm hai loi chnh: Tm
kim thng thng v siu tm kim.
2.1. My tm kim thng thng
Cc my tm kim thng thng thc hin cng vic tm kim theoqui trnh thu
thp ti liu, phn loi v to ch mc. Chng gm hai loi,Search Engine s dng th
mc ch v Search Engine to ch mc t ng.
Cc Search Engine s dng th mc ch phn lp sn cc trangtrn Internet
vo cc th mc ch v theo cc cp chi tit hn ca ch .
2.2. My siu tm kim - Meta Search Engine

Meta Search Engine l loi my truy tm o, n hot ng da trn s tn ti ca


cc Search Engine sn c. Cc Meta Search Engine khng c c s d liu ca ring
mnh. Khi c yu cu tm kim my siu tm kim s gi t kha n cc Search Engine
khc mt cch ng lot v nhn v cc kt qu tm c. Nhim v cn li ca my siu
tm kim l phn tch v phn hng li cc kt qu tm c.

Bi tp ln x l ngn ng t nhin

3. Nguyn l hot ng ca search engine

Search engine iu khin robot i thu thp thng tin trn mng thng qua cc siu
lin kt ( hyperlink ). Khi robot pht hin ra mt site mi, n gi ti liu (web page) v
cho server chnh to c s d liu ch mc phc v cho nhu cu tm kim thng tin.
Bi v thng tin trn mng lun thay i nn robots phi lin tc cp nht cc site
c. Mt cp nht ph thuc vo tng h thng search engine. Khi search engine nhn
cu truy vn t user, n s tin hnh phn tch, tm trong c s d liu ch mc & tr v
nhng ti liu tho yu cu.

Bi tp ln x l ngn ng t nhin

4. M hnh ca seach engine


4.1. B tm duyt Crawler

B tm duyt Crawler thu thp cc trang trn Internet ri chuyn chob nh ch


mc Indexer. Crawler xut pht t tp cc URL ban u S0. utin n sp xp cc phn
t trong tp S0 vo mt hng i, sau ly dncc URL theo th t v ti v cc trang
tng ng, Crawler trch tt c ccURL c trong cc trang va ti v ri li a vo hng
i. Qu trnh trntip tc cho n khi Crawler quyt nh dng li. Do s lng cc
trang tiv rt ln v tc thay i nhanh chng ca Web nn xut hin nhng vn
cn gii quyt:
La chn cc trang ti v.
Cch cp nht cc trang: Crawler phi xem xt trang no nn gh thm li
trang no khng.
Song song ho qu trnh d tm trang web: Cc Crawlers song songphi c
b tr mt cch hp l sao cho mt Crawler khng gh thm cctrang m mt
Crawler khc thm.
4.1.1. Page selection (la chn cc trang)
B tm duyt Crawler ti v cc trang theo th t trang no quantrng s ti v
trc. Nh vy ta phi tm hiu cch m Crawler xc nhmc quan trng ca cc
trang.
Cho mt trang web P, ta nh ngha mc quan trng ca trang P theo cc cch
sau:
- Interest Driven: l phng php xc nh mc quan trng ca cc trang da
vo mc quan tm ca ngi s dng vi cc trang .
- Popularity Driven: xc nh mc quan trng ca mt trang da vo mc
ph bin ca trang. Mt trong cc cch nh ngha ph bin ca trang l
s m s lin kt n trang (s back link).
- Location Driven: xc nh mc quan trong ca trang P da vo a ch ca
n.
6

Bi tp ln x l ngn ng t nhin

4.1.2 M hnh Crawler


Crawler c thit k c kh nng gh thm cc trang theo th tcc trang c
mc quan trng cao hn thm trc cc trang c hng thphn thm sau. B Crawler
nh gi cc trang da vo gi tr phn hng.Ts nh gi ny Crawler bit c cc
trang c mc quan trng cao hn ly v trong ln k tip. nh gi cht lng
ca Crawler ngi ta cth nh ngha o cht lng quality metric ca n bng mt
trong haicch sau:
- Crawl & stop: B Crawler C xut pht t trang khi u P0 v dng li sau khi
gh thm k trang, k l s lng trang m Crawler c th ti v trong mt ln
duyt. Mt Crawler tt s gh thm cc trang theo th t R1,.Rk trong R1
l trang c th hng cao nht, ln lt n R2, . Gi R1, ,RK l cctrang
hot. Trong s k trang c Crawler gh thm ch c m trang (m<=k) sc
sp th hng cao hn hoc bng trang Rk
- Crawl & Stop with Threshold: vn gi s rng b Crawler gh thm ktrang.
Tuy nhin lc ny ch quan trng G c cho sn, v bt c trang no c
quan trng ln hn G th u c coi l trang hot.
- o th hng (Ordering Metrics): Mt Crawler lu gi nhng URLs m n
gh thm trong qu trnh d tm vo mt hng i. Sau n la chnmt
URL trong hng i cho ln gh thm tip theo. mi ln la chn th
Crawler chn URL u c gi tr Orderingcao nht gh thm. o th hng
(Ordering Metric) c thit lp da vo mt trong cc o khc. V d nu
ta ang thc hin tm nhng trang c gi tr IB(P) cao, th ta s lygi tr IB(P)
lm o th hng, trong P l trang m c trang u tr ti.
4.1.3 Page Refresh
Khi Crawler la chn v ti v cc trang quan trng, sau mt khong thi gian
nht nh n phi thc hin cp nht cc trang . C rt nhiu cch cp nht cc trang
web, v cc cch khc nhau s cho cc ktqu khc nhau. Sau y l hai cch:
- Uniform refresh policy: Chin lc cp nht ng lot. Crawler gh thm li
tt c cc trang theo cng mt tn sut f.
- Proportional refresh policy: Cp nht theo t l, mt trang thng xuyn thay
i th Crawler gh thm thng xuyn hn. chnh xc hn ta gi s i l
tn sut thay i ca trang ei, v fi l tn sut m Crawler gh thm li trang ei.
Crawler phi c lng i ca mi trang thit lp chin lc gh thm li mi
trang thch hp.Vic c lng ny da vo qu trnh thay i ca trang trc m
Crawler ghi li c.
o cp nht (freshness metric):
Gi s c hai tp cc trang web A, B mi tp gm 20 trang, trong tp A c
trung bnh 10 trang c cp nht, tp B c 15 trang c cp nht. Ngi ta ni rng tp
B cp nht hn (fresher) tp A. Thm na nu tp A c cp nht mt ngy trc
y, tp B c cp nht mt nm trc y th ta ni tp A hin hnh hn (more
current) tp B, dn n khi nim tui age:
7

Bi tp ln x l ngn ng t nhin

Da trn khi nim trc gic ny chng ta a ra nh ngha freshness v age nh sau:
- Freshness: t S={e1,,en} l tp hp N trang c Crawler ti v.
reshness c nh ngha nh sau: Freshness ca trang ei ti thi im t l

Age: bit c tui ca mt tp chng ta nh ngha khi nim age nh


sau: age ca trang ei ti thi im t l
A(ei, t)

Freshness v age ca tp cc trang cc b c th thay i theo thi gian. Do vy


chng ta cn tnh freshness trung bnh trong mt khong thi gian di v s dng gi
tr ny lm freshness ca c tp hp.
Chin lc cp nht Trong mt khong thi gian nht nh cc Crawler ch c th
ti v hoc cp nht s lng gii hn cc trang. Tu thuc vo chin lc cp nht trang,
ngun download trang s c phn phi cho cc trang khc nhau theo cc cch khc
nhau.
4.2. Kho d liu Repository

B phn page Repository l mt h thng lu tr c kh nng m rng. H thng


ny qun l mt tp ln cc trang web. N thc hin hai chc nngchnh: chc nng th
nht, cho php Crawler lu tr cc trang web. Th hai,n phi cung cp API truy cp
8

Bi tp ln x l ngn ng t nhin

hiu qu b Indexer v b CollectionAnalysis c th s dng ly cc trang t kho


d liu.
4.2.1 Cc yu cu i vi repository
- Mt Repository qun l mt tp cc i tng d liu data object ln.
- C kh nng m rng
- C cch truy cp kp (dual access mode): Truy nhp ngu nhin c s dng
nhanh chng nhn v trang web v gn cho mi trang mt nh danh duy
nht. Truy cp tun t c s dng nhn v tp hp hon chnh, hay vi
tp hp con ln.
- C th cp nht khi lng ln.
- Loi b nhng trang khng cn tn ti.
- Kho d liu c thit k phn tn (Distributed Repository).
4.2.2 Nguyn tc phn tn trang
Cc trang c th c gn cho cc nt da trn mt s nguyn tc khc nhau. Nh
nguyn tc phn tn ng b (Uniform distribution), tt c cc nt c x l ng nht.
Mt trang c th c gn cho mt nt bt k trong h thng. Cc nt s cha cc phn
ca tp hp cc trang tu theo kh nng lu tr ca nt. Ngc li, vi cch phn tn
bm (hash distribution) vic nh v cc trang vo cc nt da trn nh danh ca trang.
4.2.3 Phng php t chc trang vt l
Trong mt nt n, c 3 thao tc c th thc hin: thm trang/chntrang (page
addition/insertion), truy cp tun t tc cao, v truy cp ngu nhin. Cch t chc cc
trang theo kiu vt l ti mi nt chnh l vic xem xt xem mc h tr mi thao tc
trn ca nt .
4.2.4 Chin lc cp nht
Vic thit k chin lc cp nht ph thuc vo tnh cht ca Crawler. C hai cch
cu trc Crawler: Batch-mode hoc Steady Crawler: mt Crawler kiu batch-mode c
x l nh k mi thng mt ln, v cho php duyt mt s ln nht nh (hoc n khi
ton b cc trang trong tp ch c duyt) sau Crawler dng li. Vi b Crawler
nh trn th kho d liu web c cp nht trong mt s ngy nht nh trong thng.
Ngc li, mt b Crawler n nh (steady Crawler) chy lin tc, n lin tc cp nht v
b sung cc trang mi cho Repository.

Bi tp ln x l ngn ng t nhin

4.3. B lp ch mc Indexer

Cc ti liu ti v cn phi c x l thch hp trc khi thc hin vic tm kim.


Vic s dng cc t kho hay thut ng m t ni dung ca ti liu theo mt khun
dng ngn gn hn c gi l to ch mc cho ti liu.
Modul Indexer v Collection Analysis c chc nng to ra nhiu loi ch mc khc
nhau. Modul Indexer to ra hai loi ch mc chnh l ch mc Text Index (ch mc ni
dung) v ch mc Structure Index (ch mc lin kt). Da vo hai loi ch mc ny b
Collection Analysis to ra nhiu loi ch mc hu ch khc:
Link Index: to ch mc lin kt, cc on web duyt c biu din di dng
th vi cc nh v cc cnh.
Text Index: Phng php nh ch mc da theo ni dung (text-based) l mt
phng php quan trng nh danh cc trang c lin quan n yu cu tm kim.
Ch mc kt hp: S lng v kiu ca cc ch mc Utility c quy nh bi b
Collection Analysis tu thuc vo chc nng ca b my truy vn v kiu thng tin m
modul Ranking s dng.
4.3.1 Cc bc lp ch mc
Bc 1: Xc nh cc mc t, khi nim c kh nng i din cho vnbn s c
lu tr.
Bc 2: Xc nh trng s cho tng mc t, trng s ny l gi trphn nh tm
quan trng ca mc t trong vn bn.
4.3.1.1 Xc nh mc t quan trng

10

Bi tp ln x l ngn ng t nhin

Ta xc nh mc t ca mt vn bn da vo chnh ni dung ca vn bn , hoc


tiu hay tm tt ni dung ca vn bn .
Thng thng vic lp ch mc t ng bt u bng vic kho st tn s xut
hin ca tng loi t ring r trong vn bn.
c trng xut hin ca t vng c th c nh bi th hng-tn s
(Rank_Frequency) Cc bc xc nh mt mc t quan trng
-

Cho mt tp hp n ti liu, thc hin tnh ton tn s xut hin cacc mc t


trong ti liu .
K hiu Fik (Frequency): l tn s xut hin ca mc t k trong ti liu I
Xc nh tng tn s xut hin TFk (Total Frequency) cho mi tbng cch
cng nhng tn s ca mi mc t duy nht trn tt c n ti liu.
TFk =

Sp xp cc mc t theoth t gim dn ca tn s xut hin.Chn mt gi tr


lm ngng v loi b tt c nhng t c tng tns xut hin cao hn ngng
ny (stop-word).

4.3.1.2 Tnh trng s ca mc t


Trng s ca mc t: l tn s xut hin ca mc t trong ton b tiliu, nhng t
thng xuyn xut hin trong tt c cc ti liu th t c ngha hn l nhng t ch tp
trung trong mt s ti liu.
Ngc li khi tn s xut hin ca mc t k trong tp ti liu cngcao th mc t
cng c ngha.
Lp ch mc t ng cho ti liu l xc nh t ng mc t ch mccho cc ti
liu.Loi b cc t stop-word v nhng t ny c phn bitkm v khng th s dng
xc nh ni dung ca ti liu.
Bc tip theo l chun ho mc t, tc l a mc t v dngnguyn gc bng
cch loi b tin t, hu t, v cc bin th khc ca t nht dng s nhiu, qu kh,
...
4.3.2 Cu trc ca ch mc o
Sau khi phn tch cc trang web, v thc hin tch cc t, chun ho cc t v
dng nguyn gc, loi b cc t stop word. Ta thu c mt danh mc cc t mi t c
gn km danh sch cc trang cha t . Danh mc ny gi l ch mc o (inverted
index)
4.4. B tm kim thng tin Search Engine
Search engine l cm t dng ch ton b h thng bao gm cc thnh phn chnh nh
b thu thp thng tin, b lp ch mc & b tm kim thng tin. Cc b ny hot ng lin
11

Bi tp ln x l ngn ng t nhin

tc t lc khi ng h thng, chng ph thuc ln nhau v mt d liu nhng c lp


vi nhau v mt hot ng.
Search engine tng tc vi user thng qua giao din web, c nhim v tip nhn &
tr v nhng ti liu tho yu cu ca user.
Tm kim theo t kha l tm kim cc trang m nhng t trong cu truy vn (query)
xut hin, ngoi tr stopword (cc t qu thng dng nh mo t a, an, the,). V mt
hin th v sp xp kt qu tr v, mt t cng xut hin nhiu trong mt trang th trang
cng c chn tr v cho ngi dng.V mt trang cha tt c cc t trong cu
truy vn th tt hn l mt trang khng cha mt hoc mt s t. Ngy nay, hu ht cc
search engine u h tr chc nng tm c bn v nng cao, tm t n, t ghp, cm t,
danh t ring, hay gii hn phm vi tm kim nh trn mc, tiu , on vn bn gii
thiu v trang web,..
Ngoi chin lc tm chnh xc theo t kho, cc search engine cn c gng hiu
ngha thc s ca cu hi thng qua nhng cu ch trong truy vn ca ngi dng. iu
ny c th hin qua chc nng sa li chnh t, tm c nhng hnh thc bin i khc
nhau ca mt t, v d: search engine s tm nhng t nh speaker, speaking, spoke khi
ngi dng nhp vo t speak.

12

Bi tp ln x l ngn ng t nhin

4.5. Phn hng trang (Page Rank)

Sergey Brin v Lawrence Page a ra mt phng php nhm gip cho cng
vic tnh ton hng trang. Phng php ny da trn tng rng: nu c lin kt t
trang A n trang B th l mt s tin c ca trang A i vi trang B. Nu trang B
c nhiu trang quan trng hn tr n, cn trang C no c t trang quan trng
tr n th B cng c quan trng hn C. Gi s ta c mt tp cc trang Web vi cc
lin kt gia chng, khi ta c mt th vi cc nh l cc trang Web v cc cnh l
cc lin kt gia chng.
PR c th hin vi 11 gi tr: t 0 10, ch s PR cng cao th uy tnh ca trang web
cng ln.
Nu bn dng trnh duyt firefox, bn c th dng cng c SearchStatus ti a ch
https://addons.mozilla.org/en-US/firefox/addon/searchstatus/ . Hoc bn c th vo
http://seoquake.com ci t cng c SEOQuake, c th xem vi mi trnh duyt
Cng thc tnh PageRank
PR(A) = (1-d) + d(PR(t1)/C(t1) + + PR(tn)/C(tn)).
-

Trong PR(A) l Pagerank ca trang A


t1, t2tn l cc trang lin kt ti trang A

C l s link outbound ca trang ngun t1, t2 tn (link ra ngoi)

d l h s suy gim (h s tt dn ca chui)


13

Bi tp ln x l ngn ng t nhin

V d nh Page A ca bn c 3 trang page B (PR=6) page C (PR=3) v page D (PR = 4).


Link ti Page B c 3 link dn ra ngoi, Page C c 6 link dn ra ngoi, Page D c 12 link
dn ra ngoi
Vy PR ca A = 0.15 + 0.85*( 6/3 + 3/6 + 4/12) =2 (xp x)
Mt khi bit c cng thc, ta c th ch ng hn khi trao i lin kt vi cc site
khc. Chng hn nu bn lin kt vi mt site PR =7,8 g v site ch c 1 link
outbound dn n bi vit trn site ca ta.

14

Bi tp ln x l ngn ng t nhin

PHN II: FRAMEWORK LUCENE


1. Gii thiu
Lucene l mt th vin m ngun m vit bng java cho php d dng tch hp
thm chc nng tm kim n bt c ng dng no. N cung cp cc API h tr
cho vic nh ch mc v tm kim . Lucene c th c s dng tch hp chc
nng tm kim vo ng dng sn c hoc xy dng mt search engine hon chnh
Lucene h tr cc phn sau trong cc bc ca qu trnh tm kim vi search
engine
- Phn tch d liu (dng vn bn) nh ch mc: Analyze document
- nh ch mc (Index document & Index)
- Thc hin vic xy dng cu truy vn v tm kim trong ch mc: Build query,
render result, run query
Lucene c th nh ch mc, tm kim bt c dng d liu g min l cc dng d
liu ny c th chuyn v dng d liu vn bn text. Do , Lucene khng quan
tm n ngun d liu, nh dng d liu thm ch c ngn ng ch cn ta c th
chuyn i n sang dng vn bn. N c th nh ch mc cc d liu ca server
t xa hoc trn my cc b vi cc tp tin c nh dng khc nhau nh txt, word,
html, pdf
Hin ti, Lucene c trin trn nhiu ngn ng khc nhau nh C#, PHP, C, C+
+, Python, Ruby

2. Lucene trong cc thnh phn ca mt ng dng tm kim


Phn ny s trnh by cc thnh phn th vin Lucene h tr v cc thnh phn
cn phi c tch hp thm xy dng mt h thng search engine hon chnh.
Nh ta s thy, ng dng tm kim s thc hin cc bc truy lc cc ni dung
thng tin th, to cc ti liu t ni dung, tch ra vn bn t cc ti liu nh phn
v nh ch mc ti liu. Khi ch mc c xy dng, ta s cn xy dng mt
giao din tm kim, cch thc thi tm kim v hin th kt qu tm kim.
Cc ng dng tm kim c nhiu bin th khc nhau. Mt s chy nh mt thnh
phn nh c nhng vo trong mt ng dng. Mt s c th chy trn mt
website, mt c s h tng server chuyn bit v tng tc vi nhiu ngi s
dng thng qua trnh duyt web hoc mobile. S khc c th chy trong mng
intranet ca mt cng ty v tm kim nhng ti liu ni b trong cng ty .
15

Bi tp ln x l ngn ng t nhin

Nhng cc bin th ny u c kin trc tng th ging nhau c minh ha nh


hnh bn di

Ta s ln lt xem xt cc thnh phn ca h thng search engine


2.1. Cc thnh phn indexing
Gi s ta chng ta cn tm kim trn mt s lng ln cc file. Cch tip cn
thng thng l duyt tng file xem c cha t hoc cm t ta a ra hay khng.
Phng php ny s lm vic nhng khng hiu qu nu s lng file hoc kch
thc file l qu ln. tm kim hiu qu hn, ta cn nh ch mc tp vn bn
16

Bi tp ln x l ngn ng t nhin

tm kim, chuyn i chng thnh khun dng cho php ta tm kim nhanh hn v
loi b tin trnh qut tun t. Tin trnh ny c gi l indexing v kt qu u
ra ca n gi l index

Qu trnh indexing bao gm mt dy cc bc phn bit m chng ta s ln lt


lm r sau y.
a. Thu thp ni dung (Acquire content)
Bc u tin l thu thp ni dung (acquire content). Tin trnh ny s dng
mt crawler hoc spider thu thp cc ni dung cn thit s c nh ch
mc. Cc crawler s chy lin tc nh mt dch v i khi c nhng ni
dung mi hoc thay i, n s cp nht cc ni dung ny.
Lucene nh l mt th vin core v n khng cung cp bt c chc nng no
h tr thu thp ni dung. Chng ta c th mt s th vin m ngun m sn
c thc hin chc nng thu thp ni dung thng tin sau
Solr: Mt project con ca Apache Lucene h tr thu thp thng tin
trong CSDL quan h, XML feeds, x l cc ti liu thng qua Tika
Nutch: Mt project con ca Apache Lucene, c mt crawler thu
thp ni dung ca cc website
Grub: Mt web crawler m ngun m
Heritrix, Droids, Aperture
Sau khi thu thp c cc ni dung thng tin, chng ta s to ra cc phn nh
c gi l cc ti liu ca ni dung
b. Xy dng ti liu (Build Document)
17

Bi tp ln x l ngn ng t nhin

Khi ta c c cc ni dung th, ta phi chuyn cc ni dung ny thnh cc


n v (units) thng c gi l cc document c s dng bi search
engine. Document cha mt s trng c tn phn bit vi cc gi tr tng
ng, v d: title, body, abstract, author, url. Chng ta phi thit k cch chia
cc d liu th thnh cc ti liu v cc trng cng nh cch tnh gi tr
cho tng trng ny. Thng thng, cch tip cn kh r rng: mt email tr
thnh mt ti liu, hoc mt file PDF hoc mt trang web tr thnh mt ti
liu. Nhng trong mt s trng hp th khng r rng nh th, v d: cch x
l cc file nh km trong mt email. Liu chng ta c nn gp tt c cc ni
dung text ly ra t cc file nh km thnh mt document hoc to ra cc
document phn bit
Khi ta lm vic theo cc thit k ny, ta s cn phi ly ra text t cc ni dung
thng tin th cho tng ti liu. Nu ni dung ban u dng text bnh
thng v kiu m ha c bit trc th cng vic l kh n gin. Nhng
trong trng hp cc ti liu dng nh phn (PDF, Microsoft office, Open
office, Adobe flash, streaming video, audio multimedia) hoc cha cc th (cc
tho phi c b i trc khi indexing) (RDF, XML, HTML) th sao ? Ta s
cn chy cc b lc ti liu ly ra c text t cc ni dung nh th trc
khi to search engine document.
Trong bc ny cng c th cn p dng cc logic nghip v to ra cc
trng. V d, nu ta c mt trng body text ln, ta c th chy nhng b
phn tch ng ngha ly ra cc thuc tnh nh tn, v tr, ngy, thi gian, v
tr a vo cc trng thch hp.
Lucene cung cp API xy dng cc trng v cc document nhng n
khng cung cp bt c logic no xy dng mt document. N cng khng
cung cp bt c document filter no. Ta c th s dng thm cc th vin nh
Tika thc hin chc nng document filter.
Sau khi tch ra c cc document v cc trng text th search engine vn
cha th nh ch mc c ngay m cn phi thng qua giai on phn tch
vn bn
c. Phn tch ti liu (analyze document)
Khng c search engine no nh ch mc trc tip text, ng hn l text phi
c chia nh thnh mt dy cc phn t nguyn t c gi l token. Bc
phn tch ti liu s thc hin nhim v ny. Tng token c th coi l mt t
trong ngn ng v bc ny s xc nh cch nhng trng vn bn trong mt
document c phn chia thnh mt dy cc token.

18

Bi tp ln x l ngn ng t nhin

Lucene cung cp mt tp cc b phn tch c xy dng sn, chng s gip


bn kim sot qu trnh phn tch d dng hn. Bc cui cng l nh ch
mc cho document
d. nh ch mc ti liu (Index document)
Trong bc tip theo, document c thm vo ch mc. Lucene cung cp mi
iu cn thit cho bc ny vi cc API n gin.
2.2. Cc thnh phn tm kim (Components for searching)
Searching l qu trnh tm kim cc t trong mt ch mc tm nhng ti liu ni
m cc t xut hin. Cht lng ca tm kim c m t da trn 2 o l
precision v recall. Recall o cht lng ca kt qu tm ra c ph hp hay khng
trong khi precision o mc loi b cc ti liu khng ph hp. Chng ta hy ln
lt xem cc thnh phn tm kim ca mt search engine
a. Search user interface
User interface l phn m ngi s dng tng tc trc tip trn trnh duyt,
ng dng desktop hoc thit b di ng khi h tng tc vi ng dng tm
kim. Lucene khng cung cp bt c giao din tm kim mc nh no, n phi
c chng ta t xy dng. Khi m ngi s dng submit yu cu tm kim,
n phi c dch ra thnh i tng Query ph hp cho search engine
b. Xy dng truy vn (Build query)
Khi ngi s dng dng search engine tm kim, h s submit yu cu tm
kim ln server. Sau , yu cu ny phi c dch thnh i tng Query ca
search engine. Chng ta gi bc ny l build query
Cc i tng Query c th n gin hoc phc tp. Lucene cung cp mt gi
c gi l QueryParser bin cu truy vn ngi s dng nhp vo thnh
i tng Query theo c php tm kim. Truy vn c th cha cc ton t
boolean, nhng truy vn cm hoc nhng k t i din. Ngoi ra, truy vn c
th thm vo cc rng buc hn ch cc ti liu ngi s dng c php
truy vn. Mt s ng dng cng c th sa i cu truy vn thay i th t
kt qu tm kim hoc lc nhng phn quan trng, nu boosting (thao tc thay
i trng s ca cc ch mc gip hin th kt qu theo u tin theo yu cu)
khng c thc hin khi indexing. Thng thng, mt website thng mi
in t s boost cc loi sn phm mang li li nhun nhiu hn hoc filter cc
sn phm.
QueryParser mc nh ca Lucene thng p ng c yu cu ca mt ng
dng thng thng. i khi, chng ta s mun s dng u ra ca QueryParser
v cng thm cc logic ring hiu chnh li i tng Query. By gi chng
ta c th sn sng thc thi yu cu tm kim
19

Bi tp ln x l ngn ng t nhin

c. Search Query
Search Query l qu trnh tra cu ch mc tm kim v ly ra cc ti liu khp
vi Query c sp xp theo th t yu cu. Thnh phn ny bao ng cc
cng vic phc tp bn trong ca search engine v Lucene x l tt c chng
cho bn. Lucene cng c th m rng im ny nn kh n gin nu ta
mun ty bin cch kt qu c thu thp, lc, sp xp
C 3 m hnh l thuyt thng thng ca tm kim
M hnh logic thun ty (Pure Boolean model): Cc ti liu ch c 2
trng thi hoc khp hoc khng khp i vi truy vn yu cu v
chng khng c chm im. Trong m hnh ny khng c im tng
ng vi ti liu khp v cc ti liu khp l khng c sp xp, mt
truy vn s ch nhn din c cc b ph hp
M hnh khng gian vector (Vector space model): C cc truy vn v
documents c m hnh ha nh cc vector trong khng gian nhiu
chiu. S ph hp gia mt cu truy vn v mt ti liu c tnh ton
bi mt s o khong cch vector gia cc vector ny
M hnh xc sut (Probabilistic model): Trong m hnh ny, ta phi tnh
xc sut m mt ti liu khp vi mt cu truy vn s dng cch tip
cn xc sut y
Cch tip cn ca Lucene kt hp m hnh khng gian vector v m hnh logic
thun ty v cho php chng ta quyt nh xem m hnh no s c s dng.
Cui cng, Lucene tr li cc ti liu m chng ta s hin th ra danh sch kt
qu cho ngi dng
d. Render results
Khi chng ta nhn c b cc document khp vi truy vn c sp xp
theo th t ph hp th chng ta c th hin th chng n giao din ngi
dng.
n y, chng ta kt thc vic xem xt cc thnh phn ca qu trnh indexing
v searching trong ng dng search nhng vn cha hon thnh.
3. Cc lp chnh ca Lucene
3.1. Cc lp chnh nh ch mc
Mt s lp chnh thc thi vic nh ch mc bao gm
IndexWriter
Directory
Analyzer
Document
Field
20

Bi tp ln x l ngn ng t nhin

Hnh bn di ch ra vai tr ca cc lp trn tham gia vo qu trnh nh ch mc

Di y l m t ngn gn chc nng ca cc lp trn


a. IndexWriter
IndexWriter l thnh phn trung tm ca qu trnh nh ch mc. Lp ny s
to ra mt ch mc mi hoc m mt ch mc tn ti thm, cp nht, xa
cc ti liu (Document) trong ch mc . IndexWriter l mt lp cho php
chng ta ghi ln ch mc nhng khng cho php chng ta c hoc tm kim
trn ch mc. IndexWriter cn mt ni no ghi ch mc v ni gi l
Directory
b. Directory
Lp ny m t v tr cha ch mc ca Lucene. Directory l mt lp tru tng
ci t cc phng thc chung cho cc lp con. V d: Chng ta c th s dng
FSDirectory.open ly th hin c th FSDirectory ca Directory, lp ny
cho php ta lu tr cc file ch mc trong mt th mc ca h thng file.
IndexWriter khng th nh ch mc trc tip cho text tr khi text c chia
nh thnh cc t phn bit s dng Analyzer
c. Analyzer
Trc khi vn bn (text) c nh ch mc, n s c x l bi mt
Analyzer. Analyzer c xc nh trong phng thc khi to ca IndexWriter,
n c nhim v tch ta cc tokens no s c nh ch mc v loi b phn
cn li. Nu ni dung c nh ch mc khng phi l plain text th ta phi
tch plain text ra trc khi nh ch mc. thc hin nhim v tch plain text
ra ta c th s dng Tika. Analyzer l mt lp tru tng nhng Lucene cung
cp mt vi ci t c th ca n. Mt s ci t thc hin loi b cc stop
word (cc t thng thng c s dng, khng gip phn bit c cc ti
liu vi nhau VD: a, an, is, it), mt s thc hin chuyn i cc token thnh
ch thng tm kim khng phn bit ch hoa, ch thng. Cc Analyzer l

21

Bi tp ln x l ngn ng t nhin

cc thnh phn quan trng trong Lucene v c th c s dng cho nhiu mc


ch khc na ngoi nhim v lc cc k t u vo.
Tip theo, tin trnh nh ch mc i hi mt ti liu (Document) cha cc
trng phn bit (Field)
d. Document
Lp ny m t tp hp cc trng (Field). Document c th c coi nh mt
ti liu o nh mt trang web, mt thng ip email hoc mt file text Cc
trng ca ti liu m t ti liu hoc cc siu d liu gn vi ti liu . Cc
ngun d liu gc nh mt trng trong CSDL, mt file microsoft office, mt
chng ca mt ebook l khng ph hp cho vic nh ch mc trong Lucene.
Chng ta phi tch ra c cc vn bn t cc ti liu nh phn trn v thm n
vo nh mt th hin ca Field Lucene c th x l. Cc Field c th l cc
siu d liu nh author, title, content, date
Lucene ch x l cc d liu xu v s, tc l n ch h tr cc kiu
java.lang.String, java.io.Reader v cc kiu s thng thng nh int, float,
double, long
Mt Document l mt container cha nhiu Field v mt Field l lp cha ni
dung vn bn c nh ch mc
e. Field
Tng Document l mt ch mc cha mt hoc nhiu Field c t tn c
i din bi lp Field. Tng Field c mt tn, gi tr tng ng v mt tp cc
la chn kim sot cch m Lucene s nh ch mc gi tr ca Field. Mt
ti liu c th cha nhiu hn mt Field cng tn. Trong trng hp ny cc
gi tr ca cc field s c ni vi nhau trong khi nh ch mc theo th t
cc field c thm vo document.
3.2. Cc lp chnh tm kim
Cc lp c cung cp thc hin chc nng tm kim cng kh n gin.
Di y l mt s lp chnh
IndexSearcher
Term
Query
TermQuery
TopDocs
a. IndexSearcher
IndexSearcher c nhim v tm kim cc ch mc c nh bi IndexWriter.
C th xem IndexSearcher nh l mt lp m ch mc theo kiu read only. N
yu cu mt th hin Directory nm gi cc ch mc c to ra trc v
22

Bi tp ln x l ngn ng t nhin

n cng cung cp mt s cc phng thc tm kim. Phng thc tm kim


n gin nht yu cu 2 tham s l i tng Query v s lng kt qu c
tr li, phng thc ny s tr li i tng TopDocs
VD:

b. Term
Mt Term l mt n v c bn cho tm kim. Tng t nh i tng Field,
n bao gm mt cp gi tr xu l tn ca field v gi tr ca field . Trong khi
tm kim, ta c th xy dng i tng Term v s dng chng cng vi
TermQuery

on code trn hng dn Lucene tm top 10 ti liu cha t lucene trong


trng contents, sp xp cc ti liu bi mc ph hp gim dn.
c. Query
Query l mt lp tru tng, n cha mt s phng thc tin ch v phng
thc c s dng nhiu l setBoost(float), n s ni vi Lucene rng truy vn
ny c trng s cao hn truy vn kia. Ngoi ra, Lucene cung cp mt s lng
cc lp con ca lp tru tng Query nh: TermQuery, BooleanQuery,
PhraseQuery, PrefixQuery, PhrasePrefixQuery
d. TermQuery
TermQuery l kiu c bn nht ca truy vn c h tr bi Lucene v n l
mt trong cc kiu truy vn nguyn thy. N c s dng tm kim cc ti
liu cha cc trng vi gi tr ph hp vi truy vn.
e. TopDocs
TopDocs l lp m t b gi tr kt qu c tr v sau khi tm kim. N l
mt container cha cc con tr tr n N kt qu c xp hng cao nht.
Trong kt qu l cc ti liu khp vi truy vn yu cu. Vi tng bn ghi
trong N kt qu, TopDocs ghi li cc gi tr docID s c s dng ly ra
cc gi tr Document v cc im trng s ca cc ti liu tng ng

23

Bi tp ln x l ngn ng t nhin

PHN III: NG DNG NUTCH


1. Gii thiu v Nutch
1.1. nh ngha
Nutch l mt search engine m ngun m ci t trn ngn ng lp trnh
Java. Nutch cung cp tt c cc cng c cn thit cho php bn to ra mt
search engine (SE) ca ring bn.

Ci t Nutch trn mt trong cc quy m: h thng file cc b, intranet


hoc trn mt web no .C ba cch ci t u c nhng c tnh khc
nhau.V d: thu thp h thng file l ng tin cy hn hai cch ci t kia v c th
li mng xy ra hay b nh cache khng c sn.
Cc vn vi Nutch: Thu thp thng tin cn tm kim trn hng t trang
web cn gii quyt mt vn l chng ta s bt u t u? Bao nhiu ln th
chng ta thu thp li? Chng ta s phi x l cc trng hp nh link b hng, cc
sites khng hi p v cc ni dung trng lp hay kh hiu? C mt tp cc thch
thc cung cp kh nng tm kim m rng v x l hng trm cu truy vn ng
thi trn mt tp d liu ln
1.2. c im
1.2.1. H tr nhiu protocol
Hin nay Nutch h tr cc cc protocol khc nhau nh: http, ftp, file.
Nh kin trc plugin-based, ta c th d dng m rng ra thm cc
protocol cho Nutch
1.2.2. H tr nhiu loi ti liu khc nhau
Hin nay, Nutch c th phn tch (parse) kh nhiu loi ti liu khc nhau nh:
Plain text
HTML
XML
24

Bi tp ln x l ngn ng t nhin

JavaScript: khng ch phn tch cc link t m js nh cc search


engine khc m cn phn tch c phn code ca js.
Open Office (OpenOfice.org): phn tch ti liu Open Office v Star
Office.
Ti liu Microsoft: Word (.doc), Excel(.xls), Power Point(.ppt).
Adobe PDF
RSS
Rich text format document (.rft)
MP3: phn tch cc trng D3v1 hay ID3v2 cha cc ni dung v bi
ht nh: ta , album, tc gi
ZIP
Mc nh Nutch ch x l hai loi ti liu l HTML v Plain Text. Mun m
rng ra cho cc ti liu khc, ta phi cu hnh li Nutch. Cng nh vo kin
trc Plugin-based ca Nutch m ta c th vit thm cc Plugin x l cc ti
liu khc
1.2.3. Chy trn h thng phn tn
Nutch c pht trin trn nn tng Hadoop. Do vy ng dng Nutch s
pht huy c sc mnh ca n nu chy trn mt Hadoop cluster
1.2.4. Linh hot vi plugin
Nh kin trc plugin-based, ta c th d dng m rng cc tnh nng ca
Nutch bng cch pht trin thm cc plugin. Tt c cc thao tc phn tch ti
liu, lp ch mc v tm kim u tht s c thc hin bi cc plugin
1.3.

Nutch v Lucene
Nutch c xy dng trn phn ngn ca Lucene, mt API cho vic nh
ch mc v tm kim vn bn. Mt cu hi thng thy l Ta ln s dng Lucene
hay Nutch? cu tr li n gin y l bn s dng Lucene nu bn khng cn
mt b thu thp d liu web . Mt kch bn ph bin l bn c mt c s d liu
web m bn mun tm kim. Cch tt nht l lm iu ny l bn nh ch mc
trc tip t c s d liu s dng Lucene API v sau ci t gii thut tm
kim ch s, li s dng Lucene. Nutch th ph hp hn vi nhng websites m
bn khng th truy cp trc tip n tng d liu hoc n n t cc ngun d
liu ri rc.
2. Cc nn tng pht trin ca Nutch
2.1. Lucene
25

Bi tp ln x l ngn ng t nhin

Lucene l mt b th vin ngun m c chc nng xy dng ch mc v


tm kim. Bn thn Lucene cng l mt d n ca Apache Software
Foundation, c khi xng bi Dough Cutting. Nutch s dng Lucene thc
hin to ch mc cho cc ti liu thu thp c v p dng vo chc nng tm
kim, chm im (ranking) ca search engine.
2.2.

Hadoop
Hadoop l mt framework ngun m vit bng Java cho php pht trin cc
ng dng phn tn c th x l trn cc b d liu ln. N cho php cc ng dng
c th lm vic vi hng ngn node khc nhau v hng petabyte d liu. Hadoop
ly c pht trin da trn tng t cc cng b ca Google v m hnh
MapReduce v h thng file phn tn Google File System (GFS).
2.3.

Tika
Tika l mt b cng c dng pht hin v rt trch cc metadata v
nhng ni dung vn bn c cu trc t nhng loi ti liu khc nhau. Nutch s
dng Tika phn tch cc loi ti liu khc nhau, nhm s dng cho qu trnh to
ch mc
3. Kin trc ca Nutch
Kin trc ca Nutch c phn chia mt cch t nhin thnh cc thnh phn
chnh: crawler, indexer v searcher. Cc thnh phn thc hin cc chc nng:
o Crawler thc hin thu thp cc ti liu, phn tch cc ti liu, kt qu
ca crawler l mt tp d liu segments gm nhiu segments.
o Indexer ly d liu do crawler to ra to ch mc ngc.
o Searcher s p ng cc truy vn tm kim t ngi dng da trn
tp ch mc ngc do indexer to ra

Cc thnh phn c m t chi tit sau y


26

Bi tp ln x l ngn ng t nhin

3.1. Crawler
H thng thu thp thng tin c iu khin bi cng c thu thp d liu
ca Nutch v tp cc cng c c lin quan xy dng v duy tr mt s
kiu cu trc d liu (bao gm c web, c s d liu , mt tp hp
cc segments , v cc ch s index).
Cc c s d liu web , hay webdb , l mt cu trc d liu phn nh cu
trc v tnh cht ca th web m thu thp c. N vn tn ti min
l th web c thu thp (v thu thp li) tn ti,c th l vi thng hoc
nhiu nm. Webdb ch c s dng bi trnh thu thp thng tin v khng
c vai tr g trong qu trnh tm kim. Webdb lu tr hai loi thc
th: trang v lin kt . Mt trang i din cho mt trang trn Web, v c
lp ch mc URL v php m ha MD5 s bm ni dung ca trang . Cc
thng tin thch hp cng c lu tr,bao gm c cc s ca cc lin kt
trong trang (cng c gi l outlinks ); ly thng tin (trang c np
li. Mt lin kt i din cho mt lin kt t mt trang web ( ngun) n
trangkhc (ch). Trong th web webdb, cc nt l cc trang v cc cnh
l cc lin kt.
Mt phn on (segment) l mt tp hp cc trang ly v lp ch mc
ca Crawler trong mt hot ng duy nht. Cc fetchlist cho mt phn on
l mt danh sch cc URL cho Crawler x l, v c to ra t
cc webdb. u ra Fetcherl cc d liu ly t cc trang trong
fetchlist. u ra Fetchercho mt segment xc nh c lp ch mc v ch
s c lu tr trong segment ny. Bt k mt segment no cng u c
vng i, mc nh l 30 ngy, v vy ta nn xa i cc segment c.
Qu trnh crawl c khi ng bng vic module injector chn thm mt
danh sch cc URL vao crawldb khi to crawldb.

27

Bi tp ln x l ngn ng t nhin

Crawler gm c 4 thnh phn chnh l generator, fetcher, parser v updater


hot ng lin tip nhau to thnh mt vng lp. Ti mi ln lp, crawler s to ra
mt segment. Ti khi im ca vng lp, generator s d tm trong crawldb cc
URL cn np v pht sinh ra mt danh sch cc url s np. ng thi lc ny
generator s pht sinh ra mt segment mi, lu danh sch URL s np vo
segment/crawl_fetch. Tip theo, fetcher s ly danh sch URL cn np t
segment/crawl_generate, thc hin ti cc ti liu theo tng URL. Fetcher s lu
ni dung th ca tng ti liu vo segment/content v lu trng thi np ca tng
URL vo segment/crawl_fetch. Sau , parser s thc hin ly d liu th ca cc
ti liu t segment/content v thc hin phn tch cc ti liu trch ly cc
thng tin vn bn:
o Trch cc outlink v lu vo segment/crawl_parse
o Trch cc outlink v metadata vo segment/parse_data
o Trch ton b phn text ca ti liu vo segment/parse_text
Cui cng, crawldb s s dng thng tin v cc trng thi np ca tng URL trong

28

Bi tp ln x l ngn ng t nhin

segment/crawl_fetch v danh sch cc URL mi phn tch c trong


segment/crawl_parse cp nht li crawldb. Qu trnh trn c lp i lp li. S
ln lp ca vng lp ny c gi l su
3.2. Indexer v searcher
Cc ch mc (index) l ch s o ngc (inverted index) ca tt c cc trang h
thng truy xut, c to ra bng cch trn tt c cc ch s segment. Nutch
s dng Lucene nh ch mc, v vy tt c cc cng c Lucene v API c
sn u tng tc vi cc ch mc c to ra.

Link Invertor s ly d liu t tt c cc segment xy dng linkdb. Linkdb


cha tt c cc URL m h thng bit cng vi cc inlink ca chng
Vi tng segment trong segments, indexer s to ch mc ngc cho
segments. Sau , n s thc hin trn tt c cc phn ch mc ny li vi nhau
trong indexes. User ca h thng s tng tc vi cc chng trnh tm kim
pha client. Bn thn Nutch cng h tr sn mt ng dng web thc hin
tm kim. Cc chng trnh pha client ny nhn cc query t ngi dng, gi
n searcher. Searcher thc hin tm kim trn tp ch mc v gi tr kt qu
li cho chng trnh pha client hin th kt qu ra cho ngi dng

29

Bi tp ln x l ngn ng t nhin

TI LIU THAM KHO


1. Bi ging x l ngn ng t nhin Ths Hong Anh Vit
2. Lucene in Action Second Edition Michael Mccandless, Erick Hatcher, Otis
Gospodnetic
3. http://en.wikipedia.org/wiki/Web_search_engine
4. http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
5. http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
6. http://en.wikipedia.org/wiki/Search_engine_indexing
7. http://en.wikipedia.org/wiki/PageRank
8. V mt s ti liu khc

30

Das könnte Ihnen auch gefallen