Beruflich Dokumente
Kultur Dokumente
BI TP LN
MN: X L NGN NG T NHIN
TI: TM HIU KIN TRC CA SEARCH ENGINE
FRAMEWORK LUCENE & NG DNG NUTCH
GVHD: Th.s Hong Anh Vit
Nhm sinh vin:
Nguyn Th Anh
Trn Anh Th
Nguyn Vng Quyn
Nguyn Vn Hng
20080070
20082569
20082142
20081293
H Ni - 11/2012
1
Bi tp ln x l ngn ng t nhin
MC LC
Gii thiu............................................................................................................................15
2.
3.
Cc lp chnh ca Lucene...................................................................................................20
3.1.
Cc lp chnh nh ch mc........................................................................................20
3.2.
Cc lp chnh tm kim...............................................................................................22
2.
nh ngha...................................................................................................................25
1.2.
c im......................................................................................................................25
1.3.
Nutch v Lucene..........................................................................................................26
Bi tp ln x l ngn ng t nhin
3.
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
Search engine iu khin robot i thu thp thng tin trn mng thng qua cc siu
lin kt ( hyperlink ). Khi robot pht hin ra mt site mi, n gi ti liu (web page) v
cho server chnh to c s d liu ch mc phc v cho nhu cu tm kim thng tin.
Bi v thng tin trn mng lun thay i nn robots phi lin tc cp nht cc site
c. Mt cp nht ph thuc vo tng h thng search engine. Khi search engine nhn
cu truy vn t user, n s tin hnh phn tch, tm trong c s d liu ch mc & tr v
nhng ti liu tho yu cu.
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
Da trn khi nim trc gic ny chng ta a ra nh ngha freshness v age nh sau:
- Freshness: t S={e1,,en} l tp hp N trang c Crawler ti v.
reshness c nh ngha nh sau: Freshness ca trang ei ti thi im t l
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
4.3. B lp ch mc Indexer
10
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
12
Bi tp ln x l ngn ng t nhin
Sergey Brin v Lawrence Page a ra mt phng php nhm gip cho cng
vic tnh ton hng trang. Phng php ny da trn tng rng: nu c lin kt t
trang A n trang B th l mt s tin c ca trang A i vi trang B. Nu trang B
c nhiu trang quan trng hn tr n, cn trang C no c t trang quan trng
tr n th B cng c quan trng hn C. Gi s ta c mt tp cc trang Web vi cc
lin kt gia chng, khi ta c mt th vi cc nh l cc trang Web v cc cnh l
cc lin kt gia chng.
PR c th hin vi 11 gi tr: t 0 10, ch s PR cng cao th uy tnh ca trang web
cng ln.
Nu bn dng trnh duyt firefox, bn c th dng cng c SearchStatus ti a ch
https://addons.mozilla.org/en-US/firefox/addon/searchstatus/ . Hoc bn c th vo
http://seoquake.com ci t cng c SEOQuake, c th xem vi mi trnh duyt
Cng thc tnh PageRank
PR(A) = (1-d) + d(PR(t1)/C(t1) + + PR(tn)/C(tn)).
-
Bi tp ln x l ngn ng t nhin
14
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
tm kim, chuyn i chng thnh khun dng cho php ta tm kim nhanh hn v
loi b tin trnh qut tun t. Tin trnh ny c gi l indexing v kt qu u
ra ca n gi l index
Bi tp ln x l ngn ng t nhin
18
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
c. Search Query
Search Query l qu trnh tra cu ch mc tm kim v ly ra cc ti liu khp
vi Query c sp xp theo th t yu cu. Thnh phn ny bao ng cc
cng vic phc tp bn trong ca search engine v Lucene x l tt c chng
cho bn. Lucene cng c th m rng im ny nn kh n gin nu ta
mun ty bin cch kt qu c thu thp, lc, sp xp
C 3 m hnh l thuyt thng thng ca tm kim
M hnh logic thun ty (Pure Boolean model): Cc ti liu ch c 2
trng thi hoc khp hoc khng khp i vi truy vn yu cu v
chng khng c chm im. Trong m hnh ny khng c im tng
ng vi ti liu khp v cc ti liu khp l khng c sp xp, mt
truy vn s ch nhn din c cc b ph hp
M hnh khng gian vector (Vector space model): C cc truy vn v
documents c m hnh ha nh cc vector trong khng gian nhiu
chiu. S ph hp gia mt cu truy vn v mt ti liu c tnh ton
bi mt s o khong cch vector gia cc vector ny
M hnh xc sut (Probabilistic model): Trong m hnh ny, ta phi tnh
xc sut m mt ti liu khp vi mt cu truy vn s dng cch tip
cn xc sut y
Cch tip cn ca Lucene kt hp m hnh khng gian vector v m hnh logic
thun ty v cho php chng ta quyt nh xem m hnh no s c s dng.
Cui cng, Lucene tr li cc ti liu m chng ta s hin th ra danh sch kt
qu cho ngi dng
d. Render results
Khi chng ta nhn c b cc document khp vi truy vn c sp xp
theo th t ph hp th chng ta c th hin th chng n giao din ngi
dng.
n y, chng ta kt thc vic xem xt cc thnh phn ca qu trnh indexing
v searching trong ng dng search nhng vn cha hon thnh.
3. Cc lp chnh ca Lucene
3.1. Cc lp chnh nh ch mc
Mt s lp chnh thc thi vic nh ch mc bao gm
IndexWriter
Directory
Analyzer
Document
Field
20
Bi tp ln x l ngn ng t nhin
21
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
b. Term
Mt Term l mt n v c bn cho tm kim. Tng t nh i tng Field,
n bao gm mt cp gi tr xu l tn ca field v gi tr ca field . Trong khi
tm kim, ta c th xy dng i tng Term v s dng chng cng vi
TermQuery
23
Bi tp ln x l ngn ng t nhin
Bi tp ln x l ngn ng t nhin
Nutch v Lucene
Nutch c xy dng trn phn ngn ca Lucene, mt API cho vic nh
ch mc v tm kim vn bn. Mt cu hi thng thy l Ta ln s dng Lucene
hay Nutch? cu tr li n gin y l bn s dng Lucene nu bn khng cn
mt b thu thp d liu web . Mt kch bn ph bin l bn c mt c s d liu
web m bn mun tm kim. Cch tt nht l lm iu ny l bn nh ch mc
trc tip t c s d liu s dng Lucene API v sau ci t gii thut tm
kim ch s, li s dng Lucene. Nutch th ph hp hn vi nhng websites m
bn khng th truy cp trc tip n tng d liu hoc n n t cc ngun d
liu ri rc.
2. Cc nn tng pht trin ca Nutch
2.1. Lucene
25
Bi tp ln x l ngn ng t nhin
Hadoop
Hadoop l mt framework ngun m vit bng Java cho php pht trin cc
ng dng phn tn c th x l trn cc b d liu ln. N cho php cc ng dng
c th lm vic vi hng ngn node khc nhau v hng petabyte d liu. Hadoop
ly c pht trin da trn tng t cc cng b ca Google v m hnh
MapReduce v h thng file phn tn Google File System (GFS).
2.3.
Tika
Tika l mt b cng c dng pht hin v rt trch cc metadata v
nhng ni dung vn bn c cu trc t nhng loi ti liu khc nhau. Nutch s
dng Tika phn tch cc loi ti liu khc nhau, nhm s dng cho qu trnh to
ch mc
3. Kin trc ca Nutch
Kin trc ca Nutch c phn chia mt cch t nhin thnh cc thnh phn
chnh: crawler, indexer v searcher. Cc thnh phn thc hin cc chc nng:
o Crawler thc hin thu thp cc ti liu, phn tch cc ti liu, kt qu
ca crawler l mt tp d liu segments gm nhiu segments.
o Indexer ly d liu do crawler to ra to ch mc ngc.
o Searcher s p ng cc truy vn tm kim t ngi dng da trn
tp ch mc ngc do indexer to ra
Bi tp ln x l ngn ng t nhin
3.1. Crawler
H thng thu thp thng tin c iu khin bi cng c thu thp d liu
ca Nutch v tp cc cng c c lin quan xy dng v duy tr mt s
kiu cu trc d liu (bao gm c web, c s d liu , mt tp hp
cc segments , v cc ch s index).
Cc c s d liu web , hay webdb , l mt cu trc d liu phn nh cu
trc v tnh cht ca th web m thu thp c. N vn tn ti min
l th web c thu thp (v thu thp li) tn ti,c th l vi thng hoc
nhiu nm. Webdb ch c s dng bi trnh thu thp thng tin v khng
c vai tr g trong qu trnh tm kim. Webdb lu tr hai loi thc
th: trang v lin kt . Mt trang i din cho mt trang trn Web, v c
lp ch mc URL v php m ha MD5 s bm ni dung ca trang . Cc
thng tin thch hp cng c lu tr,bao gm c cc s ca cc lin kt
trong trang (cng c gi l outlinks ); ly thng tin (trang c np
li. Mt lin kt i din cho mt lin kt t mt trang web ( ngun) n
trangkhc (ch). Trong th web webdb, cc nt l cc trang v cc cnh
l cc lin kt.
Mt phn on (segment) l mt tp hp cc trang ly v lp ch mc
ca Crawler trong mt hot ng duy nht. Cc fetchlist cho mt phn on
l mt danh sch cc URL cho Crawler x l, v c to ra t
cc webdb. u ra Fetcherl cc d liu ly t cc trang trong
fetchlist. u ra Fetchercho mt segment xc nh c lp ch mc v ch
s c lu tr trong segment ny. Bt k mt segment no cng u c
vng i, mc nh l 30 ngy, v vy ta nn xa i cc segment c.
Qu trnh crawl c khi ng bng vic module injector chn thm mt
danh sch cc URL vao crawldb khi to crawldb.
27
Bi tp ln x l ngn ng t nhin
28
Bi tp ln x l ngn ng t nhin
29
Bi tp ln x l ngn ng t nhin
30